All posts by Raks Khare

Achieve near real time operational analytics using Amazon Aurora PostgreSQL zero-ETL integration with Amazon Redshift

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/achieve-near-real-time-operational-analytics-using-amazon-aurora-postgresql-zero-etl-integration-with-amazon-redshift/

“Data is at the center of every application, process, and business decision. When data is used to improve customer experiences and drive innovation, it can lead to business growth,”

Swami Sivasubramanian, VP of Database, Analytics, and Machine Learning at AWS in With a zero-ETL approach, AWS is helping builders realize near-real-time analytics.

Customers across industries are becoming more data driven and looking to increase revenue, reduce cost, and optimize their business operations by implementing near real time analytics on transactional data, thereby enhancing agility. Based on customer needs and their feedback, AWS is investing and steadily progressing towards bringing our zero-ETL vision to life so that builders can focus more on creating value from data, instead of preparing data for analysis.

Our zero-ETL integration with Amazon Redshift facilitates point-to-point data movement to get it ready for analytics, artificial intelligence (AI) and machine learning (ML) using Amazon Redshift on petabytes of data. Within seconds of transactional data being written into supported AWS databases, zero-ETL seamlessly makes the data available in Amazon Redshift, removing the need to build and maintain complex data pipelines that perform extract, transform, and load (ETL) operations.

To help you focus on creating value from data instead of investing undifferentiated time and resources in building and managing ETL pipelines between transactional databases and data warehouses, we announced four AWS database zero-ETL integrations with Amazon Redshift at AWS re:Invent 2023:

In this post, we provide step-by-step guidance on how to get started with near real time operational analytics using the Amazon Aurora PostgreSQL zero-ETL integration with Amazon Redshift.

Solution overview

To create a zero-ETL integration, you specify an Amazon Aurora PostgreSQL-Compatible Edition cluster (compatible with PostgreSQL 15.4 and zero-ETL support) as the source, and a Redshift data warehouse as the target. The integration replicates data from the source database into the target data warehouse.

You must create Aurora PostgreSQL DB provisioned clusters within the Amazon RDS Database Preview Environment and a Redshift provisioned preview cluster or serverless preview workgroup, in the US East (Ohio) AWS Region. For Amazon Redshift, make sure that you choose the preview_2023 track in order to use zero-ETL integrations.

The following diagram illustrates the architecture implemented in this post.

The following are the steps needed to set up the zero-ETL integration for this solution. For complete getting started guides, refer to Working with Aurora zero-ETL integrations with Amazon Redshift and Working with zero-ETL integrations.

bdb-3883-image001

After Step1, you can also skip Steps 2–4 and directly start creating your zero-ETL integration from Step 5, in which case Amazon RDS will show a message about missing configurations and you can choose Fix it for me to let Amazon RDS automatically configure the steps.

  1. Configure the Aurora PostgreSQL source with a customized DB cluster parameter group.
  2. Configure the Amazon Redshift Serverless destination with the required resource policy for its namespace.
  3. Update the Redshift Serverless workgroup to enable case-sensitive identifiers.
  4. Configure the required permissions.
  5. Create the zero-ETL integration.
  6. Create a database from the integration in Amazon Redshift.
  7. Start analyzing the near real time transactional data.

Configure the Aurora PostgreSQL source with a customized DB cluster parameter group

For Aurora PostgreSQL DB clusters, you must create the custom parameter group within the Amazon RDS Database Preview Environment, in the US East (Ohio) Region. You can directly access the Amazon RDS Preview Environment.

To create an Aurora PostgreSQL database, complete the following steps:

  1. On the Amazon RDS console, choose Parameter groups in the navigation pane.
  2. Choose Create parameter group.
  3. For Parameter group family, choose aurora-postgresql15.
  4. For Type, choose DB Cluster Parameter Group.
  5. For Group name, enter a name (for example, zero-etl-custom-pg-postgres).
  6. Choose Create.bdb-3883-image002

Aurora PostgreSQL zero-ETL integrations with Amazon Redshift require specific values for the Aurora DB cluster parameters, which requires enhanced logical replication (aurora.enhanced_logical_replication).

  1. On the Parameter groups page, select the newly created parameter group.
  2. On the Actions menu, choose Edit.
  3. Set the following Aurora PostgreSQL (aurora-postgresql15 family) cluster parameter settings:
    • rds.logical_replication=1
    • aurora.enhanced_logical_replication=1
    • aurora.logical_replication_backup=0
    • aurora.logical_replication_globaldb=0

Enabling enhanced logical replication (aurora.enhanced_logical_replication) automatically sets the REPLICA IDENTITY parameter to FULL, which means that all column values are written to the write ahead log (WAL).

  1. Choose Save Changes.bdb-3883-image003
  2. Choose Databases in the navigation pane, then choose Create database.
    bdb-3883-image004
  3. For Engine type, select Amazon Aurora.
  4. For Edition, select Amazon Aurora PostgreSQL-Compatible Edition.
  5. For Available versions, choose Aurora PostgreSQL (compatible with PostgreSQL 15.4 and Zero-ETL Support).bdb-3883-image006
  6. For Templates, select Production.
  7. For DB cluster identifier, enter zero-etl-source-pg.bdb-3883-image007
  8. Under Credentials Settings, enter a password for Master password or use the option to automatically generate a password for you.
  9. In the Instance configuration section, select Memory optimized classes.
  10. Choose a suitable instance size (the default is db.r5.2xlarge).bdb-3883-image008
  11. Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (zero-etl-custom-pg-postgres).bdb-3883-image009
  12. Leave the default settings for the remaining configurations.
  13. Choose Create database.

In a few minutes, this should spin up an Aurora PostgreSQL cluster, with one writer and one reader instance, with the status changing from Creating to Available. The newly created Aurora PostgreSQL cluster will be the source for the zero-ETL integration.

bdb-3883-image010

The next step is to create a named database in Amazon Aurora PostgreSQL for the zero-ETL integration.

The PostgreSQL resource model allows you to create multiple databases within a cluster. Therefore, during the zero-ETL integration creation step, you need to specify which database you want to use as the source for your integration.

When setting up PostgreSQL, you get three standard databases out of the box: template0, template1, and postgres. Whenever you create a new database in PostgreSQL, you are actually basing it off one of these three databases in your cluster. The database created during Aurora PostgreSQL cluster creation is based on template0. The CREATE DATABASE command works by copying an existing database, and if not explicitly specified, by default, it copies the standard system database template1. For the named database for zero-ETL integration, the database is required to be created using template1 and not template0. Therefore, if an initial database name is added under Additional configuration, that would be created using template0 and cannot be used for zero-ETL integration.

  1. To create a new named database using CREATE DATABASE within the new Aurora PostgreSQL cluster zero-etl-source-pg, first get the endpoint of the writer instance of the PostgreSQL cluster.bdb-3883-image011
  2. From a terminal or using AWS CloudShell, SSH into the PostgreSQL cluster and run the following commands to install psql and create a new database zeroetl_db:
    sudo dnf install postgresql15
    psql –version
    psql -h <RDS Write Instance Endpoint> -p 5432 -U postgres
    create database zeroetl_db template template1;

Adding template template1 is optional, because by default, if not mentioned, CREATE DATABASE will use template1.

You can also connect via a client and create the database. Refer to Connect to an Aurora PostgreSQL DB cluster for the options to connect to the PostgreSQL cluster.

Configure Redshift Serverless as destination

After you create your Aurora PostgreSQL source database cluster, you configure a Redshift target data warehouse. The data warehouse must comply with the following requirements:

  • Created in preview (for Aurora PostgreSQL sources only)
  • Uses an RA3 node type (ra3.16xlarge, ra3.4xlarge, or ra3.xlplus) with at least two nodes, or Redshift Serverless
  • Encrypted (if using a provisioned cluster)

For this post, we create and configure a Redshift Serverless workgroup and namespace as the target data warehouse, following these steps:

  1. On the Amazon Redshift console, choose Serverless dashboard in the navigation pane.

Because the zero-ETL integration for Amazon Aurora PostgreSQL to Amazon Redshift has been launched in preview (not for production purposes), you need to create the target data warehouse in a preview environment.

  1. Choose Create preview workgroup.

The first step is to configure the Redshift Serverless workgroup.

  1. For Workgroup name, enter a name (for example, zero-etl-target-rs-wg).bdb-3883-image014
  2. Additionally, you can choose the capacity, to limit the compute resources of the data warehouse. The capacity can be configured in increments of 8, from 8–512 RPUs. For this post, set this to 8 RPUs.
  3. Choose Next.bdb-3883-image016

Next, you need to configure the namespace of the data warehouse.

  1. Select Create a new namespace.
  2. For Namespace, enter a name (for example, zero-etl-target-rs-ns).
  3. Choose Next.bdb-3883-image017
  4. Choose Create workgroup.
  5. After the workgroup and namespace are created, choose Namespace configurations in the navigation pane and open the namespace configuration.
  6. On the Resource policy tab, choose Add authorized principals.

An authorized principal identifies the user or role that can create zero-ETL integrations into the data warehouse.

bdb-3883-image018

  1. For IAM principal ARN or AWS account ID, you can enter either the ARN of the AWS user or role, or the ID of the AWS account that you want to grant access to create zero-ETL integrations. (An account ID is stored as an ARN.)
  2. Choose Save changes.bdb-3883-image019

After the Authorized principal is configured, you need to allow the source database to update your Redshift data warehouse. Therefore, you must add the source database as an authorized integration source to the namespace.

  1. Choose Add authorized integration source.bdb-3883-image020
  2. For Authorized source ARN, enter the ARN of the Aurora PostgreSQL cluster, because it’s the source of the zero-ETL integration.

You can obtain the ARN of the Aurora PostgreSQL cluster on the Amazon RDS console, the Configuration tab under Amazon Resource Name.

  1. Choose Save changes.bdb-3883-image021

Update the Redshift Serverless workgroup to enable case-sensitive identifiers

Amazon Aurora PostgreSQL is case sensitive by default, and case sensitivity is disabled on all provisioned clusters and Redshift Serverless workgroups. For the integration to be successful, the case sensitivity parameter enable_case_sensitive_identifier must be enabled for the data warehouse.

In order to modify the enable_case_sensitive_identifier parameter in a Redshift Serverless workgroup, you need to use the AWS Command Line Interface (AWS CLI), because the Amazon Redshift console doesn’t currently support modifying Redshift Serverless parameter values. Run the following command to update the parameter:

aws redshift-serverless update-workgroup --workgroup-name zero-etl-target-rs-wg --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true --region us-east-2

A simple way to connect to the AWS CLI is to use CloudShell, which is a browser-based shell that provides command line access to the AWS resources and tools directly from a browser. The following screenshot illustrates how to run the command in the CloudShell.

bdb-3883-image022

Configure required permissions

To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. An AWS account owner can configure required permissions for user or roles who may create zero-ETL integrations. The sample policy allows the associated principal to perform following actions:

  • Create zero-ETL integrations for the source Aurora DB cluster.
  • View and delete all zero-ETL integrations.
  • Create inbound integrations into the target data warehouse. Amazon Redshift has a different ARN format for provisioned and serverless:
  • Provisioned clusterarn:aws:redshift:{region}:{account-id}:namespace:namespace-uuid
  • Serverlessarn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid

This permission is not required if the same account owns the Redshift data warehouse and this account is an authorized principal for that data warehouse.

Complete the following steps to configure the permissions:

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. Create a new policy called rds-integrations using the following JSON. For the Amazon Aurora PostgreSQL preview, all ARNs and actions within the Amazon RDS Database Preview Environment have -preview appended to the service namespace. Therefore, in the following policy, instead of rds, you need to use rds-preview. For example, rds-preview:CreateIntegration.
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "rds:CreateIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:cluster:source-cluster",
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DescribeIntegration"
        ],
        "Resource": ["*"]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DeleteIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "redshift:CreateInboundIntegration"
        ],
        "Resource": [
            "arn:aws:redshift:{region}:{account-id}:cluster:namespace-uuid"
        ]
    }]
}
  1. Attach the policy you created to your IAM user or role permissions.

Create the zero-ETL integration

To create the zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Choose Create zero-ETL integration.bdb-3883-image023
  3. For Integration identifier, enter a name, for example zero-etl-demo.
  4. Choose Next.bdb-3883-image025
  5. For Source database, choose Browse RDS databases.bdb-3883-image026
  6. Select the source database zero-etl-source-pg and choose Choose.
  7. For Named database, enter the name of the new database created in the Amazon Aurora PostgreSQL (zeroetl-db).
  8. Choose Next.bdb-3883-image028
  9. In the Target section, for AWS account, select Use the current account.
  10. For Amazon Redshift data warehouse, choose Browse Redshift data warehouses.bdb-3883-image029

We discuss the Specify a different account option later in this section.

  1. Select the Redshift Serverless destination namespace (zero-etl-target-rs-ns), and choose Choose.bdb-3883-image031
  2. Add tags and encryption, if applicable, and choose Next.bdb-3883-image032
  3. Verify the integration name, source, target, and other settings, and choose Create zero-ETL integration.

You can choose the integration on the Amazon RDS console to view the details and monitor its progress. It takes about 30 minutes to change the status from Creating to Active, depending on size of the dataset already available in the source.

bdb-3883-image033

bdb-3883-image034

To specify a target Redshift data warehouse that’s in another AWS account, you must create a role that allows users in the current account to access resources in the target account. For more information, refer to Providing access to an IAM user in another AWS account that you own.

Create a role in the target account with the following permissions:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "redshift:DescribeClusters",
            "redshift-serverless:ListNamespaces"
         ],
         "Resource":[
            "*"
         ]
      }
   ]
}

The role must have the following trust policy, which specifies the target account ID. You can do this by creating a role with a trusted entity as an AWS account ID in another account.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS": "arn:aws:iam::{external-account-id}:root"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

The following screenshot illustrates creating this on the IAM console.

bdb-3883-image035

Then, while creating the zero-ETL integration, for Specify a different account, choose the destination account ID and the name of the role you created.

Create a database from the integration in Amazon Redshift

To create your database, complete the following steps:

  1. On the Redshift Serverless dashboard, navigate to the zero-etl-target-rs-ns namespace.
  2. Choose Query data to open the query editor v2.
    bdb-3883-image036
  3. Connect to the Redshift Serverless data warehouse by choosing Create connection.
    bdb-3883-image037
  4. Obtain the integration_id from the svv_integration system table:
    SELECT integration_id FROM svv_integration; -- copy this result, use in the next sql

  5. Use the integration_id from the previous step to create a new database from the integration. You must also include a reference to the named database within the cluster that you specified when you created the integration.
    CREATE DATABASE aurora_pg_zetl FROM INTEGRATION '<result from above>' DATABASE zeroetl_db;

bdb-3883-image038

The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near real time.

Analyze the near real time transactional data

Now you can start analyzing the near real time data from the Amazon Aurora PostgreSQL source to the Amazon Redshift target:

  1. Connect to your source Aurora PostgreSQL database. In this demo, we use psql to connect to Amazon Aurora PostgreSQL:
    psql -h <amazon_aurora_postgres_writer_endpoint> -p 5432 -d zeroetl_db -U postgres

bdb-3883-image039

  1. Create a sample table with a primary key. Make sure that all tables to be replicated from source to target have a primary key. Tables without a primary key can’t be replicated to the target.
CREATE TABLE NATION  ( 
N_NATIONKEY  INTEGER NOT NULL PRIMARY KEY, 
N_NAME       CHAR(25) NOT NULL,
N_REGIONKEY  INTEGER NOT NULL,
N_COMMENT    VARCHAR(152));
  1. Insert dummy data into the nation table and verify if the data is properly loaded:
INSERT INTO nation VALUES (1, 'USA', 1 , 'united states of america');
SELECT * FROM nation;

bdb-3883-image040

This sample data should now be replicated in Amazon Redshift.

Analyze the source data in the destination

On the Redshift Serverless dashboard, open query editor v2 and connect to the database aurora_pg_zetl you created earlier.

Run the following query to validate the successful replication of the source data into Amazon Redshift:

SELECT * FROM aurora_pg_etl.public.nation;

bdb-3883-image041

You can also use the following query to validate the initial snapshot or ongoing change data capture (CDC) activity:

SELECT * FROM sys_integration_activity ORDER BY last_commit_timestamp desc;

bdb-3883-image042

Monitoring

There are several options to obtain metrics on the performance and status of the Aurora PostgreSQL zero-ETL integration with Amazon Redshift.

If you navigate to the Amazon Redshift console, you can choose Zero-ETL integrations in the navigation pane. You can choose the zero-ETL integration you want and display Amazon CloudWatch metrics related to the integration. These metrics are also directly available in CloudWatch.

bdb-3883-image043

For each integration, there are two tabs with information available:

  • Integration metrics – Shows metrics such as the number of tables successfully replicated and lag details
    bdb-3883-image044
  • Table statistics – Shows details about each table replicated from Amazon Aurora PostgreSQL to Amazon Redshift
    bdb-3883-image045

In addition to the CloudWatch metrics, you can query the following system views, which provide information about the integrations:

Clean up

When you delete a zero-ETL integration, your transactional data isn’t deleted from Aurora or Amazon Redshift, but Aurora doesn’t send new data to Amazon Redshift.

To delete a zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Select the zero-ETL integration that you want to delete and choose Delete.
    bdb-3883-image046
  3. To confirm the deletion, enter confirm and choose Delete.
    bdb-3883-image048

Conclusion

In this post, we explained how you can set up the zero-ETL integration from Amazon Aurora PostgreSQL to Amazon Redshift, a feature that reduces the effort of maintaining data pipelines and enables near real time analytics on transactional and operational data.

To learn more about zero-ETL integration, refer to Working with Aurora zero-ETL integrations with Amazon Redshift and Limitations.


About the Authors

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Juan Luis Polo Garzon is an Associate Specialist Solutions Architect at AWS, specialized in analytics workloads. He has experience helping customers design, build and modernize their cloud-based analytics solutions. Outside of work, he enjoys travelling, outdoors and hiking, and attending to live music events.

Sushmita Barthakur is a Senior Solutions Architect at Amazon Web Services, supporting Enterprise customers architect their workloads on AWS. With a strong background in Data Analytics and Data Management, she has extensive experience helping customers architect and build Business Intelligence and Analytics Solutions, both on-premises and the cloud. Sushmita is based out of Tampa, FL and enjoys traveling, reading and playing tennis.

Data load made easy and secure in Amazon Redshift using Query Editor V2

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/data-load-made-easy-and-secure-in-amazon-redshift-using-query-editor-v2/

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data efficiently and securely. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based Query Editor V2 in addition to supporting connectivity via ODBC/JDBC or the Amazon Redshift Data API.

Amazon Redshift Query Editor V2 makes it easy to query your data using SQL and gain insights by visualizing your results using charts and graphs with a few clicks. With Query Editor V2, you can collaborate with team members by easily sharing saved queries, results, and analyses in a secure way.

Analysts performing ad hoc analyses in their workspace need to load sample data in Amazon Redshift by creating a table and load data from desktop. They want to join that data with the curated data in their data warehouse. Data engineers and data scientists have test data, and want to load data into Amazon Redshift for their machine learning (ML) or analytics use cases.

In this post, we walk through a new feature in Query Editor V2 to easily load data files either from your local desktop or Amazon Simple Storage Service (Amazon S3).

Prerequisites

Complete the following prerequisite steps:

    1. Create an Amazon Redshift provisioned cluster or Serverless endpoint.
    2. Provide access to Query Editor V2 for your end-users. To enable your users to access Query Editor V2 using IAM, as an administrator, you can attach one of the following AWS-managed policies to the AWS Identity and Access Management (IAM) user or role to grant permission:
      • AmazonRedshiftQueryEditorV2FullAccess – Grants full access to the Query Editor V2 operations and resources.
      • AmazonRedshiftQueryEditorV2NoSharing – Grants the ability to work with Query Editor V2 without sharing resources.
      • AmazonRedshiftQueryEditorV2ReadSharing – Grants the ability to work with Query Editor V2 with limited sharing of resources. The granted principal can read the resources shared with its team but can’t update them.
      • AmazonRedshiftQueryEditorV2ReadWriteSharing – Grants the ability to work with Query Editor V2 with sharing of resources. The granted principal can read and update the resources shared with its team.
    3. Provide access to the S3 bucket to load data from a local desktop file.
      • To enable your users to load data from a local desktop using Query Editor V2, as an administrator, you have to specify a common S3 bucket, and the user account must be configured with proper permissions. You can use the following IAM policy as an example to configure your IAM user or role:
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [
                        "s3:ListBucket",
                        "s3:GetBucketLocation"
                    ],
                    "Resource": [
                        "arn:aws:s3:::<staging-bucket-name>>"
                    ]
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "s3:PutObject",
                        "s3:GetObject",
                        "s3:DeleteObject"
                    ],
                    "Resource": [
                        "arn:aws:s3:::<staging-bucket-name>[/<optional-prefix>]/${aws:userid}/*"
                    ]
                }
            ]
        }
        

      • It’s also recommended to have proper separation of data access when loading data files from your local desktop. You can use the following S3 bucket policy as an example to separate data access between users of the staging bucket you configured:
        {
         "Version": "2012-10-17",
            "Statement": [
                {"Sid": "userIdPolicy",
                    "Effect": "Deny",
                    "Principal": "*",
                    "Action": ["s3:PutObject",
                               "s3:GetObject",
                               "s3:DeleteObject"],
                    "NotResource": [
                        "arn:aws:s3:::<staging-bucket-name>[/<optional-prefix>]/${aws:userid}/*"
                    ]
                 }
            ]
        }
        

Configure Query Editor V2 for your AWS account

As an admin, you must first configure Query Editor V2 before providing access to your end-users. On the Amazon Redshift console, choose Query editor v2 in the navigation pane.

If you’re accessing Query Editor v2 for the first time, you must configure your account by providing AWS Key Management Service (AWS KMS) encryption and, optionally, an S3 bucket.

By default, an AWS-owned key is used to encrypt resources. Optionally, you can create a symmetric customer managed key to encrypt Query Editor V2 resources such as saved queries and query results using the AWS KMS console or AWS KMS API operations.

The S3 bucket URI is required when loading data from your local desktop. You can provide the S3 URI of the same bucket that you configured earlier as a prerequisite.

Configure-QEv2

If you have previously configured Query Editor V2 with only AWS KMS encryption, you can choose Account Settings after launching the interface to update the S3 URI to support loading from your local desktop.

Configure-QEv2

Load data from your local desktop

Users such as data analysts, database developers, and data scientists can now load local files up to 5 MB in size into Amazon Redshift tables from Query Editor V2, without using the COPY command. The supported data formats are CSV, JSON, DELIMITER, FIXEDWIDTH, SHAPEFILE, AVRO, PARQUET, and ORC. Complete the following steps:

      1. On the Amazon Redshift console, navigate to Query Editor V2.
      2. Click on Load data.
        load data
      3. Choose Load from local file and Browse to choose a local file. You can download the student_info.csv file to use as an example.
      4. If your file has column headers as the first row, keep the default selection of Ignore header rows as 1 to ignore first row.
      5. If your file has date columns, choose Data conversion parameters.
        browse and format file
      6. Select Date format, set it to auto and choose Next.
        date format
      7. Choose Load new table to automatically infer the file schema.
      8. Specify the values for Cluster or workgroup, Database, Schema, and Table (for example, Student_info) to load data to.
      9. Choose Create table.
        create-table

A success message appears that the table was created. Now you can load data into the newly created table from a local file.

      1. Choose Load data.
        table created

A message appears that the data load was successful.

      1. Query the Student_info table to see the data.
        query data

Load data from Amazon S3

You can easily load data from Amazon S3 into an Amazon Redshift table using Query Editor V2. Complete the following steps:

      1. On the Amazon Redshift console, launch Query Editor V2 and connect to your cluster.
      2. Browse to the database name (for example, dev), the public schema, and expand Tables.
      3. You can automatically infer the schema of a S3 file similar to Load from local file option shown above however for this demo, we will also show you how to load data to an existing table. Run the following create table script to make a sample table (for this example, public.customer):
CREATE TABLE customer ( 
	c_custkey int8 NOT NULL , 
	c_name varchar(25) NOT NULL, 
	c_address varchar(40) NOT NULL, 
	c_nationkey int4 NOT NULL, 
	c_phone char(15) NOT NULL, 
	c_acctbal numeric(12,2) NOT NULL, 
	c_mktsegment char(10) NOT NULL, 
	c_comment varchar(117) NOT NULL, 
PRIMARY Key(C_CUSTKEY) 
) DISTKEY(c_custkey) sortkey(c_custkey);
      1. Choose Load data.
        Create-Table
      2. Choose Load from S3 bucket.
      3. For this post, we load data from the TPCH Sample data GitHub repo, so for the S3 URI, enter s3://redshift-downloads/TPC-H/2.18/10GB/customer.tbl.
      4. For S3 file location, choose us-east-1.
      5. For File format, choose Delimiter.
      6. For Delimiter character, enter |.
        Load from S3
      7. Choose Data conversion parameters, then select Time format and Date format as auto.
      8. Choose Back.

Refer to Data conversion parameters for more details.

Date Time Format

      1. Choose Load operations.
      2. Select Automatic update for compression encodings.
      3. Select Stop loading when maximum number of errors has been exceeded and specify a value (for example, 100).
      4. Select Statistics update and ON, then choose Next.

Refer to Data load operations for more details.

Load Operations

      1. Choose Load existing table.
      2. Specify the Cluster or workgroup, DatabaseSchema (for example, public) and Table name (for example, customer).
      3. For IAM role, choose a suitable IAM role.
      4. Choose Load data.
        S3 Load Data

Query Editor V2 generates the COPY command and runs it on the Amazon Redshift cluster. The results of the COPY command are displayed in the Result section upon completion.

S3 Load Copy

Conclusion

In this post, we showed how Amazon Redshift Query Editor V2 has simplified the process to load data into Amazon Redshift from Amazon S3 or your local desktop, thereby accelerating the data analysis. It’s an easy-to-use feature that your teams can start using to load and query datasets. If you have any questions or suggestions, please leave a comment.


About the Authors

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Erol MurtezaogluErol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Sapna Maheshwari is a Sr. Solutions Architect at Amazon Web Services. She has over 18 years of experience in data and analytics. She is passionate about telling stories with data and enjoys creating engaging visuals to unearth actionable insights.

Karthik Ramanathan is a Software Engineer with Amazon Redshift and is based in San Francisco. He brings close to two decades of development experience across the networking, data storage and IoT verticals. When not at work he is also a writer and loves to be in the water.

Albert Harkema is a Software Development Engineer at AWS. He is known for his curiosity and deep-seated desire to understand the inner workings of complex systems. His inquisitive nature drives him to develop software solutions that make life easier for others. Albert’s approach to problem-solving emphasizes efficiency, reliability, and long-term stability, ensuring that his work has a tangible impact. Through his professional experiences, he has discovered the potential of technology to improve everyday life.

Share and publish your Snowflake data to AWS Data Exchange using Amazon Redshift data sharing

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/share-and-publish-your-snowflake-data-to-aws-data-exchange-using-amazon-redshift-data-sharing/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

In this post, we discuss a customer who is currently using Snowflake to store analytics data. The customer needs to offer this data to clients who are using Amazon Redshift via AWS Data Exchange, the world’s most comprehensive service for third-party datasets. We explain in detail how to implement a fully integrated process that will automatically ingest data from Snowflake into Amazon Redshift and offer it to clients via AWS Data Exchange.

Overview of the solution

The solution consists of four high-level steps:

  1. Configure Snowflake to push the changed data for identified tables into an Amazon Simple Storage Service (Amazon S3) bucket.
  2. Use a custom-built Redshift Auto Loader to load this Amazon S3 landed data to Amazon Redshift.
  3. Merge the data from the change data capture (CDC) S3 staging tables to Amazon Redshift tables.
  4. Use Amazon Redshift data sharing to license the data to customers via AWS Data Exchange as a public or private offering.

The following diagram illustrates this workflow.

Solution Architecture Diagram

Prerequisites

To get started, you need the following prerequisites:

Configure Snowflake to track the changed data and unload it to Amazon S3

In Snowflake, identify the tables that you need to replicate to Amazon Redshift. For the purpose of this demo, we use the data in the TPCH_SF1 schema’s Customer, LineItem, and Orders tables of the SNOWFLAKE_SAMPLE_DATA database, which comes out of the box with your Snowflake account.

  1. Make sure that the Snowflake external stage name unload_to_s3 created in the prerequisites is pointing to the S3 prefix s3-redshift-loader-sourcecreated in the previous step.
  2. Create a new schema BLOG_DEMO in the DEMO_DB database:CREATE SCHEMA demo_db.blog_demo;
  3. Duplicate the Customer, LineItem, and Orders tables in the TPCH_SF1 schema to the BLOG_DEMO schema:
    CREATE TABLE CUSTOMER AS 
    SELECT * FROM snowflake_sample_data.tpch_sf1.CUSTOMER;
    CREATE TABLE ORDERS AS
    SELECT * FROM snowflake_sample_data.tpch_sf1.ORDERS;
    CREATE TABLE LINEITEM AS 
    SELECT * FROM snowflake_sample_data.tpch_sf1.LINEITEM;

  4. Verify that the tables have been duplicated successfully:
    SELECT table_catalog, table_schema, table_name, row_count, bytes
    FROM INFORMATION_SCHEMA.TABLES
    WHERE TABLE_SCHEMA = 'BLOG_DEMO'
    ORDER BY ROW_COUNT;

    unload-step-4

  5. Create table streams to track data manipulation language (DML) changes made to the tables, including inserts, updates, and deletes:
    CREATE OR REPLACE STREAM CUSTOMER_CHECK ON TABLE CUSTOMER;
    CREATE OR REPLACE STREAM ORDERS_CHECK ON TABLE ORDERS;
    CREATE OR REPLACE STREAM LINEITEM_CHECK ON TABLE LINEITEM;

  6. Perform DML changes to the tables (for this post, we run UPDATE on all tables and MERGE on the customer table):
    UPDATE customer 
    SET c_comment = 'Sample comment for blog demo' 
    WHERE c_custkey between 0 and 10; 
    UPDATE orders 
    SET o_comment = 'Sample comment for blog demo' 
    WHERE o_orderkey between 1800001 and 1800010; 
    UPDATE lineitem 
    SET l_comment = 'Sample comment for blog demo' 
    WHERE l_orderkey between 3600001 and 3600010;
    MERGE INTO customer c 
    USING 
    ( 
    SELECT n_nationkey 
    FROM snowflake_sample_data.tpch_sf1.nation s 
    WHERE n_name = 'UNITED STATES') n 
    ON n.n_nationkey = c.c_nationkey 
    WHEN MATCHED THEN UPDATE SET c.c_comment = 'This is US based customer1';

  7. Validate that the stream tables have recorded all changes:
    SELECT * FROM CUSTOMER_CHECK; 
    SELECT * FROM ORDERS_CHECK; 
    SELECT * FROM LINEITEM_CHECK;

    For example, we can query the following customer key value to verify how the events were recorded for the MERGE statement on the customer table:

    SELECT * FROM CUSTOMER_CHECK where c_custkey = 60027;

    We can see the METADATA$ISUPDATE column as TRUE, and we see DELETE followed by INSERT in the METADATA$ACTION column.
    unload-val-step-7

  8. Run the COPY command to offload the CDC from the stream tables to the S3 bucket using the external stage name unload_to_s3.In the following code, we’re also copying the data to S3 folders ending with _stg to ensure that when Redshift Auto Loader automatically creates these tables in Amazon Redshift, they get created and marked as staging tables:
    COPY INTO @unload_to_s3/customer_stg/
    FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check)
    FILE_FORMAT = (TYPE = PARQUET)
    OVERWRITE = TRUE HEADER = TRUE;

    COPY INTO @unload_to_s3/customer_stg/
    FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check)
    FILE_FORMAT = (TYPE = PARQUET)
    OVERWRITE = TRUE HEADER = TRUE;

    COPY INTO @unload_to_s3/lineitem_stg/ 
    FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.lineitem_check) 
    FILE_FORMAT = (TYPE = PARQUET) 
    OVERWRITE = TRUE HEADER = TRUE;

  9. Verify the data in the S3 bucket. There will be three sub-folders created in the s3-redshift-loader-source folder of the S3 bucket, and each will have .parquet data files.unload-step-9-valunload-step-9-valYou can also automate the preceding COPY commands using tasks, which can be scheduled to run at a set frequency for automatic copy of CDC data from Snowflake to Amazon S3.
  10. Use the ACCOUNTADMIN role to assign the EXECUTE TASK privilege. In this scenario, we’re assigning the privileges to the SYSADMIN role:
    USE ROLE accountadmin;
    GRANT EXECUTE TASK, EXECUTE MANAGED TASK ON ACCOUNT TO ROLE sysadmin;

  11. Use the SYSADMIN role to create three separate tasks to run three COPY commands every 5 minutes: USE ROLE sysadmin;
    /* Task to offload Customer CDC table */ 
    CREATE TASK sf_rs_customer_cdc 
    WAREHOUSE = SMALL 
    SCHEDULE = 'USING CRON 5 * * * * UTC' 
    AS 
    COPY INTO @unload_to_s3/customer_stg/ 
    FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check) 
    FILE_FORMAT = (TYPE = PARQUET) 
    OVERWRITE = TRUE 
    HEADER = TRUE;
    /*Task to offload Orders CDC table */ 
    CREATE TASK sf_rs_orders_cdc 
    WAREHOUSE = SMALL 
    SCHEDULE = 'USING CRON 5 * * * * UTC' 
    AS 
    COPY INTO @unload_to_s3/orders_stg/ 
    FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.orders_check)
    FILE_FORMAT = (TYPE = PARQUET)
    OVERWRITE = TRUE HEADER = TRUE;

    /* Task to offload Lineitem CDC table */ 
    CREATE TASK sf_rs_lineitem_cdc 
    WAREHOUSE = SMALL 
    SCHEDULE = 'USING CRON 5 * * * * UTC' 
    AS 
    COPY INTO @unload_to_s3/lineitem_stg/ 
    FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.lineitem_check)
    FILE_FORMAT = (TYPE = PARQUET)
    OVERWRITE = TRUE HEADER = TRUE;

    When the tasks are first created, they’re in a SUSPENDED state.

  12. Alter the three tasks and set them to RESUME state:
    ALTER TASK sf_rs_customer_cdc RESUME;
    ALTER TASK sf_rs_orders_cdc RESUME;
    ALTER TASK sf_rs_lineitem_cdc RESUME;

  13. Validate that all three tasks have been resumed successfully: SHOW TASKS;unload-setp-13-valNow the tasks will run every 5 minutes and look for new data in the stream tables to offload to Amazon S3.As soon as data is migrated from Snowflake to Amazon S3, Redshift Auto Loader automatically infers the schema and instantly creates corresponding tables in Amazon Redshift. Then, by default, it starts loading data from Amazon S3 to Amazon Redshift every 5 minutes. You can also change the default setting of 5 minutes.
  14. On the Amazon Redshift console, launch the query editor v2 and connect to your Amazon Redshift cluster.
  15. Browse to the dev database, public schema, and expand Tables.
    You can see three staging tables created with the same name as the corresponding folders in Amazon S3.
  16. Validate the data in one of the tables by running the following query:SELECT * FROM "dev"."public"."customer_stg";unload-step-16-val

Configure the Redshift Auto Loader utility

The Redshift Auto Loader makes data ingestion to Amazon Redshift significantly easier because it automatically loads data files from Amazon S3 to Amazon Redshift. The files are mapped to the respective tables by simply dropping files into preconfigured locations on Amazon S3. For more details about the architecture and internal workflow, refer to the GitHub repo.

We use an AWS CloudFormation template to set up Redshift Auto Loader. Complete the following steps:

  1. Launch the CloudFormation template.
  2. Choose Next.
    autoloader-step-2
  3. For Stack name, enter a name.
  4. Provide the parameters listed in the following table.

    CloudFormation Template Parameter Allowed Values Description
    RedshiftClusterIdentifier Amazon Redshift cluster identifier Enter the Amazon Redshift cluster identifier.
    DatabaseUserName Database user name in the Amazon Redshift cluster The Amazon Redshift database user name that has access to run the SQL script.
    DatabaseName S3 bucket name The name of the Amazon Redshift primary database where the SQL script is run.
    DatabaseSchemaName Database name in Amazon Redshift The Amazon Redshift schema name where the tables are created.
    RedshiftIAMRoleARN Default or the valid IAM role ARN attached to the Amazon Redshift cluster The IAM role ARN associated with the Amazon Redshift cluster. Your default IAM role is set for the cluster and has access to your S3 bucket, leave it at the default.
    CopyCommandOptions Copy option; default is delimiter ‘|’ gzip

    Provide the additional COPY command data format parameters.

    If InitiateSchemaDetection = Yes, then the process attempts to detect the schema and automatically set the suitable copy command options.

    In the event of failure on schema detection or when InitiateSchemaDetection = No, then this value is used as the default COPY command options to load data.

    SourceS3Bucket S3 bucket name The S3 bucket where the data is stored. Make sure the IAM role that is associated to the Amazon Redshift cluster has access to this bucket.
    InitiateSchemaDetection Yes/No

    Set to Yes to dynamically detect the schema prior to file load and create a table in Amazon Redshift if it doesn’t exist already. If a table already exists, then it won’t drop or recreate the table in Amazon Redshift.

    If schema detection fails, the process uses the default COPY options as specified in CopyCommandOptions.

    The Redshift Auto Loader uses the COPY command to load data into Amazon Redshift. For this post, set CopyCommandOptions as follows, and configure any supported COPY command options:

    delimiter '|' dateformat 'auto' TIMEFORMAT 'auto'

    autoloader-input-parameters

  5. Choose Next.
  6. Accept the default values on the next page and choose Next.
  7. Select the acknowledgement check box and choose Create stack.
    autoloader-step-7
  8. Monitor the progress of the Stack creation and wait until it is complete.
  9. To verify the Redshift Auto Loader configuration, sign in to the Amazon S3 console and navigate to the S3 bucket you provided.
    You should see a new directory s3-redshift-loader-source is created.
    autoloader-step-9

Copy all the data files exported from Snowflake under s3-redshift-loader-source.

Merge the data from the CDC S3 staging tables to Amazon Redshift tables

To merge your data from Amazon S3 to Amazon Redshift, complete the following steps:

  1. Create a temporary staging table merge_stg and insert all the rows from the S3 staging table that have metadata_action as INSERT, using the following code. This includes all the new inserts as well as the update.
    CREATE TEMP TABLE merge_stg 
    AS
    SELECT * FROM
    (
    SELECT *, DENSE_RANK() OVER (PARTITION BY c_custkey ORDER BY last_updated_ts DESC
    ) AS rnk
    FROM customer_stg WHERE rnk = 1 AND metadata$action = 'INSERT'

    The preceding code uses a window function DENSE_RANK() to select the latest entries for a given c_custkey by assigning a rank to each row for a given c_custkey and arrange the data in descending order using last_updated_ts. We then select the rows with rnk=1 and metadata$action = ‘INSERT’ to capture all the inserts.

  2. Use the S3 staging table customer_stg to delete the records from the base table customer, which are marked as deletes or updates:
    DELETE FROM customer 
    USING customer_stg 
    WHERE customer.c_custkey = customer_stg.c_custkey;

    This deletes all the rows that are present in the CDC S3 staging table, which takes care of rows marked for deletion and updates.

  3. Use the temporary staging table merge_stg to insert the records marked for updates or inserts:
    INSERT INTO customer 
    SELECT c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment 
    FROM merge_stg;

  4. Truncate the staging table, because we have already updated the target table:truncate customer_stg;
  5. You can also run the preceding steps as a stored procedure:
    CREATE OR REPLACE PROCEDURE merge_customer()
    AS $$
    BEGIN
    /*CREATING TEMP TABLE TO GET THE MOST LATEST RECORDS FOR UPDATES/NEW INSERTS*/
    CREATE TEMP TABLE merge_stg AS
    SELECT * FROM
    (
    SELECT *, DENSE_RANK() OVER (PARTITION BY c_custkey ORDER BY last_updated_ts DESC ) AS rnk
    FROM customer_stg
    )
    WHERE rnk = 1 AND metadata$action = 'INSERT';
    /* DELETING FROM THE BASE TABLE USING THE CDC STAGING TABLE ALL THE RECORDS MARKED AS DELETES OR UPDATES*/
    DELETE FROM customer
    USING customer_stg
    WHERE customer.c_custkey = customer_stg.c_custkey;
    /*INSERTING NEW/UPDATED RECORDS IN THE BASE TABLE*/ 
    INSERT INTO customer
    SELECT c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
    FROM merge_stg;
    truncate customer_stg;
    END;
    $$ LANGUAGE plpgsql;

    For example, let’s look at the before and after states of the customer table when there’s been a change in data for a particular customer.

    The following screenshot shows the new changes recorded in the customer_stg table for c_custkey = 74360.
    merge-process-new-changes
    We can see two records for a customer with c_custkey=74360 one with metadata$action as DELETE and one with metadata$action as INSERT. That means the record with c_custkey was updated at the source and these changes need to be applied to the target customer table in Amazon Redshift.

    The following screenshot shows the current state of the customer table before these changes have been merged using the preceding stored procedure:
    merge-process-current-state

  6. Now, to update the target table, we can run the stored procedure as follows: CALL merge_customer()The following screenshot shows the final state of the target table after the stored procedure is complete.
    merge-process-after-sp

Run the stored procedure on a schedule

You can also run the stored procedure on a schedule via Amazon EventBridge. The scheduling steps are as follows:

  1. On the EventBridge console, choose Create rule.
    sp-schedule-1
  2. For Name, enter a meaningful name, for example, Trigger-Snowflake-Redshift-CDC-Merge.
  3. For Event bus, choose default.
  4. For Rule Type, select Schedule.
  5. Choose Next.
    sp-schedule-step-5
  6. For Schedule pattern, select A schedule that runs at a regular rate, such as every 10 minutes.
  7. For Rate expression, enter Value as 5 and choose Unit as Minutes.
  8. Choose Next.
    sp-schedule-step-8
  9. For Target types, choose AWS service.
  10. For Select a Target, choose Redshift cluster.
  11. For Cluster, choose the Amazon Redshift cluster identifier.
  12. For Database name, choose dev.
  13. For Database user, enter a user name with access to run the stored procedure. It uses temporary credentials to authenticate.
  14. Optionally, you can also use AWS Secrets Manager for authentication.
  15. For SQL statement, enter CALL merge_customer().
  16. For Execution role, select Create a new role for this specific resource.
  17. Choose Next.
    sp-schedule-step-17
  18. Review the rule parameters and choose Create rule.

After the rule has been created, it automatically triggers the stored procedure in Amazon Redshift every 5 minutes to merge the CDC data into the target table.

Configure Amazon Redshift to share the identified data with AWS Data Exchange

Now that you have the data stored inside Amazon Redshift, you can publish it to customers using AWS Data Exchange.

  1. In Amazon Redshift, using any query editor, create the data share and add the tables to be shared:
    CREATE DATASHARE salesshare MANAGEDBY ADX;
    ALTER DATASHARE salesshare ADD SCHEMA tpch_sf1;
    ALTER DATASHARE salesshare ADD TABLE tpch_sf1.customer;

    ADX-step1

  2. On the AWS Data Exchange console, create your dataset.
  3. Select Amazon Redshift datashare.
    ADX-step3-create-datashare
  4. Create a revision in the dataset.
    ADX-step4-create-revision
  5. Add assets to the revision (in this case, the Amazon Redshift data share).
    ADX-addassets
  6. Finalize the revision.
    ADX-step-6-finalizerevision

After you create the dataset, you can publish it to the public catalog or directly to customers as a private product. For instructions on how to create and publish products, refer to NEW – AWS Data Exchange for Amazon Redshift

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the CloudFormation stack used to create the Redshift Auto Loader.
  2. Delete the Amazon Redshift cluster created for this demonstration.
  3. If you were using an existing cluster, drop the created external table and external schema.
  4. Delete the S3 bucket you created.
  5. Delete the Snowflake objects you created.

Conclusion

In this post, we demonstrated how you can set up a fully integrated process that continuously replicates data from Snowflake to Amazon Redshift and then uses Amazon Redshift to offer data to downstream clients over AWS Data Exchange. You can use the same architecture for other purposes, such as sharing data with other Amazon Redshift clusters within the same account, cross-accounts, or even cross-Regions if needed.


About the Authors

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Ekta Ahuja is a Senior Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling
and cooking.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms, Ahmed is passionate about helping customers build efficient, performant and scalable Analytic solutions.

Migrate from Snowflake to Amazon Redshift using AWS Glue Python shell

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/migrate-from-snowflake-to-amazon-redshift-using-aws-glue-python-shell/

As the most widely used cloud data warehouse, Amazon Redshift makes it simple and cost-effective to analyze your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to analyze exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics without having to manage the data warehouse infrastructure. It natively integrates with other AWS services, facilitating the process of building enterprise-grade analytics applications in a manner that is not only cost-effective, but also avoids point solutions.

We are continuously innovating and releasing new features of Amazon Redshift, enabling the implementation of a wide range of data use cases and meeting requirements with performance and scale. For example, Amazon Redshift Serverless allows you to run and scale analytics workloads without having to provision and manage data warehouse clusters. Other features that help power analytics at scale with Amazon Redshift include automatic concurrency scaling for read and write queries, automatic workload management (WLM) for concurrency scaling, automatic table optimization, the new RA3 instances with managed storage to scale cloud data warehouses and reduce costs, cross-Region data sharing, data exchange, and the SUPER data type to store semi-structured data or documents as values. For the latest feature releases for Amazon Redshift, see Amazon Redshift What’s New. In addition to improving performance and scale, you can also gain up to three times better price performance with Amazon Redshift than other cloud data warehouses.

To take advantage of the performance, security, and scale of Amazon Redshift, customers are looking to migrate their data from their existing cloud warehouse in a way that is both cost optimized and performant. This post describes how to migrate a large volume of data from Snowflake to Amazon Redshift using AWS Glue Python shell in a manner that meets both these goals.

AWS Glue is serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration, allowing you to analyze your data in minutes instead of weeks or months. AWS Glue supports the ability to use a Python shell job to run Python scripts as a shell, enabling you to author ETL processes in a familiar language. In addition, AWS Glue allows you to manage ETL jobs using AWS Glue workflows, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), and AWS Step Functions, automating and facilitating the orchestration of ETL steps.

Solution overview

The following architecture shows how an AWS Glue Python shell job migrates the data from Snowflake to Amazon Redshift in this solution.

Architecture

The solution is comprised of two stages:

  • Extract – The first part of the solution extracts data from Snowflake into an Amazon Simple Storage Service (Amazon S3) data lake
  • Load – The second part of the solution reads the data from the same S3 bucket and loads it into Amazon Redshift

For both stages, we connect the AWS Glue Python shell jobs to Snowflake and Amazon Redshift using database connectors for Python. The first AWS Glue Python shell job reads a SQL file from an S3 bucket to run the relevant COPY commands on the Snowflake database using Snowflake compute capacity and parallelism to migrate the data to Amazon S3. When this is complete, the second AWS Glue Python shell job reads another SQL file, and runs the corresponding COPY commands on the Amazon Redshift database using Redshift compute capacity and parallelism to load the data from the same S3 bucket.

Both jobs are orchestrated using AWS Glue workflows, as shown in the following screenshot. The workflow pushes data processing logic down to the respective data warehouses by running COPY commands on the databases themselves, minimizing the processing capacity required by AWS Glue to just the resources needed to run the Python scripts. The COPY commands load data in parallel both to and from Amazon S3, providing one of the fastest and most scalable mechanisms to transfer data from Snowflake to Amazon Redshift.

Because all heavy lifting around data processing is pushed down to the data warehouses, this solution is designed to provide a cost-optimized and highly performant mechanism to migrate a large volume of data from Snowflake to Amazon Redshift with ease.

Glue Workflow

The entire solution is packaged in an AWS CloudFormation template for simplicity of deployment and automatic provisioning of most of the required resources and permissions.

The high-level steps to implement the solution are as follows:

  1. Generate the Snowflake SQL file.
  2. Deploy the CloudFormation template to provision the required resources and permissions.
  3. Provide Snowflake access to newly created S3 bucket.
  4. Run the AWS Glue workflow to migrate the data.

Prerequisites

Before you get started, you can optionally build the latest version of the Snowflake Connector for Python package locally and generate the wheel (.whl) package. For instructions, refer to How to build.

If you don’t provide the latest version of the package, the CloudFormation template uses a pre-built .whl file that may not be on the most current version of Snowflake Connector for Python.

By default, the CloudFormation template migrates data from all tables in the TPCH_SF1 schema of the SNOWFLAKE_SAMPLE_DATA database, which is a sample dataset provided by Snowflake when an account is created. The following stored procedure is used to dynamically generate the Snowflake COPY commands required to migrate the dataset to Amazon S3. It accepts the database name, schema name, and stage name as the parameters.

CREATE OR REPLACE PROCEDURE generate_copy(db_name VARCHAR, schema_name VARCHAR, stage_name VARCHAR)
   returns varchar not null
   language javascript
   as
   $$
var return_value = "";
var sql_query = "select table_catalog, table_schema, lower(table_name) as table_name from " + DB_NAME + ".information_schema.tables where table_schema = '" + SCHEMA_NAME + "'" ;
   var sql_statement = snowflake.createStatement(
          {
          sqlText: sql_query
          }
       );
/* Creates result set */
var result_scan = sql_statement.execute();
while (result_scan.next())  {
       return_value += "\n";
       return_value += "COPY INTO @"
       return_value += STAGE_NAME
       return_value += "/"
       return_value += result_scan.getColumnValue(3);
       return_value += "/"
       return_value += "\n";
       return_value += "FROM ";
       return_value += result_scan.getColumnValue(1);
       return_value += "." + result_scan.getColumnValue(2);
       return_value += "." + result_scan.getColumnValue(3);
       return_value += "\n";
       return_value += "FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' COMPRESSION = GZIP)";
       return_value += "\n";
       return_value += "OVERWRITE = TRUE;"
       return_value += "\n";
       }
return return_value;
$$
;

Deploy the required resources and permissions using AWS CloudFormation

You can use the provided CloudFormation template to deploy this solution. This template automatically provisions an Amazon Redshift cluster with your desired configuration in a private subnet, maintaining a high standard of security.

  1. Sign in to the AWS Management Console, preferably as admin user.
  2. Select your desired Region, preferably the same Region where your Snowflake instance is provisioned.
  3. Choose Launch Stack:
  4. Choose Next.
  5. For Stack name, enter a meaningful name for the stack, for example, blog-resources.

The Parameters section is divided into two subsections: Source Snowflake Infrastructure and Target Redshift Configuration.

  1. For Snowflake Unload SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the sample data in the TPCH_SF1 schema of the SNOWFLAKE_SAMPLE_DATA database.
  2. For Data S3 Bucket, enter a prefix for the name of the S3 bucket that is automatically provisioned to stage the Snowflake data, for example, sf-migrated-data.
  3. For Snowflake Driver, if applicable, enter the S3 location (URI) of the .whl package built earlier as a prerequisite. By default, it uses a pre-built .whl file.
  4. For Snowflake Account Name, enter your Snowflake account name.

You can use the following query in Snowflake to return your Snowflake account name:

SELECT CURRENT_ACCOUNT();
  1. For Snowflake Username, enter your user name to connect to the Snowflake account.
  2. For Snowflake Password, enter the password for the preceding user.
  3. For Snowflake Warehouse Name, enter the warehouse name for running the SQL queries.

Make sure the aforementioned user has access to the warehouse.

  1. For Snowflake Database Name, enter the database name. The default is SNOWFLAKE_SAMPLE_DATA.
  2. For Snowflake Schema Name, enter schema name. The default is TPCH_SF1.

CFN Param Snowflake

  1. For VPC CIDR Block, enter the desired CIDR block of Redshift cluster. The default is 10.0.0.0/16.
  2. For Subnet 1 CIDR Block, enter the CIDR block of the first subnet. The default is 10.0.0.0/24.
  3. For Subnet 2 CIDR Block, enter the CIDR block of the first subnet. The default is 10.0.1.0/24.
  4. For Redshift Load SQL Script, it defaults to S3 location (URI) of a SQL file which migrates the sample data in S3 to Redshift.

The following database view in Redshift is used to dynamically generate Redshift COPY commands required to migrate the dataset from Amazon S3. It accepts the schema name as the filter criteria.

CREATE OR REPLACE VIEW v_generate_copy
AS
SELECT
    schemaname ,
    tablename  ,
    seq        ,
    ddl
FROM
    (
        SELECT
            table_id   ,
            schemaname ,
            tablename  ,
            seq        ,
            ddl
        FROM
            (
                --COPY TABLE
                SELECT
                    c.oid::bigint  as table_id   ,
                    n.nspname      AS schemaname ,
                    c.relname      AS tablename  ,
                    0              AS seq        ,
                    'COPY ' + n.nspname + '.' + c.relname + ' FROM ' AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r'
                --COPY TABLE continued                
                UNION                
                SELECT
                    c.oid::bigint as table_id   ,
                    n.nspname     AS schemaname ,
                    c.relname     AS tablename  ,
                    2             AS seq        ,
                    '''${' + '2}' + c.relname + '/'' iam_role ''${' + '1}'' gzip delimiter ''|'' EMPTYASNULL REGION ''us-east-1''' AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r'
                --END SEMICOLON                
                UNION                
                SELECT
                    c.oid::bigint as table_id  ,
                    n.nspname     AS schemaname,
                    c.relname     AS tablename ,
                    600000005     AS seq       ,
                    ';'           AS ddl
                FROM
                    pg_namespace AS n
                INNER JOIN
                    pg_class AS c
                ON
                    n.oid = c.relnamespace
                WHERE
                    c.relkind = 'r' 
             )
        ORDER BY
            table_id  ,
            schemaname,
            tablename ,
            seq 
    );

SELECT ddl
FROM v_generate_copy
WHERE schemaname = 'tpch_sf1';
  1. For Redshift Database Name, enter your desired database name, for example, dev.
  2. For Number of Redshift Nodes, enter the desired compute nodes, for example, 2.
  3. For Redshift Node Type, choose the desired node type, for example, ra3.4xlarge.
  4. For Redshift Password, enter your desired password with the following constraints: it must be 8–64 characters in length, and contain at least one uppercase letter, one lowercase letter, and one number.
  5. For Redshift Port, enter the Amazon Redshift port number to connect to. The default port is 5439.

CFN Param Redshift 1 CFN Param Redshift 2

  1. Choose Next.
  2. Review and choose Create stack.

It takes around 5 minutes for the template to finish creating all resources and permissions. Most of the resources have the prefix of the stack name you specified for easy identification of the resources later. For more details on the deployed resources, see the appendix at the end of this post.

Create an IAM role and external Amazon S3 stage for Snowflake access to the data S3 bucket

In order for Snowflake to access the TargetDataS3Bucket created earlier by CloudFormation template, you must create an AWS Identity and Access Management (IAM) role and external Amazon S3 stage for Snowflake access to the S3 bucket. For instructions, refer to Configuring Secure Access to Amazon S3.

When you create an external stage in Snowflake, use the value for TargetDataS3Bucket on the Outputs tab of your deployed CloudFormation stack for the Amazon S3 URL of your stage.

CF Output

Make sure to name the external stage unload_to_s3 if you’re migrating the sample data using the default scripts provided in the CloudFormation template.

Convert Snowflake tables to Amazon Redshift

You can simply run the following DDL statements to create TPCH_SF1 schema objects in Amazon Redshift. You can also use AWS Schema Conversion Tool (AWS SCT) to convert Snowflake custom objects to Amazon Redshift. For instructions on converting your schema, refer to Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool.

CREATE SCHEMA TPCH_SF1;
SET SEARCH_PATH to TPCH_SF1;
CREATE TABLE customer (
  c_custkey int8 not null ,
  c_name varchar(25) not null,
  c_address varchar(40) not null,
  c_nationkey int4 not null,
  c_phone char(15) not null,
  c_acctbal numeric(12,2) not null,
  c_mktsegment char(10) not null,
  c_comment varchar(117) not null,
  Primary Key(C_CUSTKEY)
) ;

CREATE TABLE lineitem (
  l_orderkey int8 not null ,
  l_partkey int8 not null,
  l_suppkey int4 not null,
  l_linenumber int4 not null,
  l_quantity numeric(12,2) not null,
  l_extendedprice numeric(12,2) not null,
  l_discount numeric(12,2) not null,
  l_tax numeric(12,2) not null,
  l_returnflag char(1) not null,
  l_linestatus char(1) not null,
  l_shipdate date not null ,
  l_commitdate date not null,
  l_receiptdate date not null,
  l_shipinstruct char(25) not null,
  l_shipmode char(10) not null,
  l_comment varchar(44) not null,
  Primary Key(L_ORDERKEY, L_LINENUMBER)
)  ;

CREATE TABLE nation (
  n_nationkey int4 not null,
  n_name char(25) not null ,
  n_regionkey int4 not null,
  n_comment varchar(152) not null,
  Primary Key(N_NATIONKEY)                                
) ;

CREATE TABLE orders (
  o_orderkey int8 not null,
  o_custkey int8 not null,
  o_orderstatus char(1) not null,
  o_totalprice numeric(12,2) not null,
  o_orderdate date not null,
  o_orderpriority char(15) not null,
  o_clerk char(15) not null,
  o_shippriority int4 not null,
  o_comment varchar(79) not null,
  Primary Key(O_ORDERKEY)
) ;

CREATE TABLE part (
  p_partkey int8 not null ,
  p_name varchar(55) not null,
  p_mfgr char(25) not null,
  p_brand char(10) not null,
  p_type varchar(25) not null,
  p_size int4 not null,
  p_container char(10) not null,
  p_retailprice numeric(12,2) not null,
  p_comment varchar(23) not null,
  PRIMARY KEY (P_PARTKEY)
) ;

CREATE TABLE partsupp (
  ps_partkey int8 not null,
  ps_suppkey int4 not null,
  ps_availqty int4 not null,
  ps_supplycost numeric(12,2) not null,
  ps_comment varchar(199) not null,
  Primary Key(PS_PARTKEY, PS_SUPPKEY)
) ;

CREATE TABLE region (
  r_regionkey int4 not null,
  r_name char(25) not null ,
  r_comment varchar(152) not null,
  Primary Key(R_REGIONKEY)                             
) ;

CREATE TABLE supplier (
  s_suppkey int4 not null,
  s_name char(25) not null,
  s_address varchar(40) not null,
  s_nationkey int4 not null,
  s_phone char(15) not null,
  s_acctbal numeric(12,2) not null,
  s_comment varchar(101) not null,
  Primary Key(S_SUPPKEY)
);

Run an AWS Glue workflow for data migration

When you’re ready to start the data migration, complete the following steps:

  1. On the AWS Glue console, choose Workflows in the navigation pane.
  2. Select the workflow to run (<stack name>snowflake-to-redshift-migration).
  3. On the Actions menu, choose Run.Glue Workflow Run
  4. To check the status of the workflow, choose the workflow and on the History tab, select the Run ID and choose View run details.
    Glue Workflow Status
  5. When the workflow is complete, navigate to the Amazon Redshift console and launch the Amazon Redshift query editor v2 to verify the successful migration of data.
  6. Run the following query in Amazon Redshift to get row counts of all tables migrated from Snowflake to Amazon Redshift. Make sure to adjust the table_schema value accordingly if you’re not migrating the sample data.
SELECT tab.table_schema,
       tab.table_name,
       nvl(tinf.tbl_rows,0) tbl_rows,
       nvl(tinf.size,0) size
FROM svv_tables tab
LEFT JOIN svv_table_info tinf 
          on tab.table_schema = tinf.schema 
          and tab.table_name = tinf.”table”
WHERE tab.table_type = 'BASE TABLE'
      and tab.table_schema in ('tpch_sf1')
ORDER BY tbl_rows;

Redshift Editor

  1. Run the following query in Snowflake to compare and validate the data:
USE DATABASE snowflake_sample_data;
SELECT  TABLE_CATALOG,
        TABLE_SCHEMA,
        TABLE_NAME,
        ROW_COUNT,
        BYTES AS SIZE,
        COMMENT
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'TPCH_SF1'
ORDER BY ROW_COUNT;

Snowflake Editor

Clean up

To avoid incurring future charges, delete the resources you created as part of the CloudFormation stack by navigating to the AWS CloudFormation console, selecting the stack blog-resources, and choosing Delete.

Conclusion

In this post, we discussed how to perform an efficient, fast, and cost-effective migration from Snowflake to Amazon Redshift. Migrations from one data warehouse environment to another can typically be very time-consuming and resource-intensive; this solution uses the power of cloud-based compute by pushing down the processing to the respective warehouses. Orchestrating this migration with the AWS Glue Python shell provides additional cost optimization.

With this solution, you can facilitate your migration from Snowflake to Amazon Redshift. If you’re interested in further exploring the potential of using Amazon Redshift, please reach out to your AWS Account Team for a proof of concept.

Appendix: Resources deployed by AWS CloudFormation

The CloudFormation stack deploys the following resources in your AWS account:

  • Networking resourcesAmazon Virtual Private Cloud (Amazon VPC), subnets, ACL, and security group.
  • Amazon S3 bucket – This is referenced as TargetDataS3Bucket on the Outputs tab of the CloudFormation stack. This bucket holds the data being migrated from Snowflake to Amazon Redshift.
  • AWS Secrets Manager secrets – Two secrets in AWS Secrets Manager store credentials for Snowflake and Amazon Redshift.
  • VPC endpoints – The two VPC endpoints are deployed to establish a private connection from VPC resources like AWS Glue to services that run outside of the VPC, such as Secrets Manager and Amazon S3.
  • IAM roles – IAM roles for AWS Glue, Lambda, and Amazon Redshift. If the CloudFormation template is to be deployed in a production environment, you need to adjust the IAM policies so they’re not as permissive as presented in this post (which were set for simplicity and demonstration). Particularly, AWS Glue and Amazon Redshift don’t require all the actions granted in the *FullAccess policies, which would be considered overly permissive.
  • Amazon Redshift cluster – An Amazon Redshift cluster is created in a private subnet, which isn’t publicly accessible.
  • AWS Glue connection – The connection for Amazon Redshift makes sure that the AWS Glue job runs within the same VPC as Amazon Redshift. This also ensures that AWS Glue can access the Amazon Redshift cluster in a private subnet.
  • AWS Glue jobs – Two AWS Glue Python shell jobs are created:
    • <stack name>-glue-snowflake-unload – The first job runs the SQL scripts in Snowflake to copy data from the source database to Amazon S3. The Python script is available in S3. The Snowflake job accepts two parameters:
      • SQLSCRIPT – The Amazon S3 location of the SQL script to run in Snowflake to migrate data to Amazon S3. This is referenced as the Snowflake Unload SQL Script parameter in the input section of the CloudFormation template.
      • SECRET – The Secrets Manager ARN that stores Snowflake connection details.
    • <stack name>-glue-redshift-load – The second job runs another SQL script in Amazon Redshift to copy data from Amazon S3 to the target Amazon Redshift database. The Python script link is available in S3. The Amazon Redshift job accepts three parameters:
      • SQLSCRIPT – The Amazon S3 location of the SQL script to run in Amazon Redshift to migrate data from Amazon S3. If you provide custom SQL script to migrate the Snowflake data to Amazon S3 (as mentioned in the prerequisites), the file location is referenced as LoadFileLocation on the Outputs tab of the CloudFormation stack.
      • SECRET – The Secrets Manager ARN that stores Amazon Redshift connection details.
      • PARAMS – This includes any additional parameters required for the SQL script, including the Amazon Redshift IAM role used in the COPY commands and the S3 bucket staging the Snowflake data. Multiple parameter values can be provided separated by a comma.
  • AWS Glue workflow – The orchestration of Snowflake and Amazon Redshift AWS Glue Python shell jobs is managed via an AWS Glue workflow. The workflow <stack name>snowflake-to-redshift-migration runs later for actual migration of data.

About the Authors

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Julia BeckJulia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.