All posts by Seetha Sarma

Running queries securely from the same VPC where an Amazon Redshift cluster is running

Post Syndicated from Seetha Sarma original https://aws.amazon.com/blogs/big-data/running-queries-securely-from-the-same-vpc-where-an-amazon-redshift-cluster-is-running/

Customers who don’t need to set up a VPN or a private connection to AWS often use public endpoints to access AWS. Although this is acceptable for testing out the services, most production workloads need a secure connection to their VPC on AWS. If you’re running your production data warehouse on Amazon Redshift, you can run your queries in Amazon Redshift query editor or use Amazon WorkSpaces from Amazon Virtual Private Cloud (Amazon VPC) to connect to Amazon Redshift securely and analyze and graph a data summary in your favorite business intelligence (BI) or data visualization desktop tool.

With Amazon Redshift, you can query petabytes of structured and semi-structured data across your data warehouse, operational database, and your data lake using standard SQL. Amazon WorkSpaces is a managed, secure Desktop-as-a-Service (DaaS) solution deployed within an Amazon VPC. In this post, we show how you can run SQL queries on Amazon Redshift securely without VPN using Amazon Redshift query editor and Amazon WorkSpaces. First, we discuss how to run queries that return large datasets from the Amazon Redshift query editor using the UNLOAD command. Next, we discuss how to set up Amazon WorkSpaces and use it to securely run queries on Amazon Redshift. We cover the detailed steps for setting up Amazon WorkSpaces and show different scenarios to test Amazon Redshift queries with Amazon WorkSpaces.

The following diagram illustrates these architectures.

Using the Amazon Redshift query editor with UNLOAD command

Amazon Redshift has a query editor on its console that is typically used to run short queries and explore the data in the Amazon Redshift database. You may have a production scenario with queries that return large result sets. For instance, you may want to unload CSV data for use by a downstream process. In this case, you can run your query in the query editor and use the UNLOAD command to send the output directly to Amazon Simple Storage Service (Amazon S3) and get notified when the data is uploaded.

  1. Create separate S3 buckets for each user. You can use the default configuration for the bucket.

Create separate S3 buckets for each user. You can use the default configuration for the bucket.

  1. On the Amazon Redshift console, choose Editor.
  2. In the Connect to database section, enter the database connection details.
  3. Choose Connect to database.

Choose Connect to database.

  1. Use the following format for the queries run from the query editor using the IAM role for Amazon Redshift, so the results get uploaded to Amazon S3:
    UNLOAD 
    ('select id, name
    from <your_ schema>.<your_table>)
    TO 's3://<username>/<yy-mm-dd-hh24-mi-ss>/'
    FORMAT as CSV
    iam_role 'arn:aws:iam:<myaccount>:role/MyAmazon_RedshiftUnloadRole';

Use the following format for the queries run from the query editor.

This query creates multiple files in the designated user’s S3 bucket under the date/time prefix. The user can preview or download the individual files on the Amazon S3 console.

This query creates multiple files in the designated user’s S3 bucket under the date/time prefix.

A large unload may take some time. You can configure Amazon Simple Notification Service (Amazon SNS) to send a notification when the results are uploaded to Amazon S3.

  1. On the Amazon SNS console, choose Topics.
  2. Choose Create topic.

Choose Create topic.

  1. Create an SNS topic with a meaningful description text, like Your query results are uploaded to S3.

In the next steps, you edit the access policy of the SNS topic to give permission for Amazon S3 to publish to it.

  1. Change the Principal from "AWS": "*" to "Service": "s3.amazonaws.com".
  2. Scroll down to “Action” and delete everything except “SNS:Publish”. Make sure to delete the extra comma.
  3. Scroll down to “Condition” and modify the text "StringEquals": { "AWS:SourceOwner": <Your account id>} to "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:<user-bucket-name>" }.

In the next steps, you edit the access policy of the SNS topic to give permission for Amazon S3 to publish to it.

  1. In the navigation pane, choose Subscriptions.
  2. Choose Create subscription.

Choose Create subscription.

  1. Subscribe to the SNS notification with the user’s email address as the endpoint.

Subscribe to the SNS notification with the user’s email address as the endpoint.

  1. Make sure the user chooses confirms the subscription from their email.
  2. On the Amazon S3 console, choose the Properties tab of the user’s S3 bucket.
  3. Under Event Notifications, choose Create event notification.

Under Event Notifications, choose Create event notification.

  1. Select All object create events.

Select All object create events.

  1. For Destination, select SNS topic.
  2. For Specify SNS topic, select Choose from your SNS topics.
  3. For SNS topic, choose the topic you created.

For SNS topic, choose the topic you created.

  1. Save your settings.
  2. To test the notification, on the Amazon Redshift console, open the query editor.
  3. Edit the UNLOAD query and change the S3 bucket name to the current date and time.
  4. Run the query and check if the user gets the email notification.

Using Amazon WorkSpaces to run Amazon Redshift queries

In this section, we cover setting up Amazon WorkSpaces, including Amazon VPC prerequisites, creating an Amazon VPC endpoint for Amazon S3, launching Amazon WorkSpaces in the same VPC where an Amazon Redshift cluster is running, setting up an Amazon WorkSpaces client, installing PSQL or a SQL client, and connecting to the client.

When setup is complete, we show different scenarios to test with Amazon WorkSpaces, such as testing a SQL command from the Amazon WorkSpaces client, testing SCREEN program to run SQL in the background, and testing PSQL with Amazon S3 and getting a notification through Amazon SNS.

Prerequisites

By default, AWS Identity and Access Management (IAM) users and roles can’t perform tasks using the AWS Management Console and they don’t have permission to create or modify Amazon VPC resources. Make sure you have administrator privileges or an administrator creates IAM policies that grants sufficient permissions to edit the route table, edit the VPC security group, and enable a DNS hostname for the VPC.

When you have the correct permissions, complete the following prerequisite steps:

  1. On the Amazon Redshift console, in config, check the cluster subnet groups to make sure the Amazon Redshift cluster is created in an Amazon VPC with at least two subnets that are in separate Availability Zones.
  2. On the Amazon VPC console, edit the route table and make sure to associate these two subnets.
  3. Make sure the Amazon VPC security group has a self-referencing inbound rule for the security group for all traffic (not all tcp). The self-referencing rule restricts the source to the same security group in the VPC, and it’s not open to all networks. Consider limiting to just the protocol and port needed for Redshift to talk to Workspaces.
  4. Edit the DNS hostname of the Amazon VPC and enable it.

Creating an Amazon VPC endpoint for Amazon S3 for software downloads

In this step, you create your Amazon VPC endpoint for Amazon S3. This gives you Amazon S3 access to download PSQL from the Amazon repository. Alternatively, you could set up a NAT Gateway and download PSQL or other SQL clients from the internet.

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
    Choose Create endpoint.
  3. Search for Service Name: S3
  4. Select the S3 service gateway
    Select the S3 service gateway
  5. Select the Amazon VPC where the Amazon Redshift cluster is running
  6. Select the route table
    Select the route table
  7. Enter the following custom policy for the endpoint to access the Amazon Linux AMI
    {
        "Version": "2008-10-17",
        "Statement": [
            {
                "Sid": "Amazon Linux AMI Repository Access",
                "Effect": "Allow",
                "Principal": "*",
                "Action": "s3:GetObject",
                "Resource": [
                    "arn:aws:s3:::*.amazonaws.com",
                    "arn:aws:s3:::*.amazonaws.com/*"
                ]
            }
        ]
    }

    Select the Amazon VPC where the Amazon Redshift cluster is running

  8. Create the endpoint

Launching Amazon WorkSpaces in the VPC where the Amazon Redshift cluster runs

You’re now ready to launch Amazon WorkSpaces.

  1. On the Amazon WorkSpaces console, choose Launch WorkSpaces.

On the Amazon WorkSpaces console, choose Launch WorkSpaces.

  1. For Directory types, select Simple AD.

Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

  1. For Directory size, select Small.
  2. Enter your directory details.

Enter your directory details.

  1. For VPC, choose the VPC where the Amazon Redshift cluster is running.
  2. For Subnets, choose the two subnets you created.

For Subnets, choose the two subnets you created.

It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

  1. When the directory is provisioned, choose the directory and subnets you created.
  2. Choose Next Step.

Choose Next Step.

  1. Create and identify your users.

Create and identify your users.

  1. Use the default settings for the compute bundle.
  2. For Running Mode, select AlwaysOn.

Alternatively, select AutoStop and adjust the time in order to run long-running queries.

Alternatively, select AutoStop and adjust the time in order to run long-running queries.

  1. Review and launch WorkSpaces.

It may take up to 20 minutes to become available.

Setting up the Amazon WorkSpaces client

In this section, you configure your Amazon WorkSpaces client.

  1. Use the link from your email to download the Amazon WorkSpaces client.
  2. Register it using the registration code from the email.
  3. Login with your username in the email and your newly created password
  4. In the Amazon WorkSpaces client, open the terminal.

In the Amazon WorkSpaces client, open the terminal.

  1. Run the following command to capture the IP address:
hostname -I | awk '{print $2}'

The following screenshot shows your output.

The following screenshot shows your output.

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your cluster.
  3. Save the endpoint information to use later.
  4. Choose the Properties tab.

Choose the Properties tab.

  1. In the Network and security section, note the VPC security group.
    In the Network and security section, note the VPC security group.
  2. On the Amazon VPC console, under Security, choose Security Groups.
  3. Select the security group that the Amazon Redshift cluster uses.

Select the security group that the Amazon Redshift cluster uses.

  1. Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

  1. On the Amazon WorkSpaces client, use the Amazon Redshift hostname from the endpoint you saved earlier and verify the VPC setup with the following code:
     nslookup <Amazon Redshift hostname>

If you see an IP address within your subnet range, the private endpoint setup for Amazon Redshift was successful.

If you see an IP address within your subnet range, the private endpoint setup for Amazon Redshift was successful.

Testing a SQL command from PSQL or a SQL client in the Amazon WorkSpaces client

To test a SQL command, complete the following steps:

  1. From the terminal in the Amazon WorkSpaces client, run the following command to install PostgreSQL:
    sudo yum install postgresql-server

Alternatively, setup a NAT Gateway and download a SQL client such as SQL Workbench on the Amazon WorkSpaces client:

sudo wget https://www.sql-workbench.eu/Workbench-Build125-with-optional-libs.zip

Then unzip the content of the downloaded file and save it to a directory:

unzip Workbench-Build125-with-optional-libs.zip -d ~/Workbench
  1. Use the Amazon Redshift hostname, port, and database names of the Amazon Redshift cluster endpoint you copied earlier and try connecting to the database:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W

  1. Enter your password when prompted.

Enter your password when prompted.

  1. Run a SQL command and check the results.

Testing the SCREEN program to run SQL in the background

You can use the SCREEN program to run the SQL command in the background and resume to see the results.

  1. From the terminal in the Amazon WorkSpaces client, install the SCREEN program:
    sudo yum install screen

  1. Run the program:
    screen

  1. Connect to PSQL:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W

  1. Enter the password when prompted.
  2. Run the SQL command
  3. Enter the command ctrl A D to detach from the screen.

The SQL command is now running in the background. You can check by running ps -ef | grep psql.

The SQL command is now running in the background. You can check by running ps -ef | grep psql.

  1. To go back to that screen, run the following command:
    screen -r

  1. To quit SCREEN, enter the following command:
    ctrl A \

Testing PSQL with Amazon S3 and Amazon SNS

Similar to the UNLOAD command we used from the Amazon Redshift query editor in the beginning of this post, you can run PSQL from the Amazon WorkSpaces client, send the output to an S3 bucket, and get an Amazon SNS notification for an object create event.

  1. From the terminal in the Amazon WorkSpaces client, run aws configure to set up AWS credentials with write access to the S3 bucket.
  2. Run the following command to write a single output file to Amazon S3 and send an email notification:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W -c 'select column1, column2 from myschema.mytable' > s3://<username>/<yy-mm-dd-hh24-mi-ss>/

Conclusion

This post explained how to securely query Amazon Redshift databases running in an Amazon VPC using the UNLOAD command with the Amazon Redshift query editor and using Amazon WorkSpaces running in the same VPC. Try out this solution to securely access Amazon Redshift databases without a VPN and run long-running queries. If you have any questions or comments, please share your thoughts in the comments section.


About the Authors

Seetha Sarma is a Senior Database Specialist Solutions Architect with Amazon Web Services. Seetha provides guidance to customers on using AWS services for distributed data processing. In her spare time she likes to go on long walks and enjoy nature.

 

 

 

Moiz MianMoiz Mian is a Solutions Architect for AWS Strategic Accounts. He focuses on enabling customers to build innovative, scalable, and secure solutions for the cloud. In his free time, he enjoys building out Smart Home systems and driving at race tracks.

Migrating from Vertica to Amazon Redshift

Post Syndicated from Seetha Sarma original https://aws.amazon.com/blogs/big-data/migrating-from-vertica-to-amazon-redshift/

Amazon Redshift powers analytical workloads for Fortune 500 companies, startups, and everything in between. With Amazon Redshift, you can query petabytes of structured and semi-structured data across your data warehouse, operational database, and your data lake using standard SQL.

When you use Vertica, you have to install and upgrade Vertica database software and manage the cluster OS and hardware. Amazon Redshift is a fully managed cloud solution; you don’t have to install and upgrade database software and manage the OS and the hardware. In this post, we discuss the best practices for migrating from a self-managed Vertica cluster to the fully managed Amazon Redshift solution. We discuss how to plan for the migration, including sizing your Amazon Redshift cluster and strategies for data placement. We look at the tools for schema conversion and see how to choose the right keys for distributing and sorting your data. We also see how to speed up the data migration to Amazon Redshift based on your data size and network connectivity. Finally, we cover how cluster management on Amazon Redshift differs from Vertica.

Migration planning

When planning your migration, start with where you want to place the data. Your business use case drives what data gets loaded to Amazon Redshift and what data remains on the data lake. In this section, we discuss how to size the Amazon Redshift cluster based on the size of the Vertica dataset that you’re moving to Amazon Redshift. We also look at the Vertica schema and decide the best data distribution and sorting strategies to use for Amazon Redshift, if you choose to do it manually.

Data placement

Amazon Redshift powers the lake house architecture, which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights not possible otherwise. In a Vertica data warehouse, you plan the capacity for all your data, whereas with Amazon Redshift, you can plan your data warehouse capacity much more efficiently. If you have a huge historical dataset being shared by multiple compute platforms, then it’s a good candidate to keep on Amazon Simple Storage Service (Amazon S3) and utilize Amazon Redshift Spectrum. Also, streaming data coming from Kafka and Amazon Kinesis Data Streams can add new files to an existing external table by writing to Amazon S3 with no resource impact to Amazon Redshift. This has a positive impact on concurrency. Amazon Redshift Spectrum is good for heavy scan and aggregate work. For tables that are frequently accessed from a business intelligence (BI) reporting or dashboarding interface and for tables frequently joined with other Amazon Redshift tables, it’s optimal to have tables loaded in Amazon Redshift.

Vertica has Flex tables to handle JSON data. You don’t need to load the JSON data to Amazon Redshift. You can use external tables to query JSON data stored on Amazon S3 directly from Amazon Redshift. You create external tables in Amazon Redshift within an external schema.

Vertica users typically create a projection on a Vertica table to optimize for a particular query. If necessary, use materialized views in Amazon Redshift. Vertica also has aggregate projection, which acts like a synchronized materialized view. With materialized views in Amazon Redshift, you can store the pre-computed results of queries and efficiently maintain them by incrementally processing the latest changes made to the source tables. Subsequent queries referencing the materialized views use the pre-computed results to run much faster. You can create materialized views based on one or more source tables using filters, inner joins, aggregations, grouping, functions, and other SQL constructs.

Cluster sizing

When you create a cluster on the Amazon Redshift console, you can get a recommendation of your cluster configuration based on the size of your data and query characteristics (see the following screenshot).

Amazon Redshift offers different node types to accommodate your workloads. We recommend using RA3 nodes so you can size compute and storage independently to achieve improved price and performance. Amazon Redshift takes advantage of optimizations such as data block temperature, data block age, and workload patterns to optimize performance and manage automatic data placement across tiers of storage in the RA3 clusters.

ETL pipelines and BI reports typically use temporary tables that are only valid for a session. Vertica has local and global temporary tables. If you’re using Vertica local temporary tables, no change is required during migration. Vertica local tables and Amazon Redshift temporary tables have similar behavior. They’re visible only to the session and get dropped when the session ends. Vertica global tables persist across sessions until they are explicitly dropped. If you use them now, you have to change them to permanent tables in Amazon Redshift and drop them when they’re no longer needed.

Data distribution, sorting, and compression

Amazon Redshift optimizes for performance by distributing the data across compute nodes and sorting the data. Make sure to set the sort key, distribution style, and compression encoding of the tables to take full advantage of the massively parallel processing (MPP) capabilities. The choice of distribution style and sort keys vary based on data model and access patterns. Use the data distribution and column order of the Vertica tables to help choose the distribution keys and sort keys on Amazon Redshift.

Distribution keys

Choose a column with high cardinality of evenly spread out values as the distribution key. Profile the data for the columns used for distribution keys. Vertica has segmentation that specifies how to distribute data for superprojections of a table, where the data to be hashed consists of one or more column values. The columns used in segmentation are most likely good candidates for distribution keys on Amazon Redshift. If you have multiple columns in segmentation, pick the column that provides the highest cardinality to reduce the possibility of high data skew.

Besides supporting data distribution by key, Amazon Redshift also supports other distribution styles: ALL, EVEN, and AUTO. Use ALL distribution for small dimension tables and EVEN distribution for larger tables, or use AUTO distribution, where Amazon Redshift changes the distribution style from ALL to EVEN as the table size reaches a threshold.

Sort keys

Amazon Redshift stores your data on disk in sorted order using the sort key. The Amazon Redshift query optimizer uses the sort order for optimal query plans. Review if one of raw columns used in the Vertica table’s Order By clause is the best column to use as the sort key in the Amazon Redshift table.

The order by fields in Vertica superprojections are good candidates for a sort key in Amazon Redshift, but the design criteria of sort order in Amazon Redshift is different from what you use in Vertica. In Vertica projections Order By clause, you use the low-cardinality columns with high probability of having RLE encoding before the high-cardinality columns. In Amazon Redshift, you can set the SORTKEY to AUTO, or choose a column as SORTKEY or define a compound sort key. You define compound sort keys using multiple columns, starting with the most frequently used column first. All the columns in the compound sort key are used, in the order in which they are listed, to sort the data. You can use a compound sort key when query predicates use a subset of the sort key columns in order. Amazon Redshift stores the table rows on disk in sorted order and uses metadata to track the minimum and maximum values for each 1 MB block, called a zone map. Amazon Redshift uses the zone map and the sort key for filtering the block, thereby reducing the scanning cost to efficiently handle range-restricted predicates.

Profile the data for the columns used for sort keys. Make sure the first column of the sort key is not encoded. Choose timestamp columns or columns used in frequent range filtering, equality filtering, or joins as sort keys in Amazon Redshift.

Encoding

You don’t always have to select compression encodings; Amazon Redshift automatically assigns RAW compression for columns that are defined as sort keys, AZ64 compression for the numeric and timestamp columns, and LZO compression for the VARCHAR columns. When you select compression encodings manually, choose AZ64 for numeric and date/time data stored in Amazon Redshift. AZ64 encoding has consistently better performance and compression than LZO. It has comparable compression with ZSTD but greatly better performance.

Tooling

After we decide the data placement, cluster size, partition keys, and sort keys, the next step is to look at the tooling for schema conversion and data migration.

You can use AWS Schema Conversion Tool (AWS SCT) to convert your schema, which can automate about 80% of the conversion, including the conversion of DISTKEY and SORTKEY, or you can choose to convert the Vertica DDLs to Amazon Redshift manually.

To efficiently migrate your data, you want to choose the right tools depending on the data size. If you have a dataset that is smaller than a couple of terabytes, you can migrate your data using AWS Data Migration Service (AWS DMS) or AWS SCT data extraction agents. When you have more than a few terabytes of data, your tool choice depends on your network connectivity. When there is no dedicated network connection, you can run the AWS SCT data extraction agents to copy the data to AWS Snowball Edge and ship the device back to AWS to complete the data export to Amazon S3. If you have a dedicated network connection to AWS, you can run the S3EXPORT or S3EXPORT_PARTITION commands available in Vertica 9.x directly from the Vertica nodes to copy the data in parallel to the S3 bucket.

The following diagram visualizes the migration process.

Schema conversion

AWS SCT uses extension pack schema to implement system functions of the source database that are required when writing your converted schema to your target database instance. Review the database migration assessment report for compatibility. AWS SCT can use source metadata and statistical information to determine the distribution key and sort key. AWS SCT adds a sort key in the Amazon Redshift table for the raw column used in the Vertica table’s Order By clause.

The following code is an example of Vertica CREATE TABLE and CREATE PROJECTION statements:

CREATE TABLE My_Schema.My_Table
(
    Product_id int,
    Product_name varchar(50),
    Product_type varchar(50),
    Product_category varchar(50),
    Quantity int,
    Created_at timestamp DEFAULT "sysdate"()
)
PARTITION BY (date_trunc('day', My_Table.Created_at));


CREATE PROJECTION My_Schema.My_Table_Projected
(
 Product_id ENCODING COMMONDELTA_COMP,
 Product_name,
 Product_type ENCODING RLE,
 Product_category ENCODING RLE,
 Quantity,
 Created_at ENCODING GCDDELTA
)
AS
 SELECT Product_id,
        Product_name,
        Product_type,
        Product_category,
        Quantity,
        Created_at
 FROM My_Schema.My_Table 
 ORDER BY Product_type,
          Product_category,
          Product_id,
          Product_name
SEGMENTED BY hash(Product_id) ALL NODES KSAFE 1;

The following code is the corresponding Amazon Redshift CREATE TABLE statement:

CREATE TABLE My_Schema.My_Table
(
    Product_id integer,
    Product_name varchar(50),
    Product_type varchar(50),
    Product_category varchar(50),
    Quantity integer,
    Created_at timestamp DEFAULT sysdate
)
DISTKEY (Product_id) 
SORTKEY (Created_at);

Data migration

To significantly reduce the data migration time from large Vertica clusters (if you have a dedicated network connection from your premises to AWS with good bandwidth), run the S3EXPORT or S3EXPORT_PARTITION function in Vertica 9.x, which exports the data in parallel from the Vertica nodes directly to Amazon S3.

The Parquet files generated by S3EXPORT don’t have any partition key on them, because partitioning consumes time and resources on the database where the S3EXPORT runs, which is typically the Vertica production database. The following code is one command you can use:

SELECT S3EXPORT( * USING PARAMETERS url='s3://myBucket/myTable') OVER(PARTITION BEST) FROM 
myTable;

The following code is another command option:

SELECT S3EXPORT_PARTITION(* USING PARAMETERS url='s3://mytable/bystate.date', multipart=false)
OVER (PARTITION by state, year) from myTable;

Performance

In this section, we look at best practices for ETL performance while copying the data from Amazon S3 to Amazon Redshift. We also discuss how to handle Vertica partition swapping and partition dropping scenarios in Amazon Redshift.

Copying using an Amazon S3 prefix

Make sure the ETL process is running from Amazon Elastic Compute Cloud (Amazon EC2) servers or other managed services within AWS. Exporting your data from Vertica as multiple files to Amazon S3 gives you the option to load your data in parallel to Amazon Redshift. While converting the Vertica ETL scripts, use the COPY command with an Amazon S3 object prefix to load an Amazon Redshift table in parallel from data files stored under that prefix on Amazon S3. See the following code:

copy mytable
from 's3://mybucket/data/mytable/' 
iam_role 'arn:aws:iam::<myaccount>:role/MyRedshiftRole';

Loading data using Amazon Redshift Spectrum queries

When you want to transform the exported Vertica data before loading to Amazon Redshift, or when you want to load only a subset of data into Amazon Redshift, use an Amazon Redshift Spectrum query. Create an external table in Amazon Redshift pointing to the exported Vertica data stored in Amazon S3 within an external schema. Put your transformation logic in a SELECT query, and ingest the result into Amazon Redshift using a CREATE TABLE or SELECT INTO statement:

CREATE TABLE mytable AS SELECT … FROM s3_external_schema.xxx WHERE …;

SELECT … INTO mytable FROM s3_external_schema.xxx WHERE …;

Handling Vertica partitions

Vertica has partitions, and the data loads use partition swapping and partition dropping. In Amazon Redshift, we can use the sort key, staging table, and alter table append to achieve similar results. First, the Amazon Redshift ETL job should use the sort key as filter conditions to insert the incremental data into a staging table or a temporary table in Amazon Redshift, for example the date from the MyTimeStamp column between yesterday and today. The ETL job should then delete data from the primary table that matches the filter conditions. The delete operation is very efficient in Amazon Redshift because of the sort key on the source partition column. The Amazon Redshift ETL jobs can then use alter table append to move the new data to the primary table. See the following code:

INSERT INTO stage_table select * from source_table WHERE date_trunc('day', table.MyTimestamp) BETWEEN <yesterday> AND <today>

DELETE FROM target_table_name select * from stage_table WHERE <target_table.key> = <stage_table.key>

ALTER TABLE target_table_name APPEND FROM stage_table_name 
[ IGNOREEXTRA | FILLTARGET ]

Cluster management

When a Vertica node fails, Vertica remains queryable but the performance is degraded until all the data is restored to the recovered node. When an Amazon Redshift node fails, Amazon Redshift automatically detects and replaces a failed node in your data warehouse cluster and replays the ReadOnly queries. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from the S3 bucket first to allow you to resume querying your data as quickly as possible.

Vertica cluster resize, similar to Amazon Redshift classic resize, takes a few hours depending on data volume to rebalance the data when nodes are added or removed. With Amazon Redshift elastic resize, the cluster resize completes within minutes. We recommend elastic resize for most use cases to shorten the cluster downtime and schedule resizes to handle seasonal spikes in your workload.

Conclusion

This post shared some best practices for migrating your data warehouse from Vertica to Amazon Redshift. It also pointed out the differences between Amazon Redshift and Vertica in handling queries, data management, cluster management, and temporary tables. Create your cluster on the Amazon Redshift console and convert your schema using AWS SCT to start your migration to Amazon Redshift. If you have any questions or comments, please share your thoughts in the comments section.


About the Authors

Seetha Sarma is a Senior Database Solutions Architect with Amazon Web Services.

 

 

 

 

Veerendra Nayak is a Senior Database Solutions Architect with Amazon Web Services.