Find the best Amazon Redshift configuration for your workload using Redshift Test Drive

Post Syndicated from Sathiish Kumar original https://aws.amazon.com/blogs/big-data/find-the-best-amazon-redshift-configuration-for-your-workload-using-redshift-test-drive/

Amazon Redshift is a widely used, fully managed, petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. With the launch of Amazon Redshift Serverless and the various deployment options Amazon Redshift provides (such as instance types and cluster sizes), customers are looking for tools that help them determine the most optimal data warehouse configuration to support their Redshift workload.

In this post, we answer that question by using Redshift Test Drive, an open-source tool that lets you evaluate which different data warehouse configurations options are best suited for your workload. We created Redshift Test Drive from SimpleReplay and redshift-config-compare (see Compare different node types for your workload using Amazon Redshift for more details) to provide a single entry point for finding the best Amazon Redshift configuration for your workload. Redshift Test Drive also provides additional features such as a self-hosted analysis UI and the ability to replicate external objects that a Redshift workload may interact with.

Amazon Redshift RA3 with managed storage is the newest instance type for Provisioned clusters. It allows you to scale and pay for compute and storage independently, as well as use advanced features such as cross-cluster data sharing and cross-Availability Zone cluster relocation. Many customers using previous generation instance types want to upgrade their clusters to RA3 instance types. In this post, we show you how to use Redshift Test Drive to evaluate the performance of an RA3 cluster configuration for your Redshift workloads.

Solution overview

At its core, Redshift Test Drive replicates a workload by extracting queries from the source Redshift data warehouse logs (shown as Workload Extractor in the following figure) and replays the extracted workload against the target Redshift data warehouses (Workload Replayer).

If these workloads interact with external objects via Amazon Redshift Spectrum (such as the AWS Glue Data Catalog) or COPY commands, Redshift Test Drive offers an external object replicator utility to clone these objects to facilitate replay.

Redshift Test Drive uses this process of workload replication for two main functionalities: comparing configurations and comparing replays.

Compare Amazon Redshift configurations

Redshift Test Drive’s ConfigCompare utility (based on redshift-config-compare tool) helps you find the best Redshift data warehouse configuration by using your workload to run performance and functional tests on different configurations in parallel. This utility’s automation starts by creating a new AWS CloudFormation stack based on this CloudFormation template. The CloudFormation stack creates an AWS Step Function state machine, which internally uses AWS Lambda functions to trigger AWS Batch jobs to run workload comparison across different Redshift instance types. These jobs extract the workload from the source Redshift data warehouse log location across the specified workload time (as provided in the config parameters) and then replays the extracted workload against a list of different target Redshift data warehouse configurations as provided in the configuration file. When the replay is complete, the Step Functions state machine uploads the performance stats for the target configurations to an Amazon Simple Storage Service (Amazon S3) bucket and creates external schemas that can then be queried from any Redshift target to identify a target configuration that meets your performance requirements.

The following diagram illustrates the architecture of this utility.

Compare replay performance

Redshift Test Drive also provides the ability to compare the replay runs visually using a self-hosted UI tool. This tool reads the stats generated by the workload replicator (stored in Amazon S3) and helps compare the replay runs across key performance indicators such as longest running queries, error distribution, queries with most deviation of latency across runs, and more.

The following diagram illustrates the architecture for the UI.

Walkthrough overview

In this post, we provide a step-by-step walkthrough of using Redshift Test Drive to automatically replay your workload against different Amazon Redshift configurations with the ConfigCompare utility. Subsequently, we use the self-hosted analysis UI utility to analyze the output of ConfigCompare for determining the optimal target warehouse configuration to migrate or upgrade. The following diagram illustrates the workflow.

Prerequisites

The following prerequisites should be addressed before we run the ConfigCompare utility:

Enable audit logging and user-activity logging in your source cluster.
Take a snapshot of the source Redshift data warehouse.

Export your source parameter group and WLM configurations to Amazon S3. The parameter group can be exported using the AWS Command Line Interface (AWS CLI), for example, using CloudShell, by running the following code:

aws redshift describe-cluster-parameters —parameter-group-name <YOUR-SOURCE-CLUSTER-PARAMETER-GROUP-NAME> —output json >> param_group_src.json

aws s3 cp param_group_src.json s3://<YOUR-BUCKET-NAME>/param_group_src.json

The WLM configurations can be copied as JSON in the console, from where you can enter them into a file and upload it to Amazon S3. If you want to test any alternative WLM configurations (such as comparing manual vs. auto WLM or enabling concurrency scaling), you can create a separate file with that target configuration and upload it to Amazon S3 as well.
Identify the target configurations you want to test. If you’re upgrading from DC2 to RA3 node types, refer to Upgrading to RA3 node types for recommendations.

For this walkthrough, let’s assume you have an existing Redshift data warehouse configuration with a two-node dc2.8xlarge provisioned cluster. You want to validate whether upgrading your current configuration to a decoupled architecture using the RA3 provisioned node type or Redshift Serverless would meet your workload price/performance requirements.

The following table summarizes the Redshift data warehouse configurations that are evaluated as part of this test.

Warehouse Type	Number of Nodes/Base RPU	Option
dc2.8xlarge	2	default auto WLM
ra3.4xlarge	4	default auto WLM
Redshift Serverless	64	auto scaling
Redshift Serverless	128	auto scaling

Run the ConfigCompare utility

Before you run the utility, customize the details of the workload to replay, including the time period and the target warehouse configurations to test, in a JSON file. Upload this file to Amazon S3 and copy the S3 URI path to use as an input parameter for the CloudFormation template that deploys the resources for the remaining orchestration.

You can read more about the individual components and inputs of JSON file in the Readme.

For our use case, we use the following JSON file as an input to the utility:

{
   "SNAPSHOT_ID": "redshift-cluster-manual-snapshot",
   "SNAPSHOT_ACCOUNT_ID": "123456789012",

   "PARAMETER_GROUP_CONFIG_S3_PATH": "s3://nodeconfig-artifacts/pg_config.json",

   "DDL_AND_COPY_SCRIPT_S3_PATH": "N/A",
   "SQL_SCRIPT_S3_PATH":"N/A",
   "NUMBER_OF_PARALLEL_SESSIONS_LIST": "N/A",
   "SIMPLE_REPLAY_LOG_LOCATION":"s3://redshift-logging-xxxxxxxx/RSLogs/",
   "SIMPLE_REPLAY_EXTRACT_START_TIME":"2023-01-28T15:45:00+00:00",
   "SIMPLE_REPLAY_EXTRACT_END_TIME":"2023-01-28T16:15:00+00:00",

   "SIMPLE_REPLAY_EXTRACT_OVERWRITE_S3_PATH":"N/A",
   "SIMPLE_REPLAY_OVERWRITE_S3_PATH":"N/A",

   "SIMPLE_REPLAY_UNLOAD_STATEMENTS": "true",

   "AUTO_PAUSE": true,
   "DATABASE_NAME": "database_name",

   "CONFIGURATIONS": [
    	{
      	"TYPE": "Provisioned",
      	"NODE_TYPE": "dc2.8xlarge",
      	"NUMBER_OF_NODES": "6",
      	"WLM_CONFIG_S3_PATH": "s3://nodeconfig-artifacts/wlm.json"
     },
     {
      	"TYPE": "Provisioned",
      	"NODE_TYPE": "ra3.4xlarge",
      	"NUMBER_OF_NODES": "12",
      	"WLM_CONFIG_S3_PATH": "s3://nodeconfig-artifacts/wlm.json"
     },
     {
      	"TYPE": "Serverless",
      	"BASE_RPU": "128"
     },
     {
      	"TYPE": "Serverless",
      	"BASE_RPU": "64"
     }
   ]
}

The utility deploys all the data warehouse configurations included in the CONFIGURATIONS section of the JSON file. A replica of the source configuration is also included to be used for a baseline of the existing workload performance.

After this file is fully configured and uploaded to Amazon S3, navigate to the AWS CloudFormation console and create a new stack based on the this CloudFormation template and specify the relevant parameters. For more details on the individual parameters, refer to the GitHub repo. The following screenshot shows the parameters used for this walkthrough.

After this is updated, proceed with the subsequent steps on the AWS CloudFormation console to launch a new stack.

When the stack is fully created, select the stack and open the Resources tab. Here, you can search for the term StepFunctions and choose the hyperlink next to the RedshiftConfigTestingStepFunction physical ID to open the Step Functions state machine to run the utility.

On the Step Functions page that opens, choose Start execution. Leave the default values and choose Start execution to trigger the run. Monitor the progress of the state machine’s run on the graph view of the page. The full run will take approximately the same time as the time window that was specified in the JSON configuration file.

When the status of the run changes from Running to Succeeded, the run is complete.

Analyze the results

When the Step Functions state machine run is complete, the performance metrics are uploaded to the S3 bucket created by the CloudFormation template initially. To analyze the performance of the workload across different configurations, you can use the self-hosted UI tool that comes with Redshift Test Drive. Set up this tool for your workload by following the instructions provided in this Readme.

After you point the UI to the S3 location that has the stats from the ConfigCompare run, the Replays section will populate with the analysis for replays found in the input S3 location. Select the target configurations you want to compare and choose Analysis to navigate to the comparisons page.

You can use the Filter Results section to denote which query types, users, and time frame to compare, and the Analysis section will expand to a section providing analysis of all the selected replays. Here you can see a comparison of the SELECT queries run by the ad hoc user of the replay.

The following screenshot shows an example of the analysis of a replay. These results show the distribution of queries completed over the full run for a given user and query type, allowing us to identify periods of high and low activity. We can also see runtimes of these queries, aggregated as percentiles, average, and standard deviation. For example, the P50 value indicates that 50% of queries ran within 26.564 seconds. The parameters used to filter for specific users, query types, and runtimes can be dynamically updated to allow the results and comparisons to be comprehensively investigated according to the specific performance requirements each individual use case demands.

Troubleshooting

As shown in the solution architecture, the main moving parts in the ConfigCompare automation are AWS CloudFormation, Step Functions (internally using Lambda), and AWS Batch.

If any resource in the CloudFormation stack fails to deploy, we recommend troubleshooting the issue based on the error shown on the AWS CloudFormation console.

To troubleshoot errors with the Step Functions state machine, locate the Amazon CloudWatch logs for a step by navigating to the state machine’s latest run on the Step Functions console and choosing CloudWatch Logs for the failed Step Functions step. After resolving the error, you can restart the state machine by choosing New execution.

For AWS Batch errors, locate the AWS Batch logs by navigating to the AWS CloudFormation console and choosing the Resources tab in the CloudFormation stack. On this tab, search for LogGroup to find the AWS Batch run logs.

For more information about common errors and their solutions, refer to the Test Drive Readme.

Clean up

When you have completed the evaluation, we recommend manually deleting the deployed Redshift warehouses to avoid any on-demand charges that could accrue. After this, you can delete the CloudFormation stack to clean up other resources.

Limitations

Some of the limitations for the WorkloadReplicator (the core utility supporting the ConfigCompare tool) are outlined in the Readme.

Conclusion

In this post, we demonstrated the process of finding the right Redshift data warehouse configuration using Redshift Test Drive. The utility offers an easy-to-use tool to replicate the workload of your choice against customizable data warehouse configurations. It also provides a self-hosted analysis UI to help you dive deeper into the stats generated during the replication process.

Get started with Test Drive today by following the instructions provided in the Readme. For an in-depth overview of the config compare automation, refer to Compare different node types for your workload using Amazon Redshift. If you’re migrating from DC2 or DS2 node types to RA3, refer to our recommendations on node count and type as a benchmark.

About the Authors

Sathiish Kumar is a Software Development Manager at Amazon Redshift and has worked on building end-to-end applications using different database and technology solutions over the last 10 years. He is passionate about helping his customers find the quickest and the most optimized solution to their problems by leveraging open-source technologies.

Julia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Noise