How to evaluate the benefits of AQUA for your Amazon Redshift workloads

Post Syndicated from Dinesh Kumar original https://aws.amazon.com/blogs/big-data/how-to-evaluate-the-benefits-of-aqua-for-your-amazon-redshift-workloads/

Amazon Redshift is the cloud data warehouse of choice for tens of thousands of customers, who use it to analyze exabytes of data to gain business insights. Customers have been asking us for better performance at scale as the volume, variety, velocity, and veracity of their data grows. We have added several features to Amazon Redshift that enable you to get up to three times better price performance with Amazon Redshift than other cloud data warehouses.

In addition, we launched AQUA (Advanced Query Accelerator) for Amazon Redshift to help you cost-effectively run analytics at the new scale of data. AQUA is a distributed and hardware-accelerated cache that enables Amazon Redshift to run an order of magnitude faster than other enterprise cloud data warehouses by automatically boosting certain types of queries. AQUA uses AWS-designed processors with AWS Nitro chips adapted to speed up data encryption and compression, and custom analytics processors, implemented in FPGAs, to accelerate operations such as scans, filtering, and aggregation. AQUA is available with the RA3.16xlarge, RA3.4xlarge, and RA3.xlplus nodes at no additional charge and requires no code changes. You can enable AQUA for your existing Amazon Redshift RA3 clusters or launch a new AQUA-enabled RA3 cluster via the AWS Management Console, API, or AWS Command Line Interface (AWS CLI). To learn more about AQUA, see Working with AQUA (Advanced Query Accelerator).

This post walks you through steps to quantify expected benefits from AQUA for Amazon Redshift for your workloads. We cover the methodology used for testing AQUA and share the scripts, a sample dataset, and queries so you can test AQUA in your own environment. We have published scripts on GitHub (along with a README file), which you need as you follow the steps in this post. Download all scripts to your working directory from where you intend to connect to your Redshift clusters.

Solution overview

At a high-level, the AQUA test process involves the following steps:

  1. Create a test cluster to evaluate AQUA and make sure that AQUA is turned on. For instructions, see Working with AQUA (Advanced Query Accelerator).
  2. Analyze your workload for AQUA and capture eligible queries. To learn more about the types of queries accelerated by AQUA, refer to When does Amazon Redshift use AQUA to run queries?
  3. Run the workload with AQUA activated and deactivated on your test cluster.
  4. Compare performance results.

This approach is a straightforward yet practical way to measure the performance gain you can expect to get for your workload by turning on AQUA.

With the exception of one workload parsing and analyzing script (aqua_capture_query.sh), you can run the rest of the scripts on a test cluster. We recommend using a test cluster to minimize the impact to your production cluster.

Create a test cluster

Create a snapshot from a production cluster that has read-heavy workloads and restore it as test cluster.

Analyze your workload and capture eligible queries

Run aqua_capture_query.sh on the production cluster to determine the queries suitable for acceleration by AQUA. We recommend choosing workloads with SELECT queries that use LIKE or SIMILAR TO functions in the WHERE clause and scan, filter, and aggregate large datasets. aqua_capture_query.sh scans the query history on your production cluster and captures queries that AQUA can accelerate.

The script runtime may exceed several minutes depending on the selected time interval (analyze_starttime and analyze_endtime), size of the cluster, and workload complexity. We suggest restricting the script runtime by limiting the time interval to the lesser of your workload runtime or 3 hours. Alternately, you can evaluate AQUA using the Amazon Reviews sample dataset, which we demonstrate later in this post.

aqua_capture_query.sh saves an output file named aqua_eligible_queries to your working directory and contains a subset of most suitable AQUA-eligible queries.

The following is a sample output by the script:

select count(*) from amazon_reviews where product_title SIMILAR TO '%lap%' group by star_rating ORDER BY star_rating desc;
select count(*) from amazon_reviews where product_title ilike '%e%|%E%' or customer_ID like '3%__%45__3';

If your workload history doesn’t have enough AQUA-eligible queries, the script reports no eligible queries found. If this happens, run the script with different date/time parameters. If you still don’t see any queries, you can try using the sample dataset and queries provided in this post. For more information about AQUA-eligible queries, refer to When does Amazon Redshift use AQUA to run queries?

Run the workload on your test cluster

Run aqua_execute_query.sh on your test cluster. The script runs the captured queries on your cluster repeatedly to get consistent performance by reducing the impact of runtime difference due to environmental factors with AQUA activated and deactivated. The script records the start date/time and end date/time to a file named workload_datetime.txt in your working directory.

Compare performance results

When aqua_execute_query.sh script is complete, run aqua_perf_compare.sh, which generates a CSV file named aqua_benefit in your working directory. The following table summarizes the sample output.

Query ID Amazon Redshift with AQUA turned on Query ID Amazon Redshift with AQUA turned off Runtime in seconds with AQUA turned on Runtime in seconds with AQUA turned off Speedup
(Column C/Column B)
1153194 1153370 2.7 59.4 22.0
1153214 1153456 22.8 104.9 4.6
334629 334631 1.66643 35.8 22.25
334850 334672 1.71297 26.9 15.7
334984 334998 1.85051 26.9 14.5
334830 334793 6.47884 87.2 13.4
334828 334740 6.6974 87.1 13

This table shows the query identifiers and runtime of the queries with AQUA activated and deactivated. You can compare the benefits offered by AQUA by reviewing the speedup column.

Example use case with the Amazon Reviews dataset

To test AQUA with the Amazon Reviews sample dataset, perform the following steps:

  1. Create a two-node RA3.4xlarge cluster by issuing the following command:
    aws redshift create-cluster --cluster-identifier test-amazon-reviews --node-type ra3.4xlarge --number-of-nodes 2 --master-username adminuser --master-user-password <xxpasswordxx> --aqua-configuration-status enabled

  2. Create a test database on our Amazon Redshift cluster by issuing the following command:
    CREATE DATABASE TestDB;

  3. Load the table with the Amazon Reviews dataset by running the script load_amazon_sentiments_data.sql.
  4. Run a few AQUA-eligible queries (similar to the following) multiple times with and without AQUA activated using the script execute_test_queries.sh:
    select count(*) from amazon_reviews WHERE product_title SIMILAR TO '%lap%' or product_title SIMILAR TO '%hope%' or product_title SIMILAR TO '%nice%' or product_title SIMILAR TO '%soa%';

More evaluation queries are available on the GitHub repo.

  1. Run aqua_perf_compare.sh on the test cluster.

The following table shows that AQUA accelerated the queries 5–22 times faster.

Query ID Amazon Redshift with AQUA turned on Query ID Amazon Redshift with AQUA turned off Runtime in seconds with AQUA turned on Runtime in seconds with AQUA turned off Speedup
(Column C/Column B)
364202 364217 2.33411 51.56207 22.09073
334984 334998 1.85051 26.9 14.5
334830 334793 6.47884 87.2 13.4
334828 334740 6.6974 87.1 13

Summary

This post provides a straightforward yet practical way to measure the performance gain you can expect to get for your workload by turning on AQUA. You can use the scripts provided and test with your own dataset and queries, or use the sample dataset and queries to determine AQUA’s impact.

We continue to invest and launch new capabilities like AQUA for Amazon Redshift to make sure Amazon Redshift continues to improve as your data warehouse needs grow. For AQUA, we continue to add hardware acceleration for more SQL operators, functions, predicates, data types, and file formats so more scans, filters, and aggregations can be pushed down to AQUA. Pushdowns to AQUA remain transparent so Amazon Redshift decides when to push queries down to AQUA to take advantage of hardware acceleration. And when queries don’t get pushed down to AQUA, they continue to run on Amazon Redshift as before.

We invite you to test AQUA for yourself and share the findings.


About the Authors

Dinesh Kumar is a Database Engineer at AWS focusing on Amazon AQUA. He works with customers to build highly scalable data warehouse and high performant database solutions. Outside work, he enjoys gardening and spending time with his family.

Manan Goel is a Product Go-To-Market Leader for AWS Analytics Services including Amazon Redshift & AQUA at AWS. He has more than 25 years of experience and is well versed with databases, data warehousing, business intelligence, and analytics. Manan holds a MBA from Duke University and a BS in Electronics & Communications engineering.