Use AQUA with Amazon Redshift RA3.xlplus nodes

Post Syndicated from Quan Li original https://aws.amazon.com/blogs/big-data/use-aqua-with-amazon-redshift-ra3-xlplus-nodes/

Amazon Redshift RA3 is the latest generation node type that allows you to scale compute and storage for your data warehouses independently. The RA3 node family includes RA3.16xlarge, RA3.4xlarge, and RA3.xlplus nodes for large, medium, and small workloads, respectively. RA3.xlplus, the latest member of the RA3 node family, offers one third of the computing power of RA3.4xlarge and costs one third of the price. RA3.xlplus is the smallest node in the RA3 family, but it offers the same advanced functionalities. It has been widely used in environments with light computing demand such as QA, data analytics for small teams, or processing smaller datasets.

In 2021, Amazon Redshift introduced AQUA (Advanced Query Accelerator) for Amazon Redshift to boost performance of analytical queries that scan, filter, and aggregate large datasets. AQUA uses AWS-designed processors with the AWS Nitro chip adapter to speed up data encryption and compression, and custom analytical processors implemented in FPGAs to accelerate applications requiring text search of a very large dataset, such as marketing and personalization.

Customers have asked us to support AQUA for RA3.xlplus, and we recently launched AQUA for RA3.xlplus nodes. In this post, we continue to build on the post AQUA (Advanced Query Accelerator) – A Speed Boost for Your Amazon Redshift Queries and show that with AQUA support, RA3.xlplus provides the same benefit as the existing supported RA3 nodes in the following areas:

Automatically boosting certain types of queries
Reducing the impact on your Amazon Redshift cluster by offloading certain queries that scan, filter, and aggregate large datasets to AQUA

Test environment

To test AQUA for RA3.xlplus, we started by creating an RA3.xlplus cluster with the following details:

Amazon Redshift cluster – 2-node RA3.xlplus
Dataset – 3 TB TPC-DS, 3 TB TPC-H
Query set – Sample queries based on the TPC-H and TPC-DS workload

Sample queries

To test AQUA, we created six text search queries that scan, filter, and aggregate the lineitem table in the TPC-H dataset, which has 18 billion rows with a WHERE clause predicate against the l_comment column.

The following table summarizes our table definition.

table	encoded	diststyle	sortkey1	rows
lineitem	Y	KEY	l_shipdate	18,000,048,306

We randomly generated a query set with queries of various complexity. The queries are designed to measure scan cost, which are an area of focus for AQUA. Each query has a predicate with LIKE and OR. The number of LIKE or OR predicates gets progressively higher to simulate complex workloads.

For example, Query 1 has one OR predicate:

SELECT COUNT(l_orderkey)
FROM lineitem
WHERE (l_comment LIKE '%across%') OR (l_comment LIKE '%brave,%');

In contrast, Query 4 has 50 OR predicates:

SELECT COUNT(l_orderkey)
  FROM lineitem
  WHERE (l_comment LIKE '%outsi%') OR
  (l_comment LIKE '%uthless%') OR
  (l_comment LIKE '%capades%') OR
  (l_comment LIKE '%horses%') OR
  (l_comment LIKE '%ornis%' AND l_comment LIKE '%phins?%') OR
  (l_comment LIKE '%affix%') OR
  (l_comment LIKE '%integrat%') OR
....
  (l_comment LIKE '%ithin%' AND l_comment LIKE '%quiet%') OR
  (l_comment LIKE '%taphs%') OR
  (l_comment LIKE '%dugouts%' AND l_comment LIKE '%ches%') OR
  (l_comment LIKE '%telets%' AND l_comment LIKE '%detect!%') OR
  (l_comment LIKE '%grow%') OR
  (l_comment LIKE '%promise!%') OR
  (l_comment LIKE '%was%') OR
  (l_comment LIKE '%accounts%') OR
  (l_comment LIKE '%idly%' AND l_comment LIKE '%deposits%') OR
  (l_comment LIKE '%integrate!%' AND l_comment LIKE '%depend%') OR
  (l_comment LIKE '%ins%' AND l_comment LIKE '%uses!%') OR
  (l_comment LIKE '%epitaphs!%' AND l_comment LIKE '%breac%') OR
  (l_comment LIKE '%pliers%' AND l_comment LIKE '%phins%') OR
  (l_comment LIKE '%hogs%' AND l_comment LIKE '%sentiments%') OR
  (l_comment LIKE '%ctions%' AND l_comment LIKE '%daringly%') OR
  (l_comment LIKE '%ies%' AND l_comment LIKE '%esias%');

The following table summarizes the complexity of each query.

Query Number	Number of OR	Number of LIKE
Query 1	1	2
Query 2	5	7
Query 3	10	12
Query 4	50	66

Scan performance improvement with AQUA

We ran the four queries sequentially without any other workload on the system. With AQUA, the performance improvements range from approximately 7–13 times faster, as summarized in the following table.

Query Number	Amazon Redshift with AQUA (seconds)	Amazon Redshift Only (seconds)	Improvement
Query 1	78.53	635.89	709.74%
Query 2	92.75	810.04	773.36%
Query 3	130.68	956.83	632.19%
Query 4	137.68	1950.9	1316.98%

AQUA impact on multiple workloads

In this environment, we simulated a multi-user workflow using TPC-DS queries on the Amazon Redshift cluster. We recorded query runtime for three scenarios:

Baseline – We measured the end-to-end runtime running all TPC-DS queries serially on the Amazon Redshift cluster. In this scenario, AQUA was off and no additional workload was run (a single user was on the cluster).
Baseline with additional workload – This was the same as the baseline scenario with an additional workload run in parallel. We simulated a user load by running text scan queries randomly selected from Query 1, Query 2 and Query 3. These queries have relatively short runtimes. We had two variations of this scenario:
- AQUA turned off
- AQUA turned on

From the results, we observed the following:

With AQUA turned on for all workloads, the impact of a text scan query on the baseline runtime was negligible.
Without AQUA, the baseline runtime was impacted by the additional workload created with text scan queries. In our case, overhead was about 31%.

	Baseline	Baseline with additional workload		Improvement with AQUA
		AQUA turned off	AQUA turned on
TPC-DS End-to-End Time	3:43:35	4:54:50	3:44:36	31.27%

Single-node RA3.xlplus support

AQUA also supports the recently released Amazon Redshift single-node RA3.xlplus. In a single-node configuration, the resource is shared among all Amazon Redshift operations, which are traditionally handled separately by a leader node and compute nodes. A single-node configuration is commonly used in a personal or small group environment for data exploration.

We ran the same set of queries as before using Query 1, 2 and Query 3. The results demonstrated that AQUA provides a similar level of accelerations for these queries in a single-node environment.

Query Number	Amazon Redshift with AQUA (seconds)	Amazon Redshift Only (seconds)	Improvement
Query 1	157.91	1,254.03	694.13%
Query 2	193.64	2,037.79	952.36%
Query 3	260.75	2,495.85	857.19%

Summary

In this post, we ran a set of simulated performance tests on the Amazon Redshift RA3.xlplus platform with AQUA. With AQUA on, RA3.xlplus provides the same benefit as earlier supported platforms. It provides a query scan performance boost with AQUA-supported operators, which will expand over time. It can reduce the performance impact of your existing workflow by offloading the scan to AQUA.

We invite you to share your comments and use cases with the Amazon Redshift AQUA team.

For more information about how AQUA accelerates Amazon Redshift, see AQUA (Advanced Query Accelerator) for Amazon Redshift.

For more information about queries accelerated by AQUA, see When does Amazon Redshift use AQUA to run queries?

About the Authors

Quan Li is a Senior Database Engineer at Amazon Redshift. His focus is enabling customers to deliver maximum business value. Quan is passionate about optimizing high-performance analytical databases. During his spare time, he enjoys traveling and experiencing different types of cuisines with his family.

Steffen Rochel is a Sr. Software Development Manager at AWS. He is focused on data analytics acceleration. He has expertise in hardware-software design and operation of large-scale, high-performance distributed systems.

Noise