Optimizing tables in Amazon Redshift using Automatic Table Optimization

Post Syndicated from Paul Lappas original https://aws.amazon.com/blogs/big-data/optimizing-tables-in-amazon-redshift-using-automatic-table-optimization/

Amazon Redshift is the most popular and fastest cloud data warehouse that lets you easily gain insights from all your data using standard SQL and your existing business intelligence (BI) tools. Amazon Redshift automates common maintenance tasks and is self-learning, self-optimizing, and constantly adapting to your actual workload to deliver the best possible performance.

Amazon Redshift has several features that automate performance tuning: automatic vacuum delete, automatic table sort, automatic analyze, and Amazon Redshift Advisor for actionable insights into optimizing cost and performance. In addition, automatic workload management (WLM) makes sure that you use cluster resources efficiently, even with dynamic and unpredictable workloads. Amazon Redshift can even automatically refresh and rewrite materialized views, speeding up query performance by orders of magnitude with pre-computed results. These capabilities use machine learning (ML) to adapt as your workloads shift, enabling you to get insights faster without spending valuable time managing your data warehouse.

Although Amazon Redshift provides industry-leading performance out of the box for most workloads, some queries benefit even more by pre-sorting and rearranging how data is physically set up on disk. In Amazon Redshift, you can set the proper sort and distribution keys for tables and allow for significant performance improvements for the most demanding workloads.

Automatic table optimization is a new self-tuning capability that helps you achieve the performance benefits of sort and distribution keys without manual effort. Automatic table optimization continuously observes how queries interact with tables and uses ML to select the best sort and distribution keys to optimize performance for the cluster’s workload. If Amazon Redshift determines that applying a key will improve cluster performance, tables are automatically altered within hours without requiring administrator intervention. Optimizations made by the Automatic table optimization feature have been shown to increase cluster performance by 24% and 34% using the 3 TB and 30 TB TPC-DS benchmark, respectively, versus a cluster without Automatic table optimization.

In this post, we illustrate how you can take advantage of the Automatic table optimization feature for your workloads and easily manage several thousands of tables with zero administration.

Solution overview

The following diagram is an architectural illustration of how Automatic table optimization works:

As you can notice, as users query the data in Amazon Redshift, automatic table optimization collects the query statistics that are analyzed using a machine learning service to predict recommendations about the sort and distribution keys. These recommendations are later applied using online ALTER statements into the respective Amazon Redshift tables automatically.

For this post, we consider a simplified version of the of the star schema benchmark (SSB), which consists of the lineitem fact table along with the part and orders dimensional tables.

We use the preceding dimensional setup to create the tables using the defaults and illustrate how Automatic table optimization can automatically optimize it based on the query patterns.

To try this solution in your AWS account, you need access to an Amazon Redshift cluster and a SQL client such as SQLWorkbench/J. For more information, see Create a sample Amazon Redshift cluster.

Creating SSB tables using the defaults

Let’s start with creating a representative set of tables from the SSB schema and letting Amazon Redshift pick the default settings for the table design.

Create the following tables to set up the dimensional model for the retail system dataset:

CREATE TABLE orders 
(
  O_ORDERKEY        BIGINT NOT NULL,
  O_CUSTKEY         BIGINT,
  O_ORDERSTATUS     VARCHAR(1),
  O_TOTALPRICE      DECIMAL(18,4),
  O_ORDERDATE       DATE,
  O_ORDERPRIORITY   VARCHAR(15),
  O_CLERK           VARCHAR(15),
  O_SHIPPRIORITY    INTEGER,
  O_COMMENT         VARCHAR(79)
);
CREATE TABLE part 
(
  P_PARTKEY       BIGINT NOT NULL,
  P_NAME          VARCHAR(55),
  P_MFGR          VARCHAR(25),
  P_BRAND         VARCHAR(10),
  P_TYPE          VARCHAR(25),
  P_SIZE          INTEGER,
  P_CONTAINER     VARCHAR(10),
  P_RETAILPRICE   DECIMAL(18,4),
  P_COMMENT       VARCHAR(23)
);

CREATE TABLE lineitem 
(
  L_ORDERKEY        BIGINT NOT NULL,
  L_PARTKEY         BIGINT,
  L_SUPPKEY         BIGINT,
  L_LINENUMBER      INTEGER NOT NULL,
  L_QUANTITY        DECIMAL(18,4),
  L_EXTENDEDPRICE   DECIMAL(18,4),
  L_DISCOUNT        DECIMAL(18,4),
  L_TAX             DECIMAL(18,4),
  L_RETURNFLAG      VARCHAR(1),
  L_LINESTATUS      VARCHAR(1),
  L_SHIPDATE        DATE,
  L_COMMITDATE      DATE,
  L_RECEIPTDATE     DATE,
  L_SHIPINSTRUCT    VARCHAR(25),
  L_SHIPMODE        VARCHAR(10),
  L_COMMENT         VARCHAR(44)
);

As you can see from the table DDL, apart from the table column definition, no other options are specified. Amazon Redshift defaults the sort key and distribution style to AUTO.

We now load data from the public Amazon Simple Storage Service (Amazon S3) bucket to our new tables. Use any SQL client tool and run the following command, providing your AWS account ID and Amazon Redshift role:

COPY orders from 's3://salamander-us-east-1/atoblog/orders/' iam_role 'arn:aws:iam::[Your-AWS_Account_Id]:role/[Your-Redshift-Role]'  CSV gzip region 'us-east-1';
COPY part from 's3://salamander-us-east-1/atoblog/part/' iam_role 'arn:aws:iam::[Your-AWS_Account_Id]:role/[Your-Redshift-Role]'  CSV gzip region 'us-east-1';
COPY lineitem from 's3://salamander-us-east-1/atoblog/lineitem/' iam_role 'arn:aws:iam::[Your-AWS_Account_Id]:role/[Your-Redshift-Role]'  CSV gzip region 'us-east-1';

Wait until the table COPY is complete.

Amazon Redshift automatically assigns the data encoding for the columns and chooses the sort and distribution style based on the size of the table.

Use the following query to review the decisions that Amazon Redshift makes for the column encoding:

/* Find the column encoding */
SELECT tablename,
       "column",
       encoding
FROM pg_table_def
WHERE tablename = 'lineitem'
AND   schemaname = 'public' LIMIT 5;
tablename column encoding
lineitem	l_orderkey	az64
lineitem	l_partkey	az64
lineitem	l_suppkey	az64
lineitem	l_linenumber	az64
lineitem	l_quantity	az64

Verify the table design choices for the sort and distribution key with the following code:

SELECT "table",
       diststyle,
       sortkey1
FROM svv_table_info
WHERE "table" IN ('part','lineitem','orders')
table diststyle sortkey1;
--Output
part	AUTO(ALL)	AUTO(SORTKEY)
orders	AUTO(ALL)	AUTO(SORTKEY)
lineitem	AUTO(EVEN)	AUTO(SORTKEY)

The tables distribution is set to AUTO(EVEN) or AUTO(ALL), depending on the size of table, and sort key is AUTO(SORTKEY).

Until now, because no active workloads were ran against these tables, no specific key choices have been made other than marking them as AUTO.

Querying the SSB tables to emulate the workload

Now end-users can use the created tables, and Amazon Redshift can support out-of-box performance.

The following are some sample queries that we can run using this SSB schema. These queries are run a few repeated times to have Amazon Redshift learn the access patterns for sort and distribution key optimization. To run the query several times, we use the \watch option available with the psql client. Otherwise just run this a few dozen times:

$ psql -h example-corp.cfgio0kcsmjy.us-west-2.redshift.amazonaws.com -U awsuser -d dw -p 5492
dw=# # \timing on
Timing is on.
--turn off result set cache so that each query execution is counted towards a workload sample

dw=# set enable_result_cache_for_session to off;
SET
dw=# /* query 1 */ SELECT L_SHIPMODE,SUM(l_quantity) AS quantity FROM lineitem JOIN part ON P_PARTKEY = l_PARTKEY where L_SHIPDATE='1992-02-28' GROUP BY L_SHIPMODE;
 l_shipmode |  quantity   
------------+-------------
 MAIL       | 436272.0000
 FOB        | 440959.0000
Time: 10020.200 ms
dw=# \watch 2
dw=# /* query 2 */ SELECT COUNT(o_orderkey) AS orders_count, SUM(l_quantity) AS quantity FROM lineitem JOIN orders ON l_orderkey = o_orderkey WHERE L_SHIPDATE = '1992-02-29'
Time: 8932.200 ms
dw=# \watch 2

The preceding queries are run a few hundred times every 2 seconds, and you can press Ctrl+C to cancel the queries.

Alternatively, you can also use the query editor to schedule the query and run it multiple times.

Reviewing recommended sort and distribution keys

Automatic table optimization uses Amazon Redshift Advisor sort and distribution key recommendations. The Advisor continuously monitors the cluster’s workload and proposes the right sort and distribution keys to improve query speed. With Automatic Table Optimization, the Advisor recommendations are visible in the SVV_ALTER_TABLE_RECOMMENDATIONS system table. This view shows recommendations for all tables, whether or not they are defined for automatic optimization. Recommendations that have auto_eligible = False are not automatically applied, but you can run the DDL to apply the recommendation manually. See the following code:

select * from svv_alter_table_recommendations
type      | database | table_id | group_id | ddl                                                                                                                                                 | auto_eligible
diststyle | db0      | 117892   | 2        | ALTER TABLE /*dkru-558bc9ee-468a-457a-99a9-e73ee7da1a18-g0-0*/ "public"."orders" ALTER DISTSTYLE KEY DISTKEY "o_orderkey" SORT                                                                               | t
 diststyle | db0      | 117885   | 1        | ALTER TABLE /*dkru-558bc9ee-468a-457a-99a9-e73ee7da1a18-g0-1*/ "public"."lineitem" ALTER DISTSTYLE KEY DISTKEY "l_orderkey" SORT| t
 sortkey   | db0      | 117890   | -1       | ALTER TABLE /*skru-15a98513-cf0f-46e8-b454-8bf61ee30c6e-g0-0*/ "public"."lineitem" ALTER SORTKEY ("l_shipdate");|t

Applying recommendations to the target tables

Amazon Redshift takes advantage of the new Automatic table optimization feature to apply the optimization made by the Advisor to the target tables. The conversion is run by the automation during periods of low workload intensity so as to minimize impact on user queries. This can be verified by running the following query:

SELECT "table",
       diststyle,
       sortkey1
FROM svv_table_info
WHERE "table" IN ('part','lineitem','orders')
AND   SCHEMA = 'public';
table,diststyle,sortkey1
part	AUTO(EVEN)	AUTO(SORTKEY)
lineitem	AUTO(KEY(l_orderkey))	AUTO(SORTKEY(l_shipdate)) 
orders	AUTO(KEY(o_orderkey))	AUTO(SORTKEY)

You can view all the optimizations that are applied on the tables using the following query:

select * from svl_auto_worker_action 
1395658	sortkey	Start                                                                                                                           	2020-10-27 22:37:23.367456	0	                                                                                                                                                                                                        
1395658	sortkey	Complete                                                                                                                        	2020-10-27 22:37:23.936958	0	SORTKEY: None;

Amazon Redshift can self-learn based on the workload, learn from the table access patterns, and apply the table design optimizations automatically.

Now let’s run the sample workload queries again after optimization:

$ psql -h example-corp.cfgio0kcsmjy.us-west-2.redshift.amazonaws.com -U awsuser -d dw -p 5492
dw=# # \timing on
Timing is on.
--turn off result set cache so that each query is executed as if executed first time
dw=# set enable_result_cache_for_session to off;
SET
dw=# /* query 1 */ SELECT L_SHIPMODE,SUM(l_quantity) AS quantity FROM lineitem JOIN part ON P_PARTKEY = l_PARTKEY where L_SHIPDATE='1992-02-28' GROUP BY L_SHIPMODE;
 l_shipmode |  quantity   
------------+-------------
 MAIL       | 436272.0000
 FOB        | 440959.0000
Time: 4020.900 ms
dw=# /* query 2 */ SELECT COUNT(o_orderkey) AS orders_count, SUM(l_quantity) AS quantity FROM lineitem JOIN orders ON l_orderkey = o_orderkey WHERE L_SHIPDATE = '1992-02-29'
Time: 3532.200 ms

With the sort and distribution optimization, query 1 and query 2 run with 40% less time elapsed, also shown by the following visual.

Converting existing tables for optimization

You can easily convert existing tables for Automatic table optimization using the ALTER table command and switch the sort and distribution styles to AUTO so that it can be automatically optimized by Amazon Redshift. See the following code:

/* Convert a table to a diststyle AUTO table */
ALTER TABLE <tbl> ALTER DISTSTYLE AUTO; 
/* Convert  table to a sort key AUTO table */
ALTER TABLE <tbl> ALTER SORTKEY AUTO;

Lead time to apply the recommendation

Amazon Redshift continuously learns from workloads, and optimizations are inserted into the svv_alter_table_recomendations. When an optimization is available, it runs within a defined frequency, as well as in periods of low workload intensity, so as to minimize impact on user queries. For more information about the lead time of applying the recommendation, see https://docs.aws.amazon.com/redshift/latest/dg/t_Creating_tables.html

Conclusion

Automatic table optimization for Amazon Redshift is a new capability that applies sort and distribution keys without the need for administrator intervention. Using automation to tune the design of tables lets you get started more easily and decreases the amount of administrative effort. Automatic table optimization enables easy management of large numbers of tables in a data warehouse because Amazon Redshift self-learns, self-optimizes, and adapts to your actual workload to deliver you the best possible performance.

About the Author

Paul Lappas is a Principal Product Manager at Amazon Redshift. Paul is responsible for Amazon Redshift’s self-tuning capabilities including Automatic Table Optimization, Workload Manager, and the Amazon Redshift Advisor. Paul is passionate about helping customers leverage their data to gain insights and make critical business decisions. In his spare time Paul enjoys playing tennis, cooking, and spending time with his wife and two boys.

Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

KamCheung Ting is a Senior Software Engineer at Amazon Redshift. He joined Redshift in 2014 and specializes in storage engine, autonomous DB and concurrency scaling.

Noise