All posts by Yanzhu Ji

Improve performance of workloads containing repetitive scan filters with multidimensional data layout sort keys in Amazon Redshift

Post Syndicated from Yanzhu Ji original https://aws.amazon.com/blogs/big-data/improve-performance-of-workloads-containing-repetitive-scan-filters-with-multidimensional-data-layout-sort-keys-in-amazon-redshift/

Amazon Redshift, the most widely used cloud data warehouse, has evolved significantly to meet the performance requirements of the most demanding workloads. This post covers one such new feature—the multidimensional data layout sort key.

Amazon Redshift now improves your query performance by supporting multidimensional data layout sort keys, which is a new type of sort key that sorts a table’s data by filter predicates instead of physical columns of the table. Multidimensional data layout sort keys will significantly improve the performance of table scans, especially when your query workload contains repetitive scan filters.

Amazon Redshift already provides the capability of automatic table optimization (ATO), which automatically optimizes the design of tables by applying sort and distribution keys without the need for administrator intervention. In this post, we introduce multidimensional data layout sort keys as an additional capability offered by ATO and fortified by Amazon Redshift’s sort key advisor algorithm.

Multidimensional data layout sort keys

When you define a table with the AUTO sort key, Amazon Redshift ATO will analyze your query history and automatically select either a single-column sort key or multidimensional data layout sort key for your table, based on which option is better for your workload. When multidimensional data layout is selected, Amazon Redshift will construct a multidimensional sort function that co-locates rows that are typically accessed by the same queries, and the sort function is subsequently used during query runs to skip data blocks and even skip scanning the individual predicate columns.

Consider the following user query, which is a dominant query pattern in the user’s workload:

SELECT season, sum(metric2) AS "__measure__0"
FROM titles
WHERE lower(subregion) like '%United States%'
GROUP BY 1
ORDER BY 1;

Amazon Redshift stores data for each column in 1 MB disk blocks and stores the minimum and maximum values in each block as part of the table’s metadata. If a query uses a range-restricted predicate, Amazon Redshift can use the minimum and maximum values to rapidly skip over large numbers of blocks during table scans. However, this query’s filter on the subregion column can’t be used to determine which blocks to skip based on minimum and maximum values, and as a result, Amazon Redshift scans all rows from the titles table:

SELECT table_name, input_rows, step_attribute
FROM sys_query_detail
WHERE query_id = 123456789;

When the user’s query was run with titles using a single-column sort key on subregion, the result of the preceding query is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 2164081640 | 
(1 rows)

This shows that the table scan read 2,164,081,640 rows.

To improve scans on the titles table, Amazon Redshift might automatically decide to use a multidimensional data layout sort key. All rows that satisfy the lower(subregion) like '%United States%' predicate would be co-located to a dedicated region of the table, and therefore Amazon Redshift will only scan data blocks that satisfy the predicate.

When the user’s query is run with titles using a multidimensional data layout sort key that includes lower(subregion) like '%United States%' as a predicate, the result of the sys_query_detail query is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 152324046  | multi-dimensional
(1 rows)

This shows that the table scan read 152,324,046 rows, which is only 7% of the original, and it used the multidimensional data layout sort key.

Note that this example uses a single query to showcase the multidimensional data layout feature, but Amazon Redshift will consider all the queries running against the table and can create multiple regions to satisfy the most commonly run predicates.

Let’s take another example, with more complex predicates and multiple queries this time.

Imagine having a table items (cost int, available int, demand int) with four rows as shown in the following example.

#id cost available demand
1 4 3 3
2 2 23 6
3 5 4 5
4 1 1 2

Your dominant workload consists of two queries:

  • 70% queries pattern:
    select * from items where cost > 3 and available < demand

  • 20% queries pattern:
    select avg(cost) from items where available < demand

With traditional sorting techniques, you might choose to sort the table over the cost column, such that the evaluation of cost > 3 will benefit from the sort. So, the items table after sorting using a single cost column will look like the following.

#id cost available demand
Region #1, with cost <= 3
Region #2, with cost > 3
#id cost available demand
4 1 1 2
2 2 23 6
1 4 3 3
3 5 4 5

By using this traditional sort, we can immediately exclude the top two (blue) rows with ID 4 and ID 2, because they don’t satisfy cost > 3.

On the other hand, with a multidimensional data layout sort key, the table will be sorted based on a combination of the two commonly occurring predicates in the user’s workload, which are cost > 3 and available < demand. As a result, the table’s rows are sorted into four regions.

#id cost available demand
Region #1, with cost <= 3 and available < demand
Region #2, with cost <= 3 and available >= demand
Region #3, with cost > 3 and available < demand
Region #4, with cost > 3 and available >= demand
#id cost available demand
4 1 1 2
2 2 23 6
3 5 4 5
1 4 3 3

This concept is even more powerful when applied to entire blocks instead of single rows, when applied to complex predicates that use operators not suitable for traditional sorting techniques (such as like), and when applied to more than two predicates.

System tables

The following Amazon Redshift system tables will show users if multidimensional data layouts are used on their tables and queries:

  • To determine if a particular table is using a multidimensional data layout sort key, you can check whether sortkey1 in svv_table_info is equal to AUTO(SORTKEY(padb_internal_mddl_key_col)).
  • To determine if a particular query uses multidimensional data layout to accelerate table scans, you can check step_attribute in the sys_query_detail view. The value will be equal to multi-dimensional if the table’s multidimensional data layout sort key was used during the scan.

Performance benchmarks

We performed internal benchmark testing for multiple workloads with repetitive scan filters and see that introducing multidimensional data layout sort keys produced the following results:

  • A 74% total runtime reduction compared to having no sort key.
  • A 40% total runtime reduction compared to having the best single-column sort key on each table.
  • A 80% reduction in total rows read from tables compared to having no sort key.
  • A 47% reduction in total rows read from tables compared to having the best single-column sort key on each table.

Feature comparison

With the introduction of multidimensional data layout sort keys, your tables can now be sorted by expressions based off of the commonly occurring filter predicates in your workload. The following table provides a feature comparison for Amazon Redshift against two competitors.

Feature Amazon Redshift Competitor A Competitor B
Support for sorting on columns Yes Yes Yes
Support for sorting by expression Yes Yes No
Automatic column selection for sorting Yes No Yes
Automatic expressions selection for sorting Yes No No
Automatic selection between columns sorting or expressions sorting Yes No No
Automatic use of sorting properties for expressions during scans Yes No No

Considerations

Keep in mind the following when using a multidimensional data layout:

  • Multidimensional data layout is enabled when you set your table as SORTKEY AUTO.
  • Amazon Redshift Advisor will automatically choose either a single-column sort key or multidimensional data layout for the table by analyzing your historical workload.
  • Amazon Redshift ATO adjusts the multidimensional data layout sorting results based on the manner in which ongoing queries interact with the workload.
  • Amazon Redshift ATO maintains multidimensional data layout sort keys the same way as it currently does for existing sort keys. Refer to Working with automatic table optimization for more details on ATO.
  • Multidimensional data layout sort keys will work with both provisioned clusters and serverless workgroups.
  • Multidimensional data layout sort keys will work with your existing data as long as the AUTO SORTKEY is enabled on your table and a workload with repetitive scan filters is detected. The table will be reorganized based on the results of multi-dimensional sort function.
  • To disable multidimensional data layout sort keys for a table, use alter table: ALTER TABLE table_name ALTER SORTKEY NONE. This disables the AUTO sort key feature on the table.
  • Multidimensional data layout sort keys are preserved when restoring or migrating your provisioned cluster to a serverless cluster or vice versa.

Conclusion

In this post, we showed that multidimensional data layout sort keys can significantly improve query runtime performance for workloads where dominant queries have repetitive scan filters.

To create a preview cluster from the Amazon Redshift console, navigate to the Clusters page and choose Create preview cluster. You can create a cluster in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm) Regions and test your workloads.

We would love to hear your feedback on this new feature and look forward to your comments on this post.


About the authors

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Jialin Ding is an Applied Scientist in the Learned Systems Group, specializing in applying machine learning and optimization techniques to improve the performance of data systems such as Amazon Redshift.

Use the new SQL commands MERGE and QUALIFY to implement and validate change data capture in Amazon Redshift

Post Syndicated from Yanzhu Ji original https://aws.amazon.com/blogs/big-data/use-the-new-sql-commands-merge-and-qualify-to-implement-and-validate-change-data-capture-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads.

Amazon Redshift has added many features to enhance analytical processing like ROLLUP, CUBE and GROUPING SETS, which were demonstrated in the post Simplify Online Analytical Processing (OLAP) queries in Amazon Redshift using new SQL constructs such as ROLLUP, CUBE, and GROUPING SETS. Amazon Redshift has recently added many SQL commands and expressions. In this post, we talk about two new SQL features, the MERGE command and QUALIFY clause, which simplify data ingestion and data filtering.

One familiar task in most downstream applications is change data capture (CDC) and applying it to its target tables. This task requires examining the source data to determine if it is an update or an insert to existing target data. Without the MERGE command, you needed to test the new dataset against the existing dataset using a business key. When that didn’t match, you inserted new rows in the existing dataset; otherwise, you updated existing dataset rows with new dataset values.

The MERGE command conditionally merges rows from a source table into a target table. Traditionally, this could only be achieved by using multiple insert, update, or delete statements separately. When using multiple statements to update or insert data, there is a risk of inconsistencies between the different operations. Merge operation reduces this risk by ensuring that all operations are performed together in a single transaction.

The QUALIFY clause filters the results of a previously computed window function according to user‑specified search conditions. You can use the clause to apply filtering conditions to the result of a window function without using a subquery. This is similar to the HAVING clause, which applies a condition to further filter rows from a WHERE clause. The difference between QUALIFY and HAVING is that filtered results from the QUALIFY clause could be based on the result of running window functions on the data. You can use both the QUALIFY and HAVING clauses in one query.

In this post, we demonstrate how to use the MERGE command to implement CDC and how to use QUALIFY to simplify validation of those changes.

Solution overview

In this use case, we have a data warehouse, in which we have a customer dimension table that needs to always get the latest data from the source system. This data must also reflect the initial creation time and last update time for auditing and tracking purposes.

A simple way to solve this is to override the customer dimension fully every day; however, that won’t achieve the update tracking, which is an audit mandate, and it might not be feasible to do for bigger tables.

You can load sample data from Amazon S3 by following the instruction here. Using the existing customer table under sample_data_dev.tpcds, we create a customer dimension table and a source table that will contain both updates for existing customers and inserts for new customers. We use the MERGE command to merge the source table data with the target table (customer dimension). We also show how to use the QUALIFY clause to simplify validating the changes in the target table.

To follow along with the steps in this post, we recommend downloading the accompanying notebook, which contains all the scripts to run for this post. To learn about authoring and running notebooks, refer to Authoring and running notebooks.

Prerequisites

You should have the following prerequisites:

Create and populate the dimension table

We use the existing customer table under sample_data_dev.tpcds to create a customer_dimension table. Complete the following steps:

  1. Create a table using a few selected fields, including the business key, and add a couple of maintenance fields for insert and update timestamps:
     -- create the customer dimension table DROP TABLE IF EXISTS customer_dim CASCADE;
    CREATE TABLE customer_dim ( 
    customer_dim_id     bigint GENERATED BY DEFAULT AS IDENTITY(1, 1), 
    c_customer_sk integer NOT NULL ENCODE az64 distkey,
    c_first_name character(20) ENCODE lzo,
    c_last_name character(30) ENCODE lzo,
    c_current_addr_sk integer ENCODE az64,
    c_birth_country character varying(20) ENCODE lzo,
    c_email_address character(50) ENCODE lzo,
    record_insert_ts    timestamp WITHOUT time ZONE DEFAULT current_timestamp ,
    record_upd_ts       timestamp WITHOUT time ZONE DEFAULT NULL
    )
    SORTKEY (c_customer_sk);

  2. Populate the dimension table:
    -- populate dimension 
    insert into customer_dim 
           (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
    select  c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address
    from “sample_data_dev”.”tpcds”.”customer”;

  3. Validate the row count and the contents of the table:
    -- check customers count and look at sample data
    select count(1) from customer_dim; 
    select * from customer_dim limit 10;

Simulate customer table changes

Use the following code to simulate changes made to the table:

-- create a source table with some updates and some inserts
-- Update- Email has changed for 100 customers 
drop table if exists src_customer;
create table src_customer distkey(c_customer_sk) as 
select c_customer_sk , c_first_name , c_last_name, c_current_addr_sk, c_birth_country, ‘x’+c_email_address as c_email_address, getdate() as effective_dt
from   customer_dim 
where  c_email_address is not null
limit 100;


-- also let’s add three completely new customers
insert into src_customer values 
(15000001, ‘Customer#15’,’000001’, 10001 ,’USA’    , ‘Customer#[email protected]’, getdate() ),
(15000002, ‘Customer#15’,’000002’, 10002 ,’MEXICO’ , ‘Customer#[email protected]’, getdate() ),
(15000003, ‘Customer#15’,’000003’, 10003 ,’CANADA’ , ‘Customer#[email protected]’, getdate() );

-- check source count
select count(1) from src_customer;

Merge the source table into the target table

Now you have a source table with some changes you need to merge with the customer dimension table.

Before the MERGE command, this type of task needed two separate UPDATE and INSERT commands to implement:

-- merge changes to dim customer
BEGIN TRANSACTION;
-- update current records
UPDATE customer_dim
SET    c_first_name      = src.c_first_name      ,
       c_last_name       = src.c_last_name       , 
       c_current_addr_sk = src.c_current_addr_sk , 
       c_birth_country   = src.c_birth_country   , 
       c_email_address   = src.c_email_address   ,
       record_upd_ts     = current_timestamp
from   src_customer AS src
where  customer_dim.c_customer_sk = src.c_customer_sk ;
-- Insert new records
INSERT INTO customer_dim (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
select src.c_customer_sk, src.c_first_name,src.c_last_name, src.c_current_addr_sk, src.c_birth_country, src.c_email_address 
from   src_customer AS src
where  src.c_customer_sk NOT IN (select c_customer_sk from customer_dim);
-- end merge operation
COMMIT TRANSACTION;

The MERGE command uses a more straightforward syntax, in which we use the key comparison result to decide if we perform an update DML operation (when matched) or an insert DML operation (when not matched):

MERGE INTO customer_dim using src_customer AS src ON customer_dim.c_customer_sk = src.c_customer_sk
WHEN MATCHED THEN UPDATE 
SET c_first_name      = src.c_first_name      ,
    c_last_name       = src.c_last_name       , 
    c_current_addr_sk = src.c_current_addr_sk , 
    c_birth_country   = src.c_birth_country   , 
    c_email_address   = src.c_email_address   ,
    record_upd_ts     = current_timestamp
WHEN NOT MATCHED THEN INSERT (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
                      VALUES (src.c_customer_sk, src.c_first_name,src.c_last_name, src.c_current_addr_sk, src.c_birth_country, src.c_email_address );

Validate the data changes in the target table

Now we need to validate the data has made it correctly to the target table. We can first check the updated data using the update timestamp. Because this was our first update, we can examine all rows where the update timestamp is not null:

-- Check the changes
-- to get updates
select * 
from customer_dim
where record_upd_ts is not null

Use QUALIFY to simplify validation of the data changes

We need to examine the data inserted in this table most recently. One way to do that is to rank the data by its insert timestamp and get those with the first rank. This requires using the window function rank() and also requires a subquery to get the results.

Before the availability of QUALIFY, we needed to build that using a subquery like the following:

select customer_dim_id,c_customer_sk ,c_first_name ,c_last_name ,c_current_addr_sk,c_birth_country ,c_email_address ,record_insert_ts ,record_upd_ts 
from 
( select rank() OVER (ORDER BY DATE_TRUNC(‘second’,record_insert_ts) desc) AS rnk, 
         customer_dim_id,c_customer_sk ,c_first_name ,c_last_name ,c_current_addr_sk,c_birth_country ,c_email_address ,record_insert_ts ,record_upd_ts 
  from customer_dim
  where record_upd_ts is null)
where rnk = 1;

The QUALIFY function eliminates the need for the subquery, as in the following code snippet:

-- to get the newly inserted rows we can make use of Qualify feature
select * 
from customer_dim
where record_upd_ts is null
qualify rank() OVER (ORDER BY DATE_TRUNC(‘second’,record_insert_ts) desc) = 1 

Validate all data changes

We can union the results of both queries to get all the inserts and update changes:

-- To get all changes
select *
from (
select 'Updates' as operations, cd.* 
from   customer_dim as cd
where  cd.record_upd_ts is not null
union 
select 'Inserts' as operations, cd.* 
from customer_dim cd
where cd.record_upd_ts is null
qualify rank() OVER (ORDER BY DATE_TRUNC('second',cd.record_insert_ts) desc) = 1 
) order by 1

Clean up

To clean up the resources used in the post, delete the Redshift provisioned cluster or Redshift Serverless workgroup and namespace you created for this post (this will also drop all the objects created).

If you used an existing Redshift provisioned cluster or Redshift Serverless workgroup and namespace, use the following code to drop these objects:

DROP TABLE IF EXISTS customer_dim CASCADE;
DROP TABLE IF EXISTS src_customer CASCADE;

Conclusion

When using multiple statements to update or insert data, there is a risk of inconsistencies between the different operations. The MERGE operation reduces this risk by ensuring that all operations are performed together in a single transaction. For Amazon Redshift customers who are migrating from other data warehouse systems or who regularly need to ingest fast-changing data into their Redshift warehouse, the MERGE command is a straightforward way to conditionally insert, update, and delete data from target tables based on existing and new source data.

In most analytic queries that use window functions, you may need to use those window functions in your WHERE clause as well. However, this is not permitted, and to do so, you have to build a subquery that contains the required window function and then use the results in the parent query in the WHERE clause. Using the QUALIFY clause eliminates the need for a subquery and therefore simplifies the SQL statement and makes it less difficult to write and read.

We encourage you to start using those new features and give us your feedback. For more details, refer to MERGE and QUALIFY clause.


About the authors

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.