Post Syndicated from Po Hong original https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/
Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries against data that is stored in Amazon S3. With Amazon Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond the data that is stored on local disks in your data warehouse. You can query vast amounts of data in your Amazon S3 “data lake” without having to go through a tedious and time-consuming extract, transfer, and load (ETL) process. Amazon Redshift Spectrum applies sophisticated query optimization and scales processing across thousands of nodes to deliver fast performance.
In this blog post, we have collected 10 important best practices for Amazon Redshift Spectrum by grouping them into several different functional groups.
These guidelines are the product of many interactions and direct project work with Amazon Redshift customers.
Amazon Redshift vs. Amazon Athena
AWS customers often ask us: Amazon Athena or Amazon Redshift Spectrum? When should I use one over the other?
When to use Amazon Athena
Amazon Athena supports a use case in which you want interactive ad-hoc queries to run against data that is stored in Amazon S3 using SQL. The serverless architecture in Amazon Athena frees you from having to provision a cluster to perform queries. You are charged based on the amount of S3 data scanned by each query. You can get significant cost savings and better performance by compressing, partitioning, or converting your data into a columnar format, which reduces the amount of data that Amazon Athena needs to scan to execute a query. All the major BI tools and SQL clients that use JDBC can be used with Amazon Athena. You can also use Amazon QuickSight for easy visualization.
When to use Amazon Redshift
We recommend using Amazon Redshift on large sets of structured data. Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. With Amazon Redshift Spectrum, you don’t have to worry about scaling your cluster. It lets you separate storage and compute, allowing you to scale each independently. You can even run multiple Amazon Redshift clusters against the same Amazon S3 data lake, enabling limitless concurrency. Amazon Redshift Spectrum automatically scales out to thousands of instances. So queries run quickly, whether they are processing a terabyte, a petabyte, or even an exabyte.
Set up the test environment
For information about prerequisites and steps to get started in Amazon Redshift Spectrum, see Getting Started with Amazon Redshift Spectrum.
You can use any data set to perform the tests to validate the best practices we have outlined in this blog post. One important requirement is that the S3 files for the largest table need to be in three separate data formats: CSV, non-partitioned Parquet as well as partitioned Parquet. How to convert from one file format to another is beyond the scope of this blog post. For more information on how this can be done, check out the following resources:
Creating the external schema
Use the Amazon Athena data catalog as the metadata store, and create an external schema named “spectrum” as follows:
The Redshift cluster and the data files in Amazon S3 must be in the same AWS region. Your Redshift cluster needs authorization to access your external data catalog in Amazon Athena and your data files in Amazon S3. You provide that authorization by referencing an AWS Identity and Access Management (IAM) role (e.g. aod-redshift-role) that is attached to your cluster. For more information, see Create an IAM Role for Amazon Redshift.
Defining external tables
As examples, an Amazon Redshift Spectrum external table using partitioned Parquet files and another external table using CSV files are defined as follows:
To recap, Amazon Redshift Spectrum uses external tables to query data that is stored in Amazon S3. You can query an external table using the same SELECT syntax you use with other Amazon Redshift tables. You can’t write to external tables because they are read-only.
You first create an external schema that references an external database, which can reside in either an Amazon Athena data catalog or an Apache Hive metastore, such as Amazon EMR. Then you create an external table in Amazon Redshift using this external schema. You must reference the external table in your SELECT statements by prefixing the table name with the schema name, without needing to create and load the table into Amazon Redshift.
The external schema references a database in the external data catalog. This requires an IAM role that authorizes your cluster to access Amazon S3 and Amazon Athena on your behalf.
If you would like to perform your tests using Amazon Redshift Spectrum, the following two queries would be a good start:
This query includes only one table and it can be used to highlight the additional processing power provided by the Amazon Redshift Spectrum layer.
This query has joins of three tables and can be very useful to compare Amazon Redshift Spectrum’s performance with that of native Amazon Redshift.
Best practices for concurrency
These recommended practices can help you optimize your concurrent workload performance using Amazon Redshift Spectrum.
1. Use Amazon Redshift Spectrum to improve scan-intensive concurrent workloads
Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. It pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Amazon Redshift Spectrum layer, so queries use much less of your cluster’s processing capacity. In addition, Amazon Redshift Spectrum scales intelligently. Based on the demands of your queries, Amazon Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing (MPP).
For some use cases of concurrent scan and/or aggregate intensive workloads, Amazon Redshift Spectrum may perform better than native Amazon Redshift on average.
The most resource-intensive aspect of any MPP system is the data-load process. This is because it competes with active analytic queries not only for compute resources, but also for locking on the tables through multiversion concurrency control (MVCC). By contrast, if you add new files to an existing external table using Amazon Redshift Spectrum by writing to Amazon S3, and then updating the meta-data to include them as new partitions, you eliminate this workload from the Amazon Redshift cluster. This has an immediate and direct positive impact on concurrency.
2. Use multiple on-demand Amazon Redshift clusters to scale concurrency
Amazon Redshift Spectrum stores data in Amazon S3, which can be accessed by multiple Amazon Redshift clusters to improve concurrent workload performance. A common Amazon Redshift customer scenario is what to do with seasonal spiky, highly concurrent query workloads. Before Amazon Redshift Spectrum, to handle the increased concurrency, customers often had to spin up multiple “read-only” Amazon Redshift clusters by restoring from a snapshot. The problem with this approach is that for a large Amazon Redshift data warehouse with hundreds of terabytes of data, the restore process can take a long time, resulting in data latency issues.
With Amazon Redshift Spectrum, you can move the largest tables to Amazon S3, and each Amazon Redshift cluster needs to keep only a small amount of data on local disks. Because of the reduction of data volume, it would be much faster to spin up or restore multiple “read-only” Amazon Redshift clusters to handle these seasonal spiky query workloads (see Figure 1). To reduce cost, you should terminate these “on-demand” Amazon Redshift clusters as soon as they have finished the job.
Figure 1: Multiple read-only Amazon Redshift clusters access a shared Redshift Spectrum layer
Amazon Redshift customers have been using pgbouncer-rr to simplify and control client query routing when deploying multiple Redshift clusters to scale concurrency. More details can be found in the blog: Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL.
Best practices for storage
In terms of storage optimization considerations, think about reducing the I/O workload at every step. That tends toward a columnar-based file format, using compression to fit more records into each storage block, and using a format that supports data partitioning. The file formats supported in Amazon Redshift Spectrum include CSV, TSV, Parquet, Sequence, and RCFile.
A further optimization is to use compression. Currently, Amazon Redshift Spectrum supports Gzip, Snappy, and BZ2.
3. Use Apache Parquet files for better performance and lower cost
Apache Parquet is a columnar storage format that is available to any project in the Apache Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language. For more details, see the Apache Parquet information page.
Amazon Redshift Spectrum reads from S3 only the columns of a file that are needed for the query. Amazon Redshift Spectrum supports predicate pushdown (also called predicate filtering).
Amazon Redshift Spectrum charges you by the amount of data that is scanned from S3 per query. Parquet stores data in a columnar format, so Amazon Redshift Spectrum can eliminate unneeded columns during the scan. As an example, it would be interesting to compare the difference in query performance between CSV text files and Parquet partitioned files.
Various tests have shown that partitioned Parquet files are not only performing faster, but they are also much more cost-effective than non-partitioned row-based CSV files.
You can use SVL_S3QUERY_SUMMARY to gain insight into several interesting S3 metrics for the query that uses partitioned Parquet files:
Pay special attention to two interesting metrics: s3_scanned_rows and s3query_returned_rows. You would notice the tremendous reduction in the amount of data that returns from Amazon Redshift Spectrum to Amazon Redshift native for the final processing when compared to CSV files.
4. Partition Parquet files on frequently used columns
When you’re deciding on the optimal partition columns, consider the following:
- Columns that are used as common filters are good candidates.
- Excessively granular partitioning adds time for retrieving partition information. However, it can help in partition pruning and reduce the amount of data scanned from S3.
- Actual performance varies depending on file placement, query pattern, file size distribution, number of files in a partition, number of qualified partitions, etc.
- Measure and avoid data skew on partitioning columns.
- One important consideration is that file size distribution should be as uniform as possible. That is, it’s better to have ten 256-MB Parquet files than one 1 GB file and six 256-MB files.
To illustrate the powerful benefits of partition pruning, you should consider creating two external tables using the Parquet format: One table is not partitioned, and the other is partitioned at the day level.
Scanning a partitioned external table can be 2x-4x times faster than a non-partitioned external table.
How do you know if “partition-pruning” is in effect? You can use the following SQL to analyze the effectiveness of partition pruning. If the query touches only a few partitions, you can verify if everything behaves as expected:
Best practices for cluster configuration
5. Optimize Redshift Spectrum performance with the right Amazon Redshift cluster configuration
There are two levels of Amazon S3 request parallelism control for an Amazon Redshift Spectrum query:
- Query level (up to 10 per query per slice)
- This number depends on how many concurrent queries are running
- Works to limit the number of threads used to support S3 scanning
- Node level (for all S3 queries that run on a node; value varies for node type)
- The more powerful the node type, the higher the limit
The simple math is as follows: when the total number of files <= parallelism_per_query (e.g., 10) * total_slices, provisioning a cluster with more nodes may not increase performance. The guidance is to check how many files an Amazon Redshift Spectrum table has. Then measure to show a trend that up to certain cluster size (in term of number of slices), the performance plateaus even as the cluster node count continues to increase. The optimal Amazon Redshift cluster size (for a given node type) is the point where no further performance gain can be achieved.
Best practices for query performance
You can use a few simple techniques to improve the performance of your queries on Amazon S3.
6. Achieve faster scan and aggregation-intensive queries with Amazon Redshift Spectrum
Certain queries like Query 1 do not have joins and their performance is usually dominated by physical I/O costs like scan speed. For these queries, Amazon Redshift Spectrum may actually be faster than native Amazon Redshift. On the other hand, for queries like Query 2, when multiple table joins are involved, it’s not surprising that the highly optimized native Amazon Redshift tables that use local storage come out the winner.
7. Improve Amazon S3 query performance with predicate pushdown
The processing that is done in the Amazon Redshift Spectrum layer (S3 scan, projection, filtering, and aggregation) is independent from an individual Amazon Redshift cluster. It also does not consume resources in the Amazon Redshift cluster.
There are certain SQL operations that can be pushed down to the Amazon Redshift Spectrum layer. You want to take advantage of these wherever possible. The following are some examples:
- GROUP BY clauses and some string functions
- Equal predicates and pattern-matching conditions such as LIKE
- Common aggregate functions such as COUNT, SUM, AVG, MIN, MAX, and many others
- Functions such as regex_replace and many others
Certain SQL operations like DISTINCT and ORDER BY must be performed in Amazon Redshift because they can’t be pushed down to Amazon Redshift Spectrum. If possible, you should minimize or avoid using them.
If you perform a quick test using the following two queries, you would notice a huge performance difference between these two queries. A natural question is: Why?
In the first query’s explain plan, S3 Aggregate is being pushed down to the Amazon Redshift Spectrum layer, and only the aggregated results are returned to Amazon Redshift for final processing.
On the other hand, if you take a close look at the second query’s explain plan, you would notice that there is no S3 aggregate in the Amazon Redshift Spectrum layer because Amazon Redshift Spectrum doesn’t support DATE as a regular data type or the DATE transform function. As a result, this query is forced to bring back a huge amount of data from S3 into Amazon Redshift to transform and process.
Another method you can use is to query against the SVL_S3QUERY_SUMMARY system view (column s3query_returned_rows) for these two SQL statements. What you will find is that there is a big difference in the number of rows returned from Amazon Redshift Spectrum to Amazon Redshift.
8. Replace DISTINCT with GROUP BY
Certain SQL operators, such as GROUP BY, MIN/MAX/COUNT, etc., can be pushed down to the Amazon Redshift Spectrum layer. Other SQL operators, such as DISTINCT and ORDER BY, can’t be pushed down. In general, any operations that can be pushed down to Amazon Redshift Spectrum experience a performance boost because of the powerful infrastructure that supports Amazon Redshift Spectrum.
As an example, examine the following two functionally equivalent SQL statements:
It turns out that there is no pushdown in the first query (because of DISTINCT). Instead, a large number of rows are returned to Amazon Redshift to be sorted and de-duped. In the second query, S3 HashAggregate is pushed to Redshift Spectrum, where most of the heavy lifting and aggregation is done. Querying against SVL_S3QUERY_SUMMARY confirms the explain plan differences:
The lesson learned is that you should replace “DISTINCT” with “GROUP BY” in your SQL statements wherever possible.
Best practices for table placement
These simple guidelines can help you determine the best place to store your tables for the highest performance.
9. Put large fact tables in S3 and leave others in Amazon Redshift
Consider Query 2 which has joins of three tables. A natural question is: What happens if all three tables are externalized as partitioned Parquet files in S3? Will the performance be better or worse compared to having all three tables in native Amazon Redshift?
It’s not surprising that a set of join-optimized Amazon Redshift tables (with proper distribution and sort keys) would outperform Amazon Redshift Spectrum. Amazon Redshift Spectrum external tables do NOT support statistics. The database engine uses heuristics or simple row counts to determine the join order. This may not be optimal in some cases. The AWS recommendation is to put only the largest fact tables in Amazon S3, while leaving the small/medium dimension tables in Amazon Redshift. This gives the optimizer heuristics the best chance to be effective.
In future Amazon Redshift releases, the Amazon Redshift optimizer will use external table statistics to generate more robust execution plans.
We added support to set the table statistics (numRows) with a TABLE PROPERTIES clause in the CREATE EXTERNAL TABLE and ALTER TABLE commands. With this piece of information, you can set the accurate row number of tables and instruct the optimizer to generate more optimal execution plans. For more information, see CREATE EXTERNAL TABLE in the Amazon Redshift documentation.
We challenge you to define an additional query that has at least three external tables and executes using the correct join order (which is available by using Explain). We offer this challenge to prevent you from incorrectly concluding that you should never join multiple external tables using Amazon Redshift Spectrum.
10. Be careful about putting large tables that frequently join together in Amazon S3
Native Amazon Redshift can outperform Amazon Redshift Spectrum by almost three times across a wide range of concurrency for queries that look like Query 2. The difference between Query 1 and Query 2 is that Query 1 involves only aggregate operations on one table, whereas Query 2 includes joins of three relatively large tables.
To be more explicit, the main reason for the performance difference is that joins are executed in Amazon Redshift. All the data that needs to participate in the joins must be loaded from S3 first and distributed on the fly to individual slices in a Amazon Redshift cluster. As a result, the latency is significantly higher using Amazon Redshift Spectrum than when accessing Amazon Redshift local storage.
So if you have several large Amazon Redshift tables that participate in joins frequently, and your query workloads are subject to tight SLAs, you may not want to put all these tables in Amazon S3.
This post provides some important best practices to improve the performance of Amazon Redshift Spectrum. Because each use case is unique, you should evaluate how you can apply these recommendations to your specific situations.
If you have any questions or suggestions, please leave your feedback in the comment section. If you need further assistance in optimizing your Amazon Redshift cluster, contact your AWS account team.
Learn how to Analyze Database Audit Logs for Security and Compliance Using Amazon Redshift Spectrum.
About the authors
Po Hong, PhD, is a Big Data Consultant in the Global Big Data & Analytics Practice of AWS Professional Services.
Peter Dalton is a Principal Consultant in AWS Professional Services.