All posts by Anusha Challa

Best practices for querying Apache Iceberg data with Amazon Redshift

Post Syndicated from Anusha Challa original https://aws.amazon.com/blogs/big-data/best-practices-for-querying-apache-iceberg-data-with-amazon-redshift/

Apache Iceberg is an open table format that helps combine the benefits of using both data warehouse and data lake architectures, giving you choice and flexibility for how you store and access data. See Using Apache Iceberg on AWS for a deeper dive on using AWS Analytics services for managing your Apache Iceberg data. Amazon Redshift supports querying Iceberg tables directly, whether they’re fully-managed using Amazon S3 Tables or self-managed in Amazon S3. Understanding best practices for how to architect, store, and query Iceberg tables with Redshift helps you meet your price and performance targets for your analytical workloads.

In this post, we discuss the best practices that you can follow while querying Apache Iceberg data with Amazon Redshift

1. Follow the table design best practices

Selecting the right data types for Iceberg tables is important for efficient query performance and maintaining data integrity. It is important to match the data types of the columns to the nature of the data they store, rather than using generic or overly broad data types.

Why follow table design best practices?

  • Optimized Storage and Performance: By using the most appropriate data types, you can reduce the amount of storage required for the table and improve query performance. For example, using the DATE data type for date columns instead of a STRING or TIMESTAMP type can reduce the storage footprint and improve the efficiency of date-based operations.
  • Improved Join Performance: The data types used for columns participating in joins can impact query performance. Certain data types, such as numeric types (such as, INTEGER, BIGINT, DECIMAL), are generally more efficient for join operations compared to string-based types (such as, VARCHAR, TEXT). This is because numeric types can be easily compared and sorted, leading to more efficient hash-based join algorithms.
  • Data Integrity and Consistency: Choosing the correct data types helps with data integrity by enforcing the appropriate constraints and validations. This reduces the risk of data corruption or unexpected behavior, especially when data is ingested from multiple sources.

How to follow table design best practices?

  • Leverage Iceberg Type Mapping: Iceberg has built-in type mapping that translates between different data sources and the Iceberg table’s schema. Understand how Iceberg handles type conversions and use this knowledge to define the most appropriate data types for your use case.
  • Select the smallest possible data type that can accommodate your data. For example, use INT instead of BIGINT if the values fit within the integer range, or SMALLINT if they fit even smaller ranges.
  • Utilize fixed-length data types when data length is consistent. This can help with predictable and faster performance.
  • Choose character types like VARCHAR or TEXT for text, prioritizing VARCHAR with an appropriate length for efficiency. Avoid over-allocating VARCHAR lengths, which can waste space and slow down operations.
  • Match numeric precision to your actual requirements. Using unnecessarily high precision (such as, DECIMAL(38,20) instead of DECIMAL(10,2) for currency) demands more storage and processing, leading to slower query execution times for calculations and comparisons.
  • Employ date and time data types (such as, DATE, TIMESTAMP) rather than storing dates as text or numbers. This optimizes storage and allows for efficient temporal filtering and operations.
  • Opt for boolean values (such as, BOOLEAN) instead of using integers to represent true/false states. This saves space and potentially enhances processing speed.
  • If the column will be used in join operations, favor data types that are typically used for indexing. Integers and date/time types generally allow for faster searching and sorting than larger, less efficient types like VARCHAR(MAX).

2. Partition your Apache Iceberg table on columns that are most frequently used in filters

When working with Apache Iceberg tables in conjunction with Amazon Redshift, one of the most effective ways to optimize query performance is to partition your data strategically. The key principle is to partition your Iceberg table based on columns that are most frequently used in query filters. This approach can significantly improve query efficiency and reduce the amount of data scanned, leading to faster query execution and lower costs.

Why partitioning Iceberg tables matters?

  • Improved Query Performance: When you partition on columns commonly used in WHERE clauses, Amazon Redshift can eliminate irrelevant partitions, reducing the amount of data it needs to scan. For example, if you have a sales table partitioned by date and you run a query to analyze sales data for January 2024, Amazon Redshift will only scan the January 2024 partition instead of the entire table. This partition pruning can dramatically improve query performance—in this scenario, if you have five years of sales data, scanning just one month means examining only 1.67% of the total data, potentially reducing query execution time from minutes to seconds.
  • Reduced Scan Costs: By scanning less data, you can lower the computational resources required and, consequently the associated costs.
  • Better Data Organization: Logical partitioning helps in organizing data in a way that aligns with common query patterns, making data retrieval more intuitive and efficient.

How to partition Iceberg tables?

  • Analyze your workload to determine which columns are most frequently used in filter conditions. For example, if you always filter your data for the last 6months, then that date will be a good partition key.
  • Select columns that have high cardinality but not too high to avoid creating too many small partitions. Good candidates often include:
    • Date or timestamp columns (such as, year, month, day)
    • Categorical columns with a moderate number of distinct values (such as, region, product category)
  • Define Partition Strategy: Use Iceberg’s partitioning capabilities to define your strategy. For example if you are using Amazon Athena to create a partitioned Iceberg table, you can use the following syntax.
CREATE TABLE [db_name.]table_name (col_name data_type [COMMENT col_comment] [, …]
[PARTITIONED BY (col_name | transform, … )]
LOCATION 's3://amzn-s3-demo-bucket/your-folder/'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' [, property_name=property_value] )

Example

CREATE TABLE iceberg_db.my_table ( id INT, date DATE, region STRING, sales DECIMAL(10,2) )
PARTITIONED BY (date, region)
LOCATION 's3://amzn-s3-demo-bucket/your-folder/'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' )
  • Ensure your Redshift queries take advantage of the partitioning scheme by including partition columns in the WHERE clause whenever possible.

Walk-through with a sample usecase
Let’s take an example to understand how to pick the best partition key by following best practices. Consider an e-commerce company looking to optimize their sales data analysis using Apache Iceberg tables with Amazon Redshift. The company maintains a table called sales_transactions, which has data for 5 years across four regions (North America, Europe, Asia, and Australia) with five product categories (Electronics, Clothing, Home & Garden, Books, and Toys). The dataset includes key columns such as transaction_id, transaction_date, customer_id, product_id, product_category, region, and sale_amount.

The data science team uses transaction_date and region columns frequently in filters, while product_category is used less frequently. The transaction_date column has high cardinality (one value per day), region has low cardinality (only 4 distinct values) and product_category has moderate cardinality (5 distinct values).

Based on this analysis, an effective partition strategy would be to partition by year and month from the transaction_date, and by region. This creates a manageable number of partitions while improving the most common query patterns. Here’s how we could implement this strategy using Amazon Athena:

CREATE TABLE iceberg_db.sales_transactions ( transaction_id STRING, transaction_date DATE, customer_id STRING, product_id STRING, product_category STRING, region STRING, sale_amount DECIMAL(10,2))
PARTITIONED BY (transaction_date, region)
LOCATION 's3://athena-371178653860-22hcl401/sales-data/'
TBLPROPERTIES ('table_type' = 'ICEBERG');

3. Optimize by selecting only the necessary columns for query

Another best practice for working with Iceberg tables is to only select the columns that are necessary for a given query, and to avoid using the SELECT * syntax.

Why should you select only necessary columns?

  • Improved Query Performance: In analytics workloads, users typically analyze subsets of data, performing large-scale aggregations or trend analyses. To optimize these operations, analytics storage systems and file formats are designed for efficient column-based reading. Examples include columnar open file formats like Apache Parquet and columnar databases such as Amazon Redshift. A key best practice to select only the required columns in your queries, so the query engine can reduce the amount of data that needs to be processed, scanned, and returned. This can lead to significantly faster query execution times, especially for large tables.
  • Reduced Resource Utilization: Fetching unnecessary columns consumes additional system resources, such as CPU, memory, and network bandwidth. Limiting the columns selected can help optimize resource utilization and improve the overall efficiency of the data processing pipeline.
  • Lower Data Transfer Costs: When querying Iceberg tables stored in cloud storage (e.g., Amazon S3), the amount of data transferred from the storage service to the query engine can directly impact the data transfer costs. Selecting only the required columns can help minimize these costs.
  • Better Data Locality: Iceberg partitions data based on the values in the partition columns. By selecting only the necessary columns, the query engine can better leverage the partitioning scheme to improve data locality and reduce the amount of data that needs to be scanned.

How to only select necessary columns?

  • Identify the Columns Needed: Carefully analyze the requirements of each query and determine the minimum set of columns required to fulfill the query’s purpose.
  • Use Selective Column Names: In the SELECT clause of your SQL queries, explicitly list the column names you need, rather than using SELECT *.

4. Generate AWS Glue data catalog column level statistics

Table statistics play an important role in database systems that utilize Cost-Based Optimizers (CBOs), such as Amazon Redshift. They help the CBO make informed decisions about query execution plans. When a query is submitted to Amazon Redshift, the CBO evaluates multiple possible execution plans and estimates their costs. These cost estimates heavily depend on accurate statistics about the data, including: Table size (number of rows), column value distributions, Number of distinct values in columns, Data skew information, and more.

AWS Glue Data Catalog supports generating statistics for data stored in the data lake including for Apache Iceberg. The statistics include metadata about the columns in a table, such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values. These column-level statistics provide valuable metadata that helps optimize query performance and improve cost efficiency when working with Apache Iceberg tables.

Why generating AWS Glue statistics matter?

  • Amazon Redshift can generate better query plans using column statistics, thereby improve performance on queries due to optimized join orders, better predicate push-down and more accurate resource allocation.
  • Costs will be optimized. Better execution plans lead to reduced data scanning, more efficient resource utilization and overall lower query costs.

How to generate AWS Glue statistics?

The Sagemaker Lakehouse Catalog enables you to generate statistics automatically for updated and created tables with a one-time catalog configuration. As new tables are created, the number of distinct values (NDVs) are collected for Iceberg tables. By default, the Data Catalog generates and updates column statistics for all columns in the tables on a weekly basis. This job analyzes 50% of records in the tables to calculate statistics.

  • On the Lake Formation console, choose Catalogs in the navigation pane.
  • Select the catalog that you want to configure, and choose Edit on the Actions menu.
  • Select Enable automatic statistics generation for the tables of the catalog and choose an IAM role. For the required permissions, see Prerequisites for generating column statistics.
  • Choose Submit.

You can override the defaults and customize statistics collection at the table level to meet specific needs. For frequently updated tables, statistics can be refreshed more often than weekly. You can also specify target columns to focus on those most commonly queried. You can set what percentage of table records to use when calculating statistics. Therefore, you can increase this percentage for tables that need more precise statistics, or decrease it for tables where a smaller sample is sufficient to optimize costs and statistics generation performance.These table-level settings can override the catalog-level settings previously described.

Read the blog Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena for more information.

5. Implement Table Maintenance Strategies for Optimal Performance

Over time, Apache Iceberg tables can accumulate various types of metadata and file artifacts that impact query performance and storage efficiency. Understanding and managing these artifacts is crucial for maintaining optimal performance of your data lake. As you use Iceberg tables, three main types of artifacts accumulate:

  • Small Files: When data is ingested into Iceberg tables, especially through streaming or frequent small batch updates, many small files can accumulate because each write operation typically creates new files rather than appending to existing ones.
  • Deleted Data Artifacts: Iceberg uses copy-on-write for updates and deletes. When records are deleted, Iceberg creates “delete markers” rather than immediately removing the data. These markers need to be processed during reads to filter out deleted records.
  • Snapshots: Every time you make changes to your table (insert, update, or delete data), Iceberg creates a new snapshot—essentially a point-in-time view of your table. While valuable for maintaining history, these snapshots increase metadata size over time, impacting query planning and execution.
  • Unreferenced Files: These are files that exist in storage but aren’t linked to any current table snapshot. They occur in two main scenarios:
    1. When old snapshots are expired, the files exclusively referenced by those snapshots become unreferenced
    2. When write operations are interrupted or fail midway, creating data files that aren’t properly linked to any snapshot

Why table maintenance matters?

Regular table maintenance delivers several important benefits:

  • Enhanced Query Performance: Consolidating small files reduces the number of file operations required during queries, while removing excess snapshots and delete markers streamlines metadata processing. These optimizations allow query engines to access and process data more efficiently.
  • Optimized Storage Utilization: Expiring old snapshots and removing unreferenced files frees up valuable storage space, helping you maintain cost-effective storage utilization as your data lake grows.
  • Improved Resource Efficiency: Maintaining well-organized tables with optimized file sizes and clean metadata requires less computational resources for query execution, allowing your analytics workloads to run faster and more efficiently.
  • Better Scalability: Properly maintained tables scale more effectively as data volumes grow, maintaining consistent performance characteristics even as your data lake expands.

How to perform table maintenance?

Three key maintenance operations help optimize Iceberg tables:

  1. Compaction: Combines smaller files into larger ones and merges delete files with data files, resulting in streamlined data access patterns and improved query performance.
  2. Snapshot Expiration: Removes old snapshots that are no longer needed while maintaining a configurable history window.
  3. Unreferenced File Removal: Identifies and removes files that are no longer referenced by any snapshot, reclaiming storage space and reducing the total number of objects the system needs to track.

AWS offers a fully managed Apache Iceberg data lake solution called S3 tables that automatically takes care of table maintenance, including:

  • Automatic Compaction: S3 Tables automatically perform compaction by combining multiple smaller objects into fewer, larger objects to improve Apache Iceberg query performance. When combining objects, compaction also applies the effects of row-level deletes in your table. You can manage compaction process based on the configurable table level properties.
    • targetFileSizeMB: Default is 512 MB. Can be configured to a value between between 64 MiB and 512 MiB.

Apache Iceberg offers various methods like Binpack, Sort, Z-order to compact data. By default Amazon S3 selects the best of these three compaction strategy automatically based on your table sort order

  • Automated Snapshot Management: S3 Tables automatically expires older snapshots based on configurable table level properties
    • MinimumSnapshots (1 by default): Minimum number of table snapshots that S3 Tables will retain
    • MaximumSnapshotAge (120 hours by default): This parameter determines the maximum age, in hours, for snapshots to be retained
  • Unreferenced File Removal: Automatically identifies and deletes objects not referenced by any table snapshots based on configurable bucket level properties:
    • unreferencedDays (3 days by default): Objects not referenced for this duration are marked as noncurrent
    • nonCurrentDays (10 days by default): Noncurrent objects are deleted after this duration

Note: Deletes of noncurrent objects are permanent with no way to recover these objects.

If you are managing Iceberg tables yourself, you’ll need to implement these maintenance tasks:

Using Athena:
  • Run OPTIMIZE command using the following syntax:
    OPTIMIZE [database_name.]<table_name>;

    This command triggers the compaction process, which utilizes a bin-packing algorithm to group small data files into larger ones. It also merges delete files with existing data files, effectively cleaning up the table and improving its structure.

  • Set the following table properties during iceberg table creation: vacuum_min_snapshots_to_keep (Default 1): Minimum snapshots to retain vacuum_max_snapshot_age_seconds (Default 432000 seconds or 5 days)
  • Periodically run the VACUUM command to expire old snapshots and remove unreferenced files. Recommended after performing operations like merge on iceberg tables. Syntax: VACUUM [database_name.]target_table. VACUUM performs snapshot expiration and orphan file removal
Using Spark SQL:
  • Schedule regular compaction jobs with Iceberg’s rewrite files action
  • Use expireSnapshots operation to remove old snapshots
  • Run deleteOrphanFiles operation to clean up unreferenced files
  • Establish a maintenance schedule based on your write patterns (hourly, daily, weekly)
  • Run these operations in sequence, typically compaction followed by snapshot expiration and unreferenced file removal
  • It’s especially important to run these operations after large ingest jobs, heavy delete operations, or overwrite operations

6. Create incremental materialized views on Apache Iceberg tables in Redshift to improve performance of time sensitive dashboard queries

Organizations across industries rely on data lake powered dashboards for time-sensitive metrics like sales trends, product performance, regional comparisons, and inventory rates. With underlying Iceberg tables containing billions of records and growing by millions daily, recalculating metrics from scratch during each dashboard refresh creates significant latency and degrades user experience.

The integration between Apache Iceberg and Amazon Redshift enables creating incremental materialized views on Iceberg tables to optimize dashboard query performance. These views enhance efficiency by:

  • Pre-computing and storing complex query results
  • Using incremental maintenance to process only recent changes since last refresh
  • Reducing compute and storage costs compared to full recalculations

Why incremental materialized views on Iceberg tables matter?

  • Performance Optimization: Pre-computed materialized views significantly accelerate dashboard queries, especially when accessing large-scale Iceberg tables
  • Cost Efficiency: Incremental maintenance through Amazon Redshift processes only recent changes, avoiding expensive full recomputation cycles
  • Customization: Views can be tailored to specific dashboard requirements, optimizing data access patterns and reducing processing overhead

How to create incremental materialized views?

  • Determine which Iceberg tables are the primary data sources for your time-sensitive dashboard queries.
  • Use the CREATE MATERIALIZED VIEW statement to define the materialized views on the Iceberg tables. Ensure that the materialized view definition includes only the necessary columns and any applicable aggregations or transformations.
  • If you have used all operators that are eligible for an incremental refresh, Amazon Redshift automatically creates an incrementally refresh-able materialized view. Refer to limitations for incremental refresh to understand the operations that are not eligible for an incremental refresh
  • Regularly refresh the materialized views using REFRESH MATERIALIZED VIEW command

7. Create Late binding views (LBVs) on Iceberg table to encapsulate business logic.

Amazon Redshift’s support for late binding views on external tables, including Apache Iceberg tables, allows you to encapsulate your business logic within the view definition. This best practice provides several benefits when working with Iceberg tables in Redshift.

Why create LBVs?

  • Centralized Business Logic: By defining the business logic in the view, you can ensure that the transformation, aggregation, and other processing steps are consistently applied across all queries that reference the view. This promotes code reuse and maintainability.
  • Abstraction from Underlying Data: Late binding views decouple the view definition from the underlying Iceberg table structure. This allows you to make changes to the Iceberg table, such as adding or removing columns, without having to update the view definitions that depend on the table.
  • Improved Query Performance: Redshift can optimize the execution of queries against late binding views, leveraging techniques like predicate pushdown and partition pruning to minimize the amount of data that needs to be processed.
  • Enhanced Data Security: By defining access controls and permissions at the view level, you can grant users access to only the data and functionality they require, improving the overall security of your data environment.

How to create LBVs?

  • Identify suitable Apache Iceberg tables: Determine which Iceberg tables are the primary data sources for your business logic and reporting requirements.
  • Create late binding views(LBVs): Use the CREATE VIEW statement to define the late binding views on the external Iceberg tables. Incorporate the necessary transformations, aggregations, and other business logic within the view definition.
    Example:

    CREATE VIEW my_iceberg_view AS
    SELECT col1,col2,SUM(col3) AS total_col3
    GROUP BY col1, col2
    WITH NO SCHEMA BINDING;
    

  • Grant View Permissions: Assign the appropriate permissions to the views, granting access to the users or roles that require access to the encapsulated business logic.

Conclusion

In this post, we covered best practices for using Amazon Redshift to query Apache Iceberg tables, focusing on fundamental design decisions. One key area is table design and data type selection, as this can have the greatest impact on your storage size and query performance. Additionally, using Amazon S3 Tables to have a fully-managed tables automatically handle essential maintenance tasks like compaction, snapshot management, and vacuum operations, allowing you to focus building your analytical applications.

As you build out your workflows to use Amazon Redshift with Apache Iceberg tables, considering the following best practices to help you achieve your workload goals:

  • Adopting Amazon S3 Tables for new implementations to leverage automated management features
  • Auditing existing table designs to identify opportunities for optimization
  • Developing a clear partitioning strategy based on actual query patterns
  • For self-managed Apache Iceberg tables on Amazon S3, implementing automated maintenance procedures for statistics generation and compaction


About the authors

Anusha Challa

Anusha Challa

Anusha is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped 100s of customers build large-scale analytics solutions in the cloud and on premises. She is passionate about data analytics, data science and AI.

Mohammed Alkateb

Mohammed Alkateb

Mohammed is an Engineering Manager at Amazon Redshift. Mohammed has 18 US patents, and he has publications in research and industrial tracks of premier database conferences including EDBT, ICDE, SIGMOD and VLDB. Mohammed holds a PhD in Computer Science from The University of Vermont, and MSc and BSc degrees in Information Systems from Cairo University.

Jonathan Katz

Jonathan Katz

Jonathan is a Core Team member of the open source PostgreSQL project and an active open source contributor.

Enrich, standardize, and translate streaming data in Amazon Redshift with generative AI

Post Syndicated from Anusha Challa original https://aws.amazon.com/blogs/big-data/enrich-standardize-and-translate-streaming-data-in-amazon-redshift-with-generative-ai/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze your data. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

Amazon Redshift ML is a feature of Amazon Redshift that enables you to build, train, and deploy machine learning (ML) models directly within the Redshift environment. Now, you can use pretrained publicly available large language models (LLMs) in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. You can use pretrained publicly available LLMs from leading providers such as Meta, AI21 Labs, LightOn, Hugging Face, Amazon Alexa, and Cohere as part of your Redshift ML workflows. By integrating with LLMs, Redshift ML can support a wide variety of natural language processing (NLP) use cases on your analytical data, such as text summarization, sentiment analysis, named entity recognition, text generation, language translation, data standardization, data enrichment, and more. Through this feature, the power of generative artificial intelligence (AI) and LLMs is made available to you as simple SQL functions that you can apply on your datasets. The integration is designed to be simple to use and flexible to configure, allowing you to take advantage of the capabilities of advanced ML models within your Redshift data warehouse environment.

In this post, we demonstrate how Amazon Redshift can act as the data foundation for your generative AI use cases by enriching, standardizing, cleansing, and translating streaming data using natural language prompts and the power of generative AI. In today’s data-driven world, organizations often ingest real-time data streams from various sources, such as Internet of Things (IoT) devices, social media platforms, and transactional systems. However, this streaming data can be inconsistent, missing values, and be in non-standard formats, presenting significant challenges for downstream analysis and decision-making processes. By harnessing the power of generative AI, you can seamlessly enrich and standardize streaming data after ingesting it into Amazon Redshift, resulting in high-quality, consistent, and valuable insights. Generative AI models can derive new features from your data and enhance decision-making. This enriched and standardized data can then facilitate accurate real-time analysis, improved decision-making, and enhanced operational efficiency across various industries, including ecommerce, finance, healthcare, and manufacturing. For this use case, we use the Meta Llama-3-8B-Instruct LLM to demonstrate how to integrate it with Amazon Redshift to streamline the process of data enrichment, standardization, and cleansing.

Solution overview

The following diagram demonstrates how to use Redshift ML capabilities to integrate with LLMs to enrich, standardize, and cleanse streaming data. The process starts with raw streaming data coming from Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK), which is materialized in Amazon Redshift as raw data. User-defined functions (UDFs) are then applied to the raw data, which invoke an LLM deployed on SageMaker JumpStart to enrich and standardize the data. The enhanced, cleansed data is then stored back in Amazon Redshift, ready for accurate real-time analysis, improved decision-making, and enhanced operational efficiency.

To deploy this solution, we complete the following steps:

  1. Choose an LLM for the use case and deploy it using foundation models (FMs) in SageMaker JumpStart.
  2. Use Redshift ML to create a model referencing the SageMaker JumpStart LLM endpoint.
  3. Create a materialized view to load the raw streaming data.
  4. Call the model function with prompts to transform the data and view results.

Example data

The following code shows an example of raw order data from the stream:

Record1: {
    "orderID":"101",
    "email":" john. roe @example.com",
    "phone":"+44-1234567890",
    "address":"123 Elm Street, London",
    "comment": "please cancel if items are out of stock"
}
Record2: {
    "orderID":"102",
    "email":" jane.s mith @example.com",
    "phone":"(123)456-7890",
    "address":"123 Main St, Chicago, 12345",
    "comment": "Include a gift receipt"
}
Record3: {
    "orderID":"103",
    "email":"max.muller @example.com",
    "phone":"+498912345678",
    "address":"Musterstrabe, Bayern 00000",
    "comment": "Bitte nutzen Sie den Expressversand"
}
Record4: {
    "orderID":"104",
    "email":" julia @example.com",
    "phone":"(111) 4567890",
    "address":"000 main st, los angeles, 11111",
    "comment": "Entregar a la puerta"
}
Record5: {
    "orderID":"105",
    "email":" roberto @example.com",
    "phone":"+33 3 44 21 83 43",
    "address":"000 Jean Allemane, paris, 00000",
    "comment": "veuillez ajouter un emballage cadeau"
}

The raw data has inconsistent formatting for email and phone numbers, the address is incomplete and doesn’t have a country, and comments are in various languages. To address the challenges with the raw data, we can implement a comprehensive data transformation process using Redshift ML integrated with an LLM in an ETL workflow. This approach can help standardize the data, cleanse it, and enrich it to meet the desired output format.

The following table shows an example of enriched address data.

orderid Address Country (Identified using LLM)
101 123 Elm Street, London United Kingdom
102 123 Main St, Chicago, 12345 USA
103 Musterstrabe, Bayern 00000 Germany
104 000 main st, los angeles, 11111 USA
105 000 Jean Allemane, paris, 00000 France

The following table shows an example of standardized email and phone data.

orderid email

cleansed_email

(Using LLM)

Phone Standardized Phone (Using LLM)
101 john. roe @example.com john.roe@example.com +44-1234567890 +44 1234567890
102 jane.s mith @example.com jane.smith@example.com (123)456-7890 +1 1234567890
103 max.muller @example.com max.muller@example.com 498912345678 +49 8912345678
104 julia @example.com julia@example.com (111) 4567890 +1 1114567890
105 roberto @example.com roberto@example.com +33 3 44 21 83 43 +33 344218343

The following table shows an example of translated and enriched comment data.

orderid Comment

english_comment

(Translated using LLM)

comment_language

(Identified by LLM)

101 please cancel if items are out of stock please cancel if items are out of st English
102 Include a gift receipt Include a gift receipt English
103 Bitte nutzen Sie den Expressversand Please use express shipping German
104 Entregar a la puerta Leave at door step Spanish
105 veuillez ajouter un emballage cadeau Please add a gift wrap French

Prerequisites

Before you implement the steps in the walkthrough, make sure you have the following prerequisites:

Choose an LLM and deploy it using SageMaker JumpStart

Complete the following steps to deploy your LLM:

  1. On the SageMaker JumpStart console, choose Foundation models in the navigation pane.
  2. Search for your FM (for this post, Meta-Llama-3-8B-Instruct) and choose View model.
  3. On the Model details page, review the End User License Agreement (EULA) and choose Open notebook in Studio to start using the notebook in Amazon SageMaker Studio.
  4. In the Select domain and user profile pop-up, choose a profile, then choose Open Studio.
  5. When the notebook opens, in the Set up notebook environment pop-up, choose t3.medium or another instance type recommended in the notebook, then choose Select.
  6. Modify the notebook cell that has accept_eula = False to accept_eula = True.
  7. Select and run the first five cells (see the highlighted sections in the following screenshot) using the run icon.
  1. After you run the fifth cell, choose Endpoints under Deployments in the navigation pane, where you can see the endpoint created.
  2. Copy the endpoint name and wait until the endpoint status is In Service.

It can take 30–45 minutes for the endpoint to be available.

Use Redshift ML to create a model referencing the SageMaker JumpStart LLM endpoint

In this step, you create a model using Redshift ML and the bring your own model (BYOM) capability. After the model is created, you can use the output function to make remote inference to the LLM model. To create a model in Amazon Redshift for the LLM endpoint you created previously, complete the following steps:

  1. Log in to the Redshift endpoint using the Amazon Redshift Query Editor V2.
  2. Make sure you have the following AWS Identity and Access Management (IAM) policy added to the default IAM role. Replace <endpointname> with the SageMaker JumpStart endpoint name you captured earlier:
    {
      "Statement": [
          {
              "Action": "sagemaker:InvokeEndpoint",
              "Effect": "Allow",
              "Resource": "arn:aws:sagemaker:<region>:<AccountNumber>:endpoint/<endpointname>",
              "Principal": "*"
          }
      ]
    }

  3. In the query editor, run the following SQL statement to create a model in Amazon Redshift. Replace <endpointname> with the endpoint name you captured earlier. Note that the input and return data type for the model is the SUPER data type.
    CREATE MODEL meta_llama_3_8b_instruct
    FUNCTION meta_llama_3_8b_instruct(super)
    RETURNS SUPER
    SAGEMAKER '<endpointname>'
    IAM_ROLE default;

Create a materialized view to load raw streaming data

Use the following SQL to create materialized view for the data that is being streamed through the customer-orders stream. The materialized view is set to auto refresh and will be refreshed as data keeps arriving in the stream.

CREATE EXTERNAL SCHEMA kinesis_streams FROM KINESIS
IAM_ROLE default;

CREATE MATERIALIZED VIEW mv_customer_orders AUTO REFRESH YES AS
    SELECT 
    refresh_time,
    approximate_arrival_timestamp,
    partition_key,
    shard_id,
    sequence_number,
    --json_parse(from_varbyte(kinesis_data, 'utf-8')) as rawdata,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'orderID',true)::character(36) as orderID,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'email',true)::character(36) as email,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'phone',true)::character(36) as phone,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'address',true)::character(36) as address,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'comment',true)::character(36) as comment
    FROM kinesis_streams."customer-orders";

After you run these SQL statements, the materialized view mv_customer_orders will be created and continuously updated as new data arrives in the customer-orders Kinesis data stream.

Call the model function with prompts to transform data and view results

Now you can call the Redshift ML LLM model function with prompts to transform the raw data and view the results. The input payload is a JSON with prompt and model parameters as attributes:

  • Prompt – The prompt is the input text or instruction provided to the generative AI model to generate new content. The prompt acts as a guiding signal that the model uses to produce relevant and coherent output. Each model has unique prompt engineering guidance. Refer to the Meta Llama 3 Instruct model card for its prompt formats and guidance.
  • Model parameters – The model parameters determine the behavior and output of the model. With model parameters, you can control the randomness, number of tokens generated, where the model should stop, and more.

In the Invoke endpoint section of the SageMaker Studio notebook, you can find the model parameters and example payloads.

k

The following SQL statement calls the Redshift ML LLM model function with prompts to standardize phone number and email data, identify the country from the address, and translate comments into English and identify the original comment’s language. The output of the SQL is stored in the table enhanced_raw_data_customer_orders.

create table enhanced_raw_data_customer_orders as
select phone,email,comment, address
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nConvert this phone number into a standard format: '||phone||'\n\nA standard phone number had plus sign followed by CountryCode followed by a space and the rest of the phone number without any spaces or dashes. \n\nExamples: +1 1234567890, +91 1234567890\n\nReturn only the standardized phone number, nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as standardized_phone
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nConvert this email into a standard format:'||email||'\n\nA standard email ID does not have spaces and is lower case. Return only the standardized email and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as standardized_email
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIdentify which country is this address in:'||address||'\n\nReturn only the country and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as country
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTranslate this statement to english if it is not in english:'||comment||'\n\nReturn the english comment and nothing else. Output only english<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as translated_comment
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIdentify which language this statement is in:'||comment||'\n\nReturn only the language and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as orig_comment_language
  from mv_customer_orders;

Query the enhanced_raw_data_customer_orders table to view the data. The output of LLM is in JSON format with the result in the generated_text attribute. It’s stored in the SUPER data type and can be queried using PartiQL:

select 
    phone as raw_phone
    , standardized_phone.generated_text :: varchar as standardized_phone 
    , email as raw_email
    , standardized_email.generated_text :: varchar as standardized_email
    , address as raw_address
    , country.generated_text :: varchar as country
    , comment as raw_comment
    , translated_comment.generated_text :: varchar as translated_comment
    , orig_comment_language.generated_text :: varchar as orig_comment_language
from enhanced_raw_data_customer_orders;

The following screenshot shows our output.

Clean up

To avoid incurring future charges, delete the resources you created:

  1. Delete the LLM endpoint in SageMaker JumpStart by running the cell in the Clean up section in the Jupyter notebook.
  2. Delete the Kinesis data stream.
  3. Delete the Redshift Serverless workgroup or Redshift cluster.

Conclusion

In this post, we showed you how to enrich, standardize, and translate streaming data in Amazon Redshift with generative AI and LLMs. Specifically, we demonstrated the integration of the Meta Llama 3 8B Instruct LLM, available through SageMaker JumpStart, with Redshift ML. Although we used the Meta Llama 3 model as an example, you can use a variety of other pre-trained LLM models available in SageMaker JumpStart as part of your Redshift ML workflows. This integration allows you to explore a wide range of NLP use cases, such as data enrichment, content summarization, knowledge graph development, and more. The ability to seamlessly integrate advanced LLMs into your Redshift environment significantly broadens the analytical capabilities of Redshift ML. This empowers data analysts and developers to incorporate ML into their data warehouse workflows with streamlined processes driven by familiar SQL commands.

We encourage you to explore the full potential of this integration and experiment with implementing various use cases that integrate the power of generative AI and LLMs with Amazon Redshift. The combination of the scalability and performance of Amazon Redshift, along with the advanced natural language processing capabilities of LLMs, can unlock new possibilities for data-driven insights and decision-making.


About the authors

Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.

Run queries concurrently and see query history using Amazon Redshift Query Editor v2

Post Syndicated from Anusha Challa original https://aws.amazon.com/blogs/big-data/run-queries-concurrently-and-see-query-history-using-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed, petabyte-scale cloud data warehouse. You have the flexibility to choose from provisioned and serverless compute modes. You can start loading and querying large datasets conveniently in Amazon Redshift using Amazon Redshift Query Editor v2, a web-based SQL client application.

Query Editor v2 empowers your technical and business teams by providing several easy-to-use features. The following are some notable actions you can perform:

  • Browse through multiple database storage and code objects using a hierarchical tree-view panel.
  • Create databases, schemas, tables, functions, and more using an easy-to-follow GUI.
  • Load industry standard sample datasets such as tpcds, tpch, and tickit in just a few clicks.
  • Load data from Amazon Simple Storage Service (Amazon S3). You can query external datasets in Amazon S3 or Amazon Relational Database Service (Amazon RDS) PostgreSQL and MySQL databases.
  • Create multiple SQL editors and SQL notebooks in separate tabs to author and run queries. This offers the following features:
    • Use each tab to run queries on a different provisioned cluster’s database or serverless workgroup’s database. You can choose where to connect or change where you’re connected using a drop-down menu.
    • Create charts to visualize the output using the built-in chart wizard. It supports different types of charts, such as histogram, bar chart, area chart, and more.
    • Export query results into JSON or CSV formats.
    • Turn on explain graph to display a graphical representation of your query’s explain plan.
  • Save queries and share them to collaborate with your teams.
  • Use SQL notebooks to organize, annotate, and share multiple SQL queries in a single document. With SQL notebooks, you can present a compelling data story to your stakeholders.
  • Define session-level variables.

In this post, we describe two of Query Editor v2’s most requested features:

  • Run multiple queries concurrently
  • View the query history for an individual tab or consolidated query history for all tabs

Run queries concurrently

With Amazon Redshift Query Editor v2, you can run multiple queries concurrently on a provisioned cluster’s database or serverless workgroup’s database. In the past, you had to wait for query runs on other tabs to complete in order to start a new query run. This is no longer the case in Query Editor v2. You can use multiple editors or notebooks that are using isolated sessions to run multiple queries concurrently.

To run queries in SQL editors or SQL notebooks in Query Editor v2, you start by connecting to a serverless workgroup or provisioned cluster’s database. Then you create a SQL editor tab or SQL notebook tab to author queries and loads. Each tab can either use an isolated session or a shared session. New tabs in Query Editor v2 use an isolated session by default. If a tab uses an isolated session, the queries in other tabs can’t see the session-level changes made by it. For example, a temporary table is valid only within a session. If you create a temporary table using a SQL editor tab that is using an isolated session, the other tabs—even if they’re connected to the same endpoint and database—can’t see this temporary table.

You can change an isolated session to a shared session by turning off the Isolated session option. Tabs using a shared session can see the session-level changes (such as temporary tables) created in other tabs that use shared connections to the same database. A connection is unique to a provisioned cluster’s database or a serverless workgroup’s database. Tabs connected to the same endpoint (provisioned cluster or serverless workgroup) but to different databases can’t share the same connection because the databases they’re connected to are different.

In Query Editor v2, you can run queries concurrently on the same database from tabs that use isolated sessions. Tabs using the shared connection must wait until a query run is complete in other tabs that are sharing the same connection.

Let’s see how you can load data into multiple tables concurrently using Query Editor V2. The tables we loading for this example are orders, supplier, and customer from the tpcds dataset.

Follow these steps to run queries concurrently:

  1. On the Amazon Redshift console, navigate to Amazon Redshift Query Editor v2.
  2. In the tree-view panel, choose the Amazon Redshift provisioned cluster or Amazon Redshift Serverless workgroup you want to connect.

You can navigate through multiple connections and view objects, as shown in the following screenshot.

redshift query v2

  1. Connect by choosing one of the authentication modes.
  2. Choose the plus sign to create as many SQL editors as the concurrent queries you require. Because we’re going to run queries to load three tables concurrently, we created three editors.

Redshift query editor v2

  1. To save SQL queries, choose (double-click) the name and enter a name to describe the query (for example, query1, query2, query3).
  2. You can choose the serverless workgroup or provisioned cluster to connect by using the drop-down menu, as shown in the following screenshot. You can also choose the database to connect. Because we want to run queries to load all three tables on the same database, choose the same compute and database for all SQL editor tabs.

serverless: workgroup

  1. Author queries you want to run concurrently in each SQL editor.
  2. Choose Run on each tab to run the queries concurrently.

Like in SQL editors, you can run queries in multiple SQL notebooks concurrently. After you author the notebooks, choose Run all in each of the notebooks.

run queries in multiple SQL notebooks

Account settings for concurrent connections

By default, you can have three concurrent connections running queries using Query Editor v2. This is an account-level setting that can only be changed by an admin user. In account settings, you can change the maximum concurrent connections value from the default value 3 to a value between 1–10. To open account settings, choose the settings icon and choose Account settings.

account settings

Under Connection settings, choose a number between 1–10 for Maximum concurrent connections and choose Save. It can take up to 10 minutes for the setting change to take effect.

connection settings

This lets you to control the number of queries your users can run concurrently, so that they don’t put a large load on the database. If your users run more than the allowed number of concurrent queries, they receive an error indicating that “The current limit of <<?>> connections has been reached. Close another connection, use a non-isolated session or contact your Query Editor v2 account administrator to adjust the limit.”

View connections

To see connections, choose the settings icon and choose Connections.

connections

For each connection, the cluster or workgroup name, database name, database user, type of session (isolated or shared) and status (busy if a query is actively running, idle if no query is running) are displayed. You can choose Go to tab to navigate to the tab associated with the connection. You can choose Close to close the connection.

close the connection

View query history

You can see the history of the last 1,000 queries that ran in Query Editor v2 in the query history. If you forgot to save your queries, you can retrieve them from the query history. You can also see the duration, status, runtime, and query text of your queries. Queries that ran from all SQL editors and SQL notebooks are available in the query history.

To get started, navigate to the Query history page in Query Editor v2. You can choose to see the last 1,000 queries that ran in the last 3 days, this week, this month, this year, or for all time.

Query history

Search for queries in query history

You can also search for queries. For example, to search for queries that have the word “nation,” enter nation in the search box and press Enter. The query history page refreshes to show queries with the keyword “nation.”

nation

Similarly, you can search for the queries that ran on a provisioned cluster or serverless workgroup. For example, to search for queries that ran on the Amazon Redshift Serverless workgroup that have the phrase “curate” in its name, enter curate in the search box.

curate

You can also search for queries that ran on a database. Enter the name of the database in the search box. For example, the following are the queries that ran on the database sample_data_dev.

sample_data_dev

View query details

For any query in the query history, you can see query details by choosing View query details on the Actions menu for that query.

view query details

For the chosen query, you can see the following details:

  • Local time when the query run started
  • Local time when the query run ended
  • Total query runtime
  • Query status (Running, Succeeded, Failed, or Canceled)
  • Cluster or workgroup in which the query ran
  • Database in which the query ran

query details

From the query details, to go back to the query history, simply choose Back.

Open query in a new tab

You can open a query in a new tab by selecting the query in the query history and choosing Open query in a new tab on the Actions menu.

Open query in a new tab

The query opens in a new untitled SQL editor.

opens in a new untitled SQL editor

Open saved queries, saved notebooks, or the source tab

You can save SQL editors and SQL notebooks authored using Query Editor v2. To see them, you can navigate to the Saved queries and Saved notebooks pages, respectively. If the query you’re seeing from the query history is part of a saved query, the Open saved query option is available on the Actions menu. You can then open the saved query by choosing that option.

Open saved query

If the query you’re seeing in the query history is part of a saved notebook, the Open saved notebook option is available on the Actions menu. You can then open the saved notebook by choosing that option.

Open saved notebook

If you ran your query from an un-saved editor tab, and the tab isn’t closed yet, you can open the source tab used for the query run by choosing the Open source tab option on the Actions menu.

Open source tab

View tab history

In Query Editor v2, in addition to seeing a consolidated query history for all SQL editors and SQL notebooks, you can see the tab-level query run history. Tabs can represent either an editor or a SQL notebook. You can see tab history for both of them.

View tab history for SQL editor tabs

SQL editor tabs are represented by the file icon. To see tab history, choose the options menu (three dots) and choose Tab history.

Tab history

When the tab history opens, you can see the queries that were run in that SQL editor tab and how long ago were they run. You can copy the query, open it in a new SQL editor tab, or see query details by choosing the options menu (three dots) next to each query.

see query details by choosing the options menu

View tab history for notebook tabs

Notebook tabs are represented by the notebook icon. On the notebook tab, to see tab history, choose the options menu (three dots) and choose Tab history.

Tab history 2

When the tab history opens, you can see the queries that were run in that notebook tab and how long ago were they run. You can copy the query, open it in a new SQL editor tab, or see query details by choosing the options menu (three dots) next to each query.

tab history for notebook tabs

Conclusion

In this post, we introduced you to concurrent query runs and query history features of Amazon Redshift Query Editor v2. It has powerful yet easy-to-use features that your teams can use to query and load datasets. If you have any questions or suggestions, please leave a comment.

Happy querying!


About the Authors

Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large scale data warehouses in the cloud and on premises.

Bahadir Özavci is a Senior Software Engineer focused on Amazon Redshift. He primarily works on designing and building features for Amazon Redshift customers to provide a great IDE experience. Outside of work, you can find him cooking or playing roguelike video games.

Mohamed Shaaban is a Senior Software Engineer in Amazon Redshift and is based in Berlin, Germany. He has over 12 years of experience in the software engineering. He is passionate about cloud services and building solutions that delight customers. Outside of work, he is an amateur photographer who loves to explore and capture unique moments.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems in order to deliver solutions that exceed expectations.

Use Amazon Redshift Spectrum with row-level and cell-level security policies defined in AWS Lake Formation

Post Syndicated from Anusha Challa original https://aws.amazon.com/blogs/big-data/use-amazon-redshift-spectrum-with-row-level-and-cell-level-security-policies-defined-in-aws-lake-formation/

Data warehouses and data lakes are key to an enterprise data management strategy. A data lake is a centralized repository that consolidates your data in any format at any scale and makes it available for different kinds of analytics. A data warehouse, on the other hand, has cleansed, enriched, and transformed data that is optimized for faster queries. Amazon Redshift is a cloud-based data warehouse that powers a lake house architecture, which enables you to query the data in a data warehouse and an Amazon Simple Storage Service (Amazon S3) data lake using familiar SQL statements and gain deeper insights.

Data lakes often contain data for multiple business units, users, locations, vendors, and tenants. Enterprises want to share their data while balancing compliance and security needs. To satisfy compliance requirements and to achieve data isolation, enterprises often need to control access at the row level and cell level. For example:

  • If you have a multi-tenant data lake, you may want each tenant to be able to view only those rows that are associated to their tenant ID
  • You may have data for multiple portfolios in the data lake and you need to control access for various portfolio managers
  • You may have sensitive information or personally identifiable information (PII) that can be viewed by users with elevated privileges only

AWS Lake Formation makes it easy to set up a secure data lake and access controls for these kinds of use cases. You can use Lake Formation to centrally define security, governance, and auditing policies, thereby achieving unified governance for your data lake. Lake Formation supports row-level security and cell-level security:

  • Row-level security allows you to specify filter expressions that limit access to specific rows of a table to a user
  • Cell-level security builds on row-level security by allowing you to apply filter expressions on each row to hide or show specific columns

Amazon Redshift is the fastest and most widely used cloud data warehouse. Amazon Redshift Spectrum is a feature of Amazon Redshift that enables you to query data from and write data back to Amazon S3 in open formats. You can query open file formats such as Parquet, ORC, JSON, Avro, CSV, and more directly in Amazon S3 using familiar ANSI SQL. This gives you the flexibility to store highly structured, frequently accessed data in an Amazon Redshift data warehouse, while also keeping up to exabytes of structured, semi-structured, and unstructured data in Amazon S3. Redshift Spectrum integrates with Lake Formation natively. This integration enables you to define data filters in Lake Formation that specify row-level and cell-level access control for users on your data and then query it using Redshift Spectrum.

In this post, we present a sample multi-tenant scenario and describe how to define row-level and cell-level security policies in Lake Formation. We also show how these policies are applied when querying the data using Redshift Spectrum.

Solution overview

In our use case, Example Corp has built an enterprise data lake on Amazon S3. They store data for multiple tenants in the data lake and query it using Redshift Spectrum. Example Corp maintains separate AWS Identity and Access Management (IAM) roles for each of their tenants and wants to control access to the multi-tenant dataset based on their IAM role.

Example Corp needs to ensure that the tenants can view only those rows that are associated to them. For example, Tenant1 should see only those rows where tenantid = 'Tenant1' and Tenant2 should see only those rows where tenantid = 'Tenant2'. Also, tenants can only view sensitive columns such as phone, email, and date of birth associated to specific countries.

The following is a screenshot of the multi-tenant dataset we use to demonstrate our solution. It has data for two tenants: Tenant1 and Tenant2. tenantid is the column that distinguishes data associated to each tenant.

To solve this use case, we implement row-level and cell-level security in Lake Formation by defining data filters. When Example Corp’s tenants query the data using Redshift Spectrum, the service checks filters defined in Lake Formation and returns only the data that the tenant has access to.

Lake Formation metadata tables contain information about data in the data lake, including schema information, partition information, and data location. You can use them to access underlying data in the data lake and manage that data with Lake Formation permissions. You can apply row-level and cell-level security to Lake Formation tables. In this post, we provide a walkthrough using a standard Lake Formation table.

The following diagram illustrates our solution architecture.

The solution workflow consists of the following steps:

  1. Create IAM roles for the tenants.
  2. Register an Amazon S3 location in Lake Formation.
  3. Create a database and use AWS Glue crawlers to create a table in Lake Formation.
  4. Create data filters in Lake Formation.
  5. Grant access to the IAM roles in Lake Formation.
  6. Attach the IAM roles to the Amazon Redshift cluster.
  7. Create an external schema in Amazon Redshift.
  8. Create Amazon Redshift users for each tenant and grant access to the external schema.
  9. Users Tenant1 and Tenant2 assume their respective IAM roles and query data using the SQL query editor or any SQL client to their external schemas inside Amazon Redshift.

Prerequisites

This walkthrough assumes that you have the following prerequisites:

Create IAM roles for the tenants

Create IAM roles Tenant1ReadRole and Tenant2ReadRole for users with elevated privileges for the two tenants, with Amazon Redshift as the trusted entity, and attach the following policy to both roles:

{
	"Version": "2012-10-17",
	"Statement": [{
		"Effect": "Allow",
		"Action": [
			"lakeformation:GetDataAccess",
			"glue:GetTable",
			"glue:GetTables",
			"glue:SearchTables",
			"glue:GetDatabase",
			"glue:GetDatabases",
			"glue:GetPartition",
			"glue:GetPartitions"
		],
		"Resource": "*"
	}]
}

Register an Amazon S3 location in Lake Formation

We use the sample multi-tenant dataset SpectrumRowLevelFiltering.csv. Complete the following steps to register the location of this dataset in Lake Formation:

  1. Download the dataset and upload it to the Amazon S3 path s3://<your_bucket>/order_details/SpectrumRowLevelFiltering.csv.
  2. On the Lake Formation console, choose Data lake locations in the navigation pane.
  3. Choose Register location.
  4. For Amazon S3 path, enter the S3 path of your dataset.
  5. For IAM role, choose either the AWSServiceRoleForLakeFormationDataAccess service-linked role (the default) or the Lake Formation administrator role mentioned in the prerequisites.
  6. Choose Register location.

Create a database and a table in Lake Formation

To create your database and table, complete the following steps:

  1. Sign in to the AWS Management Console as the data lake administrator.
  2. On the Lake Formation console, choose Databases in the navigation pane.
  3. Choose Create database.
  4. For Name, enter rs_spectrum_rls_blog.
  5. If Use only IAM access control for new tables in this database is selected, uncheck it.
  6. Choose Create database.Next, you create a new data lake table.
  7. On the AWS Glue console, choose Crawlers in navigation pane.
  8. Choose Add crawler.
  9. For Crawler name, enter order_details.
  10. For Specify crawler source type, keep the default selections.
  11. For Add data store, choose Include path, and choose the S3 path to the dataset (s3://<your_bucket>/order_details/).
  12. For Choose IAM Role, choose Create an IAM role, with the suffix rs_spectrum_rls_blog.
  13. For Frequency, choose Run on demand.
  14. For Database, choose database you just created (rs_spectrum_rls_blog).
  15. Choose Finish to create the crawler.
  16. Grant CREATE TABLE permissions and DESCRIBE/ALTER/DELETE database permissions to the IAM role you created in Step 12.
  17. To run the crawler, in the navigation pane, choose Crawlers.
  18. Select the crawler order_details and choose Run crawler.When the crawler is complete, you can find the table order_details created under the database rs_spectrum_rls_blog in the AWS Glue Data Catalog.
  19. On the AWS Glue console, in the navigation pane, choose Databases.
  20. Select the database rs_spectrum_rls_blog and choose View tables.
  21. Choose the table order_details.

The following screenshot is the schema of the order_details table.

Create data filters in Lake Formation

To implement row-level and cell-level security, first you create data filters. Then you choose that data filter while granting SELECT permission on the tables. For this use case, you create two data filters: one for Tenant1 and one for Tenant2.

  1. On the Lake Formation console, choose Data catalog in the navigation pane, then choose Data filters.
  2. Choose Create new filter.
    Let’s create the first data filter filter-tenant1-order-details restricting the rows Tenant1 is able to see in table order_details.
  3. For Data filter name, enter filter-tenant1-order-details.
  4. For Target database, choose rs_spectrum_rls_blog.
  5. For Target table, choose order_details.
  6. For Column-level access, select Include columns and then choose the following columns: c_emailaddress, c_phone, c_dob, c_firstname, c_address, c_country, c_lastname, and tenanted.
  7. For Row filter expression, enter tenantid = 'Tenant1' and c_country in  (‘USA’,‘Spain’).
  8. Choose Create filter.
  9. Repeat these steps to create another data filter filter-tenant2-order-details, with row filter expression tenantid = 'Tenant2' and c_country in (‘USA’,‘Canada’).

Grant access to IAM roles in Lake Formation

After you create the data filters, you need to attach them to the table to grant access to a principal. First let’s grant access to order_details to the IAM role Tenant1ReadRole using the data filter we created for Tenant1.

  1. On the Lake Formation console, in the navigation pane, under Permissions, choose Data Permissions.
  2. Choose Grant.
  3. In the Principals section, select IAM users and roles.
  4. For IAM users and roles, choose the role Tenant1ReadRole.
  5. In the LF-Tags or catalog resources section, choose Named data catalog resources.
  6. For Databases, choose rs_spectrum_rls_blog.
  7. For Tables, choose order_details.
  8. For Data filters, choose filter-tenant1-order-details.
  9. For Data filter permissions, choose Select.
  10. Choose Grant.
  11. Repeat these steps with the IAM role Tenant2ReadRole and data filter filter-tenant2-order-details.

Attach the IAM roles to the Amazon Redshift cluster

To attach your roles to the cluster, complete the following steps:

  1. On the Amazon Redshift console, in the navigation menu, choose CLUSTERS, then select the name of the cluster that you want to update.
  2. On the Actions menu, choose Manage IAM roles.The IAM roles page appears.
  3. Either choose Enter ARN and enter an ARN of the Tenant1ReadRole IAM role, or choose the Tenant1ReadRole IAM role from the list.
  4. Choose Add IAM role.
  5. Choose Done to associate the IAM role with the cluster.The cluster is modified to complete the change.
  6. Repeat these steps to add the Tenant2ReadRole IAM role to the Amazon Redshift cluster.

Amazon Redshift allows up to 50 IAM roles to attach to the cluster to access other AWS services.

Create an external schema in Amazon Redshift

Create an external schema on the Amazon Redshift cluster, one for each IAM role, using the following code:

CREATE EXTERNAL SCHEMA IF NOT EXISTS spectrum_tenant1
FROM DATA CATALOG DATABASE 'rs_spectrum_rls_blog'
IAM_ROLE '<<Tenant1ReadRole ARN>>'
REGION 'us-east-1';

CREATE EXTERNAL SCHEMA IF NOT EXISTS  spectrum_tenant2
FROM DATA CATALOG DATABASE  'rs_spectrum_rls_blog'
IAM_ROLE '<<Tenant2ReadRole ARN>>'
REGION 'us-east-1';

Create Amazon Redshift users for each tenant and grant access to the external schema

Complete the following steps:

  1. Create Amazon Redshift users to restrict access to the external schemas (connect to the cluster with a user that has permission to create users or superusers) using the following code:
    CREATE USER tenant1_user WITH PASSWORD '<password>';
    CREATE USER tenant2_user WITH PASSWORD '<password>';

  2. Let’s create the read-only role (tenant1_ro) to provide read-only access to the spectrum_tenant1 schema:
    create role tenant1_ro;

  3. Grant usage on spectrum_tenant1 schema to the read-only tenant1_ro role:
    grant usage on schema spectrum_tenant1 to role tenant1_ro;

  4. Now assign the user to the read-only tenant1_ro role:
    grant role tenant1_ro to tenant1_user;

  5. Repeat the same steps to grant permission to the user tenant2_user:
    create role tenant2_ro;
    grant usage on schema spectrum_tenant2 to role tenant2_ro;
    grant role tenant2_ro to tenant2_user;

Tenant1 and Tenant2 users run queries using the SQL editor or a SQL client

To test the permission levels for different users, connect to the database using the query editor with that user.

In the Query Editor in the Amazon Redshift console, connect to the cluster with tenant1_user and run the following query:

-- Query table 'order_details' in schema spectrum_tenant1 with role Tenant1ReadRole

SELECT * FROM spectrum_tenant1.order_details;

In the following screenshot, tenant1_user is only able to see records where the tenantid value is Tenant1 and only the customer PII fields specific to the US and Spain.

To validate the Lake Formation data filters, the following screenshot shows that Tenant1 can’t see any records for Tenant2.

Reconnect to the cluster using tenant2_user and run the following query:

-- Query table 'order_details' in schema spectrum_tenant2 with role Tenant2ReadRole

SELECT * FROM spectrum_tenant2.order_details;

In the following screenshot, tenant2_user is only able to see records where the tenantid value is Tenant2 and only the customer PII fields specific to the US and Canada.

To validate the Lake Formation data filters, the following screenshot shows that Tenant2 can’t see any records for Tenant1.

Conclusion

In this post, you learned how to implement row-level and cell-level security on an Amazon S3-based data lake using data filters and access control features in Lake Formation. You also learned how to use Redshift Spectrum to access the data from Amazon S3 while adhering to the row-level and cell-level security policies defined in Lake Formation.

You can further enhance your understanding of Lake Formation row-level and cell-level security by referring to Effective data lakes using AWS Lake Formation, Part 4: Implementing cell-level and row-level security.

To learn more about Redshift Spectrum, refer Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.

For more information about configuring row-level access control natively in Amazon Redshift, refer to Achieve fine-grained data security with row-level access control in Amazon Redshift.


About the authors

Anusha Challa is a Senior Analytics Specialist Solutions Architect at AWS. Her expertise is in building large-scale data warehouses, both on premises and in the cloud. She provides architectural guidance to our customers on end-to-end data warehousing implementations and migrations.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS.

Automate Amazon Redshift Cluster management operations using AWS CloudFormation

Post Syndicated from Anusha Challa original https://aws.amazon.com/blogs/big-data/automate-amazon-redshift-cluster-management-operations-using-aws-cloudformation/

Amazon Redshift is the fastest and most widely used cloud data warehouse. Tens of thousands of customers run business-critical workloads on Amazon Redshift. Amazon Redshift offers many features that enable you to build scalable, highly performant, cost-effective, and easy-to-manage workloads. For example, you can scale an Amazon Redshift cluster up or down based on your workload requirements, pause clusters when not in use to suspend on-demand billing, and enable relocation. You can automate these management activities either using the Amazon Redshift API, AWS Command Line Interface (AWS CLI), or AWS CloudFormation.

AWS CloudFormation helps you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. You create a template that describes all the AWS resources that you want, and AWS CloudFormation takes care of provisioning and configuring those resources for you.

In this post, we walk through how to use AWS CloudFormation to automate some of the most common Amazon Redshift cluster management operations:

  • Create an Amazon Redshift cluster via the following methods:
    • Restore a cluster from a snapshot
    • Create an encrypted Amazon Redshift cluster
  • Perform cluster management operations:
    • Pause or resume a cluster
    • Perform elastic resize or classic resize
    • Add or remove Identity and Access Management (IAM) roles to cluster permissions
    • Rotate encryption keys
    • Modify snapshot retention period for automated and manual snapshots
    • Enable or disable snapshot copy to another AWS Region
    • Create a parameter group with required workload management (WLM) configuration and associate it to the Amazon Redshift cluster
    • Enable concurrency scaling by modifying WLM configuration
    • Enable or disable audit logging

For a complete list of operations that you can automate using AWS CloudFormation, see Amazon Redshift resource type reference.

Benefits of using CloudFormation templates

Many of our customers build fully automated production environments and use AWS CloudFormation to aid automation. AWS CloudFormation offers an easy way to create and manage AWS Infrastructure, by treating infrastructure as code. CloudFormation templates create infrastructure resources in a group called a stack, and allow you to define and customize all components. CloudFormation templates introduce the ability to implement version control, and the ability to quickly and reliably replicate your infrastructure. This significantly simplifies your continuous integration and continuous delivery (CI/CD) pipelines and keeps multiple environments in sync. The CloudFormation template becomes a repeatable, single source of truth for your infrastructure. You can create CloudFormation templates in YAML and JSON formats. The templates provided in this post use YAML format. Amazon Redshift clusters are one of the resources that you can provision and manage using AWS CloudFormation.

Create an Amazon Redshift cluster using AWS CloudFormation

With AWS CloudFormation, you can automate Amazon Redshift cluster creation. In this section, we describe two additional ways of creating Amazon Redshift clusters: by restoring from an existing snapshot or creating using default options. In the next section, we describe how you can manage the lifecycle of a cluster by performing cluster management operations using AWS CloudFormation.

Restore an Amazon Redshift cluster from a snapshot

Snapshots are point-in-time backups of a cluster. Amazon Redshift periodically takes automated snapshots of the cluster. You can also take a snapshot manually any time. A snapshot contains data from any databases that are running on the cluster. Snapshots enable data protection, and you can also use them to build new environments to perform application testing, data mining, and more. When you start a new development or enhancement project, you may want to build a new Amazon Redshift cluster that has the same code and data as that of your production, so you can develop and test your code there before deploying it. To do this, create the new cluster by restoring from your production cluster’s snapshot.

You can use AWS CloudFormation to automate the restore operation. If you have multiple projects with different timelines or requirements, you can build multiple Amazon Redshift clusters in an automatic and scalable fashion using AWS CloudFormation.

To create a new Amazon Redshift cluster by restoring from an existing snapshot, create a CloudFormation stack using a CloudFormation template that has the AWS::Redshift::Cluster resource with the following mandatory properties:

  • SnapshotIdentifier – The name of the snapshot from which to create the new cluster
  • ClusterIdentifier – A unique identifier of your choice for the cluster
  • NodeType – The node type to be provisioned for the cluster
  • ClusterType – The type of the cluster, either single-node or multi-node (recommended for production workloads)
  • NumberofNodes – The number of compute nodes in the cluster
  • DBName – The name of the first database to be created in the new cluster
  • MasterUserName – The user name associated with the admin user account for the cluster that is being created
  • MasterUserPassword – The password associated with the admin user account for the cluster that is being created

The following is a sample CloudFormation template that restores a snapshot with identifier cfn-blog-redshift-snapshot and creates a two-node Amazon Redshift cluster with identifier cfn-blog-redshift-cluster and node type ra3.4xlarge:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password 
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      SnapshotIdentifier: "cfn-blog-redshift-snapshot"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Create an encrypted Amazon Redshift cluster

You can enable database encryption for your clusters to protect data at rest.

Use the following sample CloudFormation template to create an encrypted Amazon Redshift cluster. This template has basic properties only to make this walkthrough easy to understand. For your production workload, we recommend following the best practices as described in the post Automate Amazon Redshift cluster creation using AWS CloudFormation.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password. Must be 8-64 characters long. Must contain at least one uppercase letter, one lowercase letter and one number. Can be any printable ASCII character except “/”, ““”, or “@”.
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

In AWS CloudFormation, create a stack using this template. When the creation of CloudFormation stack is complete, you can see a new encrypted Amazon Redshift cluster called cfn-blog-redshift-cluster. For the rest of this post, we use this CloudFormation template as the base and explore how to modify the template to perform various Amazon Redshift management operations.

To create a new CloudFormation stack that uses the preceding template via the AWS CloudFormation console, complete the following steps:

  1. On the AWS CloudFormation console, choose Create Stack.
  2. On the drop-down menu, choose With new resources (standard).
  3. For Prepare template, choose Template is ready.
  4. For Specify template, choose Upload a template file.
  5. Save the provided CloudFormation template in a .yaml file and upload it.
  6. Choose Next.
  7. Enter a name for the stack. For this post, we use RedshiftClusterStack-CFNBlog.
  8. Choose Next.
  9. Choose Next again.
  10. Choose Create Stack.

To create the CloudFormation stack using the AWS CLI, run the following command:

aws cloudformation create-stack \
--stack-name RedshiftClusterStack-CFNBlog \
--template-body <<CloudFormation template’s file name>> \

The status of the stack changes to CREATE_IN_PROGRESS. After the Amazon Redshift cluster is created, the status changes to CREATE_COMPLETE. Navigate to the Amazon Redshift console to verify that the cluster is created and the status is Available.

Perform cluster management operations

Amazon Redshift customers have the flexibility to perform various cluster operations to implement workload security, perform cost-optimization, and manage scale. Often, we see our customers perform these operations in all their environments, such as DEV, QA, and PROD, to keep them in sync. You can automate these operations using CloudFormation stack updates by updating the CloudFormation template you used to create the cluster.

To perform these management operations using CloudFormation stack updates, you can create your initial CloudFormation stack in one of the two ways:

  • If your cluster isn’t already created, you can create it using AWS CloudFormation.
  • If you have an existing cluster, create a CloudFormation stack with the using existing resources option. Provide a template of the existing cluster and have a CloudFormation stack associated with the resource.

Each subsequent cluster management operation is an update to the base CloudFormation stack’s template. CloudFormation stack updates enable you to make changes to a stack’s resources by performing an update to the stack instead of deleting it and creating a new stack. Update to either add, modify, or remove relevant property values in the AWS::Redshift::Cluster resource to trigger the respective Amazon Redshift cluster management operation. AWS CloudFormation compares the changes you submit with the current state of your stack and applies only the changes. For a summary of the update workflow, see How does AWS CloudFormation work?

The following steps describe how to perform a CloudFormation stack update on the RedshiftClusterStack-CFNBlog stack that you created in the previous section.

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the stack you want to update (for this post, RedshiftClusterStack-CFNBlog).
  3. Choose Update.
  4. In the Prerequisite – Prepare template section, choose Replace current template.
  5. For Specify template, choose Upload a template file.
  6. Make changes to your current CloudFormation template based on the operation you wish to perform on the Amazon Redshift cluster.
  7. Save the updated CloudFormation template .yaml file and upload it.
  8. Choose Next.
  9. Choose Next again.
  10. Choose Next again.
  11. Choose Update Stack.

To update CloudFormation stack using the AWS CLI, run the following command:

aws cloudformation update-stack \
--stack-name RedshiftClusterStack-CFNBlog \
--template-body <<Updated CloudFormation template’s file name>> \

In this section, we look at some AWS CloudFormation properties available for Amazon Redshift clusters and dive deep to understand how to update the CloudFormation template and add, remove, or modify these properties to automate some of the most common cluster management operations. We use the RedshiftClusterStack-CFNBlog stack that you created in the previous section as an example.

Pause or resume cluster

If an Amazon Redshift cluster isn’t being used for a certain period of time, you can pause it easily and suspend on-demand billing. For example, you can suspend on-demand billing on a cluster that is used for development when it’s not in use. While the cluster is paused, you’re only charged for the cluster’s storage. This adds significant flexibility in managing operating costs for your Amazon Redshift clusters. You can resume a paused cluster when you’re ready to use it. To pause a cluster, update the cluster’s current CloudFormation stack template to add a new property called ResourceAction with the value pause-cluster. To pause the cluster created using the RedshiftClusterStack-CFNBlog stack, you can perform a stack update and use the following updated CloudFormation template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      #Added this property to pause cluster
      ResourceAction: "pause-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps in the previous section to update the stack. When the stack update operation is in progress, the cluster’s status changes from Available to Modifying, Pausing.

When the stack update is complete, the cluster’s status changes to Paused.

When you’re ready to use the cluster, you can resume it. To resume the cluster, update the cluster’s current CloudFormation stack template and change the value of the ResourceAction property to resume-cluster. You can use the following template to perform a stack update operation to resume the cluster created using the RedshiftClusterStack-CFNBlog stack:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      #Added this property to resume cluster
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps in the previous section to update the stack. When the stack update is in progress, the cluster’s status changes from Paused to Modifying, Resuming.

When the stack update is complete, the cluster’s status changes to Available.

Perform elastic resize or classic resize

Data warehouse workloads often have changing needs. You may add a new line of business and thereby ingest more data into the data warehouse, or you may have a new analytics application for your business users and add new ETL processes to support it. When your compute requirements change due to changing needs, you can resize your Amazon Redshift cluster using one of the following approaches:

  • Elastic resize – This changes the node type, number of nodes, or both. Typically, it completes within 10–15 minutes when adding or removing nodes of the same type. Cross-instance elastic resize can take up to 45 minutes. We recommend using elastic resize whenever possible, because it completes much more quickly than classic resize. Elastic resize has some growth and reduction limits on the number of nodes.
  • Classic resize – You can also use classic resize to change the node type, number of nodes, or both. We recommend this option only when you’re resizing to a configuration that isn’t available through elastic resize, because it takes considerably more time depending on your data size.

You can automate both elastic resize and classic resize operations on Amazon Redshift clusters using AWS CloudFormation. The default resize operation when initiated using a CloudFormation stack update is elastic resize. If elastic resize isn’t possible for your configuration, AWS CloudFormation throws an error. You can force the resize operation to be classic resize by specifying the value of the property Classic to Boolean true in the CloudFormation template provided in the update stack operation. If you don’t provide this parameter or set the value to false, the resize type is elastic. To initiate the resize operation, update the cluster’s current CloudFormation stack template and change the value of NodeType and NumberOfNodes properties as per your requirement.

To perform an elastic resize from the initial two-node RA3 4xlarge configuration to NumberOfNodes:2 and NodeType:ra3.16xlarge configuration on the Amazon Redshift cluster cfn-blog-redshift-cluster, you can update the current template of the RedshiftClusterStack-CFNBlog stack as shown in the following CloudFormation template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      #Modified the below 2 properties to perform elastic resize
      NodeType: "ra3.16xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Alternatively, if you want to force this resize to be a classic resize, use the following CloudFormation template to update the RedshiftClusterStack-CFNBlog stack. This template has an additional property, Classic with Boolean value true, to initiate classic resize, in addition to having updated NodeType and NumberOfNodes properties.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      #Modified the below 3 properties to perform classic resize
      NodeType: "ra3.16xlarge"
      NumberOfNodes: "2"
      Classic: true
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps earlier in this post to update the stack. For both elastic resize and classic resize, when the stack update is in progress, the cluster’s status changes from Available to Resizing.

Add or remove IAM roles to cluster permissions

Your Amazon Redshift cluster needs permissions to access other AWS services on your behalf. For the required permissions, add IAM roles to cluster permissions. You can add up to 10 IAM roles. For instructions on creating roles, see Create an IAM role.

To add IAM roles to the cluster, update the cluster’s current CloudFormation stack template and add the IamRoles property with a list of IAM roles you want to add. For example, to add IAM roles cfn-blog-redshift-role-1 and cfn-blog-redshift-role-2 to the cluster cfn-blog-redshift-cluster, you can update the RedshiftClusterStack-CFNBlog stack using the following CloudFormation template. In this template, the new array property IamRoles has been added with values cfn-blog-redshift-role-1 and cfn-blog-redshift-role-2.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Added IAMRoles property with ARNs for 2 roles
      IamRoles: [
                    !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/cfn-blog-redshift-role-1",
                    !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/cfn-blog-redshift-role-2"
                ]
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Now, if you want to remove the IAM role cfn-blog-redshift-role-2 from the cfn-blog-redshift-cluster cluster, you can perform another CloudFormation stack update on RedshiftClusterStack-CFNBlog using the following CloudFormation template. This template contains only those IAM roles you want to retain.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Updated IAMRoles property to remove role-2
      IamRoles: [
                    !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/cfn-blog-redshift-role-1"
                ]
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps earlier in this post to update the stack. When the stack update is in progress, the cluster’s status changes from Available to Available, Modifying.

Navigate to cluster’s properties tab and the Cluster Permissions section to validate that the IAM roles were associated to the cluster after the stack update is complete.

Rotate encryption keys on the cluster

You can enable database encryption for your Amazon Redshift clusters to protect data at rest. You can rotate encryption keys using AWS CloudFormation. To rotate encryption keys, update the base CloudFormation template to add the RotateEncryptionKey property and set it to Boolean true. For example, you can use the following CloudFormation template to rotate the encryption key for cfn-blog-redshift-cluster by performing an update on the CloudFormation stack RedshiftClusterStack-CFNBlog:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Added RotateEncryptionKey property
      RotateEncryptionKey: true
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps earlier in this post to update the stack. When the stack update is in progress, the cluster’s status changes from Available to Modifying, Rotating Keys. It takes approximately 2 minutes to rotate the encryption keys.

Modify the snapshot retention period for automated and manual snapshots

Amazon Redshift takes periodic automated snapshots of the cluster. By default, automated snapshots are retained for 24 hours. You can change the retention period of automated snapshots to 0–35 days. Amazon Redshift deletes automated snapshots at the end of a snapshot’s retention period, when you disable automated snapshots for the cluster or delete the cluster.

If you set the automated snapshot retention period to 0 days, the automated snapshots feature is disabled and any existing automated snapshots are deleted. Exercise caution before setting the automated snapshot retention period to 0.

You can take manual snapshots of the cluster any time. By default, manual snapshots are retained indefinitely, even after you delete the cluster. You can also specify the retention period when you create a manual snapshot. When the snapshot retention period is modified on automated snapshots, it applies to both existing and new automated snapshots. In contrast, when the snapshot retention period is modified on manual snapshots, it applies to new manual snapshots only.

To modify the retention period on snapshots, update your current cluster’s CloudFormation stack template. Add or update the AutomatedSnapshotRetentionPeriod property with an integer value (must be between 0–35) indicating the new retention period in days for automated snapshots, and the ManualSnapshotRetentionPeriod property with an integer value (must be between 1–3653) indicating the new retention period in days for manual snapshots.

The following CloudFormation template sets the AutomatedSnapshotRetentionPeriod to 7 days and ManualSnapshotRetentionPeriod to 90 days on cfn-blog-redshift-cluster when you update the current CloudFormation stack RedshiftClusterStack-CFNBlog:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Add snapshot retention properties
      AutomatedSnapshotRetentionPeriod: 7
      ManualSnapshotRetentionPeriod: 90
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Enable or disable snapshot copies to another Region

You can configure your Amazon Redshift cluster to copy all new manual and automated snapshots for a cluster to another Region. You can choose how long to keep copied automated or manual snapshots in the destination Region. If the cluster is encrypted, because AWS Key Management Service (AWS KMS) keys are specific to a Region, you must configure a snapshot copy grant for a primary key in the destination Region. For information on how to create a snapshot copy grant, see Copying AWS KMS–encrypted snapshots to another AWS Region. Make sure that the snapshot copy grant is created before enabling snapshot copy to another Region using CloudFormation templates.

To enable snapshot copy to another Region, update your current cluster’s CloudFormation stack template and add or update the following properties:

  • DestinationRegion – Required to enable snapshot copy. It specifies the destination Region that snapshots are automatically copied to.
  • SnapshotCopyRetentionPeriod – Optional. Modifies the number of days to retain snapshots in the destination Region. If this property is not specified, the retention period is the same as that of the source Region. By default, this operation only modifies the retention period of existing and new copied automated snapshots. To change the retention period of copied manual snapshots using this property, set the SnapshotCopyManual property to true.
  • SnapshotCopyManual – Indicates whether to apply the snapshot retention period to newly copied manual snapshots instead of automated snapshots. If you set this option, only newly copied manual snapshots have the new retention period.
  • SnapshotCopyGrantName – The name of the snapshot copy grant.

To copy snapshots taken from cfn-blog-redshift-cluster into the Region us-west-1 and to modify the retention period of the newly copied manual snapshots to 90 days, update the current CloudFormation stack RedshiftClusterStack-CFNBlog with the following CloudFormation template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Add cross-region snapshot copy properties
      DestinationRegion: "us-west-1"
      SnapshotCopyGrantName: "cfn-blog-redshift-cross-region-snapshot-copy-grant"
      SnapshotCopyManual: true
      SnapshotCopyRetentionPeriod: 90
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

To disable cross-Region snapshot copy, update your current CloudFormation stack’s template and remove the properties DestinationRegion, SnapshotCopyRetentionPeriod, SnapshotCopyManual and SnapshotCopyGrantName.

Create a parameter group with required WLM configuration and assign it to an Amazon Redshift cluster

You can use AWS CloudFormation to create an Amazon Redshift parameter group and associate it to an Amazon Redshift cluster. If you don’t associate a parameter group, the default parameter group is assigned, which has the defaults for parameter values and WLM configuration. When you create a new parameter group, it also has the defaults for parameters, unless you override them. The following CloudFormation template overrides the default values for the require_ssl and wlm_json_configuration parameters. The WLM configuration is specified in JSON format. In this template, automatic WLM configuration is defined on cfn-blog-redshift-cluster with three queues: etl_queue, reporting_queue, and the default queue, with the following specifications:

  • Priority for etl_queue is set to highest. It’s configured to route all queries run by users belonging to the group named etl_group to etl_queue.
  • Priority for reporting_queue is set to normal. It’s configured to route all queries run by users belonging to any group name with the word report in it or any query having a query group with the word report in it to the reporting_queue.
  • The following three query monitoring rules are defined to protect reporting_queue from bad queries:
    • When query runtime is greater than 7,200 seconds (2 hours), the query is stopped.
    • If a query has a nested loop join with more than 1,000,000 rows, its priority is changed to lowest.
    • If any query consumes more than 50% CPU utilization, it is logged.
  • All other queries are routed to the default queue. Priority for default_queue is set to lowest.
  • The following three query monitoring rules are defined to protect the default queue from bad queries:
    • If a query has a nested loop join with more than 1,000,000 rows, it is stopped.
    • If a query has a large return set and is returning more than 1,000,000 rows, it is stopped.
    • If more than 10 GB spilled to disk for a query, it is stopped.
AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  #Add parameter group resource
  RedshiftClusterParameterGroup: 
    Type: "AWS::Redshift::ClusterParameterGroup"
    Properties: 
      Description: "CFNBlog-Redshift-Cluster-parameter-group"
      ParameterGroupFamily: "redshift-1.0"
      Parameters: 
        - 
          ParameterName: "require_ssl"
          ParameterValue: "true"
        -
          ParameterName: "wlm_json_configuration"
          ParameterValue: "[
                            {
                              \"name\":\"etl_queue\",
                              \"user_group\":[\"etl_user\"],
                              \"auto_wlm\":true,
                              \"queue_type\":\"auto\",
                              \"priority\":\"highest\"
                            },
                            {
                              \"name\":\"reporting_queue\",
                              \"user_group\":[\"%report%\"],
                              \"user_group_wild_card\":1,
                              \"query_group\":[\"%report%\"],
                              \"query_group_wild_card\":1,
                              \"auto_wlm\":true,
                              \"queue_type\":\"auto\",
                              \"priority\":\"normal\",
                              \"rules\":[
                                        {
                                          \"rule_name\":\"timeout_2hours\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"query_execution_time\",
                                                          \"operator\":\">\",\"value\":7200
                                                        }
                                                      ],
                                          \"action\":\"abort\"
                                        },
                                        {
                                          \"rule_name\":\"nested_loop_reporting\",
                                          \"action\":\"change_query_priority\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"nested_loop_join_row_count\",
                                                          \"operator\":\">\",\"value\":1000000
                                                        }
                                                      ],
                                          \"value\":\"lowest\"
                                        },
                                        {
                                          \"rule_name\":\"expensive_computation\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"query_cpu_usage_percent\",
                                                          \"operator\":\">\",\"value\":50
                                                        }
                                                      ],
                                          \"action\":\"log\"
                                        }
                                      ]
                            },
                            {
                              \"name\":\"Default queue\",
                              \"auto_wlm\":true,
                              \"priority\":\"lowest\",
                              \"rules\":[
                                        {
                                          \"rule_name\":\"nested_loop\",
                                          \"action\":\"abort\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"nested_loop_join_row_count\",
                                                          \"operator\":\">\",\"value\":1000000
                                                        }
                                                      ]
                                        },
                                        {
                                          \"rule_name\":\"large_return_set\",
                                          \"action\":\"abort\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"return_row_count\",
                                                          \"operator\":\">\",\"value\":1000000
                                                        }
                                                      ]
                                        },
                                        {
                                          \"rule_name\":\"large_spill_to_disk\",
                                          \"action\":\"abort\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"query_temp_blocks_to_disk\",
                                                          \"operator\":\">\",
                                                          \"value\":10000
                                                        }
                                                      ]
                                        }
                                      ]
                            },
                            {
                              \"short_query_queue\":true
                            }
                          ]"
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
      #Add parameter to associate parameter group
      ClusterParameterGroupName: !Ref RedshiftClusterParameterGroup
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Enable concurrency scaling by modifying WLM configuration

Concurrency scaling is an autoscaling feature of Amazon Redshift that enables you to support virtually unlimited concurrent users. When you turn on concurrency scaling, Amazon Redshift automatically adds additional cluster capacity to process an increase in both read queries and write queries.

You can enable concurrency scaling at an individual WLM queue level. To enable concurrency scaling using AWS CloudFormation, update you current stack’s CloudFormation template and change the parameter value for wlm_json_configuration to add a property called concurrency_scaling and set its value to auto.

The following CloudFormation template sets concurrency scaling to auto on reporting_queue. It also overrides the value for the max_concurrency_scaling_clusters parameter from default 1 to 5.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftClusterParameterGroup: 
    Type: "AWS::Redshift::ClusterParameterGroup"
    Properties: 
      Description: "CFNBlog-Redshift-Cluster-parameter-group"
      ParameterGroupFamily: "redshift-1.0"
      Parameters: 
        - 
          ParameterName: "require_ssl"
          ParameterValue: "true"
        #add parameter to change default value for max number of concurrency scaling clusters parameter
        - 
          ParameterName: "max_concurrency_scaling_clusters"
          ParameterValue: "5"
        #Updated wlm configuration to set concurrency_scaling to auto
        -
          ParameterName: "wlm_json_configuration"
          ParameterValue: "[
                            {
                              \"name\":\"etl_queue\",
                              \"user_group\":[\"etl_user\"],
                              \"auto_wlm\":true,
                              \"queue_type\":\"auto\",
                              \"priority\":\"highest\"
                            },
                            {
                              \"name\":\"reporting_queue\",
                              \"user_group\":[\"%report%\"],
                              \"user_group_wild_card\":1,
                              \"query_group\":[\"%report%\"],
                              \"query_group_wild_card\":1,
                              \"auto_wlm\":true,
                              \"queue_type\":\"auto\",
                              \"priority\":\"normal\",
                              \"concurrency_scaling\":\"auto\"
                            },
                            {
                              \"name\":\"Default queue\",
                              \"auto_wlm\":true,
                              \"priority\":\"lowest\"
                            },
                            {
                              \"short_query_queue\":true
                            }
                          ]"
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
      ClusterParameterGroupName: !Ref RedshiftClusterParameterGroup
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Enable or disable audit logging

When you enable audit logging, Amazon Redshift creates and uploads the connection log, user log, and user activity logs to Amazon Simple Storage Service (Amazon S3). You can automate enabling and disabling audit logging using AWS CloudFormation. To enable audit logging, update the base CloudFormation template to add the LoggingProperties property with the following sub-properties:

  • BucketName – The name of an existing S3 bucket where the log files are to be stored.
  • S3KeyPrefix – The prefix applied to the log file names

Also update the parameter group to change the value of the enable_user_activity_logging parameter to true.

To enable audit logging on cfn-blog-redshift-cluster and deliver log files to BucketName: cfn-blog-redshift-cluster-audit-logs with the S3KeyPrefix:cfn-blog, update the current CloudFormation stack RedshiftClusterStack-CFNBlog with the following CloudFormation template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftClusterParameterGroup: 
    Type: "AWS::Redshift::ClusterParameterGroup"
    Properties: 
      Description: "CFNBlog-Redshift-Cluster-parameter-group"
      ParameterGroupFamily: "redshift-1.0"
      Parameters: 
        - 
          ParameterName: "require_ssl"
          ParameterValue: "true"
        - 
          ParameterName: "max_concurrency_scaling_clusters"
          ParameterValue: "5"
        #add parameter to enable user activity logging
        -
          ParameterName: "enable_user_activity_logging"
          ParameterValue: "true"
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
      #Add LoggingProperties and its sub-properties
      LoggingProperties:
          BucketName: "cfn-blog-redshift-cluster-audit-logs"
          S3KeyPrefix: "cfn-blog/"
      ClusterParameterGroupName: !Ref RedshiftClusterParameterGroup
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps earlier in this post to update the stack. To disable audit logging, update the cluster’s CloudFormation stack template and remove the LoggingProperties property.

Concurrent operations

You can perform multiple cluster management operations using a single CloudFormation stack update. For example, you can perform elastic resize and rotate encryption keys on the cfn-blog-redshift-cluster using the following CloudFormation template. This template updates the values for NodeType and NumberOfNodes, which results in an elastic resize operation, and also sets the RotateEncryptionKey parameter value to Boolean true, which results in the encryption key rotation.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      # Change Node type, number of nodes and rotate encryption
      NodeType: "ra3.16xlarge"
      NumberOfNodes: "2"
      RotateEncryptionKey: true
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Conclusion

You have now learned how to automate management operations on Amazon Redshift clusters using AWS CloudFormation. For a full list of properties you can update using this process, see Properties. For more sample CloudFormation templates, see Amazon Redshift template snippets.


About the Authors

Anusha Challa is a Senior Analytics Specialist Solutions Architect at AWS. She has over a decade of experience in building large-scale data warehouses, both on-premises and in the cloud. She provides architectural guidance to our customers on end-to-end data warehousing implementations and migrations.

Shweta Yakkali is a Software Engineer for Amazon Redshift, where she works on developing features for Redshift Cloud infrastructure. She is passionate about innovations in cloud infrastructure and enjoys learning new technologies and building enhanced features for Redshift. She holds M.S in Computer Science from Rochester Institute of Technology, New York. Outside of work, she enjoys dancing, painting and playing badminton.

Zirui Hua is a Software Development Engineer for Amazon Redshift, where he works on developing next generation features for Redshift. His main focuses are on networking and proxy of database. Outside of work, he likes to play tennis and basketball.