All posts by Ranjan Burman

Enable Multi-AZ deployments for your Amazon Redshift data warehouse

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/enable-multi-az-deployments-for-your-amazon-redshift-data-warehouse/

November 2023: This post was reviewed and updated with the general availability of Multi-AZ deployments for provisioned RA3 clusters.
Originally published on December 9th, 2022.

Amazon Redshift is a fully managed, petabyte scale cloud data warehouse that enables you to analyze large datasets using standard SQL. Data warehouse workloads are increasingly being used with mission-critical analytics applications that require the highest levels of resilience and availability. Amazon Redshift is a cloud-based data warehouse that supports many recovery capabilities to address unforeseen outages and minimize downtime. Amazon Redshift RA3 instance types store their data in Redshift Managed Storage (RMS), which is backed by Amazon Simple Storage Service (Amazon S3) making it highly available and durable by default. Amazon Redshift also supports automatic backups that can recover a data warehouse, automatically remediate failures and relocates clusters to different AZs without changes to applications. Although many customers benefit from these features, enterprise data warehouse customers require a low Recovery Time Objective (RTO) and higher availability to support their business continuity with minimal impact to applications.

Amazon Redshift just announced the general availability of Multi-AZ deployments for provisioned RA3 clusters that support running your data warehouse in two Availability Zones simultaneously and can continue operating in unforeseen failure scenarios. A Multi-AZ deployment is intended for customers with mission-critical analytics applications that require the highest levels of  resilience and availability.

A Redshift Multi-AZ deployment leverages compute resources in two AZs to scale data warehouse workload processing.  In situations where there is a high level or concurrency Redshift will automatically leverage the resources in both AZs to scale the workload for both read and write requests.

Our pre-launch tests found that Amazon Redshift Multi-AZ deployments reduce recovery time to under 60 seconds or less in the unlikely case of an AZ failure.

Single-AZ vs. Multi-AZ deployment

Amazon Redshift requires a cluster subnet group to create a cluster in your VPC. The cluster subnet group includes information about the VPC ID and a list of subnets in your VPC. When you launch a cluster, Amazon Redshift either creates a default cluster subnet group automatically or you choose a cluster subnet group of your choice so that Amazon Redshift can provision your cluster in one of the subnets in the VPC. You can configure your cluster subnet group to add subnets from different Availability Zones that you want Amazon Redshift to use for cluster deployment.

All Amazon Redshift clusters today are created and situated in a particular Availability Zone within an AWS Region and thus called Single-AZ deployments. For a Single-AZ deployment, Amazon Redshift selects the subnet from one of the Availability Zones within a Region and deploys the cluster there. You can choose an Availability Zone for deployment, and Amazon Redshift will deploy your cluster in the chosen Availability Zone based on the subnets provided.

On the other hand, a multi-AZ deployment is provisioned in two Availability Zones simultaneously. For a Multi-AZ deployment, Amazon Redshift automatically selects two subnets from two different Availability Zones and deploys an equal number of compute nodes in each Availability Zone. All these compute nodes are utilized via a single endpoint as compute nodes from both Availability Zones are used for workload processing.

As shown in the following diagrams, Amazon Redshift deploys a cluster in a single Availability Zone for Single-AZ deployment, and two Availability Zones for Multi-AZ deployment.

Auto recovery of multi-AZ deployment

In the unlikely event of an Availability Zone failure, Amazon Redshift Multi-AZ deployments continue to serve your workloads by automatically using resources in the other Availability Zone. You are not required to make any application changes to maintain business continuity during unforeseen outages since a multi-AZ deployment is accessed as a single data warehouse with one endpoint. Amazon Redshift Multi-AZ deployments are designed to ensure there is no data loss, and you can query all data committed up until the point of failure.

As shown in the below diagram, if there is an unlikely event that causes compute nodes in AZ1 to fail, then a multi-AZ deployment automatically recovers to use compute resources in AZ2. Amazon Redshift will also automatically provision identical compute nodes in another availability zone (AZ3) to continue operating simultaneously in two Availability zones (AZ2 and AZ3).

Amazon Redshift Multi-AZ deployment is not only used for protection against the possibility of Availability Zone failures, but it can also maximize your data warehouse performance by automatically distributing workload processing across two Availability Zones. A Multi-AZ deployment will always process an individual query using compute resources only from one Availability Zone, but it can automatically distribute processing of multiple simultaneous queries to both Availability Zones to increase overall performance for high concurrency workloads.

It’s a good practice to set up automatic retries in your extract, transform, and load (ETL) processes and dashboards so that they can be reissued and served by the cluster in the secondary Availability Zone when an unlikely failure happens in the primary Availability Zone. If a connection is dropped, it can then be retried or reestablished immediately. In addition, queries and loads that were running in the failed Availability Zone will be aborted. New queries issued at or after a failure occurs may experience run delays while the multi-AZ data warehouse is being recovered to a two AZ setup.

Overview of solution

In this post, we provide a walkthrough of how to create and manage a Multi-AZ deployment for Amazon Redshift using the AWS Management Console. We also test the fault tolerance of an Amazon Redshift Multi-AZ data warehouse and monitor queries in your Multi-AZ deployment.

Create a new Multi-AZ deployment from the console

You can easily create a new multi-AZ deployments through Amazon Redshift console. Amazon Redshift will deploy the same number of nodes in each of the two Availability Zones for a Multi-AZ deployment. All nodes of a multi-AZ deployment can perform read and write workload processing during normal operation. A Multi-AZ deployment is supported only for provisioned RA3 clusters.

Follow these steps to create an Amazon Redshift provisioned cluster in two Availability Zones:

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters.
  2. Click on Create cluster.

For general information about creating clusters, see Creating a cluster.

  1. Choose one of the RA3 node types on the Node type drop-down menu. The Multi-AZ deployment option only becomes available when you choose an RA3 node type.
  2. For Multi-AZ deployment, select Multi-AZ option.
  3. For Number of nodes per AZ, enter the number of nodes that you need for your cluster.

  1. Under the Database configurations, choose Admin user name and Admin user password.
  2. Turn Use defaults on next to Additional configurations to modify the default settings.
  3. Under Network and security, specify the following:
    1. For Virtual private cloud (VPC), choose the VPC you want to deploy the cluster in.
    2. For VPC security groups, either leave as default or add the security groups of your choice.
    3. For Cluster subnet group, either leave as default or add a cluster subnet group of your choice. For a Multi-AZ deployment, a cluster subnet group must include one subnet each from at least three or more different Availability Zones.

For general information about managing cluster subnet groups, see Cluster subnet groups

  1. Under Database configuration, for Database port, you either use the default value 5439 or choose a value from the range of 5431–5455 and 8191–8215.
  2. Under Database configuration, in the Database encryption section, to use a custom AWS Key Management Service (AWS KMS) key other than the default KMS key, choose Customize encryption settings. This option is deselected by default.
  3. Under Choose an AWS KMS key, you can either choose an existing KMS key, or choose Create an AWS KMS key to create a new KMS key.

For more information to create key using KMS, refer to Creating keys.

  1. Choose Create cluster.

When the cluster creation succeeds, you can view the details on the cluster details page.

Under General information, you can see Multi-AZ as Yes.

On the Properties tab, under Network and security settings, you can find the details on the primary and secondary Availability Zone.

Create a new Multi-AZ deployment from the CLI

The following create-cluster AWS CLI command shows how to create a Multi-AZ cluster

aws redshift create-cluster 
--port 5439 
    --master-username master
    --master-user-password ######
    --node-type ra3.4xlarge
    --number-of-nodes 2
    --profile maz-test
    --endpoint-url https://redshift.us-east-1.amazonaws.com
    --region eu-west-1
    --cluster-identifier redshift-cluster-1
    --multi-az 
    --maintenance-track-name CURRENT
    --encrypted

Convert a Single-AZ deployment to Multi-AZ deployment

To convert an existing Single-AZ deployment to a Multi-AZ deployment, you can go to the Redshift console and select your Redshift cluster that currently is Single-AZ setup and navigate to Actions and select Activate Multi-AZ. Your Single-AZ cluster must be encrypted for a successful conversion to Multi-AZ. During conversion to Multi-AZ, Redshift will double the total number of nodes distributing them equally in each AZ. Redshift will not allow you to split existing number of nodes while converting to Multi-AZ to maintain consistent query performance.

Complete the following steps to create a Multi-AZ deployment restored from a snapshot:

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters.
  2. Select your cluster and navigate to the cluster details page.
  3. On the Actions menu, choose Activate Multi-AZ.

  1. Review the modification summary and confirm by choosing Activate Multi-AZ.

Using the below AWS CLI command you can convert a single AZ Redshift data warehouse to Multi-AZ.

aws redshift modify-cluster 
    --profile maz-test
    --endpoint-url https://redshift.eu-west-1.amazonaws.com
    --region eu-west-1
    --cluster-identifier redshift-cluster-1
    --multi-az

Convert a Multi-AZ deployment to Single-AZ deployment

Redshift also supports conversion of a Multi-AZ deployment into Single-AZ.  This option provides customers with the flexibility to switch between different deployments with few easy steps as follows:

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters.
  2. Select your cluster and navigate to the cluster details page.
  3. On the Actions menu, choose Deactivate Multi-AZ.

  4. Review the modification summary and confirm by choosing Deactivate Multi-AZ.

Creating a Multi-AZ data warehouse restored from a snapshot

Existing customers can also create a Multi-AZ deployment by restoring a snapshot from an existing Single-AZ deployment. See the required steps as below.

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters.
  2. Select the cluster and navigate to the cluster details page.
  3. Choose the Maintenance
  4. Select a snapshot and choose Restore snapshot, Restore to provisioned cluster.
  5. Review the Cluster configuration and Cluster details values of the new cluster to be created using the snapshot information.
  6. Select Multi-AZ option and update the properties of the new cluster, then choose Restore cluster from snapshot at the bottom of the page.

Resizing a Multi-AZ data warehouse

Redshift Multi-AZ feature also supports resizing Multi-AZ Redshift cluster deployments to change the cluster configuration based on scaling needs. You can change both number and type of nodes as per needs.

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters.
  2. Select your cluster and navigate to the cluster details page.
  3. On the Actions menu, choose
  4. Once selected it will bring into another screen to show cluster resize screen where you can choose type and number of nodes and click on Resize cluster.

Failing over Multi-AZ deployment

In addition to the automatic recovery process, you can also trigger this process manually for your data warehouse using the Failover primary compute option. This approach can be used to manage operational maintenance and other planned operational procedures as per the needs of the respective environment. When the cluster successfully recovers, Multi-AZ deployment becomes available. Your Multi-AZ deployment also automatically provisions new compute nodes in another Availability Zone as soon as it is available.

Let’s manually trigger the Failover of your Redshift Multi-AZ deployment.

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Navigate to the cluster detail page
  3. From Actions, choose Failover primary compute.
  4. When prompted, choose Confirm.


After the cluster is back to Available status, you can observe that the primary and secondary Availability Zones have changed.

The following screenshot shows the status before injecting failure.

The following screenshot shows the status after injecting failure.

Restore a table from snapshot

You can restore a single table from a snapshot from your Multi-AZ cluster. When you restore a single table from a snapshot, you specify the source snapshot, database, schema, and table name, and the target database, schema, and a new table name for the restored table.

To restore a table from a snapshot:

  1. On the Amazon Redshift console, in the navigation pane, choose Clusters.
  2. Select your cluster and navigate to the cluster details page.
  3. On the Actions menu, choose Restore table.
  4. Enter the information about which snapshot, source table, and target table to use, and then choose Restore table.

Enable public connections for your Multi-AZ data warehouse

  1. From the navigation menu, choose CLUSTERS.
  2. Choose the Multi-AZ cluster that you want to modify.
  3. Choose Actions.
  4. Choose Turn on Publicly accessible.
  5. Choose Elastic IP address, if you do not choose one, an address will be randomly assigned to you.
  6. Choose Save changes.


Monitor queries for Multi-AZ deployments

A Multi-AZ deployment uses compute resources that are deployed in both Availability Zones and can continue operating in the event that the resources in a given Availability Zone are not available. All the compute resources are used at all times, which allows full operation across two Availability Zones in both read and write operations.

You can query SYS_views in the pg_catalog schema to monitor Multi-AZ query runs. The SYS_views cover query run activities and stats from primary and secondary clusters.

The following are the system tables in the SYS_view list:

Follow these steps to monitor the query run on Multi-AZ deployment from the Amazon Redshift Console:

  1. On the Amazon Redshift console, connect to the database in your Multi-AZ deployment and run queries through the query editor.
  2. Run any sample query on the Multi-AZ Redshift deployment.
  3. For a Multi-AZ deployment, you can identify a query and the Availability Zone where it is being run (running on the primary or secondary availability zone) by using the compute_type column in the SYS_QUERY_HISTORY table. The valid values for the compute type column are as follows:
    1. primary – When run on primary availability zone in the Multi-AZ deployment.
    2. secondary – When run on secondary availability zone in the Multi-AZ deployment.

The following is a sample query using the compute_type column to monitor a query:

dev=# select (compute_type) as compute_type, left(query_text, 50) query_text from sys_query_history order by start_time desc;

 compute_type | query_text
--------------+----------------------------------------------------
 secondary    | select count(*) from t1;
 primary 	   select count(*) from t2;

You can also access the query history from the console to analyze your query diagnostics.

  1. On the Query monitoring tab, choose Connect to database.

  1. For Connection, choose Create a new connection
  2. For Authentication, choose Temporary credentials
  3. For Database name, enter the database name (for example, dev).
  4. For Database user, enter the database user name (for example, awsuser).
  5. Choose Connect.

After you’re connected, under Query Monitoring, on the Query history tab, you can view all the queries and loads, as shown in the following screenshot.

Under Metric filters, you can use the various filters in the Additional filtering options section to view query history based on Time interval, Users, Databases, or SQL commands.

There are a few limitations when working with Amazon Redshift Multi-AZ in preview mode, refer here for the limitations.

Customer feedback

Janssen Pharmaceuticals, a subsidiary of Johnson & Johnson, researches and manufactures medicines with a focus on the changing needs of patients and the healthcare industry.

“Janssen Pharmaceutical uses Amazon Redshift to enable critical insights that drive important business decisions for our data scientists, data stewards, business users, and external stakeholders. With Amazon Redshift Multi-AZ, we can be confident that our data warehouse will always be available without any disruptions that might delay impact our ability to make critical business decisions.”

– Shyam Mohapatra, Director of Information Technology – Janssen Pharmaceutical Companies of Johnson & Johnson

Stripe is a technology company that builds economic infrastructure for the internet. Stripe’s products power payments for online and in-person retailers, subscriptions businesses, software platforms and marketplaces, and everything in between.

Millions of companies use Stripe’s software and APIs to accept payments, send payouts, and manage their businesses online.  Access to their Stripe data via leading data warehouses like Amazon Redshift has been a top request from our customers. Our customers needed highly available, secure, fast, and integrated analytics at scale without building complex data pipelines or moving and copying data around. With Stripe Data Pipeline for Amazon Redshift, we’re helping our customers set up a direct and reliable data pipeline in a few clicks. 

Stripe Data Pipeline enables our customers to automatically share their complete, up-to-date Stripe data with their Amazon Redshift data warehouse, and take their business analytics and reporting to the next level.”

– Brian Brunner, Senior Manager, Engineering at Stripe

Conclusion

This post demonstrated how to configure an Amazon Redshift Multi-AZ deployment in two Availability Zones and test the fault tolerance of your workloads during an unlikely failure of an Availability Zone. Amazon Redshift Multi-AZ deployment also helps improve overall performance of your data warehouse because compute nodes in both Availability Zones are used for read and write operations. Amazon Redshift Multi-AZ data warehouse helps meet the demands of customers with mission critical analytics applications that require the highest levels of availability and resiliency. For more details, refer Configuring Multi-AZ deployment.


About the Authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Saurav Das is part of the Amazon Redshift Product Management team. He has more than 16 years of experience in working with relational databases technologies and data protection. He has a deep interest in solving customer challenges centered around high availability and disaster recovery.

Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Suresh Patnam is a Principal BDM – GTM AI/ML Leader at AWS. He works with customers to build IT strategy, making digital transformation through the cloud more accessible by using data and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Simplify data analysis and collaboration with SQL Notebooks in Amazon Redshift Query Editor V2.0

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/simplify-data-analysis-and-collaboration-with-sql-notebooks-in-amazon-redshift-query-editor-v2-0/

Amazon Redshift Query Editor V2.0 is a web-based analyst workbench that you can use to author and run queries on your Amazon Redshift data warehouse. You can visualize query results with charts, and explore, share, and collaborate on data with your teams in SQL through a common interface.

With SQL Notebooks, Amazon Redshift Query Editor V2.0 simplifies organizing, documenting, and sharing of data analysis with SQL queries. The notebook interface enables users such as data analysts, data scientists, and data engineers to author SQL code more easily, organizing multiple SQL queries and annotations on a single document. You can also collaborate with your team members by sharing notebooks. With SQL Notebooks, you can visualize the query results using charts. SQL Notebooks support provides an alternative way to embed all queries required for a complete data analysis in a single document using SQL cells. Query Editor V2.0 simplifies development of SQL notebooks with query versioning and export/import features. You can use the built-in version history feature to track changes in your SQL and markdown cells. With the export/import feature, you can easily move your notebooks from development to production accounts or share with team members cross-Region and cross-account.

In this post, we demonstrate how to use SQL Notebooks using Query Editor V2.0 and walk you through some of the new features.

Use cases for SQL Notebooks

Customers want to use SQL notebooks when they want reusable SQL code with multiple SQL statements and annotations or documentations. For example:

  • A data analyst might have several SQL queries to analyze data that create temporary tables, and runs multiple SQL queries in sequence to derive insights. They might also perform visual analysis of the results.
  • A data scientist might create a notebook that creates some training data, creates a model, tests the model, and runs sample predictions.
  • A data engineer might have a script to create schema and tables, load sample data, and run test queries.

Solution overview

For this post, we use the Global Database of Events, Language, and Tone (GDELT) dataset, which monitors news across the world, and the data is stored for every second of every day. This information is freely available as part of the Registry of Open Data on AWS.

For our use case, a data scientist wants to perform unsupervised learning with Amazon Redshift ML by creating a machine learning (ML) model, and then generate insights from the dataset, create multiple versions of the notebook, visualize using charts, and share the notebook with other team members.

Prerequisites

To use the SQL Notebooks feature, you must add a policy for SQL Notebooks to a principal—an AWS Identity and Access Management (IAM) user or role—that already has one of the Query Editor V2.0 managed policies. For more information, see Accessing the query editor V2.0.

Import the sample notebook

To import the sample SQL notebook in Query Editor V2.0, complete the following steps:

  1. Download the sample SQL notebook.
  2. On the Amazon Redshift console, choose Query Editor V2 in the navigation pane. Query Editor V2.0 opens in a new browser tab.
  3. To connect to a database, choose the cluster or workgroup name.
  4. If prompted, enter your connection parameters.  For more information about different authentication methods, refer to Connecting to an Amazon Redshift database.
  5. When you’re connected to the database, choose Notebooks in the navigation pane.
  6. Choose Import to use the SQL notebook downloaded in the first step.
    After the notebook is imported successfully, it will be available under My notebooks.
  7. To open the notebook, right-click on the notebook and choose Open notebook, or double-click on the notebook.

Perform data analysis

Let’s explore how you can run different queries from the SQL notebook cells for your data analysis.

  1. Let’s start by creating the table.
  2. Next, we load data into the table using COPY command. Before running the COPY command in the notebook, you need to have a default IAM role attached to your Amazon Redshift cluster, or replace the default keyword with the IAM role ARN attached to the Amazon Redshift cluster:
    COPY gdelt_data FROM 's3://gdelt-open-data/events/1979.csv'
    region 'us-east-1' iam_role 'arn:aws:iam::<account-id>:role/<role-name>' csv delimiter '\t';

    For more information, refer to Creating an IAM role as default in Amazon Redshift.

    Before we create the ML model, let’s examine the training data.

  3. Before you run the cell to create the ML model, replace the <your-amazon-s3-bucket-name> with the S3 bucket of your account to store intermediate results.
  4. Create the ML model.
  5. To check the status of the model, run the notebook cell Show status of the model.  The model is ready when the Model State key value is READY.
  6. Let’s identify the clusters associated with each GlobalEventId.
  7. Let’s get insights into the data points assigned to one of the clusters.

In the preceding screenshot, we can observe the data points assigned to the clusters. We see clusters of events corresponding to interactions between the US and China (probably due to the establishment of diplomatic relations), between the US and RUS (probably corresponding to the SALT II Treaty), and those involving Iran (probably corresponding to the Iranian Revolution).

To add text and format the appearance to provide context and additional information for your data analysis tasks, you can add a markdown cell. For example, in our sample notebook, we have provided a description about the query in the markdown cells to make it simpler to understand. For more information on markdown cells, refer to Markdown Cells.

To run all the queries in the SQL notebook at once, choose Run all.

Add new SQL and markdown cells

To add new SQL queries or markdown cells, complete the following steps:

  1. After you open the SQL notebook, hover over the cell and choose Insert SQL to add a SQL cell or Insert markdown to add a markdown cell.
  2. The new cell is added before the cell you selected.
  3. You can also move the new cell after a specific cell by choosing the up or down icon.

Visualize notebook results using charts

Now that you can run the SQL notebook cell and get the results, you can display a graphic visualization of the results by using the chart option in Query Editor V2.0.

Let’s run the following query to get more insights into the data points assigned to one of the cluster’s results and visualize using charts.

To visualize the query results, configure a chart on the Results tab. Choose actor2name for the X-axis and totalarticles for the Y-axis dropdown. By default, the graph type is a bar chart.

Charts can be plotted in every cell, and each cell can have multiple result tables, but only one of them can have a chart. For more information about working with charts in Query Editor V2.0, refer to Visualizing query results.

Versioning in SQL Notebooks

Version control enables easier collaboration with your peers and reduces the risks of any mistakes. You can create multiple versions of the same SQL notebook by using the Save version option in Query Editor V2.0.

  1. In the navigation pane, choose Notebooks.
  2. Choose the SQL notebook that you want to open.
  3. Choose the options menu (three dots) and choose Save version.

    SQL Notebooks creates the new version and displays a message that the version has been created successfully.

    Now we can view the version history of the notebook.
  4. Choose the SQL notebook for which you created the version (right-click) and choose Version history.

    You can see a list of all the versions of the SQL notebook.
  5. To revert to a specific version of the notebook, choose the version you want and choose Revert to version.
  6. To create a new notebook from a version, choose the version you want and choose Create a new notebook from the version.

Duplicate the SQL notebook

While working with your peers, you might need to share your notebook, but you also need to continue making changes in your notebook. To avoid any impact with the shared version, you can duplicate the notebook and keep working on your changes in the duplicate copy of the notebook.

  1. In the navigation pane, choose Notebooks.
  2. Open the SQL notebook.
  3. Choose the options menu (three dots) and choose Duplicate.
  4. Provide the duplicate notebook name.
  5. Choose Duplicate.

Share notebooks

You often need to collaborate with other teams, for example to share the queries for integration testing, deploy the queries from dev to the production account, and more. You can achieve this by sharing the notebook with your team.

A team is defined for a set of users who collaborate and share Query Editor V2.0 resources. An administrator can create a team by adding a tag to an IAM role.

Before you start sharing your notebook with your team, make sure that you have the principal tag sqlworkbench-team set to the same value as the rest of your team members in your account. For example, an administrator might set the value to accounting-team for everyone in the accounting department. To create a team and tag, refer to Permissions required to use the query editor v2.0.

To share a SQL notebook with a team in the same account, complete the following steps:

  1. Open the SQL notebook you want to share.
  2. Choose the options menu (three dots) and choose Share with my team.Notebooks that are shared to the team can be seen in the notebooks panel’s Shared to my team tab, and the notebooks that are shared by the user can be seen in Shared by me tab.You can also use the export/import feature for other use cases. For example, developers can deploy notebooks from lower environments to production, or customers can provide a SAAS solution sharing notebook with their end-users in different accounts or Regions. Complete the following steps to export and import SQL notebooks:
  3. Open the SQL notebook you want to share.
  4. Choose the options menu (three dots) and choose Export. SQL Notebooks saves the notebook in your local desktop as a .ipynb file.
  5. Import the notebook into another account or Region.

Run parameterized queries in a SQL notebook

Database users often need to pass parameters to the queries with different values at runtime. You can achieve this in SQL Notebooks by using parameterized queries. It can be defined in the query as ${parameter_name}, and when the query is run, it prompts to set a value for the parameter.

Let’s look at the following example, in which we pass the events_cluster parameter.

  1. Insert a SQL cell in the SQL notebook and add the following SQL query:
    select news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster, eventcode, actor1name, actor2name, sum(numarticles) as totalarticles
    from gdelt_data
    where events_cluster = ${events_cluster}
    and actor1name <> ' 'and actor2name <> ' '
    group by 1,2,3,4
    order by 5 desc

  2. When prompted, input the value of the parameter events_cluster, (for this post, we set the value as 4).
  3. Choose Run now to run the query.

The following screenshot shows the query results with the events_cluster parameter value set to 4.

Conclusion

In this post, we introduced SQL Notebooks using the Amazon Redshift Query Editor V2.0. We used a sample notebook to demonstrate how it simplifies data analysis tasks for a data scientist and how you can collaborate using notebooks with your team.


About the Authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems in order to deliver solutions that exceed expectations.

Cansu Aksu is a Frontend Engineer at AWS. She has several years of experience in building user interfaces that simplify complex actions and contribute to a seamless customer experience. In her career in AWS, she has worked on different aspects of web application development, including front end, backend, and application security.

Andrei Marchenko is a Full Stack Software Development Engineer at AWS. He works to bring notebooks to life on all fronts—from the initial requirements to code deployment, from the database design to the end-user experience. He uses a holistic approach to deliver the best experience to customers.

Debu-PandaDebu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt, 2009)

Automate Amazon Redshift Serverless data warehouse management using AWS CloudFormation and the AWS CLI

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/automate-amazon-redshift-serverless-data-warehouse-management-using-aws-cloudformation-and-the-aws-cli/

Amazon Redshift Serverless makes it simple to run and scale analytics without having to manage the instance type, instance size, lifecycle management, pausing, resuming, and so on. It automatically provisions and intelligently scales data warehouse compute capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. Just load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business intelligence (BI) tool and continue to enjoy the best price performance and familiar SQL features in an easy-to-use, zero administration environment.

Redshift Serverless separates compute and storage and introduces two abstractions:

  • Workgroup – A workgroup is a collection of compute resources. It groups together compute resources like RPUs, VPC subnet groups, and security groups.
  • Namespace – A namespace is a collection of database objects and users. It groups together data objects, such as databases, schemas, tables, users, or AWS Key Management Service (AWS KMS) keys for encrypting data.

Some organizations want to automate the creation of workgroups and namespaces for automated infrastructure management and consistent configuration across environments, and provide end-to-end self-service capabilities. You can automate the workgroup and namespace management operations using the Redshift Serverless API, the AWS Command Line Interface (AWS CLI), or AWS CloudFormation, which we demonstrate in this post.

Solution overview

In the following sections, we discuss the automation approaches for various tasks involved in Redshift Serverless data warehouse management using AWS CloudFormation (for more information, see RedshiftServerless resource type reference) and the AWS CLI (see redshift-serverless).

The following are some of the key use cases and appropriate automation approaches to use with AWS CloudFormation:

  • Enable end-to-end self-service from infrastructure setup to querying
  • Automate data consumer onboarding for data provisioned through AWS Data Exchange
  • Accelerate workload isolation by creating endpoints
  • Create a new data warehouse with consistent configuration across environments

The following are some of the main use cases and approaches for the AWS CLI:

  • Automate maintenance operations:
    • Backup and limits
    • Modify RPU configurations
    • Manage limits
  • Automate migration from provisioned to serverless

Prerequisites

To run the operations described in this post, make sure that this user or role has AWS Identity Access and Management (IAM) arn:aws:iam::aws:policy/AWSCloudFormationFullAccess, and either the administrator permission arn:aws:iam::aws:policy/AdministratorAccess or the full Amazon Redshift permission arn:aws:iam::aws:policy/AmazonRedshiftFullAccess policy attached. Refer to Security and connections in Amazon Redshift Serverless for further details.

You should have at least three subnets, and they must span across three Availability Zones.It is not enough if just 3 subnets created in same availability zone. To create a new VPC and subnets, use the following CloudFormation template to deploy in your AWS account.

Create a Redshift Serverless namespace and workgroup using AWS CloudFormation

AWS CloudFormation helps you model and set up your AWS resources so that you can spend less time on infrastructure setup and more time focusing on your applications that run in AWS. You create a template that describes all the AWS resources that you want, and AWS CloudFormation takes care of provisioning and configuring those resources based on the given input parameters.

To create the namespace and workgroup for a Redshift Serverless data warehouse using AWS CloudFormation, complete the following steps:

  1. Choose Launch Stack to launch AWS CloudFormation in your AWS account with a template:
  2. For Stack name, enter a meaningful name for the stack, for example, rsserverless.
  3. Enter the parameters detailed in the following table.
Parameters Default Allowed Values Description
Namespace . N/A The name of the namespace of your choice to be created.
Database Name dev N/A The name of the first database in the Redshift Serverless environment.
Admin User Name admin N/A The administrator’s user name for the Redshift Serverless namespace being create.
Admin User Password . N/A The password associated with the admin user.
Associate IAM Role . Comma-delimited list of ARNs of IAM roles Associate an IAM role to your Redshift Serverless namespace (optional).
Log Export List userlog, connectionlog, useractivitylog userlog, connectionlog, useractivitylog Provide comma-separated values from the list. For example, userlog, connectionlog, useractivitylog. If left blank, LogExport is turned off.
Workgroup . N/A The workgroup name of your choice to be created.
Base RPU 128 Minimum value of 32 and maximum value of 512 The base RPU for the Redshift Serverless workgroup.
Publicly accessible false true, false Indicates if the Redshift Serverless instance is publicly accessible.
Subnet Ids . N/A You must have at least three subnets, and they must span across three Availability Zones.
Security Group Id . N/A The list of security group IDs in your VPC.
Enhanced VPC Routing false true, false The value that specifies whether to enable enhanced VPC routing, which forces Redshift Serverless to route traffic through your VPC.
  1. Pass the parameters provided to the AWS::RedshiftServerless::Namespace and AWS::RedshiftServerless::Workgroup resource types:
    Resources:
      RedshiftServerlessNamespace:
        Type: 'AWS::RedshiftServerless::Namespace'
        Properties:
          AdminUsername:
            Ref: AdminUsername
          AdminUserPassword:
            Ref: AdminUserPassword
          DbName:
            Ref: DatabaseName
          NamespaceName:
            Ref: NamespaceName
          IamRoles:
            Ref: IAMRole
          LogExports:
            Ref: LogExportsList        
      RedshiftServerlessWorkgroup:
        Type: 'AWS::RedshiftServerless::Workgroup'
        Properties:
          WorkgroupName:
            Ref: WorkgroupName
          NamespaceName:
            Ref: NamespaceName
          BaseCapacity:
            Ref: BaseRPU
          PubliclyAccessible:
            Ref: PubliclyAccessible
          SubnetIds:
            Ref: SubnetId
          SecurityGroupIds:
            Ref: SecurityGroupIds
          EnhancedVpcRouting:
            Ref: EnhancedVpcRouting        
        DependsOn:
          - RedshiftServerlessNamespace

Perform namespace and workgroup management operations using the AWS CLI

The AWS CLI is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts.

To run the Redshift Serverless CLI commands, you need to install the latest version of AWS CLI. For instructions, refer to Installing or updating the latest version of the AWS CLI.

Now you’re ready to complete the following steps:

Use the following command to create a new namespace:

aws redshift-serverless create-namespace \
    --admin-user-password '<password>' \
    --admin-username cfn-blog-admin \
    --db-name cfn-blog-db \
    --namespace-name 'cfn-blog-ns'

The following screenshot shows an example output.

create-namespace

Use the following command to create a new workgroup mapped to the namespace you just created:

aws redshift-serverless create-workgroup \
    --base-capacity 128 \
    --namespace-name 'cfn-blog-ns' \
    --no-publicly-accessible \
    --security-group-ids "sg-0269bd680e0911ce7" \
    --subnet-ids "subnet-078eedbdd99398568" "subnet-05defe25a59c0e4c2" "subnet-0f378d07e02da3e48"\
    --workgroup-name 'cfn-blog-wg'

The following is an example output.

create workgroup

To allow instances and devices outside the VPC to connect to the workgroup, use the publicly-accessible option in the create-workgroup CLI command.

To verify the workgroup has been created and is in AVAILABLE status, use the following command:

aws redshift-serverless get-workgroup \
--workgroup-name 'cfn-blog-wg' \
--output text \
--query 'workgroup.status'

The following screenshot shows our output.

Regardless of whether your snapshot was made from a provisioned cluster or serverless workgroup, it can be restored into a new serverless workgroup. Restoring a snapshot replaces the namespace and workgroup with the contents of the snapshot.

Use the following command to restore from a snapshot:

aws redshift-serverless restore-from-snapshot \
--namespace-name 'cfn-blog-ns' \
--snapshot-arn arn:aws:redshift:us-east-1:<account-id>:snapshot:<cluster-identifier>/<snapshot-identifier> \
--workgroup-name 'cfn-blog-wg'

The following is an example output.

To check the workgroup status, run the following command:

aws redshift-serverless get-workgroup \
--workgroup-name 'cfn-blog-wg' \
--output text \
--query 'workgroup.status'

To create a snapshot from an existing namespace, run the following command:

aws redshift-serverless create-snapshot \
--namespace-name cfn-blog-ns \
--snapshot-name cfn-blog-snapshot-from-ns \
--retention-period 7

The following is an example output.

Redshift Serverless creates recovery points of your namespace that are available for 24 hours. To keep your recovery point longer than 24 hours, convert it to a snapshot.

To find the recovery points associated to your namespace, run the following command:

aws redshift-serverless list-recovery-points \
--namespace-name cfn-blog-ns \
--no-paginate

The following an example output with the list of all the recovery points.

list recovery points

Let’s take the latest recoveryPointId from the list and convert to snapshot.

To create a snapshot from a recovery point, run the following command:

aws redshift-serverless convert-recovery-point-to-snapshot \
--recovery-point-id f9eaf9ac-a98d-4809-9eee-869ef03e98b4 \
--retention-period 7 \
--snapshot-name cfn-blog-snapshot-from-rp

The following is an example output.

convert-recovery-point

In addition to restoring a snapshot to a serverless namespace, you can also restore from a recovery point.

  1. First, you need to find the recovery point identifier using the list-recovery-points command.
  2. Then use the following command to restore from a recovery point:
aws redshift-serverless restore-from-recovery-point \
--namespace-name cfn-blog-ns \
--recovery-point-id 15c55fb4-d973-4d8a-a8fe-4741e7911137 \
--workgroup-name cfn-blog-wg

The following is an example output.

restore from recovery point

The base RPU determines the starting capacity for your serverless environment.

Use the following command to modify the base RPU based on your workload requirements:

aws redshift-serverless update-workgroup \
--base-capacity 256 \
--workgroup-name 'cfn-blog-wg'

The following is an example output.

Run the following command to verify the workgroup base RPU capacity has been modified to 256:

aws redshift-serverless get-workgroup \
--workgroup-name 'cfn-blog-wg' \
--output text \
--query 'workgroup.baseCapacity'


To keep costs predictable for Redshift Serverless, you can set the maximum RPU hours used per day, per week, or per month. In addition, you can take action when the limit is reached. Actions include: write a log entry to a system table, receive an alert, or turn off user queries.

Use the following command to first get the workgroup ARN:

aws redshift-serverless get-workgroup --workgroup-name 'cfn-blog-wg' \
--output text \
--query 'workgroup.workgroupArn'

The following screenshot shows our output.

Use the workgroupArn output from the preceding command with the following command to set the daily RPU usage limit and set the action behavior to log:

aws redshift-serverless create-usage-limit \
--amount 256 \
--breach-action log \
--period daily \
--resource-arn arn:aws:redshift-serverless:us-east-1:<aws-account-id>:workgroup/1dcdd402-8aeb-432e-8833-b1f78a112a93 \
--usage-type serverless-compute

The following is an example output.

Conclusion

You have now learned how to automate management operations on Redshift Serverless namespaces and workgroups using AWS CloudFormation and the AWS CLI. To automate creation and management of Amazon Redshift provisioned clusters, refer to Automate Amazon Redshift Cluster management operations using AWS CloudFormation.


About the Authors

Ranjan Burman is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Urvish Shah is a Senior Database Engineer at Amazon Redshift. He has more than a decade of experience working on databases, data warehousing and in analytics space. Outside of work, he enjoys cooking, travelling and spending time with his daughter.

Best practices to optimize your Amazon Redshift and MicroStrategy deployment

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/best-practices-to-optimize-your-amazon-redshift-and-microstrategy-deployment/

This is a guest blog post co-written by Amit Nayak at Microstrategy. In their own words, “MicroStrategy is the largest independent publicly traded business intelligence (BI) company, with the leading enterprise analytics platform. Our vision is to enable Intelligence Everywhere. MicroStrategy provides modern analytics on an open, comprehensive enterprise platform used by many of the world’s most admired brands in the Fortune Global 500. Optimized for cloud and on-premises deployments, the platform features HyperIntelligence, a breakthrough technology that overlays actionable enterprise data on popular business applications to help users make smarter, faster decisions.”


Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse. It provides a simple and cost-effective way to analyze all your data using your existing BI tools. Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift has custom JDBC and ODBC drivers that you can download from the Connect Client tab on the Amazon Redshift console, allowing you to use a wide range of familiar BI tools.

When using your MicroStrategy application with Amazon Redshift, it’s important to understand how to optimize Amazon Redshift to get the best performance to meet your workload SLAs.

In this post, we look at the best practices for optimized deployment of MicroStrategy using Amazon Redshift.

Optimize Amazon Redshift

In this section, we discuss ways to optimize Amazon Redshift.

Amazon Redshift RA3 instances

RA3 nodes with managed storage help you optimize your data warehouse by scaling and paying for the compute capacity and managed storage independently. With RA3 instances, you can choose the number of nodes based on your performance requirements, and only pay for the managed storage that you use. Size your RA3 cluster based on the amount of data you process daily without increasing your storage costs.

For additional details about RA3 features, see Amazon Redshift RA3 instances with managed storage.

Distribution styles

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute nodes according to the table’s distribution style. When you run a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in choosing a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is run.

When you create a table, you can designate one of four distribution styles: AUTO, EVEN, KEY, or ALL. If you don’t specify a distribution style, Amazon Redshift uses AUTO distribution. With AUTO distribution, Amazon Redshift assigns an optimal distribution style based on the size of the table data. You can use automatic table optimization to get started with Amazon Redshift easily or optimize production workloads while decreasing the administrative effort required to get the best possible performance.

MicroStrategy, like any SQL application, transparently takes advantage of the distribution style defined on base tables. MicroStrategy recommends following Amazon Redshift recommended best practices when implementing the physical schema of the base tables.

Sort keys

Defining a table with a sort key results in the physical ordering of data in the Amazon Redshift cluster nodes based on the sort type and the columns chosen in the key definition. Sorting enables efficient handling of range-restricted predicates to scan the minimal number of blocks on disk to satisfy a query. A contrived example would be having an orders table with 5 years of data with a SORTKEY on the order_date column. Now suppose a query on the orders table specifies a date range of 1 month on the order_date column. In this case, you can eliminate up to 98% of the disk blocks from the scan. If the data isn’t sorted, more of the disk blocks (possibly all of them) have to be scanned, resulting in the query running longer.

We recommend creating your tables with SORTKEY AUTO. This way, Amazon Redshift uses automatic table optimization to choose the sort key. If Amazon Redshift determines that applying a SORTKEY improves cluster performance, tables are automatically altered within hours from the time the cluster was created, with minimal impact to queries.

We also recommend using the sort key on columns often used in the WHERE clause of the report queries. Keep in mind that SQL functions (such as data transformation functions) applied to sort key columns in queries reduce the effectiveness of the sort key for those queries. Instead, make sure that you apply the functions to the compared values so that the sort key is used. This is commonly found on DATE columns that are used as sort keys.

Amazon Redshift Advisor provides recommendations to help you improve the performance and decrease the operating costs for your Amazon Redshift cluster. The Advisor analyzes your cluster’s workload to identify the most appropriate distribution key and sort key based on the query patterns of your cluster.

Compression

Compression settings can also play a big role when it comes to query performance in Amazon Redshift. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.

By default, Amazon Redshift automatically manages compression encoding for all columns in a table. You can specify the ENCODE AUTO option for the table to enable Amazon Redshift to automatically manage compression encoding for all columns in a table. You can alternatively apply a specific compression type to the columns in a table manually when you create the table, or you can use the COPY command to analyze and apply compression automatically.

We don’t recommend compressing the first column in a compound sort key because it might result in scanning more rows than expected.

Amazon Redshift materialized views

Materialized views can significantly boost query performance for repeated and predictable analytical workloads such as dashboarding, queries from BI tools, and extract, load, and transform (ELT) data processing.

Materialized views are especially useful for queries that are predictable and repeated over and over. Instead of performing resource-intensive queries on large tables, applications can query the pre-computed data stored in the materialized view.

For example, consider the scenario where a set of queries is used to populate a collection of charts for a dashboard. This use case is ideal for a materialized view, because the queries are predictable and repeated over and over again. Whenever a change occurs in the base tables (data is inserted, deleted, or updated), the materialized views can be automatically or manually refreshed to represent the current data.

Amazon Redshift can automatically refresh materialized views with up-to-date data from its base tables when materialized views are created with or altered to have the auto-refresh option. Amazon Redshift auto-refreshes materialized views as soon as possible after base tables changes.

To update the data in a materialized view manually, you can use the REFRESH MATERIALIZED VIEW statement at any time. There are two strategies for refreshing a materialized view:

  • Incremental refresh – In an incremental refresh, it identifies the changes to the data in the base tables since the last refresh and updates the data in the materialized view
  • Full refresh – If incremental refresh isn’t possible, Amazon Redshift performs a full refresh, which reruns the underlying SQL statement, replacing all the data in the materialized view

Amazon Redshift automatically chooses the refresh method for a materialized view depending on the SELECT query used to define the materialized view. For information about the limitations for incremental refresh, see Limitations for incremental refresh.

The following are some of the key advantages using materialized views:

  • You can speed up queries by pre-computing the results of complex queries, including multiple base tables, predicates, joins, and aggregates
  • You can simplify and accelerate ETL and BI pipelines
  • Materialized views support Amazon Redshift local, Amazon Redshift Spectrum, and federated queries
  • Amazon Redshift can use automatic query rewrites of materialized views

For example, let’s consider the sales team wants to build a report that shows
the product sales across different stores. This dashboard query is based out of a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark dataset.

In this first step, you create a regular view. See the following code:

create view vw_product_sales
as
select 
	i_brand,
	i_category,
	d_year,
	d_quarter_name,
	s_store_name,
	sum(ss_sales_price) as total_sales_price,
	sum(ss_net_profit) as total_net_profit,
	sum(ss_quantity) as total_quantity
from
store_sales ss, item i, date_dim d, store s
where ss.ss_item_sk=i.i_item_sk
and ss.ss_store_sk = s.s_store_sk
and ss.ss_sold_date_sk=d.d_date_sk
and d_year = 2000
group by i_brand,
	i_category,
	d_year,
	d_quarter_name,
	s_store_name;

The following code is a report to analyze the product sales by category:

SELECT 
	i_category,
	d_year,
	d_quarter_name,
    sum(total_quantity) as total_quantity
FROM vw_product_sales
GROUP BY 
i_category,
	d_year,
	d_quarter_name
ORDER BY 3 desc

The preceding reports take approximately 15 seconds to run. As more products are sold, this elapsed time gradually gets longer. To speed up those reports, you can create a materialized view to precompute the total sales per category. See the following code:

create materialized view mv_product_sales
as
select 
	i_brand,
	i_category,
	d_year,
	d_quarter_name,
	s_store_name,
	sum(ss_sales_price) as total_sales_price,
	sum(ss_net_profit) as total_net_profit,
	sum(ss_quantity) as total_quantity
from 
store_sales ss, item i, date_dim d, store s
where ss.ss_item_sk=i.i_item_sk
and ss.ss_store_sk = s.s_store_sk
and ss.ss_sold_date_sk=d.d_date_sk
and d_year = 2000
group by i_brand,
	i_category,
	d_year,
	d_quarter_name,
	s_store_name;

The following code analyzes the product sales by category against the materialized view:

SELECT 
	i_category,
	d_year,
	d_quarter_name,
    sum(total_quantity) as total_quantity
FROM mv_product_sales
GROUP BY 
i_category,
	d_year,
	d_quarter_name
ORDER BY 3 desc;

The same reports against a materialized view took around 4 seconds because the new queries access precomputed joins, filters, grouping, and partial sums instead of the multiple, larger base tables.

Workload management

Amazon Redshift workload management (WLM) enables you to flexibly manage priorities within workloads so that short, fast-running queries don’t get stuck in queues behind long-running queries. You can use WLM to define multiple query queues and route queries to the appropriate queues at runtime.

You can query WLM in two modes:

  • Automatic WLM – Amazon Redshift manages the resources required to run queries. Amazon Redshift determines how many queries run concurrently and how much memory is allocated to each dispatched query. Amazon Redshift uses highly trained sophisticated ML algorithms to make these decisions.
  • Query priority is a feature of automatic WLM that lets you assign priority ranks to different user groups or query groups, to ensure that higher-priority workloads get more resources for consistent query performance, even during busy times. For example, consider a critical dashboard report query that has higher priority than an ETL job. You can assign the priority as highest for the report query and high priority to the ETL query.
  • No queries are ever starved of resources, and lower priority queries always complete, but may just take longer to complete.
  • Manual WLM – With manual WLM, you can manage the system performance by modifying the WLM configuration to create separate queues for long-running queries and short-running queries. You can define up to eight queues to separate workloads from each other. Each queue contains a number of query slots, and each queue is associated with a portion of available memory.

You can also use the Amazon Redshift query monitoring rules (QMR) feature to set metrics-based performance boundaries for workload management (WLM) queues, and specify what action to take when a query goes beyond those boundaries. For example, for a queue that’s dedicated to short-running queries, you might create a rule that cancels queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops. You can use predefined rule templates in Amazon Redshift to get started with QMR.

We recommend the following configuration for WLM:

  • Enable automatic WLM
  • Enable concurrency scaling to handle an increase in concurrent read queries, with consistent fast query performance
  • Create QMR rules to track and handle poorly written queries

After you create and configure different WLM queues, you can use a MicroStrategy query label to set the Amazon Redshift query group for queue assignment. This tells Amazon Redshift which WLM queue to send the query to.

You can set the following as a report pre-statement in MicroStrategy:

set query_group to 'mstr_dashboard';

You can use MicroStrategy query labels to identify the MicroStrategy submitted SQL statements within Amazon Redshift system tables.

You can use it with all SQL statement types; therefore, we recommend using it for multi-pass SQL reports. When the label of a query is stored in the system view stl_query, it’s truncated to 15 characters (30 characters are stored in all other system tables). For this reason, you should be cautious when choosing the value for query label.

You can set the following as a report pre-statement:

set query_group to 'MSTR=!o;Project=!p;User=!u;Job=!j;'

This collects information on the server side about variables like project name, report name, user, and more.

To clean up the query group and release resources, use the cleanup post-statement:

reset query_group;

MicroStrategy allows the use of wildcards that are replaced by values retrieved at a report’s run time, as shown in the pre- and post-statements. The following table provides an example of pre- and post-statements.

VLDB Category VLDB Property Setting Value Example
Pre/Post Statements Report Pre-statement set query_group to 'MSTR=!o;Project=!p;User=!u;Job=!j;'
Pre/Post Statements Cleanup Post-statement reset query_group;

For example, see the following code:

VLDB Property Report Pre Statement = set query_group to 'MSTRReport=!o;'
set query_group to 'MSTRReport=Cost, Price, and Profit per Unit;'

Query prioritization in MicroStrategy

In general, you may have multiple applications submitting queries to Amazon Redshift in addition to MicroStrategy. You can use Amazon Redshift query groups to identify MicroStrategy submitted SQL to Amazon Redshift, along with its assignment to the appropriate Amazon Redshift WLM queue.

The Amazon Redshift query group for a MicroStrategy report is set and reset through the use of the following report-level MicroStrategy VLDB properties.

VLDB Category VLDB Property Setting Value Example
Pre/Post Statements Report Pre-statement set query_group to 'MSTR_High=!o;'
Pre/Post Statements Cleanup Post-statement reset query_group;

A MicroStrategy report job can submit one or more queries to Amazon Redshift. In such cases, all queries for a MicroStrategy report are labeled with the same query group and therefore are assigned to same queue in Amazon Redshift.

The following is an example implementation of MicroStrategy Amazon Redshift WLM:

  • High-priority MicroStrategy reports are set with report pre-statement MSTR_HIGH=!o;, medium priority reports with MSTR_MEDIUM=!o;, and low priority reports with MSTR_LOW=!o;.
  • Amazon Redshift WLM queues are created and associated with corresponding query groups. For example, the MSTR_HIGH_QUEUE queue is associated with the MSTR_HIGH=*; query group (where * is an Amazon Redshift wildcard).

Concurrency scaling

With concurrency scaling, you can configure Amazon Redshift to handle spikes in workloads while maintaining consistent SLAs by elastically scaling the underlying resources as needed. When concurrency scaling is enabled, Amazon Redshift continuously monitors the designated workload. If the queries start to get backlogged because of bursts of user activity, Amazon Redshift automatically adds transient cluster capacity and routes the requests to these new clusters. You manage which queries are sent to the concurrency scaling cluster by configuring the WLM queues. This happens transparently in a matter of seconds, so your queries continue to be served with low latency. In addition, every 24 hours that the Amazon Redshift main cluster is in use, you accrue a 1-hour credit towards using concurrency scaling. This enables 97% of Amazon Redshift customers to benefit from concurrency scaling at no additional charge.

For more details on concurrency scaling pricing, see Amazon Redshift pricing.

Amazon Redshift removes the additional transient capacity automatically when activity reduces on the cluster. You can enable concurrency scaling for the MicroStrategy report queue and in case of heavy load on the cluster, the queries run on a concurrent cluster, thereby improving the overall dashboard performance and maintaining a consistent user experience.

To make concurrency scaling work with MicroStrategy, use derived tables instead of temporary tables, which you can do by setting the VLDB property Intermediate table type to Derived table.

In the following example, we enable concurrency scaling on the Amazon Redshift cluster for the MicroStrategy dashboard queries. We create a user group in Amazon Redshift, and all the dashboard queries are allocated to this user group’s queue. With concurrency scaling in place for the report queries, we can see a significant reduction in query wait time.

For this example, we created one WLM queue to run our dashboard queries with highest priority and another ETL queue with high priority. Concurrency scaling is turned on for the dashboard queue, as shown in the following screenshot.

As part of this test, we ran several queries in parallel on the cluster, some of which are ETL jobs (insert, delete, update, and copy), and some are complex select queries, such as dashboard queries. The following graph illustrates how many queries are waiting in the WLM queues and how concurrency scaling helps to address those queries.

In the preceding graph, several queries are waiting in the WLM queues; concurrency scaling automatically starts in seconds to process queries without any delays, as shown in the following graph.

This example has demonstrated how concurrency scaling helps handle spikes in user workloads by adding transient clusters as needed to provide consistent performance even as the workload grows to hundreds of concurrent queries.

Amazon Redshift federated queries

Customers using MicroStrategy often connect various relational data sources to a single MicroStrategy project for reporting and analysis purposes. For example, you might integrate an operational (OLTP) data source (such as Amazon Aurora PostgreSQL) and data warehouse data to get meaningful insights into your business.

With federated queries in Amazon Redshift, you can query and analyze data across operational databases, data warehouses, and data lakes. The federated query feature allows you to integrate queries from Amazon Redshift on live data in external databases with queries across your Amazon Redshift and Amazon Simple Storage Service (Amazon S3) environments.

Federated queries help incorporate live data as part of your MicroStrategy reporting and analysis, without the need to connect to multiple relational data sources from MicroStrategy.

You can also use federated queries to MySQL.

This simplifies the multi-source reports use case by having the ability to run queries on both operational and analytical data sources, without the need to explicitly connect and import data from different data sources within MicroStrategy.

Redshift Spectrum

The MicroStrategy Amazon Redshift connector includes support for Redshift Spectrum, so you can connect directly to query data in Amazon Redshift and analyze it in conjunction with data in Amazon S3.

With Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data (such as PARQUET, JSON, and CSV) from files in Amazon S3 without having to load the data into Amazon Redshift tables. It allows customers with large datasets stored in Amazon S3 to query that data from within the Amazon Redshift cluster using Amazon Redshift SQL queries with no data movement—you pay only for the data you scanned. Redshift Spectrum also allows multiple Amazon Redshift clusters to concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster. Based on the demands of the queries, Redshift Spectrum can intelligently scale out to take advantage of massively parallel processing.

Use cases that might benefit from using Redshift Spectrum include:

  • A large volume of less-frequently accessed data
  • Heavy scan-intensive and aggregation-intensive queries
  • Selective queries that can use partition pruning and predicate pushdown, so the output is fairly small

Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it.

With Redshift Spectrum, you take advantage of a fast, cost-effective engine that minimizes data processed with dynamic partition pruning. You can further improve query performance by reducing the amount of data scanned. You could do this by partitioning and compressing data and by using a columnar format for storage.

For more details on how to optimize Redshift Spectrum query performance and cost, see Best Practices for Amazon Redshift Spectrum.

Optimize MicroStrategy

In this section, we discuss ways to optimize MicroStrategy.

SQL optimizations

With MicroStrategy 2021, MicroStrategy has delivered support for 70 new advanced customizable functions to enhance usability and capability, especially when compared to previously existing Apply functions. Application architects can customize the functions and make them ready and available for regular users like business analysts to use! For more information on how to use these new customizable functions, visit the MicroStrategy community site.

SQL Global Optimization

This setting can substantially reduce the number of SQL passes generated by MicroStrategy. In MicroStrategy, SQL Global Optimization reduces the total number of SQL passes with the following optimizations:

  • Eliminates unused SQL passes – For example, a temp table is created but not referenced in a later pass
  • Reuses redundant SQL passes – For example, the exact same temp table is created multiple times when a single temp table is created
  • Combines SQL passes where the SELECT list is different – For example, two temp tables that have the same FROM clause, joins, WHERE clause, and GROUP BY SELECT lists are combined into single SELECT statement
  • Combines SQL passes where the WHERE clause is different – For example, two temp tables that have same the SELECT list, FROM clause, joins, and GROUP BY predicates from the WHERE clause are moved into CASE statements in the SELECT list

The default setting for Amazon Redshift is to enable SQL Global Optimization at its highest level. If your database instance is configured as an earlier version of Amazon Redshift, you may have to enable this setting manually. For more information, see the MicroStrategy System Administration Guide.

Set Operator Optimization

This setting is used to combine multiple subqueries into a single subquery using set operators (such as UNION, INTERSECT, and EXCEPT). The default setting for Amazon Redshift is to enable Set Operator Optimization.

SQL query generation

The MicroStrategy query engine is able to combine multiple passes of SQL that access the same table (typically the main fact table). This can improve performance by eliminating multiple table scans of large tables. For example, this feature significantly reduces the number of SQL passes required to process datasets with custom groups.

Technically, the WHERE clauses of different passes are resolved in CASE statements of a single SELECT clause, which doesn’t contain qualifications in the WHERE clause. Generally, this elimination of WHERE clauses causes a full table scan on a large table.

In some cases (on a report-by-report basis), this approach can be slower than many highly qualified SELECT statements. Because any performance difference between approaches is mostly impacted by the reporting requirement and implementation in the MicroStrategy application, it’s necessary to test both options for each dataset to identify the optimal case.

The default behavior is to merge all passes with different WHERE clauses (level 4). We recommend testing any option for this setting, but most commonly the biggest performance improvements (if any) are observed by switching to the option Level 2: Merge Passes with Different SELECT.

VLDB Category VLDB Property Setting Value
Query Optimizations SQL Global Optimization Level 2: Merge Passes with Different SELECT

SQL size

As we explained earlier, MicroStrategy tries to submit a single query statement containing the analytics of multiple passes in the derived table syntax. This can lead to sizeable SQL query syntax. It’s possible for such a statement to exceed the capabilities of the driver or database. For this reason, MicroStrategy governs the size of generated queries and throws an error message if this is exceeded. Starting with MicroStrategy 10.9, this value is tuned to current Amazon Redshift capabilities (16 MB). Earlier versions specify a smaller limit that can be modified using the following VLDB setting on the Amazon Redshift DB instance in Developer.

VLDB Category VLDB Property Setting Value
Governing SQL Size/MDX Size 16777216

Subquery type

There are many cases in which the SQL engine generates subqueries (query blocks in the WHERE clause):

  • Reports that use relationship filters
  • Reports that use NOT IN set qualification, such as AND NOT
  • Reports that use attribute qualification with M-M relationships; for example, showing revenue by category and filtering on catalog
  • Reports that raise the level of a filter; for example, dimensional metric at Region level, but qualify on store
  • Reports that use non-aggregatable metrics, such as inventory metrics
  • Reports that use dimensional extensions
  • Reports that use attribute-to-attribute comparison in the filter

The default setting for subquery type for Amazon Redshift is Where EXISTS(select (col1, col2…)):

create table T00001 (
       year_id NUMERIC(10, 0),
       W000001 DOUBLE PRECISION)
 DISTSTYLE EVEN
 
insert into ZZMD00DistKey(1)
select a12.year_id  year_id,
       sum(a11.tot_sls_dlr)  W000001
from   items2 a11
       join   dates    a12
         on   (a11.cur_trn_dt = a12.cur_trn_dt)
where ((exists (select      r11.store_nbr
       from   items r11
       where r11.class_nbr = 1
        and   r11.store_nbr = a11.store_nbr))
 and a12.year_id>1993)
group by      a12.year_id

Some reports may perform better with the option of using a temporary table and falling back to IN for a correlated subquery. Reports that include a filter with an AND NOT set qualification (such as AND NOT relationship filter) will likely benefit from using temp tables to resolve the subquery. However, such reports will probably benefit more from using the Set Operator Optimization option discussed earlier. The other settings are not likely to be advantageous with Amazon Redshift.

VLDB Category VLDB Property Setting Value
Query Optimizations Subquery Type Use temporary table, falling back to IN for correlated subquery

Full outer join support

Full outer join support is enabled in the Amazon Redshift object by default. Levels at which you can set this are database instance, report, and template.

For example, the following query shows the use of full outer join with the states_dates and regions tables:

select pa0.region_id W000000,
       pa2.month_id W000001,
       sum(pa1.tot_dollar_sales) Column1
from   states_dates pa1
       full outer join       regions     pa0
         on (pa1.region_id = pa0.region_id)
       cross join    LU_MONTH      pa2
group by      pa0.region_id, pa2.month_id

DISTINCT or GROUP BY option (for no aggregation and no table key)

If no aggregation is needed and the attribute defined on the table isn’t a primary key, this property tells the SQL engine whether to use SELECT DISTINCT, GROUP BY, or neither.

Possible values for this setting include:

  • Use DISTINCT
  • No DISTINCT, no GROUP BY
  • Use GROUP BY

The DISTINCT or GROUP BY option property controls the generation of DISTINCT or GROUP BY in the SELECT SQL statement. The SQL engine doesn’t consider this property if it can make the decision based on its own knowledge. Specifically, the SQL engine ignores this property in the following situations:

  • If there is aggregation, the SQL engine uses GROUP BY, not DISTINCT
  • If there is no attribute (only metrics), the SQL engine doesn’t use DISTINCT
  • If there is COUNT (DISTINCT …) and the database doesn’t support it, the SQL engine performs a SELECT DISTINCT pass and then a COUNT(*) pass
  • If for certain selected column data types, the database doesn’t allow DISTINCT or GROUP BY, the SQL engine doesn’t do it
  • If the SELECT level is the same as the table key level and the table’s true key property is selected, the SQL engine doesn’t issue a DISTINCT

When none of the preceding conditions are met, the SQL engine uses the DISTINCT or GROUP BY property.

Use the latest Amazon Redshift drivers

For running MicroStrategy reports using Amazon Redshift, we encourage upgrading when new versions of the Amazon Redshift drivers are available. Running an application on the latest driver provides better performance, bugs recovery, and better security features. To get the latest driver version based on the OS, see Drivers and Connectors.

Conclusion

In this post, we discussed various Amazon Redshift cluster optimizations, data model optimizations, and SQL optimizations within MicroStrategy for optimizing your Amazon Redshift and MicroStrategy deployment.


About the Authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 13 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Nita Shah is a Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Bosco Albuquerque is a Sr Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers, and has helped large technology companies design data analytics solutions as well as led engineering teams in designing and implementing data analytics platforms and data products.

Amit Nayak is responsible for driving the Gateways roadmap at MicroStrategy, focusing on relational and big data databases, as well as authentication. Amit joined MicroStrategy after completing his master’s in Business Analytics at George Washington University and has maintained an oversight of the company’s gateways portfolio for the 3+ years he has been with the company.