Tag Archives: Amazon Redshift

Cybersecurity Awareness Month: Learn about the job zero of securing your data using Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/cybersecurity-awareness-month-learn-about-the-job-zero-of-securing-your-data-using-amazon-redshift/

Amazon Redshift is the most widely used cloud data warehouse. It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar on high-performance storage, and massively parallel query execution.

At AWS, we embrace the culture that security is job zero, by which we mean it’s even more important than any number one priority. AWS provides comprehensive security capabilities to satisfy the most demanding requirements, and Amazon Redshift provides data security out of the box at no extra cost. Amazon Redshift uses the AWS security frameworks to implement industry-leading security in the areas of authentication, access control, auditing, logging, compliance, data protection, and network security.

Cybersecurity Awareness Month raises awareness about the importance of cybersecurity, ensuring everyone has the resources they need to be safer and more secure online. This post highlights some of the key out-of-the-box capabilities available in Amazon Redshift to manage your data securely.

Authentication

Amazon Redshift supports industry-leading security with built-in AWS Identity and Access Management (IAM) integration, identity federation for single sign-on (SSO), and multi-factor authentication. You can federate database user authentication easily with IAM and Amazon Redshift using IAM and a third-party SAML-2.0 identity provider (IdP), such as AD FS, PingFederate, or Okta.

To get started, see the following posts:

Access control

Granular row and column-level security controls ensure users see only the data they should have access to. You can achieve column-level access control for data in Amazon Redshift by using the column-level grant and revoke statements without having to implement views-based access control or use another system. Amazon Redshift is also integrated with AWS Lake Formation, which makes sure that column and row level access in Lake Formation is enforced for Amazon Redshift queries on the data in the data lake.

The following guides can help you implement fine-grained access control on Amazon Redshift:

Auditing and logging

To monitor Amazon Redshift for any suspicious activities, you can take advantage of the auditing and logging features. Amazon Redshift logs information about connections and user activities, which can be uploaded to Amazon Simple Storage Service (Amazon S3) if you enable the audit logging feature. The API calls to Amazon Redshift are logged to AWS CloudTrail, and you can create a log trail by configuring CloudTrail to upload to Amazon S3. For more details, see Database audit logging, Analyze logs using Amazon Redshift spectrum, Querying AWS CloudTrail Logs, and System object persistence utility.

Compliance

Amazon Redshift is assessed by third-party auditors for compliance with multiple programs. If your use of Amazon Redshift is subject to compliance with standards like HIPAA, PCI, or FedRAMP, you can find more details at Compliance validation for Amazon Redshift.

Data protection

To protect data both at rest and while in transit, Amazon Redshift provides options to encrypt the data. Although the encryption settings are optional, we highly recommend enabling them. When you enable encryption at rest for your cluster, it encrypts both the data blocks as well as the metadata, and there are multiple ways to manage the encryption key (see Amazon Redshift database encryption). To safeguard your data while it’s in transit from your SQL clients to the Amazon Redshift cluster, we highly recommend configuring the security options as described in Configuring security options for connections.

For additional data protection options, see the following resources:

Amazon Redshift enables you to use an AWS Lambda function as a UDF in Amazon Redshift. You can write Lambda UDFs to enable external tokenization of data dynamic data masking, as illustrated in Amazon Redshift – Dynamic Data Masking.

Network security

Amazon Redshift is a service that runs within your VPC. There are multiple configurations to ensure access to your Amazon Redshift cluster is secured, whether the connection is from an application within your VPC or an on-premises system. For more information, see VPCs and subnets. Amazon Redshift for AWS PrivateLink ensures that all API calls from your VPC to Amazon Redshift stay within the AWS network. For more information, see Connecting to Amazon Redshift using an interface VPC endpoint.

Customer success stories

You can run your security-demanding analytical workload using out-of-the-box features. For example, SoePay, a Hong Kong–based payments solutions provider, uses AWS Fargate and Amazon Elastic Container Service (Amazon ECS) to scale its infrastructure, AWS Key Management Service (AWS KMS) to manage cryptographic keys, and Amazon Redshift to store data from merchants’ smart devices.

With AWS services, GE Renewable Energy has created a data lake where it collects and analyses machine data captured at GE wind turbines around the world. GE relies on Amazon S3 to store and protect its ever-expanding collection of wind turbine data and Amazon Redshift to help them gain new insights from the data it collects.

For more customer stories, see Amazon Redshift customers.

Conclusion and Next Steps

In this post, we discussed some of the key out-of-the-box capabilities at no extra cost available in Amazon Redshift to manage your data securely, such as authentication, access control, auditing, logging, compliance, data protection, and network security.

You should periodically review your AWS workloads to ensure security best practices have been implemented. The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building systems on AWS. This framework can help you learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. Review your security pillar provided in this framework.

In addition, AWS Security Hub, an AWS service, provides a comprehensive view of your security state within AWS that helps you check your compliance with security industry standards and best practices.

To adhere to the security needs of your organization, you can automate the deployment of an Amazon Redshift cluster in an AWS account using AWS CloudFormation and AWS Service Catalog. For more information, see Automate Amazon Redshift cluster creation using AWS CloudFormation and Automate Amazon Redshift Cluster management operations using AWS CloudFormation.


About the Authors

Kunal Deep Singh is a Software Development Manager at Amazon Web Services (AWS) and leads development of security features for Amazon Redshift. Prior to AWS he has worked at Amazon Ads and Microsoft Azure. He is passionate about building customer solutions for cloud, data and security.

Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Stream data from relational databases to Amazon Redshift with upserts using AWS Glue streaming jobs

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/stream-data-from-relational-databases-to-amazon-redshift-with-upserts-using-aws-glue-streaming-jobs/

Traditionally, read replicas of relational databases are often used as a data source for non-online transactions of web applications such as reporting, business analysis, ad hoc queries, operational excellence, and customer services. Due to the exponential growth of data volume, it became common practice to replace such read replicas with data warehouses or data lakes to have better scalability and performance. In most real-world use cases, it’s important to replicate the data from a source relational database to the target in real time. Change data capture (CDC) is one of the most common design patterns to capture the changes made in the source database and relay them to other data stores.

AWS offers a broad selection of purpose-built databases for your needs. For analytic workloads such as reporting, business analysis, and ad hoc queries, Amazon Redshift is powerful option. With Amazon Redshift, you can query and combine exabytes of structured and semi-structured data across your data warehouse, operational database, and data lake using standard SQL.

To achieve CDC from Amazon Relational Database Service (Amazon RDS) or other relational databases to Amazon Redshift, the simplest solution is to create an AWS Database Migration Service (AWS DMS) task from the database to Amazon Redshift. This approach works well for simple data replication. To have more flexibility to denormalize, transform, and enrich the data, we recommend using Amazon Kinesis Data Streams and AWS Glue streaming jobs between AWS DMS tasks and Amazon Redshift. This post demonstrates how this second approach works in a customer scenario.

Example use case

For our example use case, we have a database that stores data of a fictional organization that holds sports events. We have three dimension tables: sport_event, ticket, and customer, and one fact table: ticket_activity. The table sport_event stores sport type (such as baseball or football), date, and location. The table ticket stores seat level, location, and ticket policy for the target sport event. The table customer stores individual customer names, email addresses, and phone numbers, which are sensitive information. When a customer buys a ticket, the activity (e.g. who purchased the ticket) is recorded in the table ticket_activity. One record is inserted into the table ticket_activity every time a customer buys a ticket, so new records are being ingested into this fact table continuously. The records ingested into the table ticket_activity are only updated when needed, when an administrator maintains the data.

We assume a persona, a data analyst, who is responsible for analyzing trends of the sports activity from this continuous data in real time. To use Amazon Redshift as a primary data mart, the data analyst needs to enrich and clean the data so that users like business analysts can understand and utilize the data easily.

The following are examples of the data in each table.

The following is the dimension table sport_event.

event_id sport_type start_date location
1 35 Baseball 9/1/2021 Seattle, US
2 36 Baseball 9/18/2021 New York, US
3 37 Football 10/5/2021 San Francisco, US

The following is the dimension table ticket (the field event_id is the foreign key for the field event_id in the table sport_event).

ticket_id event_id seat_level seat_location ticket_price
1 1315 35 Standard S-1 100
2 1316 36 Standard S-2 100
3 1317 37 Premium P-1 300

The following is the dimension table customer.

customer_id name email phone
1 222 Teresa Stein [email protected] +1-296-605-8486
2 223 Caleb Houston [email protected] 087-237-9316×2670
3 224 Raymond Turner [email protected] +1-786-503-2802×2357

The following is the fact table ticket_activity (the field purchased_by is the foreign key for the field customer_id in the table customer).

ticket_id purchased_by created_by updated_by
1 1315 222 8/15/2021 8/15/2021
2 1316 223 8/30/2021 8/30/2021
3 1317 224 8/31/2021 8/31/2021

To make the data easy to analyze, the data analyst wants to have only one table that includes all the information instead of joining all four tables every time they want to analyze. They also want to mask the field phone_number and tokenize the field email_address as sensitive information. To meet this requirement, we merge these four tables into one table and denormalize, tokenize, and mask the data.

The following is the destination table for analysis, sport_event_activity.

ticket_id event_id sport_type start_date location seat_level seat_location ticket_price purchased_by name email_address phone_number created_at updated_at
1 1315 35 Baseball 9/1/2021 Seattle, USA Standard S-1 100 222 Teresa Stein 990d081b6a420d04fbe07dc822918c7ec3506b12cd7318df7eb3af6a8e8e0fd6 +*-***-***-**** 8/15/2021 8/15/2021
2 1316 36 Baseball 9/18/2021 New York, USA Standard S-2 100 223 Caleb Houston c196e9e58d1b9978e76953ffe0ee3ce206bf4b88e26a71d810735f0a2eb6186e ***-***-****x**** 8/30/2021 8/30/2021
3 1317 37 Football 10/5/2021 San Francisco, US Premium P-1 300 224 Raymond Turner 885ff2b56effa0efa10afec064e1c27d1cce297d9199a9d5da48e39df9816668 +*-***-***-****x**** 8/31/2021 8/31/2021

Solution overview

The following diagram depicts the architecture of the solution that we deploy using AWS CloudFormation.

We use an AWS DMS task to capture the changes in the source RDS instance, Kinesis Data Streams as a destination of the AWS DMS task CDC replication, and an AWS Glue streaming job to read changed records from Kinesis Data Streams and perform an upsert into the Amazon Redshift cluster. In the AWS Glue streaming job, we enrich the sports-event records.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • An Amazon RDS database instance (source).
  • An AWS DMS replication instance, used to replicate the table ticket_activity to Kinesis Data Streams.
  • A Kinesis data stream.
  • An Amazon Redshift cluster (destination).
  • An AWS Glue streaming job, which reads from Kinesis Data Streams and the RDS database instance, denormalizes, masks, and tokenizes the data, and upserts the records into the Amazon Redshift cluster.
  • Three AWS Glue Python shell jobs:
    • rds-ingest-data-initial-<CloudFormation Stack name> creates four source tables on Amazon RDS and ingests the initial data into the tables sport_event, ticket, and customer. Sample data is automatically generated at random by Faker library.
    • rds-ingest-data-incremental-<CloudFormation Stack name> ingests new ticket activity data into the source table ticket_activity on Amazon RDS continuously. This job simulates customer activity.
    • rds-upsert-data-<CloudFormation Stack name> upserts specific records in the source table ticket_activity on Amazon RDS. This job simulates administrator activity.
  • AWS Identity and Access Management (IAM) users and policies.
  • An Amazon VPC, a public subnet, two private subnets, an internet gateway, a NAT gateway, and route tables.
    • We use private subnets for the RDS database instance, AWS DMS replication instance, and Amazon Redshift cluster.
    • We use the NAT gateway to have reachability to pypi.org to use MySQL Connector for Python from the AWS Glue Python shell jobs. It also provides reachability to Kinesis Data Streams and an Amazon Simple Storage Service (Amazon S3) API endpoint.

The following diagram illustrates this architecture.

To set up these resources, you must have the following prerequisites:

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the AWS CloudFormation console.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For S3BucketName, enter the name of your new S3 bucket.
  5. For VPCCIDR, enter the CIDR IP address range that doesn’t conflict with your existing networks.
  6. For PublicSubnetCIDR, enter the CIDR IP address range within the CIDR you gave in VPCCIDR.
  7. For PrivateSubnetACIDR and PrivateSubnetBCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  8. For SubnetAzA and SubnetAzB, choose the subnets you want to use.
  9. For DatabaseUserName, enter your database user name.
  10. For DatabaseUserPassword, enter your database user password.
  11. Choose Next.
  12. On the next page, choose Next.
  13. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  14. Choose Create stack.

Stack creation can take about 20 minutes.

Ingest new records

In this section, we walk you through the steps to ingest new records.

Set up an initial source table

To set up an initial source table in Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs.
  2. Select the job rds-ingest-data-initial-<CloudFormation stack name>.
  3. On the Actions menu, choose Run job.
  4. Wait for the Run status to show as SUCCEEDED.

This AWS Glue job creates a source table event on the RDS database instance.

Start data ingestion to the source table on Amazon RDS

To start data ingestion to the source table on Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Triggers.
  2. Select the trigger periodical-trigger-<CloudFormation stack name>.
  3. On the Actions menu, choose Activate trigger.
  4. Choose Enable.

This trigger runs the job rds-ingest-data-incremental-<CloudFormation stack name> to ingest one record every minute.

Start data ingestion to Kinesis Data Streams

To start data ingestion from Amazon RDS to Kinesis Data Streams, complete the following steps:

  1. On the AWS DMS console, choose Database migration tasks.
  2. Select the task rds-to-kinesis-<CloudFormation stack name> .
  3. On the Actions menu, choose Restart/Resume.
  4. Wait for the Status to show as Load complete, replication ongoing.

The AWS DMS replication task ingests data from Amazon RDS to Kinesis Data Streams continuously.

Start data ingestion to Amazon Redshift

Next, to start data ingestion from Kinesis Data Streams to Amazon Redshift, complete the following steps:

  1. On the AWS Glue console, choose Jobs.
  2. Select the job streaming-cdc-kinesis2redshift-<CloudFormation stack name>.
  3. On the Actions menu, choose Run job.
  4. Choose Run job again.

This AWS Glue streaming job is implemented based on the guidelines in Updating and inserting new data. It performs the following actions:

  • Creates a staging table on the Amazon Redshift cluster using the Amazon Redshift Data API
  • Reads from Kinesis Data Streams, and creates a DataFrame with filtering only INSERT and UPDATE records
  • Reads from three dimension tables on the RDS database instance
  • Denormalizes, masks, and tokenizes the data
  • Writes into a staging table on the Amazon Redshift cluster
  • Merges the staging table into the destination table
  • Drops the staging table

After about 2 minutes from starting the job, the data should be ingested into the Amazon Redshift cluster.

Validate the ingested data

To validate the ingested data in the Amazon Redshift cluster, complete the following steps:

  1. On the Amazon Redshift console, choose EDITOR in the navigation pane.
  2. Choose Connect to database.
  3. For Connection, choose Create a new connection.
  4. For Authentication, choose Temporary credentials.
  5. For Cluster, choose the Amazon Redshift cluster cdc-sample-<CloudFormation stack name>.
  6. For Database name, enter dev.
  7. For Database user, enter the user that was specified in the CloudFormation template (for example, dbmaster).
  8. Choose Connect.
  9. Enter the query SELECT * FROM sport_event_activity and choose Run.

Now you can see the ingested records in the table sport_event_activity on the Amazon Redshift cluster. Let’s note the value of ticket_id from one of the records. For this post, we choose 1317 as an example.

Update existing records

Your Amazon Redshift cluster now has the latest data ingested from the tables on the source RDS database instance. Let’s update the data in the source table ticket_activity on the RDS database instance to see that the updated records are replicated to the Amazon Redshift cluster side.

The CloudFormation template creates another AWS Glue job. This job upserts the data with specific IDs on the source table event. To upsert the records in the source table, complete the following steps:

  1. On the AWS Glue console, choose Jobs.
  2. Choose the job rds-upsert-data-<CloudFormation stack name>.
  3. On the Actions menu, choose Edit job.
  4. Under Security configuration, script libraries, and job parameters (optional), for Job parameters, update the following parameters:
    1. For Key, enter --ticket_id_to_be_updated.
    2. For Value, replace 1 with one of the ticket IDs you observed on the Amazon Redshift console.
  5. Choose Save.
  6. Choose the job rds-upsert-data-<CloudFormation stack name>.
  7. On the Actions menu, choose Run job.
  8. Choose Run job.

This AWS Glue Python shell job simulates a customer activity to buy a ticket. It updates a record in the source table ticket_activity on the RDS database instance using the ticket ID passed in the job argument --ticket_id_to_be_updated. It automatically selects one customer, updates the field purchased_by with the customer ID, and updates the field updated_at with the current timestamp.

To validate the ingested data in the Amazon Redshift cluster, run the same query SELECT * FROM sport_event_activity. You can filter the record with the ticket_id value you noted earlier.

According to the rows returned to the query, the record ticket_id=1317 has been updated. The field updated_at has been updated from 2021-08-16 06:05:01 to 2021-08-16 06:53:52, and the field purchased_by has been updated from 449 to 14. From this result, you can see that this record has been successfully updated on the Amazon Redshift cluster side as well. You can also choose Queries in the left pane to see past query runs.

Clean up

Now to the final step, cleaning up the resources.

  1. Stop the AWS DMS replication task rds-to-kinesis-<CloudFormation stack name>.
  2. Stop the AWS Glue streaming job streaming-cdc-kinesis2redshift-<CloudFormation stack name>.
  3. Delete the CloudFormation stack.

Conclusion

In this post, we demonstrated how you can stream data—not only new records, but also updated records from relational databases—to Amazon Redshift. With this approach, you can easily achieve upsert use cases on Amazon Redshift clusters. In the AWS Glue streaming job, we demonstrated the common technique to denormalize, mask, and tokenize data for real-world use cases.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He enjoys collaborating with different teams to deliver results like this post. In his spare time, he enjoys playing video games with his family.

Roman Gavrilov is an Engineering Manager at AWS Glue. He has over a decade of experience building scalable Big Data and Event-Driven solutions. His team works on Glue Streaming ETL to allow near real time data preparation and enrichment for machine learning and analytics.

New – AWS Data Exchange for Amazon Redshift

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-aws-data-exchange-for-amazon-redshift/

Back in 2019 I told you about AWS Data Exchange and showed you how to Find, Subscribe To, and Use Data Products. Today, you can choose from over 3600 data products in ten categories:

In my introductory post I showed you how could subscribe to data products and then download the data sets into an Amazon Simple Storage Service (Amazon S3) bucket. I then suggested various options for further processing, including AWS Lambda functions, a AWS Glue crawler, or an Amazon Athena query.

Today we are making it even easier for you to find, subscribe to, and use third-party data with the introduction of AWS Data Exchange for Amazon Redshift. As a subscriber, you can directly use data from providers without any further processing, and no need for an Extract Transform Load (ETL) process. Because you don’t have to do any processing, the data is always current and can be used directly in your Amazon Redshift queries. AWS Data Exchange for Amazon Redshift takes care of managing all entitlements and payments for you, with all charges billed to your AWS account.

As a provider, you now have a new way to license your data and make it available to your customers.

As I was writing this post, it was cool to realize just how many existing aspects of Redshift, and Data Exchange played central roles. Because Redshift has a clean separation of storage and compute, along with built-in data sharing features, the data provider allocates and pays for storage, and the data subscriber does the same for compute. The provider does not need to scale their cluster in proportion to the size of their subscriber base, and can focus on acquiring and providing data.

Let’s take a look at this feature from two vantage points: subscribing to a data product, and publishing a data product.

AWS Data Exchange for Amazon Redshift – Subscribing to a Data Product
As a data subscriber I can browse through the AWS Data Exchange catalog and find data products that are relevant to my business, and subscribe to them.

Data providers can also create private offers and extend them to me for access via the AWS Data Exchange Console. I click My product offers, and review the offers that have been extended to me. I click on Continue to subscribe to proceed:

Then I complete my subscription by reviewing the offer and the subscription terms, noting the data sets that I will get, and clicking Subscribe:

Once the subscription is completed, I am notified and can move forward:

From the Redshift Console, I click Datashares, select From other accounts, and I can see the subscribed data set:

Next, I associate it with one or more of my Redshift clusters by creating a database that points to the subscribed datashare, and use the tables, views, and stored procedures to power my Redshift queries and my applications.

AWS Data Exchange for Amazon Redshift – Publishing a Data Product
As a data provider I can include Redshift tables, views, schemas and user-defined functions in my AWS Data Exchange product. To keep things simple, I’ll create a product that includes just one Redshift table.

I use the spiffy new Redshift Query Editor V2 to create a table that maps US area codes to a city and a state:

Then I examine the list of existing datashares for my Redshift cluster, and click Create datashare to make a new one:

Next, I go through the usual process for creating a datashare. I select AWS Data Exchange datashare, assign a name (area_code_reference), pick the database within the cluster, and make the datashare accessible to publicly accessible clusters:

Then I scroll down and click Add to move forward:

I choose my schema (public), opt to include only tables and views in my datashare, and then add the area_codes table:

At this point I can click Add to wrap up, or Add and repeat to make a more complex product that contains additional objects.

I confirm that the datashare contains the table, and click Create datashare to move forward:

Now I am ready to start publishing my data! I visit the AWS Data Exchange Console, expand the navigation on the left, and click Owned data sets:

I review the Data set creation steps, and click Create data set to proceed:

I select Amazon Redshift datashare, give my data set a name (United States Area Codes), enter a description, and click Create data set to proceed:

I create a revision called v1:

I select my datashare and click Add datashare(s):

Then I finalize the revision:

I showed you how to create a datashare and a dataset, and to publish a product using the console. If you are publishing multiple products and/or making regular revisions, you can automate all of these steps using the AWS Command Line Interface (CLI) and the Amazon Data Exchange APIs.

Initial Data Products
Multiple data providers are working to make their data products available to you through AWS Data Exchange for Amazon Redshift. Here are some of the initial offerings and the official descriptions:

  • FactSet Supply Chain Relationships – FactSet Revere Supply Chain Relationships data is built to expose business relationship interconnections among companies globally. This feed provides access to the complex networks of companies’ key customers, suppliers, competitors, and strategic partners, collected from annual filings, investor presentations, and press releases.
  • Foursquare Places 2021: New York City Sample – This trial dataset contains Foursquare’ss integrated Places (POI) database for New York City, accessible as a Redshift Data Share. Instantly load Foursquare’s Places data in to a Redshift table for further processing and analysis. Foursquare data is privacy-compliant, uniquely sourced, and trusted by top enterprises like Uber, Samsung, and Apple.
  • Mathematica Medicare Pilot Dataset – Aggregate Medicare HCC counts and prevalence by state, county, payer, and filtered to the diabetic population from 2017 to 2019.
  • COVID-19 Vaccination in Canada – This listing contains sample datasets for COVID-19 Vaccination in Canada data.
  • Revelio Labs Workforce Composition and Trends Data (Trial data) – Understand the workforce composition and trends of any company.
  • Facteus – US Card Consumer Payment – CPG Backtest – Historical sample from panel of SKU-level transaction detail from cash and card transactions across hundreds of Consumer-Packaged Goods sold at over 9,000 urban convenience stores and bodegas across the U.S.
  • Decadata Argo Supply Chain Trial Data – Supply chain data for CPG firms delivering products to US Grocery Retailers.

Jeff;

Automate Amazon Redshift Cluster management operations using AWS CloudFormation

Post Syndicated from Anusha Challa original https://aws.amazon.com/blogs/big-data/automate-amazon-redshift-cluster-management-operations-using-aws-cloudformation/

Amazon Redshift is the fastest and most widely used cloud data warehouse. Tens of thousands of customers run business-critical workloads on Amazon Redshift. Amazon Redshift offers many features that enable you to build scalable, highly performant, cost-effective, and easy-to-manage workloads. For example, you can scale an Amazon Redshift cluster up or down based on your workload requirements, pause clusters when not in use to suspend on-demand billing, and enable relocation. You can automate these management activities either using the Amazon Redshift API, AWS Command Line Interface (AWS CLI), or AWS CloudFormation.

AWS CloudFormation helps you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. You create a template that describes all the AWS resources that you want, and AWS CloudFormation takes care of provisioning and configuring those resources for you.

In this post, we walk through how to use AWS CloudFormation to automate some of the most common Amazon Redshift cluster management operations:

  • Create an Amazon Redshift cluster via the following methods:
    • Restore a cluster from a snapshot
    • Create an encrypted Amazon Redshift cluster
  • Perform cluster management operations:
    • Pause or resume a cluster
    • Perform elastic resize or classic resize
    • Add or remove Identity and Access Management (IAM) roles to cluster permissions
    • Rotate encryption keys
    • Modify snapshot retention period for automated and manual snapshots
    • Enable or disable snapshot copy to another AWS Region
    • Create a parameter group with required workload management (WLM) configuration and associate it to the Amazon Redshift cluster
    • Enable concurrency scaling by modifying WLM configuration
    • Enable or disable audit logging

For a complete list of operations that you can automate using AWS CloudFormation, see Amazon Redshift resource type reference.

Benefits of using CloudFormation templates

Many of our customers build fully automated production environments and use AWS CloudFormation to aid automation. AWS CloudFormation offers an easy way to create and manage AWS Infrastructure, by treating infrastructure as code. CloudFormation templates create infrastructure resources in a group called a stack, and allow you to define and customize all components. CloudFormation templates introduce the ability to implement version control, and the ability to quickly and reliably replicate your infrastructure. This significantly simplifies your continuous integration and continuous delivery (CI/CD) pipelines and keeps multiple environments in sync. The CloudFormation template becomes a repeatable, single source of truth for your infrastructure. You can create CloudFormation templates in YAML and JSON formats. The templates provided in this post use YAML format. Amazon Redshift clusters are one of the resources that you can provision and manage using AWS CloudFormation.

Create an Amazon Redshift cluster using AWS CloudFormation

With AWS CloudFormation, you can automate Amazon Redshift cluster creation. In this section, we describe two additional ways of creating Amazon Redshift clusters: by restoring from an existing snapshot or creating using default options. In the next section, we describe how you can manage the lifecycle of a cluster by performing cluster management operations using AWS CloudFormation.

Restore an Amazon Redshift cluster from a snapshot

Snapshots are point-in-time backups of a cluster. Amazon Redshift periodically takes automated snapshots of the cluster. You can also take a snapshot manually any time. A snapshot contains data from any databases that are running on the cluster. Snapshots enable data protection, and you can also use them to build new environments to perform application testing, data mining, and more. When you start a new development or enhancement project, you may want to build a new Amazon Redshift cluster that has the same code and data as that of your production, so you can develop and test your code there before deploying it. To do this, create the new cluster by restoring from your production cluster’s snapshot.

You can use AWS CloudFormation to automate the restore operation. If you have multiple projects with different timelines or requirements, you can build multiple Amazon Redshift clusters in an automatic and scalable fashion using AWS CloudFormation.

To create a new Amazon Redshift cluster by restoring from an existing snapshot, create a CloudFormation stack using a CloudFormation template that has the AWS::Redshift::Cluster resource with the following mandatory properties:

  • SnapshotIdentifier – The name of the snapshot from which to create the new cluster
  • ClusterIdentifier – A unique identifier of your choice for the cluster
  • NodeType – The node type to be provisioned for the cluster
  • ClusterType – The type of the cluster, either single-node or multi-node (recommended for production workloads)
  • NumberofNodes – The number of compute nodes in the cluster
  • DBName – The name of the first database to be created in the new cluster
  • MasterUserName – The user name associated with the admin user account for the cluster that is being created
  • MasterUserPassword – The password associated with the admin user account for the cluster that is being created

The following is a sample CloudFormation template that restores a snapshot with identifier cfn-blog-redshift-snapshot and creates a two-node Amazon Redshift cluster with identifier cfn-blog-redshift-cluster and node type ra3.4xlarge:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password 
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      SnapshotIdentifier: "cfn-blog-redshift-snapshot"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Create an encrypted Amazon Redshift cluster

You can enable database encryption for your clusters to protect data at rest.

Use the following sample CloudFormation template to create an encrypted Amazon Redshift cluster. This template has basic properties only to make this walkthrough easy to understand. For your production workload, we recommend following the best practices as described in the post Automate Amazon Redshift cluster creation using AWS CloudFormation.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password. Must be 8-64 characters long. Must contain at least one uppercase letter, one lowercase letter and one number. Can be any printable ASCII character except “/”, ““”, or “@”.
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

In AWS CloudFormation, create a stack using this template. When the creation of CloudFormation stack is complete, you can see a new encrypted Amazon Redshift cluster called cfn-blog-redshift-cluster. For the rest of this post, we use this CloudFormation template as the base and explore how to modify the template to perform various Amazon Redshift management operations.

To create a new CloudFormation stack that uses the preceding template via the AWS CloudFormation console, complete the following steps:

  1. On the AWS CloudFormation console, choose Create Stack.
  2. On the drop-down menu, choose With new resources (standard).
  3. For Prepare template, choose Template is ready.
  4. For Specify template, choose Upload a template file.
  5. Save the provided CloudFormation template in a .yaml file and upload it.
  6. Choose Next.
  7. Enter a name for the stack. For this post, we use RedshiftClusterStack-CFNBlog.
  8. Choose Next.
  9. Choose Next again.
  10. Choose Create Stack.

To create the CloudFormation stack using the AWS CLI, run the following command:

aws cloudformation create-stack \
--stack-name RedshiftClusterStack-CFNBlog \
--template-body <<CloudFormation template’s file name>> \

The status of the stack changes to CREATE_IN_PROGRESS. After the Amazon Redshift cluster is created, the status changes to CREATE_COMPLETE. Navigate to the Amazon Redshift console to verify that the cluster is created and the status is Available.

Perform cluster management operations

Amazon Redshift customers have the flexibility to perform various cluster operations to implement workload security, perform cost-optimization, and manage scale. Often, we see our customers perform these operations in all their environments, such as DEV, QA, and PROD, to keep them in sync. You can automate these operations using CloudFormation stack updates by updating the CloudFormation template you used to create the cluster.

To perform these management operations using CloudFormation stack updates, you can create your initial CloudFormation stack in one of the two ways:

  • If your cluster isn’t already created, you can create it using AWS CloudFormation.
  • If you have an existing cluster, create a CloudFormation stack with the using existing resources option. Provide a template of the existing cluster and have a CloudFormation stack associated with the resource.

Each subsequent cluster management operation is an update to the base CloudFormation stack’s template. CloudFormation stack updates enable you to make changes to a stack’s resources by performing an update to the stack instead of deleting it and creating a new stack. Update to either add, modify, or remove relevant property values in the AWS::Redshift::Cluster resource to trigger the respective Amazon Redshift cluster management operation. AWS CloudFormation compares the changes you submit with the current state of your stack and applies only the changes. For a summary of the update workflow, see How does AWS CloudFormation work?

The following steps describe how to perform a CloudFormation stack update on the RedshiftClusterStack-CFNBlog stack that you created in the previous section.

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the stack you want to update (for this post, RedshiftClusterStack-CFNBlog).
  3. Choose Update.
  4. In the Prerequisite – Prepare template section, choose Replace current template.
  5. For Specify template, choose Upload a template file.
  6. Make changes to your current CloudFormation template based on the operation you wish to perform on the Amazon Redshift cluster.
  7. Save the updated CloudFormation template .yaml file and upload it.
  8. Choose Next.
  9. Choose Next again.
  10. Choose Next again.
  11. Choose Update Stack.

To update CloudFormation stack using the AWS CLI, run the following command:

aws cloudformation update-stack \
--stack-name RedshiftClusterStack-CFNBlog \
--template-body <<Updated CloudFormation template’s file name>> \

In this section, we look at some AWS CloudFormation properties available for Amazon Redshift clusters and dive deep to understand how to update the CloudFormation template and add, remove, or modify these properties to automate some of the most common cluster management operations. We use the RedshiftClusterStack-CFNBlog stack that you created in the previous section as an example.

Pause or resume cluster

If an Amazon Redshift cluster isn’t being used for a certain period of time, you can pause it easily and suspend on-demand billing. For example, you can suspend on-demand billing on a cluster that is used for development when it’s not in use. While the cluster is paused, you’re only charged for the cluster’s storage. This adds significant flexibility in managing operating costs for your Amazon Redshift clusters. You can resume a paused cluster when you’re ready to use it. To pause a cluster, update the cluster’s current CloudFormation stack template to add a new property called ResourceAction with the value pause-cluster. To pause the cluster created using the RedshiftClusterStack-CFNBlog stack, you can perform a stack update and use the following updated CloudFormation template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      #Added this property to pause cluster
      ResourceAction: "pause-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps in the previous section to update the stack. When the stack update operation is in progress, the cluster’s status changes from Available to Modifying, Pausing.

When the stack update is complete, the cluster’s status changes to Paused.

When you’re ready to use the cluster, you can resume it. To resume the cluster, update the cluster’s current CloudFormation stack template and change the value of the ResourceAction property to resume-cluster. You can use the following template to perform a stack update operation to resume the cluster created using the RedshiftClusterStack-CFNBlog stack:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      #Added this property to resume cluster
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps in the previous section to update the stack. When the stack update is in progress, the cluster’s status changes from Paused to Modifying, Resuming.

When the stack update is complete, the cluster’s status changes to Available.

Perform elastic resize or classic resize

Data warehouse workloads often have changing needs. You may add a new line of business and thereby ingest more data into the data warehouse, or you may have a new analytics application for your business users and add new ETL processes to support it. When your compute requirements change due to changing needs, you can resize your Amazon Redshift cluster using one of the following approaches:

  • Elastic resize – This changes the node type, number of nodes, or both. Typically, it completes within 10–15 minutes when adding or removing nodes of the same type. Cross-instance elastic resize can take up to 45 minutes. We recommend using elastic resize whenever possible, because it completes much more quickly than classic resize. Elastic resize has some growth and reduction limits on the number of nodes.
  • Classic resize – You can also use classic resize to change the node type, number of nodes, or both. We recommend this option only when you’re resizing to a configuration that isn’t available through elastic resize, because it takes considerably more time depending on your data size.

You can automate both elastic resize and classic resize operations on Amazon Redshift clusters using AWS CloudFormation. The default resize operation when initiated using a CloudFormation stack update is elastic resize. If elastic resize isn’t possible for your configuration, AWS CloudFormation throws an error. You can force the resize operation to be classic resize by specifying the value of the property Classic to Boolean true in the CloudFormation template provided in the update stack operation. If you don’t provide this parameter or set the value to false, the resize type is elastic. To initiate the resize operation, update the cluster’s current CloudFormation stack template and change the value of NodeType and NumberOfNodes properties as per your requirement.

To perform an elastic resize from the initial two-node RA3 4xlarge configuration to NumberOfNodes:2 and NodeType:ra3.16xlarge configuration on the Amazon Redshift cluster cfn-blog-redshift-cluster, you can update the current template of the RedshiftClusterStack-CFNBlog stack as shown in the following CloudFormation template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      #Modified the below 2 properties to perform elastic resize
      NodeType: "ra3.16xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Alternatively, if you want to force this resize to be a classic resize, use the following CloudFormation template to update the RedshiftClusterStack-CFNBlog stack. This template has an additional property, Classic with Boolean value true, to initiate classic resize, in addition to having updated NodeType and NumberOfNodes properties.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      #Modified the below 3 properties to perform classic resize
      NodeType: "ra3.16xlarge"
      NumberOfNodes: "2"
      Classic: true
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps earlier in this post to update the stack. For both elastic resize and classic resize, when the stack update is in progress, the cluster’s status changes from Available to Resizing.

Add or remove IAM roles to cluster permissions

Your Amazon Redshift cluster needs permissions to access other AWS services on your behalf. For the required permissions, add IAM roles to cluster permissions. You can add up to 10 IAM roles. For instructions on creating roles, see Create an IAM role.

To add IAM roles to the cluster, update the cluster’s current CloudFormation stack template and add the IamRoles property with a list of IAM roles you want to add. For example, to add IAM roles cfn-blog-redshift-role-1 and cfn-blog-redshift-role-2 to the cluster cfn-blog-redshift-cluster, you can update the RedshiftClusterStack-CFNBlog stack using the following CloudFormation template. In this template, the new array property IamRoles has been added with values cfn-blog-redshift-role-1 and cfn-blog-redshift-role-2.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Added IAMRoles property with ARNs for 2 roles
      IamRoles: [
                    !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/cfn-blog-redshift-role-1",
                    !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/cfn-blog-redshift-role-2"
                ]
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Now, if you want to remove the IAM role cfn-blog-redshift-role-2 from the cfn-blog-redshift-cluster cluster, you can perform another CloudFormation stack update on RedshiftClusterStack-CFNBlog using the following CloudFormation template. This template contains only those IAM roles you want to retain.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Updated IAMRoles property to remove role-2
      IamRoles: [
                    !Sub "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/cfn-blog-redshift-role-1"
                ]
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps earlier in this post to update the stack. When the stack update is in progress, the cluster’s status changes from Available to Available, Modifying.

Navigate to cluster’s properties tab and the Cluster Permissions section to validate that the IAM roles were associated to the cluster after the stack update is complete.

Rotate encryption keys on the cluster

You can enable database encryption for your Amazon Redshift clusters to protect data at rest. You can rotate encryption keys using AWS CloudFormation. To rotate encryption keys, update the base CloudFormation template to add the RotateEncryptionKey property and set it to Boolean true. For example, you can use the following CloudFormation template to rotate the encryption key for cfn-blog-redshift-cluster by performing an update on the CloudFormation stack RedshiftClusterStack-CFNBlog:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Added RotateEncryptionKey property
      RotateEncryptionKey: true
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps earlier in this post to update the stack. When the stack update is in progress, the cluster’s status changes from Available to Modifying, Rotating Keys. It takes approximately 2 minutes to rotate the encryption keys.

Modify the snapshot retention period for automated and manual snapshots

Amazon Redshift takes periodic automated snapshots of the cluster. By default, automated snapshots are retained for 24 hours. You can change the retention period of automated snapshots to 0–35 days. Amazon Redshift deletes automated snapshots at the end of a snapshot’s retention period, when you disable automated snapshots for the cluster or delete the cluster.

If you set the automated snapshot retention period to 0 days, the automated snapshots feature is disabled and any existing automated snapshots are deleted. Exercise caution before setting the automated snapshot retention period to 0.

You can take manual snapshots of the cluster any time. By default, manual snapshots are retained indefinitely, even after you delete the cluster. You can also specify the retention period when you create a manual snapshot. When the snapshot retention period is modified on automated snapshots, it applies to both existing and new automated snapshots. In contrast, when the snapshot retention period is modified on manual snapshots, it applies to new manual snapshots only.

To modify the retention period on snapshots, update your current cluster’s CloudFormation stack template. Add or update the AutomatedSnapshotRetentionPeriod property with an integer value (must be between 0–35) indicating the new retention period in days for automated snapshots, and the ManualSnapshotRetentionPeriod property with an integer value (must be between 1–3653) indicating the new retention period in days for manual snapshots.

The following CloudFormation template sets the AutomatedSnapshotRetentionPeriod to 7 days and ManualSnapshotRetentionPeriod to 90 days on cfn-blog-redshift-cluster when you update the current CloudFormation stack RedshiftClusterStack-CFNBlog:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Add snapshot retention properties
      AutomatedSnapshotRetentionPeriod: 7
      ManualSnapshotRetentionPeriod: 90
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Enable or disable snapshot copies to another Region

You can configure your Amazon Redshift cluster to copy all new manual and automated snapshots for a cluster to another Region. You can choose how long to keep copied automated or manual snapshots in the destination Region. If the cluster is encrypted, because AWS Key Management Service (AWS KMS) keys are specific to a Region, you must configure a snapshot copy grant for a primary key in the destination Region. For information on how to create a snapshot copy grant, see Copying AWS KMS–encrypted snapshots to another AWS Region. Make sure that the snapshot copy grant is created before enabling snapshot copy to another Region using CloudFormation templates.

To enable snapshot copy to another Region, update your current cluster’s CloudFormation stack template and add or update the following properties:

  • DestinationRegion – Required to enable snapshot copy. It specifies the destination Region that snapshots are automatically copied to.
  • SnapshotCopyRetentionPeriod – Optional. Modifies the number of days to retain snapshots in the destination Region. If this property is not specified, the retention period is the same as that of the source Region. By default, this operation only modifies the retention period of existing and new copied automated snapshots. To change the retention period of copied manual snapshots using this property, set the SnapshotCopyManual property to true.
  • SnapshotCopyManual – Indicates whether to apply the snapshot retention period to newly copied manual snapshots instead of automated snapshots. If you set this option, only newly copied manual snapshots have the new retention period.
  • SnapshotCopyGrantName – The name of the snapshot copy grant.

To copy snapshots taken from cfn-blog-redshift-cluster into the Region us-west-1 and to modify the retention period of the newly copied manual snapshots to 90 days, update the current CloudFormation stack RedshiftClusterStack-CFNBlog with the following CloudFormation template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      #Add cross-region snapshot copy properties
      DestinationRegion: "us-west-1"
      SnapshotCopyGrantName: "cfn-blog-redshift-cross-region-snapshot-copy-grant"
      SnapshotCopyManual: true
      SnapshotCopyRetentionPeriod: 90
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

To disable cross-Region snapshot copy, update your current CloudFormation stack’s template and remove the properties DestinationRegion, SnapshotCopyRetentionPeriod, SnapshotCopyManual and SnapshotCopyGrantName.

Create a parameter group with required WLM configuration and assign it to an Amazon Redshift cluster

You can use AWS CloudFormation to create an Amazon Redshift parameter group and associate it to an Amazon Redshift cluster. If you don’t associate a parameter group, the default parameter group is assigned, which has the defaults for parameter values and WLM configuration. When you create a new parameter group, it also has the defaults for parameters, unless you override them. The following CloudFormation template overrides the default values for the require_ssl and wlm_json_configuration parameters. The WLM configuration is specified in JSON format. In this template, automatic WLM configuration is defined on cfn-blog-redshift-cluster with three queues: etl_queue, reporting_queue, and the default queue, with the following specifications:

  • Priority for etl_queue is set to highest. It’s configured to route all queries run by users belonging to the group named etl_group to etl_queue.
  • Priority for reporting_queue is set to normal. It’s configured to route all queries run by users belonging to any group name with the word report in it or any query having a query group with the word report in it to the reporting_queue.
  • The following three query monitoring rules are defined to protect reporting_queue from bad queries:
    • When query runtime is greater than 7,200 seconds (2 hours), the query is stopped.
    • If a query has a nested loop join with more than 1,000,000 rows, its priority is changed to lowest.
    • If any query consumes more than 50% CPU utilization, it is logged.
  • All other queries are routed to the default queue. Priority for default_queue is set to lowest.
  • The following three query monitoring rules are defined to protect the default queue from bad queries:
    • If a query has a nested loop join with more than 1,000,000 rows, it is stopped.
    • If a query has a large return set and is returning more than 1,000,000 rows, it is stopped.
    • If more than 10 GB spilled to disk for a query, it is stopped.
AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  #Add parameter group resource
  RedshiftClusterParameterGroup: 
    Type: "AWS::Redshift::ClusterParameterGroup"
    Properties: 
      Description: "CFNBlog-Redshift-Cluster-parameter-group"
      ParameterGroupFamily: "redshift-1.0"
      Parameters: 
        - 
          ParameterName: "require_ssl"
          ParameterValue: "true"
        -
          ParameterName: "wlm_json_configuration"
          ParameterValue: "[
                            {
                              \"name\":\"etl_queue\",
                              \"user_group\":[\"etl_user\"],
                              \"auto_wlm\":true,
                              \"queue_type\":\"auto\",
                              \"priority\":\"highest\"
                            },
                            {
                              \"name\":\"reporting_queue\",
                              \"user_group\":[\"%report%\"],
                              \"user_group_wild_card\":1,
                              \"query_group\":[\"%report%\"],
                              \"query_group_wild_card\":1,
                              \"auto_wlm\":true,
                              \"queue_type\":\"auto\",
                              \"priority\":\"normal\",
                              \"rules\":[
                                        {
                                          \"rule_name\":\"timeout_2hours\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"query_execution_time\",
                                                          \"operator\":\">\",\"value\":7200
                                                        }
                                                      ],
                                          \"action\":\"abort\"
                                        },
                                        {
                                          \"rule_name\":\"nested_loop_reporting\",
                                          \"action\":\"change_query_priority\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"nested_loop_join_row_count\",
                                                          \"operator\":\">\",\"value\":1000000
                                                        }
                                                      ],
                                          \"value\":\"lowest\"
                                        },
                                        {
                                          \"rule_name\":\"expensive_computation\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"query_cpu_usage_percent\",
                                                          \"operator\":\">\",\"value\":50
                                                        }
                                                      ],
                                          \"action\":\"log\"
                                        }
                                      ]
                            },
                            {
                              \"name\":\"Default queue\",
                              \"auto_wlm\":true,
                              \"priority\":\"lowest\",
                              \"rules\":[
                                        {
                                          \"rule_name\":\"nested_loop\",
                                          \"action\":\"abort\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"nested_loop_join_row_count\",
                                                          \"operator\":\">\",\"value\":1000000
                                                        }
                                                      ]
                                        },
                                        {
                                          \"rule_name\":\"large_return_set\",
                                          \"action\":\"abort\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"return_row_count\",
                                                          \"operator\":\">\",\"value\":1000000
                                                        }
                                                      ]
                                        },
                                        {
                                          \"rule_name\":\"large_spill_to_disk\",
                                          \"action\":\"abort\",
                                          \"predicate\":[
                                                        {
                                                          \"metric_name\":\"query_temp_blocks_to_disk\",
                                                          \"operator\":\">\",
                                                          \"value\":10000
                                                        }
                                                      ]
                                        }
                                      ]
                            },
                            {
                              \"short_query_queue\":true
                            }
                          ]"
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
      #Add parameter to associate parameter group
      ClusterParameterGroupName: !Ref RedshiftClusterParameterGroup
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Enable concurrency scaling by modifying WLM configuration

Concurrency scaling is an autoscaling feature of Amazon Redshift that enables you to support virtually unlimited concurrent users. When you turn on concurrency scaling, Amazon Redshift automatically adds additional cluster capacity to process an increase in both read queries and write queries.

You can enable concurrency scaling at an individual WLM queue level. To enable concurrency scaling using AWS CloudFormation, update you current stack’s CloudFormation template and change the parameter value for wlm_json_configuration to add a property called concurrency_scaling and set its value to auto.

The following CloudFormation template sets concurrency scaling to auto on reporting_queue. It also overrides the value for the max_concurrency_scaling_clusters parameter from default 1 to 5.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftClusterParameterGroup: 
    Type: "AWS::Redshift::ClusterParameterGroup"
    Properties: 
      Description: "CFNBlog-Redshift-Cluster-parameter-group"
      ParameterGroupFamily: "redshift-1.0"
      Parameters: 
        - 
          ParameterName: "require_ssl"
          ParameterValue: "true"
        #add parameter to change default value for max number of concurrency scaling clusters parameter
        - 
          ParameterName: "max_concurrency_scaling_clusters"
          ParameterValue: "5"
        #Updated wlm configuration to set concurrency_scaling to auto
        -
          ParameterName: "wlm_json_configuration"
          ParameterValue: "[
                            {
                              \"name\":\"etl_queue\",
                              \"user_group\":[\"etl_user\"],
                              \"auto_wlm\":true,
                              \"queue_type\":\"auto\",
                              \"priority\":\"highest\"
                            },
                            {
                              \"name\":\"reporting_queue\",
                              \"user_group\":[\"%report%\"],
                              \"user_group_wild_card\":1,
                              \"query_group\":[\"%report%\"],
                              \"query_group_wild_card\":1,
                              \"auto_wlm\":true,
                              \"queue_type\":\"auto\",
                              \"priority\":\"normal\",
                              \"concurrency_scaling\":\"auto\"
                            },
                            {
                              \"name\":\"Default queue\",
                              \"auto_wlm\":true,
                              \"priority\":\"lowest\"
                            },
                            {
                              \"short_query_queue\":true
                            }
                          ]"
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
      ClusterParameterGroupName: !Ref RedshiftClusterParameterGroup
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Enable or disable audit logging

When you enable audit logging, Amazon Redshift creates and uploads the connection log, user log, and user activity logs to Amazon Simple Storage Service (Amazon S3). You can automate enabling and disabling audit logging using AWS CloudFormation. To enable audit logging, update the base CloudFormation template to add the LoggingProperties property with the following sub-properties:

  • BucketName – The name of an existing S3 bucket where the log files are to be stored.
  • S3KeyPrefix – The prefix applied to the log file names

Also update the parameter group to change the value of the enable_user_activity_logging parameter to true.

To enable audit logging on cfn-blog-redshift-cluster and deliver log files to BucketName: cfn-blog-redshift-cluster-audit-logs with the S3KeyPrefix:cfn-blog, update the current CloudFormation stack RedshiftClusterStack-CFNBlog with the following CloudFormation template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftClusterParameterGroup: 
    Type: "AWS::Redshift::ClusterParameterGroup"
    Properties: 
      Description: "CFNBlog-Redshift-Cluster-parameter-group"
      ParameterGroupFamily: "redshift-1.0"
      Parameters: 
        - 
          ParameterName: "require_ssl"
          ParameterValue: "true"
        - 
          ParameterName: "max_concurrency_scaling_clusters"
          ParameterValue: "5"
        #add parameter to enable user activity logging
        -
          ParameterName: "enable_user_activity_logging"
          ParameterValue: "true"
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      NodeType: "ra3.4xlarge"
      NumberOfNodes: "2"
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
      #Add LoggingProperties and its sub-properties
      LoggingProperties:
          BucketName: "cfn-blog-redshift-cluster-audit-logs"
          S3KeyPrefix: "cfn-blog/"
      ClusterParameterGroupName: !Ref RedshiftClusterParameterGroup
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Follow the steps earlier in this post to update the stack. To disable audit logging, update the cluster’s CloudFormation stack template and remove the LoggingProperties property.

Concurrent operations

You can perform multiple cluster management operations using a single CloudFormation stack update. For example, you can perform elastic resize and rotate encryption keys on the cfn-blog-redshift-cluster using the following CloudFormation template. This template updates the values for NodeType and NumberOfNodes, which results in an elastic resize operation, and also sets the RotateEncryptionKey parameter value to Boolean true, which results in the encryption key rotation.

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift CFN Blog Cluster Stack
Parameters:
  MasterUserPasswordParam:
    NoEcho: true
    Type: String
    Description: Enter Master User Password
Resources:
  RedshiftCluster:
    Type: "AWS::Redshift::Cluster"
    Properties:
      ResourceAction: "resume-cluster"
      ClusterIdentifier: "cfn-blog-redshift-cluster"
      ClusterType: "multi-node"
      # Change Node type, number of nodes and rotate encryption
      NodeType: "ra3.16xlarge"
      NumberOfNodes: "2"
      RotateEncryptionKey: true
      DBName: "dev"
      MasterUsername: "username"
      Encrypted: true
      MasterUserPassword: !Ref MasterUserPasswordParam
Outputs:
  ClusterName:
    Value: !Ref RedshiftCluster

Conclusion

You have now learned how to automate management operations on Amazon Redshift clusters using AWS CloudFormation. For a full list of properties you can update using this process, see Properties. For more sample CloudFormation templates, see Amazon Redshift template snippets.


About the Authors

Anusha Challa is a Senior Analytics Specialist Solutions Architect at AWS. She has over a decade of experience in building large-scale data warehouses, both on-premises and in the cloud. She provides architectural guidance to our customers on end-to-end data warehousing implementations and migrations.

Shweta Yakkali is a Software Engineer for Amazon Redshift, where she works on developing features for Redshift Cloud infrastructure. She is passionate about innovations in cloud infrastructure and enjoys learning new technologies and building enhanced features for Redshift. She holds M.S in Computer Science from Rochester Institute of Technology, New York. Outside of work, she enjoys dancing, painting and playing badminton.

Zirui Hua is a Software Development Engineer for Amazon Redshift, where he works on developing next generation features for Redshift. His main focuses are on networking and proxy of database. Outside of work, he likes to play tennis and basketball.

Compare different node types for your workload using Amazon Redshift

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/compare-different-node-types-for-your-workload-using-amazon-redshift/

Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that offers different node types to meet the flexible workload patterns of our customers. Amazon Redshift RA3 with managed storage is the newest instance type in Amazon Redshift, which allows you to scale and pay for compute and storage independently, and also allows advanced features like AQUA (Advanced Query Accelerator), cross-cluster data sharing, and cross-Availability Zone cluster relocation.

Many customers start their workload on Amazon Redshift with RA3 instances as their default choice, which currently offers three node sizes.

Node size vCPU RAM (GiB) Managed storage quota per node
ra3.xlplus 4 32 32 TB
ra3.4xlarge 12 96 128 TB
ra3.16xlarge 48 384 128 TB

A very common question we get from customers is “Which RA3 instance type and number of nodes should we choose for our workload?” In this post, we answer that question with the following two-step process:

  1. Use the Amazon Redshift sizing calculator on the Amazon Redshift console, which estimates your cluster configuration based on data size and data access frequency.
  2. Use the Amazon Redshift Node Configuration Comparison utility to find the right configuration for your cluster based on your query performance expectation for sequential or concurrently running queries. If you’re already using Amazon Redshift and want to migrate from your existing DC2 or DS2 instances to RA3, you may follow our recommendations on node count and type when upgrading. Before doing that, you can use the utility to evaluate the new cluster’s performance by replaying your past workloads too, which integrates with the Amazon Redshift Simple Replay utility to evaluate performance metrics for different Amazon Redshift configurations to meet your needs. We describe this utility in detail later in this post.

Amazon Redshift sizing calculator

The Amazon Redshift console provides the Help me choose option when you create a new Amazon Redshift cluster. This option allows you to get an estimate of your cluster configuration based on your data size and data access frequency. In this wizard, you need to provide the following information:

  • The compressed or uncompressed size of your total dataset
  • Whether your data is time-based:
    • If your data is time based, how many months of data does the data warehouse contain and how many months are frequently queried on that data
    • If your data is not time based, what percentage of data do your queries access frequently

Based on your response, you receive a recommended node configuration, as shown in the following screenshot.

You can use this recommendation to create your Amazon Redshift cluster and quickly get started with your workload to gain insights from your data in the Amazon Redshift data warehouse.

Amazon Redshift Node Configuration Comparison utility

If you have stringent service level agreements (SLAs) for query performance in your Amazon Redshift cluster or you want to explore different Amazon Redshift configurations based on the price/performance of your workload, you can use the Amazon Redshift Node Configuration Comparison utility. This utility helps evaluate performance of your queries using different Amazon Redshift cluster configurations in parallel and compares the end results to find the best cluster configuration that meets your need.

To perform this comparison, you can use the Amazon Redshift sizing calculator recommendations in the previous step as a benchmark and test other configurations with it in parallel to compare performance. If you’re migrating from DC2 or DS2 node types to RA3, you can use our recommendations on node count and type as a benchmark.

Solution overview

The solution uses AWS Step Functions, AWS Lambda, and AWS Batch to run an end-to-end automated orchestration to find the best Amazon Redshift configuration based on your price/performance requirements. We use an AWS CloudFormation template to deploy and run this solution in your AWS account. Along with other resources, this template creates an Amazon Simple Storage Service (Amazon S3) bucket to store all data and metadata related to this process. The following diagram shows the solution architecture.

You need to create a JSON file to provide the following input configurations for your test:

  • Amazon Redshift cluster configurations
  • DDL and load script (optional)
  • Amazon Redshift snapshot identifier (optional)
  • SQL script to conduct sequential and concurrency test (optional)
  • Amazon Redshift audit log location and simple replay time window (optional)

You need to store this file in an existing S3 bucket and then use the following CloudFormation template to deploy this solution. This solution uses AWS CloudFormation to automatically provision all the required resources in your AWS accounts. For more information, see Getting started with AWS CloudFormation.

Deploying the solution also initiates an iteration of this test by invoking a Step Functions state machine in your AWS account.

Prerequisites

If you’re already running an Amazon Redshift workload in production, you can use this solution to replay your past workload using the Amazon Redshift Simple Replay utility. This helps you conduct what-if analysis on Amazon Redshift and evaluate how your workload performs with different configurations. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature like AQUA, or assess different cluster configurations. It replays your workload with the same time interval between connections, transactions, and queries as the source. It also supports extract, transform, and load (ETL) workloads including COPY and UNLOAD statements. The Amazon Redshift Node Configuration Comparison utility uses Simple Replay to automate all these configuration tests. As a prerequisite to use Simple Replay, you need to enable audit logging and user-activity logging in your Amazon Redshift cluster.

Example use case

As an example, assume you have an existing Amazon Redshift cluster with two nodes of DC2.8XLarge instances. You want to evaluate migrating this cluster to RA3 node types. You used the Amazon Redshift sizing calculator in the previous step, which recommends using four nodes of RA3.4XL, but you also want to evaluate the performance with five nodes of RA3.4XL to meet your future growth demands. For that, you want to run five test queries sequentially as well as in 5 and 10 parallel sessions in all these clusters. You also want to replay the workload in the past hour on these clusters and compare their performance.

You also want to test your workload performance with concurrency scaling enabled in that clusters, which helps improve concurrent workloads with consistently fast query performance.

The following table summarizes the five cluster configurations that are evaluated as part of this test.

Node Type Number of Nodes Option
dc2.8xlarge 2 concurrency scaling – off
ra3.4xlarge 4 concurrency scaling – off
ra3.4xlarge 4 concurrency scaling – on
ra3.4xlarge 5 concurrency scaling – off
ra3.4xlarge 5 concurrency scaling – on

To perform this test using the Amazon Redshift Node Configuration Comparison utility, you provide these configurations in a JSON file and store it in an S3 bucket. You then use the provided CloudFormation template to deploy this utility, which performs the end-to-end performance testing in all the clusters in parallel and produces a performance evaluation summary, which can help you decide which configuration works best for you.

JSON file parameters

You need to provide a configuration JSON file to use this solution. The following table explains the input parameters for this JSON file.

JSON Parameter Allowed Values Description
SNAPSHOT_ID

N/A,

Amazon Redshift snapshot identifier

Enter the snapshot identifier if you want to create new Amazon Redshift clusters by restoring from a snapshot.
Snapshot identifier is mandatory if you’re using Simple Replay in this test to replay your past workload.
If you’re using this solution in a different AWS account, make sure to share your Amazon Redshift cluster snapshot with this account. For more information, see How do I copy an Amazon Redshift cluster to a different AWS account. Enter N/A if not applicable.
SNAPSHOT_ACCOUNT_ID

N/A,

AWS Account ID

The AWS account ID where the snapshot was created. Enter N/A if not applicable.
PARAMETER_GROUP_CONFIG_S3_PATH

N/A,

Amazon S3 URI

If you use a custom parameter group for this test, provide its Amazon S3 URI. You can get this JSON by running the following command in the AWS Command Line Interface (AWS CLI):

aws redshift describe-cluster-parameters --parameter-group-name your-custom-param-group --output json

Enter N/A if not applicable.

DDL_AND_COPY_SCRIPT_S3_PATH

N/A,

Amazon S3 URI

If you create tables and load data on them before performing the test, provide its Amazon S3 URI. Enter N/A if not applicable.
SQL_SCRIPT_S3_PATH

N/A,

Amazon S3 URI

If you run performance testing of your queries, provide the Amazon S3 URI of your script consisting of all your SQL commands. These commands should be deliminated by a semicolon (;). Enter N/A if not applicable. We recommend keeping only SELECT statements in this script path.
NUMBER_OF_PARALLEL_SESSIONS_LIST N/A, comma-separated numbers Enter comma-separated numbers to denote the number of parallel sessions in which you want to run the preceding script. Enter N/A if not applicable.
SIMPLE_REPLAY_LOG_LOCATION

N/A,

Amazon S3 URI

If you’re already running an Amazon Redshift workload and your cluster has audit logging enabled, provide the Amazon S3 URI of your Amazon Redshift audit logging location. If you’re using this solution in a different AWS account, copy these logs from your source clusters’ audit logging bucket to the S3 bucket in this account. Enter N/A if not applicable.
SIMPLE_REPLAY_EXTRACT_START_TIME N/A, time in ISO-8601 format If you’re using Simple Replay in this test to replay your past workload, provide the start time of that workload in ISO-8601 format (for example, 2021-01-26T21:41:16+00:00). Enter N/A if not applicable.
SIMPLE_REPLAY_EXTRACT_END_TIME N/A, time in ISO-8601 format If you’re using Simple Replay in this test to replay your past workload, provide the end time of that workload in ISO-8601 format (for example, 2021-01-26T21:41:16+00:00). Enter N/A if not applicable.
SIMPLE_REPLAY_EXTRACT_OVERWRITE_S3_PATH

N/A,

Amazon S3 URI

If you’re using Simple Replay and you want to use a custom extract.yaml file, provide its Amazon S3 URI. Enter N/A if not applicable.
SIMPLE_REPLAY_OVERWRITE_S3_PATH

N/A,

Amazon S3 URI

If you’re using Simple Replay and you want to use a custom replay.yaml file, provide its Amazon S3 URI. Enter N/A if not applicable.
AUTO_PAUSE

true,

false

Enter true if you want to automatically pause all except the first Amazon Redshift clusters created for this test, otherwise enter false.
CONFIGURATIONS JSON array with parameters NODE_TYPE, NUMBER_OF_NODES, WLM_CONFIG_S3_PATH Enter a JSON array mentioning your Amazon Redshift cluster configurations for which you want to perform this test. We explain the parameters in the three rows.
NODE_TYPE ra3.xlplus, ra3.4xlarge, ra3.16xlarge, dc2.large, dc2.8xlarge, ds2.xlarge, ds2.8xlarge The Amazon Redshift cluster node type that you want to run for this test.
NUMBER_OF_NODES a number between 1–128 The number of nodes for your Amazon Redshift cluster.
WLM_CONFIG_S3_PATH

N/A,

Amazon S3 URI

If you want to use custom workload management settings if you have different Amazon Redshift clusters, provide the Amazon S3 URI for that. Enter N/A if not applicable.

The following code is a sample configuration JSON file used to implement this example use case:

{
"SNAPSHOT_ID": "redshift-cluster-manual-snapshot",
"SNAPSHOT_ACCOUNT_ID": "123456789012",

"PARAMETER_GROUP_CONFIG_S3_PATH": "s3://node-config-compare-bucket/pg_config.json",

"DDL_AND_COPY_SCRIPT_S3_PATH": "s3://node-config-compare-bucket/ddl.sql",
"SQL_SCRIPT_S3_PATH":"s3://node-config-compare-bucket/test_queries.sql",
"NUMBER_OF_PARALLEL_SESSIONS_LIST": "1,5,10",

"SIMPLE_REPLAY_LOG_LOCATION":"s3://redshift-logging-xxxxxxxx/RSLogs/",
"SIMPLE_REPLAY_EXTRACT_START_TIME":"2021-08-28T11:15:00+00:00",
"SIMPLE_REPLAY_EXTRACT_END_TIME":"2021-08-28T12:00:00+00:00",

"SIMPLE_REPLAY_EXTRACT_OVERWRITE_S3_PATH":"N/A",
"SIMPLE_REPLAY_OVERWRITE_S3_PATH":"N/A",

"AUTO_PAUSE": true,

"CONFIGURATIONS": [
	{
	"NODE_TYPE": "dc2.8xlarge",
	"NUMBER_OF_NODES": "2",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/source-wlm.json"
	},	
	{
	"NODE_TYPE": "ra3.4xlarge",
	"NUMBER_OF_NODES": "4",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/source-wlm.json"
	},
	{
	"NODE_TYPE": "ra3.4xlarge",
	"NUMBER_OF_NODES": "4",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/wlm-concurrency-scaling.json"
	},
{
	"NODE_TYPE": "ra3.4xlarge",
	"NUMBER_OF_NODES": "5",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/source-wlm.json"
	},
{
	"NODE_TYPE": "ra3.4xlarge",
	"NUMBER_OF_NODES": "5",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/ wlm-concurrency-scaling.json"
	}
]
}

Make sure to use same S3 bucket to store all your configurations for this test. For example, we use the S3 bucket node-config-compare-bucket to store all configuration scripts in the preceding JSON configuration. After you populate all the parameters in this JSON file, save the file in the same S3 bucket in your AWS account.

Deploy the solution using AWS CloudFormation

After you save the configuration JSON file in an S3 bucket, you can use the provided CloudFormation template to deploy this solution, which also initiates an iteration of this test. This template provisions the required AWS resources except for the Amazon Redshift clusters, which are created in the subsequent step by a Step Functions state machine. The following table lists the required parameters in the template.

CloudFormation Template Parameter Allowed Values Description
ConfigJsonS3Path Amazon S3 URI Enter the Amazon S3 URI where you stored your JSON configuration file from the previous step. The template grants access on this S3 bucket to the appropriate AWS resources created by this solution.
ClusterIdentifierPrefix Prefix of Amazon Redshift cluster identifiers Enter a valid string like rs to be used as the prefix of your Amazon Redshift cluster identifiers created by this solution.
PreExistingS3BucketToGrantRedshiftAccess

N/A,

Amazon S3 Bucket name

If you’re using Simple Replay, provide the Amazon Redshift audit logging bucket name so it can grant appropriate permissions to the AWS resources. You can also add an existing S3 bucket in same AWS Region, which can be accessed by Amazon Redshift. Enter N/A if not applicable.
GrantS3ReadOnlyAccessToRedshift

Yes,

No

If you’re using Simple Replay in the same AWS account as the source Amazon Redshift cluster, enter Yes for this parameter, which grants AmazonS3ReadOnlyAccess to the new Amazon Redshift clusters to replay copy statements within the account. Otherwise, enter No so you can’t replay copy statements if running on a different AWS account without manually configuring it.
SourceRedshiftClusterKMSKeyARN N/A, AWS KMS Key ARN Provide the AWS Key Management Service (AWS KMS) Key ARN (Amazon Resource Name) if your source Amazon Redshift cluster is encrypted (available on the stack Outputs tab). If using Simple Replay, you need to run extract and replay in the same AWS account if your source cluster is encrypted.
OnPremisesCIDR CIDR notation The IP range (CIDR notation) for your existing infrastructure to access the target and replica clusters from a SQL client. If unsure, enter your corporate desktop’s CIDR address. For instance, if your desktop’s IP address is 10.156.87.45, enter 10.156.87.45/32.
VPC VPC ID An existing Amazon Virtual Private Cloud (Amazon VPC) where you want to deploy the clusters and Amazon Elastic Compute Cloud (Amazon EC2) instances.
SubnetId Subnet ID An existing subnet within the VPC in which you deploy the Amazon Redshift clusters and AWS Batch compute environment.

Performance evaluation

The CloudFormation stack deployed in above step runs a Step Functions state machine to orchestrate the end-to-end workflow. Navigate to Step Functions in AWS Management Console to view the state machine deployed by this solution as shown in the following screenshot.

This state machine creates the Amazon Redshift clusters you defined in the configuration JSON file. Then, it runs the performance evaluation tests on these clusters. If you provided the value true for parameter AUTO_PAUSE in the input JSON file, it also pauses the Amazon Redshift clusters except for the first cluster. At the end, it unloads the performance comparison results to your Amazon S3 bucket.

The following screenshot shows the clusters that were created as part of this state machine run for our use case. The state machine automatically paused all clusters except the first one.

This state machine creates an external schema redshift_config_comparison and creates three external tables (comparison_stats, cluster_config, and pricing) in that schema to read the raw data created by this solution in an S3 bucket. Based on these external tables, it creates the views redshift_config_comparison_results and redshift_config_comparison_raw in the public schema of your Amazon Redshift clusters to compare their performance metrics. These objects can be accessed from any of the Amazon Redshift clusters created by this solution.

REDSHIFT_CONFIG_COMPARISON_RESULTS

This view provides the aggregated comparison summary of your Amazon Redshift clusters.

It provides the raw value and a percentage number for KPIs like total, mean, median, max query times, percentile-75, and percentile-90 to help you compare and find the most performant cluster based on the different cluster configurations. The test_type column in this view indicates if the test type was to replay your past workload using the Simple Replay utility or a concurrency test to run your queries in parallel with different concurrency numbers. We use the following query to view our outcome:

select * from public.redshift_config_comparison_results;

The following table summarizes the performance comparison results:

Test Type Cluster Identifier Total Query Time in Seconds Improved Total Query Time Mean Query Time in Seconds Improved Mean Query Time Median Query Time in Seconds Improved Median Query Time
simple-replay rs-dc2-8xlarge-2 98.75 0% 4.94 0% 5.11 0%
simple-replay rs-ra3-4xlarge-4 94.16 5% 4.71 5% 4.08 25%
simple-replay rs-ra3-4xlarge-4-cs 19.36 410% 0.97 409% 0.68 651%
simple-replay rs-ra3-4xlarge-5 16.21 509% 0.85 481% 0.65 686%
simple-replay rs-ra3-4xlarge-5-cs 14.66 574% 0.73 577% 0.6 752%
concurrency-1 rs-dc2-8xlarge-2 49.6 0% 9.92 0% 10.67 0%
concurrency-1 rs-ra3-4xlarge-4 45.2 10% 9.51 4% 10.3 4%
concurrency-1 rs-ra3-4xlarge-4-cs 45.2 10% 9.51 4% 10.4 3%
concurrency-1 rs-ra3-4xlarge-5 43.93 13% 8.79 13% 7.7 39%
concurrency-1 rs-ra3-4xlarge-5-cs 43.94 13% 8.78 13% 7.78 37%
concurrency-5 rs-dc2-8xlarge-2 24.8 0% 0.99 0% 0.79 8%
concurrency-5 rs-ra3-4xlarge-4 21.24 17% 0.85 16% 0.7 22%
concurrency-5 rs-ra3-4xlarge-4-cs 22.49 10% 0.9 10% 0.85 0%
concurrency-5 rs-ra3-4xlarge-5 19.73 26% 0.79 25% 0.72 19%
concurrency-5 rs-ra3-4xlarge-5-cs 18.73 32% 0.75 32% 0.72 18%
concurrency-10 rs-dc2-8xlarge-2 98.2 0% 1.96 0% 1.63 0%
concurrency-10 rs-ra3-4xlarge-4 85.46 15% 1.71 15% 1.58 3%
concurrency-10 rs-ra3-4xlarge-4-cs 88.09 11% 1.76 11% 1.59 2%
concurrency-10 rs-ra3-4xlarge-5 77.54 27% 1.55 26% 1.36 20%
concurrency-10 rs-ra3-4xlarge-5-cs 74.68 31% 1.49 32% 1.31 24%

Table 1: Summary of Performance Comparison Results

Based on these results, we can conclude that five nodes of RA3.4XLarge with concurrency scaling enabled was the best-performing configuration.

REDSHIFT_CONFIG_COMPARISON_RAW

This view provides the query-level comparison summary of your Amazon Redshift clusters. We use the following query:

select * from public.redshift_config_comparison_raw;

The following table shows the performance comparison results per query:

Query Hash Cluster Identifier Execution Time in Seconds Total Query Time in Seconds Compile Time in Seconds Queue Time in Seconds Username Query
0531f3b54885afb rs-dc2-8xlarge-2 2 7 5 0 awsuser 599
0531f3b54885afb rs-ra3-4xlarge-4 2 5 3 0 awsuser 510
0531f3b54885afb rs-ra3-4xlarge-4-cs 2 2 0 0 awsuser 499
0531f3b54885afb rs-ra3-4xlarge-5 1 6 4 0 awsuser 498
0531f3b54885afb rs-ra3-4xlarge-5-cs 1 1 0 0 awsuser 457
10ef3990f05c9f8 rs-dc2-8xlarge-2 0 0 0 0 awsuser 516
10ef3990f05c9f8 rs-ra3-4xlarge-4 0 0 0 0 awsuser 427
10ef3990f05c9f8 rs-ra3-4xlarge-4-cs 0 0 0 0 awsuser 423
10ef3990f05c9f8 rs-ra3-4xlarge-5 0 0 0 0 awsuser 412
10ef3990f05c9f8 rs-ra3-4xlarge-5-cs 0 0 0 0 awsuser 390

Table 2: Performance Comparison Per Query

Access permissions and security

To deploy this solution, you need administrator access on the AWS accounts where you plan to deploy the AWS CloudFormation resources for this solution.

The CloudFormation template provisions all the required resources using security best practices based on the principle of least privileges and hosts all resources within your account VPC. Access to the Amazon Redshift clusters is controlled with the CloudFormation template parameter OnPremisesCIDR, which you need to provide to allow on-premises users to connect to the new clusters using their SQL clients on the Amazon Redshift port.

Access permissions for all the resources are controlled using AWS Identity and Access Management (IAM) roles granting appropriate permissions to Amazon Redshift, AWS Lambda, AWS Step Functions, AWS Glue, and AWS Batch. Read and write access privileges are granted to the Amazon Redshift clusters and AWS Batch jobs on the S3 bucket created by the CloudFormation template so that it can read and update data and metadata configurations from that bucket. Read and write access privileges are also granted on the S3 bucket where the user configuration JSON file is uploaded.

Troubleshooting

There might be some rare instances in which failures occur in the state machine running this solution. To troubleshoot, refer to its logs, along with logs from the AWS Batch jobs in Amazon CloudWatch Logs. To view the AWS Batch logs, navigate to the Amazon CloudWatch console and choose Logs in the navigation pane. Find the log group with name <Your CloudFormation Stack Name>/log and choose the latest log streams.

To view the Step Functions logs, navigate to the state machine’s latest run on the Step Functions console and choose CloudWatch Logs for the failed Step Functions step.

After you fix the issue, you can restart the state machine by choosing New execution.

Clean up

Running this template in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save cost. We also recommend pausing your clusters when not in use. If you don’t plan to run this test in future, you should also delete the CloudFormation stack, which deletes all the resources it created.

Conclusion

In this post, we walked you through the process to find the correct size of your Amazon Redshift cluster based on your performance requirements. You can start with our easy-to-use, out-of-the-box Amazon Redshift sizing calculator, or you can use the Amazon Redshift Node Configuration Comparison tool to evaluate the performance of your workload and find the best cluster configuration that meets your need.


About the Authors

Manash Deb is a Software Development Engineer in the AWS Directory Service team. He has worked on building end-to-end applications in different database and technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems on AWS.

Ranjan Burman is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers to build scalable analytical solutions. He has more than 13 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Migrate to an Amazon Redshift Lake House Architecture from Snowflake

Post Syndicated from Soujanya Konka original https://aws.amazon.com/blogs/big-data/migrate-to-an-amazon-redshift-lake-house-architecture-from-snowflake/

The need to derive meaningful and timely insights increases proportionally with the amount of data being collected. Data warehouses play a key role in storing, transforming, and making data easily accessible to enable a wide range of use cases, such as data mining, business intelligence (BI) and reporting, and diagnostics, as well as predictive, prescriptive, and cognitive analysis.

Several new features of Amazon Redshift address a wide range of data requirements and improve performance of extract, load, and transform (ELT) jobs and queries. For example, concurrency scaling, the new RA3 instance types, elastic resize, materialized views, and federated query, which allows you to query data stored in your Amazon Aurora or Amazon Relational Database Service (Amazon RDS) Postgres operational databases directly from Amazon Redshift, and the SUPER data type, which can store semi-structured data or documents as values. The new distributed and hardware accelerated cache with AQUA (Advanced Query Accelerator) for Amazon Redshift delivers up to10 times more performance than other cloud warehouses. The machine learning (ML) based self-tuning capability to set sort and distribution keys for tables significantly improves query performance that was previously handled manually. For the latest feature releases for AWS services, see What’s New with AWS?

To take advantage of these capabilities and future innovation, you need to migrate from your current data warehouse, like Snowflake, to Amazon Redshift, which involves two primary steps:

  • Migrate raw, transformed, and prepared data from Snowflake to Amazon Simple Storage Service (Amazon S3)
  • Reconfigure data pipelines to move data from sources to Amazon Redshift and Amazon S3, which provide a unified, natively integrated storage layer of our Lake House Architecture

In this post, we show you how to migrate data from Snowflake to Amazon Redshift. We cover the second step, reconfiguring pipelines, in a later post.

Solution overview

Our solution is designed in two stages, as illustrated in the following architecture diagram.

The first part of our Lake House Architecture is to ingest data into the data lake. We use AWS Glue Studio with AWS Glue custom connectors to connect to the source Snowflake database and extract the tables we want and store them in Amazon S3. To accelerate extracting business insights, we load the frequently accessed data into an Amazon Redshift cluster. The infrequently accessed data is cataloged in the AWS Glue Data Catalog as external tables that can be easily accessed from our cluster.

For this post, we consider three tables: Customer, Lineitem, and Orders, from the open-source TCPH_SF10 dataset. An AWS Glue ETL job, created by AWS Glue Studio, moves the Customers and Orders tables from Snowflake into the Amazon Redshift cluster, and the Lineitem table is copied to Amazon S3 as an external table. A view is created in Amazon Redshift to combine internal and external datasets.

Prerequisites

Before we begin, complete the steps required to set up and deploy the solution:

  1. Create an AWS Secrets Manager secret with the credentials to connect to Snowflake: username, password, and warehouse details. For instructions, see Tutorial: Creating and retrieving a secret.
  2. Download the latest Snowflake JDBC JAR file and upload it to an S3 bucket. You will find this bucket referenced as SnowflakeConnectionbucket in the cloudformation step.
  3. Identify the tables in your Snowflake database that you want to migrate.

Create a Snowflake connector using AWS Glue Studio

To complete a successful connection, you should be familiar with the Snowflake ecosystem and the associated parameters for Snowflake database tables. These can be passed as job parameters during run time. The following screenshot from a Snowflake test account shows the parameter values used in the sample job.

The following screenshot shows the account credentials and database from Secrets Manager.

To create your AWS Glue custom connector for Snowflake, complete the following steps:

  1. On the AWS Glue Studio console, under Connectors, choose Create custom connector.
  2. For Connector S3 URL, browse to the S3 location where you uploaded the Snowflake JDBC connector JAR file.
  3. For Name, enter a logical name.
  4. For Connector type, choose JDBC.
  5. For Class name, enter net.snowflake.client.jdbc.SnowflakeDriver.
  6. Enter the JDBC URL base in the following format: jdbc:snowflake://<snowflakeaccountinfo>/?user=${Username}&password=${Password}&warehouse=${warehouse}.
  7. For URL parameter delimiter, enter &.
  8. Optionally, enter a description to identify your connector.
  9. Choose Create connector.

Set up a Snowflake JDBC connection

To create a JDBC connection to Snowflake, complete the following steps:

  1. On the AWS Glue Studio console, choose Connectors.
  2. Choose the connector you created.
  3. Choose Create connection.

  4. For Name and Description, enter a logical name and description for your reference.
  5. For Connection credential type, choose default.
  6. For AWS Secret, choose the secret created as a part of the prerequisites.
  7. Optionally, you can specify the credentials in plaintext format.
  8. Under Additional options, add the following key-value pairs:
    1. Key db with the Snowflake database name
    2. Key schema with the Snowflake database schema
    3. Key warehouse with the Snowflake warehouse name
  9. Choose Create connection.

Configure other resources and permissions using AWS CloudFormation

In this step, we create additional resources with AWS CloudFormation, which includes an Amazon Redshift cluster, AWS Identity and Access Management (IAM) roles with policies, S3 bucket, and AWS Glue jobs to copy tables from Snowflake to Amazon S3 and from Amazon S3 to Amazon Redshift.

  1. Sign in to the AWS Management Console as an IAM power user, preferably an admin user.
  2. Choose your Region as us-east-1.
  3. Choose Launch Stack:
  4. Choose Next.
  5. For Stack name, enter a name for the stack, for example, snowflake-to-aws-blog.
  6. For Secretname, enter the secret name created in the prerequisites.
  7. For SnowflakeConnectionName, enter the Snowflake JDBC connection you created.
  8. For Snowflake Connection bucket, enter name of the S3 bucket where the snowflake connector is uploaded
  9. For SnowflakeTableNames, enter the list of tables to migrate from Snowflake. For example, Lineitem,customers,order.
  10. For RedshiftTableNames, enter the list of the tables to load into your warehouse (Amazon Redshift). For example, customers,order.
  11. You can specify your choice of Amazon Redshift node type, number of nodes, and Amazon Redshift username and password, or use the default values.
  12. For the MasterUserPassword, enter a password for your master user keeping in mind the following constraints : It must be 8 to 64 characters in length. It must contain at least one uppercase letter, one lowercase letter, and one number.
  13. Choose Create stack.

Run AWS Glue jobs for the data load

The stack takes about 7 minutes to complete. After the stack is deployed successfully, perform the following actions:

  1. On the AWS Glue Studio console, under Databases, choose Connections.
  2. Select the connection redshiftconnection from the list and choose Test Connection.
  3. Choose the IAM role ExecuteGlueSnowflakeJobRole from the drop-down meu and choose Test connection.

If you receive an error, verify or edit the username and password and try again.

  1. After the connection is tested successfully, on the AWS Glue Studio console, select the job Snowflake-s3-load-job.
  2. On the Action menu, choose Run job.

When the job is complete, all the tables mentioned in the SnowflakeTableNames parameter are loaded into your S3 bucket. The time it takes to complete this job varies depending on the number and size of the tables.

Now we load the identified tables in Amazon Redshift.

  1. Run the job s3-redshift-load-job.
  2. After the job is complete, navigate to the Amazon Redshift console.
  3. Use the query editor to connect to your cluster to verify that the tables specified in RedshiftTableNames are loaded successfully.

You can now view and query datasets from Amazon Redshift. The Lineitem dataset is on Amazon S3 and queried by Amazon Redshift Spectrum. The following screenshot shows how to create an Amazon Redshift external schema that allows you to query Amazon S3 data from Amazon Redshift.

Tables loaded to Amazon Redshift associated storage appear as in the following screenshot.

The AWS Glue job, using the standard worker type to move Snowflake data into Amazon S3, completed in approximately 21 minutes, loading overall 2.089 GB (about 76.5 million records). The following screenshot from the Snowflake console shows the tables and their sizes, which we copied to Amazon S3.

You have the ability to customize the AWS Glue worker type, worker nodes, and max concurrency to adjust distribution and workload.

AWS Glue allows parallel data reads from the data store by partitioning the data on a column. You must specify the partition column, the lower partition bound, the upper partition bound, and the number of partitions. This feature enables you use data parallelism and multiple Spark executors allocated the Spark application.

This completes our migration from Snowflake to Amazon Redshift that enables a Lake House Architecture and the ability to analyze data in more ways. We would like to take a step further and talk about features of Amazon Redshift that can help extend this architecture for data democratization and modernize your data warehouse.

Modernize your data warehouse

Amazon Redshift powers the Lake House Architecture, which enables queries from your data lake, data warehouse, and other stores. Amazon Redshift can access the data lake using Redshift Spectrum. Amazon Redshift automatically engages nodes from a separate fleet of Redshift Spectrum nodes. These nodes run queries directly against Amazon S3, run scans and aggregations, and return the data to the compute nodes for further processing.

AWS Lake Formation provides a governance solution for data stored in an Amazon S3-based data lake and offers a central permission model with fine-grained access controls at the column and row level. Lake Formation uses the AWS Glue Data Catalog as a central metadata repository and makes it simple to ingest and catalog data using blueprints and crawlers.

The following screenshot shows the tables from Snowflake represented in the AWS Glue Data Catalog and managed by Lake Formation.

With the Amazon Redshift data lake export feature, you can also save data back in Amazon S3 in open formats like Apache Parquet, to use with other analytics services like Amazon Athena and Amazon EMR.

Distributed storage

Amazon Redshift RA3 gives the flexibility to scale compute and storage independently. Amazon Redshift data is stored on Amazon Redshift managed storage backed by Amazon S3. Distribution of datasets between cluster storage and Amazon S3 allows you to benefit from bringing the appropriate compute to the data depending on your use case. You can query data from Amazon S3 without accessing Amazon Redshift.

Let’s look at an example with the star schema. We can save a fact table that we expect to grow rapidly in Amazon S3 with the schema saved in the Data Catalog, and dimension tables in cluster storage. You can use views with union data from both Amazon S3 and the attached Amazon Redshift managed storage.

Another model for data distribution can be based on the state of hot or cold data, with hot data in Amazon Redshift managed storage and cold data in Amazon S3. In this example, we have the datasets lineitem, customer, and orders. The customer and orders portfolio are infrequently updated datasets in comparison to lineitem. We can create an external table to read lineitem data from Amazon S3 and the schema from the Data Catalog database, and load customer and orders to Amazon Redshift tables. The following screenshot shows a join query between the datasets.

It would be interesting to know the overall run statistics for this query, which can be queried from system tables. The following code gets the stats from the preceding query using svl_s3query_summary:

select elapsed, s3_scanned_rows, s3_scanned_bytes,
s3query_returned_rows, s3query_returned_bytes, files, avg_request_parallelism
from svl_s3query_summary
where query = 1918
order by query,segment;

The following screenshot shows query output.

For more information about this query, see Using the SVL_QUERY_SUMMARY view.

Automated table optimization

Distribution and sort keys are table properties that define how data is physically stored. These are managed by Amazon Redshift. Automatic table optimization continuously observes how queries interact with tables and uses ML to select the best sort and distribution keys to optimize performance for the cluster’s workload. To enhance performance, Amazon Redshift chooses the key and tables are altered automatically.

In the preceding scenario, the lineitem table had distkey (L_ORDERKEY), the customer table had distribution ALL, and orders had distkey (O_ORDERKEY).

Storage optimization

Choosing a data format depends on the data size (JSON, CSV, or Parquet). Redshift Spectrum currently supports Avro, CSV, Grok, Amazon Ion, JSON, ORC, Parquet, RCFile, RegexSerDe, Sequence, Text, and TSV data formats. When you choose your format, consider the overall data scanned and I/O efficiency, such as with a small dataset in CSV or JSON format versus the same dataset in columnar Parquet format. In this case, for smaller scans, Parquet consumes more compute capacity compared to CSV, and may eventually take around the same time as CSV. In most cases, Parquet is the optimal choice, but you need to consider other inputs like volume, cost, and latency.

SUPER data type

The SUPER data type offers native support for semi-structured data. It supports nested data formats such as JSON and Ion files. This allows you to ingest, store, and query nested data natively in Amazon Redshift. You can store JSON formatted data in SUPER columns.

You can query the SUPER data type through an easy-to-use SQL extension that is powered by the PartiQL. PartiQL is a SQL language that makes it easy to efficiently query data regardless of the format, whether the data is structured or semi-structured.

Pause and resume

Pause and resume lets you easily start and stop a cluster to save costs for intermittent workloads. This way, you can cost-effectively manage a cluster with infrequently accessed data.

You can apply pause and resume via the console, API, and user-defined schedules.

AQUA

AQUA for Amazon Redshift is a large high-speed cache architecture on top of Amazon S3 that can scale out to process data in parallel across many nodes. It flips the current paradigm of bringing the data to the compute—AQUA brings the compute to the storage layer so the data doesn’t have to move back and forth between the two, which enables Amazon Redshift to run queries much faster.

Data sharing

The data sharing feature seamlessly allows multiple Amazon Redshift clusters to query data located in RA3 clusters and their managed storage. This is ideal for workloads that are isolated from each other but data needs to be shared for cross-group collaboration without actually copying data.

Concurrency scaling

Amazon Redshift automatically adds transient clusters in seconds to serve sudden spikes in concurrent requests with consistently fast performance. For every 1 day of usage, 1 hour of concurrency scaling is available at no charge.

Conclusion

In this post, we discussed an approach to migrate a Snowflake data warehouse to a Lake House Architecture with a central data lake accessible through Amazon Redshift.

We covered how to use AWS Glue to move data from sources like Snowflake into your data lake, catalog it, and make it ready to analyze in a few simple steps. We also saw how to use Lake Formation to enable governance and fine-grained security in the data lake. Lastly, we discussed several new features of Amazon Redshift that make it easy to use, perform better, and scale to meet business demands.


About the Authors

Soujanya Konka is a Solutions Architect and Analytics specialist at AWS, focused on helping customers build their ideas on cloud. Expertise in design and implementation of business information systems and Data warehousing solutions. Before joining AWS, Soujanya has had stints with companies such as HSBC, Cognizant.

Shraddha Patel is a Solutions Architect and Big data and Analytics Specialist at AWS. She works with customers and partners to build scalable, highly available and secure solutions in the AWS cloud.

Implement a slowly changing dimension in Amazon Redshift

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/implement-a-slowly-changing-dimension-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. A star schema is a database organization structure optimized for use in a data warehouse. In a star schema, a dimension is a structure that categorizes the facts and measures in order to enable you to answer business questions. The attributes (or columns) of the dimension table provide the business meaning to the measures of the fact table. Rows in a dimension table are identified using a unique identifier like a customer identification key, and the fact table’s rows have a referential key pointing to the dimension table’s primary key. Dimension and fact tables are joined using the dimension table’s primary key and the fact table’s foreign key.

Over time, the attributes of a given row in a dimension table may change. For example, the shipping address for a customer may change. This phenomenon is called a slowly changing dimension (SCD). For historical reporting purposes, it may be necessary to keep a record of the fact that the customer has a change in address. The range of options for dealing with this involves SCD management methodologies referred to as type 1 to type 7. Type 0 is when no changes are allowed to the dimension, for example a date dimension that doesn’t change. The most common types are 1, 2 and 3:

  • Type 1 (No history) – The dimension table reflects the latest version; no history is maintained
  • Type 2 (Maintain history) – All changes are recorded and versions are tracked with dates and flags
  • Type 3 (Previous value) – The [latest – 1] value for specific columns in maintained as a separate attribute

Prerequisites

For this walkthrough, you should have the following prerequisites:

Overview of solution

This post walks you through the process of implementing SCDs on an Amazon Redshift cluster. We go through the best practices and anti-patterns. To demonstrate this, we use the customer table from the TPC-DS benchmark dataset. We show how to create a type 2 dimension table by adding slowly changing tracking columns, and we go over the extract, transform, and load (ETL) merge technique, demonstrating the SCD process.

The following figure is the process flow diagram.

The following diagram shows how a regular dimensional table is converted to a type 2 dimension table.

Implement slowly changing dimensions

To get started, we use one of two AWS CloudFormation templates from Amazon Redshift Labs:

In this post, we only show the important SQL statements; the complete SQL code is available in scd2_sample_customer_dim.sql.

The first step to implement SCD for a given dimension table is to create the dimension table with SCD tracking attributes. For example, record effective date, record end date, and active record indicator are typically added to track if a record is active or not. These fields are collectively referenced as the SCD fields (as shown in the following code) going forward in this post.

These SCD fields are added so that when a field is changed, for example, a customer’s address, the existing record in the dimension table is updated to indicate that the record isn’t active and a new record is inserted with an active flag. This way, every change to an SCD field is stored in the table and business users can run queries to see historical performance of a dimension for a given change that is being tracked.

We also introduce the following:

  • Record hash value to easily track if the customer data fields have changed their values. This hash column is computed over all the customer fields. This single hash column is compared instead of comparing multiple individual columns to determine if the data has changed.
  • Record insert and update timestamps to capture when the actual dimension row was added to the table and updated.

The following code shows the SCD fields added to the dimension table:

drop table if exists customer_dim cascade;
create table customer_dim ( 
customer_dim_id     bigint generated by default as identity(1, 1), 
c_custkey           bigint distkey, 
c_name              character varying(30), 
c_address           character varying(50), 
c_nationkey         integer, 
c_phone             character varying(20), 
c_acctbal           numeric(12, 2), 
c_mktsegment        character varying(10), 
c_comment           character varying(120), 
track_hash          bigint, 
record_start_ts     timestamp without time zone 
                    default '1970-01-01 00:00:00'::timestamp without time zone, 
record_end_ts       timestamp without time zone 
                    default '2999-12-31 00:00:00'::timestamp without time zone, 
record_active_flag  smallint default 1, 
record_upd_ts       timestamp without time zone default current_timestamp, 
record_insert_ts    timestamp without time zone default current_timestamp 
)
diststyle key 
sortkey (c_custkey);

Next, we perform the initial load to the dimension table. Because this is the first time that the dimension records are loaded, the SCD tracking attributes are set to active. For example, record start date is set to a low date, like 1900-01-01, or to a business date value to reflect when a particular change became effective. The record end date is set to a high date, like 2999-12-31, and active record indicator is set 1, indicating these rows are active.

After the initial load is complete, we create a staging table to load the incremental changes that come from the source system. This table acts as temporary holding place for incoming records. To identify if a change has occurred or not for a given record, we left outer join the customer staging table to the customer dimension table on the customer primary key (c_cust_key). We use left outer join because we want to flag matching records for the update process and unmatched records for the insert process. Left outer joining the staging table to the customer table projects both matched and unmatched rows. Matched rows are treated as updates and unmatched rows are treated as inserts.

In our data warehouse system, let’s assume we have to meet the following criteria:

  • Track changes on the address and phone fields only—type 2 with start and end timestamps
  • Other attributes are required to be kept up to date without creating history records—type 1
  • The source system provides incremental delta change records

If your source systems can’t provide delta change records and instead provides full load every time, then the data warehouse needs to have logic to identify the changed records. For such a workload, we build a second, uniquely identifiable value by using a built-in Amazon Redshift hash function on all the dimension columns to identify the changed rows.

The customer address and phone are being tracked as slowly changing dimensions. We use FNV_HASH to generate a 64-bit signed integer that accommodates 18.4 quintillion unique values. For smaller dimension tables, we can also use CHECKSUM to generate a 32-bit signed integer that accommodates 4.4 billion unique values.

We determine if the dimension row is a new record by using new_ind, or if the dimension row is changed by comparing the record hash and using track_ind for the change indicator.

Changes are identified by joining the staging table and target table on the primary key. See the following code:

truncate table stg_customer;
insert into stg_customer 
with stg as (
    select
        custkey as stg_custkey, name as stg_name, 
        address as stg_address, nationkey as stg_nationkey, 
        phone as stg_phone, acctbal as stg_acctbal,
        mktsegment as stg_mktsegment, comment as stg_comment, 
        effective_dt as stg_effective_dt,
        FNV_HASH(address,FNV_HASH(phone)) as stg_track_hash
    from
        src_customer
    )
select 
    s.* , 
    case when c.c_custkey is null then 1 else 0 end new_ind,
    case when c.c_custkey is not null 
          and s.stg_track_hash <> track_hash then 1 else 0 end track_ind
 from
    stg s
left join customer_dim c
    on s.stg_custkey = c.c_custkey
;

For rows that aren’t matched (for example, completely new records such as new_ind = 1), the rows are inserted into the dimensional table with SCD tracking attributes set as new and an active record flag indicating Active = 1.

For matched records, two possibilities could happen:

  • SCD type 2 field has changed – For this category, we use a two-step process to retain the previous version of the customer record and also record the latest version of the customer record for type 2 fields in our data warehouse. This satisfies our first business requirement. The steps are as follows:
    • Step 1 – Update the existing record in the target customer dimension table as inactive by setting the record end date to the current timestamp and active record indicator to 0.
    • Step 2 – Insert the new rows from the customer staging table into the customer target table with the record start date set to the current timestamp, record end date set to a high date, and the record active flag set to 1.
  • SCD type 1 field has changed – For this category, the row in the customer target table is updated directly with the latest rows from staging table. While doing so, we don’t update any SCD tracking date fields or flags. With this step, we retain only the latest version of the record for type 1 fields in our data warehouse. This satisfies our second business requirement.

Apply changes to the dimension table with the following code:

-- merge changes to dim customer
begin transaction;

-- close current type 2 active record based of staging data where change indicator is 1
update customer_dim
set record_end_ts = stg_effective_dt - interval '1 second',
    record_active_flag = 0,
    record_upd_ts = current_timestamp 
from stg_customer
where c_custkey = stg_custkey
and record_end_ts = '2999-12-31'
and track_ind = 1;

-- create latest version type 2 active record from staging data
-- this includes Changed + New records
insert into customer_dim
   (c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,
    c_mktsegment,c_comment,track_hash,record_start_ts,record_end_ts, 
    record_active_flag, record_insert_ts, record_upd_ts) 
select
    stg_custkey, stg_name, stg_address, stg_nationkey, stg_phone,
    stg_acctbal, stg_mktsegment, stg_comment, stg_track_hash, 
    stg_effective_dt as record_start_ts, '2999-12-31' as record_end_ts,
    1 as record_active_flag, current_timestamp as record_insert_ts, 
    current_timestamp as record_upd_ts
from
    stg_customer
where
    track_ind = 1 or new_ind = 1;

-- update type 1 current active records for non-tracking attributes
update customer_dim
set c_name = stg_name,
    c_nationkey = stg_nationkey,
    c_acctbal = stg_acctbal,
    c_mktsegment = stg_mktsegment,
    c_comment = stg_comment,
    record_upd_ts = current_timestamp
from
    stg_customer
where
    c_custkey = stg_custkey
and record_end_ts = '2999-12-31'
and track_ind = 0 and new_ind = 0;

-- end merge operation
commit transaction;

Best practices

The Amazon Redshift cloud data warehouse can process a large number of updates efficiently. To achieve this, have a staging table that shares the same table definition as your target dimension table. Then, as shown in the earlier code snippet, you can join the staging and the target dimension tables and perform the update and insert in a transaction block. This operation performs bulk updates and inserts on the target table, yielding good performance.

The Amazon Redshift shared nothing architecture typically performs at its peak when operations can be run by each node independently with minimal data movement between nodes. The target customer dimension table and the intermediate staging table created with matched distribution keys provide the best performance because all operations can be completed within the node.

Anti-patterns

You can also approach this method by comparing dimension records in a row-by-row fashion using cursors and then updating or inserting a particular row on the target table. Although this method works on smaller tables, for larger tables, it’s advised to use the bulk operations method explained in this post.

Clean up

To avoid incurring future charges, you can delete all the resources created by the CloudFormation template by deleting the CloudFormation stack.

Conclusion

In this post, you learned about slowly changing dimensions, implementing SCDs on Amazon Redshift, best practices for running the ETL operations against the target table by using intermediate staging tables, and finally anti-patterns to avoid.

Refer to Amazon Redshift data loading best practices for further materials and additional best practices, and see Updating and inserting new data for instructions to implement updates and inserts.


About the Authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift. He is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouses—architecture, development, and administration. He has been in the data and analytical field for over 13 years.

WeatherBug reduced ETL latency to 30 times faster using Amazon Redshift Spectrum

Post Syndicated from Anton Morozov original https://aws.amazon.com/blogs/big-data/weatherbug-reduced-etl-latency-to-30-times-faster-using-amazon-redshift-spectrum/

This post is co-written with data engineers, Anton Morozov and James Phillips, from Weatherbug.

WeatherBug is a brand owned by GroundTruth, based in New York City, that provides location-based advertising solutions to businesses. WeatherBug consists of a mobile app reporting live and forecast data on hyperlocal weather to consumer users. The WeatherBug Data Engineering team has built a modern analytics platform to serve multiple use cases, including weather forecasting and location-based advertising, that is completely built on AWS. They use an Amazon Simple Storage Service (Amazon S3) data lake to store clickstream data and use Amazon Redshift as their cloud data warehouse platform.

In this post, we share how WeatherBug reduced their extract, transform, and load (ETL) latency using Amazon Redshift Spectrum.

Amazon Redshift Spectrum overview

Amazon Redshift is the most widely used cloud data warehouse. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.

Redshift Spectrum allows you to query open format data directly in the S3 data lake without having to load the data or duplicate your infrastructure. With Redshift Spectrum, you can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. For more information, see Amazon Redshift Spectrum overview and Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.

Redshift Spectrum runs on a massive compute fleet independent of your Amazon Redshift cluster. Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Therefore, Redshift Spectrum queries use much less of your cluster’s processing capacity than other queries. With Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster.

Previous solution

To summarize performance metrics for internal BI needs, WeatherBug has to move a lot of data between their S3 data lake and Amazon Redshift cluster using complex ETL processings. They used Apache Airflow to orchestrate their pipeline during the ETL process and used Apache Hive for large-scale ETL jobs in order to offload the data from the Amazon Redshift cluster. The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. You can project this structure onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

WeatherBug chose Hive as a solution because it was easy to use with their Python/Airflow pipeline and they were able to launch multiple jobs in parallel. This solution was working well for WeatherBug but it needed more engineering effort to build the ETL processes along with operational overheads on the current solution because it involved multiple technologies.

The following diagram illustrates this previous architecture.

New Amazon Redshift Spectrum-based solution

To optimize the current Hadoop-based solution to move data between their Amazon Redshift cluster and S3 buckets, the WeatherBug team considered Redshift Spectrum as an alternative solution. The following diagram shows their updated architecture.

WeatherBug created Redshift Spectrum external tables that pointed to the data stored in their S3 buckets. This helped them perform the data movement and transformations from Amazon Redshift to Amazon S3 using Insert into <external table> select from <Redshift table> and from Amazon S3 to Amazon Redshift using Insert into <Redshift table> select from <external table> along with the data transformations in the inline SQL queries.

During the pilot implementation on a few sample tables with this current solution, WeatherBug found it very easy to learn how to use the Amazon Redshift Spectrum features, and not long after they had a proof of concept far superior to their existing Hadoop-based solution. They reduced the targeted pipeline’s runtime from 17 minutes to 30 seconds, a 3300% improvement, with the additional benefit of eliminating the cost and management of Hadoop cluster. They were excited to apply this approach to additional pipelines that support their Amazon Redshift clusters. This was a nice win for them to be able to improve processing times and reduce cost and overhead with low engineering effort.

In addition to these improvements, they replaced some of their jobs that use Apache Hive to query tables with Amazon Redshift Spectrum.

In their initial testing, WeatherBug is seeing costs for Amazon Redshift Spectrum of $0.14 per day instead of $16.12 per day on Hive for a selected sample job, which is a 115 times reduction in cost.

Conclusion

When you have a data lake and cloud data warehouse built in Amazon S3 and Amazon Redshift, you may need frequent ETL jobs between the two systems for different use cases. Amazon Redshift Spectrum provides an easy-to-implement, cost-effective, and high-performance solution to interact between Amazon Redshift and Amazon S3 to query the Amazon S3 data from Amazon Redshift, join Amazon Redshift tables with S3 objects, and transform using simple SQL queries. Many Data Engineering use cases similar to the WeatherBug example discussed here can be optimized using Amazon Redshift Spectrum.


About the Authors

Anton Morozov is a WeatherBug data engineer working on analytics pipelines. He considers AWS Redshift an essential technology to understanding data for business intelligence needs.

James Phillips is a WeatherBug data engineer who uses many AWS technologies on a daily basis. Some of these include Amazon Redshift, Amazon EMR, and Amazon SageMaker.

Avijit Goswami is a Principal Solutions Architect at AWS, helping his digital native and startup customers become tomorrow’s enterprises using AWS services. He is part of the analytics specialists field community in AWS and works closely with the WeatherBug Data Engineering team.

Automate your Amazon Redshift performance tuning with automatic table optimization

Post Syndicated from Adam Gatt original https://aws.amazon.com/blogs/big-data/automate-your-amazon-redshift-performance-tuning-with-automatic-table-optimization/

Amazon Redshift is a cloud data warehouse database that provides fast, consistent performance running complex analytical queries on huge datasets scaling into petabytes and even exabytes with Amazon Redshift Spectrum. Although Amazon Redshift has excellent query performance out of the box, with up to three times better price performance than other cloud data warehouses, you can further improve its performance by physically tuning tables in a data model. You do so by sorting table rows and rearranging rows across a cluster’s nodes. In Amazon Redshift, you implement this by setting sort and distribution key table attributes.

In the past, setting sort and distribution keys was an involved manual process that required a skilled resource to analyze a cluster’s workload and choose and implement the right keys for every table in the data model. More recently, Amazon Redshift Advisor provided suggestions, but these still had to be manually implemented. At AWS re:Invent 2020, Amazon Redshift announced a new feature to automate this process: automatic table optimization (ATO). ATO automatically monitors a cluster’s workload and table metadata, runs artificial intelligence algorithms over the observations, and implements sort and distribution keys online in the background, without requiring any manual intervention, and without interrupting any running queries.

In this post, I explain what sort and distribution keys are and how they improve query performance. I also explain how ATO works and how to enable and disable it. Then I outline the steps to set up and run a test of ATO on the Cloud DW benchmark derived from TPC-H using a 30 TB dataset. Finally, I present the results of a test that show ATO improved performance on this benchmark, without requiring any manual tuning.

Distribution and sort keys

In this section, I give a high-level overview of distribution and sort keys, then I explain how they’re automatically set by ATO.

Distribution keys

Amazon Redshift has a massively parallel processing (MPP) architecture, where data is distributed across multiple compute nodes (see the following diagram). This allows Amazon Redshift to run queries against each compute node in parallel, dramatically increasing query performance.

To achieve the best possible query performance, data needs to be distributed across the compute nodes in a way that is optimal to the specific workload that is being run on the cluster. For example, the optimal way to distribute data for tables that are commonly joined is to store rows with matching join keys on the same nodes. This enables Amazon Redshift to join the rows locally on each node without having to move data around the nodes. Data distribution also affects the performance of GROUP BY operations.

In Amazon Redshift, the data distribution pattern is determined by two physical table settings: distribution style (DISTSTYLE) and distribution key (DISTKEY).

Amazon Redshift has three distribution styles:

  • All – A copy of the entire table is replicated to every node
  • Even – The data in the table is spread evenly across the nodes in a cluster in a round-robin distribution
  • Key – The data is distributed across the nodes by the values in the column defined as the DISTKEY

If a table’s distribution style is key, then a single column in the table can be set as the DISTKEY.

Sort keys

Sort keys determine how rows are physically sorted in a table. Having table rows sorted improves the performance of queries with range-bound filters. Amazon Redshift stores the minimum and maximum values of each of its data blocks in metadata. When a query filters on a column (or multiple columns), the execution engine can use the metadata to skip blocks that are out of the filter’s range. For example, if a table has a sort key on the column created_date and a query has a filter WHERE created_date BETWEEN '2020-02-01' AND '2020-02-02', the execution engine can identify which blocks don’t contain data for February 1 and 2 given their metadata. The execution engine can then skip over these blocks, reducing the amount of data read and the number of rows that need to be materialized and processed, which improves the query performance.

Sort keys can be set on a single column in a table, or multiple columns (known as a compound sort key). They can also be interleaved.

Manually set distribution and sort keys

When you create a table in Amazon Redshift, you can manually set the distribution style and key, and the sort key in the CREATE TABLE DDL.

The following code shows two simplified example DDL statements for creating a dimension and fact table in a typical star schema model that manually set distribution and sort keys:

CREATE TABLE customer_dim (
   customer_key INT  
  ,customer_id INT
  ,first_name VARCHAR(100)
  ,last_name VARCHAR(100)
  ,join_date DATE) 
DISTKEY ( customer_key );

CREATE TABLE sale_fact (
   date_key DATE SORTKEY
  ,customer_key INT  
  ,product_key INT
  ,sale_amount DECIMAL(7,2))
DISTKEY ( customer_key );

Both tables have the customer_key column set as the distribution key (DISTKEY). When rows are inserted into these tables, Amazon Redshift distributes them across the cluster based on the values of the customer_key column. For example, all rows with a customer_key of 100 are moved to the same node, and likewise all rows with a customer_key of 101 are also moved to the same node (this may not be the same node as the key of 100), and so on for every row in the table.

In the sale_fact table, the date_key column has been set as the sort key (SORTKEY). When rows are inserted into this table, they’re physically sorted on the disk in the order of dates in the date_key column.

After data has been loaded into the tables, we can use the svv_table_info system view to see what keys have been set:

INSERT INTO customer_dim (customer_key, customer_id, first_name, last_name, join_date)
VALUES (100, 1001, 'John', 'Smith', SYSDATE);

INSERT INTO sale_fact (date_key, customer_key, product_key, sale_amount)
VALUES (SYSDATE, 100, 203, 98.76);

SELECT ti."table"      
      ,ti.diststyle      
      ,ti.sortkey1
FROM svv_table_info ti
WHERE ti.database = 'dev'
AND ti.schema = 'public'
AND ti."table" IN ('customer_dim', 'sale_fact')
ORDER BY 1;

The following table shows our results.

table diststyle sortkey1
customer_dim KEY(customer_key) AUTO(SORTKEY)
sale_fact KEY(customer_key) date_key

The sort key for customer_dim has been set to AUTO(SORTKEY), which I explain in the next section.

Now, if we populate these tables and run a business analytics query on them, like the following simplified query, we can see how the distribution and sort keys improve the performance of the query:

SELECT cd.customer_id      
     ,SUM(sf.sale_amount) AS sale_amount
FROM sale_fact sf
INNER JOIN customer_dim cd   
   ON sf.customer_key = cd.customer_key
WHERE sf.date_key BETWEEN '2021-03-01' AND '2021-03-07'
GROUP BY cd.customer_id;

The customer_dim table is joined to the sale_fact table on the customer_key column. Because both the table’s rows are distributed on the customer_key column, this means the related rows (such as customer_key of 100) are co-located on the same node, so when Amazon Redshift runs the query on this node, it doesn’t need to move related rows across the cluster from other nodes (a process known as redistribution) for the join. Also, there is a range bound filter on the date_key column (WHERE sf.date_key BETWEEN '2021-03-01' AND '2021-03-07'). Because there is a sort key on this column, the Amazon Redshift execution engine can more efficiently skip blocks that are out of the filter’s range.

Automatically set distribution and sort keys

Amazon Redshift ATO is enabled by default. When you create a table and don’t explicitly set distribution or sort keys in the CREATE TABLE DDL (as in the previous example), the following happens:

  • The distribution style is set to AUTO(ALL) for small tables and AUTO(EVEN) for large tables.
  • The sort key is set to AUTO(SORTKEY). This means no sort key is currently set on the table.

The AUTO keyword indicates the style and key are being managed by ATO.

For example, we can create a table with no keys explicitly defined:

CREATE TABLE customer_dim (
   customer_key INT  
  ,customer_id INT  
  ,first_name VARCHAR(100)
  ,last_name VARCHAR(100)
  ,join_date DATE);

Then we insert some data and look at the results from svv_table_info:

INSERT INTO customer_dim (customer_key, customer_id, first_name, last_name, join_date)
VALUES (100, 1001, 'John', 'Smith', SYSDATE);

SELECT ti."table"      
      ,ti.diststyle      
      ,ti.sortkey1
FROM svv_table_info ti
WHERE ti.database = 'dev'
AND ti.schema = 'public'
AND ti."table" IN ('customer_dim', 'sale_fact')
ORDER BY 1;

The following table shows our results.

table diststyle sortkey1
customer_dim AUTO(ALL) AUTO(SORTKEY)

The distribution and sort key are being managed by ATO, and the distribution style has already been set to ALL.

ATO now monitors queries that access customer_dim and analyzes the table’s metadata, and makes several observations. An example of an observation is the amount of data that is moved across nodes to perform a join, or the number of times a column was used in a range-scan filter that would have benefited from sorted data.

ATO then analyzes the observations using AI algorithms to determine if the introduction or change of a distribution or sort key will improve the workload’s performance.

For distribution keys, Amazon Redshift constructs a graph representation of the SQL join history, and uses this graph to calculate the optimal table distribution to reduce data transfer across nodes when joining tables (see the following diagram). You can find more details of this process in the scientific paper Fast and Effective Distribution-Key Recommendation for Amazon Redshift.

For sort keys, a table’s queries are monitored for columns that are frequently used in filter and join predicates. A column is then chosen based on the frequency and selectivity of those predicates.

When an optimal configuration is found, ATO implements the new keys in the background, redistributing rows across the cluster and sorting tables. For sort keys, another Amazon Redshift feature, automatic table sort, handles physically sorting the rows in the table, and maintains the sort order over time.

This whole process, from monitoring to implementation, completes in hours to days, depending on the number of queries that are run.

Convert existing tables to automatic optimization

You can enable ATO on existing tables by setting the distribution style and sort key to AUTO with the ALTER TABLE statement. For example:

ALTER TABLE customer_dim ALTER DISTSTYLE AUTO;

ALTER TABLE customer_dim ALTER SORTKEY AUTO;

If the table has existing sort and/or distribution keys that were explicitly set, then currently they will be preserved and won’t be changed by ATO.

Disable automatic table optimization

To disable ATO on a table, you can explicitly set a distribution style or key. For example:

ALTER TABLE customer_dim ALTER DISTSTYLE ALL;

ALTER TABLE customer_dim ALTER DISTSTYLE EVEN;

ALTER TABLE customer_dim ALTER DISTKEY customer_id;

You can then explicitly set a sort key or set the sort key to NONE:

ALTER TABLE customer_dim ALTER SORTKEY (join_date);

ALTER TABLE customer_dim ALTER SORTKEY NONE;

Performance test and results

Cloud DW benchmark derived from TPC-H

TPC-H is an industry standard benchmark designed to measure ad hoc query performance for business analytics workloads. It consists of 8 tables and 22 queries designed to simulate a real-world decision support system. For full details of TPC-H, see TPC BENCHMARK H.

The following diagram shows the eight tables in the TPC-H data model.

The Cloud DW Benchmark is derived from TPC-H and uses the same set of tables, queries, and a 30 TB dataset in Amazon Simple Storage Service (Amazon S3), which was generated using the official TPC-H data generator. This provides an easy way to set up and run the test on your own Amazon Redshift cluster.

Because the Cloud DW benchmark is derived from the TPC-H benchmark, it isn’t comparable to published TPC-H results, because the results of our tests don’t fully comply with the specification. This post uses the Cloud DW benchmark.

Set up the test

The following steps outline how to set up the TPC-H tables on an Amazon Redshift cluster, import the data, and set up the Amazon Redshift scheduler to run the queries.

Create an Amazon Redshift cluster

We recommend running this test on a five-node ra3.16xlarge cluster. You can run the test on a smaller cluster, but the query run times will be slower, and you may need to adjust the frequency the test queries are run at, such as every 4 hours instead of every 3 hours. For more information about creating a cluster, see Step 2: Create a sample Amazon Redshift cluster. We also recommend creating the cluster in the us-east-1 Region to reduce the amount of time required to copy the test data.

Set up permissions

For this test, you require permissions to run the COPY command on Amazon Redshift to load the test data, and permissions on the Amazon Redshift scheduler to run the test queries.

AWS Identity and Access Management (IAM) roles grant permissions to AWS services. The following steps describe how to create and set up two separate roles with the required permissions.

The RedshiftCopyRole role grants Amazon Redshift read-only access on Amazon S3 so it can copy the test data.

  1. On the IAM console, on the Role page, create a new role called RedshiftCopyRole using Redshift – Customizable as the trusted entity use case.
  2. Attach the policyamazonS3ReadOnlyAccess.

The RedshiftATOTestingRole role grants the required permissions for setting up and running the Amazon Redshift scheduler.

  1. On the IAM console, create a new role called RedshiftATOTestingRole using Redshift – Customizable as the trusted entity use case.
  2. Attach the following policies:
    1. amazonRedshiftDataFullAccess
    2. amazonEventBridgeFullAccess
  3. Create a new policy called RedshiftTempCredPolicy with the following JSON and attach it to the role. Replace {DB_USER_NAME} with the name of the Amazon Redshift database user that runs the query.
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "UseTemporaryCredentials",
			"Effect": "Allow",
			"Action": "redshift:GetClusterCredentials",
			"Resource": [
			"arn:aws:redshift:*:*:dbuser:*/{DB_USER_NAME}"
			]
		}
	]
}

This policy assigns temporary credentials to allow the scheduler to log in to Amazon Redshift.

  1. Check the role has the policies RedshiftTempCredPolicy, amazonEventBridgeFullAccess, and amazonEventBridgeFullAccess attached.
  2. Now edit the role’s trust relationships and add the following policy:
   {
      "Sid": "S1",
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }

The role should now have the trusted entities as shown in the following screenshot.

Now you can attach the role to your cluster.

  1. On the Amazon Redshift console, choose your test cluster.
  2. On the Properties tab, choose Cluster permissions.
  3. Select Manage IAM roles and attach RedshiftCopyRole.

It may take a few minutes for the roles to be applied.

The IAM user that sets up and runs the Amazon Redshift scheduler also needs to have the right permissions. The following steps describe how to assign the required permissions to the user.

  1. On the IAM console, on the Users page, choose the user to define the schedule.
  2. Attach the policy amazonEventBridgeFullAccess directly to the user.
  3. Create a new policy called AssumeATOTestingRolePolicy with the following JSON and attach it to the user. Replace {AWS ACCOUNT_NUMBER} with your AWS account number.
{
	"Version": "2012-10-17",
	"Statement": [
		{
		"Sid": "AssumeIAMRole",
		"Effect": "Allow",
		"Action": "sts:AssumeRole",
		"Resource": "arn:aws:iam::{AWS ACCOUNT_NUMBER}:role/RedshiftATOTestingRole"
		}
	]
}

This policy allows the user to assume the role RedshiftATOTestingRole, which has required permissions for the scheduler.

The user should now have the AssumeATOTestingRolePolicy and amazonEventBridgeFullAccess policies directly attached.

Create the database tables and copy the test data

The following script is an untuned version of the TPC-H ddl.sql file that creates all required tables for the test and loads them with the COPY command. To untune the tables, all the sort and distribution keys have been removed. The original tuned version is available on the amazon-redshift-utils GitHub repo.

Copy the script into your Amazon Redshift SQL client of choice, replace the <AWS ACCOUNT_NUMBER> string with your AWS account number, and run the script. On a five-node ra3.16xlarge cluster in the us-east-1 Region, the copy should take approximately 3 hours.

The script creates tables in the default schema public. For this test, I have created a database called tpch_30tb, which the script is run on.

ddl_untuned.sql

When the copy commands have finished, run the following queries to see the table’s physical settings:

SELECT ti."table"      
      ,ti.tbl_rows      
      ,ti.size      
      ,ti.diststyle      
      ,ti.sortkey1      
      ,ti.sortkey_num      
      ,ti.encoded
FROM svv_table_info ti
WHERE ti.database = 'tpch_30tb'
AND ti.schema = 'public'
ORDER BY ti.size;

The output of this query shows some optimizations have already been implemented by the COPY command. The smaller tables have the diststyle set to ALL (replicating all rows across all data nodes), and the larger tables are set to EVEN (a round-robin distribution of rows across the data nodes). Also, the encoding (compression) has been set for all the tables.

table tbl_rows size diststyle sortkey1 sortkey_num encoded
region 5 30 AUTO(ALL) AUTO(SORTKEY) 0 Y, AUTO(ENCODE)
nation 25 35 AUTO(ALL) AUTO(SORTKEY) 0 Y, AUTO(ENCODE)
supplier 300000000 25040 AUTO(EVEN) AUTO(SORTKEY) 0 Y, AUTO(ENCODE)
part 6000000000 316518 AUTO(EVEN) AUTO(SORTKEY) 0 Y, AUTO(ENCODE)
customer 4500000000 399093 AUTO(EVEN) AUTO(SORTKEY) 0 Y, AUTO(ENCODE)
partsupp 24000000000 1521893 AUTO(EVEN) AUTO(SORTKEY) 0 Y, AUTO(ENCODE)
orders 45000000000 2313421 AUTO(EVEN) AUTO(SORTKEY) 0 Y, AUTO(ENCODE)
lineitem 179997535081 8736977 AUTO(EVEN) AUTO(SORTKEY) 0 Y, AUTO(ENCODE)
SELECT td.tablename      
      ,td."column"      
      ,td.encoding
FROM pg_table_def td
WHERE td.tablename = 'customer'
AND   td.schemaname = 'public';

This output shows the encoding type set for each column of the customer table. Because Amazon Redshift is a columnar database, the compression can be set differently for each column, as opposed to a row-based database, which can only set compression at the row level.

tablename column encoding
customer c_custkey az64
customer c_name lzo
customer c_address lzo
customer c_nationkey az64
customer c_phone lzo
customer c_acctbal az64
customer c_mktsegment lzo
customer c_comment lzo

Schedule the test queries

The TPC-H benchmark uses a set of 22 queries that have a wide variation of complexity, amount of data scanned, answer set size, and elapsed time. For this test, the queries are run in serial from a single script file query0.sql. Because Query 11 (Q11) returns a large number of rows and the goal of this test was to measure execution time, this benchmark included a limit 1000 statement on the query to ensure the time being measured was predominantly execution time, rather than return time. This script is available on the amazon-redshift-utils GitHub repo under the src/BlogContent/ATO directory.

At the top of the script, enable_result_cache_for_session is set to off. This forces Amazon Redshift to rerun the queries on each test run, and not just return the query results from the results cache.

Use the following steps to schedule the test script:

  1. Download query0.sql from GitHub.
  2. Make sure the IAM user has been granted the necessary permissions (set in a previous step).
  3. On the Amazon Redshift console, open the query editor.
  4. If the current tab is blank, enter any text to enable the Schedule button.
  5. Choose Schedule.
  6. Under Scheduler permissions for IAM role, specify the IAM role you created in a previous step (RedshiftATOTestingRole).
  7. For Cluster, choose your cluster.
  8. Enter values for Database name and Database user.
  9. Under Query information, for Scheduled query name, enter tpch-30tb-test.
  10. Choose Upload Query and choose the query0.sql file you downloaded in a previous step.
  11. Under Scheduling options, change Repeat every: to 3 hours.

If you have a smaller cluster, each test run requires more time. To determine the amount of time needed, run query0.sql once to determine the runtime and add an hour for ATO to perform its processing.

  1. For Repeat on:, select Everyday.
  2. Choose Save changes.

The test queries are now set to run in the background at the given schedule. Leave the schedule running for approximately 48 hours to run a full test.

Check the table changes

Amazon Redshift automatically monitors the workload on the cluster and uses AI algorithms to calculate the optimal sort and distribution keys. Then ATO implements the table changes online, without disrupting running queries.

The following queries show the changes ATO has made and when they were made:

SELECT ti."table"      
      ,ti.diststyle      
      ,ti.sortkey1
FROM svv_table_info ti
WHERE ti.database = 'tpch_30tb'
AND ti.schema = 'public'
ORDER BY ti."table";

The following table shows our output.

table diststyle sortkey1
customer AUTO(KEY(c_custkey)) AUTO(SORTKEY)
lineitem AUTO(KEY(l_orderkey)) AUTO(SORTKEY(l_shipdate))
nation AUTO(ALL) AUTO(SORTKEY)
orders AUTO(KEY(o_orderkey)) AUTO(SORTKEY(o_orderdate))
part AUTO(KEY(p_partkey)) AUTO(SORTKEY(p_type))
partsupp AUTO(KEY(ps_partkey)) AUTO(SORTKEY)
region AUTO(ALL) AUTO(SORTKEY)
supplier AUTO(KEY(s_suppkey)) AUTO(SORTKEY(s_nationkey))

Now we can see all the distribution and sort keys that have been set by ATO.

svl_auto_worker_action is an Amazon Redshift system view that allows us to see a log of the changes made by ATO:

SELECT st."table"      
      ,wa.type      
      ,wa.status      
      ,wa.eventtime         
      ,wa.previous_state
FROM svl_auto_worker_action wa
INNER JOIN svv_table_info st   
   ON wa.table_id = st.table_id
ORDER BY eventtime;

Data in the following table has been abbreviated to save space.

table type status eventtime previous_state
supplier distkey Start 2021-06-25 04:16:54.628556
supplier distkey Complete: 100% 2021-06-25 04:17:13.083246 DIST STYLE: diststyle even;
part distkey Start 2021-06-25 04:20:13.087554
part distkey Checkpoint: progress 11.400000% 2021-06-25 04:20:36.626592
part distkey Start 2021-06-25 04:20:46.627278
part distkey Checkpoint: progress 22.829400% 2021-06-25 04:21:06.137430
part distkey Start 2021-06-25 04:23:17.421084
part distkey Checkpoint: progress 80.055012% 2021-06-25 04:23:36.653153
part distkey Start 2021-06-25 04:23:46.653869
part distkey Complete: 100% 2021-06-25 04:24:47.052317 DIST STYLE: diststyle even;
orders distkey Start 2021-06-25 04:27:47.057053
orders distkey Checkpoint: progress 1.500000% 2021-06-25 04:28:03.040231
orders distkey Start 2021-06-25 04:28:13.041467
orders distkey Checkpoint: progress 2.977499% 2021-06-25 04:28:28.049088
orders distkey Start 2021-06-25 04:57:46.254614
orders distkey Checkpoint: progress 97.512168% 2021-06-25 04:58:07.643284
orders distkey Start 2021-06-25 04:58:17.644326
orders distkey Complete: 100% 2021-06-25 05:01:44.110385 DIST STYLE: diststyle even;
customer distkey Start 2021-06-25 05:04:44.115405
customer distkey Checkpoint: progress 9.000000% 2021-06-25 05:04:56.730455
customer distkey Start 2021-06-25 05:05:06.731523
customer distkey Checkpoint: progress 18.008868% 2021-06-25 05:05:19.295817
customer distkey Start 2021-06-25 05:08:03.054506
customer distkey Checkpoint: progress 90.127292% 2021-06-25 05:08:14.604731
customer distkey Start 2021-06-25 05:08:24.605273
customer distkey Complete: 100% 2021-06-25 05:09:19.532081 DIST STYLE: diststyle even;
partsupp distkey Start 2021-06-26 04:34:14.548875
partsupp distkey Checkpoint: progress 2.300000% 2021-06-26 04:34:33.730693
partsupp distkey Start 2021-06-26 04:34:43.731784
partsupp distkey Checkpoint: progress 4.644800% 2021-06-26 04:34:59.185233
partsupp distkey Start 2021-06-26 04:52:42.980895
partsupp distkey Checkpoint: progress 95.985631% 2021-06-26 04:52:59.498615
partsupp distkey Start 2021-06-26 04:53:09.499277
partsupp distkey Complete: 100% 2021-06-26 04:55:55.539695 DIST STYLE: diststyle even;
lineitem distkey Start 2021-06-26 04:58:55.544631
lineitem distkey Checkpoint: progress 0.400000% 2021-06-26 04:59:18.864780
lineitem distkey Start 2021-06-26 04:59:28.865949
lineitem distkey Checkpoint: progress 0.798400% 2021-06-26 04:59:46.540671
lineitem distkey Start 2021-06-26 08:31:43.484178
lineitem distkey Checkpoint: progress 99.525163% 2021-06-26 08:32:05.838456
lineitem distkey Start 2021-06-26 08:32:15.839239
lineitem distkey Complete: 100% 2021-06-26 08:41:17.083716 DIST STYLE: diststyle even;
supplier sortkey Start 2021-06-26 16:17:50.458629
supplier sortkey Complete: 100% 2021-06-26 16:17:51.381247 SORTKEY: None;
part sortkey Start 2021-06-26 16:17:51.381595
part sortkey Complete: 100% 2021-06-26 16:17:52.253288 SORTKEY: None;
orders sortkey Start 2021-06-26 16:17:52.253648
orders sortkey Complete: 100% 2021-06-26 16:17:53.164057 SORTKEY: None;
lineitem sortkey Start 2021-06-26 16:17:53.164408
lineitem sortkey Complete: 100% 2021-06-26 16:17:54.289620 SORTKEY: None;

We can see when each change was implemented by ATO. First, a DISTKEY was added to the supplier table at 4:17 AM on June 25. Because this table is relatively small (300 million rows), ATO was able to apply the DISTKEY in one step.

Next, ATO started implementing a DISTKEY on the part table at 4:20 AM. For DISTKEY changes on large tables (part has 6 billion rows), ATO creates a copy of the source table in the background (a shadow table) and then copies data from the source table into the shadow table, redistributing the data according to the new DISTKEY. This process is done in batches known as checkpoints. After all the data has been copied, the metadata is updated, swapping the shadow table with the source table. There were seven checkpoints for part, and the full conversion was completed at 4:24 AM.

Then from 4:27 AM on June 25 until 4:55 AM on June 26, distribution keys were implemented on orders, customer, and partsupp. The last DISTKEY was then implemented on largest table in the model lineitem (179 billion rows), finishing at 8:41 AM. Because this table was so large, there were 261 checkpoints.

Finally, sort keys on supplier, part, orders and lineitem were created at 4:17 PM on June 26.

View the test results

Now we look at the output of the test and see what impact ATO’s changes had on the overall performance.

The following line graph shows the runtime of the query0.sql script over time. The vertical reference lines show when ATO changed a DISTKEY or SORTKEY. The first distribution keys were created before the test had actually started. This is because ATO was able to use table properties such as constraints to determine appropriate distribution keys. Over time, ATO may change these distribution keys based on workload observations.

The rest of the distribution keys were then added, and lastly the sort keys, reducing the runtime from approximately 4,750 seconds to 3,597.

The data for this graph is taken from the following query:

SELECT q.pid     
      ,COUNT(DISTINCT q.query) num_queries     
      ,MIN(q.starttime) starttime     
      ,MAX(q.endtime) endtime     
      ,DATEDIFF(SECOND, MIN(q.starttime), MAX(q.endtime)) elapsed_sec     
      ,SUM(DATEDIFF(SECOND, c.starttime, c.endtime)) compile_time     
      ,DATEDIFF(SECOND, MIN(q.starttime), MAX(q.endtime)) - SUM(DATEDIFF(SECOND, c.starttime, c.endtime)) exec_time
FROM stl_query q
INNER JOIN svl_compile c   
   ON q.query = c.query
WHERE q.userid > 1  
AND q.label LIKE 'RSPERF TPC-H%'
GROUP BY q.pid
ORDER BY starttime ASC;

The following table shows our results.

pid num_queries starttime endtime elapsed_sec elapsed_hour compile_time_sec exec_time_sec % improvement
1758 32 2021-06-25 12:00:02.475253 2021-06-25 13:39:58.554643 5996 1.67 35 5961
25569 32 2021-06-25 15:00:02.699906 2021-06-25 16:18:54.603617 4732 1.31 5 4727
4226 32 2021-06-25 18:00:02.113902 2021-06-25 19:19:12.088981 4750 1.32 0 4750 BASELINE
35145 32 2021-06-25 21:00:02.625849 2021-06-25 22:19:28.852209 4766 1.32 0 4766 0.34%
12862 32 2021-06-26 00:00:02.226747 2021-06-26 01:19:20.345285 4758 1.32 0 4758 0.17%
36919 32 2021-06-26 03:00:01.794476 2021-06-26 04:18:54.110546 4733 1.31 0 4733 -0.36%
11631 32 2021-06-26 06:00:02.300287 2021-06-26 07:19:52.082589 4790 1.33 21 4769 0.40%
33833 32 2021-06-26 09:00:02.281647 2021-06-26 10:05:40.694966 3938 1.09 27 3911 -17.66%
3830 32 2021-06-26 12:00:01.873699 2021-06-26 13:06:37.702817 3996 1.11 0 3996 -15.87%
24134 32 2021-06-26 15:00:02.203329 2021-06-26 16:06:24.548732 3982 1.11 0 3982 -16.17%
48465 32 2021-06-26 18:00:02.215612 2021-06-26 19:13:07.665636 4385 1.22 6 4379 -7.81%
26016 32 2021-06-26 21:00:02.298997 2021-06-26 22:05:38.413672 3936 1.09 0 3936 -17.14%
2076 32 2021-06-27 00:00:02.297759 2021-06-27 01:01:09.826855 3667 1.02 0 3667 -22.80%
26222 32 2021-06-27 03:00:02.485152 2021-06-27 04:00:39.922720 3637 1.01 0 3637 -23.43%
1518 32 2021-06-27 06:00:02.075845 2021-06-27 07:00:33.151602 3631 1.01 0 3631 -23.56%
23629 32 2021-06-27 09:00:01.776684 2021-06-27 10:00:08.432630 3607 1 0 3607 -24.06%
42169 32 2021-06-27 12:00:02.341020 2021-06-27 13:00:13.290535 3611 1 0 3611 -23.98%
13299 32 2021-06-27 15:00:02.394744 2021-06-27 16:00:00.383514 3598 1 1 3597 -24.27%

We take a baseline measurement of the runtime on the third run, which ensures any additional compile time is excluded from the test. When we look at the test runs from midnight on June 26 (after ATO had made its changes), we can see the performance improvement.

Clean up

After you have viewed the test results and ATO changes, be sure to decommission your cluster to avoid having to pay for unused resources. Also, delete the IAM policy RedshiftTempCredPolicy and the IAM roles RedshiftCopyRole and RedshiftATOTestingRole.

Cloud DW benchmark derived from TPC-H 3 TB results

We ran the same test using the 3 TB version of the Cloud DW benchmark derived from TPC-H. It ran on a 10-node ra3.4xlarge cluster with query0.sql run every 30 minutes. The results of the test showed ATO achieved a significant increase in performance of up to 25%.

Summary

With automatic table optimization, Amazon Redshift has further increased its automation capabilities to cover query performance tuning in addition to administration tasks.

In this post, I explained how distribution and sort keys improve performance and how they’re automatically set by ATO. I showed how ATO increased performance by up to 24% on a 30 TB industry-standard benchmark with no manual tuning required. I also outlined the steps for setting up the same test yourself.

I encourage you to try out ATO by setting up an Amazon Redshift cluster and running the test, or enabling ATO on existing and new tables on your current cluster and monitoring the results.


About the Author

Adam Gatt is a Senior Specialist Solution Architect for Analytics at AWS. He has over 20 years of experience in data and data warehousing and helps customers build robust, scalable and high-performance analytics solutions in the cloud.

Simplify your data analysis with Amazon Redshift Query Editor v2

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/simplify-your-data-analysis-with-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed cloud data warehouse that provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API. Tens of thousands of customers use Amazon Redshift as their analytics platform. Data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift Query Editor v2 is a web-based SQL client application that you can use to author and run queries on your Amazon Redshift data warehouse. You can visualize query results with charts and collaborate by sharing queries with members of your team.

Query Editor v2 provides several capabilities, such as the ability to browse and explore multiple databases, external tables, views, stored procedures, and user-defined functions. It provides wizards to create schemas, tables, and user-defined functions. It simplifies the management and collaboration of saved queries. You can also gain faster insights by visualizing the results with a single click.

Query Editor v2 enhances and builds upon the functionality of the prior version of the query editor, such as increased size of queries, the ability to author and run multi-statement queries, support for session variables, and query parameters, to name a few.

You can provide Query Editor v2 to end-users such as data analysts, database developers, and data scientists without providing the privileges required to access the Amazon Redshift console.

In this post, we walk through how to create an AWS Identity and Access Management (IAM) role to provide access to Query Editor v2 for end-users, easily connect to your clusters, run SQL queries, load data in your clusters, create charts, and share queries directly from the console.

Configure Query Editor v2 for your AWS account

As an admin, you must first configure Query Editor v2 before providing access to your end-users.

You can access Query Editor v2 from the Amazon Redshift console.

When you choose Query Editor v2 from the Editor options, a new tab in your browser opens with the Query Editor v2 interface.

By default, an AWS-owned key is used to encrypt resources. Optionally, you can create a symmetric customer managed key to encrypt Query Editor v2 resources such as saved queries and query results using the AWS Key Management Service (AWS KMS) console or AWS KMS API operations.

Provide access to Query Editor v2 for your end-users

Enterprises want to democratize access to data in the data warehouse securely by providing a web-based query editor to their end-users. You can either use IAM users or integrate the AWS console with your single sign-on (SSO) provider to provide access to end-users. In a future post, we will document all necessary steps to integrate your SSO provider with the query editor.

To enable your users to access Query Editor v2 using IAM, as an administrator, you can attach one of the AWS-managed policies depicted in the following table to the IAM user or role to grant permission. These managed policies also give access to other required services. You can create your custom-managed policy if you want to customize permissions for your end-users.

Policy Description
AmazonRedshiftQueryEditorV2FullAccess Grants full access to Query Editor v2 operations and resources. This is primarily intended for administrators.
AmazonRedshiftQueryEditorV2NoSharing Grants the ability to work with Query Editor v2 without sharing resources. Users can’t share their queries with their team members.
AmazonRedshiftQueryEditorV2ReadSharing Grants the ability to work with Query Editor v2 with limited sharing of resources. The granted principal can read the saved queries shared with its team but can’t update them.
AmazonRedshiftQueryEditorV2ReadWriteSharing Grants the ability to work with Query Editor v2 with sharing of resources. The granted principal can read and update the shared resources with its team.

For example, if you have a group of users as a part of marketing_group, and you want them to collaborate between themselves by sharing their queries, you can create an IAM role for them and assign the AmazonRedshiftQueryEditorV2ReadSharing policy. You can also tag the role with sqlworkbench-team as marketing_group.

You can use the IAM console to attach IAM policies to an IAM user or an IAM role. After you attach a policy to a role, you can attach the role to an IAM user.

To attach the IAM policies to an IAM role, complete the following steps:

  1. On the IAM console, choose Roles.
  2. Choose the role that needs access to Query Editor v2. Assume the name of the role as marketing_role.
  3. Choose Attach policies.
  4. For Policy names, choose the policies that we described previously based on your requirement.
  5. Choose Attach policy.

Now you can add the marketing_group tag for an IAM role.

  1. In the navigation pane, choose Roles and select the name of the role that you want to edit.
  2. Choose the Tags tab and choose Add tags.
  3. Add the tag key sqlworkbench-team and the value marketing_group.
  4. Choose Save changes.

Now the end-users with marketing_role can access Query Editor v2 with limited sharing of resources.

Work with Query Editor v2

You can use Query Editor v2 to author and run queries, visualize results, and share your work with your team. With Query Editor v2, you can create databases, schemas, tables, and user-defined functions (UDFs) with visual wizards. In a tree-view panel, for each of your clusters, you can view its schemas. For each schema, you can view its tables, views, functions (UDFs), and stored procedures.

Open Query Editor v2

After you log in to the console and navigate to Query Editor v2, you see a page like the following screenshot.

Query Editor v2 now provides a more IDE-like experience to its users and offers both dark and light themes. You can switch between themes by choosing the moon icon at the lower left of the page.

The left navigation pane shows the list of clusters that you have access to. If you don’t have an Amazon Redshift cluster, use the Getting Started with Amazon Redshift cluster with sample data option. In this post, we use the sample data (Tickets database) as examples.

Connect to an Amazon Redshift database

You can connect to a cluster by choosing a cluster and entering your credentials.

You can connect using either a database user name and password or temporary credentials. Query Editor v2 creates a secret on your behalf stored in AWS Secrets Manager. This secret contains credentials to connect to your database. With temporary credentials, Query Editor v2 generates a temporary password to connect to the database.

Browse a database

You can browse one or more databases in the cluster that you’re connected to. Within a database, you can manage schemas, tables, views, functions, and stored procedures in the tree-view panel. If you have integrated your cluster with the AWS Glue Data Catalog, you see the Data Catalog schema and external tables. Similarly, you can browse the external tables if you create external schemas using Amazon Redshift data sharing, Amazon Redshift Spectrum, or federated queries.

You can perform an operation on an object choosing it (right-click) and choosing from the menu options.

Author and run queries

Query Editor v2 allows you to run your queries by selecting a specific database. If you have multiple databases, make sure that you choose the correct database.

You can enter a query in the editor or select a saved query from the Queries list and choose Run. The query editor provides several shortcuts for using with your query editor, and you can access that by choosing the content assist option.

By default, Limit 100 is set to limit the results to 100 rows. You can turn off this option to return a more extensive result set. If you turn off this option, you can include the LIMIT option in your SQL statement to avoid very large result sets.

Use multiple SQL statements in a query

The query editor supports multiple queries, session variables, and temporary tables. If you have multiple SQL statements and you run the query, the results are displayed on various tabs.

Run long queries

You don’t have to wait for long queries to complete to view results. The queries run even if the browser window is closed. You can view the results the next time you log in to Query Editor v2.

Run parameterized queries

You can use parameters with your query instead of hardcoding certain values, as in the following code:

SELECT sum(qtysold) 
FROM   sales, date 
WHERE  sales.dateid = date.dateid 
AND    sellerId >= ${sellerid};

When you run a query with a parameter, you’re prompted with a form.

Run the explain plan

You can optimize your queries by turning on the Explain option to display a query plan in the results area. You can choose Save to save the query to the Queries folder.

Export results

You can export the query results on the current page to a file in JSON or CSV format. To save the file in the format you want, open the context menu (right-click) in the results area, then choose Export current page and either JSON or CSV. You can also select rows and export the results for specific rows.

Visual analysis of your results

You can perform a visual analysis of your results for a query by turning on Chart to display a graphic visualization of the results. Choose Traces to display the results as a chart. For Type, choose the style of chart as Bar, Line, and so on. For Orientation, you can choose Vertical or Horizontal. For X, select the table column that you want to use for the horizontal axis. For Y, choose the table column that you want to use for the vertical axis.

Choose Refresh to update the chart display. Choose Fullscreen to expand the chart display.

To create a chart, complete the following steps:

  1. Run a query and get results.
  2. Turn on Chart.
  3. Choose a chart style from the available options.

  1. Choose Trace and start to visualize your data.
  2. Choose Style to customize the appearance, including colors, axes, legend, and annotations.
  3. Choose Annotations to add text, shapes, and images.

For certain chart types, you can add transforms to filter, spilt, aggregate, and sort the underlying data for the chart.

You can also save, export, and browse the charts you created.

Collaborate and share with your team members

You can share queries with others on your team. As we discussed earlier, an administrator sets up a team based on the IAM policy associated with an IAM user or IAM role. For example, if you’re a member of marketing_group, you can share your queries with your team members.

Save, organize and browse queries

Before you can share your query with your team, save your query. You can also view and delete saved queries.

To save your query, choose Save, enter a title, and choose Save again.

To browse for saved queries, choose Queries from the navigation pane. You can view queries that are My queries, Shared by me, or Shared to my team. These queries can appear as individual queries or within folders you created.

Organize your queries with folders

You can organize your queries by creating folders and dragging and dropping a saved query to a folder.

Share a query

You can share your queries with your team.

  1. Choose Queries in the navigation pane.
  2. Open the context menu (right-click) of the query that you want to share.
  3. Choose Share with my team.

Manage query versions

You can also view the history of saved queries and manage query versions. Every time you save an SQL query, Query Editor v2 saves it as a new version. You can view or store 20 different versions of your query and browse earlier query versions, save a copy of a query, or restore a query.

  1. Choose Queries in the navigation pane.
  2. Open the context menu (right-click) for the query that you want to work with.
  3. Choose Version history to open a list of versions of the query.
  4. On the Version history page, choose one of the following options:
    • Revert to selected – Revert to the selected version and continue your work with this version.
    • Save selected as – Create a new query in the editor.

Conclusion

In this post, we introduced you to Amazon Redshift Query Editor v2, which has a rich set of features to manage and run your SQL statements securely that provide you with several capabilities, such as ability to browse and explore multiple databases, external tables, views, stored procedures, and user-defined functions. It provides wizards to create schemas, tables, and user-defined functions. Query Editor v2 simplifies management and collaboration of saved queries and improves the ability to analyze and visualize results with a single click.

If you have any questions or suggestions, please leave a comment.

Happy querying!


About the Author

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Eren Baydemir, a Technical Product Manager at AWS, has 15 years of experience in building customer facing products and is currently creating data analytics solutions in the Amazon Redshift team. He was the CEO and co-founder of DataRow which was acquired by Amazon in 2020.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/amazon-redshift-query-editor-v2-web-query-authoring/

When it comes to manipulating and analyzing relational data, Structured Query Language (SQL) has been an international standard since 1986, a couple of years before I was born. And yet, it sometimes takes hours to get access to a new database or data warehouse, configure credentials or single sign-on, download and install multiple desktop libraries or drivers, and get familiar with the new schema—all this before you even run a query. Not to mention the challenge of sharing queries, results, and analyses securely between members of the same team or across teams.

Today, I’m glad to announce the general availability of Amazon Redshift Query Editor V2, a web-based tool that you can use to explore, analyze, and share data using SQL. It allows you to explore, analyze, share, and collaborate on data stored on Amazon Redshift. It supports data warehouses on Amazon Redshift and data lakes through Amazon Redshift Spectrum.

Amazon Redshift Query Editor V2 provides a free serverless web interface that reduces the operational costs of managing query tools and infrastructure. Because it’s a managed SQL editor in your browser and it’s integrated with your single sign-on provider, the Query Editor V2 reduces the number of steps to the first query so you gain insights faster. You also get in-place visual analysis of query results (no data download required), all in one place. As an additional team productivity boost, it improves collaboration with saved queries and the ability to share results and analyses between users.

From a security standpoint, analysts can access Query Editor V2 without requiring any admin privileges on the Amazon Redshift cluster, using an IAM role for READ, WRITE, or ADMIN access. Check out the documentation for more details.

Connection Setup for Amazon Redshift Query Editor V2
First, you’ll need to configure the connection to your Amazon Redshift cluster.

After you have configured the connection, you can reuse it for future sessions. And, of course, you can edit or delete a connection at any time.

Simply click on a cluster to connect with Query Editor V2.

Amazon Redshift Query Editor V2 in Action
The web interface allows you to browse schemas, tables, views, functions, and stored procedures. You can also preview a table’s columns with one click and create or delete schemas, tables, or functions.

The interface is intuitive for newcomers and expert users alike. You can resize panels, create tabs, and configure your editor preferences.

Running or explaining a query is quite straightforward: You simply write (or paste) the query and choose Run. You can visualize and interact with the result set in the bottom pane. For example, you might want to change the row ordering or search for a specific word. Even though Amazon Redshift Query Editor V2 is a browser-based tool, the data movement between your browser and the Amazon Redshift cluster is optimized, so your browser doesn’t need to download any raw data. A lot of the filtering and reordering happens directly in the browser, without any wait time.

To export a result set as a JSON or CSV file on your local machine, simply right-click it.

So far so good! Running queries is the minimum you’d expect from a Query Editor. Let’s have a look at some of the more interesting features.

Team Collaboration with Amazon Redshift Query Editor V2
Amazon Redshift Query Editor V2 allows you to manage the permissions of your team members based on their IAM roles, so that you can easily share queries and cluster access in a secure way.

For example, you can use IAM managed policies such as AmazonRedshiftQueryEditorV2FullAccess, AmazonRedshiftQueryEditorV2ReadSharing, or AmazonRedshiftQueryEditorV2ReadWriteSharing. Also, don’t forget to include the redshift:GetClusterCredentials permission.

After you’ve set up the IAM roles for your team, choose Save to save a query.

The Untitled tab will show the query name. From now on, you edit this saved query to make updates and then choose Save again.

Individual users with WRITE access can run, edit, and delete shared queries, while users with READ access can only run shared queries.

If you work on multiple projects and collaborate with many different teams, it might be difficult to remember query names or even find them in a long list. In Amazon Redshift Query Editor V2, saved and shared queries are available from the left navigation in Queries. You can keep your queries organized into folders. Even nested folders are supported.

Last but not least, each saved query is versioned and the version history is always available. That’s pretty useful when you need to restore an older version.

Plot Your Queries with Amazon Redshift Query Editor V2
Sharing queries with teammates is great, but wouldn’t it even better if you could visualize a result set, export it as PNG or JPEG, and save the chart for later? Amazon Redshift Query Editor V2 allows you to perform in-place visualizations of your results. When you’re happy with the look and feel of your chart, you can save it for later and organize all your saved charts into folders. This allows you to simply choose a saved chart, rerun the corresponding query, and export the new image. No need to configure the plot from scratch or remember the configuration of hundreds of charts and queries across different projects.

Available Today
Amazon Redshift Query Editor V2 is available today in all commercial AWS Regions, except AP-Northeast-3 regions. It requires no license and it’s free, except for the cost for your Amazon Redshift cluster.

You can interact with the service using the Amazon Redshift console. It doesn’t require any driver or software on your local machine.

For more information, see the Amazon Redshift Query Editor V2 technical documentation or take a look at this video:

We look forward to your feedback.

Alex

Accelerate your data warehouse migration to Amazon Redshift – Part 4

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-4-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the fourth in a series of posts. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially reduce your overall cost to migrate to Amazon Redshift.

Check out the previous posts in the series:

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other AWS services like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating from self-managed data warehouse engines, like Teradata, to Amazon Redshift. In these cases, you typically have terabytes or petabytes of data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over a few years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort. You needed to manually remediate syntax differences, inject code to replace proprietary features, and manually tune the performance of queries and reports on the new platform.

For example, you may have a significant investment in BTEQ (Basic Teradata Query) scripting for database automation, ETL, or other tasks. Previously, you needed to manually recode these scripts as part of the conversion process to Amazon Redshift. Together with supporting infrastructure (job scheduling, job logging, error handling), this was a significant impediment to migration.

Today, we’re happy to share with you a new, purpose-built command line tool called Amazon Redshift RSQL. Some of the key features added in Amazon Redshift RSQL are enhanced flow control syntax and single sign-on support. You can also describe properties or attributes of external tables in an AWS Glue catalog or Apache Hive Metastore, external databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition, and tables shared using Amazon Redshift data sharing.

We have also enhanced the AWS Schema Conversion Tool (AWS SCT) to automatically convert BTEQ scripts to Amazon Redshift RSQL scripts. The converted scripts run on Amazon Redshift with little to no changes.

In this post, we describe some of the features of Amazon Redshift RSQL, show example scripts, and demonstrate how to convert BTEQ scripts into Amazon Redshift RSQL scripts.

Amazon Redshift RSQL features

If you currently use Amazon Redshift, you may already be running scripts on Amazon Redshift using the PSQL command line client. These scripts operate on Amazon Redshift RSQL with no modification. You can think of Amazon Redshift RSQL as an Amazon Redshift-native version of PSQL.

In addition, we have designed Amazon Redshift RSQL to make it easy to transition BTEQ scripts to the tool. The following are some examples of Amazon Redshift RSQL commands that make this possible. (For full details, see Amazon Redshift RSQL.)

  • \EXIT – This command is an extension of the PSQL \quit command. Like \quit, \EXIT terminates the execution of Amazon Redshift RSQL. In addition, you can specify an optional exit code with \EXIT.
  • \LOGON – This command creates a new connection to a database. \LOGON is an alias for the PSQL \connect command. You can specify connection parameters using positional syntax or as a connection string.
  • \REMARK – This command prints the specified string to the output. \REMARK extends the PSQL \echo command by adding the ability to break the output over multiple lines, using // as a line break.
  • \RUN – This command runs the Amazon Redshift RSQL script contained in the specified file. \RUN extends the PSQL \i command by adding an option to skip any number of lines in the specified file.
  • \OS – This is an alias for the PSQL \! command. \OS runs the operating system command that is passed as a parameter. Control returns to Amazon Redshift RSQL after running the OS command.
  • \LABEL – This is a new command for Amazon Redshift RSQL. \LABEL establishes an entry point for execution, as the target for a \GOTO command.
  • \GOTO – This command is a new command for Amazon Redshift RSQL. It’s used in conjunction with the \LABEL command. \GOTO skips all intervening commands and resumes processing at the specified \LABEL. The \LABEL must be a forward reference. You can’t jump to a \LABEL that lexically precedes the \GOTO.
  • \IF (\ELSEIF, \ELSE, \ENDIF) – This command is an extension of the PSQL \if (\elif, \else, \endif) command. \IF and \ELSEIF support arbitrary Boolean expressions including AND, OR, and NOT conditions. You can use the \GOTO command within a \IF block to control conditional execution.
  • \EXPORT – This command specifies the name of an export file that Amazon Redshift RSQL uses to store database information returned by a subsequent SQL SELECT statement.

We’ve also added some variables to Amazon Redshift RSQL to support converting your BTEQ scripts.

  • :ACTIVITYCOUNT – This variable returns the number of rows affected by the last submitted request. For a data-returning request, this is the number of rows returned to Amazon Redshift RSQL from the database. ACTIVITYCOUNT is similar to the PSQL variable ROW_COUNT, however, ROW_COUNT does not report affected-row count for SELECT, COPY or UNLOAD.
  • :ERRORCODE – This variable contains the return code for the last submitted request to the database. A zero signifies the request completed without error. The ERRORCODE variable is an alias for the variable SQLSTATE.
  • :ERRORLEVEL – This variable assigns severity levels to errors. Use the severity levels to determine a course of action based on the severity of the errors that Amazon Redshift RSQL encounters.
  • :MAXERROR – This variable designates a maximum error severity level beyond which Amazon Redshift RSQL terminates job processing.

An example Amazon Redshift RSQL script

Let’s look at an example. First, we log in to an Amazon Redshift database using Amazon Redshift RSQL. You specify the connection information on the command line as shown in the following code. The port and database are optional and default to 5439 and dev respectively if not provided.

$ rsql -h testcluster1.<example>.redshift.amazonaws.com -U testuser1 -d myredshift -p 5439
Password for user testuser1: 
DSN-less Connected
DBMS Name: Amazon Redshift
Driver Name: Amazon Redshift ODBC Driver
Driver Version: 1.4.34.1000
Rsql Version: 1.0.1
Redshift Version: 1.0.29551
Type "help" for help.

(testcluster1) testuser1@myredshift=#

If you choose to change the connection from within the client, you can use the \LOGON command:

(testcluster1) testuser1@myredshift=# \logon testschema testuser2 testcluster2.<example>.redshift.amazonaws.com
Password for user testuser2: 
DBMS Name: Amazon Redshift
Driver Name: Amazon Redshift ODBC Driver
Driver Version: 1.4.34.1000
Rsql Version: 1.0.1

Now, let’s run a simple script that runs a SELECT statement, checks for output, then branches depending on whether data was returned or not.

First, we inspect the script by using the \OS command to print the file to the screen:

(testcluster1) testuser1@myredshift=# \os cat activitycount.sql
select * from testschema.employees;

\if :ACTIVITYCOUNT = 0
  \remark '****No data found****'
  \goto LETSQUIT
\else
  \remark '****Data found****'
  \goto LETSDOSOMETHING
\endif

\label LETSQUIT
\remark '****We are quitting****'
\exit 0

\label LETSDOSOMETHING
\remark '****We are doing it****'
\exit 0

The script prints one of two messages depending on whether data is returned by the SELECT statement or not.

Now, let’s run the script using the \RUN command. The SELECT statement returns 11 rows of data. The script prints a “data found” message, and jumps to the LETSDOSOMETHING label.

(testcluster1) testuser1@myredshift=# \run file=activitycount.sql
  id  | name    | manager_id | last_promo_date
 -----+---------+------------+-----------------
 112  | Britney | 201        | 2041-03-30
 101  | Bob     | 100        |
 110  | Mark    | 201        |
 106  | Jeff    | 102        |
 201  | Ana     | 104        |
 104  | Chris   | 103        |
 111  | Phyllis | 103        |
 102  | Renee   | 101        | 2021-01-01
 100  | Caitlin |            | 2021-01-01
 105  | David   | 103        | 2021-01-01
 103  | John    | 101        |
 (11 rows)

****Data found****
\label LETSQUIT ignored
\label LETSDOSOMETHING processed
****We are doing it****

That’s Amazon Redshift RSQL in a nutshell. If you’re developing new scripts for Amazon Redshift, we encourage you to use Amazon Redshift RSQL and take advantage of its additional capabilities. If you have existing PSQL scripts, you can run those scripts using Amazon Redshift RSQL with no changes.

Use AWS SCT to automate your BTEQ conversions

If you’re a Teradata developer or DBA, you’ve probably built a library of BTEQ scripts that you use to perform administrative work, load or transform data, or to generate datasets and reports. If you’re contemplating a migration to Amazon Redshift, you’ll want to preserve the investment you made in creating those scripts.

AWS SCT has long had the ability to convert BTEQ to AWS Glue. Now, you can also use AWS SCT to automatically convert BTEQ scripts to Amazon Redshift RSQL. AWS SCT supports all the new Amazon Redshift RSQL features like conditional execution, escape to the shell, and branching.

Let’s see how it works. We create two Teradata tables, product_stg and product. Then we create a simple ETL script that uses a MERGE statement to update the product table using data from the product_stg table:

CREATE TABLE testschema.product_stg (
  prod_id INTEGER
, description VARCHAR(100) CHARACTER SET LATIN
,category_id INTEGER)
UNIQUE PRIMARY INDEX ( prod_id );

CREATE TABLE testschema.product (
  prod_id INTEGER
, description VARCHAR(100) CHARACTER SET LATIN
, category_id INTEGER)
UNIQUE PRIMARY INDEX ( prod_id );

We embed the MERGE statement inside a BTEQ script. The script tests error conditions and branches accordingly:

.SET WIDTH 100

SELECT COUNT(*) 
FROM testschema.product_stg 
HAVING COUNT(*) > 0;

.IF ACTIVITYCOUNT = 0 then .GOTO NODATA;

MERGE INTO testschema.product tgt 
USING testschema.product_stg stg 
   ON tgt.prod_id = stg.prod_id
WHEN MATCHED THEN UPDATE SET
      description = stg.description
    , category_id = stg.category_id
WHEN NOT MATCHED THEN INSERT VALUES (
  stg.prod_id
, stg.description
, stg.category_id
);

.GOTO ALLDONE;

.LABEL NODATA

.REMARK 'Staging table is empty. Stopping'

.LABEL ALLDONE 

.QUIT;               

Now, let’s use AWS SCT to convert the script to Amazon Redshift RSQL. AWS SCT converts the BTEQ commands to their Amazon Redshift RSQL and Amazon Redshift equivalents. The converted script is as follows:

\rset width 100
SELECT
    COUNT(*)
    FROM testschema.product_stg
    HAVING COUNT(*) > 0;
\if :ACTIVITYCOUNT = 0
    \goto NODATA
\endif
UPDATE testschema.product
SET description = stg.description, category_id = stg.category_id
FROM testschema.product_stg AS stg
JOIN testschema.product AS tgt
    ON tgt.prod_id = stg.prod_id;
INSERT INTO testschema.product
SELECT
    stg.prod_id, stg.description, stg.category_id
    FROM testschema.product_stg AS stg
    WHERE NOT EXISTS (
        SELECT 1
        FROM testschema.product AS tgt
        WHERE tgt.prod_id = stg.prod_id);
\goto ALLDONE
\label NODATA
\remark 'Staging table is empty. Stopping'
\label ALLDONE
\quit :ERRORLEVEL

The following are the main points of interest in the conversion:

  • The BTEQ .SET WIDTH command is converted to the Amazon Redshift RSQL \RSET WIDTH command.
  • The BTEQ ACTIVITYCOUNT variable is converted to the Amazon Redshift RSQL ACTIVITYCOUNT variable.
  • The BTEQ MERGE statement is converted into an UPDATE followed by an INSERT statement. Currently, Amazon Redshift doesn’t support a native MERGE statement.
  • The BTEQ .LABEL and .GOTO statements are translated to their Amazon Redshift RSQL equivalents \LABEL and \GOTO.

Let’s look at the actual process of using AWS SCT to convert a BTEQ script.

After starting AWS SCT, you create a Teradata migration project and navigate to the BTEQ scripts node in the source tree window pane. Right-click and choose Load scripts.

Then select the folder that contains your BTEQ scripts. The folder appears in the source tree. Open it and navigate to the script you want to convert. In our case, the script is contained in the file merge.sql. Right-click on the file, choose Convert script, then choose Convert to RSQL.You can inspect the converted script in the bottom middle pane. When you’re ready to save the script to a file, do that from the target tree on the right side.

If you have many BTEQ scripts, you can convert an entire folder at once by selecting the folder instead of an individual file.

Convert shell scripts

Many applications run BTEQ commands from within shell scripts. For example, you may have a shell script that redirects log output and controls login credentials, as in the following:

bteq <<EOF >> ${LOG} 2>&1

.run file $LOGON;

SELECT COUNT(*) 
FROM testschema.product_stg 
HAVING COUNT(*) > 0;
…

EOF

If you use shell scripts to run BTEQ, we’re happy to share that AWS SCT can help you convert those scripts. AWS SCT supports bash scripts now, and we’ll add additional shell dialects in the future.

The process to convert shell scripts is very similar to BTEQ conversion. You select a folder that contains your scripts by navigating to the Shell node in the source tree and then choosing Load scripts.

After the folder is loaded, you can convert one (or more) scripts by selecting them and choosing Convert script.

As before, the converted script appears in the UI, and you can save it from the target tree on the right side of the page.

Conclusion

We’re happy to share Amazon Redshift RSQL and expect it to be a big hit with customers. If you’re contemplating a migration from Teradata to Amazon Redshift, Amazon Redshift RSQL and AWS SCT can simplify the conversion of your existing Teradata scripts and help preserve your investment in existing reports, applications, and ETL.

All of the features described in this post are available for you to use today. You can download Amazon Redshift RSQL and AWS SCT and give it a try.

We’ll be back soon with the next installment in this series. Check back for more information on automating your migrations from Teradata to Amazon Redshift. In the meantime, you can learn more about Amazon Redshift, Amazon Redshift RSQL, and AWS SCT. Happy migrating!


About the Authors

Michael Soo is a Senior Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Po Hong, PhD, is a Principal Data Architect of Lake House Global Specialty Practice,
AWS Professional Services. He is passionate about supporting customers to adopt innovative solutions to reduce time to insight. Po is specialized in migrating large scale MPP on-premises data warehouses to the AWS Lake House architecture.

Entong Shen is a Software Development Manager of Amazon Redshift. He has been working on MPP databases for over 9 years and has focused on query optimization, statistics and migration related SQL language features such as stored procedures and data types.

Adekunle Adedotun is a Sr. Database Engineer with Amazon Redshift service. He has been working on MPP databases for 6 years with a focus on performance tuning. He also provides guidance to the development team for new and existing service features.

Asia Khytun is a Software Development Manager for the AWS Schema Conversion Tool. She has 10+ years of software development experience in C, C++, and Java.

Illia Kratsov is a Database Developer with the AWS Project Delta Migration team. He has 10+ years experience in data warehouse development with Teradata and other MPP databases.

Accelerate your data warehouse migration to Amazon Redshift – Part 3

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-3-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the third post in a multi-part series. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and reduce your overall cost to migrate to Amazon Redshift.

Check out the previous posts in the series:

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services such as Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating from self-managed data warehouse engines, like Teradata, to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a Teradata data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share recent enhancements to Amazon Redshift and the AWS Schema Conversion Tool (AWS SCT) that make it easier to automate your Teradata to Amazon Redshift migrations.

In this post, we introduce new automation for merge statements, a native function to support ASCII character conversion, enhanced error checking for string to date conversion, enhanced support for Teradata cursors and identity columns, automation for ANY and SOME predicates, automation for RESET WHEN clauses, automation for two proprietary Teradata functions (TD_NORMALIZE_OVERLAP and TD_UNPIVOT), and automation to support analytic functions (QUANTILE and QUALIFY).

Merge statement

Like its name implies, the merge statement takes an input set and merges it into a target table. If an input row already exists in the target table (a row in the target table has the same primary key value), then the target row is updated. If there is no matching target row, the input row is inserted into the table.

Until now, if you used merge statements in your workload, you were forced to manually rewrite the merge statement to run on Amazon Redshift. Now, we’re happy to share that AWS SCT automates this conversion for you. AWS SCT decomposes a merge statement into an update on existing records followed by an insert for new records.

Let’s look at an example. We create two tables in Teradata: a target table, employee, and a delta table, employee_delta, where we stage the input rows:

CREATE TABLE testschema.employee(
  id INTEGER
, name VARCHAR(20)
, manager INTEGER)
UNIQUE PRIMARY INDEX (id)
;

CREATE TABLE testschema.employee_delta (
  id INTEGER
, name VARCHAR(20)
, manager INTEGER)
UNIQUE PRIMARY INDEX(id)
;

Now we create a Teradata merge statement that updates a row if it exists in the target, otherwise it inserts the new row. We embed this merge statement into a macro so we can show you the conversion process later.

REPLACE MACRO testschema.merge_employees AS (
  MERGE INTO testschema.employee tgt
  USING testschema.employee_delta delta
    ON delta.id = tgt.id
  WHEN MATCHED THEN
    UPDATE SET name = delta.name, manager = delta.manager
  WHEN NOT MATCHED THEN
    INSERT (delta.id, delta.name, delta.manager);
);

Now we use AWS SCT to convert the macro. (See Accelerate your data warehouse migration to Amazon Redshift – Part 1 for details on macro conversion.) AWS SCT creates a stored procedure that contains an update (to implement the WHEN MATCHED condition) and an insert (to implement the WHEN NOT MATCHED condition).

CREATE OR REPLACE PROCEDURE testschema.merge_employees()
AS $BODY$
BEGIN
    UPDATE testschema.employee
    SET name = "delta".name, manager = "delta".manager
    FROM testschema.employee_delta AS delta JOIN testschema.employee AS tgt
        ON "delta".id = tgt.id;
      
    INSERT INTO testschema.employee
    SELECT
      "delta".id
    , "delta".name
    , "delta".manager
    FROM testschema.employee_delta AS delta
    WHERE NOT EXISTS (
      SELECT 1
      FROM testschema.employee AS tgt
      WHERE "delta".id = tgt.id
    );
END;
$BODY$
LANGUAGE plpgsql;

This example showed how to use merge automation for macros, but you can convert merge statements in any application context: stored procedures, BTEQ scripts, Java code, and more. Download the latest version of AWS SCT and try it out.

ASCII() function

The ASCII function takes as input a string and returns the ASCII code, or more precisely, the UNICODE code point, of the first character in the string. Previously, Amazon Redshift supported ASCII as a leader-node only function, which prevented its use with user-defined tables.

We’re happy to share that the ASCII function is now available on Amazon Redshift compute nodes and can be used with user-defined tables. In the following code, we create a table with some string data:

CREATE TABLE testschema.char_table (
  id INTEGER
, char_col  CHAR(10)
, varchar_col VARCHAR(10)
);

INSERT INTO testschema.char_table VALUES (1, 'Hello', 'world');

Now you can use the ASCII function on the string columns:

# SELECT id, char_col, ascii(char_col), varchar_col, ascii(varchar_col) FROM testschema.char_table;

 id |  char_col  | ascii | varchar_col | ascii 
  1 | Hello      |    72 | world       |   119

Lastly, if your application code uses the ASCII function, AWS SCT automatically converts any such function calls to Amazon Redshift.

The ASCII feature is available now—try it out in your own cluster.

TO_DATE() function

The TO_DATE function converts a character string into a DATE value. A quirk of this function is that it can accept a string value that isn’t a valid date and translate it into a valid date.

For example, consider the string 2021-06-31. This isn’t a valid date because the month of June has only 30 days. However, the TO_DATE function accepts this string and returns the “31st” day of June (July 1):

# SELECT to_date('2021-06-31', 'YYYY-MM-DD');
 to_date 
 2021-07-01
(1 row)

Customers have asked for strict input checking for TO_DATE, and we’re happy to share this new capability. Now, you can include a Boolean value in the function call that turns on strict checking:

# SELECT to_date('2021-06-31', 'YYYY-MM-DD', TRUE);
ERROR: date/time field date value out of range: 2021-6-31

You can turn off strict checking explicitly as well:

# SELECT to_date('2021-06-31', 'YYYY-MM-DD', FALSE);
 to_date 
 2021-07-01
(1 row)

Also, the Boolean value is optional. If you don’t include it, strict checking is turned off, and you see the same behavior as before the feature was launched.

You can learn more about the TO_DATE function and try out strict date checking in Amazon Redshift now.

CURSOR result sets

A cursor is a programming language construct that applications use to manipulate a result set one row at a time. Cursors are more relevant for OLTP applications, but some legacy applications built on data warehouses also use them.

Teradata provides a diverse set of cursor configurations. Amazon Redshift supports a more streamlined set of cursor features.

Based on customer feedback, we’ve added automation to support Teradata WITH RETURN cursors. These types of cursors are opened within stored procedures and returned to the caller for processing of the result set. AWS SCT will convert a WITH RETURN cursor to an Amazon Redshift REFCURSOR.

For example, consider the following procedure, which contains a WITH RETURN cursor. The procedure opens the cursor and returns the result to the caller as a DYNAMIC RESULT SET:

REPLACE PROCEDURE testschema.employee_cursor (IN p_mgrid INTEGER) DYNAMIC RESULT SETS 1
BEGIN
   DECLARE result_set CURSOR WITH RETURN ONLY FOR 
     SELECT id, name, manager 
     FROM testschema.employee
     WHERE manager = to_char(p_mgrid); 
   OPEN result_set;
END;

AWS SCT converts the procedure as follows. An additional parameter is added to the procedure signature to pass the REFCURSOR:

CREATE OR REPLACE PROCEDURE testschema.employee_cursor(par_p_mgrid IN INTEGER, dynamic_return_cursor INOUT refcursor)
AS $BODY$
DECLARE
BEGIN
    OPEN dynamic_return_cursor FOR
    SELECT
        id, name, manager
        FROM testschema.employee
        WHERE manager = to_char(par_p_mgrid, '99999');
END;
$BODY$
LANGUAGE plpgsql;

IDENTITY columns

Teradata supports several non-ANSI compliant features for IDENTITY columns. We have enhanced AWS SCT to automatically convert these features to Amazon Redshift, whenever possible.

Specifically, AWS SCT now converts the Teradata START WITH and INCREMENT BY clauses to the Amazon Redshift SEED and STEP clauses, respectively. For example, consider the following Teradata table:

CREATE TABLE testschema.identity_table (
  a2 BIGINT GENERATED ALWAYS AS IDENTITY (
    START WITH 1 
    INCREMENT BY 20
  )
);

The GENERATED ALWAYS clause indicates that the column is always populated automatically—a value can’t be explicitly inserted or updated into the column. The START WITH clause defines the first value to be inserted into the column, and the INCREMENT BY clause defines the next value to insert into the column.

When you convert this table using AWS SCT, the following Amazon Redshift DDL is produced. Notice that the START WITH and INCREMENT BY values are preserved in the target syntax:

CREATE TABLE IF NOT EXISTS testschema.identity_table (
  a2 BIGINT IDENTITY(1, 20)
)
DISTSTYLE KEY
DISTKEY
(a2)
SORTKEY
(a2);

Also, by default, an IDENTITY column in Amazon Redshift only contains auto-generated values, so that the GENERATED ALWAYS property in Teradata is preserved:

# INSERT INTO testschema.identity_table VALUES (100);
ERROR:  cannot set an identity column to a value

IDENTITY columns in Teradata can also be specified as GENERATED BY DEFAULT. In this case, a value can be explicitly defined in an INSERT statement. If no value is specified, the column is filled with an auto-generated value like normal. Before, AWS SCT didn’t support conversion for GENERATED BY DEFAULT columns. Now, we’re happy to share that AWS SCT automatically converts such columns for you.

For example, the following table contains an IDENTITY column that is GENERATED BY DEFAULT:

CREATE TABLE testschema.identity_by_default (
  a1 BIGINT GENERATED BY DEFAULT AS IDENTITY (
     START WITH 1 
     INCREMENT BY 20 
  )
PRIMARY INDEX (a1);

The IDENTITY column is converted by AWS SCT as follows. The converted column uses the Amazon Redshift GENERATED BY DEFAULT clause:

CREATE TABLE testschema.identity_by_default (
  a1 BIGINT GENERATED BY DEFAULT AS IDENTITY(1,20) DISTKEY
)                                                          
 DISTSTYLE KEY                                               
 SORTKEY (a1);

There is one additional syntax issue that requires attention. In Teradata, an auto-generated value is inserted when NULL is specified for the column value:

INSERT INTO identity_by_default VALUES (null);

Amazon Redshift uses a different syntax for the same purpose. Here, you include the keyword DEFAULT in the values list to indicate that the column should be auto-generated:

INSERT INTO testschema.identity_by_default VALUES (default);

We’re happy to share that AWS SCT automatically converts the Teradata syntax for INSERT statements like the preceding example. For example, consider the following Teradata macro:

REPLACE MACRO testschema.insert_identity_by_default AS (
  INSERT INTO testschema.identity_by_default VALUES (NULL);
);

AWS SCT removes the NULL and replaces it with DEFAULT:

CREATE OR REPLACE PROCEDURE testschema.insert_identity_by_default() LANGUAGE plpgsql
AS $$                                                              
BEGIN                                                              
  INSERT INTO testschema.identity_by_default VALUES (DEFAULT);
END;                                                               
$$                                                                 

IDENTITY column automation is available now in AWS SCT. You can download the latest version and try it out.

ANY and SOME filters with inequality predicates

The ANY and SOME filters determine if a predicate applies to one or more values in a list. For example, in Teradata, you can use <> ANY to find all employees who don’t work for a certain manager:

REPLACE MACRO testschema.not_in_103 AS (
  SELECT *
  FROM testschema.employee 
  WHERE manager <> ANY (103)
;
);

Of course, you can rewrite this query using a simple not equal filter, but you often see queries from third-party SQL generators that follow this pattern.

Amazon Redshift doesn’t support this syntax natively. Before, any queries using this syntax had to be manually converted. Now, we’re happy to share that AWS SCT automatically converts ANY and SOME clauses with inequality predicates. The macro above is converted to a stored procedure as follows.

CREATE OR REPLACE PROCEDURE testschema.not_in_103(macro_out INOUT refcursor)
AS $BODY$
BEGIN
    OPEN macro_out FOR
    SELECT *
    FROM testschema.employee
    WHERE ((manager <> 103));
END;
$BODY$
LANGUAGE plpgsql;

If the values list following the ANY contains two more values, AWS SCT will convert this to a series of OR conditions, one for each element in the list.

ANY/SOME filter conversion is available now in AWS SCT. You can try it out in the latest version of the application.

Analytic functions with RESET WHEN

RESET WHEN is a Teradata feature used in SQL analytical window functions. It’s an extension to the ANSI SQL standard. RESET WHEN determines the partition over which a SQL window function operates based on a specified condition. If the condition evaluates to true, a new dynamic sub-partition is created inside the existing window partition.

For example, the following view uses RESET WHEN to compute a running total by store. The running total accumulates as long as sales increase month over month. If sales drop from one month to the next, the running total resets.

CREATE TABLE testschema.sales (
  store_id INTEGER
, month_no INTEGER
, sales_amount DECIMAL(9,2)
)
;

REPLACE VIEW testschema.running_total (
  store_id
, month_no
, sales_amount
, cume_sales_amount
)
AS
SELECT 
  store_id
, month_no
, sales_amount
, SUM(sales_amount) OVER (
     PARTITION BY store_id 
     ORDER BY month_no
     RESET WHEN sales_amount < SUM(sales_amount) OVER (
       PARTITION BY store_id
       ORDER BY month_no
       ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
     )
     ROWS UNBOUNDED PRECEDING 
  )
FROM testschema.sales;

To demonstrate, we insert some test data into the table:

INSERT INTO testschema.sales VALUES (1001, 1, 35000.00);
INSERT INTO testschema.sales VALUES (1001, 2, 40000.00);
INSERT INTO testschema.sales VALUES (1001, 3, 45000.00);
INSERT INTO testschema.sales VALUES (1001, 4, 25000.00);
INSERT INTO testschema.sales VALUES (1001, 5, 30000.00);
INSERT INTO testschema.sales VALUES (1001, 6, 30000.00);
INSERT INTO testschema.sales VALUES (1001, 7, 50000.00);
INSERT INTO testschema.sales VALUES (1001, 8, 35000.00);
INSERT INTO testschema.sales VALUES (1001, 9, 60000.00);
INSERT INTO testschema.sales VALUES (1001, 10, 80000.00);
INSERT INTO testschema.sales VALUES (1001, 11, 90000.00);
INSERT INTO testschema.sales VALUES (1001, 12, 100000.00);

The sales amounts drop after months 3 and 7. The running total is reset accordingly at months 4 and 8.

SELECT * FROM testschema.running_total;

   store_id     month_no  sales_amount  cume_sales_amount
-----------  -----------  ------------  -----------------
       1001            1      35000.00           35000.00
       1001            2      40000.00           75000.00
       1001            3      45000.00          120000.00
       1001            4      25000.00           25000.00
       1001            5      30000.00           55000.00
       1001            6      30000.00           85000.00
       1001            7      50000.00          135000.00
       1001            8      35000.00           35000.00
       1001            9      60000.00           95000.00
       1001           10      80000.00          175000.00
       1001           11      90000.00          265000.00
       1001           12     100000.00          365000.00

AWS SCT converts the view as follows. The converted code uses a subquery to emulate the RESET WHEN. Essentially, a marker attribute is added to the result that flags a month over month sales drop. The flag is then used to determine the longest preceding run of increasing sales to aggregate.

CREATE OR REPLACE VIEW testschema.running_total (
  store_id
, month_no
, sales_amount
, cume_sales_amount) AS
SELECT
  store_id
, month_no
, sales_amount
, sum(sales_amount) OVER 
    (PARTITION BY k1, store_id ORDER BY month_no NULLS 
     FIRST ROWS UNBOUNDED      PRECEDING)
FROM (
  SELECT
   store_id
 , month_no
 , sales_amount
 , SUM(CASE WHEN k = 1 THEN 0 ELSE 1 END) OVER 
     (PARTITION BY store_id ORDER BY month_no NULLS 
       FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS k1
 FROM (
   SELECT
     store_id
   , month_no
   , sales_amount
   , CASE WHEN sales_amount < SUM(sales_amount) OVER 
      (PARTITION BY store_id ORDER BY month_no 
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) 
      OR sales_amount IS NULL THEN 0 ELSE 1 END AS k
   FROM testschema.sales
  )
);

We expect that RESET WHEN conversion will be a big hit with customers. You can try it now in AWS SCT.

TD_NORMALIZE_OVERLAP() function

The TD_NORMALIZE_OVERLAP function combines rows that have overlapping PERIOD values. The resulting normalized row contains the earliest starting bound and the latest ending bound from the PERIOD values of all the rows involved.

For example, we create a Teradata table that records employee salaries with the following code. Each row in the table is timestamped with the period that the employee was paid the given salary.

CREATE TABLE testschema.salaries (
  emp_id INTEGER
, salary DECIMAL(8,2)
, from_to PERIOD(DATE)
);

Now we add data for two employees. For emp_id = 1 and salary = 2000, there are two overlapping rows. Similarly, the two rows with emp_id = 2 and salary = 3000 are overlapping.

SELECT * FROM testschema.salaries ORDER BY emp_id, from_to;

     emp_id      salary  from_to
-----------  ----------  ------------------------
          1     1000.00  ('20/01/01', '20/05/31')
          1     2000.00  ('20/06/01', '21/02/28')
          1     2000.00  ('21/01/01', '21/06/30')
          2     3000.00  ('20/01/01', '20/03/31')
          2     3000.00  ('20/02/01', '20/04/30')

Now we create a view that uses the TD_NORMALIZE_OVERLAP function to normalize the overlapping data:

REPLACE VIEW testschema.normalize_salaries AS 
WITH sub_table(emp_id, salary, from_to) AS (
  SELECT 
    emp_id
  , salary
  , from_to
  FROM testschema.salaries
)
SELECT *
FROM 
  TABLE(TD_SYSFNLIB.TD_NORMALIZE_OVERLAP (NEW VARIANT_TYPE(sub_table.emp_id, sub_table.salary), sub_table.from_to)
    RETURNS (emp_id INTEGER, salary DECIMAL(8,2), from_to PERIOD(DATE))
    HASH BY emp_id
    LOCAL ORDER BY emp_id, salary, from_to
  ) AS DT(emp_id, salary, duration)
;

We can check that the view data is actually normalized:

select * from testschema.normalize_salaries order by emp_id, duration;

     emp_id      salary  duration
-----------  ----------  ------------------------
          1     1000.00  ('20/01/01', '20/05/31')
          1     2000.00  ('20/06/01', '21/06/30')
          2     3000.00  ('20/01/01', '20/04/30')

You can now use AWS SCT to convert any TD_NORMALIZE_OVERLAP statements. We first convert the salaries table to Amazon Redshift (see Accelerate your data warehouse migration to Amazon Redshift – Part 2 for details about period data type automation):

CREATE TABLE testschema.salaries (
  emp_id integer distkey
, salary numeric(8,2) ENCODE az64
, from_to_begin date ENCODE az64
, from_to_end date ENCODE az64    
)                                   
DISTSTYLE KEY                       
SORTKEY (emp_id);

# SELECT * FROM testschema.salaries ORDER BY emp_id, from_to_begin;
 emp_id | salary  | from_to_begin | from_to_end 
      1 | 1000.00 | 2020-01-01    | 2020-05-31
      1 | 2000.00 | 2020-06-01    | 2021-02-28
      1 | 2000.00 | 2021-01-01    | 2021-06-30
      2 | 3000.00 | 2020-01-01    | 2020-03-31
      2 | 3000.00 | 2020-02-01    | 2020-04-30

Now we use AWS SCT to convert the normalize_salaries view. AWS SCT adds a column that marks the start of a new group of rows. It then produces a single row for each group with a normalized timestamp.

CREATE VIEW testschema.normalize_salaries (emp_id, salary, from_to_begin, from_to_end) AS
WITH sub_table AS (
  SELECT
    emp_id
  , salary
  , from_to_begin AS start_date
  , from_to_end AS end_date
  , CASE
      WHEN start_date <= lag(end_date) OVER (PARTITION BY emp_id, salary ORDER BY start_date, end_date) THEN 0 
      ELSE 1
    END AS GroupStartFlag
    FROM testschema.salaries
  )
SELECT
  t2.emp_id
, t2.salary
, min(t2.start_date) AS from_to_begin
, max(t2.end_date) AS from_to_end
FROM (
  SELECT
    emp_id
  , salary
  , start_date
  , end_date
  , sum(GroupStartFlag) OVER (PARTITION BY emp_id, salary ORDER BY start_date ROWS UNBOUNDED PRECEDING) AS GroupID
  FROM 
    sub_table
) AS t2
GROUP BY 
  t2.emp_id
, t2.salary
, t2.GroupID;

We can check that the converted view returns the correctly normalized data:

# SELECT * FROM testschema.normalize_salaries ORDER BY emp_id;
 emp_id | salary  | from_to_begin | from_to_end 
      1 | 1000.00 | 2020-01-01    | 2020-05-31
      1 | 2000.00 | 2020-06-01    | 2021-06-30
      2 | 3000.00 | 2020-01-01    | 2020-04-30

You can try out TD_NORMALIZE_OVERLAP conversion in the latest release of AWS SCT. Download it now.

TD_UNPIVOT() function

The TD_UNPIVOT function transforms columns into rows. Essentially, we use it to take a row of similar metrics over different time periods and create a separate row for each metric.

For example, consider the following Teradata table. The table records customer visits by year and month for small kiosk stores:

CREATE TABLE TESTSCHEMA.kiosk_monthly_visits (
  kiosk_id INTEGER
, year_no INTEGER
, jan_visits INTEGER
, feb_visits INTEGER
, mar_visits INTEGER
, apr_visits INTEGER
, may_visits INTEGER
, jun_visits INTEGER
, jul_visits INTEGER
, aug_visits INTEGER
, sep_visits INTEGER
, oct_visits INTEGER
, nov_visits INTEGER
, dec_visits INTEGER)
PRIMARY INDEX (kiosk_id);

We insert some sample data into the table:

INSERT INTO testschema.kiosk_monthly_visits VALUES (100, 2020, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200);

Next, we create a view that unpivots the table so that the monthly visits appear on separate rows. The single row in the pivoted table creates 12 rows in the unpivoted table, one row per month.

REPLACE VIEW testschema.unpivot_kiosk_monthly_visits (
  kiosk_id
, year_no
, month_name
, month_visits
)
AS
SELECT 
  kiosk_id
, year_no
, month_name (FORMAT 'X(10)')
, month_visits
FROM TD_UNPIVOT (
 ON (SELECT * FROM testschema.kiosk_monthly_visits)
 USING
 VALUE_COLUMNS ('month_visits')
 UNPIVOT_COLUMN('month_name')
 COLUMN_LIST(
   'jan_visits'
 , 'feb_visits'
 , 'mar_visits'
 , 'apr_visits'
 , 'may_visits'
 , 'jun_visits'
 , 'jul_visits'
 , 'aug_visits'
 , 'sep_visits'
 , 'oct_visits'
 , 'nov_visits'
 , 'dec_visits'
 )
 COLUMN_ALIAS_LIST (
   'jan'
 , 'feb'
 , 'mar'
 , 'apr'
 , 'may'
 , 'jun'
 , 'jul'
 , 'aug'
 , 'sep'
 , 'oct'
 , 'nov'
 , 'dec'
 )
) a;

When you select from the view, the monthly sales are unpivoted into 12 separate rows:

SELECT * FROM testschema.unpivot_monthly_sales;

 id           yr           mon     mon_sales
----------- ----------- ---------- ----------
100         2021        jan           1100.00
100         2021        feb           1200.00
100         2021        mar           1300.00
100         2021        apr           1400.00
100         2021        may           1500.00
100         2021        jun           1600.00
100         2021        jul           1700.00
100         2021        aug           1800.00
100         2021        sep           1900.00
100         2021        oct           2000.00
100         2021        nov           2100.00
100         2021        dec           2200.00

Now we use AWS SCT to convert the view into ANSI SQL that can be run on Amazon Redshift. The conversion creates a common table expression (CTE) to place each month in a separate row. It then joins the CTE and the remaining attributes from the original pivoted table.

REPLACE VIEW testschema.unpivot_kiosk_monthly_visits (kiosk_id, year_no, month_name, month_visits) AS
WITH cols
AS (SELECT
    'jan' AS col
UNION ALL
SELECT
    'feb' AS col
UNION ALL
SELECT
    'mar' AS col
UNION ALL
SELECT
    'apr' AS col
UNION ALL
SELECT
    'may' AS col
UNION ALL
SELECT
    'jun' AS col
UNION ALL
SELECT
    'jul' AS col
UNION ALL
SELECT
    'aug' AS col
UNION ALL
SELECT
    'sep' AS col
UNION ALL
SELECT
    'oct' AS col
UNION ALL
SELECT
    'nov' AS col
UNION ALL
SELECT
    'dec' AS col)
SELECT
    t1.kiosk_id, t1.year_no, col AS "month_name",
    CASE col
        WHEN 'jan' THEN "jan_visits"
        WHEN 'feb' THEN "feb_visits"
        WHEN 'mar' THEN "mar_visits"
        WHEN 'apr' THEN "apr_visits"
        WHEN 'may' THEN "may_visits"
        WHEN 'jun' THEN "jun_visits"
        WHEN 'jul' THEN "jul_visits"
        WHEN 'aug' THEN "aug_visits"
        WHEN 'sep' THEN "sep_visits"
        WHEN 'oct' THEN "oct_visits"
        WHEN 'nov' THEN "nov_visits"
        WHEN 'dec' THEN "dec_visits"
        ELSE NULL
    END AS "month_visits"
    FROM testschema.kiosk_monthly_visits AS t1
    CROSS JOIN cols
    WHERE month_visits IS NOT NULL;

You can check that the converted view produces the same result as the Teradata version:

# SELECT * FROM testschema.unpivot_kiosk_monthly_visits;
 kiosk_id | year_no | month_name | month_visits 
      100 |    2020 | oct        |        2000
      100 |    2020 | nov        |        2100
      100 |    2020 | jul        |        1700
      100 |    2020 | feb        |        1200
      100 |    2020 | apr        |        1400
      100 |    2020 | aug        |        1800
      100 |    2020 | sep        |        1900
      100 |    2020 | jan        |        1100
      100 |    2020 | mar        |        1300
      100 |    2020 | may        |        1500
      100 |    2020 | jun        |        1600
      100 |    2020 | dec        |        2200

You can try out the conversion support for TD_UNPIVOT in the latest version of AWS SCT.

QUANTILE function

QUANTILE is a ranking function. It partitions the input set into a specified number of groups, each containing an equal portion of the total population. QUANTILE is a proprietary Teradata extension of the NTILE function found in ANSI SQL.

For example, we can compute the quartiles of the monthly visit data using the following Teradata view:

REPLACE VIEW testschema.monthly_visit_rank AS
SELECT
  kiosk_id
, year_no
, month_name
, month_visits
, QUANTILE(4, month_visits) qtile
FROM
 testschema.unpivot_kiosk_monthly_visits
;

When you select from the view, the QUANTILE function computes the quartile and applies it as an attribute on the output:

SELECT * FROM monthly_visit_rank;

   kiosk_id      year_no  month_name  month_visits        qtile
-----------  -----------  ----------  ------------  -----------
        100         2020  jan                 1100            0
        100         2020  feb                 1200            0
        100         2020  mar                 1300            0
        100         2020  apr                 1400            1
        100         2020  may                 1500            1
        100         2020  jun                 1600            1
        100         2020  jul                 1700            2
        100         2020  aug                 1800            2
        100         2020  sep                 1900            2
        100         2020  oct                 2000            3
        100         2020  nov                 2100            3
        100         2020  dec                 2200            3

Amazon Redshift supports a generalized NTILE function, which can implement QUANTILE, and is ANSI-compliant. We’ve enhanced AWS SCT to automatically convert QUANTILE function calls to equivalent NTILE function calls.

For example, when you convert the preceding Teradata view, AWS SCT produces the following Amazon Redshift code:

SELECT 
  unpivot_kiosk_monthly_visits.kiosk_id
, unpivot_kiosk_monthly_visits.year_no
, unpivot_kiosk_monthly_visits.month_name
, unpivot_kiosk_monthly_visits.month_visits
, ntile(4) OVER (ORDER BY unpivot_kiosk_monthly_visits.month_visits ASC  NULLS FIRST) - 1) AS qtile 
FROM 
  testschema.unpivot_kiosk_monthly_visits
;

QUANTILE conversion support is available now in AWS SCT.

QUALIFY filter

The QUALIFY clause in Teradata filters rows produced by an analytic function. Let’s look at an example. We use the following table, which contains store revenue by month. Our goal is to find the top five months by revenue:

CREATE TABLE testschema.sales (
  store_id INTEGER
, month_no INTEGER
, sales_amount DECIMAL(9,2))
PRIMARY INDEX (store_id);


SELECT * FROM sales;

   store_id     month_no  sales_amount
-----------  -----------  ------------
       1001            1      35000.00
       1001            2      40000.00
       1001            3      45000.00
       1001            4      25000.00
       1001            5      30000.00
       1001            6      30000.00
       1001            7      50000.00
       1001            8      35000.00
       1001            9      60000.00
       1001           10      80000.00
       1001           11      90000.00
       1001           12     100000.00

The data shows that July, September, October, November, and December were the top five sales months.

We create a view that uses the RANK function to rank each month by sales, then use the QUALIFY function to select the top five months:

REPLACE VIEW testschema.top_five_months(
  store_id
, month_no
, sales_amount
, month_rank
) as
SELECT
  store_id
, month_no
, sales_amount
, RANK() OVER (PARTITION BY store_id ORDER BY sales_amount DESC) month_rank
FROM
  testschema.sales
QUALIFY RANK() OVER (PARTITION by store_id ORDER BY sales_amount DESC) <= 5
;

Before, if you used the QUALIFY clause, you had to manually recode your SQL statements. Now, AWS SCT automatically converts QUALIFY into Amazon Redshift-compatible, ANSI-compliant SQL. For example, AWS SCT rewrites the preceding view as follows:

CREATE OR REPLACE VIEW testschema.top_five_months (
  store_id
, month_no
, sales_amount
, month_rank) AS
SELECT
  qualify_subquery.store_id
, qualify_subquery.month_no
, qualify_subquery.sales_amount
, month_rank
FROM (
  SELECT
    store_id
  , month_no
  , sales_amount
  , rank() OVER (PARTITION BY store_id ORDER BY sales_amount DESC NULLS FIRST) AS month_rank
  , rank() OVER (PARTITION BY store_id ORDER BY sales_amount DESC NULLS FIRST) AS qualify_expression_1
  FROM testschema.sales) AS qualify_subquery
  WHERE 
    qualify_expression_1 <= 5;

AWS SCT converts the original query into a subquery, and applies the QUALIFY expression as a filter on the subquery. AWS SCT adds an additional column to the subquery for the purpose of filtering. This is not strictly needed, but simplifies the code when column aliases aren’t used.

You can try QUALIFY conversion in the latest version of AWS SCT.

Summary

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports and applications. If you’re looking to get started on a data warehouse migration, you can learn more about Amazon Redshift and AWS SCT from our public documentation.

This post described a few of the dozens of new features we’re introducing to automate your Teradata migrations to Amazon Redshift. We’ll share more in upcoming posts about automation for proprietary Teradata features and other exciting new capabilities.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool. Happy migrating!


About the Authors

Michael Soo is a Senior Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Raza Hafeez is a Data Architect within the Lake House Global Specialty Practice of AWS Professional Services. He has over 10 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Lake House Architecture.

Po Hong, PhD, is a Principal Data Architect of Lake House Global Specialty Practice, AWS Professional Services. He is passionate about supporting customers to adopt innovative solutions to reduce time to insight. Po is specialized in migrating large scale MPP on-premises data warehouses to the AWS Lake House architecture.

Entong Shen is a Software Development Manager of Amazon Redshift. He has been working on MPP databases for over 9 years and has focused on query optimization, statistics and migration related SQL language features such as stored procedures and data types.

Sumit Singh is a database engineer with Database Migration Service team at Amazon Web Services. He works closely with customers and provide technical assistance to migrate their on-premises workload to AWS cloud. He also assists in continuously improving the quality and functionality of AWS Data migration products.

Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical background focusing on migrating and replicating databases along with data-warehouse workloads. She is passionate about helping customers in their cloud journey.

What to Consider when Selecting a Region for your Workloads

Post Syndicated from Saud Albazei original https://aws.amazon.com/blogs/architecture/what-to-consider-when-selecting-a-region-for-your-workloads/

The AWS Cloud is an ever-growing network of Regions and points of presence (PoP), with a global network infrastructure that connects them together. With such a vast selection of Regions, costs, and services available, it can be challenging for startups to select the optimal Region for a workload. This decision must be made carefully, as it has a major impact on compliance, cost, performance, and services available for your workloads.

Evaluating Regions for deployment

There are four main factors that play into evaluating each AWS Region for a workload deployment:

  1. Compliance. If your workload contains data that is bound by local regulations, then selecting the Region that complies with the regulation overrides other evaluation factors. This applies to workloads that are bound by data residency laws where choosing an AWS Region located in that country is mandatory.
  2. Latency. A major factor to consider for user experience is latency. Reduced network latency can make substantial impact on enhancing the user experience. Choosing an AWS Region with close proximity to your user base location can achieve lower network latency. It can also increase communication quality, given that network packets have fewer exchange points to travel through.
  3. Cost. AWS services are priced differently from one Region to another. Some Regions have lower cost than others, which can result in a cost reduction for the same deployment.
  4. Services and features. Newer services and features are deployed to Regions gradually. Although all AWS Regions have the same service level agreement (SLA), some larger Regions are usually first to offer newer services, features, and software releases. Smaller Regions may not get these services or features in time for you to use them to support your workload.

Evaluating all these factors can make coming to a decision complicated. This is where your priorities as a business should influence the decision.

Assess potential Regions for the right option

Evaluate by shortlisting potential Regions.

  • Check if these Regions are compliant and have the services and features you need to run your workload using the AWS Regional Services website.
  • Check feature availability of each service and versions available, if your workload has specific requirements.
  • Calculate the cost of the workload on each Region using the AWS Pricing Calculator.
  • Test the network latency between your user base location and each AWS Region.

At this point, you should have a list of AWS Regions with varying cost and network latency that looks something Table 1:

Region Compliance Latency Cost Services / Features
Region A

15 ms $$
Region B

20 ms

$$$

X

Region C

80 ms $

Table 1. Region evaluation matrix

Many workloads such as high performance computing (HPC), analytics, and machine learning (ML), are not directly linked to a customer-facing application. These would not be sensitive to network latency, so you may want to select the Region with the lowest cost.

Alternatively, you may have a backend service for a game or mobile application in which network latency has a direct impact on user experience. Measure the difference in network latency between each Region, and determine if it is worth the increased cost. You can leverage the Amazon CloudFront edge network, which helps reduce latency and increases communication quality. This is because it uses a fully managed AWS network infrastructure, which connects your application to the edge location nearest to your users.

Multi-Region deployment

You can also split the workload across multiple Regions. The same workload may have some components that are sensitive to network latency and some that are not. You may determine you can benefit from both lower network latency and reduced cost at the same time. Here’s an example:

Figure 1. Multi-Region deployment optimized for feature availability

Figure 1. Multi-Region deployment optimized for feature availability

Figure 1 shows a serverless application deployed at the Bahrain Region (me-south-1) which has a close proximity to the customer base in Riyadh, Saudi Arabia. Application users enjoy a lower latency network connecting to the AWS Cloud. Analytics workloads are deployed in the Ireland Region (eu-west-1), which has a lower cost for Amazon Redshift and other features.

Note that data transfer between Regions is not free and, in this example, costs $0.115 per GB. However, even with this additional cost factored in, running the analytical workload in Ireland (eu-west-1) is still more cost-effective. You can also benefit from additional capabilities and features that may have not yet been released in the Bahrain (me-south-1) Region.

This multi-Region setup could also be beneficial for applications with a global user base. The application can be deployed in multiple secondary AWS Regions closer to the user base locations. It uses a primary AWS Region with a lower cost for consolidated services and latency-insensitive workloads.

Figure 2. Multi-Region deployment optimized for network latency

Figure 2. Multi-Region deployment optimized for network latency

Figure 2 allows for an application to span multiple Regions to serve read requests with the lowest network latency possible. Each client will be routed to the nearest AWS Region. For read requests, an Amazon Route 53 latency routing policy will be used. For write requests, an endpoint routed to the primary Region will be used. This primary endpoint can also have periodic health checks to failover to a secondary Region for disaster recovery (DR).

Other factors may also apply for certain applications such as ones that require Amazon EC2 Spot Instances. Regions differ in size, with some having three, and others up to six Availability Zones (AZ). This results in varying Spot Instance capacity available for Amazon EC2. Choosing larger Regions offers larger Spot capacity. A multi-Region deployment offers the most Spot capacity.

Conclusion

Selecting the optimal AWS Region is an important first step when deploying new workloads. There are many other scenarios in which splitting the workload across multiple AWS Regions can result in a better user experience and cost reduction. The four factors mentioned in this blog post can be evaluated together to find the most appropriate Region to deploy your workloads.

If the workload is bound by any regulations, shortlist the Regions that are compliant. Measure the network latency between each Region and the location of the user base. Estimate the workload cost for each Region. Check that the shortlisted Regions have the services and features your workload requires. And finally, determine if your workload can benefit from running in multiple Regions.

Dive deeper into the AWS Global Infrastructure Website for more information.

How Rapid7 built multi-tenant analytics with Amazon Redshift using near-real-time datasets

Post Syndicated from Rahul Monga original https://aws.amazon.com/blogs/big-data/how-rapid7-built-multi-tenant-analytics-with-amazon-redshift-using-near-real-time-datasets/

This is a guest post co-written by Rahul Monga, Principal Software Engineer at Rapid7.

Rapid7 InsightVM is a vulnerability assessment and management product that provides visibility into the risks present across an organization. It equips you with the reporting, automation, and integrations needed to prioritize and fix those vulnerabilities in a fast and efficient manner. InsightVM has more than 5,000 customers across the globe, runs exclusively on AWS, and is available for purchase on AWS Marketplace.

To provide near-real-time insights to InsightVM customers, Rapid7 has recently undertaken a project to enhance the dashboards in their multi-tenant software as a service (SaaS) portal with metrics, trends, and aggregated statistics on vulnerability information identified in their customer assets. They chose Amazon Redshift as the data warehouse to power these dashboards due to its ability to deliver fast query performance on gigabytes to petabytes of data.

In this post, we discuss the design options that Rapid7 evaluated to build a multi-tenant data warehouse and analytics platform for InsightVM. We will deep dive into the challenges and solutions related to ingesting near-real-time datasets and how to create a scalable reporting solution that can efficiently run queries across more than 3 trillion rows. This post also discusses an option to address the scenario where a particular customer outgrows the average data access needs.

This post uses the terms customers, tenants, and organizations interchangeably to represent Rapid7 InsightVM customers.

Background

To collect data for InsightVM, customers can use scan engines or Rapid7’s Insight Agent. Scan engines allow you to collect vulnerability data on every asset connected to a network. This data is only collected when a scan is run. Alternatively, you can install the Insight Agent on individual assets to collect and send asset change information to InsightVM numerous times each day. The agent also ensures that asset data is sent to InsightVM regardless of whether or not the asset is connected to your network.

Data from scans and agents is sent in the form of packed documents, in micro-batches of hundreds of events. Around 500 documents per second are received across customers, and each document is around 2 MB in size. On a typical day, InsightVM processes 2–3 trillion rows of vulnerability data, which translates to around 56 GB of compressed data for a large customer. This data is normalized and processed by InsightVM’s vulnerability management engine and streamed to the data warehouse system for near-real-time availability of data for analytical insights to customers.

Architecture overview

In this section, we discuss the overall architectural setup for the InsightVM system.

Scan engines and agents collect and send asset information to the InsightVM cloud. Asset data is pooled, normalized, and processed to identify vulnerabilities. This is stored in an Amazon ElastiCache for Redis cluster and also pushed to Amazon Kinesis Data Firehouse for use in near-real time by InsightVM’s analytics dashboards. Kinesis Data Firehose delivers raw asset data to an Amazon Simple Storage Service (Amazon S3) bucket. The data is transformed using a custom developed ingestor service and stored in a new S3 bucket. The transformed data is then loaded into the Redshift data warehouse. Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS Lambda are used to orchestrate this data flow.  In addition, to identify the latest timestamp of vulnerability data for assets, an auxiliary table is maintained and updated periodically with the update logic in the Lambda function, which is triggered through an Amazon CloudWatch event rule. Custom-built middleware components interface between the web user interface (UI) and the Amazon Redshift cluster to fetch asset information for display in dashboards.

The following diagram shows the implementation architecture of InsightVM, including the data warehouse system:

Rapid-7 Multi-tenant Architecture

The architecture has built-in tenant isolation because data access is abstracted through the API. The application uses a dimensional model to support low-latency queries and extensibility for future enhancements.

Amazon Redshift data warehouse design: Options evaluated and selection

Considering Rapid7’s need for near-real-time analytics at any scale, the InsightVM data warehouse system is designed to meet the following requirements:

  • Ability to view asset vulnerability data at near-real time, within 5–10 minutes of ingest
  • Less than 5 seconds’ latency when measured at 95 percentiles (p95) for reporting queries
  • Ability to support 15 concurrent queries per second, with the option to support more in the future
  • Simple and easy-to-manage data warehouse infrastructure
  • Data isolation for each customer or tenant

Rapid7 evaluated Amazon Redshift RA3 instances to support these requirements. When designing the Amazon Redshift schema to support these goals, they evaluated the following strategies:

  • Bridge model – Storage and access to data for each tenant is controlled at the individual schema level in the same database. In this approach, multiple schemas are set up, where each schema is associated with a tenant, with the same exact structure of the dimensional model.
  • Pool model – Data is stored in a single database schema for all tenants, and a new column (tenant_id) is used to scope and control access to individual tenant data. Access to the multi-tenant data is controlled using API-level access to the tables. Tenants aren’t aware of the underlying implementation of the analytical system and can’t query them directly.

For more information about multi-tenant models, see Implementing multi-tenant patterns in Amazon Redshift using data sharing.

Initially when evaluating the bridge model, it provided an advantage for tenant-only data for queries, plus the ability to decouple a tenant to an independent cluster if they outgrow the resources that are available in the single cluster. Also, when the p95 metrics were evaluated in this setup, the query response times were less than 5 seconds, because each tenant data is isolated into smaller tables. However, the major concern with this approach was with the near-real-time data ingestion into over 50,000 tables (5,000 customer schemas x approximately 10 tables per schema) every 5 minutes. Having thousands of commits every minute into an online analytical processing (OLAP) system like Amazon Redshift can lead to most resources being exhausted in the ingestion process. As a result, the application suffers query latencies as data grows.

The pool model provides a simpler setup, but the concern was with query latencies when multiple tenants access the application from the same tables. Rapid7 hoped that these concerns would be addressed by using Amazon Redshift’s support for massively parallel processing (MPP) to enable fast execution of most complex queries operating on large amounts of data. With the right table design using the right sort and distribution keys, it’s possible to optimize the setup. Furthermore, with automatic table optimization, the Amazon Redshift cluster can automatically make these determinations without any manual input.

Rapid7 evaluated both the pool and bridge model designs, and decided to implement the pool model. This model provides simplified data ingestion and can support query latencies of under 5 seconds at p95 with the right table design. The following table summarizes the results of p95 tests conducted with the pool model setup.

Query P95
Large customer: Query with multiple joins, which list assets, their vulnerabilities, and all their related attributes, with aggregated metrics for each asset, and filters to scope assets by attributes like location, names, and addresses Less than 4 seconds
Large customer: Query to return vulnerability content information given a list of vulnerability identifiers Less than 4 seconds

Tenet isolation and security

Tenant isolation is fundamental to the design and development of SaaS systems. It enables SaaS providers to reassure customers that, even in a multi-tenant environment, their resources can’t be accessed by other tenants.

With the Amazon Redshift table design using the pool model, Rapid7 built a separate data access layer in the middleware that templatized queries, augmented with runtime parameter substitution to uniquely filter specific tenant and organization data.

The following is a sample of templatized query:

<#if useDefaultVersion()>
currentAssetInstances AS (
SELECT tablename.*
FROM tablename
<#if (applyTags())>
JOIN dim_asset_tag USING (organization_id, attribute2, attribute3)
</#if>
WHERE organization_id ${getOrgIdFilter()}
<#if (applyTags())>
AND tag_id IN ($(tagIds))
</#if>
),
</#if>

The following is a Java interface snippet to populate the template:

public interface TemplateParameters {

boolean useDefaultVersion(); boolean useVersion(); default Set<String> getVersions() {
return null;
} default String getVersionJoin(String var1) {
return "";
}

String getTemplateName();

String getOrgIdString();

default String getOrgIdFilter() {
return "";
}

Every query uses organization_id and additional parameters to uniquely access tenant data. During runtime, organization_id and other metadata are extracted from the secured JWT token that is passed to middleware components after the user is authenticated in the Rapid7 cloud platform.

Best practices and lessons learned

To fully realize the benefits of the Amazon Redshift architecture and design for the multiple tenants & near real-time ingestion, considerations on the table design allow you to take full advantage of the massively parallel processing and columnar data storage. In this section, we discuss the best practices and lessons learned from building this solution.

Sort key for effective data pruning

Sorting a table on an appropriate sort key can accelerate query performance, especially queries with range-restricted predicates, by requiring fewer table blocks to be read from disk. To have Amazon Redshift choose the appropriate sort order, the AUTO option was utilized. Automatic table optimization continuously observes how queries interact with tables and discovers the right sort key for the table. To effectively prune the data by the tenant, organization_id is identified as the sort key to perform the restricted scans. Furthermore, because all queries are routed through the data access layer, organization_id is automatically added in the predicate conditions to ensure effective use of the sort keys.

Micro-batches for data ingestion

Amazon Redshift is designed for large data ingestion, rather than transaction processing. The cost of commits is relatively high, and excessive use of commits can result in queries waiting for access to the commit queue. Data is micro-batched during ingestion as it arrives for multiple organizations. This results in fewer transactions and commits when ingesting the data.

Load data in bulk

If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, and this type of load is much slower.

The Amazon Redshift manifest file is used to ingest the datasets that span multiple files in a single COPY command, which allows fast ingestion of data in each micro-batch.

RA3 instances for data sharing

Rapid 7 uses Amazon Redshift RA3 instances, which enable data sharing to allow you to securely and easily share live data across Amazon Redshift clusters for reads. In this multi-tenant architecture when a tenant outgrows the average data access needs, it can be isolated to a separate cluster easily and independently scaled using the data sharing. This is accomplished by monitoring the STL_SCAN table to identify different tenants and isolate them to allow for independent scalability as needed.

Concurrency scaling for consistently fast query performance

When concurrency scaling is enabled, Amazon Redshift automatically adds additional cluster capacity when you need it to process an increase in concurrent read queries. To meet the uptick in user requests, the concurrency scaling feature is enabled to dynamically bring up additional capacity to provide consistent p95 values that meet Rapid7’s defined requirements for the InsightVM application.

Results and benefits

Rapid7 saw the following results from this architecture:

  • The new architecture has reduced the time required to make data accessible to customers to less than 5 minutes on average. The previous architecture had higher level of processing time variance, and could sometimes exceed 45 minutes
  • Dashboards load faster and have enhanced drill-down functionality, improving the end-user experience
  • With all data in a single warehouse, InsightVM has a single source of truth, compared to the previous solution where InsightVM had copies of data maintained in different databases and domains, which could occasionally get out of sync
  • The new architecture lowers InsightVM’s reporting infrastructure cost by almost three times, as compared to the previous architecture

Conclusion

With Amazon Redshift, the Rapid7 team has been able to centralize asset and vulnerability information for InsightVM customers. The team has simultaneously met its performance and management objectives with the use of a multi-tenant pool model and optimized table design. In addition, data ingestion via Kinesis Data Firehose and custom-built microservices to load data into Amazon Redshift in near-real time enabled Rapid7 to deliver asset vulnerability information to customers more than nine times faster than before, improving the InsightVM customer experience.


About the Authors

Rahul Monga is a Principal Software Engineer at Rapid7, currently working on the next iteration of InsightVM. Rahul’s focus areas are highly distributed cloud architectures and big data processing. Originally from the Washington DC area, Rahul now resides in Austin, TX with his wife, daughter, and adopted pup.

Sujatha Kuppuraju is a Senior Solutions Architect at Amazon Web Services (AWS). She works with ISV customers to help design secured, scalable and well-architected solutions on the AWS Cloud. She is passionate about solving complex business problems with the ever-growing capabilities of technology.

Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Analyze daily trading activity using transaction data from Amazon Redshift in Amazon FinSpace

Post Syndicated from Mariia Berezina original https://aws.amazon.com/blogs/big-data/analyze-daily-trading-activity-using-transaction-data-from-amazon-redshift-in-amazon-finspace/

Financial services organizations use data from various sources to discover new insights and improve trading decisions. Finding the right dataset and getting access to the data can frequently be a time-consuming process. For example, to analyze daily trading activity, analysts need to find a list of available databases and tables, identify its owner’s contact information, get access, understand the table schema, and load the data. They repeat this process for every additional dataset needed for the analysis.

Amazon FinSpace makes it easy for analysts and quants to discover, analyze, and share data, reducing the time it takes to find and access financial data from months to minutes. To get started, FinSpace admins create a category and an attribute set to capture relevant external reference information such as database type and table name. After connecting to data source or uploading it directly through the FinSpace user interface (UI), you can create datasets in FinSpace that include schema and other relevant information. Analysts can then search the catalog for necessary datasets and connect to them using the FinSpace web interface or through the FinSpace JupyterLab notebook.

Amazon Redshift is a popular choice for storing and querying exabytes of structured and semi-structured data such as trade transactions. In this post, we explore how to connect to an Amazon Redshift data warehouse from FinSpace through a Spark SQL JDBC connection and populate the FinSpace catalog with metadata such as schema details, dataset owner, and description. We then show how simple it is to use the FinSpace catalog to discover available data and to connect to an Amazon Redshift cluster from a Jupyter notebook in FinSpace to read daily trades for Amazon (AMZN) stock. Finally, we will evaluate how well-executed were our stock purchases. We will do it by comparing our transactions stored in Amazon Redshift to trading history for the stock stored in FinSpace.

Solution overview

The blog post covers the following steps:

  1. Configure your FinSpace catalog to describe your Amazon Redshift tables.
  2. Use FinSpace notebooks to connect to Amazon Redshift.
  3. Populate the FinSpace catalog with tables from Amazon Redshift. Add description, owner, and attributes to each dataset to help with data discovery and access control.
  4. Search the FinSpace catalog for data.
  5. Use FinSpace notebooks to analyze data from both FinSpace and Amazon Redshift to evaluate trade performance based on the daily price for AMZN stock.

The diagram below provides the complete solution overview.

Prerequisites

Before you get started, make sure you have the following prerequisites:

  • Download Jupyter notebooks covering Amazon Redshift dataset import and analysis. Import them into FinSpace by cloning the GitHub repo or by dropping them into FinSpace. The code provided in this blog post should be run from the FinSpace notebooks.
  • Setup a FinSpace environment. For instructions on creating a new environment, see Create an Amazon FinSpace Environment.
  • Install Capital Markets sample data bundle, as explained in the “Sample Data Bundle” guide.
  • Ensure you have permissions to manage categories and controlled vocabularies and manage attribute sets in FinSpace.
  • Create an Amazon Redshift cluster in the same AWS account as the FinSpace environment. For instructions, see Create a cluster. Additionally, create a superuser and ensure that the cluster is publicly accessible.
  • Create a table in Amazon Redshift and insert trading transaction data using these SQL queries.

Configure your FinSpace catalog to describe your Amazon Redshift tables

FinSpace users can discover relevant datasets by using search or by navigating across categories under the Browse Data menu. Categories allow for cataloging of datasets by commonly used business terms (such as source, data class, type, industry, and so on). An attribute set holds additional metadata for each dataset, including categories and table details to enable you to connect to the data source directly from a FinSpace notebook. Analysts can browse and search attributes to find datasets based on the values assigned to them.

Complete the following steps to create a new subcategory called Redshift under the Source category, and create an attribute set called Redshift Table Attributes. In the following section, we use the subcategory and attribute set to tag datasets from Amazon Redshift. FinSpace users can then browse for the data from the Amazon Redshift source from the Browse Data menu and filter datasets in FinSpace for the tables that are located in the company’s Amazon Redshift data warehouse.

  1. On the FinSpace console, choose Settings (gear icon).
  2. Choose Categories.
  3. Hover over the Source category and choose Edit this Category.

  4. On the Edit Category page, hover over the Source category again and choose Add Sub-Category.
  5. Add Redshift as a source subcategory and Financial data from company's Amazon Redshift data warehouse as the description.

Next, create an attribute set called Redshift Table Attributes to capture additional business context for each dataset.

  1. On the FinSpace console, choose Settings (gear icon).
  2. Choose Attribute Sets.
  3. Choose CREATE ATTRIBUTE SET.
  4. Create a new attribute set called Redshift Table Attributes.
  5. Add the following fields:
    1. Catalog – Data String type
    2. Schema – Data String type
    3. Table – Data String type
    4. Source – Categorization Source type

Use FinSpace notebooks to connect to Amazon Redshift

The notebooks downloaded as part of the prerequisite provide the integration between FinSpace and Amazon Redshift. The steps below explain the code so you can run and extend as needed.
  1. Connect to the Spark cluster by running the following code:
%local
from aws.finspace.cluster import FinSpaceClusterManager

# if this was already run, no need to run again
if 'finspace_clusters' not in globals():
    finspace_clusters = FinSpaceClusterManager()
    finspace_clusters.auto_connect()
else:
    print(f'connected to cluster: {finspace_clusters.get_connected_cluster_id()}')

After the connection is established, you see a connected to cluster message. It may take 5–8 minutes for the cluster connection to establish.

  1. Add the JDBC driver to Spark jars by running the following code:
%%configure -f
{ "conf":{
          "spark.jars": "https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/2.0.0.7/redshift-jdbc42-2.0.0.7.jar"
         }
}

In this example, we use the latest driver version available (2.0). To download the latest JDBC driver, see Download the Amazon Redshift JDBC driver, version 2.0.

  1. Run cells 1.3–1.4 in the notebook (collapsed to improved readability) to add FinSpace helper classes found in public GitHub examples and to add utility functions.

Python helper classes help with schema and table creation, cluster management, and more. The utility functions help translate Amazon Redshift data to a FinSpace schema.

Next, you update the user group ID that should get access to the datasets, and update the Amazon Redshift connection parameters.

  1. On the FinSpace console, choose Settings (gear icon).
  2. Chose Users and User Groups.
  3. Select a group and copy the group ID from the URL.
  4. On the Amazon Redshift console, open your cluster.
  5. Note the cluster endpoint information from the General information section.
  6. Note your database name, port, and admin user name in the Database configurations section.

If you don’t know your user name or password, contact your Amazon Redshift administrator.

Populate the FinSpace catalog with tables from Amazon Redshift

Now we’re ready to import table metadata from Amazon Redshift into FinSpace. For each table, we create a FinSpace dataset, populate the attribute set we created with the metadata about the table (catalog, schema, table names, and Redshift subcategory for the Source category), and associate the populated attribute set to the created dataset.

  1. Use spark.read to retrieve a list of tables and columns as a Spark DataFrame:
spark.read.format("jdbc").option("driver","com.amazon.redshift.jdbc42.Driver").option("url", urlStr).option("query", Query).load()

As a result, you get two DataFrames, tablesDF and schemaDF, containing a list of tables and associated metadata (database, schema, table names, and comments) as shown in the following screenshot.

  1. Get the attribute set Redshift Table Attributes that we created earlier by running finspace.attribute_set(att_name). We use its identifiers for populating the metadata for each dataset we create in FinSpace.
# Get the attribute set
sfAttrSet = finspace.attribute_set(att_name)

att_def = None
att_fields = None

# Get the fields of the attribute set
att_resp = finspace.describe_attribute_set(sfAttrSet['id'])

if 'definition' in att_resp: 
    att_def = att_resp['definition']
    
if 'fields' in att_def:
    att_fields = att_def['fields']
  1. Get an ID for the Redshift subcategory to populate the attribute set and identify the datasets with the Amazon Redshift source:
source_cls = finspace.classification('Source')

source_fields = finspace.describe_classification(source_cls['id'])
source_key = None

for n in source_fields['definition']['nodes']:
    if n['fields']['name'] == source_name: 
        source_key = n['key']

# this is the key for source in the Category
print(f'Source: {source_name} Key: {source_key}')

As an output, you get the source_key ID for the Redshift subcategory.

  1. Use list_dataset_metadata_by_taxonomy_node(taxonomyId, source_key) to get the list of existing datasets in FinSpace to avoid duplicating the data if an Amazon Redshift table already exists in FinSpace:
# Get all the datasets from Redshift (classification type Source, with values ‘Redshift’)
resp = finspace.client.list_dataset_metadata_by_taxonomy_node(taxonomyId=source_cls['id'], taxonomyNodeKey=source_key)

# Get a list of datasets to iterate over
datasets = resp['datasetMetadataSummaries']

# Build the lookup table for existing datasets from Redshift to avoid creating duplicates
types_list = []

for s in datasets:

        # end of the arn is the dataset ID
        dataset_id = os.path.basename(s['datasetArn'])

        # get the details of the dataset (name, description, etc)
        dataset_details_resp = finspace.client.describe_dataset_details(datasetId=dataset_id)

        dataset_details = None
        dataset_types   = None
        owner_info = None
        taxonomy_info = None
        
        if 'dataset' in dataset_details_resp:
            dataset_details = dataset_details_resp["dataset"]

        if 'datasetTypeContexts' in dataset_details_resp:
            dataset_types = dataset_details_resp["datasetTypeContexts"]

        if 'ownerinfo' in dataset_details_resp:
            owner_info = dataset_details_resp["ownerinfo"]

        if 'taxonomyNodesinfo' in dataset_details_resp:
            taxonomy_info = dataset_details_resp["taxonomyNodesinfo"]
            
        # Pull Redshift attribute set from the list of dataset_types

        # first check the definition, then extract the values against the definition
        # have the keys of values/labels as the column header?
        for dt in dataset_types:
            if (dt['definition']['name'] != att_name):
                continue

            dd = {
                'dataset_id' : dataset_id
            }

            # used to map the field name (id) to the tile seen in the UI
            field_map = {}

            # get the field titles for name
            for f in dt['definition']['fields']:
                field_map[f['name']] = f['title']

            # human readable, else the keys would be numbers
            for v in dt['values']:
                dd[field_map[v['field']]] = v['values']

            types_list.append(dd)

types_pdf = pd.DataFrame(types_list)

If you already have tables tagged with Redshift as a source, your output looks similar to the following screenshot.

  1. Set permissions and owner details by updating the following code with your desired values:
basicPermissions = [
"ViewDatasetDetails",
"ReadDatasetData",
"AddDatasetData",
"CreateSnapshot",
"EditDatasetMetadata",
"ManageDatasetPermissions",
"DeleteDataset"
]

# All datasets have ownership
basicOwnerInfo = {
"phoneNumber" : "12125551000",
"email" : "[email protected]",
"name" : "Jane Doe"
}
  1. Create a DataFrame with a list of tables in Amazon Redshift to iterate over:
tablesPDF = tablesDF.select('TABLE_CATALOG', 'TABLE_SCHEMA', 'TABLE_NAME', 'COMMENT').toPandas()
  1. Run the following code to:
    1. Check if a table already exists in FinSpace;
    2. If it doesn’t exist, get table’s schema and create an attribute set;
    3. Add the description and the attribute set to the dataset (Catalog, Schema, Table names, and Source).
c = 0
create=True

# For each table, create a dataset with the necessary attribute set populated and associated to the dataset
for index, row in tablesPDF.iterrows():
    
    c = c + 1
        
    catalog = row.TABLE_CATALOG
    schema  = row.TABLE_SCHEMA
    table   = row.TABLE_NAME
    
    # do we already have this dataset?
    exist_i = None
    for ee_i, ee in types_pdf.iterrows():
        if catalog in ee.Catalog:
            if schema in ee.Schema:
                if table in ee.Table:
                    exist_i = ee_i

    if exist_i is not None:
        print(f"Table exists in FinSpace: \n{types_pdf.iloc[[exist_i]]}")
        continue

    # Attributes and their populated values
    att_values = [
        { 'field' : get_field_by_name(att_fields, 'Catalog'), 'type' : get_field_by_name(att_fields, 'Catalog', 'type')['name'], 'values' : [ catalog ] },
        { 'field' : get_field_by_name(att_fields, 'Schema'),  'type' : get_field_by_name(att_fields, 'Schema', 'type')['name'],  'values' : [ schema ] },
        { 'field' : get_field_by_name(att_fields, 'Table'),   'type' : get_field_by_name(att_fields, 'Table', 'type')['name'],   'values' : [ table ] },
        { 'field' : get_field_by_name(att_fields, 'Source'),  'type' : get_field_by_name(att_fields, 'Source', 'type')['name'],  'values' : [ source_key ] },
    ]

    # get this table's schema from Redshift
    tableSchemaPDF = schemaDF.filter(schemaDF.table_name == table).filter(schemaDF.table_schema == schema).select('ORDINAL_POSITION', 'COLUMN_NAME', 'IS_NULLABLE', 'DATA_TYPE', 'COMMENT').orderBy('ORDINAL_POSITION').toPandas()

    print(tableSchemaPDF)
    # translate Redshift schema to FinSpace Schema
    fs_schema = get_finspace_schema(tableSchemaPDF)

    # name and description of the dataset to create
    name = f'{table}'
    description = f'Redshift table from catalog: {catalog}'
    
    if row.COMMENT is not None:
        description = row.COMMENT
    
    print(f'name: {name}')
    print(f'description: {description}')

    print("att_values:")
    for i in att_values:
        print(i)

    print("schema:")
    for i in fs_schema['columns']:
        print(i)
    
    if (create):
        # create the dataset
        dataset_id = finspace.create_dataset(
            name = name,
            description = description,
            permission_group_id = group_id,
            dataset_permissions = basicPermissions,
            kind = "TABULAR",
            owner_info = basicOwnerInfo,
            schema = fs_schema
        )

        print(f'Created, dataset_id: {dataset_id}')

        time.sleep(20)

        # associate tha attributes to the dataset
        if (att_name is not None and att_values is not None):
            print(f"Associating values to attribute set: {att_name}")
            finspace.associate_attribute_set(att_name=att_name, att_values=att_values, dataset_id=dataset_id) 

Search the FinSpace catalog for data

Analysts can search for datasets available to them in FinSpace and refine the results using category filters. To analyze our trading activity in the next section, we need to find two datasets: all trades of AMZN stock, and the buy and sell orders from the Amazon Redshift database.

  1. Search for “AMZN” or “US Equity TAQ Sample” to find the “US Equity TAQ Sample – 14 Symbols 6 Months – Sample” dataset provided as part of the Capital Markets Sample Data Bundle.

You can explore the dataset schema and review the attribute set.

  1. Copy the dataset ID and data view ID on the Data View Details page.

We use these IDs in the next section to connect to the data view in FinSpace and analyze our trading activity.

Next, we find the trade_history dataset that we created from the Amazon Redshift table and copy its dataset ID.

  1. On the FinSpace console, choose Source under BROWSE DATA and choose Redshift.
  2. Open the trade_history table.
  3. Copy the dataset ID located in the URL.

Users with permissions to create datasets can also update the dataset with additional information, including a description and owner contact information if those details have changed since the dataset was created in FinSpace.

Use FinSpace notebooks to analyze data from both FinSpace and Amazon Redshift

We’re now ready to analyze the data.

  1. Import the analysis notebook to JupyterLab in FinSpace.
  2. Follow the steps covered in the previous section, Connect to Amazon Redshift from a FinSpace Jupyter notebook using JDBC, to connect to the FinSpace cluster and add a JDBC driver to Spark jars. Add helper and utility functions.
  3. Set up your database connection and date parameters. In this scenario, we analyze trading activity for January 2, 2021.
  4. Connect to Amazon Redshift and query the table directly. Import the data as a Spark DataFrame.
myTrades  = get_dataframe_from_database(dataset_id = dataset_id_db, att_name = db_att_name)

As a result, you get the data stored in the Amazon Redshift database as a Spark DataFrame.

  1. Filter for stock purchase transactions (labeled as P) and calculate an average price paid:
avgPrice = (myTrades.filter( myTrades.trans_date == aDate )
                    .filter(myTrades.trans_type == "P")
                    .select('price')
                    .agg({'price':'avg'}))
  1. Get trading data from the FinSpace Capital Markets dataset:
df = finspace.read_view_as_spark(dataset_id = dataset_id, view_id = view_id)
  1. Apply date, ticker, and trade type filters:
import datetime as dt
import pandas as pd

fTicker = 'AMZN'

pDF = (
    df.filter( df.date == aDate )
    .filter(df.eventtype == "TRADE NB")
    .filter(df.ticker == fTicker)
    .select('price', 'quantity')
).toPandas()
  1. Compare the average purchase price to the daily trading price and plot them to compare how close we got to the lowest price.
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1, figsize=(12, 6))

pDF["price"].plot(kind="hist", weights=pDF["quantity"], bins=50, figsize=(12,6))
plt.axvline(x=avgPrice.toPandas(), color='red')

# Add labels
plt.title(f"{fTicker} Price Distribution vs Avg Purchase Price")
plt.ylabel('Trades')
plt.xlabel('Price')
plt.subplots_adjust(bottom=0.2)

%matplot plt

As a result, you get a distribution of AMZN stock prices traded on January 2, 2021, which we got from a dataset in FinSpace. The red line in the following graph is the average price we paid for the stock calculated from the transaction data stored in Amazon Redshift. Although we didn’t pay the highest price traded that day, we performed average, paying $1,877 per share versus the lowest price of $1,865.

Clean up

If your work with FinSpace or Amazon Redshift is complete, delete the Amazon Redshift cluster or the FinSpace environment to avoid incurring additional fees.

Conclusion

In this post, we reviewed how to connect the Amazon Redshift database and FinSpace in order to create new datasets in FinSpace using the table metadata from Amazon Redshift. We then explored how to look for available data in the FinSpace web app to find two datasets that can help us evaluate how close we got to the best daily price. Finally, we used FinSpace dataset details to import the data into two DataFrames and plot price distribution versus the average price we paid. As a result, we reduced the time it takes to discover and connect to datasets needed for analyzing trading transactions.

Download the import and analysis Jupyter notebooks discussed in this blog post on GitHub.

Visit the FinSpace user guide to learn more about the service, or contact us to discuss FinSpace or Amazon Redshift in more detail.


About the Authors

Mariia Berezina is a Sr. Launch Manager at AWS. She is passionate about building new products to help customers get the most out of data. When not working, she enjoys mentoring women in technology, diving, and traveling the world.

Vincent Saulys is a Principal Solutions Architect at AWS working on FinSpace. Vincent has over 25 years of experience solving some of the world’s most difficult technical problems in the financial services industry. He is a launcher and leader of many mission-critical breakthroughs in data science and technology on behalf of Goldman Sachs, FINRA, and AWS.

Federated authentication to Amazon Redshift using AWS Single Sign-On

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/federated-authentication-to-amazon-redshift-using-aws-single-sign-on/

Managing database users through identity federation allows you to manage authentication and authorization procedures centrally. Amazon Redshift, a fast, fully managed cloud data warehouse, provides browser-based plugins for JDBC/ODBC drivers, which helps you easily implement identity federation capabilities added with multi-factor authentication (MFA) to secure your data warehouse, and also helps automation and enforcement of data access policies across the organization.

AWS Single Sign-On (AWS SSO) provides tools to federate access to users to the AWS environment. AWS SSO integrates with AWS Organizations to manage access to all the AWS accounts under the organization. In our previous post, we explained how you can integrate the Amazon Redshift browser-based Security Assertion Markup Language (SAML) plugin to add SSO and MFA capability with your federation identity provider (IdP). We expand on that in this post to show how you can set up this federated authentication to connect users to Amazon Redshift through AWS SSO integrated with a supported identity source directory of your choice, such as the native AWS SSO identity store, AWS managed or self-managed or on-premises Microsoft Active Directory (AD), or an external IdP such as Okta, Azure AD, or Ping.

Solution overview

When you connect to Amazon Redshift using a JDBC/ODBC client, you can use the Amazon Redshift browser SAML plugin to launch a custom AWS SSO SAML application, which provides the SAML attributes required to connect to Amazon Redshift, after authenticating the user identity against the identity source directory that you have integrated with AWS SSO. The SAML application uses the user’s identity source credentials to get their user and group attributes, and grants the appropriate Amazon Redshift database access privileges to the user. The following diagram illustrates this workflow.

The following are the high-level steps for this setup:

  1. Enable AWS SSO (linked with your preferred identity source directory) and set up custom SAML applications in AWS SSO with the appropriate user or group association and attribute mappings.
  2. Set up a SAML IdP for AWS SSO and link it to an AWS Identity and Access Management (IAM) role with appropriate permissions to access Amazon Redshift.
  3. Set up the Amazon Redshift cluster and database groups that correspond to your AWS SSO directory groups.
  4. Configure the JDBC/ODBC client to authenticate with your AWS SSO SAML application and obtain federated IAM credentials to access the Amazon Redshift cluster.

Prerequisites

You need the following prerequisites to set up this solution:

Use case

In this example use case, we use AWS SSO integrated with AWS Managed Microsoft AD as the identity source to connect to an Amazon Redshift cluster as users from two different AD groups: BI users and analysts. We create two SAML applications in AWS SSO to map these two groups with their respective users and then connect to Amazon Redshift using SQL Workbench/J client via AWS SSO using their Microsoft AD user credentials.

As a prerequisite step, we have already set up an AWS Managed Microsoft AD directory with sample directory groups and user mappings, and attached it to AWS SSO as the identity source.

The following screenshots show our AD groups and user mappings.

The following screenshot shows our AWS SSO identity source mapping.

Configure AWS SSO

As mentioned in the prerequisites section, you need to enable AWS SSO in your account, and map it with a supported identity source. If AWS SSO isn’t configured in your account, follow the steps in Getting Started.

In this step, you create two custom SAML applications in AWS SSO.

  1. On the AWS SSO console, choose Applications in the navigation pane.
  2. Choose Add a new application.
  3. Choose Add a custom SAML 2.0 application.
  4. For Display name, enter an appropriate name for the SAML application (for this post, because we create two applications, we first enter Redshift-SAML-BI-User).
  5. In the Application metadata section, choose the option to manually enter the metadata values.
  6. For Application ACS URL, enter http://localhost:7890/redshift/.
  7. For Application SAML audience, enter urn:amazon:webservices:redshift.
  8. On the Configuration tab, choose Download to download the AWS SSO SAML metadata file.

We use this file later to create the IdP.

  1. On the Assigned users tab, choose Assign users to add bi_users_group to this application.
  2. On the Attribute mappings tab, add the custom attribute mappings from the following table.
User attribute in the application Maps to this string value or user attribute in AWS SSO Description
Subject ${user:email} User identity
https://aws.amazon.com/SAML/Attributes/RoleSessionName ${user:email} Identification for the user session, which in most cases is the email_id of the user
https://redshift.amazon.com/SAML/Attributes/AutoCreate True If this parameter is set, new users authenticated by the IdP are automatically created in Amazon Redshift
https://aws.amazon.com/SAML/Attributes/Role arn:aws:iam::<yourAWSAccountID>:role/redshift-federation-role,arn:aws:iam:: <yourAWSAccountID>:saml-provider/redshift-federation-saml-provider aws_idp_iam_role_arn, aws_identity_provider_arn
https://redshift.amazon.com/SAML/Attributes/DbUser ${user:email} Identification for the user session, which in most cases is the email_id of the user
https://redshift.amazon.com/SAML/Attributes/DbGroups bi_users_group Amazon Redshift database group names for the user, which in most cases is the same as the directory groups the user belongs to

The IAM role and IdP names and ARN strings entered for the https://aws.amazon.com/SAML/Attributes/Role attribute mapping must match the names given while creating those IAM resources for the BI user group during the IAM role setup in the next section.

  1. Choose Dashboard in the navigation pane and choose the User portal URL link to launch the user portal’s login page in a new browser window.
  2. Log in to the AWS SSO user portal using your Microsoft AD user credentials for the BI user.

After logging in, you can see the new SAML application we created.

  1. Choose the application (right-click) and copy the link address.

We need this application login URL in a later step to connect to Amazon Redshift using AWS SSO federation.

  1. Similar to the preceding steps, create another SAML application called Redshift-SAML-Analyst-User and assign the analyst group to this application.
  2. On the application’s Attribute mappings tab, add all the attribute mappings similar to the previous application, but with different mapping values for the Role and DbGroups attributes:
    1. The DbGroups parameter should be mapped to analysts_group.
    2. The Role parameter value entered should match the names of IAM resources created for the analyst user group in the next section.
  1. Log in to the AWS SSO user portal using your Microsoft AD user credentials for the analyst user.
  2. Copy the application link address, which you need in a later step to connect to Amazon Redshift using AWS SSO federation.

Set up IAM roles and SAML IdPs

In this step, you set up two SAML IdPs and two IAM roles with appropriate permissions for the two AD directory groups to access your Amazon Redshift cluster. You need the SAML Metadata XML files downloaded from your AWS SSO SAML applications from the previous section. As outlined in our earlier post, you may follow the step-by-step process to add the IdPs and IAM roles manually, or use the following AWS CloudFormation template.

Next, you need to confirm or replace the role ARN and the IdP ARN values in the two SAML applications’ attribute mappings. Refer to the following screenshot for the location of the two ARN values on the IAM console.

Set up an Amazon Redshift cluster

If you haven’t set up an Amazon Redshift cluster yet, see Getting started with Amazon Redshift for a step-by-step guide to create a new cluster in your AWS account.

If you already have an Amazon Redshift cluster, note the admin user credentials for that cluster and connect to that cluster using a SQL client like SQL Workbench/J and the latest Amazon Redshift JDBC driver.

After logging in to your Amazon Redshift cluster as an admin user, you can set up database objects and appropriate access permissions for them. In the following code, we set up two schemas for analysts and BI users, and then grant access on them to the relevant groups:

CREATE GROUP analysts_group;
CREATE GROUP bi_users_group;

CREATE SCHEMA IF NOT EXISTS analysts_schema;
GRANT USAGE ON SCHEMA analysts_schema TO GROUP analysts_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA analysts_schema GRANT SELECT ON TABLES TO GROUP analysts_group;
GRANT SELECT ON ALL TABLES IN SCHEMA analysts_schema TO GROUP analysts_group;

CREATE SCHEMA IF NOT EXISTS bi_schema;
GRANT USAGE ON SCHEMA bi_schema TO GROUP bi_users_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA bi_schema GRANT SELECT ON TABLES TO GROUP bi_users_group;
GRANT SELECT ON ALL TABLES IN SCHEMA bi_schema TO GROUP bi_users_group;

Connect to Amazon Redshift with AWS SSO federation

In this step, you connect to your Amazon Redshift cluster from your SQL Workbench/J client using AWS SSO federation.

  1. Create a new connection in SQL Workbench/J and choose Amazon Redshift as the driver.
  2. Enter the following code in the URL section of your connection properties (provide your Amazon Redshift cluster endpoint):
    jdbc:redshift:iam://your_cluster_endpoint

  3. Choose Extended Properties and add the following three properties:
    1. login_url – Enter the BI user group’s SAML application’s login URL you copied in an earlier step.
    2. plugin_name – Enter com.amazon.redshift.plugin.BrowserSamlCredentialsProvider.
    3. idp_response_timeout – Enter 60.
  4. Choose OK, and connect to your cluster.

This launches your AWS SSO SAML application’s sign-in page in a browser window. After you successfully authenticate using the BI user’s AD user credentials in the browser, the SQL client connects you to Amazon Redshift as the BI user under the bi_users_group database group. You can verify the user’s database session and group association by running the following SQL:

select * from stv_sessions where starttime &gt; sysdate-1 order by 1 desc;

  1. Similar to the BI user, you can create a new database connection to test an analyst user login. Instead of adding the extended properties in SQL Workbench, you can also use an initialization file to add the JDBC connection properties. For that, create a file rsjdbc.ini on your file system with the following contents (provide the analyst group’s SAML application’s login URL you copied earlier):
[ANALYST]
login_url=https://xxxxxxxxxx.awsapps.com/start/#/saml/default/Redshift-SAML-Analyst-User/ins-xxxxxxxx
plugin_name=com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
idp_response_timeout=60
  1. Enter the following in the URL section of your connection properties (provide your cluster endpoint and file system path for the rsjdbc.ini file you created in the previous step):
jdbc:redshift:iam://your_cluster_endpoint?inifile=/network_path/rsjdbc.ini&amp;IniSection=ANALYST

Your connection profile should look like the following screenshot.

  1. Choose OK to connect.

After you authenticate using the analyst user’s AD user credentials in the browser, you should be logged in to Amazon Redshift as the analyst user mapped to the analysts_group, as shown in the following screenshot.

Conclusion

In this post, we showed how you can use AWS SSO with AWS Managed Microsoft AD to federate access to Amazon Redshift using identity federation. The same setup instructions can also work with any other supported identity source directory of your choice, such as the native AWS SSO identity store, self-managed or on-premises Active Directory, or an external identity provider (IdP) such as Okta, Azure AD, and Ping.


About the Authors

Manash Deb is a Software Development Engineer in AWS Redshift team. He has worked on building end-to-end data-driven solutions in different database and data warehousing technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS.

 

 

Manish Vazirani is an Analytics Specialist Solutions Architect at Amazon Web Services.

 

 

 

 

Rajesh Mohan is an SDE-II at Amazon Studios where he engineers solutions to build a media supply chain to integrate content vendors with Prime Video. In his free time, he enjoys wandering the streets of New York City, trekking the mountains of California and binging on food videos while chomping away on delicious food from around the world.

Get started with the Amazon Redshift Data API

Post Syndicated from David Zhang original https://aws.amazon.com/blogs/big-data/get-started-with-the-amazon-redshift-data-api/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. Tens of thousands of customers use Amazon Redshift to process exabytes of data to power their analytical workloads.

The Amazon Redshift Data API is an Amazon Redshift feature that simplifies access to your Amazon Redshift data warehouse by removing the need to manage database drivers, connections, network configurations, data buffering, credentials, and more. You can run SQL statements using the AWS Software Development Kit (AWS SDK), which supports different languages such as C++, Go, Java, JavaScript, .Net, Node.js, PHP, Python, and Ruby.

Since you’re reading this post, you may also be interested in the following AWS Online Tech Talks video for more info:

With the Data API, you can programmatically access data in your Amazon Redshift cluster from different AWS services such as AWS Lambda, Amazon SageMaker notebooks, AWS Cloud9, and also your on-premises applications using the AWS SDK. This allows you to build cloud-native, containerized, serverless, web-based, and event-driven applications on the AWS Cloud.

The Data API also enables you to run analytical queries on Amazon Redshift’s native tables, external tables in your data lake via Amazon Redshift Spectrum, and also across Amazon Redshift clusters, which is known as data sharing. You can also perform federated queries with external data sources such as Amazon Aurora.

In an earlier, post, we shared in great detail on how you can use the Data API to interact with your Amazon Redshift data warehouse. In this post, we learn how to get started with the Data API in different languages and also discuss various use cases in which customers are using this to build modern applications combining modular, serverless, and event-driven architectures. The Data API was launched in September 2020, and thousands of our customers are already using it for a variety of use cases:

  • Extract, transform, and load (ETL) orchestration with AWS Step Functions
  • Access Amazon Redshift from applications
  • Access Amazon Redshift from SageMaker Jupyter notebooks
  • Access Amazon Redshift with REST endpoints
  • Event-driven extract, load, transformation
  • Event-driven web application design
  • Serverless data processing workflows

Key features of the Data API

In this section, we discuss the key features of the Data API.

Use different programming language of your choice

The Data API integrates with the AWS SDK to run queries. Therefore, you can use any language supported by the AWS SDK to build your application with it, such as C++, Go, Java, JavaScript, .NET, Node.js, PHP, Python, and Ruby.

Run individual or batch SQL statements

With the Data API, you can run individual queries from your application or submit a batch of SQL statements within a transaction, which is useful to simplify your workload.

Run SQL statements with parameters

With the Data API, you can run parameterized SQL queries, which brings the ability to write reusable code when developing ETL code by passing parameters into a SQL template instead of concatenating parameters into each query on their own. This also makes it easier to migrate code from existing applications that needs parameterization. In addition, parameterization also makes code secure by eliminating malicious SQL injection.

No drivers needed

With the Data API, you can interact with Amazon Redshift without having to configure JDBC or ODBC drivers. The Data API eliminates the need for configuring drivers and managing database connections. You can run SQL commands to your Amazon Redshift cluster by calling a Data API secured API endpoint.

No network or security group setup

The Data API doesn’t need a persistent connection with Amazon Redshift. Instead, it provides a secure HTTP endpoint, which you can use to run SQL statements. Therefore, you don’t need to set up and manage a VPC, security groups, and related infrastructure to access Amazon Redshift with the Data API.

No password management

The Data API provides two options to provide credentials:

  • IAM temporary credentials – With this approach, you only need to provide the username and AWS Identity and Access Management (IAM) GetClusterCredentials permission to access Amazon Redshift with no password. The Data API automatically handles retrieving your credentials using IAM temporary credentials.
  • AWS Secrets Manager secret – With this approach, you store your username and password credentials in an AWS Secrets Manager secret and allow the Data API to access those credentials from that secret.

You can also use the Data API when working with federated logins through IAM credentials. You don’t have to pass database credentials via API calls when using identity providers such as Okta, Azure Active Directory, or database credentials stored in Secrets Manager. If you’re using Lambda, the Data API provides a secure way to access your database without the additional overhead of launching Lambda functions in Amazon Virtual Private Cloud (Amazon VPC).

Asynchronous

The Data API is asynchronous. You can run long-running queries without having to wait for it to complete, which is key in developing a serverless, microservices-based architecture. Each query results in a query ID, and you can use this ID to check the status and response of the query. In addition, query results are stored for 24 hours.

Event notifications

You can monitor Data API events in Amazon EventBridge, which delivers a stream of real-time data from your source application to targets such as Lambda. This option is available when you’re running your SQL statements in the Data API using the WithEvent parameter set to true. When a query is complete, the Data API can automatically send event notifications to EventBridge, which you may use to take further actions. This helps you design event-driven applications with Amazon Redshift. For more information, see Monitoring events for the Amazon Redshift Data API in Amazon EventBridge.

Get started with the Data API

It’s easy to get started with the Data API using the AWS SDK. You can use the Data API to run your queries on Amazon Redshift using different languages such as C++, Go, Java, JavaScript, .Net, Node.js, PHP, Python and Ruby.

All API calls from different programming languages follow similar parameter signatures. For instance, you can run the ExecuteStatement API to run individual SQL statements in the AWS Command Line Interface (AWS CLI) or different languages such as Python and JavaScript (NodeJS).

The following code is an example using the AWS CLI:

aws redshift-data execute-statement
--cluster-identifier my-redshift-cluster
--database dev
--db-user awsuser
--sql 'select data from redshift_table’;

The following example code uses Python:

boto3.client("redshift-data").execute_statement(
	ClusterIdentifier = ‘my-redshift-cluster’,
	Database = ‘dev’,
	DbUser = ‘awsuser’,
	Sql = 'select data from redshift_table’)

The following code uses JavaScript (NodeJS):

new AWS.RedshiftData().executeStatement({
	ClusterIdentifier: 'my-redshift-cluster',
	Database: 'dev',
	DbUser: 'awsuser',
	Sql: 'select data from redshift_table'
}).promise();

We have also published a GitHub repository showcasing how to get started with the Data API in different languages such as Go, Java, JavaScript, Python, and TypeScript. You may go through the step-by-step process explained in the repository to build your custom application in all these languages using the Data API.

Use cases with the Data API

You can use the Data API to modernize and simplify your application architectures by creating modular, serverless, event-driven applications with Amazon Redshift. In this section, we discuss some common use cases.

ETL orchestration with Step Functions

When performing ETL workflows, you have to complete a number of steps. As your business scales, the steps and dependencies often become complex and difficult to manage. With the Data API and Step Functions, you can easily orchestrate complex ETL workflows. You can explore the following example use case and AWS CloudFormation template demonstrating ETL orchestration using the Data API and Step Functions.

Access Amazon Redshift from custom applications

If you’re designing your custom application in any programming language that is supported by the AWS SDK, the Data API simplifies data access from your applications, which may be an application hosted on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS) and other compute services or a serverless application built with Lambda. You can explore an example use case and CloudFormation template showcasing how to easily work with the Data API from Amazon EC2 based applications.

Access Amazon Redshift from SageMaker Jupyter notebooks

SageMaker notebooks are very popular among the data science community to analyze and solve machine learning problems. The Data API makes it easy to access and visualize data from your Amazon Redshift data warehouse without troubleshooting issues on password management or VPC or network issues. You can learn more about this use case along with a CloudFormation template showcasing how to use the Data API to interact from a SageMaker Jupyter notebook.

Access Amazon Redshift with REST endpoints

With the AWS SDK, you can use the Data APIs to directly invoke them as REST API calls such as GET or POST methods. For more information, see REST for Redshift Data API.

Event-driven ELT

Event–driven applications are popular with many customers, where applications run in response to events. A primary benefit of this architecture is the decoupling of producer and consumer processes, which allows greater flexibility in application design and building decoupled processes. For more information, see Building an event-driven application with AWS Lambda and the Amazon Redshift Data API. The following CloudFormation template demonstrates the same.

Event-driven web application design

Similar to event-driven ELT applications, event-driven web applications are also becoming popular, especially if you want to avoid long-running database queries, which create bottlenecks for the application servers. For example, you may be running a web application that has a long-running database query taking a minute to complete. Instead of designing that web application with long-running API calls, you can use the Data API and Amazon API Gateway WebSockets, which creates a lightweight websocket connection with the browser and submits the query to Amazon Redshift using the Data API. When the query is finished, the Data API sends a notification to EventBridge about its completion. When the data is available in the Data API, it’s pushed back to this browser session and the end-user can view the dataset. You can explore an example use case along with a CloudFormation template showcasing how to build an event-driven web application using the Data API and API Gateway WebSockets.

Serverless data processing workflows

With the Data API, you can design a serverless data processing workflow, where you can design an end-to-end data processing pipeline orchestrated using serverless AWS components such as Lambda, EventBridge, and the Data API client. Typically, a data pipeline involves multiple steps, for example:

  1. Load raw sales and customer data to a data warehouse.
  2. Integrate and transform the raw data.
  3. Build summary tables or unload this data to a data lake so subsequent steps can consume this data.

The example use case Serverless Data Processing Workflow using Amazon Redshift Data Api demonstrates how to chain multiple Lambda functions in a decoupled fashion and build an end-to-end data pipeline. In that code sample, a Lambda function is run through a scheduled event that loads raw data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift. On its completion, the Data API generates an event that triggers an event rule in EventBridge to invoke another Lambda function that prepares and transforms raw data. When that process is complete, it generates another event triggering a third EventBridge rule to invoke another Lambda function and unloads the data to Amazon S3. The Data API enables you to chain this multi-step data pipeline in a decoupled fashion.

Conclusion

The Data API offers many additional benefits when integrating Amazon Redshift into your analytical workload. The Data API simplifies and modernizes current analytical workflows and custom applications. You can perform long-running queries without having to pause your application for the queries to complete. This enables you to build event-driven applications as well as fully serverless ETL pipelines.

The Data API functionalities are available in many different programming languages to suit your environment. As mentioned earlier, there are a wide variety of use cases and possibilities where you can use the Data API to improve your analytical workflow. To learn more, see Using the Amazon Redshift Data API.


About the Authors

David Zhang is an AWS Solutions Architect who helps customers design robust, scalable, and data-driven solutions across multiple industries. With a background in software engineering, David is an active leader and contributor to AWS open-source initiatives. He is passionate about solving real-world business problems and continuously strives to work from the customer’s perspective.

 

 

Bipin Pandey is a Data Architect at AWS. He loves to build data lake and analytics platform for his customers. He is passionate about automating and simplifying customer problems with the use of cloud solutions.

 

 

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data-driven solutions in different database and data warehousing technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS.

 

 

Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years. His Linkedin profile is here.

 

 

Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.

 

 

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.