Tag Archives: Amazon Redshift

Restrict Amazon Redshift Spectrum external table access to Amazon Redshift IAM users and groups using role chaining

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/restrict-amazon-redshift-spectrum-external-table-access-to-amazon-redshift-iam-users-and-groups-using-role-chaining/

With Amazon Redshift Spectrum, you can query the data in your Amazon Simple Storage Service (Amazon S3) data lake using a central AWS Glue metastore from your Amazon Redshift cluster. This capability extends your petabyte-scale Amazon Redshift data warehouse to unbounded data storage limits, which allows you to scale to exabytes of data cost-effectively. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of Redshift Spectrum nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and there’s nothing to provision or manage. You only pay $5 for every 1 TB of data scanned. This post discusses how to configure Amazon Redshift security to enable fine grained access control using role chaining to achieve high-fidelity user-based permission management.

As you start using the lake house approach, which integrates Amazon Redshift with the Amazon S3 data lake using RedShift Spectrum, you need more flexibility when it comes to granting access to different external schemas on the cluster. For example, in the following use case, you have two Redshift Spectrum schemas, SA and SB, mapped to two databases, A and B, respectively, in an AWS Glue Data Catalog, in which you want to allow access for the following when queried from Amazon Redshift:

  • Select access for SA only to IAM user group Grp1
  • Select access for database SB only to IAM user group Grp2
  • No access for IAM user group Grp3 to databases SA and SB

By default, the policies defined under the AWS Identity and Access Management (IAM) role assigned to the Amazon Redshift cluster manages Redshift Spectrum table access, which is inherited by all users and groups in the cluster. This IAM role associated to the cluster cannot easily be restricted to different users and groups. This post details the configuration steps necessary to achieve fine-grained authorization policies for different users in an Amazon Redshift cluster and control access to different Redshift Spectrum schemas and tables using IAM role chaining. When using role chaining, you don’t have to modify the cluster; you can make all modifications on the IAM side. Adding new roles doesn’t require any changes in Amazon Redshift. Even when using AWS Lake Formation, as of this writing, you can’t achieve this level of isolated, coarse-grained access control on the Redshift Spectrum schemas and tables. For more information about cross-account queries, see How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3.

Prerequisites

This post uses a TPC-DS 3 TB public dataset from Amazon S3 cataloged in AWS Glue by an AWS Glue crawler and an example retail department dataset. To get started, you must complete the following prerequisites. The first two prerequisites are outside of the scope of this post, but you can use your cluster and dataset in your Amazon S3 data lake.

  1. Create an Amazon Redshift cluster with or without an IAM role assigned to the cluster.
  2. Create an AWS Glue Data Catalog with a database using data from the data lake in Amazon S3, with either an AWS Glue crawler, Amazon EMR, AWS Glue, or Athena.The database should have one or more tables pointing to different Amazon S3 paths. This post uses an industry standard TPC-DS 3 TB dataset, but you can also use your own dataset.
  3. Create IAM users and groups to use later in Amazon Redshift:
    • Create new IAM groups named grpA and grpB without any policies.
    • Create users a1 and b1 and add them to groups grpA and grpB, respectively. Use lower-case usernames.
    • Add the following policy to all the groups you created to allow IAM users temporary credentials when authenticating against Amazon Redshift:
      {
         			 "Version": "2012-10-17",
          			"Statement": {
              			"Effect": "Allow",
              			"Action": "redshift:GetClusterCredentials",
              			"Resource": "arn:aws:redshift:us-east-1:0123456789:dbuser:*/*"
          			}
      }

      You may want to use more restricted access by allowing specific users and groups in the cluster to this policy for additional security.

  1. Create the IAM users and groups locally on the Amazon Redshift cluster without any password.
    • To create user a1, enter the following code:
      create user a1 password disable;

    • To create grpA, enter the following code:
      create group grpA with user a1;

    • Repeat these steps for user b1 and add the user to grpB.
  1. Install a jdbc sql query client such as SqlWorkbenchJ on the client machine.

Use case

In the following use case, you have an AWS Glue Data Catalog with a database named tpcds3tb. Tables in this database point to Amazon S3 under a single bucket, but each table is mapped to a different prefix under the bucket. The following screenshot shows the different table locations.

You use the tpcds3tb database and create a Redshift Spectrum external schema named schemaA. You create groups grpA and grpB with different IAM users mapped to the groups. The goal is to grant different access privileges to grpA and grpB on external tables within schemaA.

This post presents two options for this solution:

  • Use the Amazon Redshift grant usage statement to grant grpA access to external tables in schemaA. The groups can access all tables in the data lake defined in that schema regardless of where in Amazon S3 these tables are mapped to.
  • Configure role chaining to Amazon S3 external schemas that isolate group access to specific data lake locations and deny access to tables in the schema that point to a different Amazon S3 locations.

Isolating user and group access using the grant usage privilege

You can use the Amazon Redshift grant usage privilege on schemaA, which allows grpA access to all objects under that schema. You don’t grant any usage privilege to grpB; users in that group should see access denied when querying.

  1. Create an IAM role named mySpectrum for the Amazon Redshift cluster to allow Redshift Spectrum to read Amazon S3 objects using the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*"
                ],
                "Resource": "*"
            }
        ]
    }

  2. Add a trust relationship to allow users in Amazon Redshift to assume roles assigned to the cluster. See the following code:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }

  3. Choose your cluster name.
  4. Choose Properties.
  5. Choose Manage IAM roles.
  6. For Available IAM roles, choose your new role.If you don’t see the role listed, select Enter ARN and enter the role’s ARN.
  7. Choose Done as seen in screenshot below.
  8. Use the Amazon Redshift JDBC driver that has AWS SDK, which you can download from the Amazon Redshift console (see the following screenshot) and connect to the cluster using the IAM connection string from a SQL client such as SqlWorkbenchJ.Screenshot below depicts the jar file type you select.Following screenshot depicts the connection configuration using Workbenchj.
  9. As an Amazon Redshift admin user, create external schemas with schemaA mapped to the AWS Glue database tpcds3tb (you use the IAM role you created earlier to allow Redshift Spectrum access to Amazon S3). See the following code:
    create external schema schemaA from data catalog
    database 'tpcds3tb'
    iam_role 'arn:aws:iam::0123456789:role/mySpectrum' region 'us-east-1';

  10. Verify the schema is in the Amazon Redshift catalog with the following code:
    select * from svv_external_schemas;

  11. Grant usage privilege to grpA. See the following code:
    Grant usage on schema schemaA to group grpA;

  12. Query tables in schemaA as user a1 in grpA using your SQL client. See the following code:
    Select * from schemaA.customer limit 3;
    Select * from schemaA.call_center limit 3;

    The following screenshot shows the successful query results.

  13. Query tables in schemaA as user b1 in grpB.The following screenshot shows the error message you receive.

This option gives great flexibility to isolate user access on Redshift Spectrum schemas, but what if user b1 is authorized to access one or more tables in that schema but not all tables? The second option creates coarse-grained access control policies.

Isolating user and group access using IAM policies and role chaining

You can use IAM policies mapped to IAM roles with a trust relationship to specific users and groups based on Amazon S3 location access and assign it to the cluster. For this use case, grpB is authorized to only access the table catalog_page located at s3://myworkspace009/tpcds3t/catalog_page/, and grpA is authorized to access all tables but catalog_page located at s3://myworkspace009/tpcds3t/*. The following steps help you configure for the given security requirement.

The following diagram depicts how role chaining works.

You first create IAM roles with policies specific to grpA and grpB. The first role is a generic cluster role that allows users to assume this role using a trust relationship defined in the role.

  1. On the IAM console, create a new role. See the following code:
    RoleName: myblog-redshift-assumeRole

  2. Add the following two policies to this role:
    • Add a managed policy named AWSAWS GlueConsoleFullAccess. You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.
    • Add an inline policy called myblog-redshift-assumerole-inline with the following rules:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "Stmt1487639602000",
                  "Effect": "Allow",
                  "Action": [
                      "sts:AssumeRole"
                  ],
                  "Resource": "arn:aws:iam::0123456789:role/*"
              }
          ]
      }

  1. Add a trust relationship that allows the users in the cluster to assume this role. You can choose to limit this to specific users as necessary. See the following code:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringLike": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/*"
            }
          }
        }
      ]
    }

  2. Create a new Redshift-customizable role specific to grpA with a policy allowing access to Amazon S3 locations for which this group is only allowed access. Make sure you omit the Amazon S3 location for the catalog_page table; you don’t want to authorize this group to view that data.
    • Name the role myblog-grpA-role.
  1. Add the following two policies to this role:
    • Add a managed policy named AWSAWS GlueConsoleFullAccess to the role. (Note: You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.)
    • Add an inline policy named myblog-grpA-access-policy with the following rules (modify it to fit your security needs and allow minimal permissions):
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowS3",
                  "Effect": "Allow",
                  "Action": [
                      "s3:*"
                  ],
                  "Resource": [
                      "arn:aws:s3:::myworkspace009/tpcds3t/call_center/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_returns/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_sales/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer_address/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer_demographics/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/date_dim/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/item/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/promotion/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/store/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/store_sales/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/web_page/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/web_sales/*"
                  ]
              }
          ]
      }

  1. Add a trust relationship explicitly listing all users in grpA to only allow them to assume this role (choose the tab Trust relationships and edit it to add the following policy updating the relevant account details):The trust relationship has to be updated for each user added to this role, or build a new role for each user. It is fairly easy to script automate updating this trust relationship for each new user.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": " arn:aws:iam::0123456789:role/myblog-redshift-assumeRole"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/a1"
            }
          }
        }
      ]
    }

  2. Create another Redshift-customizable role specific to grpB with a policy restricting access only to Amazon S3 locations where this group is allowed access.
    • Name the role myblog-grpB-role.
  1. Add the following two policies to this role. Create these managed policies reflecting the data access per DB Group and attach them to the roles that are assumed on the cluster.
    • Add a managed policy named AWSAWS GlueConsoleFullAccess to the role. You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.
    • Add an inline policy named myblog-grpB-access-policy with the following rules (modify it to fit your security needs and allow minimal permissions):
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowS3",
                  "Effect": "Allow",
                  "Action": [
                      "s3:*"
                  ],
                  "Resource": [
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_page/*"
                  ]
              }
          ]
      }

  1. Add a trust relationship explicitly listing all users in grpB to only allow them to assume this role (choose the tab Trust relationships and edit it to add the following policy updating the relevant account details):This trust relationship has to be updated for each user for this role, or build a role for each user. It is fairly easy to script automate updating this trust relationship.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": " arn:aws:iam::0123456789:role/myblog-redshift-assumeRole"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/b1"
            }
          }
        }
      ]
    }

  2. Attach the three roles to the Amazon Redshift cluster and remove any other roles mapped to the cluster. If you don’t find any roles in the drop-down menu, use the role ARN.
  3. As an admin user, create a new external schema for grpA and grpB, respectively, using role chaining with the two roles you created.
    • For grpA, enter the following code:
      create external schema rc_schemaA from data catalog
      database 'tpcds3tb'
      iam_role 'arn:aws:iam::0123456789:role/myblog-redshift-assumeRole,arn:aws:iam::0123456789:role/myblog-grpA-role' region 'us-east-1';
      
      grant usage on schema rc_schemaA to group grpA;

    • For grpB, enter the following code:
      create external schema rc_schemaB from data catalog
      database 'tpcds3tb'
      iam_role 'arn:aws:iam::0123456789:role/myblog-redshift-assumeRole,arn:aws:iam::0123456789:role/myblog-grpB-role' region 'us-east-1';
      
      grant usage on schema rc_schemaB to group grpB;

  1. Query the external schema as user in grpA and grpB.
    • To query the customer table and catalog_page table as user a1 in grpA, enter the following code:
      select * from rc_schemaA.customer limit 3;
      select * from rc_schemaA.catalog_page limit 3;

The following screenshot shows the query results; user a1 can access the customer table successfully.

The following screenshot shows that user a1 can’t access catalog_page.

    • Query the customer table and catalog_page table as user b1 in grpB.

The following screenshot shows that user b1 can access catalog_page.

The following screenshot shows that user b1 can’t access the customer table.

Conclusion

This post demonstrated two different ways to isolate user and group access to external schema and tables. With the first option of using Grant usage statements, the granted group has access to all tables in the schema regardless of which Amazon S3 data lake paths the tables point to. This approach gives great flexibility to grant access at ease, but it doesn’t allow or deny access to specific tables in that schema.

With the second option, you manage user and group access at the grain of Amazon S3 objects, which gives more control of data security and lowers the risk of unauthorized data access. This approach has some additional configuration overhead compared to the first approach, but can yield better data security.

In both approaches, building a right governance model upfront on Amazon S3 paths, external schemas, and table mapping based on how groups of users access them is paramount to provide the best security and allow low operational overhead.

Special acknowledgment goes to AWS colleague Martin Grund for his valuable comments and suggestions.

 


About the Authors

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

 

Harshida Patel is a Data Warehouse Specialist Solutions Architect with AWS.

 

 

How Wind Mobility built a serverless data architecture

Post Syndicated from Pablo Giner original https://aws.amazon.com/blogs/big-data/how-wind-mobility-built-a-serverless-data-architecture/

Guest post by Pablo Giner, Head of BI, Wind Mobility.

Over the past few years, urban micro-mobility has become a trending topic. With the contamination indexes hitting historic highs, cities and companies worldwide have been introducing regulations and working on a wide spectrum of solutions to alleviate the situation.

We at Wind Mobility strive to make commuters’ life more sustainable and convenient by bringing short distance urban transportation to cities worldwide.

At Wind Mobility, we scale our services at the same pace as our users demand them, and we do it in an economically and environmentally viable way. We optimize our fleet distribution to avoid overcrowding cities with more scooters than those that are actually going to be used, and we position them just meters away from where our users need them and at the time of the day when they want them.

How do we do that? By optimizing our operations to their fullest. To do so, we need to be very well informed about our users’ behavior under varying conditions and understand our fleet’s potential.

Scalability and flexibility for rapid growth

We knew that before we could solve this challenge, we needed to collect data from many different sources, such as user interactions with our application, user demand, IoT signals from our scooters, and operational metrics. To analyze the numerous datasets collected and extract actionable insights, we needed to build a data lake. While the high-level goal was clear, the scope was less so. We were working hard to scale our operation as we continued to launch new markets. The rapid growth and expansion made it very difficult to predict the volume of data we would need to consume. We were also launching new microservices to support our growth, which resulted in more data sources to ingest. We needed an architecture that allowed us to be agile and quickly adopt to meet our growth. It became clear that a serverless architecture was best positioned to meet those needs, so we started to design our 100% serverless infrastructure.

The first challenge was ingesting and storing data from our scooters in the field, events from our mobile app, operational metrics, and partner APIs. We use AWS Lambda to capture changes in our operational databases and mobile app and push the events to Amazon Kinesis Data Streams, which allows us to take action in real time. We also use Amazon Kinesis Data Firehose to write the data to Amazon Simple Storage Service (Amazon S3), which we use for analytics.

After we were in Amazon S3 and adequately partitioned as per its most common use cases (we partition by date, region, and business line, depending on the data source), we had to find a way to query this data for both data profiling (understanding structure, content, and interrelationships) and ad hoc analysis. For that we chose AWS Glue crawlers to catalog our data and Amazon Athena to read from the AWS Glue Data Catalog and run queries. However, ad hoc analysis and data profiling are relatively sporadic tasks in our team, because most of the data processing computing hours are actually dedicated to transforming the multiple data sources into our data warehouse, consolidating the raw data, modeling it, adding new attributes, and picking the data elements, which constitute 95% of our analytics and predictive needs.

This is where all the heavy lifting takes place. We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from Amazon S3, processes it using Apache Spark, and writes the results to our Amazon Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Architecture overview

The following diagram represents our current data architecture, showing two serverless data collection, processing, and reporting pipelines:

  • Operational databases from Amazon Relational Database Service (Amazon RDS) and MongoDB
  • IoT and application events, followed by Athena for data profiling and Amazon Redshift for reporting

Our data is curated and transformed multiple times a day using an automated pipeline running on AWS Glue. The team can now focus on analyzing the data and building machine learning (ML) applications.

We chose Amazon QuickSight as our business intelligence tool to help us visualize and better understand our operational KPIs. Additionally, we use Amazon Elastic Container Registry (Amazon ECR) to store our Docker images containing our custom ML algorithms and Amazon Elastic Container Service (Amazon ECS) where we train, evaluate, and host our ML models. We schedule our models to be trained and evaluated multiple times a day. Taking as input curated data about demand, conversion, and flow of scooters, we run the models to help us optimize fleet utilization for a particular city at any given time.

The following diagram represents how data from the data lake is incorporated into our ML training, testing, and serving system. First, our developers work in the application code and commit their changes, which are built into new Docker images by our CI/CD pipeline and stored in the Amazon ECR registry. These images are pushed into Amazon ECS and tested in DEV and UAT environments before moving to PROD (where they are triggered by the Amazon ECS task scheduler). During their execution, the Amazon ECS tasks (some train the demand and usage forecasting models, some produce the daily and hourly predictions, and others optimize the fleet distribution to satisfy the forecast) read their configuration and pull data from Amazon S3 (which has been previously produced by scheduled AWS Glue jobs), finally storing their results back into Amazon S3. Executions of these pipelines are tracked via MLFlow (in a dedicated Amazon Elastic Compute Cloud (Amazon EC2) server) and the final result indicating the fleet operations required is fit into a Kepler map, which is then consumed by the operators on the field.

Conclusion

We at Wind Mobility place data at the forefront of our operations. For that, we need our data infrastructure to be as flexible as the industry and the context we operate in, which is why we chose serverless. Over the course of a year, we have built a data lake, a data warehouse, a BI suite, and a variety of (production) data science applications. All of that with a very small team.

Also, within the last 12 months, we have scaled up several of our data pipelines by a factor of 10, without slowing our momentum or redesigning any part of our architecture. When it came to double our fleet in 1 week and increase the frequency at which we capture data from scooters by a factor of 10, our serverless data architecture scaled with no issues. This allowed us to focus on adding value by simplifying our operation, reacting to changes quickly, and delighting our users.

We have measured our success in multiple dimensions:

  • Speed – Serverless is faster to deploy and expand; we believe we have reduced our time to market for the entire infrastructure by a factor of 2
  • Visibility – We have 360 degree visibility of our operations worldwide, accessible by our city managers, finance team, and management board
  • Optimized fleet deployment – We know, at any minute of the day, the number of scooters that our customers need over the next few hours, which reduces unsatisfied demand by more than 50%

If you face a similar challenge, our advice is clear: go fully serverless and use the spectrum of solutions available from AWS.

Follow us and discover more about Wind Mobility on Facebook, Instagram and LinkedIn.

 


About the Author

Pablo Giner is Head of BI at Wind Mobility. Pablo’s background is in wheels (motorcycle racing > vehicle engineering > collision insurance > eScooters sharing…) and for the last few years he has specialized in forming and developing data teams. At Wind Mobility, he leads the data function (data engineering + analytics + data science), and the project he is most proud of is what they call smart fleet rebalancing, an AI backed solution to reposition their fleet in real-time. “In God we trust. All others must bring data.” – W. Edward Deming

 

 

 

Process data with varying data ingestion frequencies using AWS Glue job bookmarks

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/process-data-with-varying-data-ingestion-frequencies-using-aws-glue-job-bookmarks/

We often have data processing requirements in which we need to merge multiple datasets with varying data ingestion frequencies. Some of these datasets are ingested one time in full, received infrequently, and always used in their entirety, whereas other datasets are incremental, received at certain intervals, and joined with the full datasets to generate output. To address this requirement, this post demonstrates how to build an extract, transform, and load (ETL) pipeline using AWS Glue.

Using AWS Glue

AWS Glue provides a serverless environment to extract, transform, and load a large number of datasets from several sources for analytics purposes. It has a feature called job bookmarks to process incremental data when rerunning a job on a scheduled interval. A job bookmark is composed of the states for various job elements, such as sources, transformations, and targets. This is done by persisting state information from a job run that helps AWS Glue prevent reprocessing old data.

For this use case, we use an AWS Glue job with job bookmarks enabled to process files received in varying frequencies (a full dataset signifying files that are received one time, and incremental datasets signifying files that are received in certain regular intervals). These files are merged together. In addition to enabling job bookmarks, we also use an optional parameter transformation_ctx (transformation context) in an AWS Glue PySpark dynamic frame. This acts as a unique identifier for the ETL operator instance to identify state information within a job bookmark for a given operator. AWS Glue uses transformation_ctx to index the key to the bookmark state.

You can capture and maintain state information for incremental datasets and avoid reprocessing by using transformation context. Transformation context is omitted for the full dataset file, which results in the job run state information not getting captured for the full dataset and allowing it to participate in the next processing event in its entirety. Even though the job bookmark flag is enabled at the AWS Glue job level, because transformation context is omitted for the full dataset, every time the job runs, the entire data from the full dataset is used as part of the job. In contrast, only the newly arrived datasets are processed for the incremental datasets.

Solution overview

To demonstrate the job bookmark utility of AWS Glue, we use TLC Trip Record Data datasets. We use NYC yellow taxi trip data monthly files as the incremental dataset, and NYC taxi zone lookup as the full dataset. The monthly yellow taxi trip data has a field named PULocationID (where a customer was picked up), which is joined with the LocationID field from the NYC taxi zone lookup file to create an output dataset that contains Borough, Zone, and service_zone from the NYC taxi zone lookup dataset and all the fields (except the PULocationID field) from the monthly NYC taxi trip data file.

The following diagram depicts a high-level architecture of the process.

Descriptions of Diagram

  • Two Amazon S3 Raw bucket locations are used for storing incoming CSV source data (NYC taxi monthly files (Incremental Dataset) and NYC Taxi lookup file (Full Dataset)).
  • A Bookmark enabled glue Job joins data between monthly trip data file and the taxi zone lookup file to generate output parquet files and creates NYC taxi trip table in Glue Data Catalog and Redshift database.
  • S3 Curated Bucket is used to store NYC Taxi monthly processed parquet files.

Creating the AWS CloudFormation stack

You use the following AWS CloudFormation template to create the below mentioned resources in your preferred AWS account and Region:

Additionally, make sure you have an Amazon EC2 key pair created in the account and Region you’re launching the stack from.

To provide the stack parameters, complete the following steps:

  1. For Stack name, enter BigDataBlog-GlueBookmark-Stack.

  1. For RedshiftClusterIdentifier, enter bigdatablogrscluster.
  2. For NodeType, choose large.
  3. For NumberOfNodes, choose 2.
  4. For DatabaseName, enter bigdatablogdev.

  1. For MasterUserName, enter bigdatabloguser.
  2. For MasterUserPassword, enter a password for the master user account.
  3. For Maintenancewindow, enter sun:05:00-sun:05:30.
  4. For EC2InstanceType, choose micro.
  5. For SubscriptionEmail, enter your preferred email.
  6. For MyIPAddressCidr, enter your IP address.

You can find your IP Address by browsing https://www.whatismyip.com/ and looking up the value for My Public IPv4 is:. Add /32 at the end to make it CIDR-compatible and most restrictive.

  1. For DestinationPrefixListId, enter your prefix list ID.

To find your ID, set AWS credentials by entering aws configure in the command prompt. Run aws ec2 describe-prefix-lists to get the PrefixListId where PrefixListName is com.amazonaws.<<AWS region>>.s3 from the output.

  1. For NewS3BucketName, enter the name of your S3 bucket.

  1. For gluedatabase, enter bigdatabloggluedb.
  2. For EC2KeyName, enter the name of your key pair.

For instructions on creating a stack, see Creating a Stack on the AWS CloudFormation Console.

Make sure the stack is complete before moving to the next steps.

Creating the AWS Glue job

To create your AWS Glue job, complete the following steps:

  1. Download NYC yellow monthly trip data for October 2019 and November 2019 and save them under the s3://<<Your S3 Bucket>>/tripdata/ prefix.
  2. Download the NYC Taxi Zone lookup table and save it under the s3://<<Your S3 Bucket>>/tripdata-lookup/ prefix.
  3. Use the following PySpark script and change the piece of the code enclosed inside <<…>>.

You can find the values for the following keys on the Outputs tab for the CloudFormation stack:

    • S3Bucket
    • Snstopic

You can find the values for the following keys on the Parameters tab for the CloudFormation stack:

    • EC2KeyName
    • MyIPAddressCidr
    • NewS3BucketName
    • SubscriptionEmail

  1. When the AWS Glue script is ready, upload it to the S3 bucket under the s3://<<Your S3 Bucket>>/glue-script/ prefix.

You refer to this when you create the AWS Glue job.

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Create job.
  3. For Name, enter a name for the job. For more information about AWS Glue job naming, see Jobs.
  4. For IAM role, choose the role the CloudFormation template created. Use the value for the key Glueaccessrole from the stack outputs.
  5. For Type, choose Spark.
  6. For Glue version, choose Spark 2.4, Python 3 (Glue Version 1.0).
  7. For This job runs, choose An existing script that you provide.
  8. For S3 path where the script is stored, choose the script file that you saved earlier under the s3://<<Your S3 Bucket>>/Glue-script/ prefix.
  9. In the Advanced properties section, for Job bookmark, choose Enable.
  10. For Catalog options, select Use Glue Data Catalog as the Hive metastore.
  11. For Connections, enter the value of the key GlueConnection from the stack outputs.
  12. Choose Save job and edit script.

Creating an Amazon Redshift database schema

Before you run the AWS Glue job, you need to connect to the Amazon Redshift cluster and create an Amazon Redshift database schema named Glue_bookmark_redshift_schema. To connect to the cluster, use one of the JDBC client-based SQL tools, such as SQL Workbench/J. For instructions, see How can I access a private Amazon Redshift cluster from my local machine?

To access the cluster, you use the Amazon Redshift master user bigdatabloguser (the value for MasterUserName on the Parameters tab of the CloudFormation stack) and the password you provided when creating the stack.

Running AWS Glue job

The Glue Job takes only one argument; name of the file being processed. Pass the file name, such as yellow_tripdata_2019-10.csv, while processing that file. This enables you to track the records that belong to a specific file so that it’s easier to evaluate the result of multiple job runs using different files.

When the Glue job run is successful, you can see the output Parquet files under the /tripdata-joined-output/ prefix inside the S3 bucket you created by running the CloudFormation template. You can also use Amazon Athena to query the data from the table created in the Data Catalog. For more information, see Running SQL Queries Using Amazon Athena.

Query the Amazon Redshift database table named redshift_bookmark_table and review the output.

Explaining the solution

A bookmark-enabled AWS Glue job (in PySpark) is created that reads the NYC yellow taxi trip’s monthly file, joins it with NYC taxi zone lookup file, produces files in Parquet format, and saves them in an Amazon s3 location.

A Data Catalog table is created that refers to the Parquet files’ location in Amazon S3. The resulting dataset is also loaded into an Amazon Redshift table using the AWS Glue PySpark job.

The AWS Glue job bookmark transformation context is used while the AWS Glue dynamic frame is created by reading a monthly NYC taxi file, whereas the transformation context is disabled while reading and creating the dynamic frame for the taxi zone lookup file (because the entire file is required for processing each monthly trip file). This allows you to process each monthly trip file exactly one time and reuse the entire taxi zone lookup file as many times as required because the missing transformation context for the lookup file doesn’t allow the bookmark context to be set for that file.

When a new NYC trip data monthly file arrives and the AWS Glue job runs, it only processes the newly arrived monthly file and ignores any previously processed monthly files. Similarly, when the Data Catalog table data is copied into Amazon Redshift, it only copies the newly processed underlying Parquet files’ data and appends it to the Amazon Redshift table. At this time the transformation context is enabled to utilize the job bookmark, and the AWS Glue dynamic frame is created by reading the Data Catalog table.

The following PySpark code uses the transformation context to create an AWS Glue dynamic frame while reading the monthly incremental file:

taxidata = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths": [InputDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'},transformation_ctx = "taxidata")

However, the following code omits transformation context when creating the AWS Glue dynamic frame for the lookup file:

Lookupdata  = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})

Additionally, the following code uses the transformation context while reading the Data Catalog table, which is loaded into an Amazon Redshift table:

datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")

You can see in the screenshot below that the 2019 October yellow taxi trip data file has arrived for processing (the incremental dataset).

To process each month’s data, you need the taxi zone lookup (full dataset).

The following screenshot shows the output of the AWS Glue job after processing the 2019 October trip data, saved in Parquet format.

The following two screenshots show the Amazon Redshift table, displaying the count of records for the October 2019 taxi data and only October 2019 taxi data file has been processed so far, respectively

The following screenshot shows that the November 2019 NYC taxi data file has arrived for processing.

The following screenshot shows the output of the AWS Glue job after processing the 2019 November trip data, saved in Parquet format. The job only processed the November data and ignored the October data (to be reprocessed) because the job bookmark and transformation context was enabled.

The following screenshot shows that the Amazon Redshift table now has both October and November data and shows the total record count.

The following screenshot shows individual record count for each month.

Querying with Athena

You can also review the dataset in Athena, which uses the same Glue Data Catalog. The following screenshot of an Athena query shows the Data Catalog table has both October and November data, with the total record count.

The following screenshot of an Athena query shows the individual record count for each month.

The following screenshot shows the location information, including borough, zone, and service zone, which is available in the taxi zone lookup and is joined with the October taxi trip data.

The following screenshot shows the output for the same query on the November data.

Cleaning up

When you’re done using this solution, you should delete the CloudFormation stack to avoid incurring any further charges.

Conclusion

This post describes how you can merge datasets received in different frequencies as part of your ETL pipeline processing using AWS Glue job bookmarks. The use case demonstrated how to use job bookmarks and transformation context to build an ETL pipeline for processing several incremental datasets.

 


About the Authors

Dipankar is a Senior Data Architect with AWS Professional Services, helping customers build analytics platform and solutions. He has a keen interest in distributed computing. Dipankar enjoys spending time playing chess and watching old Hollywood movies.

 

 

 

Ashok Padmanabhan is a big data consultant with AWS Professional Services, helping customers build big data and analytics platform and solutions. When not building and designing data lakes, Ashok enjoys spending time at beaches near his home in Florida.

Moovit embraces data lake architecture by extending their Amazon Redshift cluster to analyze billions of data points every day

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/moovit-embraces-data-lake-architecture-by-extending-their-amazon-redshift-cluster-to-analyze-billions-of-data-points-every-day/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

Moovit is a leading Mobility as a Service (MaaS) solutions provider and maker of the top urban mobility app. Guiding over 800 million users in more than 3,200 cities across 103 countries to get around town effectively and conveniently, Moovit has experienced exponential growth of their service in the last few years. The company amasses up to 6 billion anonymous data points a day to add to the world’s largest repository of transit and urban mobility data, aided by Moovit’s network of more than 685,000 local editors that help map and maintain local transit information in cities that would otherwise be unserved.

Like Moovit, many companies today are using Amazon Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics using services such as Amazon EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using Amazon Athena and Amazon Redshift Spectrum. You can also use a data lake with ML services such as Amazon SageMaker to gain insights.

Moovit uses an Amazon Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.

Additionally, Moovit was looking to manage their storage costs and evolve to a model that allowed cooler data to be maintained at the lowest cost in S3, and maintain the hottest data in Redshift for the most efficient query performance. The proposed solution implemented a hot/cold storage pattern using Amazon Redshift Spectrum and reduced the local disk utilization on the Amazon Redshift cluster to make sure costs are maintained. Moovit is currently evaluating the new RA3 node with managed storage as an additional level of flexibility that will allow them to easily scale the amount of hot/cold storage without limit.

In this post we demonstrate how Moovit, with the support of AWS, implemented a lake house architecture by employing the following best practices:

  • Unloading data into Amazon Simple Storage Service (Amazon S3)
  • Instituting a hot/cold pattern using Amazon Redshift Spectrum
  • Using AWS Glue to crawl and catalog the data
  • Querying data using Athena

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

  1. Unload data from Amazon Redshift to Amazon S3
  2. Create an AWS Glue Data Catalog using an AWS Glue crawler
  3. Query the data lake in Amazon Athena
  4. Query Amazon Redshift and the data lake with Amazon Redshift Spectrum

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. The following AWS services and access: Amazon Redshift, Amazon S3, AWS Glue, and Athena.
  4. The appropriate AWS Identity and Access Management (IAM) permissions for Amazon Redshift Spectrum and AWS Glue to access Amazon S3 buckets. For more information, see IAM policies for Amazon Redshift Spectrum and Setting up IAM Permissions for AWS Glue.

Walkthrough

To demonstrate the process Moovit used during their data architecture, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.

The Orders table has the following columns:

ColumnType
O_ORDERKEYint4
O_CUSTKEYint4
O_ORDERSTATUSvarchar
O_TOTALPRICEnumeric
O_ORDERDATEdate
O_ORDERPRIORITYvarchar
O_CLERKvarchar
O_SHIPPRIORITYint4
O_COMMENTvarchar
SKIPvarchar

Unloading data from Amazon Redshift to Amazon S3

Amazon Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in Amazon S3, compared with text formats.

To unload cold or historical data from Amazon Redshift to Amazon S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):

UNLOAD ('select o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment, skip
FROM tpc.orders
ORDER BY o_orderkey, o_orderdate') 
TO 's3://tpc-bucket/orders/' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account_number>:role/>Role<'
FORMAT AS parquet allowoverwrite PARTITION BY (o_orderdate);

It is important to define a partition key or column that minimizes Amazon S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the o_orderdate field as the partition key.

Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the UNLOAD command splits the results to one or more files per node slice (virtual worker in the Amazon Redshift cluster) which allows you to use the Amazon Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In Moovit’s use case, the default UNLOAD using PARALLEL ON yielded dozens of small (MBs) files. For Moovit, PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the LEADER node and wrote it out as a single stream controlling the file size using the MAXFILESIZE option.

Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have min_value and max_value column statistics for each row group that allow Amazon Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from Amazon S3, which results in improved performance and reduced cost.

After unloading the data to your data lake, you can view your Parquet file’s content in Amazon S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.

You’re now ready to populate your Data Catalog using an AWS Glue crawler.

Creating a Data Catalog with an AWS Glue crawler

To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.

An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.

Querying the data lake in Athena

After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.

Querying Amazon Redshift and the data lake using a unified view with Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. Amazon Redshift clusters transparently use the Amazon Redshift Spectrum feature when the SQL query references an external table stored in Amazon S3. Large multiple queries in parallel are possible by using Amazon Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from Amazon S3 back to the Amazon Redshift cluster.

Following best practices, Moovit decided to persist all their data in their Amazon S3 data lake and only store hot data in Amazon Redshift. They could query both hot and cold datasets in a single query with Amazon Redshift Spectrum.

The first step is creating an external schema in Amazon Redshift that maps a database in the Data Catalog. See the following code:

CREATE EXTERNAL SCHEMA spectrum 
FROM data catalog 
DATABASE 'datalake' 
iam_role 'arn:aws:iam::<account_number>:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

After the crawler creates the external table, you can start querying in Amazon Redshift using the mapped schema that you created earlier. See the following code:

SELECT * FROM spectrum.orders;

Lastly, create a late binding view that unions the hot and cold data:

CREATE OR REPLACE VIEW lake_house_joint_view AS (SELECT * FROM public.orders WHERE o_orderdate >= dateadd(‘day’,-90,date_trunc(‘day’,getdate())) 
UNION ALL SELECT * FROM spectrum.orders WHERE o_orderdate < dateadd(‘day’,-90,date_trunc(‘day’,getdate())) WITH NO SCHEMA BINDING;

Summary

In this post, we showed how Moovit unloaded data from Amazon Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within Moovit can access the data, analyze it with various tools, and come up with new insights.

As an additional benefit, Moovit reduced their Amazon Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the Amazon Redshift cluster. Keeping only hot data on the Amazon Redshift cluster prevents Moovit from deleting data frequently, which saves IT resources, time, and effort.

If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.

 


About the Authors

Yonatan Dolan is a Business Development Manager at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.

 

 

 

 

Alon Gendler is a Startup Solutions Architect at Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.

 

 

 

 

Vincent Gromakowski is a Specialist Solutions Architect for Amazon Web Services.

 

 

Best practices for Amazon Redshift Federated Query

Post Syndicated from Joe Harris original https://aws.amazon.com/blogs/big-data/amazon-redshift-federated-query-best-practices-and-performance-considerations/

Amazon Redshift Federated Query enables you to use the analytic power of Amazon Redshift to directly query data stored in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL databases. For more information about setting up an environment where you can try out Federated Query, see Accelerate Amazon Redshift Federated Query adoption with AWS CloudFormation.

Federated Query enables real-time data integration and simplified ETL processing. You can now connect live data sources directly in Amazon Redshift to provide real-time reporting and analysis. Previously, you needed to extract data from your PostgreSQL database to Amazon Simple Storage Service (Amazon S3) and load it to Amazon Redshift using COPY or query it from Amazon S3 with Amazon Redshift Spectrum. For more information about the benefits of Federated Query, see Build a Simplified ETL and Live Data Query Solution using Amazon Redshift Federated Query.

This post discusses 10 best practices to help you maximize the benefits of Federated Query when you have large federated data sets, when your federated queries retrieve large volumes of data, or when you have many Redshift users accessing federated data sets. These techniques are not necessary for general usage of Federated Query. They are intended for advanced users who want to make the most of this exciting feature.

The best practices are divided into two sections: the first for advice that applies to your Amazon Redshift cluster, and the second for advice that applies to your Aurora PostgreSQL and Amazon RDS for PostgreSQL environments.

The code examples provided in this post derive from the data and queries in the CloudDataWarehouseBenchmark GitHub repo (based on TPC-H and TPC-DS).

Best practices to apply in Amazon Redshift

The following best practices apply to your Amazon Redshift cluster when using federated queries to access your Aurora or Amazon RDS for PostgreSQL instances.

1. Use separate external schemas for each use case

Consider creating separate Amazon Redshift external schemas, using separate remote PostgreSQL users, for each specific Amazon Redshift use case. This practice allows you to have extra control over the users and groups who can access the external database. For instance, you may want to have an external schema for ETL usage, with an associated PostgreSQL user, that has broad access and another schema, and an associated PostgreSQL user for ad-hoc reporting and analysis with access limited to specific resources.

The following code example creates two external schemas for ETL use and ad-hoc reporting use. Each schema uses a different SECRET_ARN containing credentials for separate users in the PostgreSQL database.

-- ETL usage - broad access
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_etl
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-etl-secret-187Asd'
;
-- Ad-Hoc usage - limited access
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_adhoc
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-adhoc-secret-187Asd'
;

2. Use query timeouts to limit total runtimes

Consider setting a timeout on the users or groups that have access to your external schemas. User queries could unintentionally try to retrieve a very large number of rows from the external relation and remain running for an extended time, which holds open resources in both Amazon Redshift and PostgreSQL.

To limit the total runtime of a user’s queries, you can set a statement_timeout for all a user’s queries. The following code example sets a 2-hour timeout for an ETL user:

-- Set ETL user timeout to 2 hours
ALTER USER etl_user SET statement_timeout TO 7200000;

If many users have access to your external schemas, it may not be practical to define a statement_timeout for each individual user. Instead, you can add a query monitoring rule in your WLM configuration using the query_execution_time metric. The following screenshot shows an Auto WLM configuration with an Adhoc Reporting queue for users in the adhoc group, with a rule that cancels queries that run for longer than 1,800 seconds (30 minutes).

3. Make sure the Amazon Redshift query plan is efficient

Review the overall query plan and query metrics of your federated queries to make sure that Amazon Redshift processes them efficiently. For more information about query plans, see Evaluating the query plan.

Viewing the Amazon Redshift query explain plan

You can retrieve the plan for your query by prefixing your SQL with EXPLAIN and running that in your SQL client. The following code example is the explain output for a sample query:

<< REDSHIFT >> QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 XN Aggregate  (cost=6396670427721.37..6396670427721.37 rows=1 width=32)
   ->  XN Hash Join DS_BCAST_INNER  (cost=499986.50..6396670410690.30 rows=6812425 width=32)
         Hash Cond: ("outer".l_partkey = ("inner".p_partkey)::bigint)
         ->  XN Seq Scan on lineitem  (cost=0.00..2997629.29 rows=199841953 width=40)
               Filter: ((l_shipdate < '1994-03-01'::date) AND (l_shipdate >= '1994-02-01'::date))
         ->  XN Hash  (cost=449987.85..449987.85 rows=19999460 width=4)
               ->  XN PG Query Scan part  (cost=0.00..449987.85 rows=19999460 width=4)
                     ->  Remote PG Seq Scan apg_tpch_100g.part  (cost=0.00..249993.25 rows=19999460 width=4)
                           Filter: ((p_type)::text ~~ 'PROMO%'::text)

The operator XN PG Query Scan indicates that Amazon Redshift will run a query against the federated PostgreSQL database for this part of the query, we refer to this as the “federated subquery” in this post. When your query uses multiple federated data sources Amazon Redshift runs a federated subquery for each source. Amazon Redshift runs each federated subquery from a randomly selected node in the cluster.

Below the XN PG Query Scan line, you can see Remote PG Seq Scan followed by a line with a Filter: element. These two lines define how Amazon Redshift accesses the external data and the predicate used in the federated subquery. You can see that the federated subquery will run against the federated table apg_tpch.part.

You can also see from rows=19999460 that Amazon Redshift estimates that the query can return up to 20 million rows from PostgreSQL. It creates this estimate by asking PostgreSQL for statistics about the table.

Joins

Since each federated subquery runs from a single node in the cluster, Amazon Redshift must choose a join distribution strategy to send the rows returned from the federated subquery to the rest of the cluster to complete the joins in your query. The choice of a broadcast or distribution strategy is indicated in the explain plan. Operators that start with DS_BCAST broadcast a full copy of the data to all nodes. Operators that start with DS_DIST distribute a portion of the data to each node in the cluster.

It’s usually most efficient to broadcast small results and distribute larger results. When the planner has a good estimate of the number of rows that the federated subquery will return, it chooses the correct join distribution strategy. However, if the planner’s estimate isn’t accurate, it may choose broadcast for result that is too large, which can slow down your query.

Join Order

Joins should use the smaller result as the inner relation. When your query joins two tables (or two federated subqueries), Amazon Redshift must choose how best to perform the join. The query planner may not perform joins in the order declared in your query. Instead, it uses the information it has about the relations being joined to create estimated costs for a variety of possible plans. It uses the plan, including join order, that has the lowest expected cost.

When you use a hash join, the most common join, Amazon Redshift constructs a hash table from the inner table (or result) and compares it to every row from the outer table. You want to use the smallest result as the inner so that the hash table can fit in memory. The chosen ordering join may not be optimal if the planner’s estimate doesn’t reflect the real size of the results from each step in the query.

Improving query efficiency

The following is high-level advice for improving efficiency. For more information, see Analyzing the query plan.

  • Examine the plan for separate parts of your query. If your query has multiple joins or uses subqueries, you can review the explain plan for each join or subquery to check whether the query benefits from being simplified. For instance, if you use several joins, examine the plan for a simpler query using only one join to see how Amazon Redshift plans that join on its own.
  • Examine the order of outer joins and use an inner join. The planner can’t always reorder outer joins. If you can convert an outer join to an inner join, it may allow the planner to use a more efficient plan.
  • Reference the distribution key of the largest Amazon Redshift table in the join. When a join references the distribution key Amazon Redshift can complete the join on each node in parallel without moving the rows from the Redshift table across the cluster.
  • Insert the federated subquery result into a table. Amazon Redshift has optimal statistics when the data comes from a local temporary or permanent table. In rare cases, it may be most efficient to store the federated data in a temporary table first and join it with your Amazon Redshift data.

4. Make sure predicates are pushed down to the remote query

Amazon Redshift’s query optimizer is very effective at pushing predicate conditions down to the federated subquery that runs in PostgreSQL. Review the query plan of important or long-running federated queries to check that Amazon Redshift applies all applicable predicates to each subquery.

Consider the following example query, in which the predicate is inside a CASE statement and the federated relation is within a CTE subquery:

WITH cte 
AS (SELECT p_type, l_extendedprice, l_discount, l_quantity 
    FROM public.lineitem
    JOIN apg_tpch.part --<< PostgreSQL table
         ON l_partkey = p_partkey
    WHERE l_shipdate >= DATE '1994-02-01'
      AND l_shipdate < (DATE '1994-02-01' + INTERVAL '1 month')
)
SELECT CASE WHEN p_type LIKE 'PROMO%' --<< PostgreSQL filter predicate pt1
            THEN TRUE ELSE FALSE END AS is_promo
     , AVG( ( l_extendedprice * l_discount) / l_quantity ) AS avg_promo_disc_val
FROM cte
WHERE is_promo IS TRUE --<< PostgreSQL filter predicate pt2
GROUP BY 1;

Amazon Redshift can still effectively optimize the federated subquery by pushing a filter down to the remote relation. See the following plan:

<< REDSHIFT >> QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 XN HashAggregate  (cost=17596843176454.83..17596843176456.83 rows=200 width=87)
   ->  XN Hash Join DS_BCAST_INNER  (cost=500000.00..17596843142391.79 rows=6812609 width=87)
         Hash Cond: ("outer".l_partkey = ("inner".p_partkey)::bigint)
         ->  XN Seq Scan on lineitem  (cost=0.00..2997629.29 rows=199841953 width=40)
               Filter: ((l_shipdate < '1994-03-01'::date) AND (l_shipdate >= '1994-02-01'::date))
         ->  XN Hash  (cost=450000.00..450000.00 rows=20000000 width=59)-- Federated subquery >>
               ->  XN PG Query Scan part  (cost=0.00..450000.00 rows=20000000 width=59)
                     ->  Remote PG Seq Scan apg_tpch.part  (cost=0.00..250000.00 rows=20000000 width=59)
                           Filter: (CASE WHEN ((p_type)::text ~~ 'PROMO%'::text) THEN true ELSE false END IS TRUE)-- << Federated subquery

If Redshift can’t push your predicates down as needed, or the query still returns too much data, consider the advice in the following two sections regarding materialized views and syncing tables. To easily rewrite your queries to achieve effective filter pushdown, consider the advice in the final best practice regarding persisting frequently queried data.

5. Use materialized views to cache frequently accessed data

Amazon Redshift now supports the creation of materialized views that reference federated tables in external schemas.

Cache queries that run often

Consider caching frequently run queries in your Amazon Redshift cluster using a materialized view. When many users run the same federated query regularly, the remote content of the query must be retrieved again for each execution. With a materialized view, the results can instead be retrieved from your Amazon Redshift cluster without getting the same data from the remote database. You can then schedule the refresh of the materialized view to happen at a specific time, depending upon the change rate and importance of the remote data.

The following code example demonstrates the creation, querying, and refresh of a materialized view from a query that uses a federated source table:

-- Create the materialized view
CREATE MATERIALIZED VIEW mv_store_quantities_by_quarter AS
SELECT ss_store_sk
     , d_quarter_name
     , COUNT(ss_quantity) AS count_quantity
     , AVG(ss_quantity) AS avg_quantity
FROM public.store_sales
JOIN apg_tpcds.date_dim --<< federated table
    ON d_date_sk = ss_sold_date_sk
GROUP BY ss_store_sk
ORDER BY ss_store_sk
;
--Query the materialized view
SELECT * 
FROM mv_store_quanties_by_quarter
WHERE d_quarter_name = '1998Q1'
;
--Refresh the materialized view
REFRESH MATERIALIZED VIEW mv_store_quanties_by_quarter
;

Cache tables that are used by many queries

Also consider locally caching tables used by many queries using a materialized view. When many different queries use the same federated table it’s often better to create a materialized view for that federated table which can then be referenced by the other queries instead.

The following code example demonstrates the creation and querying of a materialized view on a single federated source table:

-- Create the materialized view
CREATE MATERIALIZED VIEW mv_apg_part AS
SELECT * FROM apg_tpch_100g.part
;
--Query the materialized view
SELECT SUM(l_extendedprice * (1 - l_discount)) AS discounted_price
FROM public.lineitem, mv_apg_part
WHERE l_partkey = p_partkey
  AND l_shipdate BETWEEN '1997-03-01' AND '1997-04-01'
;

As of this writing, you can’t reference a materialized view inside another materialized view. Other views that use the cached table need to be regular views.

Balance caching against refresh time and frequency

The use of materialized views is best suited for queries that run quickly relative to the refresh schedule. For example, a materialized view refreshed hourly should run in a few minutes, and a materialized view refreshed daily should run in less than an hour. As of this writing, materialized views that reference external tables aren’t eligible for incremental refresh. A full refresh occurs when you run REFRESH MATERIALIZED VIEW and recreate the entire result.

Limit remote access using materialized views

Also consider using materialized views to reduce the number of users who can issue queries directly against your remote databases. You can grant external schema access only to a user who refreshes the materialized views and grant other Amazon Redshift users access only to the materialized view.

Limiting the scope of access in this way is a general best practice for data security when querying from remote production databases that contain sensitive information.

6. Sync large remote tables to a local copy

Consider keeping a copy of the remote table in a permanent Amazon Redshift table. When your remote table is large and a full refresh of a materialized view is time-consuming it’s more effective to use a sync process to keep a local copy updated.

Sync newly added remote data

When your large remote table only has new rows added, not updated nor deleted, you can synchronize your Amazon Redshift copy by periodically inserting the new rows from the remote table into the copy. You can automate this sync process using the example stored procedure sp_sync_get_new_rows on GitHub.

This example stored procedure requires the source table to have an auto-incrementing identity column as its primary key. It finds the current maximum in your Amazon Redshift table, retrieves all rows in the federated table with a higher ID value, and inserts them into the Amazon Redshift table.

The following code examples demonstrate a sync from a federated source table to a Amazon Redshift target table. First, you create a source table with four rows in the PostgreSQL database:

CREATE TABLE public.pg_source ( 
      pk_col   BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY
    , data_col VARCHAR(20));
INSERT INTO public.pg_tbl (data_col)
VALUES ('aardvark'),('aardvarks'),('aardwolf'),('aardwolves')
;

Create a target table with two rows in your Amazon Redshift cluster:

CREATE TABLE public.rs_target (
      pk_col   BIGINT PRIMARY KEY
    , data_col VARCHAR(20));
INSERT INTO public.rs_tbl
VALUES (1,'aardvark'), (2,'aardvarks')
;

Call the Amazon Redshift stored procedure to sync the tables:

CALL sp_sync_get_new_rows(SYSDATE,'apg_tpch.pg_source','public.rs_target','pk_col','public.sp_logs',0);
-- INFO:  SUCCESS - 2 new rows inserted into `target_table`.

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  
-- --------+------------
--       1 | aardvark  
--       2 | aardvarks  
--       4 | aardwolves
--       3 | aardwolf 

Merge remote data changes

After you update or insert rows in your remote table, you can synchronize your Amazon Redshift copy by periodically merging the changed rows and new rows from the remote table into the copy. This approach works best when changes are clearly marked in the table so that you can easily retrieve just the new or changed rows. You can automate this sync process using the example stored procedure sp_sync_merge_changes, on GitHub.

This example stored procedure requires the source to have a date/time column that indicates the last time each row was modified. It uses this column to find changes that you need to sync and either updates the changed rows or inserts new rows in the Amazon Redshift copy. The stored procedure also requires the table to have a primary key declared. It uses the primary key to identify which rows to update in the local copy of the data.

The following code examples demonstrate a refresh from a federated source table to an Amazon Redshift target table. First, create a sample table with two rows in your Amazon Redshift cluster:

CREATE TABLE public.rs_tbl ( 
      pk_col   INTEGER PRIMARY KEY
    , data_col VARCHAR(20)
    , last_mod TIMESTAMP);
INSERT INTO public.rs_tbl 
VALUES (1,'aardvark', SYSDATE), (2,'aardvarks', SYSDATE);

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  |     last_mod
-- --------+------------+---------------------
--       1 | aardvark   | 2020-04-01 18:01:02
--       2 | aardvarks  | 2020-04-01 18:01:02

Create a source table with four rows in your PostgreSQL database:

CREATE TABLE public.pg_tbl (` `
      pk_col   INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY
    , data_col VARCHAR(20)
    , last_mod TIMESTAMP);
INSERT INTO public.pg_tbl (data_col, last_mod)
VALUES ('aardvark', NOW()),('aardvarks', NOW()),('aardwolf', NOW()),('aardwolves', NOW());

Call the Amazon Redshift stored procedure to sync the tables:

CALL sp_sync_merge_changes(SYSDATE,'apg_tpch.pg_tbl','public.rs_tbl','last_mod','public.sp_logs',0);
-- INFO:  SUCCESS - 4 rows synced.

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  |      last_mod
-- --------+------------+---------------------
--       1 | aardvark   | 2020-04-01 18:09:56
--       2 | aardvarks  | 2020-04-01 18:09:56
--       4 | aardwolves | 2020-04-01 18:09:56
--       3 | aardwolf   | 2020-04-01 18:09:56

Best practices to apply in Aurora or Amazon RDS

The following best practices apply to your Aurora or Amazon RDS for PostgreSQL instances when using them with Amazon Redshift federated queries.

7. Use a read replica to minimize Aurora or RDS impact

Aurora and Amazon RDS allow you to configure one or more read replicas of your PostgreSQL instance. As of this writing, Federated Query doesn’t allow writing to the federated database, so you should use a read-only endpoint as the target for your external schema. This also makes sure that the federated subqueries Amazon Redshift issues have the minimum possible impact on the master database instance, which often runs a large number of small and fast write transactions.

For more information about read replicas, see Adding Aurora Replicas to a DB Cluster and Working with PostgreSQL Read Replicas in Amazon RDS.

The following code example creates an external schema using a read-only endpoint. You can see the -ro naming in the endpoint URI configuration:

--In Amazon Redshift: 
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_etl
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-etl-secret-187Asd';

8. Use specific and limited PostgreSQL users for each use case

As mentioned in the first best practice regarding separate external schemas, consider creating separate PostgreSQL users for each federated query use case. Having multiple users allows you to grant only the permissions needed for each specific use case. Each user needs a different SECRET_ARN, containing its access credentials, for the Amazon Redshift external schema to use. See the following code:

-- Create an ETL user who will have broad access
CREATE USER redshift_etl WITH PASSWORD '<<example>>';
-- Create an Ad-Hoc user who will have limited access
CREATE USER redshift_adhoc WITH PASSWORD '<<example>>';

Apply a user timeout to limit query runtime

Consider setting a statement_timeout on your PostgreSQL users. A user query could accidentally try to retrieve many millions of rows from the external relation and remain running for an extended time, which holds open resources in both Amazon Redshift and PostgreSQL. To prevent this, specify different timeout values for each user according to their expected usage. The following code example sets timeouts for an ETL user and an ad-hoc reporting user:

-- Set ETL user timeout to 1 hour
ALTER USER redshift_etl SET statement_timeout TO 3600000;
-- Set Ad-Hoc user timeout to 15 minutes
ALTER USER redshift_adhoc SET statement_timeout TO 900000;

9. Make sure the PostgreSQL table is correctly indexed

Consider adding or modifying PostgreSQL indexes to make sure Amazon Redshift federated queries run efficiently. Amazon Redshift retrieves data from PostgreSQL using regular SQL queries against your remote database. Queries are often faster when using an index, particularly when the query returns a small portion of the table.

Consider the following code example of an Amazon Redshift federated query on the lineitem table:

SELECT AVG( ( l_extendedprice * l_discount) / l_quantity ) AS avg_disc_val
FROM apg_tpch.lineitem
WHERE l_shipdate >= DATE '1994-02-01'
  AND l_shipdate < (DATE '1994-02-01' + INTERVAL '1 day');

Amazon Redshift rewrites this into the following federated subquery to run in PostgreSQL:

SELECT pg_catalog."numeric"(l_discount)
     , pg_catalog."numeric"(l_extendedprice)
     , pg_catalog."numeric"(l_quantity) 
FROM public.lineitem  
WHERE (l_shipdate < '1994-02-02'::date) 
  AND (l_shipdate >= '1994-02-01'::date);

Without an index, you get the following plan from PostgreSQL:

<< POSTGRESQL >> QUERY PLAN [No Index]
--------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..16223550.40 rows=232856 width=17)
   Workers Planned: 2
   ->  Parallel Seq Scan on lineitem  (cost=0.00..16199264.80 rows=97023 width=17)
         Filter: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))

You can add the following index to cover exactly the data this query needs:

CREATE INDEX lineitem_ix_covering 
ON public.lineitem (l_shipdate, l_extendedprice, l_discount, l_quantity);

With the new index in place, you see the following plan:

<< POSTGRESQL >> QUERY PLAN [With Covering Index]
------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on lineitem  (cost=7007.35..839080.74 rows=232856 width=17)
   Recheck Cond: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))
   ->  Bitmap Index Scan on lineitem_ix_covering  (cost=0.00..6949.13 rows=232856 width=0)
         Index Cond: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))

In the revised plan, the max cost is 839080 versus the original 16223550—19 times less. The reduced cost suggests that the query is faster when using the index, but testing is needed to confirm this.

Indexes require careful consideration. The detailed tradeoffs of adding additional indexes in PostgreSQL, the specific PostgreSQL index types available, and index usage techniques are beyond the scope of this post.

10. Replace restrictive joins with a remote view

Many analytic queries use joins to restrict the rows that the query returns. For instance, you might apply a predicate such as calender_quarter='2019Q4' to your date_dim table and join to your large fact table. The filter on date_dim reduces the rows returned from the fact table by an order of magnitude. However, as of this writing, Amazon Redshift can’t push such join restrictions down to the federated relation.

Consider the following example query with a join between two federated tables:

SELECT ss_store_sk
,COUNT(ss_quantity) AS count_quantity
,AVG(ss_quantity) AS avg_quantity
FROM apg_tpcds.store_sales
JOIN apg_tpcds.date_dim
  ON d_date_sk = ss_sold_date_sk
WHERE d_quarter_name = '1998Q1'
GROUP BY ss_store_sk
ORDER BY ss_store_sk
LIMIT 100;

When you EXPLAIN this query in Amazon Redshift, you see the following plan:

<< REDSHIFT >> QUERY PLAN [Original]
----------------------------------------------------------------------------------------------------------------------------------
<< snip >>
  ->  XN PG Query Scan store_sales  (cost=0.00..576019.84 rows=28800992 width=12)
        ->  Remote PG Seq Scan store_sales  (cost=0.00..288009.92 rows=28800992 width=12)
  ->  XN Hash  (cost=1643.60..1643.60 rows=73049 width=4)
        ->  XN PG Query Scan date_dim  (cost=0.00..1643.60 rows=73049 width=4)
              ->  Remote PG Seq Scan date_dim  (cost=0.00..913.11 rows=73049 width=4)
                    Filter: (d_quarter_name = '1998Q1'::bpchar)

The query plan shows that date_dim is filtered, but store_sales doesn’t have a filter. This means Amazon Redshift retrieves all rows from store_sales and only then uses the join to filter the rows. Because store_sales is a very big table, this probably takes too long, especially if you want to run this query regularly.

As a solution, you can create the following view in PostgreSQL that encapsulates this join:

CREATE VIEW vw_store_sales_quarter
AS SELECT ss.*, dd.d_quarter_name ss_quarter_name 
FROM store_sales   ss 
JOIN date_dim      dd 
    ON ss.ss_sold_date_sk = dd.d_date_sk;

Rewrite the Amazon Redshift query to use the view as follows:

SELECT ss_store_sk
      ,COUNT(ss_quantity) AS count_quantity
      ,AVG(ss_quantity) AS avg_quantity
FROM apg_tpcds_10g.vw_store_sales_date
WHERE ss_quarter_name = '1998Q1'
GROUP BY ss_store_sk
ORDER BY ss_store_sk
LIMIT 100;

When you EXPLAIN this rewritten query in Amazon Redshift, you see the following plan:

<< REDSHIFT >> QUERY PLAN [Remote View]
----------------------------------------------------------------------------------------------------------------------------------
<< snip >>
  ->  XN HashAggregate  (cost=30.00..31.00 rows=200 width=8)
        ->  XN PG Query Scan vw_store_sales_date  (cost=0.00..22.50 rows=1000 width=8)
              ->  Remote PG Seq Scan vw_store_sales_date  (cost=0.00..12.50 rows=1000 width=8)
                    Filter: (ss_quarter_name = '1998Q1'::bpchar)

Amazon Redshift now pushes the filter down to your view. The join restriction is applied in PostgreSQL and many fewer rows are returned to Amazon Redshift. You may notice that Remote PG Seq Scan now shows rows=1000; this is a default value that the query optimizer uses when PostgreSQL can’t provide table statistics.

Summary

This post reviewed 10 best practices to help you maximize the performance Amazon Redshift federated queries. Every use case is unique, so carefully evaluate how you can apply these recommendations to your specific situation.

AWS will continue to enhance and improve Amazon Redshift Federated Query, and welcomes your feedback. If you have any questions or suggestions, leave your feedback in the comments. If you need further assistance in optimizing your Amazon Redshift cluster, contact your AWS account team.

Special thanks go to AWS colleagues Sriram Krishnamurthy, Entong Shen, Niranjan Kamat, Vuk Ercegovac, and Ippokratis Pandis for their help and support with this post.

 


About the Author

Joe Harris is a senior Redshift database engineer at AWS, focusing on Redshift performance. He has been analyzing data and building data warehouses on a wide variety of platforms for two decades. Before joining AWS he was a Redshift customer from launch day in 2013 and was the top contributor to the Redshift forum.

 

Monitor and control the storage space of a schema with quotas with Amazon Redshift

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/monitor-and-control-the-storage-space-of-a-schema-with-quotas-with-amazon-redshift/

Yelp connects people with great local businesses. Since its launch in 2004, Yelp has grown from offering services for just one city—its headquarters home of San Francisco—to a multinational presence spanning major metros across more than 30 countries. The company’s performance-based advertising and transactional business model led to revenues of more than $500 million during 2015, a 46% increase over the prior year. Yelp has evolved into a mobile-centric company, with more than 70% of searches, and more than 58% of content originating from mobile devices.

Yelp uses Amazon Redshift to analyze mobile app usage data and ad data on customer cohorts, auctions, and ad metrics. Yelp has immediately benefited by the new Amazon Redshift schema storage quota feature.

“Amazon Redshift is a managed data warehouse service that allows Yelp to focus on data analytics without spending time on database administration,” says Steven Moy, Lead Engineer for Yelp’s Metrics Platform. The Metrics Platform provides long-term persistent data storage and SQL-on-anything query capabilities for Yelp’s Engineering teams. “A key strategy for our data warehouse users to iterate quickly is to have a writable schema called ‘tmp’ for users to prototype various table schema. However, we occasionally faced challenges when there was not enough free space during a query execution, degrading the entire data warehouse query operation. With the new schema quota feature, we can provision a storage quota ceiling on the ‘tmp’ schema to safeguard runaway storage issues. We look forward to all the autonomous features coming from Amazon Redshift.”

Many organizations are moving toward self-service analytics, where different personas create their own insights on the evolved volume, variety, and velocity of data to keep up with the acceleration of business. This data democratization creates the need to enforce data governance, control cost, and prevent data mismanagement. Controlling the storage quota of different personas is a significant challenge for data governance and data storage operation. This post shows you how to set up Amazon Redshift storage quotas by different personas.

Prerequisites

Before starting this walkthrough, you must have the following:

  • An Amazon Redshift cluster. The US East (N. Virginia) Region is preferred because you need to load data from Amazon Simple Storage Service (Amazon S3) in us-east-1.
  • A database user with superuser permission.

Setting up and testing the schema quota

To set up the environment and implement the use case, complete the following steps:

  1. Connect to your Amazon Redshift cluster using your preferred SQL client as a superuser or user with CREATE SCHEMA privileges.
  1. Create the user sales with the following code:
CREATE USER sales WITH password 'Abcd1234!';

The user in the preceding code is just for the purposes of this walkthrough. You should use identity federation in AWS Identity and Access Management (IAM) to use a corporate user rather than creating one manually. For more information, see Authentication.

  1. Set up a schema quota with the CREATE SCHEMA See the following code:
CREATE SCHEMA [ IF NOT EXISTS ] schema_name [ AUTHORIZATION username ] 
           [ QUOTA {quota [MB | GB | TB] | UNLIMITED} ] [ schema_element [ ... ] 
 
CREATE SCHEMA AUTHORIZATION username[ QUOTA {quota [MB | GB | TB] | UNLIMITED} ] [ schema_element [ ... ] ]

The QUOTA parameters define the maximum amount of disk space that the specified schema can use.

  1. Create the schema sales_schema with a quota of 2 GB for the user sales. See the following code:
CREATE SCHEMA sales_schema AUTHORIZATION sales QUOTA 2 GB; 
  1. Impersonate the user sales with the following code:
SET SESSION AUTHORIZATION 'sales';
SELECT CURRENT_USER;
  1. Create the tables region and lineitem with the following code:
CREATE TABLE sales_schema.region (
  r_regionkey int4,
  r_name char(25),
  r_comment varchar(152)                            
) DISTSTYLE EVEN;
 
CREATE TABLE sales_schema.lineitem (
  l_orderkey int8,
  l_partkey int8,
  l_suppkey int4,
  l_linenumber int4,
  l_quantity numeric(12,2),
  l_extendedprice numeric(12,2),
  l_discount numeric(12,2),
  l_tax numeric(12,2),
  l_returnflag char(1),
  l_linestatus char(1),
  l_shipdate date,
  l_commitdate date,
  l_receiptdate date,
  l_shipinstruct char(25),
  l_shipmode char(10),
  l_comment varchar(44)
) DISTSTYLE EVEN;

Single statement transaction

To test single statement transaction use case, complete the following steps:

  1. Load data into the table region with the following code:
COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

 <Your-IAM-Role> is the ARN of your IAM role with the necessary permissions associated with your Amazon Redshift cluster.

 The COPY command should complete with five rows loaded. See the following code:

dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
  1. Check the quota and disk usage for the schema from system view svv_schema_quota_state. See the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should show a 2,048 MB quota with 30 MB disk usage on your four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  2048 |         30 |           1.46
(1 row)
  1. Load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

You should get an error that the transaction is aborted due to exceeding the disk space quota. See the following code:

dev=> COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 40895 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 2798).
Free up disk space or request increased quota for the schema(s).

Amazon Redshift checks each transaction for quota violations before committing the transaction. The size limit can exceed the quota temporarily within a transaction before it’s committed because the quota violation check occurs at the end of a transaction. That’s why you initially see the message loading to table lineitem completed successfully. When a transaction exceeds the quota, Amazon Redshift aborts the transaction, prohibits subsequent ingestions, and reverts all changes until you free up disk space.

The quota violation occurrence is stored in the system table stl_schema_quota_violations.

  1. Run the following code to see the violation:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

It should show the quota, disk usage, disk usage percentage out of the configured quota, and the timestamp of when the violation happened. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(1 row)
  1. Change the schema quota to 4 GB using the ALTER SCHEMA See the following code:
RESET SESSION AUTHORIZATION; 
ALTER SCHEMA sales_schema QUOTA 4 GB;
SET SESSION AUTHORIZATION 'sales';
SELECT CURRENT_USER;

You have to be a database superuser or user with CREATE SCHEMA permissions to set a schema quota during schema creation. However, only a superuser can change a schema quota. You use RESET SESSION AUTHORIZATION to reset the user impersonation back to the original superuser.

  1. Check the quota and disk usage for the schema with the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should return a 4,096 MB quota with 30 MB disk usage on your four-node DC2.large cluster. The failed COPY statement earlier automatically reclaims disk space when the transaction is aborted for a pristine table. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |         30 |           0.73
(1 row)
  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following message:

Load into table 'lineitem' completed, 59986052 record(s) loaded successfully
  1. Check the quota and disk usage for the schema with the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should return a 4,096 MB quota and 2798 MB disk usage on a four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |       2798 |          68.31
(1 row)
  1. Rerun the COPY command to load a large amount of data into the table lineitem to exceed the quota. See the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return an error that the transaction is aborted due to exceeding the disk space quota. See the following code:

dev=# COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 86438 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
  1. Execute the following code to perform subsequent small COPY/INSERT/UPDATE after the quota violation:
COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;

All statements should return that the error transaction is aborted due to exceeding disk space quota. See the following code:

dev=# COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
ERROR:  Transaction 86478 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
dev=# INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
ERROR:  Transaction 86479 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
dev=# UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
ERROR:  Transaction 86483 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).

For non-pristine tables, although the offending transaction that exceeded the quota was aborted (for example, COPY to lineitem), the transaction still consumes space and the schema is blocked until disk space is reclaimed. That is why the subsequent COPY/INSERT/UPDATE to the region table is aborted as well.

  1. Double-check the quota violation from system table with the following code:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

It should return disk usage of 5,486 MB with a four-node DC2.large cluster. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.653489
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.469287
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.197434
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 20:28:01.344333
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(5 rows)

Single statement shrink-only transaction

A shrink-only transaction is a transaction composed solely of statements that has the potential to directly or indirectly reduce the amount of disk space used, such as: DELETE, TRUNCATE, DROP TABLE, VACUUM, etc.

To test single statement shrink-only transaction use case, complete the following steps:

  1. Delete all data from the lineitem table with the following code:
DELETE sales_schema.lineitem;
SELECT COUNT(*) FROM sales_schema.lineitem;

It should return a record count of zero:

dev=> DELETE sales_schema.lineitem;
DELETE 59986052
dev=> SELECT COUNT(*) FROM sales_schema.lineitem;
 count 
-------
     0
(1 row)

Although the schema quota was exceeded, the DELETE transaction is committed because the shrink-only transaction is an exception to the quota violation check.

The TRUNCATE command is much more efficient than DELETE to delete all data and doesn’t require a VACUUM and ANALYZE command. The preceding code is for illustration purposes only.

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following error message:

Transaction is aborted due to exceeding the disk space quota

See the following code:

dev=> COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 87024 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 8150).
Free up disk space or request increased quota for the schema(s).

Because DELETE doesn’t automatically free up disk space, the preceding COPY command exceeds the quota and the transaction is aborted.

  1. Double-check the quota violation from the system table with the following code:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

The latest row should return a disk usage of 8,150 MB with a four-node DC2.large cluster. This shows that disk usage increases with the aborted transaction. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  4096 |       8150 |         198.97 | 2020-04-20 21:30:54.354669
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.653489
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.469287
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.197434
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 20:28:01.344333
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(6 rows)
  1. Run VACUUM to free up disk space:
VACUUM sales_schema.lineitem;

The schema quota is exceeded, but because Amazon Redshift disregards the quota violation and commits transactions in certain cases like VACUUM, this completes without error.

Amazon Redshift now automatically runs the VACUUM DELETE operation in the background to reclaim disk space.

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following message:

Load into table 'lineitem' completed, 59986052 record(s) loaded successfully

Because VACUUM reclaimed the disk space, the COPY command ran successfully.

CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema

To test CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema use case, complete the following steps:

  1. Execute the CTAS statement to create another table using a query. See the following code:
CREATE TABLE sales_schema.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;

It should return that the transaction is aborted due to exceeding the disk space quota. CTAS is subject to similar quota checks in the same schema. See the following code:

dev=> CREATE TABLE sales_schema.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;
ERROR:  Transaction 112229 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5624).
Free up disk space or request increased quota for the schema(s).
  1. Create a temp table and populate data with the following code:
CREATE TEMP TABLE lineitem_temp AS SELECT * FROM sales_schema.lineitem;
SELECT COUNT(*) from lineitem_temp; 

It should return 59,986,052 rows. The schema quota doesn’t consider temporary tables created as part of a temporary namespace or schema.

  1. Create the same table lineitem in the public schema and load data into it using the following code:
CREATE TABLE public.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;
SELECT COUNT(*) FROM lineitem_ctas;

It should return 59,986,052 rows.

Ingestion into different schemas has no effect and the transaction is committed.

  1. Append data into the table lineitem in sales_schema from another table using the ALTER TABLE APPEND See the following code:
ALTER TABLE sales_schema.lineitem APPEND FROM lineitem_ctas;

It should return an error that the transaction is aborted due to exceeding the disk space quota. Quota violation is applicable to the full schema.

  1. Append data from the sales_schema lineitem table to another table in the public schema with the following code:
ALTER TABLE lineitem_ctas APPEND FROM sales_schema.lineitem;
SELECT COUNT(*) FROM sales_schema.lineitem;
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

It should return a record count of zero from the SELECT COUNT statement and the disk usage drop. The ALTER TABLE APPEND statement empties the table immediately and reclaims disk space when the transaction is complete. See the following code:

dev=> ALTER TABLE lineitem_ctas APPEND FROM sales_schema.lineitem;
INFO:  ALTER TABLE APPEND "lineitem_ctas" from "lineitem" is complete.
ALTER TABLE APPEND and COMMIT TRANSACTION
dev=> SELECT COUNT(*) FROM sales_schema.lineitem;
 count 
-------
     0
(1 row)
 
dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |         30 |           0.73
(1 row)

Concurrent transactions

To test concurrent transactions use case, complete the following steps:

  1. Increase the quota to 6 GB using the following code:
RESET SESSION AUTHORIZATION;
ALTER SCHEMA sales_schema QUOTA 6 GB;
SET SESSION AUTHORIZATION 'sales';

You need two separate SQL client sessions connected to the Amazon Redshift cluster and run the code concurrently for the following test.

  1. On the first session, enter the following code:
CREATE TABLE sales_schema.lineitem_txn1 AS SELECT * FROM lineitem_ctas;
  1. While the first session is still running, on the second session, enter the following code:
CREATE TABLE sales_schema.lineitem_txn2 AS SELECT * FROM lineitem_ctas;

The statement from the first session should return a warning message:

Schema(s) close to reaching their allocated disk space quotas

This is a warning message that the schema is close to its quota, but the statement is still complete. See the following code:

dev=> CREATE TABLE sales_schema.lineitem_txn1 AS SELECT * FROM lineitem_ctas;
WARNING:  Schema(s) close to reaching their allocated disk space quotas: (Schema: sales_schema, Quota: 6144, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
SELECT

The statement from the second session should return an error that the transaction is aborted due to exceeding the disk space quota. Because the first transaction from the first session was committed first, the disk space usage increased to close to the quota. The second transaction checks the quota violation before committing the transaction and thus aborts the transaction.

Reset quota below existing disk space usage

To test resetting quota below existing disk space usage use case, complete the following:

  1. Run the following code to reset the quota to 2 GB:
RESET SESSION AUTHORIZATION;
ALTER SCHEMA sales_schema QUOTA 2 GB;
SET SESSION AUTHORIZATION 'sales';
CREATE TABLE sales_schema.test AS SELECT 1;

You can set the quota below the existing disk space usage; however, all future ingestion transactions are aborted until you free disk space.  See the following code:

dev=> RESET SESSION AUTHORIZATION;
RESET
dev=# ALTER SCHEMA sales_schema QUOTA 2 GB;
ALTER SCHEMA
dev=# SET SESSION AUTHORIZATION 'sales';
SET
dev=> CREATE TABLE sales_schema.test AS SELECT 1;
ERROR:  Transaction 112557 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5610).
Free up disk space or request increased quota for the schema(s).

Multi-statement transactions

To test multi-statement transaction use case, complete the following steps:

  1. Enter the following code to test a multi-statement transaction:
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
COMMIT;
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 1;
COMMIT;
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
COMMIT;

All INSERT/UPDATE/COPY transactions should be aborted because the quota is exceeded. See the following code:

dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
INSERT 0 1
dev=> COMMIT;
ERROR:  Transaction 114291 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 1;
UPDATE 1
dev=> COMMIT;
ERROR:  Transaction 114295 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
dev=> COMMIT;
ERROR:  Transaction 114303 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5624).
Free up disk space or request increased quota for the schema(s).
  1. Enter the following code to test a multi-statement transaction with DROP TABLE at the same transaction:
BEGIN;
	INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
	UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
	COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
	DROP TABLE sales_schema.region;
COMMIT;

The transaction should be committed successfully. The COPY/INSERT/DELETE transaction into a table that gets dropped at the same time the transaction is committed. See the following code:

dev=> BEGIN;
BEGIN
dev=> INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
INSERT 0 1
dev=> UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
UPDATE 3
dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
dev=> DROP TABLE sales_schema.region;
DROP TABLE
dev=> COMMIT;
COMMIT
  1. Run shrink-only statements in a multi-statement transaction with the following code:
BEGIN;
	DELETE FROM sales_schema.lineitem WHERE l_orderkey = 53924162;
	DROP TABLE sales_schema.lineitem;
COMMIT;

The transaction should be committed successfully. A transaction containing shrink-only statements should succeed. See the following code:

dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.lineitem WHERE l_orderkey = 53924162;
DELETE 0
dev=> DROP TABLE sales_schema.lineitem;
DROP TABLE
dev=> COMMIT;
COMMIT

Best practices

These following recommended practices can help you optimize your workload using storage quotas:

  • The schema feature works best when users are aligned with schemas – Use schemas to logically segregate the users, for example, the MAIN schema for the base and aggregated tables owned by data engineers, and the ANALYST schema controlled with storage quota owned by data analysts that perform their own aggregation by querying the base data from the MAIN schema into the ANALYST The data analyst only has read permissions on the MAIN schema but read and write permissions on the ANALYST schema.
  • Revoke usage on public schema – All users have CREATE and USAGE privileges on the PUBLIC schema of a database by default. Because storage quota isn’t applicable to the PUBLIC schema by default, all users should be disallowed from creating objects in the PUBLIC schema of a database. They should instead use their own aligned schemas as per the previous best practice. Use the REVOKE command to remove that privilege.
  • Use physical tables instead of temporary tables – Because temporary tables aren’t subject to storage quota, use physical tables instead—but remember to drop the physical tables at the end of the transaction. An alternative is to create a separate schema to cater to temporary physical tables with storage quota, so instead of using temporary tables, create physical tables in the dedicated schema. This would ease administration and you can perform an automation to clean up tables in this dedicated schema on a regular basis.
  • Use data shrinking operations – Use data shrinking operations like DELETE, VACUUM DELETE, DROP, and TRUNCATE to reclaim disk space, especially if it’s nearing the quota.
  • Use workload management (WLM) with query priority to control resources for different users – Data ingestion and aggregation performed by different users might be resource-intensive and impact the throughput of other mission-critical loads or queries running on the cluster. Amazon Redshift Advisor automatically analyzes the current WLM usage and can make recommendations to get more throughput from your cluster. Periodically reviewing the suggestions from Advisor helps you get the best performance.

Monitoring

You should periodically monitor for any storage quota violations so that you can take corrective action and provide business continuity. The system table stl_schema_quota_violations is a good resource to get quota violation events. To query the violation information, enter the following code:

SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

You can automate the execution of the preceding query and send notifications to users if any violations occur using the following solutions:

Cleaning up

When you’re done with testing, you can remove the objects and users with the following code:

RESET SESSION AUTHORIZATION;
DROP TABLE IF EXISTS sales_schema.lineitem_txn1; 
DROP TABLE IF EXISTS lineitem_temp;
DROP TABLE IF EXISTS lineitem_ctas;
DROP SCHEMA sales_schema;
DROP USER IF EXISTS sales;

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. Data storage operation with quotas is part of an important data governance framework that provides compliance and operation efficiency. A simple QUOTA parameter in a CREATE/ALTER SCHEMA statement enables disk usage quotas on an Amazon Redshift schema. You can assign Amazon Redshift database users who represent personas in the organization to different schemas with quotas to control the amount of disk space they can consume. This is crucial for data governance and cost control, and avoids running into operation issues like disk full errors, especially if they keep creating objects and ingesting data that might halt other mission-critical analytic operations.

This post showed you how easy it is to control the storage quota for Amazon Redshift using schemas. You went through the steps to create and alter the quota on a schema and authorize specific users to the schema. You also explored different use cases of ingesting data into the schema and monitoring the different outcomes and disk usage using the system view. You can adopt this feature to support your business needs.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Ahmad Barghout is a Software Development Engineer on the Amazon Redshift storage team. He has worked on cluster resize optimizations and currently works on ease-of-use features such as schema quota. Before that, he was a full stack developer at Cisco. He holds a master’s degree from The University of Ottawa.

 

 

 

Himanshu Raja is a Principal Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

 

 

Migrating your Netezza data warehouse to Amazon Redshift

Post Syndicated from John Hwang original https://aws.amazon.com/blogs/big-data/migrating-your-netezza-data-warehouse-to-amazon-redshift/

With IBM announcing Netezza reaching end-of-life, you’re faced with the prospect of having to migrate your data and workloads off your analytics appliance. For some, this presents an opportunity to transition to the cloud.

Enter Amazon Redshift.

Amazon Redshift is a cloud-native data warehouse platform built to handle workloads at scale, and it shares key similarities with Netezza that make it an excellent candidate to replace your on-premises appliance. You can migrate your data and applications to Amazon Redshift in less time and with fewer changes than migrating to other analytics platforms. For developers, this means less time spent retraining on the new database. For stakeholders, it means a lower cost of, and time to, migration. For more information, see How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime.

This post discusses important similarities and differences between Netezza and Amazon Redshift, and how they could impact your migration timeline.

Similarities

Three significant similarities between Netezza and Amazon Redshift are their compatibility with Postgres, their massively parallel processing (MPP) architecture, and the Amazon Redshift feature Advanced Query Accelerator (AQUA) compared to Netezza’s use of FPGAs.

Postgres compatibility

Both Netezza and Amazon Redshift share some compatibility with Postgres, an open-source database. This means that Netezza SQL and Amazon Redshift SQL have a similar syntax. In particular, both support many features of PL/pgSQL, Postgres’s procedural language. Your Netezza stored procedures can translate to Amazon Redshift with little-to-no rewriting of code.

You can also use the AWS Schema Conversion Tool, which can automatically migrate a large percentage of Netezza storage procedures to Amazon Redshift syntax with zero user effort. And because both databases are built for analytics and not transactional workloads, there are similar characteristics between the two databases. For example, both Netezza and Amazon Redshift don’t enforce primary keys to improve performance, though you can still define primary keys on your tables to help the optimizer create better query plans.

MPP architecture

Both Netezza and Amazon Redshift are MPP databases. This means that a query is sent to a leader node, which then compiles a set of commands that it sends to multiple compute nodes. Each worker node performs its task in parallel and returns the results to the leader node, where the results are aggregated and returned to the user. This means that you can apply similar architectural strategies from Netezza, such as zone maps and distribution styles, to Amazon Redshift. For more information, see Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization.

Amazon Redshift AQUA and Netezza’s FPGA

AWS recently announced the new Amazon Redshift feature AQUA, which is in preview as of this writing. AQUA uses AWS-designed processors to bring certain compute tasks closer to the storage layer. Compression, encryption, filtering, aggregation—and other tasks—can now happen in this intermediary layer in between the storage and compute, which leaves the CPU free to handle more complex tasks. When it’s released, AQUA will accelerate queries up to 10 times faster.

Netezza uses FPGAs to perform simple compute tasks before data reaches the CPU. Applications designed to employ these FPGA features on Netezza (for example, queries that rely heavily on certain aggregate functions and data filtering) translate well to Amazon Redshift clusters using AQUA.

Differences

For all the similarities that Amazon Redshift and Netezza share, they also have differences. There are three important differences that could have significant impact on your data and application architecture when migrating from Netezza to Amazon Redshift: column store vs. row store, concurrency scaling, and data lake integration.

Column store vs. row store

Netezza stores each row of data onto disk in data blocks, whereas Amazon Redshift stores each column of data. For many analytics workloads, a column store can have dramatic performance benefits over a row store. Typically, an analytics database has tables with many columns, but only a small handful of those columns are used in any one query. For example, assume that you have a table in a row store database with 100 columns and 100 million rows. If you want to sum the entire value of one column in this table, your query has to suffer the I/O penalty of scanning the entire table to retrieve the data in this single column. In a columnar database with the same table, the query only faces I/O for the single column. In addition to enjoying the improved performance for this type of workload in Amazon Redshift, this gives you options for designing wide tables without having to weigh the increase in I/O for typical analytics workloads the way you do in Netezza.

Concurrency scaling

Although both Netezza and Amazon Redshift offer queue priority and short query acceleration to help reduce concurrency issues, Amazon Redshift also uses the benefits of the cloud to offer additional options to handle concurrency.

One option is to take unrelated workloads from a single Netezza appliance and migrate them to separate Amazon clusters, each with an instance type and number of nodes sized accordingly for the workload it has to support. Netezza’s more traditional license and support pricing model, combined with its limited options for appliance sizes, can make this type of architecture untenable for your organization’s budget.

Additionally, Amazon Redshift offers Concurrency Scaling to scale out (and scale back in) additional compute capacity automatically and on the fly. If you want to add or remove nodes in the cluster, or experiment with other nodes types, you can do so in just a few minutes using elastic resize (to change the number of nodes) and classic resize (to change instance types).

This type of scaling and resizing isn’t feasible on Netezza because the on-premises appliance has a fixed number of blades. Concurrency scaling in Amazon Redshift can support virtually unlimited concurrent users and concurrent queries, and its ability to automatically add and remove additional capacity means you only pay for the time the concurrency scaling clusters are in use.

Data lake integration

Netezza offers the ability to create an external table from a data file either on the Netezza host or a remote host. However, querying those tables means migrating the entire dataset internally before running the query. With Amazon Redshift Spectrum, rather than using external tables as a convenient way to migrate entire datasets to and from the database, you can run analytical queries against data in your data lake the same way you do an internal table. As the volume of your data in the lake grows, partitioning can help keep the amount of data scanned for each query (and therefore performance) consistent. You can even join the data from your external tables in Amazon Simple Storage Service (Amazon S3) to your internal tables in Amazon Redshift.

Not only does this mean true integration of your data warehouse and data lake, which can lower your warehouse’s more expensive storage requirements, but because the Amazon Redshift Spectrum layer uses its own dedicated infrastructure outside of your cluster, you can offload many compute-intensive tasks from your cluster and push them down to the Redshift Spectrum layer. For more information, see Amazon Redshift Spectrum overview.

Conclusion

Amazon Redshift can accelerate migration from Netezza and potentially lower the time and cost of moving your analytics workloads into the cloud. The similarity between the two systems eases the pain of migration, but Amazon Redshift has important differences that can offer additional benefits around performance and cost. The expertise of AWS and its partner network can help with your migration strategy and offer guidance to avoid potential roadblocks and pitfalls.

 


About the Author

John Hwang is a senior solutions architect at AWS.

 

 

 

 

 

Brian McDermott is a senior sales manager at AWS.

 

 

 

Build an AWS Well-Architected environment with the Analytics Lens

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/build-an-aws-well-architected-environment-with-the-analytics-lens/

Building a modern data platform on AWS enables you to collect data of all types, store it in a central, secure repository, and analyze it with purpose-built tools. Yet you may be unsure of how to get started and the impact of certain design decisions. To address the need to provide advice tailored to specific technology and application domains, AWS added the concept of well-architected lenses 2017. AWS now is happy to announce the Analytics Lens for the AWS Well-Architected Framework. This post provides an introduction of its purpose, topics covered, common scenarios, and services included.

The new Analytics Lens offers comprehensive guidance to make sure that your analytics applications are designed in accordance with AWS best practices. The goal is to give you a consistent way to design and evaluate cloud architectures, based on the following five pillars:

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization

The tool can help you assess the analytics workloads you have deployed in AWS by identifying potential risks and offering suggestions for improvements.

Using the Analytics Lens to address common requirements

The Analytics Lens models both the data architecture at the core of the analytics applications and the application behavior itself. These models are organized into the following six areas, which encompass the vast majority of analytics workloads deployed on AWS:

  1. Data ingestion
  2. Security and governance
  3. Catalog and search
  4. Central storage
  5. Processing and analytics
  6. User access

The following diagram illustrates these areas and their related AWS services.

There are a number of common scenarios where the Analytics Lens applies, such as the following:

  • Building a data lake as the foundation for your data and analytics initiatives
  • Efficient batch data processing at scale
  • Building a platform for streaming ingest and real-time event processing
  • Handling big data processing and streaming
  • Data-preparation operations

Whichever of these scenarios fits your needs, building to the principles of the Analytics Lens in the AWS Well-Architected Framework can help you implement best practices for success.

The Analytics Lens explains when and how to use the core services in the AWS analytics portfolio. These include Amazon Kinesis, Amazon Redshift, Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation. It also explains how Amazon Simple Storage Service (Amazon S3) can serve as the storage for your data lake and how to integrate with relevant AWS security services. With reference architectures, best practices advice, and answers to common questions, the Analytics Lens can help you make the right design decisions.

Conclusion

Applying the lens to your existing architectures can validate the stability and efficiency of your design (or provide recommendations to address the gaps that are identified). AWS is committed to the Analytics Lens as a living tool; as the analytics landscape evolves and new AWS services come on line, we’ll update the Analytics Lens appropriately. Our mission will always be to help you design and deploy well-architected applications.

For more information about building your own Well-Architected environment using the Analytics Lens, see the Analytics Lens whitepaper.

Special thanks to the following individuals who contributed to building this resource, among many others who helped with review and implementation: Radhika Ravirala, Laith Al-Saadoon, Wallace Printz, Ujjwal Ratan, and Neil Mukerje.

Are there questions you’d like to see answered in the tool? Share your thoughts and questions in the comments.

 


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at Amazon Web Services. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 


Radhika Ravirala is a specialist solutions architect at Amazon Web Services, where she helps customers craft distributed analytics applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley.

Monitor and optimize queries on the new Amazon Redshift console

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/monitor-and-optimize-queries-on-the-new-amazon-redshift-console/

Tens of thousands of customers use Amazon Redshift to power their workloads to enable modern analytics use cases, such as Business Intelligence, predictive analytics, and real-time streaming analytics. As an administrator or data engineer, it’s important that your users, such as data analysts and BI professionals, get optimal performance. You can use the Amazon Redshift console to monitor and diagnose query performance issues.

The Amazon Redshift console features a monitoring dashboard and updated flows to create, manage, and monitor Amazon Redshift clusters. For more information, see Simplify management of Amazon Redshift clusters with the Redshift console.

This post discusses how you can use the new Amazon Redshift console to monitor your user queries, identify slow queries, and terminate runaway queries. The post also reviews details such as query plans, execution details for your queries, in-place recommendations to optimize slow queries, and how to use the Advisor recommendations to improve your query performance.

User query vs. rewritten query

Any query that users submit to Amazon Redshift is a user query. Analysts either author a user query or a BI tool such as Amazon QuickSight or Tableau generates the query. Amazon Redshift typically rewrites queries for optimization purposes. It can rewrite a user query into a single query or break it down into multiple queries. These queries are rewritten queries.

The following steps are performed by Amazon Redshift for each query:

  1. The leader node receives and parses the query.
  2. The parser produces an initial query tree, which is a logical representation of the original query. Amazon Redshift inputs this query tree into the query optimizer.
  3. The optimizer evaluates and, if necessary, rewrites the query to maximize its efficiency. This process sometimes results in creating multiple queries to replace a single query.

The query rewrite is done automatically and is transparent to the user.

Query monitoring with the original Amazon Redshift console and system tables

Previously, you could monitor the performance of rewritten queries in the original Amazon Redshift console or system tables. However, it was often challenging to find the SQL your users submitted.

The following table shows the comparison of query monitoring differences between the original Amazon Redshift console, system tables, and the new console.

Original ConsoleSystem TablesNew Console

•  User query not supported

•  Monitor only rewritten queries

•  Shows only top 100 queries

•  User query not available

•  Rewritten queries

•  All rewritten queries

•  Supports user queries

•  Shows all queries available in system tables

•  Allows you to correlate rewritten queries with user queries

The new console simplifies monitoring user queries and provides visibility to all query monitoring information available in the system.

Monitoring and diagnosing queries

The Amazon Redshift console provides information about the performance of queries that run in the cluster. You can use this information to identify and diagnose queries that take a long time to process and create bottlenecks that prevent other queries from executing efficiently.

The following table shows some of the common questions you may have when monitoring, isolating, and diagnosing query performance issues.

MonitorIsolate and DiagnoseOptimize
How is my cluster doing in terms of query performance and resource utilization?A user complained about performance issues at a specific time. How do I identify that SQL and diagnose problems?How can I optimize the SQL that our end-users author?
How is my cluster throughput, concurrency, and latency looking?

Which other queries were running when my query was slow? Were all queries slow?

 

Is there any optimization required in my schema design?
Are queries being queued in my cluster?Is my database overloaded with queries from other users? Is my queue depth increasing or decreasing?Is there any tuning required for my WLM queues?
Which queries or loads are running now?How do I identify queries that a specific user runs?Can I get any benefit if I enable concurrency scaling?
Which queries or loads are taking longer than usual timing?The resources of my cluster are running very high. How do I find out which queries are running?
What are my top queries by duration in the last hour or last 24 hours?
Which queries have failed?
Is the average query latency for my cluster increasing or decreasing over time?

You can answer these questions by either using the Amazon Redshift console or developing scripts using the system catalog.

Monitoring queries

You can monitor your queries on the Amazon Redshift console on the Queries and loads page or on the Query monitoring tab on the Clusters page. While both options are similar for query monitoring, you can quickly get to your queries for all your clusters on the Queries and loads page. You have to select your cluster and period for viewing your queries.

You can view the queries using List view on the Query monitoring tab on the Clusters page. The query monitoring page visually shows the queries in a Gantt chart.

Each bar represents a user query, and the length of the bar represents runtime for a query. The X-axis shows the selected period, and the location of the bar indicates when a query started and ended. The queries include both standard SQL statements such as SELECT, INSERT, and DELETE, and loads such as COPY commands.

Monitoring top queries

By default, the Query monitoring page shows the top 100 longest queries by runtime or duration for the selected time window. You can change the time window to view the top queries for that period. The top queries also include completed queries and running queries. The completed queries are sorted by descending order of query runtime or duration.

Identifying running queries

You can find out your running queries by choosing Running queries from the drop-down menu.

To see the query’s details such as SQL text, runtime details, related rewritten queries, and execution details, choose the query ID.

The Duration column shows the estimated duration and runtime for a query. You can terminate a query by selecting the query and choosing Terminate query. You need the have the  redshift:CancelQuerySession action added to your IAM policy to cancel a query.

Viewing loads

As a data engineer or Redshift administrator, ensuring that your load jobs complete correctly and meet required performance SLAs is a major priority.

You can view all your load jobs by choosing Loads from the drop-down menu on the Query monitoring page. You can then zoom in on the desired time window.

The preceding Gantt chart shows all loads completed successfully. The query status indicates if the load failed or if an administrator terminated it.

Identifying failed queries

You can identify failed queries by choosing Failed or stopped queries from the drop-down menu on the Query monitoring page and then zooming in on the desired time.

The query page shows 50 queries by default, and you have to paginate to view more results. You can change the page size by choosing the settings gear icon.

In the Preferences section, you can customize what fields you want to see on the Queries and loads list. For example, you can see the PID and not the transaction ID. These changes persist across browser sessions.

Monitoring long-running queries

Amazon Redshift categorizes queries if a query or load runs more than 10 minutes. You can filter long-running queries by choosing Long queries from the drop-down menu. Similarly, you can also filter medium and short queries.

Isolating problematic queries

The following section looks at some use cases in which you use the console to diagnose query performance issues.

Query performance issues

For this use case, a user complains that their queries as part of the dashboards are slow, and you want to identify the associated queries. These queries might not be part of the top queries. To isolate these queries, you can either choose Completed queries or All queries from the drop-down menu and specify the time window by choosing Custom.

You can also drill down to view the queries in a specific period, or filter for queries from one particular user by searching their user name.

You can also filter your queries by searching SQL query text.

As with the earlier charts, the size of a bar represents a relative duration of the runtime for a query. In this period, the highlighted query is the slowest. If you mouse over a bar in the Gantt chart, it provides helpful information about the query such as query ID, part of the query text, and runtime. To view details about a specific query, choose Query ID.

Identifying systemic query performance problems

For this use case, many of your users are complaining about longer-than-normal query runtimes. You want to diagnose what is happening in your cluster. You can customize your time and switch to the graph view, which helps you to correlate longer runtimes with what is happening in the cluster. As the following Gantt chart and CPU utilization graph shows, many queries were running at that time, and CPU utilization almost reached 100%.

The concurrency scaling feature of Amazon Redshift could have helped maintain consistent performance throughput the workload spike.

High CPU utilization

You can correlate query performance with cluster performance and highlight on a given metric such as CPU utilization, which shows you which queries were running at that time.

Monitoring workload performance

You can get a detailed view of your workload’s performance by looking at the Workload execution breakdown chart. You can find out how long it took to plan, wait, and execute your workload. You can also view time spent in operations such as INSERT, UPDATE, DELETE, COPY, UNLOAD, or CTAS. The chosen time in the query history is stored when you navigate between pages.

In the preceding screenshot, you can see several waits in the workload breakdown graph. You can take advantage of concurrency scaling to process a burst of queries.

Correlating query throughput with query duration

You can view the trend of the performance of your queries, such as duration or execution time for your long, medium, and short queries, and correlate with the query throughput.

This information can offer insight into how well the cluster serves each query category with its current configuration.

Monitoring workload for your WLM queues

You can correlate query performance with cluster performance and highlight a given metric such as CPU utilization to see which queries were running at that time. You can view the average throughput, average duration, and average queue time by different WLM queues. Insight from this graph might help you tune your queries; for example, by assigning the right priority for your WLM queue or enabling concurrency scaling for your WLM queue.

Diagnosing and optimizing query performance

After you isolate a slow query, you can drill down to the execution details of the query by choosing Query ID. The following screenshot shows multiple query IDs for a query that has been rewritten to multiple queries.

The Query details page shows you the parent query and all rewritten queries.

You can also find out whether any of the rewritten queries ran on a concurrency scaling cluster.

You can view the query plans, execution statistics such as the cost of each step of the plan, and data scanned for the query.

You can also view the cluster metrics at the time the query ran on the cluster. The following screenshot shows the problematic steps for your query plan. Choosing a problematic step reveals in-place recommendations to improve this query.

Implementing Advisor recommendations

Amazon Redshift Advisor provides recommendations that could improve workload performance. Amazon Redshift uses machine learning to look at your workload and provide customized recommendations. Amazon Redshift monitors and offers guidance for improved performance on the following crucial areas:

  • Short query acceleration (SQA) – Checks for query patterns and reports the number of recent queries in which you can reduce latency and the daily queue time for SQA-eligible queries by enabling SQA, thus improving your query performance
  • Sort key for tables – Analyzes the workload for your data warehouse over several days to identify a beneficial sort key for your tables and makes sort key recommendations
  • Distribution key for tables – Analyzes your workload to identify the most appropriate distribution key for tables that can significantly benefit from a key distribution style

The following screenshot shows a recommendation to alter the distribution key for the table.

Enabling concurrency scaling

To deliver optimal performance for your users, you can monitor user workloads and take action if you diagnose a problem. You can drill down to the query history for that specific time, and see several queries running at that time.

If you aren’t using concurrency scaling, your queries might be getting queued. You can also see that on the Workload concurrency tab. In the following screenshot, you can see that many queries are queued during that time because you didn’t enable concurrency scaling.

You can monitor all submitted queries and enable concurrency scaling when queued queries are increasing.

View a demo of Query Monitoring to learn more about the feature:

Conclusion

This post showed you the new features in the Amazon Redshift console that allow you to monitor user queries and help you diagnose performance issues in your user workload. The console also allows you to view your top queries by duration, filter failed, and long-running queries, and help you drill down to view related rewritten queries and their execution details, which you can use to tune your queries. Start using the query monitoring features of the new Amazon Redshift console to monitor your user workload today!

 


About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

 

 

Apurva Gupta is a user experience designer at AWS. She specializes in databases, analytics and AI solutions. Previously, she has worked with companies both big and small leading end-to-end design and helping teams set-up design-first product development processes, design systems and accessibility programs. 

 

 

 

Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.

 

 

 

Sudhakar Reddy is a full stack software development engineer with Amazon Redshift. He is specialized in building cloud services and applications for Big data, Databases and Analytics. 

 

 

 

 

Zayd Simjee is a software development engineer with Amazon Redshift.

Federate Amazon Redshift access with Microsoft Azure AD single sign-on

Post Syndicated from Harshida Patel original https://aws.amazon.com/blogs/big-data/federate-amazon-redshift-access-with-microsoft-azure-ad-single-sign-on/

Recently, we helped a large enterprise customer who was building their data warehouse on Amazon Redshift, using Azure AD as a corporate directory. Their requirement was to enable data warehouse users to use their corporate credentials to query data in Redshift. Doing so not only provided a better user experience — users can utilize their corporate credentials and do not have to remember extra passwords – but it also it made maintenance easier, because all corporate users are managed in a single place. The solution was to set up Azure AD federated access to Redshift.

You can use federation to centrally manage access to Amazon Redshift. This simplifies administration by allowing you to control user access at a central location and reducing the overhead of creating and maintaining database users. In addition to Active Directory Federation Service (ADFS), PingFederate, and Okta, Amazon Redshift also supports Microsoft Azure Active Directory (Azure AD) federation.

For more information about using ADFS with Amazon Redshift, see Federate Database User Authentication Easily with IAM and Amazon Redshift. For more information about integrating Azure AD, see Setting Up JDBC or ODBC Single Sign-on Authentication with Microsoft Azure AD.

This post illustrates how to set up federation using Azure AD and IAM. Azure AD manages the users and provides federated access to Amazon Redshift using IAM. You do not need to create separate database users in Amazon Redshift with this setup.

Solution overview

The solution to configure the federation between Azure AD and IAM to allow seamless access to Amazon Redshift using an SQL client contains the following components:

  • Azure AD serves as the identity provider for user authentication.
  • AWS serves as the service provider for authorization.
  • Security Assertion Markup Language (SAML 2.0) simplifies the interaction between identity provider and service provider.
  • AWS STS requests temporary credentials to connect to Amazon Redshift.

The following diagram illustrates the architecture of the solution.

The solution contains the following steps:

  1. You configure a JDBC or ODBC driver in SQL client to use Azure AD federation.
  2. The client makes an authentication request to the enterprise application in Azure using Azure AD credentials.
  3. The Azure enterprise application queries Azure AD and generates a SAML response, which includes the IAM roles assigned to the user.
  4. The client receives the SAML response.
  5. The SAML assertion goes to the AWS federation endpoint, which invokes the AssumeRoleWithSAML API of AWS STS and generates temporary IAM credentials.
  6. You use the temporary credentials to connect to the Amazon Redshift cluster.

Prerequisites

This blog post assumes that you have the following:

On Azure AD side:

On Amazon Redshift side:

Walkthrough overview

This walkthrough consists of the following three sections:

  1. Setting up the Azure enterprise non-gallery application using single sign-on (SSO) with SAML.
  2. Setting up the IAM provider and roles, which includes the following steps:
    • Creating the SAML identity provider.
    • Creating an IAM role for access to the Amazon Redshift cluster.
    • Creating an IAM provider and an IAM role to use SAML-based federation.
    • Testing the SSO setup.
  3. Configuring the JDBC client to use Azure AD user credentials to log in to the Amazon Redshift cluster. This post uses a JDBC client. However, you can use the same setup to support ODBC clients.

Setting Up Azure Enterprise Application with SAML single sign on

To set up the Azure enterprise application, complete the following steps:

  1. Log in to Azure Portal.
  2. Choose Enterprise applications.
  3. Choose New application.
  4. Choose Non-gallery application.
  5. For Name, enter Redshift.
  6. Choose Add.
  7. Under Manage¸ choose Single sign-on.
  8. In the Basic SAML Configuration section, for Entity ID and Reply URL, enter https://signin.aws.amazon.com/saml.
  9. In the User Attributes & Claims section, choose Edit.
  10. For Additional claims, Add new claim with the following values (if they do not exist):
    • For http://schemas.microsoft.com/ws/2008/06/identity/claims/groups, choose user.groups [SecurityGroup].
    • For https://redshift.amazon.com/SAML/Attributes/RoleSessionName, choose user.userprincipalname.
    • For https://redshift.amazon.com/SAML/Attributes/DbGroups, choose user.assignedroles
    • For https://redshift.amazon.com/SAML/Attributes/DbUser, choose user.userprincipalname.
    • For https://redshift.amazon.com/SAML/Attributes/AutoCreate, enter trueIIt is noteworthy to mention here that value of “DbUser” and “DbGroups” must be lowercase, begin with a letter, contain only alphanumeric characters, underscore (‘_’), plus sign (‘+’), dot (‘.’), at (‘@’), or hyphen (‘-‘), and be less than 128 characters.
  11. In addition to all of the claims added in previous step, Add new claim
    • For Manage claim, for Name, enter Role.
    • For Namespace, enter https://aws.amazon.com/SAML/Attributes.
    • For Source, select Attribute.
    • For Source attribute, enter arn:aws:iam::<yourAWSAccount>:role/AzureSSO,arn:aws:iam::<yourAWSAccount>:saml-provider/AzureADProviderEEnsure “Roleclaim is configured properly. Make sure you change this to your own AWS account. We will setup role AzureSSO and AzureADProvider in AWS Identity Access Management.
  12. In the Single sign-on section, under SAML Signing Certificate, for Federation Metadata XML, choose Download. Save this file locally.Often, the name of the file is the same as the application name; for example, Redshift.xml. You use this file to configure the IAM identity provider in the next section. This file will be used to configure IAM Identity Provider in next section.
  13. On the App registrations page, choose the application Redshift. If you don’t see your application in the list, choose All apps from the drop-down menu and search for it. Register it if it is not registered.
  14. Record the Application (client) ID and Directory (tenant) ID.You use these values in the JDBC connection when you connect to Amazon Redshift.
  15. Under API permissions, choose Add a permission.
  16. Choose Microsoft Graph.
  17. Choose Delegated permissions
  18. For User, choose User.Read.
  19. Choose Add permission.This allows the Amazon Redshift enterprise application to grant admin consent to read user profile and perform login using SSO.
  20. Under Security, choose Permissions.
  21. Add Users/groups to the application or grant universal admin consent for the entire organization.
  22. Choose Certificates & secrets.
  23. Generate the client secret key.
  24. Record the client secret key to use when you configure the JDBC connection.

You have now set up the enterprise application in Azure. You are ready to set up the IAM provider and role to use this Azure federation.

Setting Up IAM Identity Provider and Roles in AWS

An IAM identity provider is an entity in IAM that describes an external identity provider (IdP) service that supports the SAML 2.0 standard. You use an IAM identity provider when you want to establish trust between an SAML-compatible IdP; for example, Azure AD. It is important to name this identity provider AzureADProvider to match the SAML claims you created earlier.

For this walkthrough, you create the following in IAM:

  • IAM SAML identity provider AzureADProvider
  • IAM role AzureSSO
  • IAM policy redshiftAccessPolicy, which you attach to the role AzureSSO to provide permission to access the Amazon Redshift cluster

Creating the IAM SAML identity provider

To create your IAM SAML identity provider, complete the following steps:

  1. Sign in to the AWS Management Console as an admin.
  2. Under Security, Identity, & Compliance, choose IAM.
  3. Choose Identity providers.
  4. Choose Create Provider.
  5. On the Configure Provider page, for Provider Type, choose SAML.
    • For Provider Name, enter AzureADProvider.
    • For Metadata Document, choose xml.
  6. Choose Next Step.
  7. Choose Create.

Creating the IAM role for access to the Amazon Redshift cluster

You now create a role for SAML-based federation. The name of this role needs to match what you named the SAML claim you created earlier: AzureSSO. Complete the following steps:

  1. On the IAM console, choose Roles.
  2. Choose Create role.
  3. You will be directed to the Create role page where you can choose from several different options. To setup the Azure AD authentication on the Create role page, choose SAML 2.0 federation.
  4. Under Choose a SAML 2.0 provider:
    • For SAML provider, choose AzureADProvider from the drop down.
    • Select Allow programmatic and AWS Management Console access.
    • For Attribute, keep at the default SAML:aud.
    • For Value, keep at the default https://signin.aws.amazon.com/saml.
    • For Condition, you can optionally add additional conditions of who can use SSO. For this walkthrough, do not add any conditions.
  5. Choose Next: Permissions.You attach the policy to provide permission to Amazon Redshift after you create the policy.
  6. Choose Next: Tags. For Add tags (optional), you can optionally add key-value tags to help better organize, track, or control access for this role. For this post, do not add any tags.
  7. Choose Next: Review
  8. On the Review page, for Role name, enter AzureSSO.
    • For Role description, you can optionally enter a brief description of what your role does.
    • For Trusted entities, verify the ARN of the provider you specified earlier is correct.
  9. Choose Create role.

Creating the IAM Amazon Redshift access policy

In the following steps, you allow the role AzureSSO to generate an AWS STS token to connect to Amazon Redshift by adding an IAM permissions policy redshiftAccessPolicy to the role.

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. On the Create policy page, choose JSON.
  4. For the policy, enter the following JSON:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:DescribeClusters"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:cluster:<YOUR-REDSHIFT-CLUSTER-NAME>"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:GetClusterCredentials"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbname:<YOUR-REDSHIFT-CLUSTER-NAME>/dev",
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbuser:<YOUR-REDSHIFT-CLUSTER-NAME>/${redshift:DbUser}"
                ],
                "Condition": {
                    "StringEquals": {
                        "aws:userid": "<YOUR-ROLE-ID>:${redshift:DbUser}"
                    }
                }
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:CreateClusterUser"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbuser:<YOUR-REDSHIFT-CLUSTER-NAME>/${redshift:DbUser}"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:JoinGroup"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbgroup:<YOUR-REDSHIFT-CLUSTER-NAME>/<YOUR-DB-GROUP>"
                ]
            }
        ]
    }

    In the preceding code, replace the placeholders with the appropriate values. The following table provides an explanation of each placeholder.

1<YOUR-REGION>The Region hosting your solution.
2<AWS-ACCOUNT-NUMBER>Your AWS account.
3<YOUR-REDSHIFT-CLUSTER-NAME>Your Amazon Redshift cluster name; for example, cluster-1.
4<YOUR-ROLE-ID>The unique identifier of the role AzureSSO, which you create in the next section. You can enter the code aws iam get-role AzureSSO and use the RoleId value from the output. For more information, see IAM Identifiers.
5<YOUR-DB-GROUP>The database group that the user can join. For this post, use developer_grp.

 

  1. Choose Review policy.
  2. In the Review policy section, for Name, enter the name of your policy; for example, redshiftAccessPolicy.
  3. For Description, you can optionally enter a brief description of what the policy does.
  4. For Summary, review your policy components and make sure to resolve any warnings or errors.
  5. Choose Create policy.
  6. Add this policy to the IAM role AzureSSO. For instructions, see Adding IAM Identity Permissions (Console).
    • Choose Managed IAM Policy.

Testing the SSO setup

You can now test the SSO setup. Complete the following steps.

  1. On the Azure Portal, for your Amazon Redshift application, choose Single sign-on.
  2. Choose Test.
  3. Choose Sign in as current user.

If the setup is correct, this brings you to the console (may be in next Tab for some browsers). The following screenshot shows that the user test is signed in using the assumed role AzureSSO.

After you verify the SSO setup, you are ready to connect the SQL client to Amazon Redshift using Azure AD federation.

Setting up your JDBC SQL client to use Azure AD federation

You can use any client tool that supports an Amazon Redshift JDBC or ODBC driver to connect to Amazon Redshift using Azure SSO. For this post, use the SQLWorkbenchJ client to connect to Amazon Redshift using JDBC. For instructions in installing SQLWorkbench/J, see Connect to Your Cluster by Using SQL Workbench/J.

To connect to the Amazon Redshift cluster using your Azure AD credentials, complete the following steps. If you are using another client, you can pass all these parameters in the JDBC URL as a URL parameter.

To use Azure AD with JDBC, the Amazon Redshift JDBC driver must be version 1.2.37.1061 or later. To use Azure AD with ODBC, the Amazon Redshift ODBC driver must be version 1.4.10.1000 or later. For this walkthrough, use the driver with AWS SDK.

  1. Start SQLWorkbenchJ.
  2. On the Select Connection Profile page, choose Add Profile Group.
  3. Click on the “New Connection Profile.” This will open a new profile in the profile group you created. You can choose an appropriate name — we named it “Azure.”
  4. For Driver, choose your Amazon Redshift driver.
  5. Next step is to construct JDBC URL and input it in “URL” field. You can do it by using the following pattern:jdbc:redshift:iam://<your-cluster-identifier-here>:<your-cluster-region-here>/<your-DB-name-here>RReplace <your-cluster-identifier-here>, <your-cluster-region-here>, and <your-DB-name-here> with the values matching your Amazon Redshift cluster. This post uses cluster-identifier, us-west-1, and dev.
  6. Choose Extended Properties.

Add the following key-value pairs tabulated in Extended Properties of SQLWorkbenchJ.

1plugin_nameTo use Azure AD federation, use com.amazon.redshift.plugin.AzureCredentialsProvider.
2idp_tenantThis is the tenant name of your company configured on your IdP (Azure). This value can either be the tenant name or the tenant unique ID with hyphens. For this walkthrough, use tenant ID that you recorded earlier.
3client_secretUse the client secret value that you created earlier.
4client_idThis is your client ID with hyphens of the Redshift application. Use the client ID that you recorded earlier.

If you completed these steps correctly, you can connect to the Amazon Redshift cluster using your Azure AD credentials.

Summary

In this post, we provided a step-by-step guide to configure and use Azure AD as your IdP to enable federation to an Amazon Redshift cluster.

 


About the Authors

Harshida Patel is a Data Warehouse Specialist Solutions Architect with AWS.

 

 

 

 

Pir Abdul Rasool Qureshi is a Solutions Architect at AWS. He enjoys solving complex customer problems in Databases & Analytics and deliver successful outcomes. Outside of work, he loves to spend time with his family, watch movies and play cricket whenever possible.

 

 

Achieve finer-grained data security with column-level access control in Amazon Redshift

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/achieve-finer-grained-data-security-with-column-level-access-control-in-amazon-redshift/

Amazon Redshift is the most popular cloud data warehouse because it provides fast insights at a low cost. Customers can confidently run mission critical workloads, even in highly regulated industries, because Amazon Redshift comes with out of the box security and compliance. The security features, combined with the ability to easily analyze data in-place and in open formats, along with compute and storage elasticity, and ease of use are what makes tens of thousands of customers choose Amazon Redshift.

Many organizations store sensitive data, commonly classified as personally identifiable information (PII) or sensitive personal information (SPI) in Amazon Redshift and this data will have restricted access from different persona in the organization. For example, your human resources, finance, sales, data science, and marketing departments may all have the required access privileges to view customer data, whereas only the finance department should have access to sensitive data like personally identifiable information (PII) or payment card industry (PCI).

Views or AWS Lake Formation on Amazon Redshift Spectrum was used previously to manage such scenarios, however this adds extra overhead in creating and maintaining views or Amazon Redshift Spectrum. View based approach is also difficult to scale and can lead to lack of security controls. Amazon Redshift column-level access control is a new feature that supports access control at a column-level for data in Amazon Redshift. You can use column-level GRANT and REVOKE statements to help meet your security and compliance needs similar to managing any database object.

This post shows you how to setup Amazon Redshift column-level access control on table, view and materialized view.

Use Case

There are two tables that store customer demographic and account balance data. Finance department can see all customer data while Sales department can only view and update market segment and account balance data as the rest of customer demographic data like customer name, phone and nation are considered PII data and should have restricted access. This is a good use case for column-level access control to secure the PII data. Below is a simple entity relation diagram for the 2 tables.

Prerequisites

Before trying out the illustration in this blog, note the following prerequisites:

  1. Amazon Redshift cluster.
  2. Database user with permission to create table or superuser.

Setting up the environment

To setup the environment and implement the use case, complete the following steps:

  1. Connect to your Amazon Redshift cluster using any SQL client of your choice with user with permission to create table or superuser.
  2. Create two tables with the following code:

    CREATE TABLE customer 
    (
      customerid       INT8 NOT NULL,
      customername     VARCHAR(25) NOT NULL,
      phone            CHAR(15) NOT NULL,
      nationid        INT4 NOT NULL,
      marketsegment    CHAR(10) NOT NULL,
      accountbalance   NUMERIC(12,2) NOT NULL
    );
    CREATE TABLE nation 
    (
      nationid    INT4 NOT NULL,
      nationname   CHAR(25) NOT NULL
    );

  3. Populate some sample data into the two tables with the following code:

    INSERT INTO customer VALUES
    (1, 'Customer#000000001', '33-687-542-7601', 3, 'HOUSEHOLD', 2788.52),
    (2, 'Customer#000000002', '13-806-545-9701', 1, 'MACHINERY', 591.98),
    (3, 'Customer#000000003', '13-312-472-8245', 1, 'HOUSEHOLD', 3332.02),
    (4, 'Customer#000000004', '23-127-851-8031', 2, 'MACHINERY', 9255.67),
    (5, 'Customer#000000005', '13-137-193-2709', 1, 'BUILDING', 5679.84)
    ;
    INSERT INTO nation VALUES
    (1, 'UNITED STATES'),
    (2, 'AUSTRALIA'),
    (3, 'UNITED KINGDOM')
    ;

  4. Create a view and a materialized view with the following code:

    CREATE OR REPLACE VIEW customer_vw AS SELECT customername, phone, marketsegment, accountbalance, CASE WHEN accountbalance < 1000 THEN 'low' WHEN accountbalance > 1000 AND accountbalance < 5000 THEN 'middle' ELSE 'high' END AS incomegroup FROM customer;
    
    CREATE MATERIALIZED VIEW customernation_mv AS SELECT customername, phone, nationname, marketsegment, sum(accountbalance) AS accountbalance FROM customer c INNER JOIN nation n ON c.nationid = n.nationid GROUP BY customername, phone, nationname, marketsegment;

  5. The purpose of the view, customer_vw is to implement business rule of customer income group categorization based on customer dataset.
  6. Analytical dashboards frequently access this dataset by joining and aggregating tables customer and nation and thus, the materialized view customernation_mv is created to speed up the performance such query significantly.
  7. Create and grant table level permissions to user finance which represent finance department users. Note that below users are created only for illustration purpose. We recommend you to use AWS IAM Federation to bring your corporate users without creating them manually in Amazon Redshift. For more information, please refer to https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-authentication-access-control.html#authentication.
    CREATE USER finance WITH password 'Abcd1234!';
    CREATE USER sales WITH password 'Abcd1234!'; 
    GRANT SELECT, UPDATE ON customer TO finance;
    GRANT SELECT ON customer_vw TO finance;
    GRANT SELECT ON customernation_mv TO finance;

  8. Note that user finance has SELECT and UPDATE permission on all columns on customer table.
  9. You need to test and validate user finance is able to view all data from the customer table, customer_vw view and customernation_mv materialized view and update data on customer table.
  10. Enter the following code:
    SET SESSION AUTHORIZATION 'finance';
    SELECT CURRENT_USER;
    SELECT * FROM customer;
    SELECT * FROM customer_vw;
    SELECT * FROM customernation_mv;
    UPDATE customer SET accountbalance = 1000 WHERE marketsegment = 'BUILDING';
    RESET SESSION AUTHORIZATION;

    Note that SQL statement SET SESSION AUTHORIZATION 'finance' is used to impersonate user finance in above code.

    Each select statement should return five rows and the update statement should return one row updated. See the following code:

    dev=# SET SESSION AUTHORIZATION 'finance';
    SET
    dev=> SELECT CURRENT_USER;
     current_user 
    --------------
     finance
    (1 row)
     
    dev=> SELECT * FROM customer;
     customerid |    customername    |      phone      | nationid | marketsegment | accountbalance 
    ------------+--------------------+-----------------+----------+---------------+----------------
              1 | Customer#000000001 | 33-687-542-7601 |        3 | HOUSEHOLD     |        2788.52
              2 | Customer#000000002 | 13-806-545-9701 |        1 | MACHINERY     |         591.98
              3 | Customer#000000003 | 13-312-472-8245 |        1 | HOUSEHOLD     |        3332.02
              4 | Customer#000000004 | 23-127-851-8031 |        2 | MACHINERY     |        9255.67
              5 | Customer#000000005 | 13-137-193-2709 |        1 | BUILDING      |        5679.84
    (5 rows)
     
    dev=> SELECT * FROM customer_vw;
        customername    |      phone      | marketsegment | accountbalance | incomegroup 
    --------------------+-----------------+---------------+----------------+-------------
     Customer#000000001 | 33-687-542-7601 | HOUSEHOLD     |        2788.52 | middle
     Customer#000000002 | 13-806-545-9701 | MACHINERY     |         591.98 | low
     Customer#000000003 | 13-312-472-8245 | HOUSEHOLD     |        3332.02 | middle
     Customer#000000004 | 23-127-851-8031 | MACHINERY     |        9255.67 | high
     Customer#000000005 | 13-137-193-2709 | BUILDING      |        5679.84 | high
    (5 rows)
     
    dev=> SELECT * FROM customernation_mv;
        customername    |      phone      |        nationname         | marketsegment | accountbalance 
    --------------------+-----------------+---------------------------+---------------+----------------
     Customer#000000005 | 13-137-193-2709 | UNITED STATES             | BUILDING      |        5679.84
     Customer#000000004 | 23-127-851-8031 | AUSTRALIA                 | MACHINERY     |        9255.67
     Customer#000000003 | 13-312-472-8245 | UNITED STATES             | HOUSEHOLD     |        3332.02
     Customer#000000002 | 13-806-545-9701 | UNITED STATES             | MACHINERY     |         591.98
     Customer#000000001 | 33-687-542-7601 | UNITED KINGDOM            | HOUSEHOLD     |        2788.52
    (5 rows)
     
    dev=> UPDATE customer SET accountbalance = 1000 WHERE marketsegment = 'BUILDING';
    UPDATE 1
    dev=>  
    dev=> RESET SESSION AUTHORIZATION;
    RESET

You have now successfully setup table level permissions for user finance to view and update all customer data.

Setting up Amazon Redshift column-level access control

Column-level access control can be enabled and disabled by using GRANT and REVOKE statements with the following syntax:

GRANT { { SELECT | UPDATE } ( column_name [, ...] ) [, ...] | ALL [ PRIVILEGES ] ( column_name [,...] ) }
ON { [ TABLE ] table_name [, ...] }
TO { username | GROUP group_name | PUBLIC } [, ...]
 
REVOKE { { SELECT | UPDATE } ( column_name [, ...] ) [, ...] | ALL [ PRIVILEGES ] ( column_name [,...] ) }
ON { [ TABLE ] table_name [, ...] }
FROM { username | GROUP group_name | PUBLIC } [, ...]
[ CASCADE | RESTRICT ]

To set up column-level privileges, complete the following steps:

  1. To determine which users have column-level access control, you can query PG_ATTRIBUTE_INFO system view. Enter the following code:
    SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;

  2. The query should return zero records as we have not implemented column-level access control yet.
  3. Grant user sales SELECT permission on columns marketsegment and accountbalance on table customer, view customer_vw and materialized view customernation_mv. We also grant UPDATE permission on column marketsegment and accountbalance on table customer by entering the following code:
    RESET SESSION AUTHORIZATION;
    GRANT SELECT (marketsegment, accountbalance) ON customer TO sales WITH GRANT OPTION;
    GRANT SELECT (marketsegment, accountbalance),UPDATE (marketsegment, accountbalance) ON customer TO sales;
    GRANT SELECT (marketsegment, accountbalance) ON customer_vw TO sales;
    GRANT SELECT (marketsegment, accountbalance) ON customernation_mv TO sales;

  4. Error message “Grant options are not supported for column privileges” should be returned for the first statement. This is because only a table’s owner or a superuser can grant column-level privileges and to maintain simple security model.
  5. Validate if above permissions have been granted with the following code:
    SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b  JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;

  6. The query should return six rows. See the following code:
    dev=# SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b  JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;
          attacl       |    attname     |      relname      
    -------------------+----------------+-------------------
     {sales=rw/fqdemo} | accountbalance | customer
     {sales=rw/fqdemo} | marketsegment  | customer
     {sales=r/fqdemo}  | accountbalance | customer_vw
     {sales=r/fqdemo}  | marketsegment  | customer_vw
     {sales=r/fqdemo}  | accountbalance | customernation_mv
     {sales=r/fqdemo}  | marketsegment  | customernation_mv
    (6 rows)

    The output above shows:
    Users: sales (attacl column)
    Permissions: read/write (attacl column value “rw”)
    On Column: accountbalance, marketsegment (attname column)
    Of table: customer (relname column)
    Granted by: fqdemo (attacl column)

    Users: sales (attacl column)
    Permissions: read (attacl column value “r”)
    On Column: accountbalance, marketsegment (attname column)
    Of table: customer_vw, customernation_mv (relname column)
    Granted by: fqdemo (attacl column)

  7. After you confirmed the column-level access control are correct, run as user sales to query table customer, view customer_vw and materialized view customernation_mv using the following code:
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT * FROM customer;
    SELECT * FROM customer_vw;
    SELECT * FROM customernation_mv;

  8. Each select statement should return permission denied error as the user does not have permissions to all columns of the objects being queried. See the following code:
    dev=# SET SESSION AUTHORIZATION 'sales';
    SET
    dev=> SELECT CURRENT_USER;
     current_user 
    --------------
     sales
    (1 row)
     
    dev=> SELECT * FROM customer;
    ERROR:  permission denied for relation customer
    dev=> SELECT * FROM customer_vw;
    ERROR:  permission denied for relation customer_vw
    dev=> SELECT * FROM customernation_mv;
    ERROR:  permission denied for relation customernation_mv

  9. Query only the columns marketsegment and accountbalance from table customer, view customer_vw and materialized view customernation_mv with the following code:
    SELECT marketsegment, accountbalance FROM customer;
    SELECT marketsegment, accountbalance FROM customer_vw;
    SELECT marketsegment, accountbalance FROM customernation_mv;

  10. Each select statement should return five rows as user sales has permission to query columns marketsegment and accountbalance. See the following code:
    dev=> SELECT marketsegment, accountbalance FROM customer;
     marketsegment | accountbalance 
    ---------------+----------------
     HOUSEHOLD     |        2788.52
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     MACHINERY     |        9255.67
     BUILDING      |        1000.00
    (5 rows)
     
    dev=> SELECT marketsegment, accountbalance FROM customer_vw;
     marketsegment | accountbalance 
    ---------------+----------------
     HOUSEHOLD     |        2788.52
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     MACHINERY     |        9255.67
     BUILDING      |        1000.00
    (5 rows)
     
    dev=> SELECT marketsegment, accountbalance FROM customernation_mv;
     marketsegment | accountbalance 
    ---------------+----------------
     MACHINERY     |        9255.67
     BUILDING      |        5679.84
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     HOUSEHOLD     |        2788.52
    (5 rows)

  11. Update the accountbalance column with the following code:
    UPDATE customer SET accountbalance = 2000 WHERE marketsegment = 'BUILDING';
    SELECT accountbalance FROM customer WHERE marketsegment = 'BUILDING';

  12. The select statement should return one row that shows value 2000. See the following code:
    dev=> UPDATE customer SET accountbalance = 2000 WHERE marketsegment = 'BUILDING';
    UPDATE 1
    dev=> SELECT accountbalance FROM customer WHERE marketsegment = 'BUILDING';
     accountbalance 
    ----------------
            2000.00
    (1 row)

  13. Update the accountbalance column with condition nationid=1 by using the following code:
    UPDATE customer SET accountbalance = 3000 WHERE nationid = 1;

  14. The update statement should return permission denied error as user sales does not have column-level privileges on column nationid in the where clause.
  15. Query the count of record group by nationid with the following code:
    SELECT COUNT(*) FROM customer GROUP BY nationid;

  16. The select statement should return permission denied error as user sales doesn’t have column-level privileges on column nationid in the group by clause.
  17. Please also note that column-level privileges are checked for columns not only in the select list but also where clause, order by clause, group by clause, having clause and other clauses of a query that require SELECT/UPDATE privileges on a column.
  18. Remove column marketsegment from column-level access control for user sales using REVOKE command and see what happens. Enter the following code:
    RESET SESSION AUTHORIZATION;
    REVOKE SELECT (marketsegment) ON customer FROM sales;
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT marketsegment, accountbalance FROM customer;
    SELECT accountbalance FROM customer;

  19. As you can see, user sales is no longer able to view marketsegment from table customer.
    dev=> SELECT marketsegment, accountbalance FROM customer;
    ERROR:  permission denied for relation customer
    dev=> SELECT accountbalance FROM customer;
     accountbalance 
    ----------------
            2788.52
             591.98
            3332.02
            9255.67
            2000.00
    (5 rows)

  20. Enter the following code to query column marketsegment from view customer_vw:
    SELECT marketsegment FROM customer_vw;

  21. The statement should return five rows as user sales still has access to column marketsegment on the view even though column-level privileges have been revoked from table customer. Views execute with the permissions of the view owner so it will still continue to work as long as the view’s owner still has column or table-level privileges on the base tables used by the view. To prevent unauthorized access of the sensitive data, the column-level privileges for user sales should be revoked from the view as well.
  22. Revoke all permissions for user sales with the following code:
    RESET SESSION AUTHORIZATION;
    SELECT CURRENT_USER;
    REVOKE SELECT ON customernation_mv FROM sales;
    REVOKE SELECT ON customer_vw FROM sales;
    REVOKE SELECT ON customer FROM sales;
    REVOKE UPDATE ON customer FROM sales;

  23. Query the table, view and materialized view again with user sales using the following code:
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT marketsegment, accountbalance FROM customer;
    SELECT marketsegment, accountbalance FROM customer_vw;
    SELECT marketsegment, accountbalance FROM customernation_mv;

  24. Permission denied error should be returned and this shows that REVOKE is able to remove all permissions.

As summary, a simple GRANT statement will enable column-level access control on Amazon Redshift table, view and materialized view. A REVOKE statement is what you need to remove the permission. This eliminates the complexity of legacy views-based access control to achieve fine-grained read and write access control.

Clean up

Once you are done with above testing, you can remove the objects and users with the following code:

RESET SESSION AUTHORIZATION;
REVOKE SELECT ON customernation_mv FROM finance;
REVOKE SELECT ON customer_vw FROM finance;
REVOKE SELECT ON customer FROM finance;
REVOKE UPDATE ON customer FROM finance;
DROP VIEW customer_vw;
DROP MATERIALIZED VIEW customernation_mv;
DROP TABLE nation;
DROP TABLE customer;
DROP USER IF EXISTS sales;
DROP USER IF EXISTS finance;

Summary

Amazon Redshift is secure by default and security doesn’t cost extra. It provides Authentication (Active Directory, Okta, Ping Federate, and Azure AD), Federation and comes pre-integrated with AWS IAM and KMS. It also supports table-based access control for data in Amazon Redshift and column-level access control for data in Amazon S3 through Amazon Redshift Spectrum since September 2019. Amazon Redshift now supports access control at a column-level for local tables, eliminating the need to implement view-based access control or using another system.

This post showed you how easy it is to setup Amazon Redshift column-level access control. The use case in this post demonstrated how to confirm that you have fine-grained access on the table, view, and materialized view. You can adopt this feature to support your business needs.

If you have any questions or suggestions, please leave a comment below.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Srikanth Sopirala is a Sr. Specialist Solutions Architect focused on Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.

Speed up your ELT and BI queries with Amazon Redshift materialized views

Post Syndicated from Juan Yu original https://aws.amazon.com/blogs/big-data/speed-up-your-elt-and-bi-queries-with-amazon-redshift-materialized-views/

The Amazon Redshift materialized views function helps you achieve significantly faster query performance on repeated or predictable workloads such as dashboard queries from Business Intelligence (BI) tools, such as Amazon QuickSight. It also speeds up and simplifies extract, load, and transform (ELT) data processing. You can use materialized views to store frequently used precomputations and seamlessly use them to achieve lower latency on subsequent analytical queries.

This post demonstrates how to create a materialized view, refresh it after data ingestion, and speed up your BI workload.

Setting up your sample datasets

This walkthrough uses the Amazon Customer Reviews Dataset. It is a public dataset stored in the us-east-1 Region. You will create the following three tables:

  • product_reviews – Contains customer reviews for a specific product
  • customer – Contains customer profile data
  • customer_address – Contains customer address information

The following diagram shows the relationship of the three tables.

To download the script and set up the tables, choose mv_blog.sql.

Creating and using materialized views

For this use case, your marketing team wants to build a report that shows how many customers per state like your products. You also want to drill down to each product category when needed.

In this first step, you create a regular view. See the following code:

CREATE VIEW v_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code is a report to analyze the product review count per state:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

The following code is a report to analyze the product review count per state for specific categories:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The preceding reports take approximately 4 seconds to run. As you sell more products and get more reviews, this elapsed time gradually gets longer. To speed up those reports, you can create a materialized view to precompute the count of reviews per product category and per state. See the following code:

CREATE MATERIALIZED VIEW mv_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code are the reports to analyze the product review against the materialized view.

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The same reports against materialized views take less than 200 milliseconds because the new queries access precomputed joins, filters, grouping, and partial sums instead of the multiple, larger base tables.

Speeding up and simplifying ELT data processing

To achieve similar performance without the use of materialized views, many users use the CREATE TABLE AS (CTAS) command. However, as you update base tables with new data inserts, updates, or deletes, the CTAS tables become stale; you must recreate them to keep them up-to-date with the latest changes from the base tables. Now with Amazon Redshift materialized views, you can overcome this problem by efficiently and incrementally refreshing the materialized views with supported SQL. For example, the following code ingests another 10,000 reviews:

INSERT INTO product_reviews
SELECT   marketplace, 
  cast(customer_id as bigint) customer_id, 
  review_id, 
  product_id, 
  cast(product_parent as bigint) product_parent, 
  product_title, 
  star_rating, 
  helpful_votes, 
  total_votes, 
  vine, 
  verified_purchase, 
  review_headline, 
  review_body, 
  review_date, 
  year,
  product_category
FROM demo.products_reviews
WHERE review_date = '2015-07-01' LIMIT 10000;

Now the materialized view is out-of-date. To refresh the materialized view, enter the following code:

REFRESH MATERIALIZED VIEW mv_reviews_byprod_and_state;

Within 200 milliseconds, the materialized view is up-to-date again. Your report queries have the same consistent, fast performance.

The following screenshot is the query log that shows query performance. The log shows newer statements at the top.

The materialized views refresh is much faster because it’s incremental: Amazon Redshift only uses the new data to update the materialized view instead of recomputing the entire materialized view again from the base tables.  For more information, see REFRESH MATERIALIZED VIEW.

Materialized views also simplify and make ELT easier and more efficient. Without materialized views, you might create an ELT job and use CTAS to precompute the product analysis data. The ELT job recomputes this data after new data is ingested and stores the data in the precomputed product analysis table to meet the dashboard latency requirement.

In particular, the ETL job drops and recreates the precomputed product analysis table after each ingestion. See the following code:

BEGIN;
    DROP TABLE IF EXISTS latest_product_analysis;
    CREATE TABLE latest_product_analysis as SELECT ...;
END;

With materialized views, you just need to create the materialized view one time and refresh to keep it up-to-date. To refresh materialized views after ingesting new data, add REFRESH MATERIALIZED VIEW to the ELT data ingestion scripts. Redshift will automatically and incrementally bring the materialized view up-to-date.

Achieving faster performance for BI dashboards

You can use materialized views to help your BI team build a dashboard to analyze product trends.

For example, to create a materialized view to join customer and customer_address dimension tables and precompute reviews and ratings, enter the following code:

CREATE MATERIALIZED VIEW mv_product_analysis 
    sortkey(product_category, Customer_State, review_date) 
AS
SELECT PR.product_category,
       A.ca_state AS Customer_State,
       PR.review_date AS Review_Date,
       PR.product_title,
       COUNT(1) AS review_total,
       SUM(PR.star_rating) AS rating
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND marketplace = 'US'
GROUP BY 1,2,3,4;

You access materialized views the same as you do a regular table. For this walkthrough, choose a materialized view as the source for an Amazon QuickSight dataset. As showing by the following screenshot.

You can preview data of the materialized view in Amazon QuickSight to understand what information can be used to build the dashboard. The following screenshot shows the sample data of mv_product_analysis.

To track how many reviews customers post over time, use review_date as the X-axis and Sum(review_total) as the Y-axis. The following graph shows this visualization.

The following screenshot shows a complete dashboard “Product trend” that analyzes the top product category, product popularity by state, and more.

Because you are using materialized views, the product trend dashboard loads in seconds and is always up-to-date. You can gain the latest insights, understand customer purchase behavior, and identify business opportunities and optimizations.

You can compare the performance of materialized views with other possible alternatives, such as using regular views and using CTAS. The following graph shows the overall query execution for the product trend dashboard. Materialized views not only improve query performance by more than an order of magnitude compared to using a regular view, but also have low maintenance costs compared to using a CTAS because the incremental refresh time is proportional to the delta of changes in the base tables. In contrast, the CTAS recreate approach needs to processes all the data in the base tables.

The following animated gif shows the actual response time for the product trend dashboard built using Amazon QuickSight in direct query mode.

Conclusion

This post showed how to create Amazon Redshift materialized views with one or more base tables to speed up both BI queries and ELT. You can easily build and maintain efficient data processing pipelines and seamlessly extend the low latency query execution benefits of materialized views to data analysis.

 


About the Authors

Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.

 

 

 

 

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

 

 

 

Accelerate Amazon Redshift Federated Query adoption with AWS CloudFormation

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerate-amazon-redshift-federated-query-adoption-with-aws-cloudformation/

Amazon Redshift Federated Query allows you to combine the data from one or more Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL databases with data already in Amazon Redshift. You can also combine such data with data in an Amazon S3 data lake.

This post shows you how to set up Aurora PostgreSQL and Amazon Redshift with a 10 GB TPC-H dataset, and Amazon Redshift Federated Query using AWS CloudFormation. For more information about using Federated Query, see Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query. You can use the environment you set up in this post to experiment with various use cases in the preceding post.

Benefits of using CloudFormation templates

The standard workflow of setting up Amazon Redshift Federated Query involves six steps. For more information, see Querying Data with Federated Query in Amazon Redshift. With a CloudFormation template, you can condense these manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources to create and allows you to automate the creation of AWS resources to support Amazon Redshift Federated Query. You can further enhance this template to become the single source of truth for your infrastructure.

A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

Architecture overview

The following diagram illustrates the solution architecture.

The CloudFormation templates provision the following components in the architecture:

  • VPC
  • Subnets
  • Route tables
  • Internet gateway
  • Amazon Linux Bastion host
  • Secrets
  • Aurora PostgreSQL cluster with TPC-H dataset preloaded
  • Amazon Redshift cluster with TPC-H dataset preloaded
  • Amazon Redshift IAM role with required permissions

Prerequisites

Before you create your resources in AWS CloudFormation, you must complete the following prerequisites:

  • Have an IAM user with sufficient permissions to interact with the AWS Management Console and related AWS services. Your IAM permissions must also include access to create IAM roles and policies via the CloudFormation template.
  • Create an Amazon EC2 key pair in the us-east-1 Region. Make sure that you save the private key; this is the only time you can do so. You use this key pair as an input parameter when you set up the CloudFormation stack.

Setting up the resources with AWS CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

To create these resources, complete the following steps:

  1. Sign in to the console.
  2. Choose the us-east-1 Region in which to create the stack.
  3. Choose Launch Stack:
  4. Choose Next.This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
  5. For Stack name, enter a stack name.
  6. For Session, leave as the default.
  7. For ec2KeyPair, choose the key pair you created earlier.
  8. Choose Next.
  9. On the next screen, choose Next.
  10. Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create.Stack creation can take up to 45 minutes.
  12. After the stack creation is complete, in the Outputs section, record the value of the key for the following components, which you use in a later step:
  • AuroraClusterEndpoint
  • AuroraSecretArn
  • RedshiftClusterEndpoint
  • RedshiftClusterRoleArn

You are now ready to log in to both the Aurora PostgreSQL and Amazon Redshift cluster and run some basic commands to test them.

Logging in to the clusters using the Amazon Linux Bastion host

The following steps assume that you use a computer with an SSH client to connect to the Bastion host. For more information about connecting using various clients, see Connect to Your Linux Instance.

  1. Move the private key of the EC2 key pair (that you saved previously) to a location on your SSH client, where you are connecting to the Amazon Linux Bastion host.
  2. Change the permission of the private key using the following code, so that it’s not publicly viewable:chmod 400 <private key file name; for example, bastion-key.pem>
  3. On the Amazon EC2 console, choose Instances.
  4. Choose the Amazon Linux Bastion host that the CloudFormation stack created.
  5. Choose Connect.
  6. Copy the value for SSHCommand.
  7. On the SSH client, change the directory to the location where you saved the EC2 private key, and paste the SSHCommand value.
  8. On the console, open the Secrets Manager dashboard.
  9. Choose the secret secretAuroraMasterUser-*.
  10. Choose Retrieve secret value.
  11. Record the password under Secret key/value, which you use to log in to the Aurora PostgreSQL cluster.
  12. Choose the secret SecretRedshiftMasterUser.
  13. Choose Retrieve secret value.
  14. Record the password under Secret key/value, which you use to log in to the Amazon Redshift cluster.
  15. Log in to both the Aurora PostgreSQL and Amazon Redshift database using PSQL Client.The CloudFormation template has already set up PSQL Client binaries on the Amazon Linux Bastion host.
  16. Enter the following code in the command prompt of the Bastion host (substitute <RedshiftClusterEndpoint> with the value from the AWS CloudFormation output):psql -h <RedshiftClusterEndpoint> -d dev -p 5439 -U fqdemo
  17. When prompted, enter the database user password you recorded earlier.
  18. Enter the following SQL command:
    select "table" from svv_table_info where schema='public';

    You should see the following eight tables as the output:

    dev=# select "table" from svv_table_info where schema='public';
     table   
    ----------
     orders
     customer
     region
     nation
     supplier
     part
     lineitem
     partsupp
    (8 rows)

  19. Launch another command prompt session of the Bastion host and enter the following code (substitute <AuroraClusterEndpoint> with the value from the AWS CloudFormation output):psql -h <AuroraClusterEndpoint> -d dev -p 5432 -U awsuser
  20. When prompted, enter the database user password you recorded earlier.
  21. Enter the following SQL command:
    select tablename from pg_tables where schemaname='public';

    You should see the following eight tables as the output:

    dev=# select tablename from pg_tables where schemaname='public';
     tablename 
    -----------
     region
     nation
     lineitem
     orders
     part
     supplier
     partsupp
     customer
    (8 rows)

Completing Federated Query setup

The final step is to create an external schema to connect to the Aurora PostgreSQL instance. The following example code creates an external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS pg 
FROM POSTGRES 
DATABASE 'dev' 
SCHEMA 'public' 
URI '<AuroraClusterEndpoint>' 
PORT 5432 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – AuroraClusterEndpoint value from the CloudFormation stack outputs. Value is in the format <stackname>-cluster.<randomcharacter>.us-east-1.rds.amazonaws.com
  • IAMRoleRedshiftClusterRoleArn value from the CloudFormation stack outputs. Value is in the format arn:aws:iam::<accountnumber>:role/<stackname>-RedshiftClusterRole-<randomcharacter>
  • SecretARNAuroraSecretArn value from the CloudFormation stack outputs. Value is in the format arn:aws:secretsmanager:us-east-1:<accountnumber>: secret:secretAuroraMasterUser-<randomcharacter>

Testing Federated Query

Now that you have set up Federated Query, you can start testing the feature using the TPC-H dataset that was preloaded into both Aurora PostgreSQL and Amazon Redshift.

The following query shows the parts and supplier relationship. Tables PARTSUPP and PART are stored in Amazon Redshift, and the SUPPLIER table in the subquery is from Aurora PostgreSQL:

SELECT TOP 10 P_BRAND,
       P_TYPE,
       P_SIZE,
       COUNT(DISTINCT PS_SUPPKEY) AS SUPPLIER_CNT
FROM PARTSUPP,
     PART
WHERE P_PARTKEY = PS_PARTKEY
AND   P_BRAND <> 'Brand#23'
AND   P_TYPE NOT LIKE 'MEDIUM ANODIZED%'
AND   P_SIZE IN (1,32,33,46,7,42,21,40)
AND   PS_SUPPKEY NOT IN (SELECT S_SUPPKEY
                         FROM pg.SUPPLIER
                         WHERE S_COMMENT LIKE '%Customer%Complaints%')
GROUP	BY P_BRAND,
         P_TYPE,
         P_SIZE
ORDER	BY SUPPLIER_CNT DESC,
         P_BRAND,
         P_TYPE,
         P_SIZE;

The following query shows the order priority by combining ORDERS table data from Amazon Redshift and Aurora PostgreSQL. This demonstrates the use case of live data query from an OLTP source federated with historical data on a data warehouse:

SELECT O_ORDERPRIORITY,
       COUNT(*) AS ORDER_COUNT
FROM (SELECT O_ORDERPRIORITY
      FROM ORDERS o
      WHERE O_ORDERDATE < '1997-07-01'       AND O_ORDERDATE >= CAST(DATE '1997-07-01' - INTERVAL '3 months' AS DATE)
      UNION ALL
      SELECT O_ORDERPRIORITY
      FROM pg.ORDERS o
      WHERE O_ORDERDATE >= '1997-07-01'
      AND   O_ORDERDATE < CAST(DATE '1997-07-01' +INTERVAL '1 day' AS DATE))
GROUP	BY O_ORDERPRIORITY
ORDER	BY O_ORDERPRIORITY;

You can continue to experiment with the dataset and explore the three main use cases from the post, Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query.

Deleting the CloudFormation stack

When you are finished, delete the CloudFormation stack; some of the AWS resources in this walkthrough incur a cost if you continue to use them. Complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose the stack you launched in this walkthrough. The stack must be currently running.
  3. In the stack details pane, choose Delete.
  4. Choose Delete stack.

Summary

This post showed you how to automate the creation of an Aurora PostgreSQL and Amazon Redshift cluster preloaded with the TPC-H dataset, the prerequisites of the new Amazon Redshift Federated Query feature using AWS CloudFormation, and a single manual step to complete the setup. The post also provided some example federated queries using the TPC-H dataset, which you can use to accelerate your learning and adoption of the new features. You can continue to modify the CloudFormation templates from this post to support your business needs.

If you have any questions or suggestions, please leave a comment.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Srikanth Sopirala is a Sr. Specialist Solutions Architect focused on Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.

Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query

Post Syndicated from Tito Mijares original https://aws.amazon.com/blogs/big-data/build-a-simplified-etl-and-live-data-query-solution-using-redshift-federated-query/

You may have heard the saying that the best ETL is no ETL. Amazon Redshift now makes this possible with Federated Query. In its initial release, this feature lets you query data in Amazon Aurora PostgreSQL or Amazon RDS for PostgreSQL using Amazon Redshift external schemas. Federated Query also exposes the metadata from these source databases through system views and driver APIs, which allows business intelligence tools like Tableau and Amazon Quicksight to connect to Amazon Redshift and query data in PostgreSQL without having to make local copies. This enables a new data warehouse pattern—live data query—in which you can seamlessly retrieve data from PostgreSQL databases, or build data into a late binding view, which combines operational PostgreSQL data, analytical Amazon Redshift local data, and historical Amazon Redshift Spectrum data in an Amazon S3 data lake.

Simplified ETL use case

For this ETL use case, you can simplify the familiar upsert pattern with a federated query. You can bypass the need for incremental extracts in Amazon S3 and the subsequent load via COPY by querying the data in place within its source database. This change can be a single line of code that replaces the COPY command with a query to an external table. See the following code:

BEGIN;
CREATE TEMP TABLE staging (LIKE ods.store_sales);
-- replace the following COPY from S3 
COPY staging FROM 's3://yourETLbucket/daily_store_sales/' 
     IAM_ROLE 'arn:aws:iam::<account_id>:role/<s3_reader_role>' DELIMITER '|' COMPUPDATE OFF;
-- with this federated query to load staging data from PostgreSQL source
INSERT INTO staging SELECT * FROM pg.store_sales p
	WHERE p.last_updated_date > (SELECT MAX(last_updated_date) FROM ods.store_sales)
DELETE FROM ods.store_sales USING staging s WHERE ods.store_sales.id = s.id;
INSERT INTO ods.store_sales SELECT * FROM staging;
DROP TABLE staging;
COMMIT;

In the preceding example, the table pg.store_sales resides in PostgreSQL, and you use a federated query to retrieve fresh data to load into a staging table in Amazon Redshift, keeping the actual delete and insert operations unchanged. This pattern is likely the most common application of federated queries.

Setting up an external schema

The external schema pg in the preceding example was set up as follows:

CREATE EXTERNAL SCHEMA IF NOT EXISTS pg                                                                         
FROM POSTGRES                                                                                                           
DATABASE 'dev' 
SCHEMA 'retail'                                                                                     
URI 'database-1.cluster-ro-samplecluster.us-east-1.rds.amazonaws.com'                                                    
PORT 5432                                                                                                               
IAM_ROLE 'arn:aws:iam::555566667777:role/myFederatedQueryRDS'                                                           
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB'

If you’re familiar with the CREATE EXTERNAL SCHEMA command from using it in Spectrum, note some new parameter options to enable federated queries.

FROM POSTGRES                                                                                                           
DATABASE 'dev' 
SCHEMA 'retail'

Whereas Amazon Redshift Spectrum references an external data catalog that resides within AWS Glue, Amazon Athena, or Hive, this code points to a Postgres catalog. Also, expect more keywords used with FROM, as Amazon Redshift supports more source databases for federated querying. By default, if you do not specify SCHEMA, it defaults to public.

Within the target database, you identify DATABASE ‘dev’ and SCHEMA ‘retail’, so any queries to the Amazon Redshift table pg.<some_table> get issued to PostgreSQL as a request for retail.<some_table> in the dev database. For Amazon Redshift, query predicates are pushed down and run entirely in PostgreSQL, which reduces the result set returned to Amazon Redshift for subsequent operations. Going further, the query planner derives cardinality estimates for external tables to optimize joins between Amazon Redshift and PostgreSQL. From the preceding example:

URI 'database-1.cluster-ro-samplecluster.us-east-1.rds.amazonaws.com'                                                    
PORT 5432

The URI and PORT parameters that reference both the PostgreSQL endpoint and port are self-explanatory, but there are a few things to consider in your configuration:

  • Use a read replica endpoint in Aurora or Amazon RDS for PostgreSQL to reduce load on the primary instance.
  • Set up your Amazon RDS for PostgreSQL instance, Aurora serverless or provisioned instances, and Amazon Redshift clusters to use the same VPC and subnet groups. That way, you can add the security group for the cluster to the inbound rules of the security group for the Aurora or Amazon RDS for PostgreSQL instance.
  • If both Amazon Redshift and Aurora or Amazon RDS for PostgreSQL are on different VPCs, set up VPC peering. For more information, see What is VPC Peering?

Configuring AWS Secrets Manager for remote database credentials

To retrieve AWS Secrets Manager remote database credentials, our example uses the following code:

IAM_ROLE 'arn:aws:iam::555566667777:role/myFederatedQueryRDS'                                                           
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB'

These two parameters are interrelated because the SECRET_ARN is also embedded in the IAM policy for the role.

If a service like Secrets Manager didn’t exist and you wanted to issue a federated query from Amazon Redshift to PostgreSQL, you would need to supply the database credentials to the CREATE EXTERNAL SCHEMA command via a parameter like CREDENTIALS, which you also use with the COPY command. However, this hardcoded approach doesn’t take into account that the PostgreSQL credentials could expire.

You avoid this problem by keeping PostgreSQL database credentials within Secrets Manager, which provides a centralized service to manage secrets. Because Amazon Redshift retrieves and uses these credentials, they are transient and not stored in any generated code and are discarded after query execution.

Storing credentials in Secrets Manager takes up to a few minutes. To store a new secret, complete the following steps:

 

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. In the Store a new secret section, complete the following:

 

  • Supply your PostgreSQL database credentials
  • Name the secret; for example, MyRDSCredentials
  • Configure rotation (you can enable this at a later time)
  • Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post)
  1. Choose Next.

You can also retrieve the credentials easily.

  1. On the Secrets Manager console, choose your secret.
  2. Choose Retrieve secret value.

The following screenshot shows you the secret value details.

This secret is now an AWS resource referenced via a secret ARN. See the following screenshot.

Setting up an IAM role

You can now pull everything together by embedding the secret ARN into an IAM policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

The following screenshot shows the details of the IAM role called myFederatedQueryRDS, which contains the MyRDSSecretPolicy policy. It’s the same role that’s supplied in the IAM_ROLE parameter of the CREATE EXTERNAL SCHEMA DDL.

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose your cluster.
  2. From the Actions drop-down menu, choose Manage IAM roles.
  3. Choose and add the IAM role you just created.

You have now completed the following steps:

  1. Create an IAM policy and role
  2. Store your PostgreSQL database credentials in Secrets Manager
  3. Create an Amazon Redshift external schema definition that uses the secret and IAM role to authenticate with a PostgreSQL endpoint
  4. Apply a mapping between an Amazon Redshift database and schema to a PostgreSQL database and schema so Amazon Redshift may issue queries to PostgreSQL tables.

You only need to complete this configuration one time.

Querying live operational data

This section explores another use case: querying operational data across multiple source databases. In this use case, a global online retailer has databases deployed by different teams across distinct geographies:

  • Region us-east-1 runs serverless Aurora PostgreSQL.
  • Region us-west-1 runs provisioned Aurora PostgreSQL, which is also configured as a global database with a read replica in us-east-1.
  • Region eu-west-1 runs an Amazon RDS for PostgreSQL instance with a read replica in us-east-1.

Serverless and provisioned Aurora PostgreSQL and Amazon RDS for PostgreSQL are visible in the Amazon RDS console in Region us-east-1. See the following screenshot:

For this use case, assume that you configured the read replicas for Aurora and Amazon RDS to share the same VPC and subnets in us-east-1 with the local serverless Aurora PostgreSQL. Furthermore, you have already created secrets for each of these instances’ credentials, and also an IAM role MyCombinedRDSSecretPolicy, which is more permissive and allows Amazon Redshift to retrieve the value of any Amazon RDS secret within any Region. Be mindful of security in actual production use, however, and explicitly specify the resource ARNs for each secret in separate statements in your IAM policy. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:*:555566667777:secret:*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

External schema DDLs in Amazon Redshift can then reference the combined IAM role and individual secret ARNs. See the following code:

CREATE EXTERNAL SCHEMA IF NOT EXISTS useast
FROM POSTGRES
DATABASE 'dev'
URI 'us-east-1-aurora-pg-serverless.cluster-samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyEastUSAuroraServerlessCredentials-dXOlEq'
;

CREATE EXTERNAL SCHEMA IF NOT EXISTS uswest
FROM POSTGRES
DATABASE 'dev'
URI 'global-aurora-pg-west-coast-stores-instance-1.samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:us-west-1:555566667777:secret:MyWestUSAuroraGlobalDBCredentials-p3sV9m'
;

CREATE EXTERNAL SCHEMA IF NOT EXISTS europe
FROM POSTGRES
DATABASE 'dev'
URI 'eu-west-1-postgres-read-replica.samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:eu-west-1:555566667777:secret:MyEuropeRDSPostgresCredentials-mz2u9L'
;

This late binding view abstracts the underlying queries to TPC-H lineitem test data within all PostgreSQL instances. See the following code:

CREATE VIEW global_lineitem AS
SELECT 'useast' AS region, * from useast.lineitem
UNION ALL
SELECT 'uswest', * from uswest.lineitem
UNION ALL
SELECT 'europe', * from europe.lineitem
WITH NO SCHEMA BINDING
;

Amazon Redshift can query live operational data across multiple distributed databases and aggregate results into a unified view with this feature. See the following code:

dev=# SELECT region, extract(month from l_shipdate) as month,
      sum(l_extendedprice * l_quantity) - sum(l_discount) as sales
      FROM global_lineitem
      WHERE l_shipdate >= '1997-01-01'
      AND l_shipdate < '1998-01-01'
      AND month < 4
      GROUP BY 1, 2
      ORDER BY 1, 2
;
 region | month |      sales
--------+-------+------------------
 europe |     1 | 16036160823.3700
 europe |     2 | 15089300790.7200
 europe |     3 | 16579123912.6700
 useast |     1 | 16176034865.7100
 useast |     2 | 14624520114.6700
 useast |     3 | 16645469098.8600
 uswest |     1 | 16800599170.4600
 uswest |     2 | 14547930407.7000
 uswest |     3 | 16595334825.9200
(9 rows)

If you examine Remote PG Seq Scan in the following query plan, you see that predicates are pushed down for execution in all three PostgreSQL databases. Unlike your initial simplified ETL use case, no ETL is performed because data is queried and filtered in place. See the following code:

dev=# EXPLAIN SELECT region, extract(month from l_shipdate) as month,
      sum(l_extendedprice * l_quantity) - sum(l_discount) as sales
FROM global_lineitem
WHERE l_shipdate >= '1997-01-01'
AND l_shipdate < '1998-01-01'
AND month < 4
GROUP BY 1, 2
ORDER BY 1, 2
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
   Merge Key: derived_col1, derived_col2
   ->  XN Network  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
         Send to leader
         ->  XN Sort  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
               Sort Key: derived_col1, derived_col2
               ->  XN HashAggregate  (cost=60136.52..60138.02 rows=200 width=100)
                     ->  XN Subquery Scan global_lineitem  (cost=20037.51..60130.52 rows=600 width=100)
                           ->  XN Append  (cost=20037.51..60124.52 rows=600 width=52)
                                 ->  XN Subquery Scan "*SELECT* 1"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan useast.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
                                 ->  XN Subquery Scan "*SELECT* 2"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan uswest.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
                                 ->  XN Subquery Scan "*SELECT* 3"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan europe.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
(24 rows)

Combining the data lake, data warehouse, and live operational data

In this next use case, you join Amazon Redshift Spectrum historical data with current data in Amazon Redshift and live data in PostgreSQL. You use a 3TB TPC-DS dataset and unload data from 1998 through 2001 from the store_sales table in Amazon Redshift to Amazon S3. The unloaded files are stored in Parquet format with ss_sold_date_sk as partitioning key.

To access this historical data via Amazon Redshift Spectrum, create an external table. See the following code:

CREATE EXTERNAL TABLE spectrum.store_sales_historical
(
  ss_sold_time_sk int ,
  ss_item_sk int ,
  ss_customer_sk int ,
  ss_cdemo_sk int ,
  ss_hdemo_sk int ,
  ss_addr_sk int ,
  ss_store_sk int ,
  ss_promo_sk int ,
  ss_ticket_number bigint,
  ss_quantity int ,
  ss_wholesale_cost numeric(7,2) ,
  ss_list_price numeric(7,2) ,
  ss_sales_price numeric(7,2) ,
  ss_ext_discount_amt numeric(7,2) ,
  ss_ext_sales_price numeric(7,2) ,
  ss_ext_wholesale_cost numeric(7,2) ,
  ss_ext_list_price numeric(7,2) ,
  ss_ext_tax numeric(7,2) ,
  ss_coupon_amt numeric(7,2) ,
  ss_net_paid numeric(7,2) ,
  ss_net_paid_inc_tax numeric(7,2) ,
  ss_net_profit numeric(7,2)
)
PARTITIONED BY (ss_sold_date_sk int)
STORED AS PARQUET
LOCATION 's3://mysamplebucket/historical_store_sales/';   

The external spectrum schema is defined as the following:

CREATE EXTERNAL SCHEMA spectrum
FROM data catalog DATABASE 'spectrumdb'
IAM_ROLE 'arn:aws:iam::555566667777:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Instead of an Amazon S3 read-only policy, the IAM role mySpectrumRole contains both AmazonS3FullAccess and AWSGlueConsoleFullAccess policies, in which the former allows Amazon Redshift writes to Amazon S3. See the following code:

UNLOAD ('SELECT * FROM tpcds.store_sales WHERE ss_sold_date_sk < 2452276')
TO 's3://mysamplebucket/historical_store_sales/'
IAM_ROLE 'arn:aws:iam::555566667777:role/mySpectrumRole'
FORMAT AS PARQUET
PARTITION BY (ss_sold_date_sk) ALLOWOVERWRITE;

To make partitioned data visible, the ALTER TABLE ... ADD PARTITION command needs to specify all partition values. For this use case, 2450816 through 2452275 correspond to dates 1998-01-02 through 2001-12-31, respectively. To generate these DDLs quickly, use the following code:

WITH partitions AS (SELECT * FROM generate_series(2450816, 2452275))
SELECT 'ALTER TABLE spectrum.store_sales_historical ADD PARTITION (ss_sold_date_sk='|| generate_series || ') '
    || 'LOCATION \'s3://mysamplebucket/historical_store_sales/ss_sold_date_sk=' || generate_series || '/\';'
FROM partitions;

You can run the generated ALTER TABLE statements individually or as a batch to make partition data visible. See the following code:

ALTER TABLE spectrum.store_sales_historical 
ADD PARTITION (ss_sold_date_sk=2450816)
LOCATION 's3://mysamplebucket/historical_store_sales/ss_sold_date_sk=2450816/';
-- repeated for all partition values

The three combined sources in the following view consist of historical data in Amazon S3 for 1998 through 2001, current data local to Amazon Redshift for 2002, and live data for two months of 2003 in PostgreSQL. When you create this late binding view, you have to re-order Amazon Redshift Spectrum external table columns because the previous UNLOAD operation specifying ss_sold_date_sk as partition key shifted that column’s order to last. See the following code:

CREATE VIEW store_sales_integrated AS
SELECT * FROM uswest.store_sales_live
UNION ALL
SELECT * FROM tpcds.store_sales_current
UNION ALL
SELECT ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, 
       ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, 
       ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, 
       ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, 
       ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, 
       ss_net_paid_inc_tax, ss_net_profit
FROM spectrum.store_sales_historical
WITH NO SCHEMA BINDING;

You can now run a query on the view to aggregate date and join tables across the three sources. See the following code:

dev=# SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
GROUP BY 1
ORDER BY 1
;
 date_part |   count
-----------+------------
      1998 | 1632403114
      1999 | 1650163390
      2000 | 1659168880
      2001 | 1641184375
      2002 | 1650209644
      2003 |   17994540
(6 rows)

Time: 77624.926 ms (01:17.625)

This following federated query ran on a two-node DC2.8XL cluster and took 1 minute and 17 seconds to join store sales in Amazon S3, PostgreSQL, and Amazon Redshift, with the date dimension table in Amazon Redshift, aggregating and sorting row counts by year:

dev=# EXPLAIN SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
GROUP BY 1
ORDER BY 1;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
   Merge Key: "date_part"('year'::text, b.d_date)
   ->  XN Network  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
         Send to leader
         ->  XN Sort  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
               Sort Key: "date_part"('year'::text, b.d_date)
               ->  XN HashAggregate  (cost=461036314645.93..461036315011.18 rows=73049 width=8)
                     ->  XN Hash Join DS_DIST_ALL_NONE  (cost=913.11..428113374829.91 rows=6584587963204 width=8)
                           Hash Cond: ("outer".ss_sold_date_sk = "inner".d_date_sk)
                           ->  XN Subquery Scan a  (cost=0.00..263498674836.70 rows=6584587963204 width=4)
                                 ->  XN Append  (cost=0.00..197652795204.66 rows=6584587963204 width=4)
                                       ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..539836.20 rows=17994540 width=4)
                                             ->  XN PG Query Scan store_sales_live  (cost=0.00..359890.80 rows=17994540 width=4)
                                                   ->  Remote PG Seq Scan uswest.store_sales_live  (cost=0.00..179945.40 rows=17994540 width=4)
                                       ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..33004193.28 rows=1650209664 width=4)
                                             ->  XN Seq Scan on store_sales_current  (cost=0.00..16502096.64 rows=1650209664 width=4)
                                       ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..197619251175.18 rows=6582919759000 width=4)
                                             ->  XN Partition Loop  (cost=0.00..131790053585.18 rows=6582919759000 width=4)
                                                   ->  XN Seq Scan PartitionInfo of spectrum.store_sales_historical  (cost=0.00..10.00 rows=1000 width=4)
                                                   ->  XN S3 Query Scan store_sales_historical  (cost=0.00..131658395.18 rows=6582919759 width=0)
                                                         ->  S3 Seq Scan spectrum.store_sales_historical location:"s3://mysamplebucket/historical_store_sales" format:PARQUET (cost=0.00..65829197.59 rows=6582919759 width=0)
                           ->  XN Hash  (cost=730.49..730.49 rows=73049 width=8)
                                 ->  XN Seq Scan on date_dim b  (cost=0.00..730.49 rows=73049 width=8)
(23 rows)

The query plan shows these are full sequential scans running on the three source tables with the number of returned rows highlighted, totaling 8.2 billion. Because Amazon Redshift Spectrum does not generate statistics for external tables, you manually set the numRows property to the row count for historical data in Amazon S3. See the following code:

ALTER TABLE spectrum.store_sales_historical SET TABLE PROPERTIES ('numRows' = '6582919759');

You can join with another dimension table local to Amazon Redshift, this time the 30 million row customer table, and filter by column c_birth_country. See the following code:

dev=# SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
JOIN tpcds.customer c on (a.ss_customer_sk = c.c_customer_sk)
AND c.c_birth_country = 'UNITED STATES'
GROUP BY 1
ORDER BY 1
;
 date_part |  count
-----------+---------
      1998 | 7299277
      1999 | 7392156
      2000 | 7416905
      2001 | 7347920
      2002 | 7390590
      2003 |   81627
(6 rows)

Time: 77878.586 ms (01:17.879)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
   Merge Key: "date_part"('year'::text, b.d_date)
   ->  XN Network  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
         Send to leader
         ->  XN Sort  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
               Sort Key: "date_part"('year'::text, b.d_date)
               ->  XN HashAggregate  (cost=363288854947.85..363288855313.09 rows=73049 width=8)
                     ->  XN Hash Join DS_DIST_ALL_NONE  (cost=376252.50..363139873158.03 rows=29796357965 width=8)
                           Hash Cond: ("outer".ss_sold_date_sk = "inner".d_date_sk)
                           ->  XN Hash Join DS_BCAST_INNER  (cost=375339.39..362394963295.79 rows=29796357965 width=4)
                                 Hash Cond: ("outer".ss_customer_sk = "inner".c_customer_sk)
                                 ->  XN Subquery Scan a  (cost=0.00..263498674836.70 rows=6584587963204 width=8)
                                       ->  XN Append  (cost=0.00..197652795204.66 rows=6584587963204 width=8)
                                             ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..539836.20 rows=17994540 width=8)
                                                   ->  XN PG Query Scan store_sales_live  (cost=0.00..359890.80 rows=17994540 width=8)
                                                         ->  Remote PG Seq Scan uswest.store_sales_live  (cost=0.00..179945.40 rows=17994540 width=8)
                                             ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..33004193.28 rows=1650209664 width=8)
                                                   ->  XN Seq Scan on store_sales_current  (cost=0.00..16502096.64 rows=1650209664 width=8)
                                             ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..197619251175.18 rows=6582919759000 width=8)
                                                   ->  XN Partition Loop  (cost=0.00..131790053585.18 rows=6582919759000 width=8)
                                                         ->  XN Seq Scan PartitionInfo of spectrum.store_sales_historical  (cost=0.00..10.00 rows=1000 width=4)
                                                         ->  XN S3 Query Scan store_sales_historical  (cost=0.00..131658395.18 rows=6582919759 width=4)
                                                               ->  S3 Seq Scan spectrum.store_sales_historical location:"s3://mysamplebucket/historical_store_sales" format:PARQUET (cost=0.00..65829197.59 rows=6582919759 width=4)
                                 ->  XN Hash  (cost=375000.00..375000.00 rows=135755 width=4)
                                       ->  XN Seq Scan on customer c  (cost=0.00..375000.00 rows=135755 width=4)
                                             Filter: ((c_birth_country)::text = 'UNITED STATES'::text)
                           ->  XN Hash  (cost=730.49..730.49 rows=73049 width=8)
                                 ->  XN Seq Scan on date_dim b  (cost=0.00..730.49 rows=73049 width=8)
(28 rows)

Query performance hardly changed from the previous query. Because the query only scanned one column (ss_sold_date_sk), it benefits from Parquet’s columnar structure for the historical data subquery. To put it another way, if the historical data is stored as CSV, all the data is scanned, which degrades performance significantly.

Additionally, the TPC-DS model does not store date values in the store_sales fact table. Instead, a foreign key references the date_dim table. If you plan on implementing something similar but frequently filter by a date column, consider adding that column into the fact table and have it as a sort key, and also adding a partitioning column in Amazon Redshift Spectrum. That way, Amazon Redshift can more efficiently skip blocks for local data and prune partitions for Amazon S3 data, in the latter, and also push filtering criteria down to Amazon Redshift Spectrum.

Conclusion

Applications of live data integration in real-world scenarios include data discovery, data preparation for machine learning, operational analytics, IoT telemetry analytics, fraud detection, and compliance and security audits. Whereas Amazon Redshift Spectrum extends the reach of Amazon Redshift into the AWS data lake, Federated Query extends its reach into operational databases and beyond.

For more information about data type differences between these databases, see Data Type Differences Between Amazon Redshift and Supported RDS PostgreSQL or Aurora PostgreSQL Databases. For more information about accessing federated data with Amazon Redshift, see Limitations and Considerations When Accessing Federated Data with Amazon Redshift.

 


About the Authors

Tito Mijares is a Data Warehouse Specialist Solutions Architect at AWS. He helps AWS customers adopt and optimize their use of Amazon Redshift. Outside of AWS, he enjoys playing jazz guitar and working on audio recording and playback projects.

 

 

 

Entong Shen is a Senior Software Development Engineer for AWS Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

 

 

 

Niranjan Kamat is a software engineer on the Amazon Redshift query processing team. His focus of PhD research was in interactive querying over large databases. In Redshift, he has worked in different query processing areas such as query optimization, analyze command and statistics, and federated querying. In his spare time, he enjoys playing with his three year old daughter, practicing table tennis (was ranked in top 10 in Ohio, USATT rating 2143), and chess.

 

 

Sriram Krishnamurthy is a Software Development Manager for AWS Redshift Query Processing team. He is passionate about Databases and has been working on Semi Structured Data Processing and SQL Compilation & Execution for over 15 years. In his free time, you can find him on the tennis court, often with his two young daughters in tow.

Improved speed and scalability in Amazon Redshift

Post Syndicated from Naresh Chainani original https://aws.amazon.com/blogs/big-data/improved-speed-and-scalability-in-amazon-redshift/

Since Amazon Redshift launched in 2012, its focus has always been on providing you with the best possible performance, at scale, and at the lowest possible cost. For most organizations across the globe, the amount of data going into their data warehouse is growing exponentially, and the number of people who want insights from that data increases daily. Because of this, Amazon Redshift is continually innovating to handle this growing volume of data and demand for insights.

Amazon Redshift delivers fast performance, at scale, for the most demanding workloads. Getting there was not easy, and it takes consistent investment across a variety of technical focus areas to make this happen. This post breaks down what it takes to build the world’s fastest cloud data warehouse.

Redshift’s performance investments are based on a broad spectrum of needs and uses the extensive fleet telemetry data from tens of thousands of customers to guide development work. Your needs may vary based on a variety of factors, including the size of your data warehouse, the number of concurrent users, the skillset of those users, and workload characteristics such as frequency of usage and query latency requirements.

Based on this, Amazon Redshift has performance investments anchored in four key areas:

  • Out-of-the-box performance – Amazon Redshift is over twice as fast out-of-the-box than it was 6 months ago, and keeps getting faster without any additional manual optimization and tuning
  • Automatic optimizations – Amazon Redshift is self-learning, self-optimizing, and constantly adapts to your actual workload to deliver the best possible performance
  • Scalability – Amazon Redshift can boost throughput to be 35 times greater to support increases in concurrent users and scales linearly for a wide range of workloads
  • Price-performance – Amazon Redshift provides predictable performance at significantly lower total cost than other data warehouse solutions by optimizing resource utilization

Out-of-the-box performance

Amazon Redshift has always enabled you to manually tune workload performance based on statically partitioning memory and specifying the number of concurrent queries. Amazon Redshift also uses advanced machine learning (ML) algorithms to tune configuration settings automatically for optimal data warehouse performance.

These algorithms are extremely effective because they are trained with real-world telemetry data generated from processing over two exabytes of data a day. Additionally, because Amazon Redshift has more customers than any other cloud data warehouse, the telemetry data makes the configuration recommendations from the ML algorithms even more accurate.

In addition, Amazon Redshift Advisor guides optimization by recommending sort keys, distribution keys, and more. However, if you want to override the learning, self-tuning behavior of Amazon Redshift, you still have fine-grained control.

The out-of-the-box performance of Amazon Redshift is continually improving. In November 2019, our Cloud Data Warehouse benchmark[1] showed that the out-of-the-box performance of Amazon Redshift was twice as fast as 6 months ago. This ongoing improvement in performance is the culmination of many technical innovations. This post presents three improvements that have had the most impact.

Improved compression

Amazon Redshift uses AZ64, a novel compression encoding, which delivers high compression ratios and significantly improved query performance. With AZ64, you no longer need to make the trade-off between storage size and performance. AZ64 compresses data, on average, 35% more than the popular, high-performance LZO algorithm, and processes the data 40% faster. AZ64 achieves this by efficiently compressing small groups of data values and uses CPU vector instructions and single instruction, multiple data (SIMD) for parallel processing of AZ64-encoded columns. To benefit from this, you simply select the data type for each column, and Amazon Redshift chooses the compression encoding method. Five months after launch, the AZ64 encoding has become the fourth most popular encoding option in Amazon Redshift with millions of columns.

Efficient large-scale join operations

When complex queries join large tables, massive amounts of data transfers over the network for the join processing on the Amazon Redshift compute nodes. This used to create network bottlenecks that impacted query performance.

Amazon Redshift now uses distributed bloom filters to enable high-performance joins for such workloads. Distributed bloom filters efficiently filter rows at the source that do not match the join relation, which greatly reduces the amount of data transferred over the network. Across the fleet, we see millions of selective bloom filters deployed each day, with some of them filtering more than one billion rows. This removes the network from being the bottleneck and improves overall query performance. Amazon Redshift automatically determines what queries are suitable for this optimization and adapts the optimization at runtime. In addition, Amazon Redshift uses CPU cache-optimized processing to make query execution more efficient for large-scale joins and aggregations. Together, these features improve performance for over 70% of the queries that Amazon Redshift processes daily.

Enhanced query planning

Query planning determines the quickest way to process a given query by evaluating costs associated with various query plans. Costs include many factors, for example, the number of I/O operations required, amount of disk buffer space, time to read from disk, parallelism for the query, and tables statistics such as number of rows and number of distinct values of a column being fresh and relevant.

The Amazon Redshift query planner factors in the capabilities of modern hardware, including the network stack, to take full advantage of the performance that the hardware offers. Its statistic collection process is automated, and it computes statistics by using algorithms like HyperLogLog (HLL), which improves the quality of statistics and therefore enables the cost-based planner to make better choices.

Automatic optimizations

Amazon Redshift uses ML techniques and advanced graph algorithms to continuously improve performance. Different workloads have different data access patterns, and Amazon Redshift observes the workload to learn and adapt. It automatically adjusts data layout, distribution keys, and query plans to provide optimal performance for a given workload.

The automatic optimizations in Amazon Redshift make intelligent decisions, such as how to distribute data across nodes, which datasets are frequently queried together and should be co-located, how to sort data, and how many parallel and concurrent queries should run on the system based on the complexity of the query. Amazon Redshift automatically adjusts these configurations to optimize throughput and performance as you use the data warehouse. To make sure these optimizations do not interrupt your workload, Amazon Redshift runs them incrementally and only during the periods when clusters have low utilization.

For maintenance operations, Amazon Redshift reduces the amount of compute resources required by operating only on frequently accessed tables and portions within those tables. Amazon Redshift prioritizes which portions of the table to operate on by analyzing query patterns. When relevant, it provides prescriptive guidance through recommendations in Amazon Redshift Advisor. You can then evaluate and apply those recommendations as needed.

AWS also constantly evaluates and assesses how effective its ML-based self-learning systems are in delivering fast query performance when compared to traditional methods such as expert human tuning.

Amazon Redshift has been adding automatic optimizations for years. The following timeline shows some of the automatic optimizations delivered over the last 12 months. 

For more information about specific optimizations, see the following posts about

automatic VACUUM DELETE, automatic ANALYZE, distribution key recommendations, sort key recommendations, automatic table sort, and automatic distribution style.

Scalability

Amazon Redshift can boost throughput by more than 35 times to support increases in concurrent users, and scales linearly for simple and mixed workloads.

Scaling to support a growing number of users

As the number of users accessing the data warehouse grows, you should not experience delays in your query responses. Most of the time, your workloads are not predictable and can vary throughout the day. Traditional data warehouses are typically provisioned for peak usage to avoid delays and missed SLAs, and you end up paying for resources that you are not fully using.

Concurrency Scaling in Amazon Redshift allows the data warehouse to handle spikes in workloads while maintaining consistent SLAs by elastically scaling the underlying resources as needed. Amazon Redshift continuously monitors the designated workload. If the queries start to get backlogged because of bursts of user activity, Amazon Redshift automatically adds transient cluster capacity and routes the requests to these new clusters. This transient capacity is available in a few seconds, so your queries continue to be served with low latency. Amazon Redshift removes the additional transient capacity automatically when activity reduces on the cluster.

You can choose if you want to elastically scale for certain workloads and by how much with a simple one-step configuration. Every 24 hours that the Amazon Redshift main cluster is in use, you accrue a 1-hour credit. This makes concurrency scaling free for more than 97% of use cases.

With the ability to automatically add and remove additional capacity, Amazon Redshift data warehouses can improve overall throughput by over 35 times. This post demonstrates how far you can dynamically allocate more compute power to satisfy the demands of concurrent users with the following experiment. First, take a baseline measurement using the Cloud Data Warehouse benchmark and five concurrent users. You can then enable Concurrency Scaling and add more and more users with each iteration. As soon as Amazon Redshift detects queuing, it allocates additional scaling clusters automatically. Ultimately, this experiment ran over 200 concurrent queries on Amazon Redshift and generated more than 35 times greater throughput. This many concurrently executing queries represents a concurrent user population of several thousand. This demonstrates how you can support virtually unlimited concurrent users on your Amazon Redshift data warehouses.

The scaling for concurrent users is also linear. You get consistent increases in performance with every extra dollar spent on the Concurrency Scaling clusters. This helps to keep data warehouse costs predictable as business needs grow. With Concurrency Scaling, AWS can perform benchmark tests with tens of thousands of queries per hour, with hundreds of queries running concurrently and providing linear scale. This represents a real-world workload in enterprises with thousands of concurrent users connecting to the data warehouse.

Scaling while running multiple mixed workloads

As data warehouses grow over time, the number and complexity of the workloads that run on the data warehouse also increase. For example, if you migrate from an on-premises data warehouse to Amazon Redshift, you might first run traditional analytics workloads, and eventually bring more operational and real-time data into the cluster to build new use cases and applications. To scale any data warehouse effectively, you must be able to prioritize and manage multiple types of workloads concurrently. Automatic workload management (WLM) and query priorities are two recent capabilities added to Amazon Redshift that enable you to do just that.

Automatic WLM makes sure that you use cluster resources efficiently, even with dynamic and unpredictable workloads. With automatic WLM, Amazon Redshift uses ML to classify incoming queries based on their memory and processing requirements and routes them to appropriate internal queues to start executing in parallel. Amazon Redshift dynamically adjusts the number of queries to execute in parallel to optimize overall throughput and performance. When queries that require large amounts of resources are in the system (for example, hash joins between large tables), the concurrency is lower. When you submit lighter queries (such as inserts, deletes, or simple aggregations), concurrency is higher. There is a feedback loop to continuously monitor system utilization and regulate admission into the cluster.

However, not all queries may be equally important to you; the performance of one workload or set of users might be more important than others. Query priorities in Amazon Redshift address this. You can give higher-priority workloads preferential treatment, including more resources during busy times, for consistent query performance. Amazon Redshift workload management uses intelligent algorithms to make sure that lower-priority queries continue to make progress and don’t stall.

You can combine Amazon Redshift Concurrency Scaling and automatic WLM with query priorities to solve complex data warehouse use cases. For example, the following table summarizes an Amazon Redshift configuration that effectively mixes ETL with analytics workloads.

WLM queueQueue priorityConcurrency ScalingNotes
ETLHighOffWhen ETL runs, it gets the highest priority
BI queriesNormalOnWhen BI workload suddenly increases, Concurrency Scaling adds capacity to maintain user SLAs
One-time or exploratory queriesLowOffCluster offers analytic access for casual users and data scientists when resources are available

For this use case, and many more, you can maintain SLAs, achieve efficiencies with your cluster utilization, and get sufficient flexibility to invest according to business priorities.

Price performance

You can measure price performance by calculating both the total cost of the computing service consumed and the total amount of computing work performed. Maximum performance for minimum cost gives you the best price performance.

As your data warehouses grow, Amazon Redshift gets more efficient. It moderates cost increases and keeps costs predictable, even as your needs grow. This sets Amazon Redshift apart from others in the market that increase in price much more as the number of users grows.

The investments in automatic optimizations, out-of-the-box performance, and scale all contribute to the unbeatable price performance that Amazon Redshift offers. When you compare typical customer quotes and investments, you find that Amazon Redshift costs 50% –75% less than other cloud data warehouses.

Measuring performance

AWS measures performance, throughput, and price-performance on a nightly basis. AWS also runs larger and more comprehensive benchmarks regularly to make sure the tests extend beyond your current needs. For benchmark results to be useful, they need to be well defined and easily reproducible. AWS uses the Cloud DW benchmark based on current TPC-DS and TPC-H benchmarks without any query or data modifications and compliant with TPC rules and requirements.

It’s important that anyone can reproduce these benchmarks; you can download the benchmark codes and scripts from  GitHub  and the accompanying dataset from a public Amazon S3 bucket.

Summary

Amazon Redshift is self-learning, self-optimizing, and consistently uses telemetry of the actual workload to deliver the best possible performance. Amazon Redshift is more than twice as fast out-of-the-box than it was 6 months ago, and keeps getting faster without any manual optimization and tuning. Amazon Redshift can boost throughput by more than 35 times to support increases in concurrent users and scales linearly for simple and mixed workloads.

In addition to software improvements, AWS continues to build data warehouses on the best hardware available. The new RA3 ndoe type with managed storage features high bandwidth networking and sizable high-performance SSDs as local caches. RA3 nodes use your workload patterns and advanced data management techniques, such as automatic fine-grained data eviction and intelligent data pre-fetching, to deliver the performance of local SSD while scaling storage automatically to Amazon S3. The hardware-based performance improvements in preview with AQUA (Advanced Query Accelerator) bring even more dramatic performance improvements and drive costs down with a new distributed and hardware accelerated cache.

These performance improvements are the cumulative result of years of strategic and sustained product investment and technical innovation across multiple areas such as automatic optimizations, out-of-the-box performance, and scalability. Additionally, price-performance remains a priority so you receive the best value.

Each dataset and workload has unique characteristics, and a proof of concept is the best way to understand how Amazon Redshift performs in your unique situation. When running your own proof of concept, it’s important that you focus on the right metrics—query throughput (number of queries per hour) and price-performance for your workload.
[1]
You can make a data-driven decision by requesting assistance with a proof of concept or working with a system integration and consulting partner. It’s also important to consider not only how a data warehouse performs with your current needs, but also its future performance with increasingly complex workloads, datasets, and users.

To stay up-to-date with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.

———
[1] The TPC Benchmark, TPC-DS and TPC-H are trademarks of the Transaction Processing Performance Council www.tpc.org


About the Author

Naresh Chainani is a Senior Software Development Manager at Amazon Redshift where he leads Query Processing, Query Performance, Distributed Systems and Workload Management with a strong team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.

 

 

Berni Schiefer is a Senior Development Manager for EMEA at Amazon Web Services, leading the Amazon Redshift development team in Berlin, Germany. The Berlin team focuses on Redshift Performance and Scalability, SQL Query Compilation, Spatial Data support, and Redshift Spectrum. Previously, Berni was an IBM Fellow working in the area of Private Cloud, Db2, Db2 Warehouse, BigSQL, with a focus on SQL-based engines, query optimization and performance.

 

 

Neeraja Rentachintala is a seasoned Product Management and GTM leader at Amazon Web Services, bringing over 20 years of experience in Product Vision, Strategy and Leadership roles in industry-leading data products and platforms. During her career, she delivered products in Analytics, Big data, Databases, Data and Application Integration, AI/ML serving Fortune 500 enterprise and ventures including MapR (acquired by HPE), Microsoft, Oracle, Informatica and Expedia.com. Currently Neeraja is a Principal Product Manager with Amazon Web Services building Amazon Redshift – the world’s most popular, highest performance and most scalable cloud data warehouse. Neeraja earned a Bachelor of Technology in Electronics and Communication Engineering from the National Institute of Technology in India and various business program certifications from the University of Washington, MIT Sloan School of Management and Stanford University.

Amazon Redshift update – ra3.4xlarge instances

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-redshift-update-ra3-4xlarge-instances/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers built their workloads using it. We are always listening to your feedback and, in December last year, we announced our 3rd generation RA3 node type providing you the ability to scale compute and storage separately. Previous generation DS2 and DC2 nodes had a fixed amount of storage and required adding more nodes to your cluster to increase storage capacity. The new RA3 nodes let you determine how much compute capacity you need to support your workload and then scale the amount of storage based on your needs. The first member of the RA3 family was the ra3.16xlarge which we heard from many customers was fantastic, but more than they needed for their workload needs.

Today we are adding a new smaller member to the RA3 family: the ra3.4xlarge.

The RA3 node type is based on AWS Nitro and includes support for Redshift managed storage. Redshift managed storage automatically manages data placement across tiers of storage and caches the hottest data in high-performance SSD storage while automatically offloading colder data to Amazon Simple Storage Service (S3). Redshift managed storage uses advanced techniques such as block temperature, data block age, and workload patterns to optimize performance.

RA3 nodes with managed storage are a great fit for analytics workloads that require massive storage capacity and can be a great fit for workloads such as operational analytics, where the subset of data that is most important evolves constantly over time. In the past, there was pressure to offload or archive old data to other storage because of fixed storage limits. This made maintaining the operational analytics data set and the larger historical dataset difficult to query when needed.

The new ra3.4xlarge node provides 12 vCPUs, 96 GiB of RAM, and addresses up to 64 Tb of managed storage. A cluster can contain up to 32 of these instances, for a total storage of 2048 TB (that’s 2 petabytes!).

The differences between ra3.16xlarge and ra3.4xlarge nodes are summarized in the table below.

vCPUMemoryAddressable Storage I/OPrice
(US East (N. Virginia))
ra3.4xlarge1296 GiB64TB RMS2 GB/sec$3.26 per Hour
ra3.16xlarge28384 GiB64TB RMS8 GB/sec$13.04 per Hour

To create a new cluster, I am using the Redshift AWS Management Console or AWS Command Line Interface (CLI). In the console. I click Create Cluster and choose ra3.4xlarge instances.

If you have a DS2 or DC2 instance-based cluster you create a new RA3 cluster to evaluate the new instance with managed storage. You use a recent snapshot of your Redshift DS2 or DC2 cluster to create a new cluster based on ra3.4xlarge instances. You keep the two clusters running in parallel to evaluate the compute needs of your application.

You can resize your RA3 cluster at anytime by using elastic resize to add or remove compute capacity. If elastic resize is not available for your chosen configuration then you can do a classic resize.

RA3 instances are now available in 14 AWS Regions : US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Canada (Central), and South America (São Paulo).

The price vary from one region to the other, starting at $3.26/hr/node in US East (N. Virginia). Check the Amazon Redshift pricing page for details.

— seb

Speeding up Etleap models at AXS with Amazon Redshift materialized views

Post Syndicated from Christian Romming original https://aws.amazon.com/blogs/big-data/speeding-up-etleap-models-at-axs-with-amazon-redshift-materialized-views/

The materialized views feature in Amazon Redshift is now generally available and has been benefiting customers and partners in preview since December 2019. One customer, AXS, is a leading ticketing, data, and marketing solutions provider for live entertainment venues in the US, UK, Europe, and Japan. Etleap, an Amazon Redshift partner, is an extract, transform, load, and transform (ETLT) service built for AWS. AXS uses Etleap to ingest data into Amazon Redshift from a variety of sources, including file servers, Amazon S3, relational databases, and applications. These ingestion pipelines parse, structure, and load data into Amazon Redshift tables with appropriate column types and sort and distribution keys.

Improving dashboard performance with Etleap models

To analyze data, AXS typically runs queries against large tables that originate from multiple sources. One of the ways that AXS uses Amazon Redshift is to power interactive dashboards. To achieve fast dashboard load times, AXS pre-computes partial answers to the queries dashboards use. These partial answers are orders of magnitude smaller in terms of the number of rows than the tables on which they are based. Dashboards can load much faster than they would if they were querying the base tables directly by querying Amazon Redshift tables that hold the pre-computed partial answers.

Etleap supports creating and managing such pre-computations through a feature called models. A model consists of a SELECT query and triggers for when it should be updated. An example of a trigger is a change to a base table, that is, a table the SELECT statement uses that defines the model. This way, the model can remain consistent with its base tables.

The following screenshot shows an Etleap model with two base table dependencies.

Etleap represents their models as tables in Amazon Redshift. To create the model table, Etleap wraps the SELECT statement in a CREATE TABLE AS (CTAS) query. When an update is triggered, for example, due to base table inserts, updates, or deletes, Etleap recomputes the model table through the following code:

CREATE TABLE model_temporary AS SELECT …
DROP TABLE model;
RENAME TABLE model_temporary TO model;

Analyzing CTAS performance as data grows

AXS manages a large number of Etleap models. For one particular model, the CTAS query takes over 6 minutes, on average. This query performs an aggregation on a join of three different tables, including an event table that is constantly ingesting new data and contains over a billion rows. The following graph shows that the CTAS query time increases as the event table increases in number of rows.

There are two key problems with the query taking longer:

  • There’s a longer delay before the updated model is available to analysts
  • The model update consumes more Amazon Redshift cluster resources

To address this, AXS would have to resort to workarounds that are either inconvenient or costly, such as archiving older data from the event table or expanding the Amazon Redshift cluster to increase available resources.

Comparing CTAS to materialized views

Etleap decided to run an experiment to verify that Amazon Redshift’s materialized views feature is an improvement over the CTAS approach for this AXS model. First, they built the materialized view by wrapping the SELECT statement in a CREATE MATERIALIZED VIEW AS query. For updates, instead of recreating the materialized view every time that data in a base table changes, a REFRESH MATERIALIZED VIEW query is sufficient. The expectation was that using materialized views would be significantly faster than the CTAS-based procedure. The following graph compares query times of CTAS to materialized view refresh.

Running REFRESH MATERIALIZED VIEW was 7.9 times faster than the CTAS approach—it took 49 seconds instead of 371 seconds on average at the current scale. Additionally, the update time was roughly proportional to the number of rows that were added to the base table since the last update, rather than the total size of the base table. In this use case, this number is 3.8 million, which corresponds to the approximate number of events ingested per day.

This is great news. The solution solves the previous problems because the delay the model update caused stays constant as new data comes in, and so do the resources that Amazon Redshift consume (assuming the growth of the base table is constant). In other words, using materialized views eliminates the need for workarounds, such as archiving or cluster expansion, as the dataset grows. It also simplifies the refresh procedure for model updates by reducing the number of SQL statements from three (CREATE, DROP, and RENAME) to one (REFRESH).

Achieving fast refresh performance with materialized views

Amazon Redshift can refresh a materialized view efficiently and incrementally. It keeps track of the last transaction in the base tables up to which the materialized view was previously refreshed. During subsequent refreshes, Amazon Redshift processes only the newly inserted, updated, or deleted tuples in the base tables, referred to as a delta, to bring the materialized view up-to-date with its base tables. In other words, Amazon Redshift can incrementally maintain the materialized view by reading only base table deltas, which leads to faster refresh times.

For AXS, Amazon Redshift analyzed their materialized view definitions, which join multiple tables, filters, and aggregates, to figure out how to incrementally maintain their specific materialized view. Each time AXS refreshes the materialized view, Amazon Redshift quickly determines if a refresh is needed, and if so, incrementally maintains the materialized view. As records are ingested into the base table, the materialized view refresh times shown are much faster and grow very slowly because each refresh reads a delta that is small and roughly the same size as the other deltas. In comparison, the refresh times using CTAS are much slower because each refresh reads all the base tables. Moreover, the refresh times using CTAS grow much faster because the amount of data that each refresh reads grows with the ingest rate.

You are in full control of when to refresh your materialized views. For example, AXS refreshes their materialized views based on triggers defined in Etleap. As a result, transactions that are run on base tables do not incur additional cost to maintain dependent materialized views. Decoupling the base tables’ updates from the materialized view’s refresh gives AXS an easy way to insulate their dashboard users and offers them a well-defined snapshot to query, while ingesting new data into base tables. When AXS vets the next batch of base table data via their ETL pipelines, they can refresh their materialized views to offer the next snapshot of dashboard results.

In addition to efficiently maintaining their materialized views, AXS also benefits from the simplicity of Amazon Redshift storing each materialized view as a plain table. Queries on the materialized view perform with the same world-class speed that Amazon Redshift runs any query. You can organize a materialized view like other tables, which means that you can exploit distribution key and sort columns to further improve query performance. Finally, when you need to process many queries at peak times, Amazon Redshift’s concurrency scaling kicks in automatically to elastically scale query processing capacity.

Conclusion

Now that the materialized views feature is generally available, Etleap gives you the option of using materialized views rather than tables when creating models. You can use models more actively as part of your ETLT strategies, and also choose more frequent update schedules for your models, due to the performance benefits of incremental refreshes.

For more information about Amazon Redshift materialized views, see Materialize your Amazon Redshift Views to Speed Up Query Execution and Creating Materialized Views in Amazon Redshift.

 


About the Author

Christian Romming is the founder and CEO of Etleap.  Etleap is a managed ETL solution for AWS that doesn’t require extensive engineering work to set up, maintain, and scale.

 

 

 

 

Prasad Varakur is a Database, Big Data & Distributed Systems enthusiast, and Product Manager at Amazon Web Services. Prior to this, he has developed Database and Storage engines at SAP/Sybase, Couchbase, Huawei, Novell, EMC, and Veritas. He holds 11 patents in database systems and distributed computing, and his thesis has contributed foundational works of Parametric Query Optimization. He holds Master’s degree in Computer Science from IIT, Kanpur.

 

Vuk Ercegovac is a principal engineer for Redshift at AWS.

 

 

 

Halodoc: Building the Future of Tele-Health One Microservice at a Time

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/halodoc-building-the-future-of-tele-health-one-microservice-at-a-time/

Halodoc, a Jakarta-based healthtech platform, uses tele-health and artificial intelligence to connect patients, doctors, and pharmacies. Join builder Adrian De Luca for this special edition of This is My Architecture as he dives deep into the solutions architecture of this Indonesian healthtech platform that provides healthcare services in one of the most challenging traffic environments in the world.

Explore how the company evolved its monolithic backend into decoupled microservices with Amazon EC2 and Amazon Simple Queue Service (SQS), adopted serverless to cost effectively support new user functionality with AWS Lambda, and manages the high volume and velocity of data with Amazon DynamoDB, Amazon Relational Database Service (RDS), and Amazon Redshift.

For more content like this, subscribe to our YouTube channels This is My Architecture, This is My Code, and This is My Model, or visit the This is My Architecture AWS website, which has search functionality and the ability to filter by industry, language, and service.

Materialize your Amazon Redshift Views to Speed Up Query Execution

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/materialize-your-amazon-redshift-views-to-speed-up-query-execution/

At AWS, we take pride in building state of the art virtualization technologies to simplify the management and access to cloud services such as networks, computing resources or object storage.

In a Relational Database Management Systems (RDBMS), a view is virtualization applied to tables : it is a virtual table representing the result of a database query. Views are frequently used when designing a schema, to present a subset of the data, summarized data (such as aggregated or transformed data) or to simplify data access across multiple tables. When using data warehouses, such as Amazon Redshift, a view simplifies access to aggregated data from multiple tables for Business Intelligence (BI) tools such as Amazon QuickSight or Tableau.

Views provide ease of use and flexibility but they are not speeding up data access. The database system must evaluate the underlying query representing the view each time your application accesses the view. When performance is key, data engineers use create table as (CTAS) as an alternative. A CTAS is a table defined by a query. The query is executed at table creation time and your applications can use it like a normal table, with the downside that the CTAS data set is not refreshed when underlying data are updated. Furthermore, the CTAS definition is not stored in the database system. It is not possible to know if a table was created by a CTAS or not, making it difficult to track which CTAS needs to be refreshed and which is current.

Today, we are introducing materialized views for Amazon Redshift. A materialized view (MV) is a database object containing the data of a query. A materialized view is like a cache for your view. Instead of building and computing the data set at run-time, the materialized view pre-computes, stores and optimizes data access at the time you create it. Data are ready and available to your queries just like regular table data.

Using materialized views in your analytics queries can speed up the query execution time by orders of magnitude because the query defining the materialized view is already executed and the data is already available to the database system.

Materialized views are especially useful for queries that are predictable and repeated over and over. Instead of performing resource-intensive queries on large tables, applications can query the pre-computed data stored in the materialized view.

When the data in the base tables are changing, you refresh the materialized view by issuing a Redshift SQL statement “refresh materialized view“. After issuing a refresh statement, your materialized view contains the same data as would have been returned by a regular view. Refreshes can be incremental or full refreshes (recompute). When possible, Redshift incrementally refreshes data that changed in the base tables since the materialized view was last refreshed.

Let’s see how it works. I create a sample schema to store sales information : each sales transaction and details about the store where the sales took place.

To view the total amount of sales per city, I create a materialized view with the create materialized view SQL statement. I connect to the Redshift console, select the query Editor and type the following statement to create a materialized view (city_sales) joining records from two tables and aggregating sales amount (sum(sales.amount)) per city (group by city):

CREATE MATERIALIZED VIEW city_sales AS (
  SELECT st.city, SUM(sa.amount) as total_sales
  FROM sales sa, store st
  WHERE sa.store_id = st.id
  GROUP BY st.city
);

The resulting schema is below:

Now I can query the materialized view just like a regular view or table and issue statements like “SELECT city, total_sales FROM city_sales” to get the below results. The join between the two tables and the aggregate (sum and group by) are already computed, resulting to significantly less data to scan.

When the data in the underlying base tables change, the materialized view is not automatically reflecting those changes. The data stored in the materialized can be refreshed on demand with latest changes from base tables using the SQL refreshmaterialized view command. Let’s see a practical example:

!-- let's add a row in the sales base table
INSERT INTO sales (id, item, store_id, customer_id, amount)
VALUES(8, 'Gaming PC Super ProXXL', 1, 1, 3000);

SELECT city, total_sales FROM city_sales WHERE city = 'Paris'

city |total_sales|
-----|-----------|
Paris|        690|

!-- the new sale is not taken into account !

!-- let's refresh the materialized view
REFRESH MATERIALIZED VIEW city_sales;

SELECT city, total_sales FROM city_sales WHERE city = 'Paris'

city |total_sales|
-----|-----------|
Paris|       3690|

!-- now the view has the latest sales data

The full code for this very simple demo is available as a gist.

You can start to use materialized views today in all AWS Regions.

There is nothing to change in your existing clusters to start to use materialized views, you can start to create them today at no additional cost.

Happy building !

Lower your costs with the new pause and resume actions on Amazon Redshift

Post Syndicated from Sain Das original https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/

Today’s analytics workloads typically require a data warehouse to be available 24 hours a day, 7 days a week. However, there may be times when you need an Amazon Redshift cluster for a short duration of time at frequent (or infrequent) intervals. For example, you may run a periodic ETL job or use a cluster for testing and development and not use it during off-hours or weekends. In these cases, you may want an easy way to keep the data warehouse up and running only part of the time. Previously, you could accomplish this by making a backup, terminating the cluster, and restoring the cluster from the snapshot. The pause and resume actions on Amazon Redshift are a much simpler way to suspend billing and are designed to use if your Amazon Redshift cluster is out of operation for hours at a time, and especially if that time is on a regularly scheduled basis.

Pausing a cluster suspends compute and retains the underlying data structures and data so you can resume the cluster at a later point in time. You can configure this through the Amazon Redshift console or the use of Amazon Redshift CLIs.

When the cluster is paused, the data warehouse’s storage incurs charges. On-demand compute billing is suspended and resumed on a per-second basis. Paused clusters still appear as an entry in the console. You can also automate the pause and resume actions by using a schedule that matches your operational needs.

Using the actions via the Amazon Redshift console

To use the pause and resume actions on the Amazon Redshift console, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your desired cluster.
  3. Choose Actions.
  4. Choose Pause.
  5. To determine when to pause the cluster, choose from the following three options:
    • To pause the cluster immediately, select Pause now.
    • To pause the cluster at a later point, select Pause later.
    • To pause and resume the cluster according to a set schedule, select Pause and resume on schedule.
  6. For this walkthrough, select Pause now.
  7. Choose Pause now.

The cluster is now in Modifying status. It can take up to a few minutes for the cluster to change to a Paused state, but the cost accrual for compute resources is suspended immediately.

The following screenshot shows a view of the cluster status.

Amazon Redshift processes any outstanding queries before it pauses the cluster. When the cluster is paused, you can still view it on the Amazon Redshift console, and the Resume action is available.

To resume the cluster, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your desired cluster.
  3. Choose Actions.
  4. Choose Resume.
  5. Choose when to resume the cluster. The options are the same as those for the pause action.
  6. For this walkthrough, select Resume now.
  7. Choose Resume now.

The cluster moves to Modifying status. Depending upon the size of the cluster, it can take several minutes to resume a cluster before queries can be processed. Billing only resumes when the cluster is available.

The following screenshot shows the view of the cluster status.

Using the actions via CLI

The following two commands pause and resume the cluster:

  • Pause-cluster
  • Resume-cluster

To pause a given cluster, enter the following code:

aws redshift pause-cluster --cluster identifier <insert cluster identifier here>

To resume a paused cluster, enter the following code:

aws redshift resume-cluster --cluster identifier <insert cluster identifier here>

Scheduling pause and resume actions

You can schedule to pause and resume a cluster at specific times of the day and week. For example, this walkthrough pauses a cluster on Friday 8:00 p.m. and resumes it on Monday 7:00 a.m. You can configure this via the Amazon Redshift console or APIs.

Scheduling via the Amazon Redshift console

To schedule to pause and resume a cluster on the Amazon Redshift console, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your desired cluster.
  3. Choose Actions.
  4. Choose Pause.
  5. Select Pause and resume on schedule.
  6. For Schedule name, enter a name for this schedule.
  7. Optionally, for Starts on and Ends on, enter the dates and times to start and end the schedule.
  8. For Pause every and Resume every, choose the time and day to pause and resume.
  9. Choose Schedule recurring pause and resume.

You can review existing pause and resume schedules on the Schedules tab. See the following screenshot.

Scheduling via CLI

The following CLI commands allow you to create, modify, and delete scheduled pause and resume tasks.

To create a scheduled action to occur one time, enter the following code:

aws redshift create-scheduled-action --scheduled-action-name test-resume --schedule "at(2020-02-21T02:00:00)" --target-action "{\"ResumeCluster\":{\"ClusterIdentifier\":\"redshift-cluster-1\"}}" --iam-role arn:aws:iam::<Account ID>:role/<Redshift Role>

To create a recurring scheduled action, enter the following code:

aws redshift create-scheduled-action --scheduled-action-name "scheduled-pause-repetitive" --target-action "{\"PauseCluster\":{\"ClusterIdentifier\":\"redshift-cluster-1\"}}" --schedule "cron(30 20 * * ? *)" --iam-role "arn:aws:iam::<Account ID>:role/<Redshift Role>"

The preceding code example pauses a cluster daily at 10:30 p.m.

To modify an existing scheduled action, enter the following code:

aws redshift modify-scheduled-action --scheduled-action-name "scheduled-pause-repetitive" --schedule "cron(30 * * * ? *)"

The preceding code example modifies the scheduled-pause-repetitive schedule to run every hour at 30 minutes past the hour.

To delete a scheduled action, enter the following code:

aws redshift delete-scheduled-action --scheduled-action-name "scheduled-pause-repetitive"

Summary

The pause and resume actions on Amazon Redshift allow you to easily pause and resume clusters that may not be in operation at all times. It allows you to create a regularly-scheduled time to initiate the pause and resume actions at specific times or you can manually initiate a pause and later a resume. Flexible on-demand pricing and per-second billing gives you greater control of costs of your Redshift compute clusters while maintaining your data in a way that is simple to manage. You can run your data warehouse at the lowest cost possible without having to purchase a fixed amount of resources up front.

 


About the Author

Sain Das is a data warehouse specialist solutions architect with AWS.