Tag Archives: database

How Wind Mobility built a serverless data architecture

Post Syndicated from Pablo Giner original https://aws.amazon.com/blogs/big-data/how-wind-mobility-built-a-serverless-data-architecture/

Guest post by Pablo Giner, Head of BI, Wind Mobility.

Over the past few years, urban micro-mobility has become a trending topic. With the contamination indexes hitting historic highs, cities and companies worldwide have been introducing regulations and working on a wide spectrum of solutions to alleviate the situation.

We at Wind Mobility strive to make commuters’ life more sustainable and convenient by bringing short distance urban transportation to cities worldwide.

At Wind Mobility, we scale our services at the same pace as our users demand them, and we do it in an economically and environmentally viable way. We optimize our fleet distribution to avoid overcrowding cities with more scooters than those that are actually going to be used, and we position them just meters away from where our users need them and at the time of the day when they want them.

How do we do that? By optimizing our operations to their fullest. To do so, we need to be very well informed about our users’ behavior under varying conditions and understand our fleet’s potential.

Scalability and flexibility for rapid growth

We knew that before we could solve this challenge, we needed to collect data from many different sources, such as user interactions with our application, user demand, IoT signals from our scooters, and operational metrics. To analyze the numerous datasets collected and extract actionable insights, we needed to build a data lake. While the high-level goal was clear, the scope was less so. We were working hard to scale our operation as we continued to launch new markets. The rapid growth and expansion made it very difficult to predict the volume of data we would need to consume. We were also launching new microservices to support our growth, which resulted in more data sources to ingest. We needed an architecture that allowed us to be agile and quickly adopt to meet our growth. It became clear that a serverless architecture was best positioned to meet those needs, so we started to design our 100% serverless infrastructure.

The first challenge was ingesting and storing data from our scooters in the field, events from our mobile app, operational metrics, and partner APIs. We use AWS Lambda to capture changes in our operational databases and mobile app and push the events to Amazon Kinesis Data Streams, which allows us to take action in real time. We also use Amazon Kinesis Data Firehose to write the data to Amazon Simple Storage Service (Amazon S3), which we use for analytics.

After we were in Amazon S3 and adequately partitioned as per its most common use cases (we partition by date, region, and business line, depending on the data source), we had to find a way to query this data for both data profiling (understanding structure, content, and interrelationships) and ad hoc analysis. For that we chose AWS Glue crawlers to catalog our data and Amazon Athena to read from the AWS Glue Data Catalog and run queries. However, ad hoc analysis and data profiling are relatively sporadic tasks in our team, because most of the data processing computing hours are actually dedicated to transforming the multiple data sources into our data warehouse, consolidating the raw data, modeling it, adding new attributes, and picking the data elements, which constitute 95% of our analytics and predictive needs.

This is where all the heavy lifting takes place. We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from Amazon S3, processes it using Apache Spark, and writes the results to our Amazon Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Architecture overview

The following diagram represents our current data architecture, showing two serverless data collection, processing, and reporting pipelines:

  • Operational databases from Amazon Relational Database Service (Amazon RDS) and MongoDB
  • IoT and application events, followed by Athena for data profiling and Amazon Redshift for reporting

Our data is curated and transformed multiple times a day using an automated pipeline running on AWS Glue. The team can now focus on analyzing the data and building machine learning (ML) applications.

We chose Amazon QuickSight as our business intelligence tool to help us visualize and better understand our operational KPIs. Additionally, we use Amazon Elastic Container Registry (Amazon ECR) to store our Docker images containing our custom ML algorithms and Amazon Elastic Container Service (Amazon ECS) where we train, evaluate, and host our ML models. We schedule our models to be trained and evaluated multiple times a day. Taking as input curated data about demand, conversion, and flow of scooters, we run the models to help us optimize fleet utilization for a particular city at any given time.

The following diagram represents how data from the data lake is incorporated into our ML training, testing, and serving system. First, our developers work in the application code and commit their changes, which are built into new Docker images by our CI/CD pipeline and stored in the Amazon ECR registry. These images are pushed into Amazon ECS and tested in DEV and UAT environments before moving to PROD (where they are triggered by the Amazon ECS task scheduler). During their execution, the Amazon ECS tasks (some train the demand and usage forecasting models, some produce the daily and hourly predictions, and others optimize the fleet distribution to satisfy the forecast) read their configuration and pull data from Amazon S3 (which has been previously produced by scheduled AWS Glue jobs), finally storing their results back into Amazon S3. Executions of these pipelines are tracked via MLFlow (in a dedicated Amazon Elastic Compute Cloud (Amazon EC2) server) and the final result indicating the fleet operations required is fit into a Kepler map, which is then consumed by the operators on the field.

Conclusion

We at Wind Mobility place data at the forefront of our operations. For that, we need our data infrastructure to be as flexible as the industry and the context we operate in, which is why we chose serverless. Over the course of a year, we have built a data lake, a data warehouse, a BI suite, and a variety of (production) data science applications. All of that with a very small team.

Also, within the last 12 months, we have scaled up several of our data pipelines by a factor of 10, without slowing our momentum or redesigning any part of our architecture. When it came to double our fleet in 1 week and increase the frequency at which we capture data from scooters by a factor of 10, our serverless data architecture scaled with no issues. This allowed us to focus on adding value by simplifying our operation, reacting to changes quickly, and delighting our users.

We have measured our success in multiple dimensions:

  • Speed – Serverless is faster to deploy and expand; we believe we have reduced our time to market for the entire infrastructure by a factor of 2
  • Visibility – We have 360 degree visibility of our operations worldwide, accessible by our city managers, finance team, and management board
  • Optimized fleet deployment – We know, at any minute of the day, the number of scooters that our customers need over the next few hours, which reduces unsatisfied demand by more than 50%

If you face a similar challenge, our advice is clear: go fully serverless and use the spectrum of solutions available from AWS.

Follow us and discover more about Wind Mobility on Facebook, Instagram and LinkedIn.

 


About the Author

Pablo Giner is Head of BI at Wind Mobility. Pablo’s background is in wheels (motorcycle racing > vehicle engineering > collision insurance > eScooters sharing…) and for the last few years he has specialized in forming and developing data teams. At Wind Mobility, he leads the data function (data engineering + analytics + data science), and the project he is most proud of is what they call smart fleet rebalancing, an AI backed solution to reposition their fleet in real-time. “In God we trust. All others must bring data.” – W. Edward Deming

 

 

 

Process data with varying data ingestion frequencies using AWS Glue job bookmarks

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/process-data-with-varying-data-ingestion-frequencies-using-aws-glue-job-bookmarks/

We often have data processing requirements in which we need to merge multiple datasets with varying data ingestion frequencies. Some of these datasets are ingested one time in full, received infrequently, and always used in their entirety, whereas other datasets are incremental, received at certain intervals, and joined with the full datasets to generate output. To address this requirement, this post demonstrates how to build an extract, transform, and load (ETL) pipeline using AWS Glue.

Using AWS Glue

AWS Glue provides a serverless environment to extract, transform, and load a large number of datasets from several sources for analytics purposes. It has a feature called job bookmarks to process incremental data when rerunning a job on a scheduled interval. A job bookmark is composed of the states for various job elements, such as sources, transformations, and targets. This is done by persisting state information from a job run that helps AWS Glue prevent reprocessing old data.

For this use case, we use an AWS Glue job with job bookmarks enabled to process files received in varying frequencies (a full dataset signifying files that are received one time, and incremental datasets signifying files that are received in certain regular intervals). These files are merged together. In addition to enabling job bookmarks, we also use an optional parameter transformation_ctx (transformation context) in an AWS Glue PySpark dynamic frame. This acts as a unique identifier for the ETL operator instance to identify state information within a job bookmark for a given operator. AWS Glue uses transformation_ctx to index the key to the bookmark state.

You can capture and maintain state information for incremental datasets and avoid reprocessing by using transformation context. Transformation context is omitted for the full dataset file, which results in the job run state information not getting captured for the full dataset and allowing it to participate in the next processing event in its entirety. Even though the job bookmark flag is enabled at the AWS Glue job level, because transformation context is omitted for the full dataset, every time the job runs, the entire data from the full dataset is used as part of the job. In contrast, only the newly arrived datasets are processed for the incremental datasets.

Solution overview

To demonstrate the job bookmark utility of AWS Glue, we use TLC Trip Record Data datasets. We use NYC yellow taxi trip data monthly files as the incremental dataset, and NYC taxi zone lookup as the full dataset. The monthly yellow taxi trip data has a field named PULocationID (where a customer was picked up), which is joined with the LocationID field from the NYC taxi zone lookup file to create an output dataset that contains Borough, Zone, and service_zone from the NYC taxi zone lookup dataset and all the fields (except the PULocationID field) from the monthly NYC taxi trip data file.

The following diagram depicts a high-level architecture of the process.

Descriptions of Diagram

  • Two Amazon S3 Raw bucket locations are used for storing incoming CSV source data (NYC taxi monthly files (Incremental Dataset) and NYC Taxi lookup file (Full Dataset)).
  • A Bookmark enabled glue Job joins data between monthly trip data file and the taxi zone lookup file to generate output parquet files and creates NYC taxi trip table in Glue Data Catalog and Redshift database.
  • S3 Curated Bucket is used to store NYC Taxi monthly processed parquet files.

Creating the AWS CloudFormation stack

You use the following AWS CloudFormation template to create the below mentioned resources in your preferred AWS account and Region:

Additionally, make sure you have an Amazon EC2 key pair created in the account and Region you’re launching the stack from.

To provide the stack parameters, complete the following steps:

  1. For Stack name, enter BigDataBlog-GlueBookmark-Stack.

  1. For RedshiftClusterIdentifier, enter bigdatablogrscluster.
  2. For NodeType, choose large.
  3. For NumberOfNodes, choose 2.
  4. For DatabaseName, enter bigdatablogdev.

  1. For MasterUserName, enter bigdatabloguser.
  2. For MasterUserPassword, enter a password for the master user account.
  3. For Maintenancewindow, enter sun:05:00-sun:05:30.
  4. For EC2InstanceType, choose micro.
  5. For SubscriptionEmail, enter your preferred email.
  6. For MyIPAddressCidr, enter your IP address.

You can find your IP Address by browsing https://www.whatismyip.com/ and looking up the value for My Public IPv4 is:. Add /32 at the end to make it CIDR-compatible and most restrictive.

  1. For DestinationPrefixListId, enter your prefix list ID.

To find your ID, set AWS credentials by entering aws configure in the command prompt. Run aws ec2 describe-prefix-lists to get the PrefixListId where PrefixListName is com.amazonaws.<<AWS region>>.s3 from the output.

  1. For NewS3BucketName, enter the name of your S3 bucket.

  1. For gluedatabase, enter bigdatabloggluedb.
  2. For EC2KeyName, enter the name of your key pair.

For instructions on creating a stack, see Creating a Stack on the AWS CloudFormation Console.

Make sure the stack is complete before moving to the next steps.

Creating the AWS Glue job

To create your AWS Glue job, complete the following steps:

  1. Download NYC yellow monthly trip data for October 2019 and November 2019 and save them under the s3://<<Your S3 Bucket>>/tripdata/ prefix.
  2. Download the NYC Taxi Zone lookup table and save it under the s3://<<Your S3 Bucket>>/tripdata-lookup/ prefix.
  3. Use the following PySpark script and change the piece of the code enclosed inside <<…>>.

You can find the values for the following keys on the Outputs tab for the CloudFormation stack:

    • S3Bucket
    • Snstopic

You can find the values for the following keys on the Parameters tab for the CloudFormation stack:

    • EC2KeyName
    • MyIPAddressCidr
    • NewS3BucketName
    • SubscriptionEmail

  1. When the AWS Glue script is ready, upload it to the S3 bucket under the s3://<<Your S3 Bucket>>/glue-script/ prefix.

You refer to this when you create the AWS Glue job.

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Create job.
  3. For Name, enter a name for the job. For more information about AWS Glue job naming, see Jobs.
  4. For IAM role, choose the role the CloudFormation template created. Use the value for the key Glueaccessrole from the stack outputs.
  5. For Type, choose Spark.
  6. For Glue version, choose Spark 2.4, Python 3 (Glue Version 1.0).
  7. For This job runs, choose An existing script that you provide.
  8. For S3 path where the script is stored, choose the script file that you saved earlier under the s3://<<Your S3 Bucket>>/Glue-script/ prefix.
  9. In the Advanced properties section, for Job bookmark, choose Enable.
  10. For Catalog options, select Use Glue Data Catalog as the Hive metastore.
  11. For Connections, enter the value of the key GlueConnection from the stack outputs.
  12. Choose Save job and edit script.

Creating an Amazon Redshift database schema

Before you run the AWS Glue job, you need to connect to the Amazon Redshift cluster and create an Amazon Redshift database schema named Glue_bookmark_redshift_schema. To connect to the cluster, use one of the JDBC client-based SQL tools, such as SQL Workbench/J. For instructions, see How can I access a private Amazon Redshift cluster from my local machine?

To access the cluster, you use the Amazon Redshift master user bigdatabloguser (the value for MasterUserName on the Parameters tab of the CloudFormation stack) and the password you provided when creating the stack.

Running AWS Glue job

The Glue Job takes only one argument; name of the file being processed. Pass the file name, such as yellow_tripdata_2019-10.csv, while processing that file. This enables you to track the records that belong to a specific file so that it’s easier to evaluate the result of multiple job runs using different files.

When the Glue job run is successful, you can see the output Parquet files under the /tripdata-joined-output/ prefix inside the S3 bucket you created by running the CloudFormation template. You can also use Amazon Athena to query the data from the table created in the Data Catalog. For more information, see Running SQL Queries Using Amazon Athena.

Query the Amazon Redshift database table named redshift_bookmark_table and review the output.

Explaining the solution

A bookmark-enabled AWS Glue job (in PySpark) is created that reads the NYC yellow taxi trip’s monthly file, joins it with NYC taxi zone lookup file, produces files in Parquet format, and saves them in an Amazon s3 location.

A Data Catalog table is created that refers to the Parquet files’ location in Amazon S3. The resulting dataset is also loaded into an Amazon Redshift table using the AWS Glue PySpark job.

The AWS Glue job bookmark transformation context is used while the AWS Glue dynamic frame is created by reading a monthly NYC taxi file, whereas the transformation context is disabled while reading and creating the dynamic frame for the taxi zone lookup file (because the entire file is required for processing each monthly trip file). This allows you to process each monthly trip file exactly one time and reuse the entire taxi zone lookup file as many times as required because the missing transformation context for the lookup file doesn’t allow the bookmark context to be set for that file.

When a new NYC trip data monthly file arrives and the AWS Glue job runs, it only processes the newly arrived monthly file and ignores any previously processed monthly files. Similarly, when the Data Catalog table data is copied into Amazon Redshift, it only copies the newly processed underlying Parquet files’ data and appends it to the Amazon Redshift table. At this time the transformation context is enabled to utilize the job bookmark, and the AWS Glue dynamic frame is created by reading the Data Catalog table.

The following PySpark code uses the transformation context to create an AWS Glue dynamic frame while reading the monthly incremental file:

taxidata = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths": [InputDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'},transformation_ctx = "taxidata")

However, the following code omits transformation context when creating the AWS Glue dynamic frame for the lookup file:

Lookupdata  = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})

Additionally, the following code uses the transformation context while reading the Data Catalog table, which is loaded into an Amazon Redshift table:

datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")

You can see in the screenshot below that the 2019 October yellow taxi trip data file has arrived for processing (the incremental dataset).

To process each month’s data, you need the taxi zone lookup (full dataset).

The following screenshot shows the output of the AWS Glue job after processing the 2019 October trip data, saved in Parquet format.

The following two screenshots show the Amazon Redshift table, displaying the count of records for the October 2019 taxi data and only October 2019 taxi data file has been processed so far, respectively

The following screenshot shows that the November 2019 NYC taxi data file has arrived for processing.

The following screenshot shows the output of the AWS Glue job after processing the 2019 November trip data, saved in Parquet format. The job only processed the November data and ignored the October data (to be reprocessed) because the job bookmark and transformation context was enabled.

The following screenshot shows that the Amazon Redshift table now has both October and November data and shows the total record count.

The following screenshot shows individual record count for each month.

Querying with Athena

You can also review the dataset in Athena, which uses the same Glue Data Catalog. The following screenshot of an Athena query shows the Data Catalog table has both October and November data, with the total record count.

The following screenshot of an Athena query shows the individual record count for each month.

The following screenshot shows the location information, including borough, zone, and service zone, which is available in the taxi zone lookup and is joined with the October taxi trip data.

The following screenshot shows the output for the same query on the November data.

Cleaning up

When you’re done using this solution, you should delete the CloudFormation stack to avoid incurring any further charges.

Conclusion

This post describes how you can merge datasets received in different frequencies as part of your ETL pipeline processing using AWS Glue job bookmarks. The use case demonstrated how to use job bookmarks and transformation context to build an ETL pipeline for processing several incremental datasets.

 


About the Authors

Dipankar is a Senior Data Architect with AWS Professional Services, helping customers build analytics platform and solutions. He has a keen interest in distributed computing. Dipankar enjoys spending time playing chess and watching old Hollywood movies.

 

 

 

Ashok Padmanabhan is a big data consultant with AWS Professional Services, helping customers build big data and analytics platform and solutions. When not building and designing data lakes, Ashok enjoys spending time at beaches near his home in Florida.

Moovit embraces data lake architecture by extending their Amazon Redshift cluster to analyze billions of data points every day

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/moovit-embraces-data-lake-architecture-by-extending-their-amazon-redshift-cluster-to-analyze-billions-of-data-points-every-day/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

Moovit is a leading Mobility as a Service (MaaS) solutions provider and maker of the top urban mobility app. Guiding over 800 million users in more than 3,200 cities across 103 countries to get around town effectively and conveniently, Moovit has experienced exponential growth of their service in the last few years. The company amasses up to 6 billion anonymous data points a day to add to the world’s largest repository of transit and urban mobility data, aided by Moovit’s network of more than 685,000 local editors that help map and maintain local transit information in cities that would otherwise be unserved.

Like Moovit, many companies today are using Amazon Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics using services such as Amazon EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using Amazon Athena and Amazon Redshift Spectrum. You can also use a data lake with ML services such as Amazon SageMaker to gain insights.

Moovit uses an Amazon Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.

Additionally, Moovit was looking to manage their storage costs and evolve to a model that allowed cooler data to be maintained at the lowest cost in S3, and maintain the hottest data in Redshift for the most efficient query performance. The proposed solution implemented a hot/cold storage pattern using Amazon Redshift Spectrum and reduced the local disk utilization on the Amazon Redshift cluster to make sure costs are maintained. Moovit is currently evaluating the new RA3 node with managed storage as an additional level of flexibility that will allow them to easily scale the amount of hot/cold storage without limit.

In this post we demonstrate how Moovit, with the support of AWS, implemented a lake house architecture by employing the following best practices:

  • Unloading data into Amazon Simple Storage Service (Amazon S3)
  • Instituting a hot/cold pattern using Amazon Redshift Spectrum
  • Using AWS Glue to crawl and catalog the data
  • Querying data using Athena

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

  1. Unload data from Amazon Redshift to Amazon S3
  2. Create an AWS Glue Data Catalog using an AWS Glue crawler
  3. Query the data lake in Amazon Athena
  4. Query Amazon Redshift and the data lake with Amazon Redshift Spectrum

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. The following AWS services and access: Amazon Redshift, Amazon S3, AWS Glue, and Athena.
  4. The appropriate AWS Identity and Access Management (IAM) permissions for Amazon Redshift Spectrum and AWS Glue to access Amazon S3 buckets. For more information, see IAM policies for Amazon Redshift Spectrum and Setting up IAM Permissions for AWS Glue.

Walkthrough

To demonstrate the process Moovit used during their data architecture, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.

The Orders table has the following columns:

ColumnType
O_ORDERKEYint4
O_CUSTKEYint4
O_ORDERSTATUSvarchar
O_TOTALPRICEnumeric
O_ORDERDATEdate
O_ORDERPRIORITYvarchar
O_CLERKvarchar
O_SHIPPRIORITYint4
O_COMMENTvarchar
SKIPvarchar

Unloading data from Amazon Redshift to Amazon S3

Amazon Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in Amazon S3, compared with text formats.

To unload cold or historical data from Amazon Redshift to Amazon S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):

UNLOAD ('select o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment, skip
FROM tpc.orders
ORDER BY o_orderkey, o_orderdate') 
TO 's3://tpc-bucket/orders/' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account_number>:role/>Role<'
FORMAT AS parquet allowoverwrite PARTITION BY (o_orderdate);

It is important to define a partition key or column that minimizes Amazon S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the o_orderdate field as the partition key.

Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the UNLOAD command splits the results to one or more files per node slice (virtual worker in the Amazon Redshift cluster) which allows you to use the Amazon Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In Moovit’s use case, the default UNLOAD using PARALLEL ON yielded dozens of small (MBs) files. For Moovit, PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the LEADER node and wrote it out as a single stream controlling the file size using the MAXFILESIZE option.

Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have min_value and max_value column statistics for each row group that allow Amazon Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from Amazon S3, which results in improved performance and reduced cost.

After unloading the data to your data lake, you can view your Parquet file’s content in Amazon S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.

You’re now ready to populate your Data Catalog using an AWS Glue crawler.

Creating a Data Catalog with an AWS Glue crawler

To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.

An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.

Querying the data lake in Athena

After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.

Querying Amazon Redshift and the data lake using a unified view with Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. Amazon Redshift clusters transparently use the Amazon Redshift Spectrum feature when the SQL query references an external table stored in Amazon S3. Large multiple queries in parallel are possible by using Amazon Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from Amazon S3 back to the Amazon Redshift cluster.

Following best practices, Moovit decided to persist all their data in their Amazon S3 data lake and only store hot data in Amazon Redshift. They could query both hot and cold datasets in a single query with Amazon Redshift Spectrum.

The first step is creating an external schema in Amazon Redshift that maps a database in the Data Catalog. See the following code:

CREATE EXTERNAL SCHEMA spectrum 
FROM data catalog 
DATABASE 'datalake' 
iam_role 'arn:aws:iam::<account_number>:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

After the crawler creates the external table, you can start querying in Amazon Redshift using the mapped schema that you created earlier. See the following code:

SELECT * FROM spectrum.orders;

Lastly, create a late binding view that unions the hot and cold data:

CREATE OR REPLACE VIEW lake_house_joint_view AS (SELECT * FROM public.orders WHERE o_orderdate >= dateadd(‘day’,-90,date_trunc(‘day’,getdate())) 
UNION ALL SELECT * FROM spectrum.orders WHERE o_orderdate < dateadd(‘day’,-90,date_trunc(‘day’,getdate())) WITH NO SCHEMA BINDING;

Summary

In this post, we showed how Moovit unloaded data from Amazon Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within Moovit can access the data, analyze it with various tools, and come up with new insights.

As an additional benefit, Moovit reduced their Amazon Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the Amazon Redshift cluster. Keeping only hot data on the Amazon Redshift cluster prevents Moovit from deleting data frequently, which saves IT resources, time, and effort.

If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.

 


About the Authors

Yonatan Dolan is a Business Development Manager at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.

 

 

 

 

Alon Gendler is a Startup Solutions Architect at Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.

 

 

 

 

Vincent Gromakowski is a Specialist Solutions Architect for Amazon Web Services.

 

 

Best practices for Amazon Redshift Federated Query

Post Syndicated from Joe Harris original https://aws.amazon.com/blogs/big-data/amazon-redshift-federated-query-best-practices-and-performance-considerations/

Amazon Redshift Federated Query enables you to use the analytic power of Amazon Redshift to directly query data stored in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL databases. For more information about setting up an environment where you can try out Federated Query, see Accelerate Amazon Redshift Federated Query adoption with AWS CloudFormation.

Federated Query enables real-time data integration and simplified ETL processing. You can now connect live data sources directly in Amazon Redshift to provide real-time reporting and analysis. Previously, you needed to extract data from your PostgreSQL database to Amazon Simple Storage Service (Amazon S3) and load it to Amazon Redshift using COPY or query it from Amazon S3 with Amazon Redshift Spectrum. For more information about the benefits of Federated Query, see Build a Simplified ETL and Live Data Query Solution using Amazon Redshift Federated Query.

This post discusses 10 best practices to help you maximize the benefits of Federated Query when you have large federated data sets, when your federated queries retrieve large volumes of data, or when you have many Redshift users accessing federated data sets. These techniques are not necessary for general usage of Federated Query. They are intended for advanced users who want to make the most of this exciting feature.

The best practices are divided into two sections: the first for advice that applies to your Amazon Redshift cluster, and the second for advice that applies to your Aurora PostgreSQL and Amazon RDS for PostgreSQL environments.

The code examples provided in this post derive from the data and queries in the CloudDataWarehouseBenchmark GitHub repo (based on TPC-H and TPC-DS).

Best practices to apply in Amazon Redshift

The following best practices apply to your Amazon Redshift cluster when using federated queries to access your Aurora or Amazon RDS for PostgreSQL instances.

1. Use separate external schemas for each use case

Consider creating separate Amazon Redshift external schemas, using separate remote PostgreSQL users, for each specific Amazon Redshift use case. This practice allows you to have extra control over the users and groups who can access the external database. For instance, you may want to have an external schema for ETL usage, with an associated PostgreSQL user, that has broad access and another schema, and an associated PostgreSQL user for ad-hoc reporting and analysis with access limited to specific resources.

The following code example creates two external schemas for ETL use and ad-hoc reporting use. Each schema uses a different SECRET_ARN containing credentials for separate users in the PostgreSQL database.

-- ETL usage - broad access
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_etl
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-etl-secret-187Asd'
;
-- Ad-Hoc usage - limited access
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_adhoc
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-adhoc-secret-187Asd'
;

2. Use query timeouts to limit total runtimes

Consider setting a timeout on the users or groups that have access to your external schemas. User queries could unintentionally try to retrieve a very large number of rows from the external relation and remain running for an extended time, which holds open resources in both Amazon Redshift and PostgreSQL.

To limit the total runtime of a user’s queries, you can set a statement_timeout for all a user’s queries. The following code example sets a 2-hour timeout for an ETL user:

-- Set ETL user timeout to 2 hours
ALTER USER etl_user SET statement_timeout TO 7200000;

If many users have access to your external schemas, it may not be practical to define a statement_timeout for each individual user. Instead, you can add a query monitoring rule in your WLM configuration using the query_execution_time metric. The following screenshot shows an Auto WLM configuration with an Adhoc Reporting queue for users in the adhoc group, with a rule that cancels queries that run for longer than 1,800 seconds (30 minutes).

3. Make sure the Amazon Redshift query plan is efficient

Review the overall query plan and query metrics of your federated queries to make sure that Amazon Redshift processes them efficiently. For more information about query plans, see Evaluating the query plan.

Viewing the Amazon Redshift query explain plan

You can retrieve the plan for your query by prefixing your SQL with EXPLAIN and running that in your SQL client. The following code example is the explain output for a sample query:

<< REDSHIFT >> QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 XN Aggregate  (cost=6396670427721.37..6396670427721.37 rows=1 width=32)
   ->  XN Hash Join DS_BCAST_INNER  (cost=499986.50..6396670410690.30 rows=6812425 width=32)
         Hash Cond: ("outer".l_partkey = ("inner".p_partkey)::bigint)
         ->  XN Seq Scan on lineitem  (cost=0.00..2997629.29 rows=199841953 width=40)
               Filter: ((l_shipdate < '1994-03-01'::date) AND (l_shipdate >= '1994-02-01'::date))
         ->  XN Hash  (cost=449987.85..449987.85 rows=19999460 width=4)
               ->  XN PG Query Scan part  (cost=0.00..449987.85 rows=19999460 width=4)
                     ->  Remote PG Seq Scan apg_tpch_100g.part  (cost=0.00..249993.25 rows=19999460 width=4)
                           Filter: ((p_type)::text ~~ 'PROMO%'::text)

The operator XN PG Query Scan indicates that Amazon Redshift will run a query against the federated PostgreSQL database for this part of the query, we refer to this as the “federated subquery” in this post. When your query uses multiple federated data sources Amazon Redshift runs a federated subquery for each source. Amazon Redshift runs each federated subquery from a randomly selected node in the cluster.

Below the XN PG Query Scan line, you can see Remote PG Seq Scan followed by a line with a Filter: element. These two lines define how Amazon Redshift accesses the external data and the predicate used in the federated subquery. You can see that the federated subquery will run against the federated table apg_tpch.part.

You can also see from rows=19999460 that Amazon Redshift estimates that the query can return up to 20 million rows from PostgreSQL. It creates this estimate by asking PostgreSQL for statistics about the table.

Joins

Since each federated subquery runs from a single node in the cluster, Amazon Redshift must choose a join distribution strategy to send the rows returned from the federated subquery to the rest of the cluster to complete the joins in your query. The choice of a broadcast or distribution strategy is indicated in the explain plan. Operators that start with DS_BCAST broadcast a full copy of the data to all nodes. Operators that start with DS_DIST distribute a portion of the data to each node in the cluster.

It’s usually most efficient to broadcast small results and distribute larger results. When the planner has a good estimate of the number of rows that the federated subquery will return, it chooses the correct join distribution strategy. However, if the planner’s estimate isn’t accurate, it may choose broadcast for result that is too large, which can slow down your query.

Join Order

Joins should use the smaller result as the inner relation. When your query joins two tables (or two federated subqueries), Amazon Redshift must choose how best to perform the join. The query planner may not perform joins in the order declared in your query. Instead, it uses the information it has about the relations being joined to create estimated costs for a variety of possible plans. It uses the plan, including join order, that has the lowest expected cost.

When you use a hash join, the most common join, Amazon Redshift constructs a hash table from the inner table (or result) and compares it to every row from the outer table. You want to use the smallest result as the inner so that the hash table can fit in memory. The chosen ordering join may not be optimal if the planner’s estimate doesn’t reflect the real size of the results from each step in the query.

Improving query efficiency

The following is high-level advice for improving efficiency. For more information, see Analyzing the query plan.

  • Examine the plan for separate parts of your query. If your query has multiple joins or uses subqueries, you can review the explain plan for each join or subquery to check whether the query benefits from being simplified. For instance, if you use several joins, examine the plan for a simpler query using only one join to see how Amazon Redshift plans that join on its own.
  • Examine the order of outer joins and use an inner join. The planner can’t always reorder outer joins. If you can convert an outer join to an inner join, it may allow the planner to use a more efficient plan.
  • Reference the distribution key of the largest Amazon Redshift table in the join. When a join references the distribution key Amazon Redshift can complete the join on each node in parallel without moving the rows from the Redshift table across the cluster.
  • Insert the federated subquery result into a table. Amazon Redshift has optimal statistics when the data comes from a local temporary or permanent table. In rare cases, it may be most efficient to store the federated data in a temporary table first and join it with your Amazon Redshift data.

4. Make sure predicates are pushed down to the remote query

Amazon Redshift’s query optimizer is very effective at pushing predicate conditions down to the federated subquery that runs in PostgreSQL. Review the query plan of important or long-running federated queries to check that Amazon Redshift applies all applicable predicates to each subquery.

Consider the following example query, in which the predicate is inside a CASE statement and the federated relation is within a CTE subquery:

WITH cte 
AS (SELECT p_type, l_extendedprice, l_discount, l_quantity 
    FROM public.lineitem
    JOIN apg_tpch.part --<< PostgreSQL table
         ON l_partkey = p_partkey
    WHERE l_shipdate >= DATE '1994-02-01'
      AND l_shipdate < (DATE '1994-02-01' + INTERVAL '1 month')
)
SELECT CASE WHEN p_type LIKE 'PROMO%' --<< PostgreSQL filter predicate pt1
            THEN TRUE ELSE FALSE END AS is_promo
     , AVG( ( l_extendedprice * l_discount) / l_quantity ) AS avg_promo_disc_val
FROM cte
WHERE is_promo IS TRUE --<< PostgreSQL filter predicate pt2
GROUP BY 1;

Amazon Redshift can still effectively optimize the federated subquery by pushing a filter down to the remote relation. See the following plan:

<< REDSHIFT >> QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 XN HashAggregate  (cost=17596843176454.83..17596843176456.83 rows=200 width=87)
   ->  XN Hash Join DS_BCAST_INNER  (cost=500000.00..17596843142391.79 rows=6812609 width=87)
         Hash Cond: ("outer".l_partkey = ("inner".p_partkey)::bigint)
         ->  XN Seq Scan on lineitem  (cost=0.00..2997629.29 rows=199841953 width=40)
               Filter: ((l_shipdate < '1994-03-01'::date) AND (l_shipdate >= '1994-02-01'::date))
         ->  XN Hash  (cost=450000.00..450000.00 rows=20000000 width=59)-- Federated subquery >>
               ->  XN PG Query Scan part  (cost=0.00..450000.00 rows=20000000 width=59)
                     ->  Remote PG Seq Scan apg_tpch.part  (cost=0.00..250000.00 rows=20000000 width=59)
                           Filter: (CASE WHEN ((p_type)::text ~~ 'PROMO%'::text) THEN true ELSE false END IS TRUE)-- << Federated subquery

If Redshift can’t push your predicates down as needed, or the query still returns too much data, consider the advice in the following two sections regarding materialized views and syncing tables. To easily rewrite your queries to achieve effective filter pushdown, consider the advice in the final best practice regarding persisting frequently queried data.

5. Use materialized views to cache frequently accessed data

Amazon Redshift now supports the creation of materialized views that reference federated tables in external schemas.

Cache queries that run often

Consider caching frequently run queries in your Amazon Redshift cluster using a materialized view. When many users run the same federated query regularly, the remote content of the query must be retrieved again for each execution. With a materialized view, the results can instead be retrieved from your Amazon Redshift cluster without getting the same data from the remote database. You can then schedule the refresh of the materialized view to happen at a specific time, depending upon the change rate and importance of the remote data.

The following code example demonstrates the creation, querying, and refresh of a materialized view from a query that uses a federated source table:

-- Create the materialized view
CREATE MATERIALIZED VIEW mv_store_quantities_by_quarter AS
SELECT ss_store_sk
     , d_quarter_name
     , COUNT(ss_quantity) AS count_quantity
     , AVG(ss_quantity) AS avg_quantity
FROM public.store_sales
JOIN apg_tpcds.date_dim --<< federated table
    ON d_date_sk = ss_sold_date_sk
GROUP BY ss_store_sk
ORDER BY ss_store_sk
;
--Query the materialized view
SELECT * 
FROM mv_store_quanties_by_quarter
WHERE d_quarter_name = '1998Q1'
;
--Refresh the materialized view
REFRESH MATERIALIZED VIEW mv_store_quanties_by_quarter
;

Cache tables that are used by many queries

Also consider locally caching tables used by many queries using a materialized view. When many different queries use the same federated table it’s often better to create a materialized view for that federated table which can then be referenced by the other queries instead.

The following code example demonstrates the creation and querying of a materialized view on a single federated source table:

-- Create the materialized view
CREATE MATERIALIZED VIEW mv_apg_part AS
SELECT * FROM apg_tpch_100g.part
;
--Query the materialized view
SELECT SUM(l_extendedprice * (1 - l_discount)) AS discounted_price
FROM public.lineitem, mv_apg_part
WHERE l_partkey = p_partkey
  AND l_shipdate BETWEEN '1997-03-01' AND '1997-04-01'
;

As of this writing, you can’t reference a materialized view inside another materialized view. Other views that use the cached table need to be regular views.

Balance caching against refresh time and frequency

The use of materialized views is best suited for queries that run quickly relative to the refresh schedule. For example, a materialized view refreshed hourly should run in a few minutes, and a materialized view refreshed daily should run in less than an hour. As of this writing, materialized views that reference external tables aren’t eligible for incremental refresh. A full refresh occurs when you run REFRESH MATERIALIZED VIEW and recreate the entire result.

Limit remote access using materialized views

Also consider using materialized views to reduce the number of users who can issue queries directly against your remote databases. You can grant external schema access only to a user who refreshes the materialized views and grant other Amazon Redshift users access only to the materialized view.

Limiting the scope of access in this way is a general best practice for data security when querying from remote production databases that contain sensitive information.

6. Sync large remote tables to a local copy

Consider keeping a copy of the remote table in a permanent Amazon Redshift table. When your remote table is large and a full refresh of a materialized view is time-consuming it’s more effective to use a sync process to keep a local copy updated.

Sync newly added remote data

When your large remote table only has new rows added, not updated nor deleted, you can synchronize your Amazon Redshift copy by periodically inserting the new rows from the remote table into the copy. You can automate this sync process using the example stored procedure sp_sync_get_new_rows on GitHub.

This example stored procedure requires the source table to have an auto-incrementing identity column as its primary key. It finds the current maximum in your Amazon Redshift table, retrieves all rows in the federated table with a higher ID value, and inserts them into the Amazon Redshift table.

The following code examples demonstrate a sync from a federated source table to a Amazon Redshift target table. First, you create a source table with four rows in the PostgreSQL database:

CREATE TABLE public.pg_source ( 
      pk_col   BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY
    , data_col VARCHAR(20));
INSERT INTO public.pg_tbl (data_col)
VALUES ('aardvark'),('aardvarks'),('aardwolf'),('aardwolves')
;

Create a target table with two rows in your Amazon Redshift cluster:

CREATE TABLE public.rs_target (
      pk_col   BIGINT PRIMARY KEY
    , data_col VARCHAR(20));
INSERT INTO public.rs_tbl
VALUES (1,'aardvark'), (2,'aardvarks')
;

Call the Amazon Redshift stored procedure to sync the tables:

CALL sp_sync_get_new_rows(SYSDATE,'apg_tpch.pg_source','public.rs_target','pk_col','public.sp_logs',0);
-- INFO:  SUCCESS - 2 new rows inserted into `target_table`.

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  
-- --------+------------
--       1 | aardvark  
--       2 | aardvarks  
--       4 | aardwolves
--       3 | aardwolf 

Merge remote data changes

After you update or insert rows in your remote table, you can synchronize your Amazon Redshift copy by periodically merging the changed rows and new rows from the remote table into the copy. This approach works best when changes are clearly marked in the table so that you can easily retrieve just the new or changed rows. You can automate this sync process using the example stored procedure sp_sync_merge_changes, on GitHub.

This example stored procedure requires the source to have a date/time column that indicates the last time each row was modified. It uses this column to find changes that you need to sync and either updates the changed rows or inserts new rows in the Amazon Redshift copy. The stored procedure also requires the table to have a primary key declared. It uses the primary key to identify which rows to update in the local copy of the data.

The following code examples demonstrate a refresh from a federated source table to an Amazon Redshift target table. First, create a sample table with two rows in your Amazon Redshift cluster:

CREATE TABLE public.rs_tbl ( 
      pk_col   INTEGER PRIMARY KEY
    , data_col VARCHAR(20)
    , last_mod TIMESTAMP);
INSERT INTO public.rs_tbl 
VALUES (1,'aardvark', SYSDATE), (2,'aardvarks', SYSDATE);

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  |     last_mod
-- --------+------------+---------------------
--       1 | aardvark   | 2020-04-01 18:01:02
--       2 | aardvarks  | 2020-04-01 18:01:02

Create a source table with four rows in your PostgreSQL database:

CREATE TABLE public.pg_tbl (` `
      pk_col   INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY
    , data_col VARCHAR(20)
    , last_mod TIMESTAMP);
INSERT INTO public.pg_tbl (data_col, last_mod)
VALUES ('aardvark', NOW()),('aardvarks', NOW()),('aardwolf', NOW()),('aardwolves', NOW());

Call the Amazon Redshift stored procedure to sync the tables:

CALL sp_sync_merge_changes(SYSDATE,'apg_tpch.pg_tbl','public.rs_tbl','last_mod','public.sp_logs',0);
-- INFO:  SUCCESS - 4 rows synced.

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  |      last_mod
-- --------+------------+---------------------
--       1 | aardvark   | 2020-04-01 18:09:56
--       2 | aardvarks  | 2020-04-01 18:09:56
--       4 | aardwolves | 2020-04-01 18:09:56
--       3 | aardwolf   | 2020-04-01 18:09:56

Best practices to apply in Aurora or Amazon RDS

The following best practices apply to your Aurora or Amazon RDS for PostgreSQL instances when using them with Amazon Redshift federated queries.

7. Use a read replica to minimize Aurora or RDS impact

Aurora and Amazon RDS allow you to configure one or more read replicas of your PostgreSQL instance. As of this writing, Federated Query doesn’t allow writing to the federated database, so you should use a read-only endpoint as the target for your external schema. This also makes sure that the federated subqueries Amazon Redshift issues have the minimum possible impact on the master database instance, which often runs a large number of small and fast write transactions.

For more information about read replicas, see Adding Aurora Replicas to a DB Cluster and Working with PostgreSQL Read Replicas in Amazon RDS.

The following code example creates an external schema using a read-only endpoint. You can see the -ro naming in the endpoint URI configuration:

--In Amazon Redshift: 
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_etl
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-etl-secret-187Asd';

8. Use specific and limited PostgreSQL users for each use case

As mentioned in the first best practice regarding separate external schemas, consider creating separate PostgreSQL users for each federated query use case. Having multiple users allows you to grant only the permissions needed for each specific use case. Each user needs a different SECRET_ARN, containing its access credentials, for the Amazon Redshift external schema to use. See the following code:

-- Create an ETL user who will have broad access
CREATE USER redshift_etl WITH PASSWORD '<<example>>';
-- Create an Ad-Hoc user who will have limited access
CREATE USER redshift_adhoc WITH PASSWORD '<<example>>';

Apply a user timeout to limit query runtime

Consider setting a statement_timeout on your PostgreSQL users. A user query could accidentally try to retrieve many millions of rows from the external relation and remain running for an extended time, which holds open resources in both Amazon Redshift and PostgreSQL. To prevent this, specify different timeout values for each user according to their expected usage. The following code example sets timeouts for an ETL user and an ad-hoc reporting user:

-- Set ETL user timeout to 1 hour
ALTER USER redshift_etl SET statement_timeout TO 3600000;
-- Set Ad-Hoc user timeout to 15 minutes
ALTER USER redshift_adhoc SET statement_timeout TO 900000;

9. Make sure the PostgreSQL table is correctly indexed

Consider adding or modifying PostgreSQL indexes to make sure Amazon Redshift federated queries run efficiently. Amazon Redshift retrieves data from PostgreSQL using regular SQL queries against your remote database. Queries are often faster when using an index, particularly when the query returns a small portion of the table.

Consider the following code example of an Amazon Redshift federated query on the lineitem table:

SELECT AVG( ( l_extendedprice * l_discount) / l_quantity ) AS avg_disc_val
FROM apg_tpch.lineitem
WHERE l_shipdate >= DATE '1994-02-01'
  AND l_shipdate < (DATE '1994-02-01' + INTERVAL '1 day');

Amazon Redshift rewrites this into the following federated subquery to run in PostgreSQL:

SELECT pg_catalog."numeric"(l_discount)
     , pg_catalog."numeric"(l_extendedprice)
     , pg_catalog."numeric"(l_quantity) 
FROM public.lineitem  
WHERE (l_shipdate < '1994-02-02'::date) 
  AND (l_shipdate >= '1994-02-01'::date);

Without an index, you get the following plan from PostgreSQL:

<< POSTGRESQL >> QUERY PLAN [No Index]
--------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..16223550.40 rows=232856 width=17)
   Workers Planned: 2
   ->  Parallel Seq Scan on lineitem  (cost=0.00..16199264.80 rows=97023 width=17)
         Filter: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))

You can add the following index to cover exactly the data this query needs:

CREATE INDEX lineitem_ix_covering 
ON public.lineitem (l_shipdate, l_extendedprice, l_discount, l_quantity);

With the new index in place, you see the following plan:

<< POSTGRESQL >> QUERY PLAN [With Covering Index]
------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on lineitem  (cost=7007.35..839080.74 rows=232856 width=17)
   Recheck Cond: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))
   ->  Bitmap Index Scan on lineitem_ix_covering  (cost=0.00..6949.13 rows=232856 width=0)
         Index Cond: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))

In the revised plan, the max cost is 839080 versus the original 16223550—19 times less. The reduced cost suggests that the query is faster when using the index, but testing is needed to confirm this.

Indexes require careful consideration. The detailed tradeoffs of adding additional indexes in PostgreSQL, the specific PostgreSQL index types available, and index usage techniques are beyond the scope of this post.

10. Replace restrictive joins with a remote view

Many analytic queries use joins to restrict the rows that the query returns. For instance, you might apply a predicate such as calender_quarter='2019Q4' to your date_dim table and join to your large fact table. The filter on date_dim reduces the rows returned from the fact table by an order of magnitude. However, as of this writing, Amazon Redshift can’t push such join restrictions down to the federated relation.

Consider the following example query with a join between two federated tables:

SELECT ss_store_sk
,COUNT(ss_quantity) AS count_quantity
,AVG(ss_quantity) AS avg_quantity
FROM apg_tpcds.store_sales
JOIN apg_tpcds.date_dim
  ON d_date_sk = ss_sold_date_sk
WHERE d_quarter_name = '1998Q1'
GROUP BY ss_store_sk
ORDER BY ss_store_sk
LIMIT 100;

When you EXPLAIN this query in Amazon Redshift, you see the following plan:

<< REDSHIFT >> QUERY PLAN [Original]
----------------------------------------------------------------------------------------------------------------------------------
<< snip >>
  ->  XN PG Query Scan store_sales  (cost=0.00..576019.84 rows=28800992 width=12)
        ->  Remote PG Seq Scan store_sales  (cost=0.00..288009.92 rows=28800992 width=12)
  ->  XN Hash  (cost=1643.60..1643.60 rows=73049 width=4)
        ->  XN PG Query Scan date_dim  (cost=0.00..1643.60 rows=73049 width=4)
              ->  Remote PG Seq Scan date_dim  (cost=0.00..913.11 rows=73049 width=4)
                    Filter: (d_quarter_name = '1998Q1'::bpchar)

The query plan shows that date_dim is filtered, but store_sales doesn’t have a filter. This means Amazon Redshift retrieves all rows from store_sales and only then uses the join to filter the rows. Because store_sales is a very big table, this probably takes too long, especially if you want to run this query regularly.

As a solution, you can create the following view in PostgreSQL that encapsulates this join:

CREATE VIEW vw_store_sales_quarter
AS SELECT ss.*, dd.d_quarter_name ss_quarter_name 
FROM store_sales   ss 
JOIN date_dim      dd 
    ON ss.ss_sold_date_sk = dd.d_date_sk;

Rewrite the Amazon Redshift query to use the view as follows:

SELECT ss_store_sk
      ,COUNT(ss_quantity) AS count_quantity
      ,AVG(ss_quantity) AS avg_quantity
FROM apg_tpcds_10g.vw_store_sales_date
WHERE ss_quarter_name = '1998Q1'
GROUP BY ss_store_sk
ORDER BY ss_store_sk
LIMIT 100;

When you EXPLAIN this rewritten query in Amazon Redshift, you see the following plan:

<< REDSHIFT >> QUERY PLAN [Remote View]
----------------------------------------------------------------------------------------------------------------------------------
<< snip >>
  ->  XN HashAggregate  (cost=30.00..31.00 rows=200 width=8)
        ->  XN PG Query Scan vw_store_sales_date  (cost=0.00..22.50 rows=1000 width=8)
              ->  Remote PG Seq Scan vw_store_sales_date  (cost=0.00..12.50 rows=1000 width=8)
                    Filter: (ss_quarter_name = '1998Q1'::bpchar)

Amazon Redshift now pushes the filter down to your view. The join restriction is applied in PostgreSQL and many fewer rows are returned to Amazon Redshift. You may notice that Remote PG Seq Scan now shows rows=1000; this is a default value that the query optimizer uses when PostgreSQL can’t provide table statistics.

Summary

This post reviewed 10 best practices to help you maximize the performance Amazon Redshift federated queries. Every use case is unique, so carefully evaluate how you can apply these recommendations to your specific situation.

AWS will continue to enhance and improve Amazon Redshift Federated Query, and welcomes your feedback. If you have any questions or suggestions, leave your feedback in the comments. If you need further assistance in optimizing your Amazon Redshift cluster, contact your AWS account team.

Special thanks go to AWS colleagues Sriram Krishnamurthy, Entong Shen, Niranjan Kamat, Vuk Ercegovac, and Ippokratis Pandis for their help and support with this post.

 


About the Author

Joe Harris is a senior Redshift database engineer at AWS, focusing on Redshift performance. He has been analyzing data and building data warehouses on a wide variety of platforms for two decades. Before joining AWS he was a Redshift customer from launch day in 2013 and was the top contributor to the Redshift forum.

 

Monitor and control the storage space of a schema with quotas with Amazon Redshift

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/monitor-and-control-the-storage-space-of-a-schema-with-quotas-with-amazon-redshift/

Yelp connects people with great local businesses. Since its launch in 2004, Yelp has grown from offering services for just one city—its headquarters home of San Francisco—to a multinational presence spanning major metros across more than 30 countries. The company’s performance-based advertising and transactional business model led to revenues of more than $500 million during 2015, a 46% increase over the prior year. Yelp has evolved into a mobile-centric company, with more than 70% of searches, and more than 58% of content originating from mobile devices.

Yelp uses Amazon Redshift to analyze mobile app usage data and ad data on customer cohorts, auctions, and ad metrics. Yelp has immediately benefited by the new Amazon Redshift schema storage quota feature.

“Amazon Redshift is a managed data warehouse service that allows Yelp to focus on data analytics without spending time on database administration,” says Steven Moy, Lead Engineer for Yelp’s Metrics Platform. The Metrics Platform provides long-term persistent data storage and SQL-on-anything query capabilities for Yelp’s Engineering teams. “A key strategy for our data warehouse users to iterate quickly is to have a writable schema called ‘tmp’ for users to prototype various table schema. However, we occasionally faced challenges when there was not enough free space during a query execution, degrading the entire data warehouse query operation. With the new schema quota feature, we can provision a storage quota ceiling on the ‘tmp’ schema to safeguard runaway storage issues. We look forward to all the autonomous features coming from Amazon Redshift.”

Many organizations are moving toward self-service analytics, where different personas create their own insights on the evolved volume, variety, and velocity of data to keep up with the acceleration of business. This data democratization creates the need to enforce data governance, control cost, and prevent data mismanagement. Controlling the storage quota of different personas is a significant challenge for data governance and data storage operation. This post shows you how to set up Amazon Redshift storage quotas by different personas.

Prerequisites

Before starting this walkthrough, you must have the following:

  • An Amazon Redshift cluster. The US East (N. Virginia) Region is preferred because you need to load data from Amazon Simple Storage Service (Amazon S3) in us-east-1.
  • A database user with superuser permission.

Setting up and testing the schema quota

To set up the environment and implement the use case, complete the following steps:

  1. Connect to your Amazon Redshift cluster using your preferred SQL client as a superuser or user with CREATE SCHEMA privileges.
  1. Create the user sales with the following code:
CREATE USER sales WITH password 'Abcd1234!';

The user in the preceding code is just for the purposes of this walkthrough. You should use identity federation in AWS Identity and Access Management (IAM) to use a corporate user rather than creating one manually. For more information, see Authentication.

  1. Set up a schema quota with the CREATE SCHEMA See the following code:
CREATE SCHEMA [ IF NOT EXISTS ] schema_name [ AUTHORIZATION username ] 
           [ QUOTA {quota [MB | GB | TB] | UNLIMITED} ] [ schema_element [ ... ] 
 
CREATE SCHEMA AUTHORIZATION username[ QUOTA {quota [MB | GB | TB] | UNLIMITED} ] [ schema_element [ ... ] ]

The QUOTA parameters define the maximum amount of disk space that the specified schema can use.

  1. Create the schema sales_schema with a quota of 2 GB for the user sales. See the following code:
CREATE SCHEMA sales_schema AUTHORIZATION sales QUOTA 2 GB; 
  1. Impersonate the user sales with the following code:
SET SESSION AUTHORIZATION 'sales';
SELECT CURRENT_USER;
  1. Create the tables region and lineitem with the following code:
CREATE TABLE sales_schema.region (
  r_regionkey int4,
  r_name char(25),
  r_comment varchar(152)                            
) DISTSTYLE EVEN;
 
CREATE TABLE sales_schema.lineitem (
  l_orderkey int8,
  l_partkey int8,
  l_suppkey int4,
  l_linenumber int4,
  l_quantity numeric(12,2),
  l_extendedprice numeric(12,2),
  l_discount numeric(12,2),
  l_tax numeric(12,2),
  l_returnflag char(1),
  l_linestatus char(1),
  l_shipdate date,
  l_commitdate date,
  l_receiptdate date,
  l_shipinstruct char(25),
  l_shipmode char(10),
  l_comment varchar(44)
) DISTSTYLE EVEN;

Single statement transaction

To test single statement transaction use case, complete the following steps:

  1. Load data into the table region with the following code:
COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

 <Your-IAM-Role> is the ARN of your IAM role with the necessary permissions associated with your Amazon Redshift cluster.

 The COPY command should complete with five rows loaded. See the following code:

dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
  1. Check the quota and disk usage for the schema from system view svv_schema_quota_state. See the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should show a 2,048 MB quota with 30 MB disk usage on your four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  2048 |         30 |           1.46
(1 row)
  1. Load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

You should get an error that the transaction is aborted due to exceeding the disk space quota. See the following code:

dev=> COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 40895 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 2798).
Free up disk space or request increased quota for the schema(s).

Amazon Redshift checks each transaction for quota violations before committing the transaction. The size limit can exceed the quota temporarily within a transaction before it’s committed because the quota violation check occurs at the end of a transaction. That’s why you initially see the message loading to table lineitem completed successfully. When a transaction exceeds the quota, Amazon Redshift aborts the transaction, prohibits subsequent ingestions, and reverts all changes until you free up disk space.

The quota violation occurrence is stored in the system table stl_schema_quota_violations.

  1. Run the following code to see the violation:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

It should show the quota, disk usage, disk usage percentage out of the configured quota, and the timestamp of when the violation happened. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(1 row)
  1. Change the schema quota to 4 GB using the ALTER SCHEMA See the following code:
RESET SESSION AUTHORIZATION; 
ALTER SCHEMA sales_schema QUOTA 4 GB;
SET SESSION AUTHORIZATION 'sales';
SELECT CURRENT_USER;

You have to be a database superuser or user with CREATE SCHEMA permissions to set a schema quota during schema creation. However, only a superuser can change a schema quota. You use RESET SESSION AUTHORIZATION to reset the user impersonation back to the original superuser.

  1. Check the quota and disk usage for the schema with the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should return a 4,096 MB quota with 30 MB disk usage on your four-node DC2.large cluster. The failed COPY statement earlier automatically reclaims disk space when the transaction is aborted for a pristine table. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |         30 |           0.73
(1 row)
  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following message:

Load into table 'lineitem' completed, 59986052 record(s) loaded successfully
  1. Check the quota and disk usage for the schema with the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should return a 4,096 MB quota and 2798 MB disk usage on a four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |       2798 |          68.31
(1 row)
  1. Rerun the COPY command to load a large amount of data into the table lineitem to exceed the quota. See the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return an error that the transaction is aborted due to exceeding the disk space quota. See the following code:

dev=# COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 86438 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
  1. Execute the following code to perform subsequent small COPY/INSERT/UPDATE after the quota violation:
COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;

All statements should return that the error transaction is aborted due to exceeding disk space quota. See the following code:

dev=# COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
ERROR:  Transaction 86478 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
dev=# INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
ERROR:  Transaction 86479 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
dev=# UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
ERROR:  Transaction 86483 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).

For non-pristine tables, although the offending transaction that exceeded the quota was aborted (for example, COPY to lineitem), the transaction still consumes space and the schema is blocked until disk space is reclaimed. That is why the subsequent COPY/INSERT/UPDATE to the region table is aborted as well.

  1. Double-check the quota violation from system table with the following code:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

It should return disk usage of 5,486 MB with a four-node DC2.large cluster. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.653489
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.469287
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.197434
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 20:28:01.344333
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(5 rows)

Single statement shrink-only transaction

A shrink-only transaction is a transaction composed solely of statements that has the potential to directly or indirectly reduce the amount of disk space used, such as: DELETE, TRUNCATE, DROP TABLE, VACUUM, etc.

To test single statement shrink-only transaction use case, complete the following steps:

  1. Delete all data from the lineitem table with the following code:
DELETE sales_schema.lineitem;
SELECT COUNT(*) FROM sales_schema.lineitem;

It should return a record count of zero:

dev=> DELETE sales_schema.lineitem;
DELETE 59986052
dev=> SELECT COUNT(*) FROM sales_schema.lineitem;
 count 
-------
     0
(1 row)

Although the schema quota was exceeded, the DELETE transaction is committed because the shrink-only transaction is an exception to the quota violation check.

The TRUNCATE command is much more efficient than DELETE to delete all data and doesn’t require a VACUUM and ANALYZE command. The preceding code is for illustration purposes only.

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following error message:

Transaction is aborted due to exceeding the disk space quota

See the following code:

dev=> COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 87024 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 8150).
Free up disk space or request increased quota for the schema(s).

Because DELETE doesn’t automatically free up disk space, the preceding COPY command exceeds the quota and the transaction is aborted.

  1. Double-check the quota violation from the system table with the following code:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

The latest row should return a disk usage of 8,150 MB with a four-node DC2.large cluster. This shows that disk usage increases with the aborted transaction. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  4096 |       8150 |         198.97 | 2020-04-20 21:30:54.354669
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.653489
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.469287
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.197434
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 20:28:01.344333
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(6 rows)
  1. Run VACUUM to free up disk space:
VACUUM sales_schema.lineitem;

The schema quota is exceeded, but because Amazon Redshift disregards the quota violation and commits transactions in certain cases like VACUUM, this completes without error.

Amazon Redshift now automatically runs the VACUUM DELETE operation in the background to reclaim disk space.

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following message:

Load into table 'lineitem' completed, 59986052 record(s) loaded successfully

Because VACUUM reclaimed the disk space, the COPY command ran successfully.

CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema

To test CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema use case, complete the following steps:

  1. Execute the CTAS statement to create another table using a query. See the following code:
CREATE TABLE sales_schema.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;

It should return that the transaction is aborted due to exceeding the disk space quota. CTAS is subject to similar quota checks in the same schema. See the following code:

dev=> CREATE TABLE sales_schema.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;
ERROR:  Transaction 112229 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5624).
Free up disk space or request increased quota for the schema(s).
  1. Create a temp table and populate data with the following code:
CREATE TEMP TABLE lineitem_temp AS SELECT * FROM sales_schema.lineitem;
SELECT COUNT(*) from lineitem_temp; 

It should return 59,986,052 rows. The schema quota doesn’t consider temporary tables created as part of a temporary namespace or schema.

  1. Create the same table lineitem in the public schema and load data into it using the following code:
CREATE TABLE public.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;
SELECT COUNT(*) FROM lineitem_ctas;

It should return 59,986,052 rows.

Ingestion into different schemas has no effect and the transaction is committed.

  1. Append data into the table lineitem in sales_schema from another table using the ALTER TABLE APPEND See the following code:
ALTER TABLE sales_schema.lineitem APPEND FROM lineitem_ctas;

It should return an error that the transaction is aborted due to exceeding the disk space quota. Quota violation is applicable to the full schema.

  1. Append data from the sales_schema lineitem table to another table in the public schema with the following code:
ALTER TABLE lineitem_ctas APPEND FROM sales_schema.lineitem;
SELECT COUNT(*) FROM sales_schema.lineitem;
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

It should return a record count of zero from the SELECT COUNT statement and the disk usage drop. The ALTER TABLE APPEND statement empties the table immediately and reclaims disk space when the transaction is complete. See the following code:

dev=> ALTER TABLE lineitem_ctas APPEND FROM sales_schema.lineitem;
INFO:  ALTER TABLE APPEND "lineitem_ctas" from "lineitem" is complete.
ALTER TABLE APPEND and COMMIT TRANSACTION
dev=> SELECT COUNT(*) FROM sales_schema.lineitem;
 count 
-------
     0
(1 row)
 
dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |         30 |           0.73
(1 row)

Concurrent transactions

To test concurrent transactions use case, complete the following steps:

  1. Increase the quota to 6 GB using the following code:
RESET SESSION AUTHORIZATION;
ALTER SCHEMA sales_schema QUOTA 6 GB;
SET SESSION AUTHORIZATION 'sales';

You need two separate SQL client sessions connected to the Amazon Redshift cluster and run the code concurrently for the following test.

  1. On the first session, enter the following code:
CREATE TABLE sales_schema.lineitem_txn1 AS SELECT * FROM lineitem_ctas;
  1. While the first session is still running, on the second session, enter the following code:
CREATE TABLE sales_schema.lineitem_txn2 AS SELECT * FROM lineitem_ctas;

The statement from the first session should return a warning message:

Schema(s) close to reaching their allocated disk space quotas

This is a warning message that the schema is close to its quota, but the statement is still complete. See the following code:

dev=> CREATE TABLE sales_schema.lineitem_txn1 AS SELECT * FROM lineitem_ctas;
WARNING:  Schema(s) close to reaching their allocated disk space quotas: (Schema: sales_schema, Quota: 6144, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
SELECT

The statement from the second session should return an error that the transaction is aborted due to exceeding the disk space quota. Because the first transaction from the first session was committed first, the disk space usage increased to close to the quota. The second transaction checks the quota violation before committing the transaction and thus aborts the transaction.

Reset quota below existing disk space usage

To test resetting quota below existing disk space usage use case, complete the following:

  1. Run the following code to reset the quota to 2 GB:
RESET SESSION AUTHORIZATION;
ALTER SCHEMA sales_schema QUOTA 2 GB;
SET SESSION AUTHORIZATION 'sales';
CREATE TABLE sales_schema.test AS SELECT 1;

You can set the quota below the existing disk space usage; however, all future ingestion transactions are aborted until you free disk space.  See the following code:

dev=> RESET SESSION AUTHORIZATION;
RESET
dev=# ALTER SCHEMA sales_schema QUOTA 2 GB;
ALTER SCHEMA
dev=# SET SESSION AUTHORIZATION 'sales';
SET
dev=> CREATE TABLE sales_schema.test AS SELECT 1;
ERROR:  Transaction 112557 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5610).
Free up disk space or request increased quota for the schema(s).

Multi-statement transactions

To test multi-statement transaction use case, complete the following steps:

  1. Enter the following code to test a multi-statement transaction:
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
COMMIT;
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 1;
COMMIT;
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
COMMIT;

All INSERT/UPDATE/COPY transactions should be aborted because the quota is exceeded. See the following code:

dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
INSERT 0 1
dev=> COMMIT;
ERROR:  Transaction 114291 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 1;
UPDATE 1
dev=> COMMIT;
ERROR:  Transaction 114295 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
dev=> COMMIT;
ERROR:  Transaction 114303 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5624).
Free up disk space or request increased quota for the schema(s).
  1. Enter the following code to test a multi-statement transaction with DROP TABLE at the same transaction:
BEGIN;
	INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
	UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
	COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
	DROP TABLE sales_schema.region;
COMMIT;

The transaction should be committed successfully. The COPY/INSERT/DELETE transaction into a table that gets dropped at the same time the transaction is committed. See the following code:

dev=> BEGIN;
BEGIN
dev=> INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
INSERT 0 1
dev=> UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
UPDATE 3
dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
dev=> DROP TABLE sales_schema.region;
DROP TABLE
dev=> COMMIT;
COMMIT
  1. Run shrink-only statements in a multi-statement transaction with the following code:
BEGIN;
	DELETE FROM sales_schema.lineitem WHERE l_orderkey = 53924162;
	DROP TABLE sales_schema.lineitem;
COMMIT;

The transaction should be committed successfully. A transaction containing shrink-only statements should succeed. See the following code:

dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.lineitem WHERE l_orderkey = 53924162;
DELETE 0
dev=> DROP TABLE sales_schema.lineitem;
DROP TABLE
dev=> COMMIT;
COMMIT

Best practices

These following recommended practices can help you optimize your workload using storage quotas:

  • The schema feature works best when users are aligned with schemas – Use schemas to logically segregate the users, for example, the MAIN schema for the base and aggregated tables owned by data engineers, and the ANALYST schema controlled with storage quota owned by data analysts that perform their own aggregation by querying the base data from the MAIN schema into the ANALYST The data analyst only has read permissions on the MAIN schema but read and write permissions on the ANALYST schema.
  • Revoke usage on public schema – All users have CREATE and USAGE privileges on the PUBLIC schema of a database by default. Because storage quota isn’t applicable to the PUBLIC schema by default, all users should be disallowed from creating objects in the PUBLIC schema of a database. They should instead use their own aligned schemas as per the previous best practice. Use the REVOKE command to remove that privilege.
  • Use physical tables instead of temporary tables – Because temporary tables aren’t subject to storage quota, use physical tables instead—but remember to drop the physical tables at the end of the transaction. An alternative is to create a separate schema to cater to temporary physical tables with storage quota, so instead of using temporary tables, create physical tables in the dedicated schema. This would ease administration and you can perform an automation to clean up tables in this dedicated schema on a regular basis.
  • Use data shrinking operations – Use data shrinking operations like DELETE, VACUUM DELETE, DROP, and TRUNCATE to reclaim disk space, especially if it’s nearing the quota.
  • Use workload management (WLM) with query priority to control resources for different users – Data ingestion and aggregation performed by different users might be resource-intensive and impact the throughput of other mission-critical loads or queries running on the cluster. Amazon Redshift Advisor automatically analyzes the current WLM usage and can make recommendations to get more throughput from your cluster. Periodically reviewing the suggestions from Advisor helps you get the best performance.

Monitoring

You should periodically monitor for any storage quota violations so that you can take corrective action and provide business continuity. The system table stl_schema_quota_violations is a good resource to get quota violation events. To query the violation information, enter the following code:

SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

You can automate the execution of the preceding query and send notifications to users if any violations occur using the following solutions:

Cleaning up

When you’re done with testing, you can remove the objects and users with the following code:

RESET SESSION AUTHORIZATION;
DROP TABLE IF EXISTS sales_schema.lineitem_txn1; 
DROP TABLE IF EXISTS lineitem_temp;
DROP TABLE IF EXISTS lineitem_ctas;
DROP SCHEMA sales_schema;
DROP USER IF EXISTS sales;

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. Data storage operation with quotas is part of an important data governance framework that provides compliance and operation efficiency. A simple QUOTA parameter in a CREATE/ALTER SCHEMA statement enables disk usage quotas on an Amazon Redshift schema. You can assign Amazon Redshift database users who represent personas in the organization to different schemas with quotas to control the amount of disk space they can consume. This is crucial for data governance and cost control, and avoids running into operation issues like disk full errors, especially if they keep creating objects and ingesting data that might halt other mission-critical analytic operations.

This post showed you how easy it is to control the storage quota for Amazon Redshift using schemas. You went through the steps to create and alter the quota on a schema and authorize specific users to the schema. You also explored different use cases of ingesting data into the schema and monitoring the different outcomes and disk usage using the system view. You can adopt this feature to support your business needs.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Ahmad Barghout is a Software Development Engineer on the Amazon Redshift storage team. He has worked on cluster resize optimizations and currently works on ease-of-use features such as schema quota. Before that, he was a full stack developer at Cisco. He holds a master’s degree from The University of Ottawa.

 

 

 

Himanshu Raja is a Principal Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

 

 

Migrating your Netezza data warehouse to Amazon Redshift

Post Syndicated from John Hwang original https://aws.amazon.com/blogs/big-data/migrating-your-netezza-data-warehouse-to-amazon-redshift/

With IBM announcing Netezza reaching end-of-life, you’re faced with the prospect of having to migrate your data and workloads off your analytics appliance. For some, this presents an opportunity to transition to the cloud.

Enter Amazon Redshift.

Amazon Redshift is a cloud-native data warehouse platform built to handle workloads at scale, and it shares key similarities with Netezza that make it an excellent candidate to replace your on-premises appliance. You can migrate your data and applications to Amazon Redshift in less time and with fewer changes than migrating to other analytics platforms. For developers, this means less time spent retraining on the new database. For stakeholders, it means a lower cost of, and time to, migration. For more information, see How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime.

This post discusses important similarities and differences between Netezza and Amazon Redshift, and how they could impact your migration timeline.

Similarities

Three significant similarities between Netezza and Amazon Redshift are their compatibility with Postgres, their massively parallel processing (MPP) architecture, and the Amazon Redshift feature Advanced Query Accelerator (AQUA) compared to Netezza’s use of FPGAs.

Postgres compatibility

Both Netezza and Amazon Redshift share some compatibility with Postgres, an open-source database. This means that Netezza SQL and Amazon Redshift SQL have a similar syntax. In particular, both support many features of PL/pgSQL, Postgres’s procedural language. Your Netezza stored procedures can translate to Amazon Redshift with little-to-no rewriting of code.

You can also use the AWS Schema Conversion Tool, which can automatically migrate a large percentage of Netezza storage procedures to Amazon Redshift syntax with zero user effort. And because both databases are built for analytics and not transactional workloads, there are similar characteristics between the two databases. For example, both Netezza and Amazon Redshift don’t enforce primary keys to improve performance, though you can still define primary keys on your tables to help the optimizer create better query plans.

MPP architecture

Both Netezza and Amazon Redshift are MPP databases. This means that a query is sent to a leader node, which then compiles a set of commands that it sends to multiple compute nodes. Each worker node performs its task in parallel and returns the results to the leader node, where the results are aggregated and returned to the user. This means that you can apply similar architectural strategies from Netezza, such as zone maps and distribution styles, to Amazon Redshift. For more information, see Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization.

Amazon Redshift AQUA and Netezza’s FPGA

AWS recently announced the new Amazon Redshift feature AQUA, which is in preview as of this writing. AQUA uses AWS-designed processors to bring certain compute tasks closer to the storage layer. Compression, encryption, filtering, aggregation—and other tasks—can now happen in this intermediary layer in between the storage and compute, which leaves the CPU free to handle more complex tasks. When it’s released, AQUA will accelerate queries up to 10 times faster.

Netezza uses FPGAs to perform simple compute tasks before data reaches the CPU. Applications designed to employ these FPGA features on Netezza (for example, queries that rely heavily on certain aggregate functions and data filtering) translate well to Amazon Redshift clusters using AQUA.

Differences

For all the similarities that Amazon Redshift and Netezza share, they also have differences. There are three important differences that could have significant impact on your data and application architecture when migrating from Netezza to Amazon Redshift: column store vs. row store, concurrency scaling, and data lake integration.

Column store vs. row store

Netezza stores each row of data onto disk in data blocks, whereas Amazon Redshift stores each column of data. For many analytics workloads, a column store can have dramatic performance benefits over a row store. Typically, an analytics database has tables with many columns, but only a small handful of those columns are used in any one query. For example, assume that you have a table in a row store database with 100 columns and 100 million rows. If you want to sum the entire value of one column in this table, your query has to suffer the I/O penalty of scanning the entire table to retrieve the data in this single column. In a columnar database with the same table, the query only faces I/O for the single column. In addition to enjoying the improved performance for this type of workload in Amazon Redshift, this gives you options for designing wide tables without having to weigh the increase in I/O for typical analytics workloads the way you do in Netezza.

Concurrency scaling

Although both Netezza and Amazon Redshift offer queue priority and short query acceleration to help reduce concurrency issues, Amazon Redshift also uses the benefits of the cloud to offer additional options to handle concurrency.

One option is to take unrelated workloads from a single Netezza appliance and migrate them to separate Amazon clusters, each with an instance type and number of nodes sized accordingly for the workload it has to support. Netezza’s more traditional license and support pricing model, combined with its limited options for appliance sizes, can make this type of architecture untenable for your organization’s budget.

Additionally, Amazon Redshift offers Concurrency Scaling to scale out (and scale back in) additional compute capacity automatically and on the fly. If you want to add or remove nodes in the cluster, or experiment with other nodes types, you can do so in just a few minutes using elastic resize (to change the number of nodes) and classic resize (to change instance types).

This type of scaling and resizing isn’t feasible on Netezza because the on-premises appliance has a fixed number of blades. Concurrency scaling in Amazon Redshift can support virtually unlimited concurrent users and concurrent queries, and its ability to automatically add and remove additional capacity means you only pay for the time the concurrency scaling clusters are in use.

Data lake integration

Netezza offers the ability to create an external table from a data file either on the Netezza host or a remote host. However, querying those tables means migrating the entire dataset internally before running the query. With Amazon Redshift Spectrum, rather than using external tables as a convenient way to migrate entire datasets to and from the database, you can run analytical queries against data in your data lake the same way you do an internal table. As the volume of your data in the lake grows, partitioning can help keep the amount of data scanned for each query (and therefore performance) consistent. You can even join the data from your external tables in Amazon Simple Storage Service (Amazon S3) to your internal tables in Amazon Redshift.

Not only does this mean true integration of your data warehouse and data lake, which can lower your warehouse’s more expensive storage requirements, but because the Amazon Redshift Spectrum layer uses its own dedicated infrastructure outside of your cluster, you can offload many compute-intensive tasks from your cluster and push them down to the Redshift Spectrum layer. For more information, see Amazon Redshift Spectrum overview.

Conclusion

Amazon Redshift can accelerate migration from Netezza and potentially lower the time and cost of moving your analytics workloads into the cloud. The similarity between the two systems eases the pain of migration, but Amazon Redshift has important differences that can offer additional benefits around performance and cost. The expertise of AWS and its partner network can help with your migration strategy and offer guidance to avoid potential roadblocks and pitfalls.

 


About the Author

John Hwang is a senior solutions architect at AWS.

 

 

 

 

 

Brian McDermott is a senior sales manager at AWS.

 

 

 

Monitor and optimize queries on the new Amazon Redshift console

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/monitor-and-optimize-queries-on-the-new-amazon-redshift-console/

Tens of thousands of customers use Amazon Redshift to power their workloads to enable modern analytics use cases, such as Business Intelligence, predictive analytics, and real-time streaming analytics. As an administrator or data engineer, it’s important that your users, such as data analysts and BI professionals, get optimal performance. You can use the Amazon Redshift console to monitor and diagnose query performance issues.

The Amazon Redshift console features a monitoring dashboard and updated flows to create, manage, and monitor Amazon Redshift clusters. For more information, see Simplify management of Amazon Redshift clusters with the Redshift console.

This post discusses how you can use the new Amazon Redshift console to monitor your user queries, identify slow queries, and terminate runaway queries. The post also reviews details such as query plans, execution details for your queries, in-place recommendations to optimize slow queries, and how to use the Advisor recommendations to improve your query performance.

User query vs. rewritten query

Any query that users submit to Amazon Redshift is a user query. Analysts either author a user query or a BI tool such as Amazon QuickSight or Tableau generates the query. Amazon Redshift typically rewrites queries for optimization purposes. It can rewrite a user query into a single query or break it down into multiple queries. These queries are rewritten queries.

The following steps are performed by Amazon Redshift for each query:

  1. The leader node receives and parses the query.
  2. The parser produces an initial query tree, which is a logical representation of the original query. Amazon Redshift inputs this query tree into the query optimizer.
  3. The optimizer evaluates and, if necessary, rewrites the query to maximize its efficiency. This process sometimes results in creating multiple queries to replace a single query.

The query rewrite is done automatically and is transparent to the user.

Query monitoring with the original Amazon Redshift console and system tables

Previously, you could monitor the performance of rewritten queries in the original Amazon Redshift console or system tables. However, it was often challenging to find the SQL your users submitted.

The following table shows the comparison of query monitoring differences between the original Amazon Redshift console, system tables, and the new console.

Original ConsoleSystem TablesNew Console

•  User query not supported

•  Monitor only rewritten queries

•  Shows only top 100 queries

•  User query not available

•  Rewritten queries

•  All rewritten queries

•  Supports user queries

•  Shows all queries available in system tables

•  Allows you to correlate rewritten queries with user queries

The new console simplifies monitoring user queries and provides visibility to all query monitoring information available in the system.

Monitoring and diagnosing queries

The Amazon Redshift console provides information about the performance of queries that run in the cluster. You can use this information to identify and diagnose queries that take a long time to process and create bottlenecks that prevent other queries from executing efficiently.

The following table shows some of the common questions you may have when monitoring, isolating, and diagnosing query performance issues.

MonitorIsolate and DiagnoseOptimize
How is my cluster doing in terms of query performance and resource utilization?A user complained about performance issues at a specific time. How do I identify that SQL and diagnose problems?How can I optimize the SQL that our end-users author?
How is my cluster throughput, concurrency, and latency looking?

Which other queries were running when my query was slow? Were all queries slow?

 

Is there any optimization required in my schema design?
Are queries being queued in my cluster?Is my database overloaded with queries from other users? Is my queue depth increasing or decreasing?Is there any tuning required for my WLM queues?
Which queries or loads are running now?How do I identify queries that a specific user runs?Can I get any benefit if I enable concurrency scaling?
Which queries or loads are taking longer than usual timing?The resources of my cluster are running very high. How do I find out which queries are running?
What are my top queries by duration in the last hour or last 24 hours?
Which queries have failed?
Is the average query latency for my cluster increasing or decreasing over time?

You can answer these questions by either using the Amazon Redshift console or developing scripts using the system catalog.

Monitoring queries

You can monitor your queries on the Amazon Redshift console on the Queries and loads page or on the Query monitoring tab on the Clusters page. While both options are similar for query monitoring, you can quickly get to your queries for all your clusters on the Queries and loads page. You have to select your cluster and period for viewing your queries.

You can view the queries using List view on the Query monitoring tab on the Clusters page. The query monitoring page visually shows the queries in a Gantt chart.

Each bar represents a user query, and the length of the bar represents runtime for a query. The X-axis shows the selected period, and the location of the bar indicates when a query started and ended. The queries include both standard SQL statements such as SELECT, INSERT, and DELETE, and loads such as COPY commands.

Monitoring top queries

By default, the Query monitoring page shows the top 100 longest queries by runtime or duration for the selected time window. You can change the time window to view the top queries for that period. The top queries also include completed queries and running queries. The completed queries are sorted by descending order of query runtime or duration.

Identifying running queries

You can find out your running queries by choosing Running queries from the drop-down menu.

To see the query’s details such as SQL text, runtime details, related rewritten queries, and execution details, choose the query ID.

The Duration column shows the estimated duration and runtime for a query. You can terminate a query by selecting the query and choosing Terminate query. You need the have the  redshift:CancelQuerySession action added to your IAM policy to cancel a query.

Viewing loads

As a data engineer or Redshift administrator, ensuring that your load jobs complete correctly and meet required performance SLAs is a major priority.

You can view all your load jobs by choosing Loads from the drop-down menu on the Query monitoring page. You can then zoom in on the desired time window.

The preceding Gantt chart shows all loads completed successfully. The query status indicates if the load failed or if an administrator terminated it.

Identifying failed queries

You can identify failed queries by choosing Failed or stopped queries from the drop-down menu on the Query monitoring page and then zooming in on the desired time.

The query page shows 50 queries by default, and you have to paginate to view more results. You can change the page size by choosing the settings gear icon.

In the Preferences section, you can customize what fields you want to see on the Queries and loads list. For example, you can see the PID and not the transaction ID. These changes persist across browser sessions.

Monitoring long-running queries

Amazon Redshift categorizes queries if a query or load runs more than 10 minutes. You can filter long-running queries by choosing Long queries from the drop-down menu. Similarly, you can also filter medium and short queries.

Isolating problematic queries

The following section looks at some use cases in which you use the console to diagnose query performance issues.

Query performance issues

For this use case, a user complains that their queries as part of the dashboards are slow, and you want to identify the associated queries. These queries might not be part of the top queries. To isolate these queries, you can either choose Completed queries or All queries from the drop-down menu and specify the time window by choosing Custom.

You can also drill down to view the queries in a specific period, or filter for queries from one particular user by searching their user name.

You can also filter your queries by searching SQL query text.

As with the earlier charts, the size of a bar represents a relative duration of the runtime for a query. In this period, the highlighted query is the slowest. If you mouse over a bar in the Gantt chart, it provides helpful information about the query such as query ID, part of the query text, and runtime. To view details about a specific query, choose Query ID.

Identifying systemic query performance problems

For this use case, many of your users are complaining about longer-than-normal query runtimes. You want to diagnose what is happening in your cluster. You can customize your time and switch to the graph view, which helps you to correlate longer runtimes with what is happening in the cluster. As the following Gantt chart and CPU utilization graph shows, many queries were running at that time, and CPU utilization almost reached 100%.

The concurrency scaling feature of Amazon Redshift could have helped maintain consistent performance throughput the workload spike.

High CPU utilization

You can correlate query performance with cluster performance and highlight on a given metric such as CPU utilization, which shows you which queries were running at that time.

Monitoring workload performance

You can get a detailed view of your workload’s performance by looking at the Workload execution breakdown chart. You can find out how long it took to plan, wait, and execute your workload. You can also view time spent in operations such as INSERT, UPDATE, DELETE, COPY, UNLOAD, or CTAS. The chosen time in the query history is stored when you navigate between pages.

In the preceding screenshot, you can see several waits in the workload breakdown graph. You can take advantage of concurrency scaling to process a burst of queries.

Correlating query throughput with query duration

You can view the trend of the performance of your queries, such as duration or execution time for your long, medium, and short queries, and correlate with the query throughput.

This information can offer insight into how well the cluster serves each query category with its current configuration.

Monitoring workload for your WLM queues

You can correlate query performance with cluster performance and highlight a given metric such as CPU utilization to see which queries were running at that time. You can view the average throughput, average duration, and average queue time by different WLM queues. Insight from this graph might help you tune your queries; for example, by assigning the right priority for your WLM queue or enabling concurrency scaling for your WLM queue.

Diagnosing and optimizing query performance

After you isolate a slow query, you can drill down to the execution details of the query by choosing Query ID. The following screenshot shows multiple query IDs for a query that has been rewritten to multiple queries.

The Query details page shows you the parent query and all rewritten queries.

You can also find out whether any of the rewritten queries ran on a concurrency scaling cluster.

You can view the query plans, execution statistics such as the cost of each step of the plan, and data scanned for the query.

You can also view the cluster metrics at the time the query ran on the cluster. The following screenshot shows the problematic steps for your query plan. Choosing a problematic step reveals in-place recommendations to improve this query.

Implementing Advisor recommendations

Amazon Redshift Advisor provides recommendations that could improve workload performance. Amazon Redshift uses machine learning to look at your workload and provide customized recommendations. Amazon Redshift monitors and offers guidance for improved performance on the following crucial areas:

  • Short query acceleration (SQA) – Checks for query patterns and reports the number of recent queries in which you can reduce latency and the daily queue time for SQA-eligible queries by enabling SQA, thus improving your query performance
  • Sort key for tables – Analyzes the workload for your data warehouse over several days to identify a beneficial sort key for your tables and makes sort key recommendations
  • Distribution key for tables – Analyzes your workload to identify the most appropriate distribution key for tables that can significantly benefit from a key distribution style

The following screenshot shows a recommendation to alter the distribution key for the table.

Enabling concurrency scaling

To deliver optimal performance for your users, you can monitor user workloads and take action if you diagnose a problem. You can drill down to the query history for that specific time, and see several queries running at that time.

If you aren’t using concurrency scaling, your queries might be getting queued. You can also see that on the Workload concurrency tab. In the following screenshot, you can see that many queries are queued during that time because you didn’t enable concurrency scaling.

You can monitor all submitted queries and enable concurrency scaling when queued queries are increasing.

View a demo of Query Monitoring to learn more about the feature:

Conclusion

This post showed you the new features in the Amazon Redshift console that allow you to monitor user queries and help you diagnose performance issues in your user workload. The console also allows you to view your top queries by duration, filter failed, and long-running queries, and help you drill down to view related rewritten queries and their execution details, which you can use to tune your queries. Start using the query monitoring features of the new Amazon Redshift console to monitor your user workload today!

 


About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

 

 

Apurva Gupta is a user experience designer at AWS. She specializes in databases, analytics and AI solutions. Previously, she has worked with companies both big and small leading end-to-end design and helping teams set-up design-first product development processes, design systems and accessibility programs. 

 

 

 

Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.

 

 

 

Sudhakar Reddy is a full stack software development engineer with Amazon Redshift. He is specialized in building cloud services and applications for Big data, Databases and Analytics. 

 

 

 

 

Zayd Simjee is a software development engineer with Amazon Redshift.

Federate Amazon Redshift access with Microsoft Azure AD single sign-on

Post Syndicated from Harshida Patel original https://aws.amazon.com/blogs/big-data/federate-amazon-redshift-access-with-microsoft-azure-ad-single-sign-on/

Recently, we helped a large enterprise customer who was building their data warehouse on Amazon Redshift, using Azure AD as a corporate directory. Their requirement was to enable data warehouse users to use their corporate credentials to query data in Redshift. Doing so not only provided a better user experience — users can utilize their corporate credentials and do not have to remember extra passwords – but it also it made maintenance easier, because all corporate users are managed in a single place. The solution was to set up Azure AD federated access to Redshift.

You can use federation to centrally manage access to Amazon Redshift. This simplifies administration by allowing you to control user access at a central location and reducing the overhead of creating and maintaining database users. In addition to Active Directory Federation Service (ADFS), PingFederate, and Okta, Amazon Redshift also supports Microsoft Azure Active Directory (Azure AD) federation.

For more information about using ADFS with Amazon Redshift, see Federate Database User Authentication Easily with IAM and Amazon Redshift. For more information about integrating Azure AD, see Setting Up JDBC or ODBC Single Sign-on Authentication with Microsoft Azure AD.

This post illustrates how to set up federation using Azure AD and IAM. Azure AD manages the users and provides federated access to Amazon Redshift using IAM. You do not need to create separate database users in Amazon Redshift with this setup.

Solution overview

The solution to configure the federation between Azure AD and IAM to allow seamless access to Amazon Redshift using an SQL client contains the following components:

  • Azure AD serves as the identity provider for user authentication.
  • AWS serves as the service provider for authorization.
  • Security Assertion Markup Language (SAML 2.0) simplifies the interaction between identity provider and service provider.
  • AWS STS requests temporary credentials to connect to Amazon Redshift.

The following diagram illustrates the architecture of the solution.

The solution contains the following steps:

  1. You configure a JDBC or ODBC driver in SQL client to use Azure AD federation.
  2. The client makes an authentication request to the enterprise application in Azure using Azure AD credentials.
  3. The Azure enterprise application queries Azure AD and generates a SAML response, which includes the IAM roles assigned to the user.
  4. The client receives the SAML response.
  5. The SAML assertion goes to the AWS federation endpoint, which invokes the AssumeRoleWithSAML API of AWS STS and generates temporary IAM credentials.
  6. You use the temporary credentials to connect to the Amazon Redshift cluster.

Prerequisites

This blog post assumes that you have the following:

On Azure AD side:

On Amazon Redshift side:

Walkthrough overview

This walkthrough consists of the following three sections:

  1. Setting up the Azure enterprise non-gallery application using single sign-on (SSO) with SAML.
  2. Setting up the IAM provider and roles, which includes the following steps:
    • Creating the SAML identity provider.
    • Creating an IAM role for access to the Amazon Redshift cluster.
    • Creating an IAM provider and an IAM role to use SAML-based federation.
    • Testing the SSO setup.
  3. Configuring the JDBC client to use Azure AD user credentials to log in to the Amazon Redshift cluster. This post uses a JDBC client. However, you can use the same setup to support ODBC clients.

Setting Up Azure Enterprise Application with SAML single sign on

To set up the Azure enterprise application, complete the following steps:

  1. Log in to Azure Portal.
  2. Choose Enterprise applications.
  3. Choose New application.
  4. Choose Non-gallery application.
  5. For Name, enter Redshift.
  6. Choose Add.
  7. Under Manage¸ choose Single sign-on.
  8. In the Basic SAML Configuration section, for Entity ID and Reply URL, enter https://signin.aws.amazon.com/saml.
  9. In the User Attributes & Claims section, choose Edit.
  10. For Additional claims, Add new claim with the following values (if they do not exist):
    • For http://schemas.microsoft.com/ws/2008/06/identity/claims/groups, choose user.groups [SecurityGroup].
    • For https://redshift.amazon.com/SAML/Attributes/RoleSessionName, choose user.userprincipalname.
    • For https://redshift.amazon.com/SAML/Attributes/DbGroups, choose user.assignedroles
    • For https://redshift.amazon.com/SAML/Attributes/DbUser, choose user.userprincipalname.
    • For https://redshift.amazon.com/SAML/Attributes/AutoCreate, enter trueIIt is noteworthy to mention here that value of “DbUser” and “DbGroups” must be lowercase, begin with a letter, contain only alphanumeric characters, underscore (‘_’), plus sign (‘+’), dot (‘.’), at (‘@’), or hyphen (‘-‘), and be less than 128 characters.
  11. In addition to all of the claims added in previous step, Add new claim
    • For Manage claim, for Name, enter Role.
    • For Namespace, enter https://aws.amazon.com/SAML/Attributes.
    • For Source, select Attribute.
    • For Source attribute, enter arn:aws:iam::<yourAWSAccount>:role/AzureSSO,arn:aws:iam::<yourAWSAccount>:saml-provider/AzureADProviderEEnsure “Roleclaim is configured properly. Make sure you change this to your own AWS account. We will setup role AzureSSO and AzureADProvider in AWS Identity Access Management.
  12. In the Single sign-on section, under SAML Signing Certificate, for Federation Metadata XML, choose Download. Save this file locally.Often, the name of the file is the same as the application name; for example, Redshift.xml. You use this file to configure the IAM identity provider in the next section. This file will be used to configure IAM Identity Provider in next section.
  13. On the App registrations page, choose the application Redshift. If you don’t see your application in the list, choose All apps from the drop-down menu and search for it. Register it if it is not registered.
  14. Record the Application (client) ID and Directory (tenant) ID.You use these values in the JDBC connection when you connect to Amazon Redshift.
  15. Under API permissions, choose Add a permission.
  16. Choose Microsoft Graph.
  17. Choose Delegated permissions
  18. For User, choose User.Read.
  19. Choose Add permission.This allows the Amazon Redshift enterprise application to grant admin consent to read user profile and perform login using SSO.
  20. Under Security, choose Permissions.
  21. Add Users/groups to the application or grant universal admin consent for the entire organization.
  22. Choose Certificates & secrets.
  23. Generate the client secret key.
  24. Record the client secret key to use when you configure the JDBC connection.

You have now set up the enterprise application in Azure. You are ready to set up the IAM provider and role to use this Azure federation.

Setting Up IAM Identity Provider and Roles in AWS

An IAM identity provider is an entity in IAM that describes an external identity provider (IdP) service that supports the SAML 2.0 standard. You use an IAM identity provider when you want to establish trust between an SAML-compatible IdP; for example, Azure AD. It is important to name this identity provider AzureADProvider to match the SAML claims you created earlier.

For this walkthrough, you create the following in IAM:

  • IAM SAML identity provider AzureADProvider
  • IAM role AzureSSO
  • IAM policy redshiftAccessPolicy, which you attach to the role AzureSSO to provide permission to access the Amazon Redshift cluster

Creating the IAM SAML identity provider

To create your IAM SAML identity provider, complete the following steps:

  1. Sign in to the AWS Management Console as an admin.
  2. Under Security, Identity, & Compliance, choose IAM.
  3. Choose Identity providers.
  4. Choose Create Provider.
  5. On the Configure Provider page, for Provider Type, choose SAML.
    • For Provider Name, enter AzureADProvider.
    • For Metadata Document, choose xml.
  6. Choose Next Step.
  7. Choose Create.

Creating the IAM role for access to the Amazon Redshift cluster

You now create a role for SAML-based federation. The name of this role needs to match what you named the SAML claim you created earlier: AzureSSO. Complete the following steps:

  1. On the IAM console, choose Roles.
  2. Choose Create role.
  3. You will be directed to the Create role page where you can choose from several different options. To setup the Azure AD authentication on the Create role page, choose SAML 2.0 federation.
  4. Under Choose a SAML 2.0 provider:
    • For SAML provider, choose AzureADProvider from the drop down.
    • Select Allow programmatic and AWS Management Console access.
    • For Attribute, keep at the default SAML:aud.
    • For Value, keep at the default https://signin.aws.amazon.com/saml.
    • For Condition, you can optionally add additional conditions of who can use SSO. For this walkthrough, do not add any conditions.
  5. Choose Next: Permissions.You attach the policy to provide permission to Amazon Redshift after you create the policy.
  6. Choose Next: Tags. For Add tags (optional), you can optionally add key-value tags to help better organize, track, or control access for this role. For this post, do not add any tags.
  7. Choose Next: Review
  8. On the Review page, for Role name, enter AzureSSO.
    • For Role description, you can optionally enter a brief description of what your role does.
    • For Trusted entities, verify the ARN of the provider you specified earlier is correct.
  9. Choose Create role.

Creating the IAM Amazon Redshift access policy

In the following steps, you allow the role AzureSSO to generate an AWS STS token to connect to Amazon Redshift by adding an IAM permissions policy redshiftAccessPolicy to the role.

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. On the Create policy page, choose JSON.
  4. For the policy, enter the following JSON:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:DescribeClusters"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:cluster:<YOUR-REDSHIFT-CLUSTER-NAME>"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:GetClusterCredentials"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbname:<YOUR-REDSHIFT-CLUSTER-NAME>/dev",
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbuser:<YOUR-REDSHIFT-CLUSTER-NAME>/${redshift:DbUser}"
                ],
                "Condition": {
                    "StringEquals": {
                        "aws:userid": "<YOUR-ROLE-ID>:${redshift:DbUser}"
                    }
                }
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:CreateClusterUser"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbuser:<YOUR-REDSHIFT-CLUSTER-NAME>/${redshift:DbUser}"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "redshift:JoinGroup"
                ],
                "Resource": [
                    "arn:aws:redshift:<YOUR-REGION>:<AWS-ACCOUNT-NUMBER>:dbgroup:<YOUR-REDSHIFT-CLUSTER-NAME>/<YOUR-DB-GROUP>"
                ]
            }
        ]
    }

    In the preceding code, replace the placeholders with the appropriate values. The following table provides an explanation of each placeholder.

1<YOUR-REGION>The Region hosting your solution.
2<AWS-ACCOUNT-NUMBER>Your AWS account.
3<YOUR-REDSHIFT-CLUSTER-NAME>Your Amazon Redshift cluster name; for example, cluster-1.
4<YOUR-ROLE-ID>The unique identifier of the role AzureSSO, which you create in the next section. You can enter the code aws iam get-role AzureSSO and use the RoleId value from the output. For more information, see IAM Identifiers.
5<YOUR-DB-GROUP>The database group that the user can join. For this post, use developer_grp.

 

  1. Choose Review policy.
  2. In the Review policy section, for Name, enter the name of your policy; for example, redshiftAccessPolicy.
  3. For Description, you can optionally enter a brief description of what the policy does.
  4. For Summary, review your policy components and make sure to resolve any warnings or errors.
  5. Choose Create policy.
  6. Add this policy to the IAM role AzureSSO. For instructions, see Adding IAM Identity Permissions (Console).
    • Choose Managed IAM Policy.

Testing the SSO setup

You can now test the SSO setup. Complete the following steps.

  1. On the Azure Portal, for your Amazon Redshift application, choose Single sign-on.
  2. Choose Test.
  3. Choose Sign in as current user.

If the setup is correct, this brings you to the console (may be in next Tab for some browsers). The following screenshot shows that the user test is signed in using the assumed role AzureSSO.

After you verify the SSO setup, you are ready to connect the SQL client to Amazon Redshift using Azure AD federation.

Setting up your JDBC SQL client to use Azure AD federation

You can use any client tool that supports an Amazon Redshift JDBC or ODBC driver to connect to Amazon Redshift using Azure SSO. For this post, use the SQLWorkbenchJ client to connect to Amazon Redshift using JDBC. For instructions in installing SQLWorkbench/J, see Connect to Your Cluster by Using SQL Workbench/J.

To connect to the Amazon Redshift cluster using your Azure AD credentials, complete the following steps. If you are using another client, you can pass all these parameters in the JDBC URL as a URL parameter.

To use Azure AD with JDBC, the Amazon Redshift JDBC driver must be version 1.2.37.1061 or later. To use Azure AD with ODBC, the Amazon Redshift ODBC driver must be version 1.4.10.1000 or later. For this walkthrough, use the driver with AWS SDK.

  1. Start SQLWorkbenchJ.
  2. On the Select Connection Profile page, choose Add Profile Group.
  3. Click on the “New Connection Profile.” This will open a new profile in the profile group you created. You can choose an appropriate name — we named it “Azure.”
  4. For Driver, choose your Amazon Redshift driver.
  5. Next step is to construct JDBC URL and input it in “URL” field. You can do it by using the following pattern:jdbc:redshift:iam://<your-cluster-identifier-here>:<your-cluster-region-here>/<your-DB-name-here>RReplace <your-cluster-identifier-here>, <your-cluster-region-here>, and <your-DB-name-here> with the values matching your Amazon Redshift cluster. This post uses cluster-identifier, us-west-1, and dev.
  6. Choose Extended Properties.

Add the following key-value pairs tabulated in Extended Properties of SQLWorkbenchJ.

1plugin_nameTo use Azure AD federation, use com.amazon.redshift.plugin.AzureCredentialsProvider.
2idp_tenantThis is the tenant name of your company configured on your IdP (Azure). This value can either be the tenant name or the tenant unique ID with hyphens. For this walkthrough, use tenant ID that you recorded earlier.
3client_secretUse the client secret value that you created earlier.
4client_idThis is your client ID with hyphens of the Redshift application. Use the client ID that you recorded earlier.

If you completed these steps correctly, you can connect to the Amazon Redshift cluster using your Azure AD credentials.

Summary

In this post, we provided a step-by-step guide to configure and use Azure AD as your IdP to enable federation to an Amazon Redshift cluster.

 


About the Authors

Harshida Patel is a Data Warehouse Specialist Solutions Architect with AWS.

 

 

 

 

Pir Abdul Rasool Qureshi is a Solutions Architect at AWS. He enjoys solving complex customer problems in Databases & Analytics and deliver successful outcomes. Outside of work, he loves to spend time with his family, watch movies and play cricket whenever possible.

 

 

New – Amazon Keyspaces (for Apache Cassandra) is Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-keyspaces-for-apache-cassandra-is-now-generally-available/

We introduced Amazon Managed Apache Cassandra Service (MCS) in preview at re:Invent last year. In the few months that passed, the service introduced many new features, and it is generally available today with a new name: Amazon Keyspaces (for Apache Cassandra).

Amazon Keyspaces is built on Apache Cassandra, and you can use it as a fully managed, serverless database. Your applications can read and write data from Amazon Keyspaces using your existing Cassandra Query Language (CQL) code, with little or no changes. For each table, you can select the best configuration depending on your use case:

  • With on-demand, you pay based on the actual reads and writes you perform. This is the best option for unpredictable workloads.
  • With provisioned capacity, you can reduce your costs for predictable workloads by configuring capacity settings up front. You can also further optimize costs by enable auto scaling, which updates your provisioned capacity settings automatically as your traffic changes throughout the day.

Using Amazon Keyspaces
One of the first “serious” applications I built as a kid, was an archive for my books. I’d like to rebuild it now as a serverless API, using:

With Amazon Keyspaces, your data is stored in keyspaces and tables. A keyspace gives you a way to group related tables together. In the blog post for the preview, I used the console to configure my data model. Now, I can also use AWS CloudFormation to manage my keyspaces and tables as code. For example I can create a bookstore keyspace and a books table with this CloudFormation template:

AWSTemplateFormatVersion: '2010-09-09'
Description: Amazon Keyspaces for Apache Cassandra example

Resources:

  BookstoreKeyspace:
    Type: AWS::Cassandra::Keyspace
    Properties: 
      KeyspaceName: bookstore

  BooksTable:
    Type: AWS::Cassandra::Table
    Properties: 
      TableName: books
      KeyspaceName: !Ref BookstoreKeyspace
      PartitionKeyColumns: 
        - ColumnName: isbn
          ColumnType: text
      RegularColumns: 
        - ColumnName: title
          ColumnType: text
        - ColumnName: author
          ColumnType: text
        - ColumnName: pages
          ColumnType: int
        - ColumnName: year_of_publication
          ColumnType: int

Outputs:
  BookstoreKeyspaceName:
    Description: "Keyspace name"
    Value: !Ref BookstoreKeyspace # Or !Select [0, !Split ["|", !Ref BooksTable]]
  BooksTableName:
    Description: "Table name"
    Value: !Select [1, !Split ["|", !Ref BooksTable]]

If you don’t specify a name for a keyspace or a table in the template, CloudFormation generates a unique name for you. Note that in this way keyspaces and tables may contain uppercase characters that are outside of the usual Cassandra conventions, and you need to put those names between double quotes when using Cassandra Query Language (CQL).

When the creation of the stack is complete, I see the new bookstore keyspace in the console:

Selecting the books table, I have an overview of its configuration, including the partition key, the clustering columns, and all the columns, and the option to change the capacity mode for the table from on-demand to provisioned:

For authentication and authorization, Amazon Keyspaces supports AWS Identity and Access Management (IAM) identity-based policies, that you can use with IAM users, groups, and roles. Here’s a list of actions, resources, and conditions that you can use in IAM policies with Amazon Keyspaces. You can now also manage access to resources based on tags.

You can use IAM roles using AWS Signature Version 4 Process (SigV4) with this open source authentication plugin for the DataStax Java driver. In this way you can run your applications inside an Amazon Elastic Compute Cloud (EC2) instance, a container managed by Amazon ECS or Amazon Elastic Kubernetes Service, or a Lambda function, and leverage IAM roles for authentication and authorization to Amazon Keyspaces, without the need to manage credentials. Here’s a sample application that you can test on an EC2 instance with an associated IAM role giving access to Amazon Keyspaces.

Going back to my books API, I create all the resources I need, including a keyspace and a table, with the following AWS Serverless Application Model (SAM) template.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Sample Books API using Cassandra as database

Globals:
  Function:
    Timeout: 30

Resources:

  BookstoreKeyspace:
    Type: AWS::Cassandra::Keyspace

  BooksTable:
    Type: AWS::Cassandra::Table
    Properties: 
      KeyspaceName: !Ref BookstoreKeyspace
      PartitionKeyColumns: 
        - ColumnName: isbn
          ColumnType: text
      RegularColumns: 
        - ColumnName: title
          ColumnType: text
        - ColumnName: author
          ColumnType: text
        - ColumnName: pages
          ColumnType: int
        - ColumnName: year_of_publication
          ColumnType: int

  BooksFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: BooksFunction
      Handler: books.App::handleRequest
      Runtime: java11
      MemorySize: 2048
      Policies:
        - Statement:
          - Effect: Allow
            Action:
            - cassandra:Select
            Resource:
              - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/system*"
              - !Join
                - ""
                - - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/${BookstoreKeyspace}/table/"
                  - !Select [1, !Split ["|", !Ref BooksTable]] # !Ref BooksTable returns "Keyspace|Table"
          - Effect: Allow
            Action:
            - cassandra:Modify
            Resource:
              - !Join
                - ""
                - - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/${BookstoreKeyspace}/table/"
                  - !Select [1, !Split ["|", !Ref BooksTable]] # !Ref BooksTable returns "Keyspace|Table"
      Environment:
        Variables:
          KEYSPACE_TABLE: !Ref BooksTable # !Ref BooksTable returns "Keyspace|Table"
      Events:
        GetAllBooks:
          Type: HttpApi
          Properties:
            Method: GET
            Path: /books
        GetBookByIsbn:
          Type: HttpApi
          Properties:
            Method: GET
            Path: /books/{isbn}
        PostNewBook:
          Type: HttpApi
          Properties:
            Method: POST
            Path: /books

Outputs:
  BookstoreKeyspaceName:
    Description: "Keyspace name"
    Value: !Ref BookstoreKeyspace # Or !Select [0, !Split ["|", !Ref BooksTable]]
  BooksTableName:
    Description: "Table name"
    Value: !Select [1, !Split ["|", !Ref BooksTable]]
  BooksApi:
    Description: "API Gateway HTTP API endpoint URL"
    Value: !Sub "https://${ServerlessHttpApi}.execute-api.${AWS::Region}.amazonaws.com/"
  BooksFunction:
    Description: "Books Lambda Function ARN"
    Value: !GetAtt BooksFunction.Arn
  BooksFunctionIamRole:
    Description: "Implicit IAM Role created for Books function"
    Value: !GetAtt BooksFunctionRole.Arn

In this template I don’t specify the keyspace and table names, and CloudFormation is generating unique names automatically. The function IAM policy gives access to read (cassandra:Select) and write (cassandra:Write) only to the books table. I am using CloudFormation Fn::Select and Fn::Split intrinsic functions to get the table name. The driver also needs read access to the system* keyspaces.

To use the authentication plugin for the DataStax Java driver that supports IAM roles, I write the Lambda function in Java, using the APIGatewayV2ProxyRequestEvent and APIGatewayV2ProxyResponseEvent classes to communicate with the HTTP API created by the API Gateway.

package books;

import java.net.InetSocketAddress;
import java.security.NoSuchAlgorithmException;
import java.util.Collections;
import java.util.List;
import java.util.HashMap;
import java.util.Map;
import java.util.StringJoiner;
import javax.net.ssl.SSLContext;

import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;

import com.datastax.oss.driver.api.core.ConsistencyLevel;
import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.*;

import software.aws.mcs.auth.SigV4AuthProvider;

import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import com.amazonaws.services.lambda.runtime.LambdaLogger;
import com.amazonaws.services.lambda.runtime.events.APIGatewayV2ProxyRequestEvent;
import com.amazonaws.services.lambda.runtime.events.APIGatewayV2ProxyResponseEvent;

public class App implements RequestHandler<APIGatewayV2ProxyRequestEvent, APIGatewayV2ProxyResponseEvent> {
    
    JSONParser parser = new JSONParser();
    String[] keyspace_table = System.getenv("KEYSPACE_TABLE").split("\\|");
    String keyspace = keyspace_table[0];
    String table = keyspace_table[1];
    CqlSession session = getSession();
    PreparedStatement selectBookByIsbn = session.prepare("select * from \"" + table + "\" where isbn = ?");
    PreparedStatement selectAllBooks = session.prepare("select * from \"" + table + "\"");
    PreparedStatement insertBook = session.prepare("insert into \"" + table + "\" "
    + "(isbn, title, author, pages, year_of_publication)" + "values (?, ?, ?, ?, ?)");
    
    public APIGatewayV2ProxyResponseEvent handleRequest(APIGatewayV2ProxyRequestEvent request, Context context) {
        
        LambdaLogger logger = context.getLogger();
        
        String responseBody;
        int statusCode = 200;
        
        String routeKey = request.getRequestContext().getRouteKey();
        logger.log("routeKey = '" + routeKey + "'");
        
        if (routeKey.equals("GET /books")) {
            ResultSet rs = execute(selectAllBooks.bind());
            StringJoiner jsonString = new StringJoiner(", ", "[ ", " ]");
            for (Row row : rs) {
                String json = row2json(row);
                jsonString.add(json);
            }
            responseBody = jsonString.toString();
        } else if (routeKey.equals("GET /books/{isbn}")) {
            String isbn = request.getPathParameters().get("isbn");
            logger.log("isbn: '" + isbn + "'");
            ResultSet rs = execute(selectBookByIsbn.bind(isbn));
            if (rs.getAvailableWithoutFetching() == 1) {
                responseBody = row2json(rs.one());
            } else {
                statusCode = 404;
                responseBody = "{\"message\": \"not found\"}";
            }
        } else if (routeKey.equals("POST /books")) {
            String body = request.getBody();
            logger.log("Body: '" + body + "'");
            JSONObject requestJsonObject = null;
            if (body != null) {
                try {
                    requestJsonObject = (JSONObject) parser.parse(body);
                } catch (ParseException e) {
                    e.printStackTrace();
                }
                if (requestJsonObject != null) {
                    int i = 0;
                    BoundStatement boundStatement = insertBook.bind()
                    .setString(i++, (String) requestJsonObject.get("isbn"))
                    .setString(i++, (String) requestJsonObject.get("title"))
                    .setString(i++, (String) requestJsonObject.get("author"))
                    .setInt(i++, ((Long) requestJsonObject.get("pages")).intValue())
                    .setInt(i++, ((Long) requestJsonObject.get("year_of_publication")).intValue())
                    .setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
                    ResultSet rs = execute(boundStatement);
                    statusCode = 201;
                    responseBody = body;
                } else {
                    statusCode = 400;
                    responseBody = "{\"message\": \"JSON parse error\"}";
                }
            } else {
                statusCode = 400;
                responseBody = "{\"message\": \"body missing\"}";
            }
        } else {
            statusCode = 405;
            responseBody = "{\"message\": \"not implemented\"}";
        }
        
        Map<String, String> headers = new HashMap<>();
        headers.put("Content-Type", "application/json");
        
        APIGatewayV2ProxyResponseEvent response = new APIGatewayV2ProxyResponseEvent();
        response.setStatusCode(statusCode);
        response.setHeaders(headers);
        response.setBody(responseBody);
        
        return response;
    }
    
    private String getStringColumn(Row row, String columnName) {
        return "\"" + columnName + "\": \"" + row.getString(columnName) + "\"";
    }
    
    private String getIntColumn(Row row, String columnName) {
        return "\"" + columnName + "\": " + row.getInt(columnName);
    }
    
    private String row2json(Row row) {
        StringJoiner jsonString = new StringJoiner(", ", "{ ", " }");
        jsonString.add(getStringColumn(row, "isbn"));
        jsonString.add(getStringColumn(row, "title"));
        jsonString.add(getStringColumn(row, "author"));
        jsonString.add(getIntColumn(row, "pages"));
        jsonString.add(getIntColumn(row, "year_of_publication"));
        return jsonString.toString();
    }
    
    private ResultSet execute(BoundStatement bs) {
        final int MAX_RETRIES = 3;
        ResultSet rs = null;
        int retries = 0;

        do {
            try {
                rs = session.execute(bs);
            } catch (Exception e) {
                e.printStackTrace();
                session = getSession(); // New session
            }
        } while (rs == null && retries++ < MAX_RETRIES);
        return rs;
    }
    
    private CqlSession getSession() {
        
        System.setProperty("javax.net.ssl.trustStore", "./cassandra_truststore.jks");
        System.setProperty("javax.net.ssl.trustStorePassword", "amazon");
        
        String region = System.getenv("AWS_REGION");
        String endpoint = "cassandra." + region + ".amazonaws.com";
        
        System.out.println("region: " + region);
        System.out.println("endpoint: " + endpoint);
        System.out.println("keyspace: " + keyspace);
        System.out.println("table: " + table);
        
        SigV4AuthProvider provider = new SigV4AuthProvider(region);
        List<InetSocketAddress> contactPoints = Collections.singletonList(new InetSocketAddress(endpoint, 9142));
        
        CqlSession session;
                
        try {
            session = CqlSession.builder().addContactPoints(contactPoints).withSslContext(SSLContext.getDefault())
            .withLocalDatacenter(region).withAuthProvider(provider).withKeyspace("\"" + keyspace + "\"")
            .build();
        } catch (NoSuchAlgorithmException e) {
            session = null;
            e.printStackTrace();
        }
        
        return session;
    }
}

To connect to Amazon Keyspaces with TLS/SSL using the Java driver, I need to include a trustStore in the JVM arguments. When using the Cassandra Java Client Driver in a Lambda function, I can’t pass parameters to the JVM, so I pass the same options as system properties, and specify the SSL context when creating the CQL session with the  withSslContext(SSLContext.getDefault()) parameter. Note that I also have to configure the pom.xml file, used by Apache Maven, to include the trustStore file as a dependency.

System.setProperty("javax.net.ssl.trustStore", "./cassandra_truststore.jks");
System.setProperty("javax.net.ssl.trustStorePassword", "amazon");

Now, I can use a tool like curl or Postman to test my books API. First, I take the endpoint of the API from the output of the CloudFormation stack. At the beginning there are no books stored in the books table, and if I do an HTTP GET on the resource, I get an empty JSON list. For readability, I am removing all HTTP headers from the output.

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books

HTTP/1.1 200 OK
[]

In the code, I am using a PreparedStatement to run a CQL statement to select all rows from the books table. The names of the keystore and of the table are passed to the Lambda function in an environment variable, as described in the SAM template above.

Let’s use the API to add a book, by doing an HTTP POST on the resource.

$ curl -i -d '{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }' -H "Content-Type: application/json" -X POST https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books

HTTP/1.1 201 Created
{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }

I can check that the data has been inserted in the table using the CQL Editor in the console, where I select all the rows in the table.

I repeat the previous HTTP GET to get the list of the books, and I see the one I just created.

$ curl -i https://a1b2c3d4e5-api.eu-west-1.amazonaws.com/books

HTTP/1.1 200 OK
[ { "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 } ]

I can get a single book by ISBN, because the isbn column is the primary key of the table and I can use it in the where condition of a select statement.

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books/978-0201896831

HTTP/1.1 200 OK
{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }

If there is no book with that ISBN, I return a “not found” message:

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books/1234567890

HTTP/1.1 404 Not Found
{"message": "not found"}

It works! We just built a fully serverless API using CQL to read and write data using temporary security credentials, managing the whole infrastructure, including the database table, as code.

Available Now
Amazon Keyspace (for Apache Cassandra) is ready for your applications, please see this table for regional availability. You can find more information on how to use Keyspaces in the documentation. In this post, I built a new application, but you can get lots of benefits by migrating your current tables to a fully managed environment. For migrating data, you can now use cqlsh as described in this post.

Let me know what are you going to use it for!

Danilo

Achieve finer-grained data security with column-level access control in Amazon Redshift

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/achieve-finer-grained-data-security-with-column-level-access-control-in-amazon-redshift/

Amazon Redshift is the most popular cloud data warehouse because it provides fast insights at a low cost. Customers can confidently run mission critical workloads, even in highly regulated industries, because Amazon Redshift comes with out of the box security and compliance. The security features, combined with the ability to easily analyze data in-place and in open formats, along with compute and storage elasticity, and ease of use are what makes tens of thousands of customers choose Amazon Redshift.

Many organizations store sensitive data, commonly classified as personally identifiable information (PII) or sensitive personal information (SPI) in Amazon Redshift and this data will have restricted access from different persona in the organization. For example, your human resources, finance, sales, data science, and marketing departments may all have the required access privileges to view customer data, whereas only the finance department should have access to sensitive data like personally identifiable information (PII) or payment card industry (PCI).

Views or AWS Lake Formation on Amazon Redshift Spectrum was used previously to manage such scenarios, however this adds extra overhead in creating and maintaining views or Amazon Redshift Spectrum. View based approach is also difficult to scale and can lead to lack of security controls. Amazon Redshift column-level access control is a new feature that supports access control at a column-level for data in Amazon Redshift. You can use column-level GRANT and REVOKE statements to help meet your security and compliance needs similar to managing any database object.

This post shows you how to setup Amazon Redshift column-level access control on table, view and materialized view.

Use Case

There are two tables that store customer demographic and account balance data. Finance department can see all customer data while Sales department can only view and update market segment and account balance data as the rest of customer demographic data like customer name, phone and nation are considered PII data and should have restricted access. This is a good use case for column-level access control to secure the PII data. Below is a simple entity relation diagram for the 2 tables.

Prerequisites

Before trying out the illustration in this blog, note the following prerequisites:

  1. Amazon Redshift cluster.
  2. Database user with permission to create table or superuser.

Setting up the environment

To setup the environment and implement the use case, complete the following steps:

  1. Connect to your Amazon Redshift cluster using any SQL client of your choice with user with permission to create table or superuser.
  2. Create two tables with the following code:

    CREATE TABLE customer 
    (
      customerid       INT8 NOT NULL,
      customername     VARCHAR(25) NOT NULL,
      phone            CHAR(15) NOT NULL,
      nationid        INT4 NOT NULL,
      marketsegment    CHAR(10) NOT NULL,
      accountbalance   NUMERIC(12,2) NOT NULL
    );
    CREATE TABLE nation 
    (
      nationid    INT4 NOT NULL,
      nationname   CHAR(25) NOT NULL
    );

  3. Populate some sample data into the two tables with the following code:

    INSERT INTO customer VALUES
    (1, 'Customer#000000001', '33-687-542-7601', 3, 'HOUSEHOLD', 2788.52),
    (2, 'Customer#000000002', '13-806-545-9701', 1, 'MACHINERY', 591.98),
    (3, 'Customer#000000003', '13-312-472-8245', 1, 'HOUSEHOLD', 3332.02),
    (4, 'Customer#000000004', '23-127-851-8031', 2, 'MACHINERY', 9255.67),
    (5, 'Customer#000000005', '13-137-193-2709', 1, 'BUILDING', 5679.84)
    ;
    INSERT INTO nation VALUES
    (1, 'UNITED STATES'),
    (2, 'AUSTRALIA'),
    (3, 'UNITED KINGDOM')
    ;

  4. Create a view and a materialized view with the following code:

    CREATE OR REPLACE VIEW customer_vw AS SELECT customername, phone, marketsegment, accountbalance, CASE WHEN accountbalance < 1000 THEN 'low' WHEN accountbalance > 1000 AND accountbalance < 5000 THEN 'middle' ELSE 'high' END AS incomegroup FROM customer;
    
    CREATE MATERIALIZED VIEW customernation_mv AS SELECT customername, phone, nationname, marketsegment, sum(accountbalance) AS accountbalance FROM customer c INNER JOIN nation n ON c.nationid = n.nationid GROUP BY customername, phone, nationname, marketsegment;

  5. The purpose of the view, customer_vw is to implement business rule of customer income group categorization based on customer dataset.
  6. Analytical dashboards frequently access this dataset by joining and aggregating tables customer and nation and thus, the materialized view customernation_mv is created to speed up the performance such query significantly.
  7. Create and grant table level permissions to user finance which represent finance department users. Note that below users are created only for illustration purpose. We recommend you to use AWS IAM Federation to bring your corporate users without creating them manually in Amazon Redshift. For more information, please refer to https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-authentication-access-control.html#authentication.
    CREATE USER finance WITH password 'Abcd1234!';
    CREATE USER sales WITH password 'Abcd1234!'; 
    GRANT SELECT, UPDATE ON customer TO finance;
    GRANT SELECT ON customer_vw TO finance;
    GRANT SELECT ON customernation_mv TO finance;

  8. Note that user finance has SELECT and UPDATE permission on all columns on customer table.
  9. You need to test and validate user finance is able to view all data from the customer table, customer_vw view and customernation_mv materialized view and update data on customer table.
  10. Enter the following code:
    SET SESSION AUTHORIZATION 'finance';
    SELECT CURRENT_USER;
    SELECT * FROM customer;
    SELECT * FROM customer_vw;
    SELECT * FROM customernation_mv;
    UPDATE customer SET accountbalance = 1000 WHERE marketsegment = 'BUILDING';
    RESET SESSION AUTHORIZATION;

    Note that SQL statement SET SESSION AUTHORIZATION 'finance' is used to impersonate user finance in above code.

    Each select statement should return five rows and the update statement should return one row updated. See the following code:

    dev=# SET SESSION AUTHORIZATION 'finance';
    SET
    dev=> SELECT CURRENT_USER;
     current_user 
    --------------
     finance
    (1 row)
     
    dev=> SELECT * FROM customer;
     customerid |    customername    |      phone      | nationid | marketsegment | accountbalance 
    ------------+--------------------+-----------------+----------+---------------+----------------
              1 | Customer#000000001 | 33-687-542-7601 |        3 | HOUSEHOLD     |        2788.52
              2 | Customer#000000002 | 13-806-545-9701 |        1 | MACHINERY     |         591.98
              3 | Customer#000000003 | 13-312-472-8245 |        1 | HOUSEHOLD     |        3332.02
              4 | Customer#000000004 | 23-127-851-8031 |        2 | MACHINERY     |        9255.67
              5 | Customer#000000005 | 13-137-193-2709 |        1 | BUILDING      |        5679.84
    (5 rows)
     
    dev=> SELECT * FROM customer_vw;
        customername    |      phone      | marketsegment | accountbalance | incomegroup 
    --------------------+-----------------+---------------+----------------+-------------
     Customer#000000001 | 33-687-542-7601 | HOUSEHOLD     |        2788.52 | middle
     Customer#000000002 | 13-806-545-9701 | MACHINERY     |         591.98 | low
     Customer#000000003 | 13-312-472-8245 | HOUSEHOLD     |        3332.02 | middle
     Customer#000000004 | 23-127-851-8031 | MACHINERY     |        9255.67 | high
     Customer#000000005 | 13-137-193-2709 | BUILDING      |        5679.84 | high
    (5 rows)
     
    dev=> SELECT * FROM customernation_mv;
        customername    |      phone      |        nationname         | marketsegment | accountbalance 
    --------------------+-----------------+---------------------------+---------------+----------------
     Customer#000000005 | 13-137-193-2709 | UNITED STATES             | BUILDING      |        5679.84
     Customer#000000004 | 23-127-851-8031 | AUSTRALIA                 | MACHINERY     |        9255.67
     Customer#000000003 | 13-312-472-8245 | UNITED STATES             | HOUSEHOLD     |        3332.02
     Customer#000000002 | 13-806-545-9701 | UNITED STATES             | MACHINERY     |         591.98
     Customer#000000001 | 33-687-542-7601 | UNITED KINGDOM            | HOUSEHOLD     |        2788.52
    (5 rows)
     
    dev=> UPDATE customer SET accountbalance = 1000 WHERE marketsegment = 'BUILDING';
    UPDATE 1
    dev=>  
    dev=> RESET SESSION AUTHORIZATION;
    RESET

You have now successfully setup table level permissions for user finance to view and update all customer data.

Setting up Amazon Redshift column-level access control

Column-level access control can be enabled and disabled by using GRANT and REVOKE statements with the following syntax:

GRANT { { SELECT | UPDATE } ( column_name [, ...] ) [, ...] | ALL [ PRIVILEGES ] ( column_name [,...] ) }
ON { [ TABLE ] table_name [, ...] }
TO { username | GROUP group_name | PUBLIC } [, ...]
 
REVOKE { { SELECT | UPDATE } ( column_name [, ...] ) [, ...] | ALL [ PRIVILEGES ] ( column_name [,...] ) }
ON { [ TABLE ] table_name [, ...] }
FROM { username | GROUP group_name | PUBLIC } [, ...]
[ CASCADE | RESTRICT ]

To set up column-level privileges, complete the following steps:

  1. To determine which users have column-level access control, you can query PG_ATTRIBUTE_INFO system view. Enter the following code:
    SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;

  2. The query should return zero records as we have not implemented column-level access control yet.
  3. Grant user sales SELECT permission on columns marketsegment and accountbalance on table customer, view customer_vw and materialized view customernation_mv. We also grant UPDATE permission on column marketsegment and accountbalance on table customer by entering the following code:
    RESET SESSION AUTHORIZATION;
    GRANT SELECT (marketsegment, accountbalance) ON customer TO sales WITH GRANT OPTION;
    GRANT SELECT (marketsegment, accountbalance),UPDATE (marketsegment, accountbalance) ON customer TO sales;
    GRANT SELECT (marketsegment, accountbalance) ON customer_vw TO sales;
    GRANT SELECT (marketsegment, accountbalance) ON customernation_mv TO sales;

  4. Error message “Grant options are not supported for column privileges” should be returned for the first statement. This is because only a table’s owner or a superuser can grant column-level privileges and to maintain simple security model.
  5. Validate if above permissions have been granted with the following code:
    SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b  JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;

  6. The query should return six rows. See the following code:
    dev=# SELECT b.attacl, b.attname, c.relname FROM pg_catalog.pg_attribute_info b  JOIN pg_class c ON c.oid=b.attrelid WHERE c.relname in ('customer','customer_vw','customernation_mv') AND b.attacl IS NOT NULL ORDER BY c.relname, b.attname;
          attacl       |    attname     |      relname      
    -------------------+----------------+-------------------
     {sales=rw/fqdemo} | accountbalance | customer
     {sales=rw/fqdemo} | marketsegment  | customer
     {sales=r/fqdemo}  | accountbalance | customer_vw
     {sales=r/fqdemo}  | marketsegment  | customer_vw
     {sales=r/fqdemo}  | accountbalance | customernation_mv
     {sales=r/fqdemo}  | marketsegment  | customernation_mv
    (6 rows)

    The output above shows:
    Users: sales (attacl column)
    Permissions: read/write (attacl column value “rw”)
    On Column: accountbalance, marketsegment (attname column)
    Of table: customer (relname column)
    Granted by: fqdemo (attacl column)

    Users: sales (attacl column)
    Permissions: read (attacl column value “r”)
    On Column: accountbalance, marketsegment (attname column)
    Of table: customer_vw, customernation_mv (relname column)
    Granted by: fqdemo (attacl column)

  7. After you confirmed the column-level access control are correct, run as user sales to query table customer, view customer_vw and materialized view customernation_mv using the following code:
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT * FROM customer;
    SELECT * FROM customer_vw;
    SELECT * FROM customernation_mv;

  8. Each select statement should return permission denied error as the user does not have permissions to all columns of the objects being queried. See the following code:
    dev=# SET SESSION AUTHORIZATION 'sales';
    SET
    dev=> SELECT CURRENT_USER;
     current_user 
    --------------
     sales
    (1 row)
     
    dev=> SELECT * FROM customer;
    ERROR:  permission denied for relation customer
    dev=> SELECT * FROM customer_vw;
    ERROR:  permission denied for relation customer_vw
    dev=> SELECT * FROM customernation_mv;
    ERROR:  permission denied for relation customernation_mv

  9. Query only the columns marketsegment and accountbalance from table customer, view customer_vw and materialized view customernation_mv with the following code:
    SELECT marketsegment, accountbalance FROM customer;
    SELECT marketsegment, accountbalance FROM customer_vw;
    SELECT marketsegment, accountbalance FROM customernation_mv;

  10. Each select statement should return five rows as user sales has permission to query columns marketsegment and accountbalance. See the following code:
    dev=> SELECT marketsegment, accountbalance FROM customer;
     marketsegment | accountbalance 
    ---------------+----------------
     HOUSEHOLD     |        2788.52
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     MACHINERY     |        9255.67
     BUILDING      |        1000.00
    (5 rows)
     
    dev=> SELECT marketsegment, accountbalance FROM customer_vw;
     marketsegment | accountbalance 
    ---------------+----------------
     HOUSEHOLD     |        2788.52
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     MACHINERY     |        9255.67
     BUILDING      |        1000.00
    (5 rows)
     
    dev=> SELECT marketsegment, accountbalance FROM customernation_mv;
     marketsegment | accountbalance 
    ---------------+----------------
     MACHINERY     |        9255.67
     BUILDING      |        5679.84
     MACHINERY     |         591.98
     HOUSEHOLD     |        3332.02
     HOUSEHOLD     |        2788.52
    (5 rows)

  11. Update the accountbalance column with the following code:
    UPDATE customer SET accountbalance = 2000 WHERE marketsegment = 'BUILDING';
    SELECT accountbalance FROM customer WHERE marketsegment = 'BUILDING';

  12. The select statement should return one row that shows value 2000. See the following code:
    dev=> UPDATE customer SET accountbalance = 2000 WHERE marketsegment = 'BUILDING';
    UPDATE 1
    dev=> SELECT accountbalance FROM customer WHERE marketsegment = 'BUILDING';
     accountbalance 
    ----------------
            2000.00
    (1 row)

  13. Update the accountbalance column with condition nationid=1 by using the following code:
    UPDATE customer SET accountbalance = 3000 WHERE nationid = 1;

  14. The update statement should return permission denied error as user sales does not have column-level privileges on column nationid in the where clause.
  15. Query the count of record group by nationid with the following code:
    SELECT COUNT(*) FROM customer GROUP BY nationid;

  16. The select statement should return permission denied error as user sales doesn’t have column-level privileges on column nationid in the group by clause.
  17. Please also note that column-level privileges are checked for columns not only in the select list but also where clause, order by clause, group by clause, having clause and other clauses of a query that require SELECT/UPDATE privileges on a column.
  18. Remove column marketsegment from column-level access control for user sales using REVOKE command and see what happens. Enter the following code:
    RESET SESSION AUTHORIZATION;
    REVOKE SELECT (marketsegment) ON customer FROM sales;
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT marketsegment, accountbalance FROM customer;
    SELECT accountbalance FROM customer;

  19. As you can see, user sales is no longer able to view marketsegment from table customer.
    dev=> SELECT marketsegment, accountbalance FROM customer;
    ERROR:  permission denied for relation customer
    dev=> SELECT accountbalance FROM customer;
     accountbalance 
    ----------------
            2788.52
             591.98
            3332.02
            9255.67
            2000.00
    (5 rows)

  20. Enter the following code to query column marketsegment from view customer_vw:
    SELECT marketsegment FROM customer_vw;

  21. The statement should return five rows as user sales still has access to column marketsegment on the view even though column-level privileges have been revoked from table customer. Views execute with the permissions of the view owner so it will still continue to work as long as the view’s owner still has column or table-level privileges on the base tables used by the view. To prevent unauthorized access of the sensitive data, the column-level privileges for user sales should be revoked from the view as well.
  22. Revoke all permissions for user sales with the following code:
    RESET SESSION AUTHORIZATION;
    SELECT CURRENT_USER;
    REVOKE SELECT ON customernation_mv FROM sales;
    REVOKE SELECT ON customer_vw FROM sales;
    REVOKE SELECT ON customer FROM sales;
    REVOKE UPDATE ON customer FROM sales;

  23. Query the table, view and materialized view again with user sales using the following code:
    SET SESSION AUTHORIZATION 'sales';
    SELECT CURRENT_USER;
    SELECT marketsegment, accountbalance FROM customer;
    SELECT marketsegment, accountbalance FROM customer_vw;
    SELECT marketsegment, accountbalance FROM customernation_mv;

  24. Permission denied error should be returned and this shows that REVOKE is able to remove all permissions.

As summary, a simple GRANT statement will enable column-level access control on Amazon Redshift table, view and materialized view. A REVOKE statement is what you need to remove the permission. This eliminates the complexity of legacy views-based access control to achieve fine-grained read and write access control.

Clean up

Once you are done with above testing, you can remove the objects and users with the following code:

RESET SESSION AUTHORIZATION;
REVOKE SELECT ON customernation_mv FROM finance;
REVOKE SELECT ON customer_vw FROM finance;
REVOKE SELECT ON customer FROM finance;
REVOKE UPDATE ON customer FROM finance;
DROP VIEW customer_vw;
DROP MATERIALIZED VIEW customernation_mv;
DROP TABLE nation;
DROP TABLE customer;
DROP USER IF EXISTS sales;
DROP USER IF EXISTS finance;

Summary

Amazon Redshift is secure by default and security doesn’t cost extra. It provides Authentication (Active Directory, Okta, Ping Federate, and Azure AD), Federation and comes pre-integrated with AWS IAM and KMS. It also supports table-based access control for data in Amazon Redshift and column-level access control for data in Amazon S3 through Amazon Redshift Spectrum since September 2019. Amazon Redshift now supports access control at a column-level for local tables, eliminating the need to implement view-based access control or using another system.

This post showed you how easy it is to setup Amazon Redshift column-level access control. The use case in this post demonstrated how to confirm that you have fine-grained access on the table, view, and materialized view. You can adopt this feature to support your business needs.

If you have any questions or suggestions, please leave a comment below.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Srikanth Sopirala is a Sr. Specialist Solutions Architect focused on Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.

Speed up your ELT and BI queries with Amazon Redshift materialized views

Post Syndicated from Juan Yu original https://aws.amazon.com/blogs/big-data/speed-up-your-elt-and-bi-queries-with-amazon-redshift-materialized-views/

The Amazon Redshift materialized views function helps you achieve significantly faster query performance on repeated or predictable workloads such as dashboard queries from Business Intelligence (BI) tools, such as Amazon QuickSight. It also speeds up and simplifies extract, load, and transform (ELT) data processing. You can use materialized views to store frequently used precomputations and seamlessly use them to achieve lower latency on subsequent analytical queries.

This post demonstrates how to create a materialized view, refresh it after data ingestion, and speed up your BI workload.

Setting up your sample datasets

This walkthrough uses the Amazon Customer Reviews Dataset. It is a public dataset stored in the us-east-1 Region. You will create the following three tables:

  • product_reviews – Contains customer reviews for a specific product
  • customer – Contains customer profile data
  • customer_address – Contains customer address information

The following diagram shows the relationship of the three tables.

To download the script and set up the tables, choose mv_blog.sql.

Creating and using materialized views

For this use case, your marketing team wants to build a report that shows how many customers per state like your products. You also want to drill down to each product category when needed.

In this first step, you create a regular view. See the following code:

CREATE VIEW v_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code is a report to analyze the product review count per state:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

The following code is a report to analyze the product review count per state for specific categories:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The preceding reports take approximately 4 seconds to run. As you sell more products and get more reviews, this elapsed time gradually gets longer. To speed up those reports, you can create a materialized view to precompute the count of reviews per product category and per state. See the following code:

CREATE MATERIALIZED VIEW mv_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code are the reports to analyze the product review against the materialized view.

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The same reports against materialized views take less than 200 milliseconds because the new queries access precomputed joins, filters, grouping, and partial sums instead of the multiple, larger base tables.

Speeding up and simplifying ELT data processing

To achieve similar performance without the use of materialized views, many users use the CREATE TABLE AS (CTAS) command. However, as you update base tables with new data inserts, updates, or deletes, the CTAS tables become stale; you must recreate them to keep them up-to-date with the latest changes from the base tables. Now with Amazon Redshift materialized views, you can overcome this problem by efficiently and incrementally refreshing the materialized views with supported SQL. For example, the following code ingests another 10,000 reviews:

INSERT INTO product_reviews
SELECT   marketplace, 
  cast(customer_id as bigint) customer_id, 
  review_id, 
  product_id, 
  cast(product_parent as bigint) product_parent, 
  product_title, 
  star_rating, 
  helpful_votes, 
  total_votes, 
  vine, 
  verified_purchase, 
  review_headline, 
  review_body, 
  review_date, 
  year,
  product_category
FROM demo.products_reviews
WHERE review_date = '2015-07-01' LIMIT 10000;

Now the materialized view is out-of-date. To refresh the materialized view, enter the following code:

REFRESH MATERIALIZED VIEW mv_reviews_byprod_and_state;

Within 200 milliseconds, the materialized view is up-to-date again. Your report queries have the same consistent, fast performance.

The following screenshot is the query log that shows query performance. The log shows newer statements at the top.

The materialized views refresh is much faster because it’s incremental: Amazon Redshift only uses the new data to update the materialized view instead of recomputing the entire materialized view again from the base tables.  For more information, see REFRESH MATERIALIZED VIEW.

Materialized views also simplify and make ELT easier and more efficient. Without materialized views, you might create an ELT job and use CTAS to precompute the product analysis data. The ELT job recomputes this data after new data is ingested and stores the data in the precomputed product analysis table to meet the dashboard latency requirement.

In particular, the ETL job drops and recreates the precomputed product analysis table after each ingestion. See the following code:

BEGIN;
    DROP TABLE IF EXISTS latest_product_analysis;
    CREATE TABLE latest_product_analysis as SELECT ...;
END;

With materialized views, you just need to create the materialized view one time and refresh to keep it up-to-date. To refresh materialized views after ingesting new data, add REFRESH MATERIALIZED VIEW to the ELT data ingestion scripts. Redshift will automatically and incrementally bring the materialized view up-to-date.

Achieving faster performance for BI dashboards

You can use materialized views to help your BI team build a dashboard to analyze product trends.

For example, to create a materialized view to join customer and customer_address dimension tables and precompute reviews and ratings, enter the following code:

CREATE MATERIALIZED VIEW mv_product_analysis 
    sortkey(product_category, Customer_State, review_date) 
AS
SELECT PR.product_category,
       A.ca_state AS Customer_State,
       PR.review_date AS Review_Date,
       PR.product_title,
       COUNT(1) AS review_total,
       SUM(PR.star_rating) AS rating
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND marketplace = 'US'
GROUP BY 1,2,3,4;

You access materialized views the same as you do a regular table. For this walkthrough, choose a materialized view as the source for an Amazon QuickSight dataset. As showing by the following screenshot.

You can preview data of the materialized view in Amazon QuickSight to understand what information can be used to build the dashboard. The following screenshot shows the sample data of mv_product_analysis.

To track how many reviews customers post over time, use review_date as the X-axis and Sum(review_total) as the Y-axis. The following graph shows this visualization.

The following screenshot shows a complete dashboard “Product trend” that analyzes the top product category, product popularity by state, and more.

Because you are using materialized views, the product trend dashboard loads in seconds and is always up-to-date. You can gain the latest insights, understand customer purchase behavior, and identify business opportunities and optimizations.

You can compare the performance of materialized views with other possible alternatives, such as using regular views and using CTAS. The following graph shows the overall query execution for the product trend dashboard. Materialized views not only improve query performance by more than an order of magnitude compared to using a regular view, but also have low maintenance costs compared to using a CTAS because the incremental refresh time is proportional to the delta of changes in the base tables. In contrast, the CTAS recreate approach needs to processes all the data in the base tables.

The following animated gif shows the actual response time for the product trend dashboard built using Amazon QuickSight in direct query mode.

Conclusion

This post showed how to create Amazon Redshift materialized views with one or more base tables to speed up both BI queries and ELT. You can easily build and maintain efficient data processing pipelines and seamlessly extend the low latency query execution benefits of materialized views to data analysis.

 


About the Authors

Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.

 

 

 

 

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

 

 

 

Serving Billions of Ads in Just 100 ms Using Amazon Elasticache for Redis

Post Syndicated from Rodrigo Asensio original https://aws.amazon.com/blogs/architecture/serving-billions-of-ads-with-amazon-elasticache-for-redis/

This post was co-written with Lucas Ceballos, CTO of Smadex

Introduction

Showing ads may seem to be a simple task, but it’s not. Showing the right ad to the right user is an incredibly complex challenge that involves multiple disciplines such as artificial intelligence, data science, and software engineering. Doing it one million times per second with a 100-ms constraint is even harder.

In the ad-tech business, speed and infrastructure costs are the keys to success. The less the final user waits for an ad, the higher the probability of that user clicking on the ad. Doing that while keeping infrastructure costs under control is crucial for business profitability.

About Smadex

Smadex is the leading mobile-first programmatic advertising platform specifically built to deliver best user acquisition performance and complete transparency.

Its state-of-the-art digital signal processing (DSP) technology provides advertisers with the tools they need to achieve their goals and ROI, with measurable results from web forms, post-app install events, store visits, and sales.

Smadex advertising architecture

What does showing ads look like under the hood? At Smadex, our technology works based on the OpenRTB (Real-Time Bidding) protocol.

RTB is a means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction, which is similar to financial markets.

To show ads, we participate in auctions deciding in real time which ad to show and how much to bid trying to optimize the cost of every impression.

High level diagram

  1. The final user browses the publisher’s website or app.
  2. Ad-exchange is called to start a new auction.
  3. Smadex receives the bid request and has to decide which ad to show and how much to offer in just 100 ms (and this is happening one million times per second).
  4. If Smadex won the auction, the ad must be sent and rendered on the publisher’s website or app.
  5. In the end, the user interacts with the ad sending new requests and events to Smadex platform.

Flow of data

As you can see in the previous diagram, showing ads is just one part of the challenge. After the ad is shown, the final user interacts with it in multiple ways, such as clicking it, installing an application, subscribing to a service, etc. This happens during a determined period that we call the “attribution window.” All of those interactions must be tracked and linked to the original bid transaction (using the request_id parameter).

Doing this is complicated: billions of bid transactions must be stored and available so that they can be quickly accessed every time the user interacts with the ad. The longer we store the transactions, the longer we can “wait” for an interaction to take place, and the better for our business and our clients, too.

Detailed diagram

Challenge #1: Cost

The challenge is: What kind of database can store billions of records per day, with at least a 30-day retention capacity (attribution window), be accessed by key-value, and all by spending as little as possible?

The answer is…none! Based on our research, all the available options that met the technical requirements were way out of our budget.

So…how to solve it? Here is when creativity and the combination of different AWS services comes into place.

We started to analyze the time dispersion of the events trying to find some clues. The interesting thing we spotted was that 90% of what we call “post-bid events” (impression, click, install, etc.) happened within one hour after the auction took place.

That means that we can process 90% of post-bid events by storing just one hour of bids.

Under our current workload, in one hour we participate in approximately 3.7 billion auctions generating 100 million bid records of an average 600 bytes each. This adds up to 55 gigabytes per hour, an easier amount of data to process.

Instead of thinking about one single database to store all the bid requests, we decided to split bids into two different categories:

  • Hot Bid: A request that took place within the last hour (small amount and frequently accessed)
  • Cold Bid: A request that took place more than our hour ago (huge amount and infrequently accessed)

Amazon ElastiCache for Redis is the best option to store 55 GB of data in memory, which gives us the ability to query in a key-value way with the lowest possible latency.

Hot Bids flow

Hot Bids flow diagram

  1. Every new bid is a hot bid by definition so it’s going to be stored in the hot bids Redis cluster.
  2. At the moment of the user interaction with the ad, the Smadex tracker component receives an HTTPS notification, including the bid request UUID that originated it.
  3. Based on the date of occurrence extracted from the received UUID, the tracker component can determine if it’s looking for a hot bid or not. If it’s a hot bid, the tracker reads it directly from Redis performing a key-value lookup query.

It’s been easy so far but what to do with the other 29 days and 23 hours we need to store?

Challenge #2: Performance

As we previously mentioned, cold bids are a huge infrequently accessed number of records with only 10% of post-bid events pointing to them. That sounds like a good use case for an inexpensive and slower data store like Amazon S3.

Thanks to the S3 low-cost storage prices combined with the ability to query S3 objects directly using Amazon Athena, we were able to optimize our costs by storing and querying cold bids by implementing a serverless architecture.

Cold Bids Flow

Cold Bids flow diagram

  1. Incoming bids are buffered by Fluentd and flushed to S3 every one minute in JSON format. Every single file flushed to S3 contains all the bids processed by a specific EC2 instance for one minute.
  2. An AWS Lambda function is automatically triggered on every new PutObject event from S3. This function transforms the JSON records to Parquet format and will save it back the S3 bucket, but this time into a specific partition folder based on file creation timestamp.
  3. As seen on the hot bids flow, the tracker component will determine if it’s looking for a hot or a cold bid based on the extracted timestamp of the request UUID. In this case, the cold bid will be retrieved by running an Amazon Athena look-up query leveraging the use of partitions and Parquet format to reduce as much as possible the latency and data that needs to be scanned.

Conclusion

Thanks to this combined approach using different technologies and a variety of AWS services we were able to extend our attribution window from 30 to 90 days while reducing the infrastructure costs by 45%.

 

 

How to track changes to secrets stored in AWS Secrets Manager using AWS Config and AWS Config Rules

Post Syndicated from Jerry Hayes original https://aws.amazon.com/blogs/security/how-to-track-changes-to-secrets-stored-in-aws-secrets-manager-using-aws-config-and-aws-config-rules/

On April 20th, AWS Config announced support for AWS Secrets Manager, making it easier to track configuration changes to the secrets you manage in AWS Secrets Manager. You can now use AWS Config to track changes to secrets’ metadata — such as secret description and rotation configuration, relationship to other AWS sources such as the KMS Key used for secret encryption, Lambda function used for secret rotation, and attributes such as tags associated with the secrets.

You can also leverage two new AWS Managed Config Rules to evaluate if your secrets’ configuration is in compliance with your organization’s security and compliance requirements, identify secrets that don’t conform to these standards, and receive notifications about them via Amazon Simple Notification Service (SNS). Once enabled, these rules will trigger every time a secret’s configuration changes.

  • secretsmanager-rotation-enabled-check: Checks whether or not secrets present in AWS Secrets Manager are configured for rotation. This rule also supports the maximumAllowedRotationFrequency parameter which, if specified, will compare the secret’s configured rotation frequency to the value set in the parameter.
  • secretsmanager-scheduled-rotation-success-check: Checks whether or not secrets present in AWS Secrets Manager configured for rotation have been rotated within their rotation schedule.

In this blog post, I walk you through two ways to use AWS Config rules to determine if your organization’s secrets are being managed in compliance with your security requirements:

  • Example 1: Drive rotation adoption by identifying secrets in a single account that aren’t configured for rotation. This maps to the first managed rule listed above.
  • Example 2: Drive compliance with your security standards across multiple AWS accounts by creating an AWS Config Aggregator, which allows you to collect configuration and compliance data from multiple accounts across multiple regions.

Example 1: Drive rotation adoption by identifying secrets that aren’t configured for rotation in a single account and region

Many organizations require regular secret rotation. Use the new managed rule secretsmanager-rotation-check to verify whether your secrets are configured for automatic rotation.

  1. From the AWS Config console, navigate to Settings and ensure that Recording is on. Under Resource types to record, turn on recording for all resources by checking the All resources box next to Record all resources supported in this region, as shown in Figure 1 below.

    Figure 1: Enable Recording

    Figure 1: Enable Recording

  2. To set up the rule, go to the Rules page in the AWS Config console and select Add rule, as shown in Figure 2.

    Figure 2: Add Rule

    Figure 2: Add Rule

  3. Search for secretsmanager-rotation-enabled-check in the search bar and select the rule that appears, as shown in Figure 3.

    Figure 3: Search for rule

    Figure 3: Search for rule

  4. In Figure 4, I use the name secretsmanager-rotation-enabled-check for the name of my rule. Trigger type is set to run upon changes to the resource’s configuration. For Scope of changes, you can monitor all applicable resources for this rule type or resources with specific tags. In my example, I am monitoring all secrets where the ENVIRONMENT tag is set to PRODUCTION. And finally, under Rule Parameters, I set maximumAllowedRotationFrequency to 30 days.

    Figure 4: Add managed rule

    Figure 4: Add managed rule

  5. In my example, I specify AWS-PublishSNSNotification as my Remediation action and enter the parameters for AutomationAssumeRole, Message, and TopicArn topic so that I can receive notifications from an Amazon SNS topic regarding non-compliant secrets, as shown in Figure 5 below. Setting a Remediation action is optional. Once the rule is set up the way you want it, select Save.
    Figure 5: Choose remediation action

    Figure 5: Choose remediation action

    Once you’ve saved the rule, it will evaluate your secrets every time there’s a change in the secret metadata, and you’ll receive an Amazon SNS notification about non-compliant secrets.

  6. In the AWS Config console, view your results by selecting Rules from the menu on the left. In Figure 6, secretsmanager-rotation-enabled-check shows that it has detected 1 noncompliant resource.

    Figure 6: View rule evaluation results

    Figure 6: View rule evaluation results

  7. Select secretsmanager-rotation-enabled-check and it provides a link to the Resource ID of the non-compliant secret, as shown in Figure 7.

    Figure 7: Detailed view of rule with noncompliant secret

    Figure 7: Detailed view of rule with noncompliant secret

Example 2: Drive security compliance across multiple AWS accounts in your AWS Organization by creating an AWS Config Aggregator

Next, I’ll show you how to use the AWS Config Aggregator to review how secrets are configured across all accounts and regions in your AWS Organization so you can see whether they’re in compliance with your organization’s security and compliance requirements. AWS Organizations helps you centrally govern your environment as you grow and scale your workloads on AWS.

NOTE: You must enable AWS Config and the AWS Config managed rules specific to secrets in all accounts and regions that you want to monitor before creating the aggregator. You can use AWS CloudFormation StackSets to enable AWS Config and provision rules across accounts and regions as described here.

  1. In this example, I create the aggregator in my AWS Organization’s master account. From the AWS Config console, select Aggregators from the left menu, then select Add aggregator, as shown in Figure 8.

    Figure 8: Add aggregator

    Figure 8: Add aggregator

  2. Select the check box next to Allow data replication, as shown in Figure 9 below. This provides the permission for your AWS Organization’s master account to access the resource configuration and compliance details for all the accounts and regions in your AWS Organization.

    Figure 9: Allow data replication

    Figure 9: Allow data replication

  3. Provide a name for the aggregator. In Figure 10, I’ve named mine MyOrganizationsSecrets. Select Add my organization, then Choose IAM role. Select Create a Role and enter a role name and then select Choose IAM role. The IAM role allows AWS Config to get the list of accounts in your AWS Organization.
    Figure 10: Enable data replication and configure aggregator

    Figure 10: Enable data replication and configure aggregator

    NOTE: If you do not have an organization configured in AWS Organizations, you can select Add individual account IDs and then either add account IDs manually or update a comma separated list of accounts.

  4. Select Choose IAM role. Ensure Create a role is selected and enter a unique name. In Figure 11, I’ve named my role aws-config-aggregator-role. Select Choose IAM role again to create the role and again to continue.

    Figure 11: Choose IAM role

    Figure 11: Choose IAM role

  5. Select the Regions you want to aggregate data and select Save. In Figure 12, I’ve selected the two regions in which my AWS Organization uses Secrets Manager.
    Figure 12: Pick target regions for aggregation

    Figure 12: Select target regions for aggregation

    Once you’ve selected your regions, click Save.

  6. Select the aggregator you just created to see the Aggregated view. In Figure 13, I select MyOrganizationsSecrets.As noted on the console, an aggregator is an AWS Config resource type that collects AWS Config data from multiple accounts and regions, the data displayed in the dashboard is received from multiple aggregation sources and is refreshed at different intervals. Data might be delayed by a few minutes.

    Figure 13: Select aggregator

    Figure 13: Select aggregator

  7. In the Aggregated view shown in Figure 14 below, you can now see a dashboard view of all resources in your Organization, across all accounts and regions.On the top right, the Config rule compliance status shows that this organization has 11 compliant and 7 non-compliant rules. Below that is the Top 5 non-compliant rules which denotes the rule name, the region, the account number, and number of non-compliant resources.
    Figure 14: Aggregated view

    Figure 14: Aggregated view

    You can drill down into this data to view all compliant and non-compliant secrets in all your organization’s accounts and regions, and you can work with individual account or secret owners to drive security compliance — ensuring all secrets are configured for rotation, all secrets meet your organizations’ standard for rotation frequency, and secrets are rotated successfully.

  8. In Figure 15, I select secretsmanager-rotation-enabled-check for us-east-1 from the Top 5 non-complaint rules.

    Figure 15: Top 5 noncompliant rules

    Figure 15: Top 5 noncompliant rules

  9. The detail view in Figure 16 below shows the 5 non-compliant resources and their corresponding Resource IDs.

    Figure 16: Compliant and non-compliant secrets

    Figure 16: Compliant and non-compliant secrets

Summary

In this post, I showed you how to track and evaluate secret configuration using AWS Config and AWS Config Rules using the AWS Management Console. You can also do this using the AWS Config APIs or the AWS Command Line Interface (CLI).

This enables you to drive secrets management best practices in an individual account or across your AWS Organization. To get started managing secrets, open the Secrets Manager console. To learn more, read How to Store, Distribute, and Rotate Credentials Securely with Secret Manager or refer to the Secrets Manager documentation.

If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Secrets Manager forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Jerry Hayes

Jerry Hayes is a Solutions Architect Manager on the World Wide Public Sector (WWPS) Solutions Architect (SA) team where he manages a high-performing team of Specialist SAs supporting National Security customers. He holds a Master’s degree from George Washington University and a Bachelor’s degree from Virginia Tech (Go Hokies!). Outside of work, Jerry enjoys spending time with his family, watching football, running, and traveling to new and exciting places.

Enhancing automated database continuous integration with AWS CodeBuild and Amazon RDS Database Snapshot

Post Syndicated from bobyeh original https://aws.amazon.com/blogs/devops/enhancing-automated-database-continuous-integration-with-aws-codebuild-and-amazon-rds-database-snapshot/

In major integration merges, it’s sometimes necessary to verify the changes with existing online data. To inspect the changes with a cloned database can give us confidence to deploy to the production database. This post demonstrates how to use AWS CodeBuild and Amazon RDS Database Snapshot to verify your code revisions in both the application layer and the underlying layer, ensuring that your existing data works seamlessly with your revised code.

Making code revisions using continuous integration requires running periodic verification to ensure that your new deliverable works functionally and reliably. It’s easy to focus attention solely on the surface level changes made to the application layer. However, it’s important to remember to inspect the changes made to the underlying data layer too.

From the application layer, users modify the data model for different reasons. Any data model definition change in the application layer maps to a schema change in the database. For those services backed with a relational database (RDBMS), a user might perform data definition language (DDL) operations directly toward a database schema or rely on an object-relational mapping (ORM) library to migrate the schema to fit the application revision. These schema changes (CREATE, DROP, ALTER, TRUNCATE, etc.) can be very critical, especially for those services serving real customers.

Performing proper verification and simulation for these changes mitigates the risk of bringing down services. After the changes are applied, fundamental operation testing (CRUD – CREATE, READ, UPDATE, DELETE) toward data models is mandatory; this leads to data control language (DCL) operations (INSERT, SELECT, UPDATE, DELETE, etc.). After all the necessary steps, a user can move on to the deployment stage.

About this page

  • Time to read:6 minutes
  • Time to complete:30 minutes
  • Cost to complete (estimated):Less than $1 for 1-GB database snapshot and restored instance
  • Learning level:Advanced (300)
  • Services used:AWS CodeBuild, IAM, RDS

Solution overview

This example uses a buildspec file in CodeBuild. Set up a build project that points to a source control repository containing that buildspec file. The CodeBuild runtime environment restores the database server from an RDS snapshot.We restore snapshot to an Amazon Aurora cluster as example through AWS Command Line Interface (AWS CLI). After the database is restored, the build process starts to run your integration process, which is in mock code in the buildspec definition. After the verification stage, CodeBuild drops the restored database.

 

Architecture diagram showing an overview of how we use CodeBuild to restore a database snapshot to verify and validate the new database schema change.

Prerequisites

The following components are required to implement this example:

Walkthrough

Follow these steps to execute the solution.

Prepare your build specification file

Before you begin, prepare your CodeBuild Build Specification file with following information:

  • db-cluster-identifier-prefix
  • db-snapshot-identifier
  • region-ID
  • account-ID
  • vpc-security-group-id

The db-cluster-identifier-prefix creates a temporary database followed by a timestamp. Make sure that this value does not overlap with any other databases. The db-snapshot-identifier points to the snapshot you are calling to run with your application. Region-ID and account-ID describe the account on which you are running. The vpc-security-group-id indicates the security group you use in the CodeBuild environment and temporary database.

YAML
Version: 0.2
phases:
  install:
    runtime-versions:
      python: 3.7
pre_build:
  commands:
    - pip3 install awscli --upgrade --user
    - export DATE=`date +%Y%m%d%H%M`
    - export DBIDENTIFIER=db-cluster-identifier-prefix-$DATE
    - echo $DBIDENTIFIER
    - aws rds restore-db-cluster-from-snapshot --snapshot-identifier arn:aws:rds:region-ID:account-ID:cluster-snapshot:db-snapshot-identifier –vpc-security-group-ids vpc-security-group-id --db-cluster-identifier $DBIDENTIFIER --engine aurora
    - while [ $(aws rds describe-db-cluster-endpoints --db-cluster-identifier $DBNAME | grep -c available) -eq 0 ]; do echo "sleep 60s"; sleep 60; done
    - echo "Temp db ready"
    - export ENDPOINT=$(aws rds describe-db-cluster-endpoints --db-cluster-identifier $DBIDENTIFIER| grep "\"Endpoint\"" | grep -v "\-ro\-" | awk -F '\"' '{print $4}')
    - echo $ENDPOINT
build:
  commands:
    - echo Build started on `date`
    - echo proceed db connection to $ENDPOINT
    - echo proceed db migrate update, DDL proceed here
    - echo proceed application test, CRUD test run here
post_build:
  commands:
    - echo Build completed on `date`
    - echo $DBNAME
    - aws rds delete-db-cluster --db-cluster-identifier $DBIDENTIFIER --skip-final-snapshot &

 

After you finish editing the file, name it buildspec.yml. Save it in the root directory with which you plan to build, then commit the file into your code repository.

  1. Open the CodeBuild console.
  2. Choose Create build project.
  3. In Project Configuration, enter the name and description for the build project.
  4. In Source, select the source provider for your code repository.
  5. In Environment image, choose Managed image, Ubuntu, and the latest runtime version.
  6. Choose the appropriate service role for your project.
  7. In the Additional configuration menu, select the VPC with your Amazon RDS database snapshots, as shown in the following screenshot, and then select Validate VPC Settings. For more information, see Use CodeBuild with Amazon Virtual Private Cloud.
  8. In Security Groups, select the security group needed for the CodeBuild environment to access your temporary database.
  9. In Build Specifications, select Use a buildspec file.

CodeBuild Project Additional Configuration - VPC

Grant permission for the build project

Follow these steps to grant permission.

  1. Navigate to the AWS Management Console Policies.
  2. Choose Create a policy and select the JSON tab.To give CodeBuild access to the Amazon RDS resource in the pre_build stage, you must grant RestoreDBClusterFromSnapshot and DeleteDBCluster. Follow the least privilege guideline and limit the DeleteDBCluster action point to “arn:aws:rds:*:*:cluster: db-cluster-identifier-*”.
  3. Copy the following code and paste it into your policy:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": "rds:RestoreDBClusterFromSnapshot",
          "Resource": "*"
        },
        {
          "Sid": "VisualEditor1",
          "Effect": "Allow",
          "Action": "rds:DeleteDBCluster*",
          "Resource": "arn:aws:rds:*:*:cluster:db-cluster-identifier-*"
        }
      ]
    }
  4. Choose Review Policy.
  5. Enter a Name and Description for this policy, then choose Create Policy.
  6. After the policy is ready, attach it to your CodeBuild service role, as shown in the following screenshot.

Attach created policy to IAM role

Use database snapshot restore to launch the build process

  1. Navigate back to CodeBuild and locate the project you just created.
  2. Give an appropriate timeout setting and make sure that you set it to the correct branch for your repository.
  3. Choose Start Build.
  4. Open the Build Log to view the database cluster from your snapshot in the pre_build stage, as shown in the following screenshot.CodeBuild ProjectBuild Log - pre_build stage
  5. In the build stage, use $ENDPOINT to point your application to this temporary database, as shown in the following screenshot.CodeBuild Project Build Log - build stage
  6. In the post_build, delete the cluster, as shown in the following screenshot.CodeBuild Project Build log - post build stage

Test your database schema change

After you set up this pipeline, you can begin to test your database schema change within your application code. This example defines several steps in the Build Specifications file to migrate the schema and run with the latest application code. In this example, you can verify that all the modifications fit from the application to the database.

YAML
build:
  commands:
    - echo Build started on `date`
    - echo proceed db connection to $ENDPOINT
    # run a script to apply your latest schema change
    - echo proceed db migrate update
    # start the latest code, and run your own testing
    - echo proceed application test

After validation

After we validated the database schema change in the above steps, a suitable strategy for deployment to production should be utilized that would align with the criteria to satisfy the business goals.

Cleaning up

To avoid incurring future charges, delete the resources as following steps:

  1. Open the CodeBuild console
  2. Click the project you created for this test.
  3. Click the delete build project and input delete to confirm deletion.

Conclusion

In this post, you created a mechanism to set up a temporary database and limit access into the build runtime. The temporary database stands alone and isolated. This mechanism can be applied to secure the permission control for the database snapshot, or not to break any existing environment. The database engine applies to all available RDS options, including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. This provides options, without impacting any existing environments, for critical events triggered by major changes in the production database schema, or data format changes required by business decisions.

 

Accelerate Amazon Redshift Federated Query adoption with AWS CloudFormation

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerate-amazon-redshift-federated-query-adoption-with-aws-cloudformation/

Amazon Redshift Federated Query allows you to combine the data from one or more Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL databases with data already in Amazon Redshift. You can also combine such data with data in an Amazon S3 data lake.

This post shows you how to set up Aurora PostgreSQL and Amazon Redshift with a 10 GB TPC-H dataset, and Amazon Redshift Federated Query using AWS CloudFormation. For more information about using Federated Query, see Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query. You can use the environment you set up in this post to experiment with various use cases in the preceding post.

Benefits of using CloudFormation templates

The standard workflow of setting up Amazon Redshift Federated Query involves six steps. For more information, see Querying Data with Federated Query in Amazon Redshift. With a CloudFormation template, you can condense these manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources to create and allows you to automate the creation of AWS resources to support Amazon Redshift Federated Query. You can further enhance this template to become the single source of truth for your infrastructure.

A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

Architecture overview

The following diagram illustrates the solution architecture.

The CloudFormation templates provision the following components in the architecture:

  • VPC
  • Subnets
  • Route tables
  • Internet gateway
  • Amazon Linux Bastion host
  • Secrets
  • Aurora PostgreSQL cluster with TPC-H dataset preloaded
  • Amazon Redshift cluster with TPC-H dataset preloaded
  • Amazon Redshift IAM role with required permissions

Prerequisites

Before you create your resources in AWS CloudFormation, you must complete the following prerequisites:

  • Have an IAM user with sufficient permissions to interact with the AWS Management Console and related AWS services. Your IAM permissions must also include access to create IAM roles and policies via the CloudFormation template.
  • Create an Amazon EC2 key pair in the us-east-1 Region. Make sure that you save the private key; this is the only time you can do so. You use this key pair as an input parameter when you set up the CloudFormation stack.

Setting up the resources with AWS CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

To create these resources, complete the following steps:

  1. Sign in to the console.
  2. Choose the us-east-1 Region in which to create the stack.
  3. Choose Launch Stack:
  4. Choose Next.This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
  5. For Stack name, enter a stack name.
  6. For Session, leave as the default.
  7. For ec2KeyPair, choose the key pair you created earlier.
  8. Choose Next.
  9. On the next screen, choose Next.
  10. Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create.Stack creation can take up to 45 minutes.
  12. After the stack creation is complete, in the Outputs section, record the value of the key for the following components, which you use in a later step:
  • AuroraClusterEndpoint
  • AuroraSecretArn
  • RedshiftClusterEndpoint
  • RedshiftClusterRoleArn

You are now ready to log in to both the Aurora PostgreSQL and Amazon Redshift cluster and run some basic commands to test them.

Logging in to the clusters using the Amazon Linux Bastion host

The following steps assume that you use a computer with an SSH client to connect to the Bastion host. For more information about connecting using various clients, see Connect to Your Linux Instance.

  1. Move the private key of the EC2 key pair (that you saved previously) to a location on your SSH client, where you are connecting to the Amazon Linux Bastion host.
  2. Change the permission of the private key using the following code, so that it’s not publicly viewable:chmod 400 <private key file name; for example, bastion-key.pem>
  3. On the Amazon EC2 console, choose Instances.
  4. Choose the Amazon Linux Bastion host that the CloudFormation stack created.
  5. Choose Connect.
  6. Copy the value for SSHCommand.
  7. On the SSH client, change the directory to the location where you saved the EC2 private key, and paste the SSHCommand value.
  8. On the console, open the Secrets Manager dashboard.
  9. Choose the secret secretAuroraMasterUser-*.
  10. Choose Retrieve secret value.
  11. Record the password under Secret key/value, which you use to log in to the Aurora PostgreSQL cluster.
  12. Choose the secret SecretRedshiftMasterUser.
  13. Choose Retrieve secret value.
  14. Record the password under Secret key/value, which you use to log in to the Amazon Redshift cluster.
  15. Log in to both the Aurora PostgreSQL and Amazon Redshift database using PSQL Client.The CloudFormation template has already set up PSQL Client binaries on the Amazon Linux Bastion host.
  16. Enter the following code in the command prompt of the Bastion host (substitute <RedshiftClusterEndpoint> with the value from the AWS CloudFormation output):psql -h <RedshiftClusterEndpoint> -d dev -p 5439 -U fqdemo
  17. When prompted, enter the database user password you recorded earlier.
  18. Enter the following SQL command:
    select "table" from svv_table_info where schema='public';

    You should see the following eight tables as the output:

    dev=# select "table" from svv_table_info where schema='public';
     table   
    ----------
     orders
     customer
     region
     nation
     supplier
     part
     lineitem
     partsupp
    (8 rows)

  19. Launch another command prompt session of the Bastion host and enter the following code (substitute <AuroraClusterEndpoint> with the value from the AWS CloudFormation output):psql -h <AuroraClusterEndpoint> -d dev -p 5432 -U awsuser
  20. When prompted, enter the database user password you recorded earlier.
  21. Enter the following SQL command:
    select tablename from pg_tables where schemaname='public';

    You should see the following eight tables as the output:

    dev=# select tablename from pg_tables where schemaname='public';
     tablename 
    -----------
     region
     nation
     lineitem
     orders
     part
     supplier
     partsupp
     customer
    (8 rows)

Completing Federated Query setup

The final step is to create an external schema to connect to the Aurora PostgreSQL instance. The following example code creates an external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS pg 
FROM POSTGRES 
DATABASE 'dev' 
SCHEMA 'public' 
URI '<AuroraClusterEndpoint>' 
PORT 5432 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – AuroraClusterEndpoint value from the CloudFormation stack outputs. Value is in the format <stackname>-cluster.<randomcharacter>.us-east-1.rds.amazonaws.com
  • IAMRoleRedshiftClusterRoleArn value from the CloudFormation stack outputs. Value is in the format arn:aws:iam::<accountnumber>:role/<stackname>-RedshiftClusterRole-<randomcharacter>
  • SecretARNAuroraSecretArn value from the CloudFormation stack outputs. Value is in the format arn:aws:secretsmanager:us-east-1:<accountnumber>: secret:secretAuroraMasterUser-<randomcharacter>

Testing Federated Query

Now that you have set up Federated Query, you can start testing the feature using the TPC-H dataset that was preloaded into both Aurora PostgreSQL and Amazon Redshift.

The following query shows the parts and supplier relationship. Tables PARTSUPP and PART are stored in Amazon Redshift, and the SUPPLIER table in the subquery is from Aurora PostgreSQL:

SELECT TOP 10 P_BRAND,
       P_TYPE,
       P_SIZE,
       COUNT(DISTINCT PS_SUPPKEY) AS SUPPLIER_CNT
FROM PARTSUPP,
     PART
WHERE P_PARTKEY = PS_PARTKEY
AND   P_BRAND <> 'Brand#23'
AND   P_TYPE NOT LIKE 'MEDIUM ANODIZED%'
AND   P_SIZE IN (1,32,33,46,7,42,21,40)
AND   PS_SUPPKEY NOT IN (SELECT S_SUPPKEY
                         FROM pg.SUPPLIER
                         WHERE S_COMMENT LIKE '%Customer%Complaints%')
GROUP	BY P_BRAND,
         P_TYPE,
         P_SIZE
ORDER	BY SUPPLIER_CNT DESC,
         P_BRAND,
         P_TYPE,
         P_SIZE;

The following query shows the order priority by combining ORDERS table data from Amazon Redshift and Aurora PostgreSQL. This demonstrates the use case of live data query from an OLTP source federated with historical data on a data warehouse:

SELECT O_ORDERPRIORITY,
       COUNT(*) AS ORDER_COUNT
FROM (SELECT O_ORDERPRIORITY
      FROM ORDERS o
      WHERE O_ORDERDATE < '1997-07-01'       AND O_ORDERDATE >= CAST(DATE '1997-07-01' - INTERVAL '3 months' AS DATE)
      UNION ALL
      SELECT O_ORDERPRIORITY
      FROM pg.ORDERS o
      WHERE O_ORDERDATE >= '1997-07-01'
      AND   O_ORDERDATE < CAST(DATE '1997-07-01' +INTERVAL '1 day' AS DATE))
GROUP	BY O_ORDERPRIORITY
ORDER	BY O_ORDERPRIORITY;

You can continue to experiment with the dataset and explore the three main use cases from the post, Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query.

Deleting the CloudFormation stack

When you are finished, delete the CloudFormation stack; some of the AWS resources in this walkthrough incur a cost if you continue to use them. Complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose the stack you launched in this walkthrough. The stack must be currently running.
  3. In the stack details pane, choose Delete.
  4. Choose Delete stack.

Summary

This post showed you how to automate the creation of an Aurora PostgreSQL and Amazon Redshift cluster preloaded with the TPC-H dataset, the prerequisites of the new Amazon Redshift Federated Query feature using AWS CloudFormation, and a single manual step to complete the setup. The post also provided some example federated queries using the TPC-H dataset, which you can use to accelerate your learning and adoption of the new features. You can continue to modify the CloudFormation templates from this post to support your business needs.

If you have any questions or suggestions, please leave a comment.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Srikanth Sopirala is a Sr. Specialist Solutions Architect focused on Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.

Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query

Post Syndicated from Tito Mijares original https://aws.amazon.com/blogs/big-data/build-a-simplified-etl-and-live-data-query-solution-using-redshift-federated-query/

You may have heard the saying that the best ETL is no ETL. Amazon Redshift now makes this possible with Federated Query. In its initial release, this feature lets you query data in Amazon Aurora PostgreSQL or Amazon RDS for PostgreSQL using Amazon Redshift external schemas. Federated Query also exposes the metadata from these source databases through system views and driver APIs, which allows business intelligence tools like Tableau and Amazon Quicksight to connect to Amazon Redshift and query data in PostgreSQL without having to make local copies. This enables a new data warehouse pattern—live data query—in which you can seamlessly retrieve data from PostgreSQL databases, or build data into a late binding view, which combines operational PostgreSQL data, analytical Amazon Redshift local data, and historical Amazon Redshift Spectrum data in an Amazon S3 data lake.

Simplified ETL use case

For this ETL use case, you can simplify the familiar upsert pattern with a federated query. You can bypass the need for incremental extracts in Amazon S3 and the subsequent load via COPY by querying the data in place within its source database. This change can be a single line of code that replaces the COPY command with a query to an external table. See the following code:

BEGIN;
CREATE TEMP TABLE staging (LIKE ods.store_sales);
-- replace the following COPY from S3 
COPY staging FROM 's3://yourETLbucket/daily_store_sales/' 
     IAM_ROLE 'arn:aws:iam::<account_id>:role/<s3_reader_role>' DELIMITER '|' COMPUPDATE OFF;
-- with this federated query to load staging data from PostgreSQL source
INSERT INTO staging SELECT * FROM pg.store_sales p
	WHERE p.last_updated_date > (SELECT MAX(last_updated_date) FROM ods.store_sales)
DELETE FROM ods.store_sales USING staging s WHERE ods.store_sales.id = s.id;
INSERT INTO ods.store_sales SELECT * FROM staging;
DROP TABLE staging;
COMMIT;

In the preceding example, the table pg.store_sales resides in PostgreSQL, and you use a federated query to retrieve fresh data to load into a staging table in Amazon Redshift, keeping the actual delete and insert operations unchanged. This pattern is likely the most common application of federated queries.

Setting up an external schema

The external schema pg in the preceding example was set up as follows:

CREATE EXTERNAL SCHEMA IF NOT EXISTS pg                                                                         
FROM POSTGRES                                                                                                           
DATABASE 'dev' 
SCHEMA 'retail'                                                                                     
URI 'database-1.cluster-ro-samplecluster.us-east-1.rds.amazonaws.com'                                                    
PORT 5432                                                                                                               
IAM_ROLE 'arn:aws:iam::555566667777:role/myFederatedQueryRDS'                                                           
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB'

If you’re familiar with the CREATE EXTERNAL SCHEMA command from using it in Spectrum, note some new parameter options to enable federated queries.

FROM POSTGRES                                                                                                           
DATABASE 'dev' 
SCHEMA 'retail'

Whereas Amazon Redshift Spectrum references an external data catalog that resides within AWS Glue, Amazon Athena, or Hive, this code points to a Postgres catalog. Also, expect more keywords used with FROM, as Amazon Redshift supports more source databases for federated querying. By default, if you do not specify SCHEMA, it defaults to public.

Within the target database, you identify DATABASE ‘dev’ and SCHEMA ‘retail’, so any queries to the Amazon Redshift table pg.<some_table> get issued to PostgreSQL as a request for retail.<some_table> in the dev database. For Amazon Redshift, query predicates are pushed down and run entirely in PostgreSQL, which reduces the result set returned to Amazon Redshift for subsequent operations. Going further, the query planner derives cardinality estimates for external tables to optimize joins between Amazon Redshift and PostgreSQL. From the preceding example:

URI 'database-1.cluster-ro-samplecluster.us-east-1.rds.amazonaws.com'                                                    
PORT 5432

The URI and PORT parameters that reference both the PostgreSQL endpoint and port are self-explanatory, but there are a few things to consider in your configuration:

  • Use a read replica endpoint in Aurora or Amazon RDS for PostgreSQL to reduce load on the primary instance.
  • Set up your Amazon RDS for PostgreSQL instance, Aurora serverless or provisioned instances, and Amazon Redshift clusters to use the same VPC and subnet groups. That way, you can add the security group for the cluster to the inbound rules of the security group for the Aurora or Amazon RDS for PostgreSQL instance.
  • If both Amazon Redshift and Aurora or Amazon RDS for PostgreSQL are on different VPCs, set up VPC peering. For more information, see What is VPC Peering?

Configuring AWS Secrets Manager for remote database credentials

To retrieve AWS Secrets Manager remote database credentials, our example uses the following code:

IAM_ROLE 'arn:aws:iam::555566667777:role/myFederatedQueryRDS'                                                           
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB'

These two parameters are interrelated because the SECRET_ARN is also embedded in the IAM policy for the role.

If a service like Secrets Manager didn’t exist and you wanted to issue a federated query from Amazon Redshift to PostgreSQL, you would need to supply the database credentials to the CREATE EXTERNAL SCHEMA command via a parameter like CREDENTIALS, which you also use with the COPY command. However, this hardcoded approach doesn’t take into account that the PostgreSQL credentials could expire.

You avoid this problem by keeping PostgreSQL database credentials within Secrets Manager, which provides a centralized service to manage secrets. Because Amazon Redshift retrieves and uses these credentials, they are transient and not stored in any generated code and are discarded after query execution.

Storing credentials in Secrets Manager takes up to a few minutes. To store a new secret, complete the following steps:

 

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. In the Store a new secret section, complete the following:

 

  • Supply your PostgreSQL database credentials
  • Name the secret; for example, MyRDSCredentials
  • Configure rotation (you can enable this at a later time)
  • Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post)
  1. Choose Next.

You can also retrieve the credentials easily.

  1. On the Secrets Manager console, choose your secret.
  2. Choose Retrieve secret value.

The following screenshot shows you the secret value details.

This secret is now an AWS resource referenced via a secret ARN. See the following screenshot.

Setting up an IAM role

You can now pull everything together by embedding the secret ARN into an IAM policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:us-east-1:555566667777:secret:MyRDSCredentials-TfzFSB"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

The following screenshot shows the details of the IAM role called myFederatedQueryRDS, which contains the MyRDSSecretPolicy policy. It’s the same role that’s supplied in the IAM_ROLE parameter of the CREATE EXTERNAL SCHEMA DDL.

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose your cluster.
  2. From the Actions drop-down menu, choose Manage IAM roles.
  3. Choose and add the IAM role you just created.

You have now completed the following steps:

  1. Create an IAM policy and role
  2. Store your PostgreSQL database credentials in Secrets Manager
  3. Create an Amazon Redshift external schema definition that uses the secret and IAM role to authenticate with a PostgreSQL endpoint
  4. Apply a mapping between an Amazon Redshift database and schema to a PostgreSQL database and schema so Amazon Redshift may issue queries to PostgreSQL tables.

You only need to complete this configuration one time.

Querying live operational data

This section explores another use case: querying operational data across multiple source databases. In this use case, a global online retailer has databases deployed by different teams across distinct geographies:

  • Region us-east-1 runs serverless Aurora PostgreSQL.
  • Region us-west-1 runs provisioned Aurora PostgreSQL, which is also configured as a global database with a read replica in us-east-1.
  • Region eu-west-1 runs an Amazon RDS for PostgreSQL instance with a read replica in us-east-1.

Serverless and provisioned Aurora PostgreSQL and Amazon RDS for PostgreSQL are visible in the Amazon RDS console in Region us-east-1. See the following screenshot:

For this use case, assume that you configured the read replicas for Aurora and Amazon RDS to share the same VPC and subnets in us-east-1 with the local serverless Aurora PostgreSQL. Furthermore, you have already created secrets for each of these instances’ credentials, and also an IAM role MyCombinedRDSSecretPolicy, which is more permissive and allows Amazon Redshift to retrieve the value of any Amazon RDS secret within any Region. Be mindful of security in actual production use, however, and explicitly specify the resource ARNs for each secret in separate statements in your IAM policy. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:*:555566667777:secret:*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

External schema DDLs in Amazon Redshift can then reference the combined IAM role and individual secret ARNs. See the following code:

CREATE EXTERNAL SCHEMA IF NOT EXISTS useast
FROM POSTGRES
DATABASE 'dev'
URI 'us-east-1-aurora-pg-serverless.cluster-samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:555566667777:secret:MyEastUSAuroraServerlessCredentials-dXOlEq'
;

CREATE EXTERNAL SCHEMA IF NOT EXISTS uswest
FROM POSTGRES
DATABASE 'dev'
URI 'global-aurora-pg-west-coast-stores-instance-1.samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:us-west-1:555566667777:secret:MyWestUSAuroraGlobalDBCredentials-p3sV9m'
;

CREATE EXTERNAL SCHEMA IF NOT EXISTS europe
FROM POSTGRES
DATABASE 'dev'
URI 'eu-west-1-postgres-read-replica.samplecluster.us-east-1.rds.amazonaws.com'
PORT 5432
IAM_ROLE 'arn:aws:iam::555566667777:role/MyCombinedRDSFederatedQuery'
SECRET_ARN 'arn:aws:secretsmanager:eu-west-1:555566667777:secret:MyEuropeRDSPostgresCredentials-mz2u9L'
;

This late binding view abstracts the underlying queries to TPC-H lineitem test data within all PostgreSQL instances. See the following code:

CREATE VIEW global_lineitem AS
SELECT 'useast' AS region, * from useast.lineitem
UNION ALL
SELECT 'uswest', * from uswest.lineitem
UNION ALL
SELECT 'europe', * from europe.lineitem
WITH NO SCHEMA BINDING
;

Amazon Redshift can query live operational data across multiple distributed databases and aggregate results into a unified view with this feature. See the following code:

dev=# SELECT region, extract(month from l_shipdate) as month,
      sum(l_extendedprice * l_quantity) - sum(l_discount) as sales
      FROM global_lineitem
      WHERE l_shipdate >= '1997-01-01'
      AND l_shipdate < '1998-01-01'
      AND month < 4
      GROUP BY 1, 2
      ORDER BY 1, 2
;
 region | month |      sales
--------+-------+------------------
 europe |     1 | 16036160823.3700
 europe |     2 | 15089300790.7200
 europe |     3 | 16579123912.6700
 useast |     1 | 16176034865.7100
 useast |     2 | 14624520114.6700
 useast |     3 | 16645469098.8600
 uswest |     1 | 16800599170.4600
 uswest |     2 | 14547930407.7000
 uswest |     3 | 16595334825.9200
(9 rows)

If you examine Remote PG Seq Scan in the following query plan, you see that predicates are pushed down for execution in all three PostgreSQL databases. Unlike your initial simplified ETL use case, no ETL is performed because data is queried and filtered in place. See the following code:

dev=# EXPLAIN SELECT region, extract(month from l_shipdate) as month,
      sum(l_extendedprice * l_quantity) - sum(l_discount) as sales
FROM global_lineitem
WHERE l_shipdate >= '1997-01-01'
AND l_shipdate < '1998-01-01'
AND month < 4
GROUP BY 1, 2
ORDER BY 1, 2
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
   Merge Key: derived_col1, derived_col2
   ->  XN Network  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
         Send to leader
         ->  XN Sort  (cost=1000000060145.67..1000000060146.17 rows=200 width=100)
               Sort Key: derived_col1, derived_col2
               ->  XN HashAggregate  (cost=60136.52..60138.02 rows=200 width=100)
                     ->  XN Subquery Scan global_lineitem  (cost=20037.51..60130.52 rows=600 width=100)
                           ->  XN Append  (cost=20037.51..60124.52 rows=600 width=52)
                                 ->  XN Subquery Scan "*SELECT* 1"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan useast.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
                                 ->  XN Subquery Scan "*SELECT* 2"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan uswest.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
                                 ->  XN Subquery Scan "*SELECT* 3"  (cost=20037.51..20041.51 rows=200 width=52)
                                       ->  XN HashAggregate  (cost=20037.51..20039.51 rows=200 width=52)
                                             ->  XN PG Query Scan lineitem  (cost=0.00..20020.84 rows=1667 width=52)
                                                   ->  Remote PG Seq Scan europe.lineitem  (cost=0.00..20000.00 rows=1667 width=52)
                                                         Filter: ((l_shipdate < '1998-01-01'::date) AND (l_shipdate >= '1997-01-01'::date) AND ("date_part"('month'::text, l_shipdate) < 4))
(24 rows)

Combining the data lake, data warehouse, and live operational data

In this next use case, you join Amazon Redshift Spectrum historical data with current data in Amazon Redshift and live data in PostgreSQL. You use a 3TB TPC-DS dataset and unload data from 1998 through 2001 from the store_sales table in Amazon Redshift to Amazon S3. The unloaded files are stored in Parquet format with ss_sold_date_sk as partitioning key.

To access this historical data via Amazon Redshift Spectrum, create an external table. See the following code:

CREATE EXTERNAL TABLE spectrum.store_sales_historical
(
  ss_sold_time_sk int ,
  ss_item_sk int ,
  ss_customer_sk int ,
  ss_cdemo_sk int ,
  ss_hdemo_sk int ,
  ss_addr_sk int ,
  ss_store_sk int ,
  ss_promo_sk int ,
  ss_ticket_number bigint,
  ss_quantity int ,
  ss_wholesale_cost numeric(7,2) ,
  ss_list_price numeric(7,2) ,
  ss_sales_price numeric(7,2) ,
  ss_ext_discount_amt numeric(7,2) ,
  ss_ext_sales_price numeric(7,2) ,
  ss_ext_wholesale_cost numeric(7,2) ,
  ss_ext_list_price numeric(7,2) ,
  ss_ext_tax numeric(7,2) ,
  ss_coupon_amt numeric(7,2) ,
  ss_net_paid numeric(7,2) ,
  ss_net_paid_inc_tax numeric(7,2) ,
  ss_net_profit numeric(7,2)
)
PARTITIONED BY (ss_sold_date_sk int)
STORED AS PARQUET
LOCATION 's3://mysamplebucket/historical_store_sales/';   

The external spectrum schema is defined as the following:

CREATE EXTERNAL SCHEMA spectrum
FROM data catalog DATABASE 'spectrumdb'
IAM_ROLE 'arn:aws:iam::555566667777:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Instead of an Amazon S3 read-only policy, the IAM role mySpectrumRole contains both AmazonS3FullAccess and AWSGlueConsoleFullAccess policies, in which the former allows Amazon Redshift writes to Amazon S3. See the following code:

UNLOAD ('SELECT * FROM tpcds.store_sales WHERE ss_sold_date_sk < 2452276')
TO 's3://mysamplebucket/historical_store_sales/'
IAM_ROLE 'arn:aws:iam::555566667777:role/mySpectrumRole'
FORMAT AS PARQUET
PARTITION BY (ss_sold_date_sk) ALLOWOVERWRITE;

To make partitioned data visible, the ALTER TABLE ... ADD PARTITION command needs to specify all partition values. For this use case, 2450816 through 2452275 correspond to dates 1998-01-02 through 2001-12-31, respectively. To generate these DDLs quickly, use the following code:

WITH partitions AS (SELECT * FROM generate_series(2450816, 2452275))
SELECT 'ALTER TABLE spectrum.store_sales_historical ADD PARTITION (ss_sold_date_sk='|| generate_series || ') '
    || 'LOCATION \'s3://mysamplebucket/historical_store_sales/ss_sold_date_sk=' || generate_series || '/\';'
FROM partitions;

You can run the generated ALTER TABLE statements individually or as a batch to make partition data visible. See the following code:

ALTER TABLE spectrum.store_sales_historical 
ADD PARTITION (ss_sold_date_sk=2450816)
LOCATION 's3://mysamplebucket/historical_store_sales/ss_sold_date_sk=2450816/';
-- repeated for all partition values

The three combined sources in the following view consist of historical data in Amazon S3 for 1998 through 2001, current data local to Amazon Redshift for 2002, and live data for two months of 2003 in PostgreSQL. When you create this late binding view, you have to re-order Amazon Redshift Spectrum external table columns because the previous UNLOAD operation specifying ss_sold_date_sk as partition key shifted that column’s order to last. See the following code:

CREATE VIEW store_sales_integrated AS
SELECT * FROM uswest.store_sales_live
UNION ALL
SELECT * FROM tpcds.store_sales_current
UNION ALL
SELECT ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, 
       ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, 
       ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, 
       ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, 
       ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, 
       ss_net_paid_inc_tax, ss_net_profit
FROM spectrum.store_sales_historical
WITH NO SCHEMA BINDING;

You can now run a query on the view to aggregate date and join tables across the three sources. See the following code:

dev=# SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
GROUP BY 1
ORDER BY 1
;
 date_part |   count
-----------+------------
      1998 | 1632403114
      1999 | 1650163390
      2000 | 1659168880
      2001 | 1641184375
      2002 | 1650209644
      2003 |   17994540
(6 rows)

Time: 77624.926 ms (01:17.625)

This following federated query ran on a two-node DC2.8XL cluster and took 1 minute and 17 seconds to join store sales in Amazon S3, PostgreSQL, and Amazon Redshift, with the date dimension table in Amazon Redshift, aggregating and sorting row counts by year:

dev=# EXPLAIN SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
GROUP BY 1
ORDER BY 1;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
   Merge Key: "date_part"('year'::text, b.d_date)
   ->  XN Network  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
         Send to leader
         ->  XN Sort  (cost=1461036320912.29..1461036321094.91 rows=73049 width=8)
               Sort Key: "date_part"('year'::text, b.d_date)
               ->  XN HashAggregate  (cost=461036314645.93..461036315011.18 rows=73049 width=8)
                     ->  XN Hash Join DS_DIST_ALL_NONE  (cost=913.11..428113374829.91 rows=6584587963204 width=8)
                           Hash Cond: ("outer".ss_sold_date_sk = "inner".d_date_sk)
                           ->  XN Subquery Scan a  (cost=0.00..263498674836.70 rows=6584587963204 width=4)
                                 ->  XN Append  (cost=0.00..197652795204.66 rows=6584587963204 width=4)
                                       ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..539836.20 rows=17994540 width=4)
                                             ->  XN PG Query Scan store_sales_live  (cost=0.00..359890.80 rows=17994540 width=4)
                                                   ->  Remote PG Seq Scan uswest.store_sales_live  (cost=0.00..179945.40 rows=17994540 width=4)
                                       ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..33004193.28 rows=1650209664 width=4)
                                             ->  XN Seq Scan on store_sales_current  (cost=0.00..16502096.64 rows=1650209664 width=4)
                                       ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..197619251175.18 rows=6582919759000 width=4)
                                             ->  XN Partition Loop  (cost=0.00..131790053585.18 rows=6582919759000 width=4)
                                                   ->  XN Seq Scan PartitionInfo of spectrum.store_sales_historical  (cost=0.00..10.00 rows=1000 width=4)
                                                   ->  XN S3 Query Scan store_sales_historical  (cost=0.00..131658395.18 rows=6582919759 width=0)
                                                         ->  S3 Seq Scan spectrum.store_sales_historical location:"s3://mysamplebucket/historical_store_sales" format:PARQUET (cost=0.00..65829197.59 rows=6582919759 width=0)
                           ->  XN Hash  (cost=730.49..730.49 rows=73049 width=8)
                                 ->  XN Seq Scan on date_dim b  (cost=0.00..730.49 rows=73049 width=8)
(23 rows)

The query plan shows these are full sequential scans running on the three source tables with the number of returned rows highlighted, totaling 8.2 billion. Because Amazon Redshift Spectrum does not generate statistics for external tables, you manually set the numRows property to the row count for historical data in Amazon S3. See the following code:

ALTER TABLE spectrum.store_sales_historical SET TABLE PROPERTIES ('numRows' = '6582919759');

You can join with another dimension table local to Amazon Redshift, this time the 30 million row customer table, and filter by column c_birth_country. See the following code:

dev=# SELECT extract(year from b.d_date), count(a.ss_sold_date_sk)
FROM store_sales_integrated a
JOIN tpcds.date_dim b on (a.ss_sold_date_sk = b.d_date_sk)
JOIN tpcds.customer c on (a.ss_customer_sk = c.c_customer_sk)
AND c.c_birth_country = 'UNITED STATES'
GROUP BY 1
ORDER BY 1
;
 date_part |  count
-----------+---------
      1998 | 7299277
      1999 | 7392156
      2000 | 7416905
      2001 | 7347920
      2002 | 7390590
      2003 |   81627
(6 rows)

Time: 77878.586 ms (01:17.879)

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Merge  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
   Merge Key: "date_part"('year'::text, b.d_date)
   ->  XN Network  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
         Send to leader
         ->  XN Sort  (cost=1363288861214.20..1363288861396.83 rows=73049 width=8)
               Sort Key: "date_part"('year'::text, b.d_date)
               ->  XN HashAggregate  (cost=363288854947.85..363288855313.09 rows=73049 width=8)
                     ->  XN Hash Join DS_DIST_ALL_NONE  (cost=376252.50..363139873158.03 rows=29796357965 width=8)
                           Hash Cond: ("outer".ss_sold_date_sk = "inner".d_date_sk)
                           ->  XN Hash Join DS_BCAST_INNER  (cost=375339.39..362394963295.79 rows=29796357965 width=4)
                                 Hash Cond: ("outer".ss_customer_sk = "inner".c_customer_sk)
                                 ->  XN Subquery Scan a  (cost=0.00..263498674836.70 rows=6584587963204 width=8)
                                       ->  XN Append  (cost=0.00..197652795204.66 rows=6584587963204 width=8)
                                             ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..539836.20 rows=17994540 width=8)
                                                   ->  XN PG Query Scan store_sales_live  (cost=0.00..359890.80 rows=17994540 width=8)
                                                         ->  Remote PG Seq Scan uswest.store_sales_live  (cost=0.00..179945.40 rows=17994540 width=8)
                                             ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..33004193.28 rows=1650209664 width=8)
                                                   ->  XN Seq Scan on store_sales_current  (cost=0.00..16502096.64 rows=1650209664 width=8)
                                             ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..197619251175.18 rows=6582919759000 width=8)
                                                   ->  XN Partition Loop  (cost=0.00..131790053585.18 rows=6582919759000 width=8)
                                                         ->  XN Seq Scan PartitionInfo of spectrum.store_sales_historical  (cost=0.00..10.00 rows=1000 width=4)
                                                         ->  XN S3 Query Scan store_sales_historical  (cost=0.00..131658395.18 rows=6582919759 width=4)
                                                               ->  S3 Seq Scan spectrum.store_sales_historical location:"s3://mysamplebucket/historical_store_sales" format:PARQUET (cost=0.00..65829197.59 rows=6582919759 width=4)
                                 ->  XN Hash  (cost=375000.00..375000.00 rows=135755 width=4)
                                       ->  XN Seq Scan on customer c  (cost=0.00..375000.00 rows=135755 width=4)
                                             Filter: ((c_birth_country)::text = 'UNITED STATES'::text)
                           ->  XN Hash  (cost=730.49..730.49 rows=73049 width=8)
                                 ->  XN Seq Scan on date_dim b  (cost=0.00..730.49 rows=73049 width=8)
(28 rows)

Query performance hardly changed from the previous query. Because the query only scanned one column (ss_sold_date_sk), it benefits from Parquet’s columnar structure for the historical data subquery. To put it another way, if the historical data is stored as CSV, all the data is scanned, which degrades performance significantly.

Additionally, the TPC-DS model does not store date values in the store_sales fact table. Instead, a foreign key references the date_dim table. If you plan on implementing something similar but frequently filter by a date column, consider adding that column into the fact table and have it as a sort key, and also adding a partitioning column in Amazon Redshift Spectrum. That way, Amazon Redshift can more efficiently skip blocks for local data and prune partitions for Amazon S3 data, in the latter, and also push filtering criteria down to Amazon Redshift Spectrum.

Conclusion

Applications of live data integration in real-world scenarios include data discovery, data preparation for machine learning, operational analytics, IoT telemetry analytics, fraud detection, and compliance and security audits. Whereas Amazon Redshift Spectrum extends the reach of Amazon Redshift into the AWS data lake, Federated Query extends its reach into operational databases and beyond.

For more information about data type differences between these databases, see Data Type Differences Between Amazon Redshift and Supported RDS PostgreSQL or Aurora PostgreSQL Databases. For more information about accessing federated data with Amazon Redshift, see Limitations and Considerations When Accessing Federated Data with Amazon Redshift.

 


About the Authors

Tito Mijares is a Data Warehouse Specialist Solutions Architect at AWS. He helps AWS customers adopt and optimize their use of Amazon Redshift. Outside of AWS, he enjoys playing jazz guitar and working on audio recording and playback projects.

 

 

 

Entong Shen is a Senior Software Development Engineer for AWS Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

 

 

 

Niranjan Kamat is a software engineer on the Amazon Redshift query processing team. His focus of PhD research was in interactive querying over large databases. In Redshift, he has worked in different query processing areas such as query optimization, analyze command and statistics, and federated querying. In his spare time, he enjoys playing with his three year old daughter, practicing table tennis (was ranked in top 10 in Ohio, USATT rating 2143), and chess.

 

 

Sriram Krishnamurthy is a Software Development Manager for AWS Redshift Query Processing team. He is passionate about Databases and has been working on Semi Structured Data Processing and SQL Compilation & Execution for over 15 years. In his free time, you can find him on the tennis court, often with his two young daughters in tow.

Improved speed and scalability in Amazon Redshift

Post Syndicated from Naresh Chainani original https://aws.amazon.com/blogs/big-data/improved-speed-and-scalability-in-amazon-redshift/

Since Amazon Redshift launched in 2012, its focus has always been on providing you with the best possible performance, at scale, and at the lowest possible cost. For most organizations across the globe, the amount of data going into their data warehouse is growing exponentially, and the number of people who want insights from that data increases daily. Because of this, Amazon Redshift is continually innovating to handle this growing volume of data and demand for insights.

Amazon Redshift delivers fast performance, at scale, for the most demanding workloads. Getting there was not easy, and it takes consistent investment across a variety of technical focus areas to make this happen. This post breaks down what it takes to build the world’s fastest cloud data warehouse.

Redshift’s performance investments are based on a broad spectrum of needs and uses the extensive fleet telemetry data from tens of thousands of customers to guide development work. Your needs may vary based on a variety of factors, including the size of your data warehouse, the number of concurrent users, the skillset of those users, and workload characteristics such as frequency of usage and query latency requirements.

Based on this, Amazon Redshift has performance investments anchored in four key areas:

  • Out-of-the-box performance – Amazon Redshift is over twice as fast out-of-the-box than it was 6 months ago, and keeps getting faster without any additional manual optimization and tuning
  • Automatic optimizations – Amazon Redshift is self-learning, self-optimizing, and constantly adapts to your actual workload to deliver the best possible performance
  • Scalability – Amazon Redshift can boost throughput to be 35 times greater to support increases in concurrent users and scales linearly for a wide range of workloads
  • Price-performance – Amazon Redshift provides predictable performance at significantly lower total cost than other data warehouse solutions by optimizing resource utilization

Out-of-the-box performance

Amazon Redshift has always enabled you to manually tune workload performance based on statically partitioning memory and specifying the number of concurrent queries. Amazon Redshift also uses advanced machine learning (ML) algorithms to tune configuration settings automatically for optimal data warehouse performance.

These algorithms are extremely effective because they are trained with real-world telemetry data generated from processing over two exabytes of data a day. Additionally, because Amazon Redshift has more customers than any other cloud data warehouse, the telemetry data makes the configuration recommendations from the ML algorithms even more accurate.

In addition, Amazon Redshift Advisor guides optimization by recommending sort keys, distribution keys, and more. However, if you want to override the learning, self-tuning behavior of Amazon Redshift, you still have fine-grained control.

The out-of-the-box performance of Amazon Redshift is continually improving. In November 2019, our Cloud Data Warehouse benchmark[1] showed that the out-of-the-box performance of Amazon Redshift was twice as fast as 6 months ago. This ongoing improvement in performance is the culmination of many technical innovations. This post presents three improvements that have had the most impact.

Improved compression

Amazon Redshift uses AZ64, a novel compression encoding, which delivers high compression ratios and significantly improved query performance. With AZ64, you no longer need to make the trade-off between storage size and performance. AZ64 compresses data, on average, 35% more than the popular, high-performance LZO algorithm, and processes the data 40% faster. AZ64 achieves this by efficiently compressing small groups of data values and uses CPU vector instructions and single instruction, multiple data (SIMD) for parallel processing of AZ64-encoded columns. To benefit from this, you simply select the data type for each column, and Amazon Redshift chooses the compression encoding method. Five months after launch, the AZ64 encoding has become the fourth most popular encoding option in Amazon Redshift with millions of columns.

Efficient large-scale join operations

When complex queries join large tables, massive amounts of data transfers over the network for the join processing on the Amazon Redshift compute nodes. This used to create network bottlenecks that impacted query performance.

Amazon Redshift now uses distributed bloom filters to enable high-performance joins for such workloads. Distributed bloom filters efficiently filter rows at the source that do not match the join relation, which greatly reduces the amount of data transferred over the network. Across the fleet, we see millions of selective bloom filters deployed each day, with some of them filtering more than one billion rows. This removes the network from being the bottleneck and improves overall query performance. Amazon Redshift automatically determines what queries are suitable for this optimization and adapts the optimization at runtime. In addition, Amazon Redshift uses CPU cache-optimized processing to make query execution more efficient for large-scale joins and aggregations. Together, these features improve performance for over 70% of the queries that Amazon Redshift processes daily.

Enhanced query planning

Query planning determines the quickest way to process a given query by evaluating costs associated with various query plans. Costs include many factors, for example, the number of I/O operations required, amount of disk buffer space, time to read from disk, parallelism for the query, and tables statistics such as number of rows and number of distinct values of a column being fresh and relevant.

The Amazon Redshift query planner factors in the capabilities of modern hardware, including the network stack, to take full advantage of the performance that the hardware offers. Its statistic collection process is automated, and it computes statistics by using algorithms like HyperLogLog (HLL), which improves the quality of statistics and therefore enables the cost-based planner to make better choices.

Automatic optimizations

Amazon Redshift uses ML techniques and advanced graph algorithms to continuously improve performance. Different workloads have different data access patterns, and Amazon Redshift observes the workload to learn and adapt. It automatically adjusts data layout, distribution keys, and query plans to provide optimal performance for a given workload.

The automatic optimizations in Amazon Redshift make intelligent decisions, such as how to distribute data across nodes, which datasets are frequently queried together and should be co-located, how to sort data, and how many parallel and concurrent queries should run on the system based on the complexity of the query. Amazon Redshift automatically adjusts these configurations to optimize throughput and performance as you use the data warehouse. To make sure these optimizations do not interrupt your workload, Amazon Redshift runs them incrementally and only during the periods when clusters have low utilization.

For maintenance operations, Amazon Redshift reduces the amount of compute resources required by operating only on frequently accessed tables and portions within those tables. Amazon Redshift prioritizes which portions of the table to operate on by analyzing query patterns. When relevant, it provides prescriptive guidance through recommendations in Amazon Redshift Advisor. You can then evaluate and apply those recommendations as needed.

AWS also constantly evaluates and assesses how effective its ML-based self-learning systems are in delivering fast query performance when compared to traditional methods such as expert human tuning.

Amazon Redshift has been adding automatic optimizations for years. The following timeline shows some of the automatic optimizations delivered over the last 12 months. 

For more information about specific optimizations, see the following posts about

automatic VACUUM DELETE, automatic ANALYZE, distribution key recommendations, sort key recommendations, automatic table sort, and automatic distribution style.

Scalability

Amazon Redshift can boost throughput by more than 35 times to support increases in concurrent users, and scales linearly for simple and mixed workloads.

Scaling to support a growing number of users

As the number of users accessing the data warehouse grows, you should not experience delays in your query responses. Most of the time, your workloads are not predictable and can vary throughout the day. Traditional data warehouses are typically provisioned for peak usage to avoid delays and missed SLAs, and you end up paying for resources that you are not fully using.

Concurrency Scaling in Amazon Redshift allows the data warehouse to handle spikes in workloads while maintaining consistent SLAs by elastically scaling the underlying resources as needed. Amazon Redshift continuously monitors the designated workload. If the queries start to get backlogged because of bursts of user activity, Amazon Redshift automatically adds transient cluster capacity and routes the requests to these new clusters. This transient capacity is available in a few seconds, so your queries continue to be served with low latency. Amazon Redshift removes the additional transient capacity automatically when activity reduces on the cluster.

You can choose if you want to elastically scale for certain workloads and by how much with a simple one-step configuration. Every 24 hours that the Amazon Redshift main cluster is in use, you accrue a 1-hour credit. This makes concurrency scaling free for more than 97% of use cases.

With the ability to automatically add and remove additional capacity, Amazon Redshift data warehouses can improve overall throughput by over 35 times. This post demonstrates how far you can dynamically allocate more compute power to satisfy the demands of concurrent users with the following experiment. First, take a baseline measurement using the Cloud Data Warehouse benchmark and five concurrent users. You can then enable Concurrency Scaling and add more and more users with each iteration. As soon as Amazon Redshift detects queuing, it allocates additional scaling clusters automatically. Ultimately, this experiment ran over 200 concurrent queries on Amazon Redshift and generated more than 35 times greater throughput. This many concurrently executing queries represents a concurrent user population of several thousand. This demonstrates how you can support virtually unlimited concurrent users on your Amazon Redshift data warehouses.

The scaling for concurrent users is also linear. You get consistent increases in performance with every extra dollar spent on the Concurrency Scaling clusters. This helps to keep data warehouse costs predictable as business needs grow. With Concurrency Scaling, AWS can perform benchmark tests with tens of thousands of queries per hour, with hundreds of queries running concurrently and providing linear scale. This represents a real-world workload in enterprises with thousands of concurrent users connecting to the data warehouse.

Scaling while running multiple mixed workloads

As data warehouses grow over time, the number and complexity of the workloads that run on the data warehouse also increase. For example, if you migrate from an on-premises data warehouse to Amazon Redshift, you might first run traditional analytics workloads, and eventually bring more operational and real-time data into the cluster to build new use cases and applications. To scale any data warehouse effectively, you must be able to prioritize and manage multiple types of workloads concurrently. Automatic workload management (WLM) and query priorities are two recent capabilities added to Amazon Redshift that enable you to do just that.

Automatic WLM makes sure that you use cluster resources efficiently, even with dynamic and unpredictable workloads. With automatic WLM, Amazon Redshift uses ML to classify incoming queries based on their memory and processing requirements and routes them to appropriate internal queues to start executing in parallel. Amazon Redshift dynamically adjusts the number of queries to execute in parallel to optimize overall throughput and performance. When queries that require large amounts of resources are in the system (for example, hash joins between large tables), the concurrency is lower. When you submit lighter queries (such as inserts, deletes, or simple aggregations), concurrency is higher. There is a feedback loop to continuously monitor system utilization and regulate admission into the cluster.

However, not all queries may be equally important to you; the performance of one workload or set of users might be more important than others. Query priorities in Amazon Redshift address this. You can give higher-priority workloads preferential treatment, including more resources during busy times, for consistent query performance. Amazon Redshift workload management uses intelligent algorithms to make sure that lower-priority queries continue to make progress and don’t stall.

You can combine Amazon Redshift Concurrency Scaling and automatic WLM with query priorities to solve complex data warehouse use cases. For example, the following table summarizes an Amazon Redshift configuration that effectively mixes ETL with analytics workloads.

WLM queueQueue priorityConcurrency ScalingNotes
ETLHighOffWhen ETL runs, it gets the highest priority
BI queriesNormalOnWhen BI workload suddenly increases, Concurrency Scaling adds capacity to maintain user SLAs
One-time or exploratory queriesLowOffCluster offers analytic access for casual users and data scientists when resources are available

For this use case, and many more, you can maintain SLAs, achieve efficiencies with your cluster utilization, and get sufficient flexibility to invest according to business priorities.

Price performance

You can measure price performance by calculating both the total cost of the computing service consumed and the total amount of computing work performed. Maximum performance for minimum cost gives you the best price performance.

As your data warehouses grow, Amazon Redshift gets more efficient. It moderates cost increases and keeps costs predictable, even as your needs grow. This sets Amazon Redshift apart from others in the market that increase in price much more as the number of users grows.

The investments in automatic optimizations, out-of-the-box performance, and scale all contribute to the unbeatable price performance that Amazon Redshift offers. When you compare typical customer quotes and investments, you find that Amazon Redshift costs 50% –75% less than other cloud data warehouses.

Measuring performance

AWS measures performance, throughput, and price-performance on a nightly basis. AWS also runs larger and more comprehensive benchmarks regularly to make sure the tests extend beyond your current needs. For benchmark results to be useful, they need to be well defined and easily reproducible. AWS uses the Cloud DW benchmark based on current TPC-DS and TPC-H benchmarks without any query or data modifications and compliant with TPC rules and requirements.

It’s important that anyone can reproduce these benchmarks; you can download the benchmark codes and scripts from  GitHub  and the accompanying dataset from a public Amazon S3 bucket.

Summary

Amazon Redshift is self-learning, self-optimizing, and consistently uses telemetry of the actual workload to deliver the best possible performance. Amazon Redshift is more than twice as fast out-of-the-box than it was 6 months ago, and keeps getting faster without any manual optimization and tuning. Amazon Redshift can boost throughput by more than 35 times to support increases in concurrent users and scales linearly for simple and mixed workloads.

In addition to software improvements, AWS continues to build data warehouses on the best hardware available. The new RA3 ndoe type with managed storage features high bandwidth networking and sizable high-performance SSDs as local caches. RA3 nodes use your workload patterns and advanced data management techniques, such as automatic fine-grained data eviction and intelligent data pre-fetching, to deliver the performance of local SSD while scaling storage automatically to Amazon S3. The hardware-based performance improvements in preview with AQUA (Advanced Query Accelerator) bring even more dramatic performance improvements and drive costs down with a new distributed and hardware accelerated cache.

These performance improvements are the cumulative result of years of strategic and sustained product investment and technical innovation across multiple areas such as automatic optimizations, out-of-the-box performance, and scalability. Additionally, price-performance remains a priority so you receive the best value.

Each dataset and workload has unique characteristics, and a proof of concept is the best way to understand how Amazon Redshift performs in your unique situation. When running your own proof of concept, it’s important that you focus on the right metrics—query throughput (number of queries per hour) and price-performance for your workload.
[1]
You can make a data-driven decision by requesting assistance with a proof of concept or working with a system integration and consulting partner. It’s also important to consider not only how a data warehouse performs with your current needs, but also its future performance with increasingly complex workloads, datasets, and users.

To stay up-to-date with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.

———
[1] The TPC Benchmark, TPC-DS and TPC-H are trademarks of the Transaction Processing Performance Council www.tpc.org


About the Author

Naresh Chainani is a Senior Software Development Manager at Amazon Redshift where he leads Query Processing, Query Performance, Distributed Systems and Workload Management with a strong team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.

 

 

Berni Schiefer is a Senior Development Manager for EMEA at Amazon Web Services, leading the Amazon Redshift development team in Berlin, Germany. The Berlin team focuses on Redshift Performance and Scalability, SQL Query Compilation, Spatial Data support, and Redshift Spectrum. Previously, Berni was an IBM Fellow working in the area of Private Cloud, Db2, Db2 Warehouse, BigSQL, with a focus on SQL-based engines, query optimization and performance.

 

 

Neeraja Rentachintala is a seasoned Product Management and GTM leader at Amazon Web Services, bringing over 20 years of experience in Product Vision, Strategy and Leadership roles in industry-leading data products and platforms. During her career, she delivered products in Analytics, Big data, Databases, Data and Application Integration, AI/ML serving Fortune 500 enterprise and ventures including MapR (acquired by HPE), Microsoft, Oracle, Informatica and Expedia.com. Currently Neeraja is a Principal Product Manager with Amazon Web Services building Amazon Redshift – the world’s most popular, highest performance and most scalable cloud data warehouse. Neeraja earned a Bachelor of Technology in Electronics and Communication Engineering from the National Institute of Technology in India and various business program certifications from the University of Washington, MIT Sloan School of Management and Stanford University.

Use MAP for Windows to Simplify your Migration to AWS

Post Syndicated from Fred Wurden original https://aws.amazon.com/blogs/compute/use-map-for-windows-to-simplify-your-migration-to-aws/

There’s no question that organizations today are being disrupted in their industry. In a previous blog post, I shared that such disruption often accelerates organizations’ decisions to move to the cloud. When these organizations migrate to the cloud, Windows workloads are often critical to their business and these workloads require a performant, reliable, and secure cloud infrastructure. Customers tell us that reducing risk, building cloud expertise, and lowering costs are important factors when choosing that infrastructure.

Today, we are announcing the general availability of the Migration Acceleration Program (MAP) for Windows, a comprehensive program that helps you execute large-scale migrations and modernizations of your Windows workloads on AWS. We have millions of customers on AWS, and have spent the last 11 years helping Windows customers successfully move to our cloud. We’ve built a proven methodology, providing you with AWS services, tools, and expertise to help simplify the migration of your Windows workloads to AWS. MAP for Windows provides prescriptive guidance, consulting support from experts, tools, trainings, and service credits to help reduce the risk and cost of migrating to the cloud as you embark on your migration journey.

MAP for Windows also helps you along the pathways to modernize current and legacy versions of Windows Server and SQL Server to cloud native and open source solutions, enabling you to break free from commercial licensing costs. With the strong price-performance of open-source solutions and the proven reliability of AWS, you can innovate quickly while reducing your risk.

With MAP for Windows, you will follow a simple three-step migration process to your migration:

  1. Assess Your Readiness: The migration readiness assessment helps you identify gaps along the six dimensions of the AWS Cloud Adoption Framework: business, process, people, platform, operations, and security. This assessment helps customers identify capabilities required in the migration. MAP for Windows also includes an Optimization and Licensing Assessment, which provides recommendations on how to optimize your licenses on AWS.
  2. Mobilize Your Resources: The mobilize phase helps you build an operational foundation for your migration, with the goal of fixing the capability gaps identified in the assessment phase. The mobilize phase accelerates your migration decisions by providing clear guidance on migration plans that improve the success of your migration.
  3. Migrate or Modernize Your Workloads: APN Partners and the AWS ProServe team help customers execute the large-scale migration plan developed during the mobilize phase. MAP for Windows also offers financial incentives to help you offset migration costs such as labor, training, and the expense of sometimes running two environments in parallel.

MAP for Windows includes support from AWS Professional Services and AWS Migration Competency Partners, such as Rackspace, 2nd Watch, Accenture, Cloudreach, Enimbos Global Services, Onica, and Slalom. Our MAP for Windows partners have successfully demonstrated completion of multiple large-scale migrations to AWS. They have received the APN Migration Competency Partner and the Microsoft Workloads Competency designations.

Learn about what MAP for Windows can do for you on this page. Learn also about the migration experiences of AWS customers. And contact us to discuss your Windows migration or modernization initiatives and apply to MAP for Windows.

About the Author

Fred Wurden is the GM of Enterprise Engineering (Windows, VMware, Red Hat, SAP, benchmarking) working to make AWS the most customer-centric cloud platform on Earth. Prior to AWS, Fred worked at Microsoft for 17 years and held positions, including: EU/DOJ engineering compliance for Windows and Azure, interoperability principles and partner engagements, and open source engineering. He lives with his wife and a few four-legged friends since his kids are all in college now.

Amazon Lightsail Database Tips and Tricks

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/amazon-lightsail-database-tips-and-tricks/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail | Twitter: @mikegcoleman

Managed Databases on Amazon Lightsail are affordably priced, and incredibly easy to run. Lightsail databases offer a solid foundation on which to build your application.  You can leverage attractive features like one-click high availability, automatic backups, and a choice of database engines to support your Lightsail apps.

While it’s super simple to do an initial deployment on Amazon Lightsail, I often get questions about how to perform some standard management tasks. Some examples of these tasks are scaling up a database or accessing that database with command line tools. I am also asked how to handle a scenario when you find that you need some of the advanced features found in Amazon Relation Database Service (RDS).

This blog answers these questions and offers general guidance on how to address these issues.

Scale Up Your Database

When I first deploy resources to the cloud, I always choose the least expensive option. Often times, that choice works out and everything runs fine. But sometimes, this results in under sizing resources, which necessitates a move to resources with more horsepower.

If this happens with your Lightsail databases, it’s straightforward to move your database to a larger size. Additionally, you can check the metrics page in the Amazon Lightsail console to see your database performance, and to determine if you need to upgrade.

Let’s walk through how to size up your database.

Start by creating a snapshot of your instance.

  1. Navigate to the Lightsail home page and click databases
  2. Click on the name of your database
  3. From the horizontal menu, click on Snapshots & restoreScreenshot of the snapshot and restore choice
  4. Under Manual Snapshot click + Create snapshotscreenshot of where to hit create snapshot
  5. Give the snapshot a name
  6. Click Create

It takes several minutes for the snapshot creation process to complete. Once the snapshot is available, you can create your new database instance choosing a larger size.

  1. Click the three-dot menu to the right of the snapshot you just created
  2. Choose Create new database
  3. Under Choose your database plan, select either a Standard or High Availability If you’re running a mission critical application, you definitely want to choose the high availability option. Standard is great for test environments or workloads where your application can withstand downtime in the event of a database failure.
  4. Choose the size for your new database instance
  5. Give your database instance a name
  6. Click Create database

The new database is created after several minutes.

Lightsail generates a new password when you create a new database from a snapshot. You can either use this newly generated password, or change it. You can change the password using the following steps:

  1. From the Lightsail home, page click Databases
  2. Scroll down to the Connection details section
  3. If you want to use the auto-generated password, click Show in the password box to display the password
    Otherwise complete steps 4 and 5 to specify a new password.
  4. Under Password, click Change password
  5. Enter a new password and click Save
    It will take a few minutes for the password to update

Now, go into your application. Configure the application to point the new database using the new endpoint, user name, and password values.

Note: It’s out of the scope for this blog to cover how to configure individual applications. Consult your application documentation to see how to do it for your specific application.

Command Line Access

There may be times when you need to work on your database using command line tools. You cannot connect directly to your Lightsail database instance. But, you can access the database remotely from another Lightsail instance.

You can also make your instance accessible via the public internet, and access it remotely from any internet-connected computer. However, I wouldn’t recommend this from a security perspective.

You first must create a new Lightsail instance to get started accessing your Lightsail database via the command line. I recommend basing your instance on Lightsail’s LAMP blueprint because there are MySQL command line tools already installed.

To create a new LAMP instance, do the following:

  1. From the Lightsail home page, click Create Instance
  2. Make sure you create the instance in the same Region as your Lightsail databaseinstance location image
  3. Under Select a blueprint, choose LAMP (PHP 7)blueprint selection
  4. Since you’re only using this instance to run MySQL command line tools, you can choose the smallest instance size
  5. Give your instance a name
  6. Click Create Instance

It takes a few minutes for your new instance to start up.

To check that everything is working correctly, use the MySQL command line interface.

Make sure you have the database user name, password, and endpoint. These can be found by clicking on the name of your database under the Connection details section.

  1. Use either your own SSH client or the built-sin web client to access the Lightsail instance you just created
  2. On the command line, enter the following command substituting the values for your database
mysql \
--host <lightsail database endpoint> \
--user <lightsail database username> \
--password

For example:

mysql \
--host ls-randomchars.us-east-2.rds.amazonaws.com \
--user dbmasteruser \
--password

Notice that you don’t actually put the password on the command line.

3. When prompted enter the password (note that the password will not show up when you enter it)

4. You should now be at the MySQL command prompt

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 87482
Server version: 5.7.26-log Source distribution

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

From here, you can use the command line as you normally would.

Migrating From a Managed Database to Amazon RDS

One of the great things about Lightsail is that it’s easy to get started quickly. It also gives you an easy migration path to more advanced AWS services, should you ever need them. For instance, you might se tup your database on Lightsail, and then realize that it could benefit from read replicas to handle growing traffic. Fortunately, it’s a pretty straightforward process to migrate your data from Lightsail to RDS.

 

Deploy an Amazon RDS database

First, make sure you have an RDS database running the same engine in the same Region as your Lightsail instance, and in your default Amazon VPC. For example, if your Lightsail database is running MySQL in the Oregon Region, RDS should also be running MySQL in the Oregon Region and in the default VPC. If you’re not sure how to create an RDS database, check out their documentation.

Make sure to note the username and password for your new database.

Create a Lightsail Instance

You also need a Lightsail instance with the MySQL command line tools installed. You can set one up by following the instructions in the previous section of this blog.

Enable VPC Peering

To get started, ensure that the Lightsail VPC can communicate. You do this by enabling VPC peering in Lightsail, and modifying the security group for RDS to allow traffic from the Lightsail VPC.

  1. Return to the Lightsail console home page and click Account in the top-right corner. Choose Account from the pop out menu.
  2. Click Advanced on the horizontal menu
  3. Under VPC peering, ensure that the Enable VPC peering box is checked for the region where your database is deployed.
    enable vpc peering screenshot

Adjust the RDS database security group

The next step is to edit the security group for the RDS instance to allow traffic from the Lightsail subnet.

  1. Return to the RDS console home page
  2. Under Resourcesclick on DB Instances
  3. Click on the name of the database you want to migrate data into
  4. Under Connectivity and securityclick on the security group nameconnectivity and security configuration

The security group dialog appears. From here you can add an entry for the Lightsail subnet.

  1. Click the Inbound tab near the bottom of the screen
  2. Click the Edit button
  3. Click Add rule in the pop-up box
  4. From the Type drop-down choose MySQL/Aurora
  5. In the source box, enter 172.26.0.0/16 (this is the CIDR address for the Lightsail subnet)inbound rules
  6. Click Save

Migrate the data from the Lightsail Database to RDS

Now that Lightsail resources can talk with your RDS database, you can do the actual migration.

The initial step is to use mysqldump to export your database information into a file that can be imported into RDS. mysqldump has many options. In this case, you export a database named tasks. Choose the appropriate database for your use case, as well as any other options that make sense.

  1. Use either your own SSH client or the built-in web client to access the Lightsail instance you just created.
  2. Use the following mysqldump command to create a backup of your database to a text file (dump.sql). Substitute the connection values for your Lightsail database. These values  are on the details page of your database under Connection details. The database name must be specific to your environment.
mysqldump \
--host <lightsail database endpoint> \
--user <lightsail database username> \
--databases <database name> \
--password \
> dump.sql

For example:

mysqldump \
--host ls-randomchars.us-west-2.rds.amazonaws.com \
--user dbmasteruser \
--databases tasks \
--password \
--set-gtid-purged=OFF \
> dump.sql

Now that you have a database backup, you can import that into your RDS instance. You need the connection details from your RDS database. Use the username and password from when you created the database. You can find the endpoint on the details page of your database under Connectivity and security (See the following screenshot for an example).

endpoint and port for connectivity and security

If you are not already, return to the terminal session for the Lightsail instance that has the MySQL tools installed.

To import the data into the RDS database you must provide the contents of the dump.sql file to the mysql command line, too. The cat command lists out the file, and by using | (referred to as a pipe) we can send the output directly from that command into mysql.

cat dump.sql | \
mysql \
--host <RDS database endpoint> \
--user <RDS user> \
--password

For example:

cat dump.sql | \
mysql \
--host database.randomchars.us-west-2.rds.amazonaws.com \
--user dbmasteruser \
--password

You can also use the mysql command to see if the database was created (this is similar to what we did when we passed in the file in the previous step. Instead, this time we’re using echo to pipe in the command show databases;)

echo "show databases;| \
mysql \
--host <RDS database endpoint> \
--user <RDS user> \
--password

For example:

echo "show databases;" | \
mysql \
--host database.randomchars.us-west-2.rds.amazonaws.com \
--user dbmasteruser \
--password

From here, you reconfigure your application to access your new RDS database.

Conclusion

In this post I reviewed some common tasks that you might want to do once you created your Amazon Lightsail database. You learned how to scale up the size of your database, how to access it with command line tools, and how to migrate to RDS.

If you’ve not yet deployed a Managed Database on Lightsail why not head over to the Lightsail console and create one now. If you need a bit of guidance to get started, we have a workshop at https://lightsailworkshop.com that will show you how to use Lightsail to deploy a two-tier web application using a MySQL database backend. Please feel free to leave comments and questions for future blog posts.

Speeding up Etleap models at AXS with Amazon Redshift materialized views

Post Syndicated from Christian Romming original https://aws.amazon.com/blogs/big-data/speeding-up-etleap-models-at-axs-with-amazon-redshift-materialized-views/

The materialized views feature in Amazon Redshift is now generally available and has been benefiting customers and partners in preview since December 2019. One customer, AXS, is a leading ticketing, data, and marketing solutions provider for live entertainment venues in the US, UK, Europe, and Japan. Etleap, an Amazon Redshift partner, is an extract, transform, load, and transform (ETLT) service built for AWS. AXS uses Etleap to ingest data into Amazon Redshift from a variety of sources, including file servers, Amazon S3, relational databases, and applications. These ingestion pipelines parse, structure, and load data into Amazon Redshift tables with appropriate column types and sort and distribution keys.

Improving dashboard performance with Etleap models

To analyze data, AXS typically runs queries against large tables that originate from multiple sources. One of the ways that AXS uses Amazon Redshift is to power interactive dashboards. To achieve fast dashboard load times, AXS pre-computes partial answers to the queries dashboards use. These partial answers are orders of magnitude smaller in terms of the number of rows than the tables on which they are based. Dashboards can load much faster than they would if they were querying the base tables directly by querying Amazon Redshift tables that hold the pre-computed partial answers.

Etleap supports creating and managing such pre-computations through a feature called models. A model consists of a SELECT query and triggers for when it should be updated. An example of a trigger is a change to a base table, that is, a table the SELECT statement uses that defines the model. This way, the model can remain consistent with its base tables.

The following screenshot shows an Etleap model with two base table dependencies.

Etleap represents their models as tables in Amazon Redshift. To create the model table, Etleap wraps the SELECT statement in a CREATE TABLE AS (CTAS) query. When an update is triggered, for example, due to base table inserts, updates, or deletes, Etleap recomputes the model table through the following code:

CREATE TABLE model_temporary AS SELECT …
DROP TABLE model;
RENAME TABLE model_temporary TO model;

Analyzing CTAS performance as data grows

AXS manages a large number of Etleap models. For one particular model, the CTAS query takes over 6 minutes, on average. This query performs an aggregation on a join of three different tables, including an event table that is constantly ingesting new data and contains over a billion rows. The following graph shows that the CTAS query time increases as the event table increases in number of rows.

There are two key problems with the query taking longer:

  • There’s a longer delay before the updated model is available to analysts
  • The model update consumes more Amazon Redshift cluster resources

To address this, AXS would have to resort to workarounds that are either inconvenient or costly, such as archiving older data from the event table or expanding the Amazon Redshift cluster to increase available resources.

Comparing CTAS to materialized views

Etleap decided to run an experiment to verify that Amazon Redshift’s materialized views feature is an improvement over the CTAS approach for this AXS model. First, they built the materialized view by wrapping the SELECT statement in a CREATE MATERIALIZED VIEW AS query. For updates, instead of recreating the materialized view every time that data in a base table changes, a REFRESH MATERIALIZED VIEW query is sufficient. The expectation was that using materialized views would be significantly faster than the CTAS-based procedure. The following graph compares query times of CTAS to materialized view refresh.

Running REFRESH MATERIALIZED VIEW was 7.9 times faster than the CTAS approach—it took 49 seconds instead of 371 seconds on average at the current scale. Additionally, the update time was roughly proportional to the number of rows that were added to the base table since the last update, rather than the total size of the base table. In this use case, this number is 3.8 million, which corresponds to the approximate number of events ingested per day.

This is great news. The solution solves the previous problems because the delay the model update caused stays constant as new data comes in, and so do the resources that Amazon Redshift consume (assuming the growth of the base table is constant). In other words, using materialized views eliminates the need for workarounds, such as archiving or cluster expansion, as the dataset grows. It also simplifies the refresh procedure for model updates by reducing the number of SQL statements from three (CREATE, DROP, and RENAME) to one (REFRESH).

Achieving fast refresh performance with materialized views

Amazon Redshift can refresh a materialized view efficiently and incrementally. It keeps track of the last transaction in the base tables up to which the materialized view was previously refreshed. During subsequent refreshes, Amazon Redshift processes only the newly inserted, updated, or deleted tuples in the base tables, referred to as a delta, to bring the materialized view up-to-date with its base tables. In other words, Amazon Redshift can incrementally maintain the materialized view by reading only base table deltas, which leads to faster refresh times.

For AXS, Amazon Redshift analyzed their materialized view definitions, which join multiple tables, filters, and aggregates, to figure out how to incrementally maintain their specific materialized view. Each time AXS refreshes the materialized view, Amazon Redshift quickly determines if a refresh is needed, and if so, incrementally maintains the materialized view. As records are ingested into the base table, the materialized view refresh times shown are much faster and grow very slowly because each refresh reads a delta that is small and roughly the same size as the other deltas. In comparison, the refresh times using CTAS are much slower because each refresh reads all the base tables. Moreover, the refresh times using CTAS grow much faster because the amount of data that each refresh reads grows with the ingest rate.

You are in full control of when to refresh your materialized views. For example, AXS refreshes their materialized views based on triggers defined in Etleap. As a result, transactions that are run on base tables do not incur additional cost to maintain dependent materialized views. Decoupling the base tables’ updates from the materialized view’s refresh gives AXS an easy way to insulate their dashboard users and offers them a well-defined snapshot to query, while ingesting new data into base tables. When AXS vets the next batch of base table data via their ETL pipelines, they can refresh their materialized views to offer the next snapshot of dashboard results.

In addition to efficiently maintaining their materialized views, AXS also benefits from the simplicity of Amazon Redshift storing each materialized view as a plain table. Queries on the materialized view perform with the same world-class speed that Amazon Redshift runs any query. You can organize a materialized view like other tables, which means that you can exploit distribution key and sort columns to further improve query performance. Finally, when you need to process many queries at peak times, Amazon Redshift’s concurrency scaling kicks in automatically to elastically scale query processing capacity.

Conclusion

Now that the materialized views feature is generally available, Etleap gives you the option of using materialized views rather than tables when creating models. You can use models more actively as part of your ETLT strategies, and also choose more frequent update schedules for your models, due to the performance benefits of incremental refreshes.

For more information about Amazon Redshift materialized views, see Materialize your Amazon Redshift Views to Speed Up Query Execution and Creating Materialized Views in Amazon Redshift.

 


About the Author

Christian Romming is the founder and CEO of Etleap.  Etleap is a managed ETL solution for AWS that doesn’t require extensive engineering work to set up, maintain, and scale.

 

 

 

 

Prasad Varakur is a Database, Big Data & Distributed Systems enthusiast, and Product Manager at Amazon Web Services. Prior to this, he has developed Database and Storage engines at SAP/Sybase, Couchbase, Huawei, Novell, EMC, and Veritas. He holds 11 patents in database systems and distributed computing, and his thesis has contributed foundational works of Parametric Query Optimization. He holds Master’s degree in Computer Science from IIT, Kanpur.

 

Vuk Ercegovac is a principal engineer for Redshift at AWS.