All posts by BP Yau

How to get best price performance from your Amazon Redshift Data Sharing deployment

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/how-to-get-best-price-performance-from-your-amazon-redshift-data-sharing-deployment/

Amazon Redshift is a fast, scalable, secure, and fully-managed data warehouse that enables you to analyze all of your data using standard SQL easily and cost-effectively. Amazon Redshift Data Sharing allows customers to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and regions without needing to copy or move data from one cluster to another.

Amazon Redshift Data Sharing was initially launched in March 2021, and added support for cross-account data sharing was added in August 2021. The cross-region support became generally available in February 2022. This provides full flexibility and agility to share data across Redshift clusters in the same AWS account, different accounts, or different regions.

Amazon Redshift Data Sharing is used to fundamentally redefine Amazon Redshift deployment architectures into a hub-spoke, data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, easily onboard new use cases, and most importantly do all of this without the complexity of data movement and data copies. Some of the most common questions asked during data sharing deployment are, “How big should my consumer clusters and producer clusters be?”, and “How do I get the best price performance for workload isolation?”. As workload characteristics like data size, ingestion rate, query pattern, and maintenance activities can impact data sharing performance, a continuous strategy to size both consumer and producer clusters to maximize the performance and minimize cost should be implemented. In this post, we provide a step-by-step approach to help you determine your producer and consumer clusters sizes for the best price performance based on your specific workload.

Generic consumer sizing guidance

The following steps show the generic strategy to size your producer and consumer clusters. You can use it as a starting point and modify accordingly to cater your specific use case scenario.

Size your producer cluster

You should always make sure that you properly size your producer cluster to get the performance that you need to meet your SLA. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation for the producer cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started, and you should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Size and setup initial consumer cluster

You should always size your consumer cluster based on your compute needs. One way to get started is to follow the generic cluster sizing guide similar to the producer cluster above.

Setup Amazon Redshift data sharing

Setup data sharing from producer to consumer once you have both the producer and consumer cluster setup. Refer to this post for guidance on how to setup data sharing.

Test consumer only workload on initial consumer cluster

Test consumer only workload on the new initial consumer cluster. This can be done by pointing consumer applications, for example ETL tools, BI applications, and SQL clients, to the new consumer cluster and rerunning the workload to evaluate the performance against your requirements.

Test consumer only workload on different consumer cluster configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either continue to use this cluster configuration or you can test on smaller configurations to see if you can further reduce the cost and still get the performance that you need.

On the other hand, if the initial size consumer cluster fails to meet your workload performance requirements, then you can further test larger configurations to get the configuration that meets your SLA.

As a rule of thumb, size up the consumer cluster by 2x the initial cluster configuration incrementally until it meets your workload requirements.

Once you plan out what configuration you want to test, use elastic resize to resize the initial cluster to the target cluster configuration. After elastic resize is completed, perform the same workload test and evaluate the performance against your SLA. Select the configuration that meets your price performance target.

Test producer only workload on different producer cluster configurations

Once you move your consumer workload to the consumer cluster with the optimum price performance, there might be an opportunity to reduce the compute resource on the producer to save on costs.

To achieve this, you can rerun the producer only workload on 1/2x of the original producer size and evaluate the workload performance. Resizing the cluster up and down accordingly depends on the result, and then you select the minimum producer configuration that meets your workload performance requirements.

Re-evaluate after a full workload run over time

As Amazon Redshift continues evolving, and there are continuous performance and scalability improvement releases, data sharing performance will continue improving. Furthermore, numerous variables might impact the performance of data sharing queries. The following are just some examples:

  • Ingestion rate and amount of data change
  • Query pattern and characteristic
  • Workload changes
  • Concurrency
  • Maintenance activities, for example vacuum, analyze, and ATO

This is why you must re-evaluate the producer and consumer cluster sizing using the strategy above on occasion, especially after a full workload deployment, to gain the new best price performance from your cluster’s configuration.

Automated sizing solutions

If your environment involved more complex architecture, for example with multiple tools or applications (BI, ingestion or streaming, ETL, data science), then it might not feasible to use the manual method from the generic guidance above. Instead, you can leverage solutions in this section to automatically replay the workload from your production cluster on the test consumer and producer clusters to evaluate the performance.

Simple Replay utility will be leveraged as the automated solution to guide you through the process of getting the right producer and consumer clusters size for the best price performance.

Simple Replay is a tool for conducting a what-if analysis and evaluating how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.

Here we walk through the steps to extract your workload logs from the source production cluster and replay them in an isolated environment. This lets you perform a direct comparison between these Amazon Redshift clusters seamlessly and select the clusters configuration that best meet your price performance target.

The following diagram shows the solution architecture.

Architecutre for testing simple replay

Solution walkthrough

Follow these steps to go through the solution to size your consumer and producer clusters.

Size your production cluster

You should always make sure to properly size your existing production cluster to get the performance that you need to meet your workload requirements. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation on the production cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started. You should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Identify the workload to be isolated

You might have different workloads running on your original cluster, but the first step is to identify the most critical workload to the business that we want to isolate. This is because we want to make sure that the new architecture can meet your workload requirements. This post is a good reference on a data sharing workload isolation use case that can help you decide which workload can be isolated.

Setup Simple Replay

Once you know your critical workload, you must enable audit logging in your production cluster where the critical workload identified above is running to capture query activities and store in Amazon Simple Storage Service (Amazon S3). Note that it may take up to three hours for the audit logs to be delivered to Amazon S3. Once the audit log is available, proceed to setup Simple Replay and then extract the critical workload from the audit log. Note that start_time and end_time could be used as parameters to filter out the critical workload if those workloads run in certain time periods, for example 9am to 11am. Otherwise it will extract all of the logged activities.

Baseline workload

Create a baseline cluster with the same configuration as the producer cluster by restoring from the production snapshot. The purpose of starting with the same configuration is to baseline the performance with an isolated environment.

Once the baseline cluster is available, replay the extracted workload in the baseline cluster. The output from this replay will be the baseline used to compare against subsequent replays on different consumer configurations.

Setup initial producer and consumer test clusters

Create a producer cluster with the same production cluster configuration by restoring from the production snapshot. Create a consumer cluster with the recommended initial consumer size from the previous guidance. Furthermore, setup data sharing between the producer and consumer.

Replay workload on initial producer and consumer

Replay the producer only workload on the initial size producer cluster. This can be achieved using the “Exclude” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Replay the consumer only workload on the initial size consumer cluster. This can be achieved using the “Include” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Evaluate the performance of these replays against the baseline and workload performance requirements.

Replay consumer workload on different configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either use this cluster configuration or you can follow these steps to test on smaller configurations to see if you can further reduce costs and still get the performance that you need.

Compare initial consumer performance results against your workload requirements:

  1. If the result exceeds your workload performance requirements, then you can reduce the size of the consumer cluster incrementally, starting with 1/2x, retry the replay and evaluate the performance, then resize up or down accordingly based on the result until it meets your workload requirements. The purpose is to get a sweet spot where you’re comfortable with the performance requirements and get the lowest price possible.
  2. If the result fails to meet your workload performance requirements, then you can increase the size of the cluster incrementally, starting with 2x the original size, retry the replay and evaluate the performance until it meets your workload performance requirements.

Replay producer workload on different configurations

Once you split your workloads out to consumer clusters, the load on the producer cluster should be reduced and you should evaluate your producer cluster’s workload performance to seek the opportunity to downsize to save on costs.

The steps are similar to consumer replay. Elastic resize the producer cluster incrementally starting with 1/2x the original size, replay the producer only workload and evaluate the performance, and then further resize up or down until it meets your workload performance requirements. The purpose is to get a sweet spot where you’re comfortable with the workload performance requirements and get the lowest price possible. Once you have the desired producer cluster configuration, retry replay consumer workloads on the consumer cluster to make sure that the performance wasn’t impacted by producer cluster configuration changes. Finally, you should replay both producer and consumer workloads concurrently to make sure that the performance is achieved in a full workload scenario.

Re-evaluate after a full workload run over time

Similar to the generic guidance, you should re-evaluate the producer and consumer clusters sizing using the previous strategy on occasion, especially after full workload deployment to gain the new best price performance from your cluster’s configuration.

Clean up

Running these sizing tests in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save on costs. We also recommend pausing your clusters when they’re not in use.

Applying Amazon Redshift and data sharing best practices

Proper sizing of both your producer and consumer clusters will give you a good start to get the best price performance from your Amazon Redshift deployment. However, sizing isn’t the only factor that can maximize your performance. In this case, understanding and following best practices are equally important.

General Amazon Redshift performance tuning best practices are applicable to data sharing deployment. Make sure that your deployment follows these best practices.

There numerous data sharing specific best practices that you should follow to make sure that you maximize the performance. Refer to this post for more details.

Summary

There is no one-size-fits-all recommendation on producer and consumer cluster sizes. It varies by workloads and your performance SLA. The purpose of this post is to provide you with guidance for how you can evaluate your specific data sharing workload performance to determine both consumer and producer cluster sizes to get the best price performance. Consider testing your workloads on producer and consumer using simple replay before adopting it in production to get the best price performance.


About the Authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance and operational excellence at scale in their cloud journey. He has a keen interest in Data Analytics as well.

Amazon Redshift data sharing best practices and considerations

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/amazon-redshift-data-sharing-best-practices-and-considerations/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Amazon Redshift data sharing allows for a secure and easy way to share live data for reading across Amazon Redshift clusters. It allows an Amazon Redshift producer cluster to share objects with one or more Amazon Redshift consumer clusters for read purposes without having to copy the data. With this approach, workloads isolated to different clusters can share and collaborate frequently on data to drive innovation and offer value-added analytic services to your internal and external stakeholders. You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined SQL functions, to provide fine-grained access controls that can be tailored for different users and businesses that all need access to Amazon Redshift data. The feature itself is simple to use and integrate into existing BI tools.

In this post, we discuss Amazon Redshift data sharing, including some best practices and considerations.

How does Amazon Redshift data sharing work ?

  • To achieve best in class performance Amazon Redshift consumer clusters cache and incrementally update block level data (let us refer to this as block metadata) of objects that are queried, from the producer cluster (this works even when cluster is paused).
  • The time taken for caching block metadata depends on the rate of the data change on the producer since the respective object(s) were last queried on the consumer. (As of today the consumer clusters only update their metadata cache for an object only on demand i.e. when queried)
  • If there are frequent DDL operations, the consumer is forced to re-cache the full block metadata for an object during the next access to maintain consistency as to enable live sharing as structure changes on the producer invalidate all the existing metadata cache on the consumers.
  • Once the consumer has the block metadata in sync with the latest state of an object on the producer that is when the query would execute as any other regular query (query referring to local objects).

Now that we have the necessary background on data sharing and how it works, let’s look at a few best practices across streams that can help improve workloads while using data sharing.

Security

In this section, we share some best practices for security when using Amazon Redshift data sharing.

Use INCLUDE NEW cautiously

INCLUDE NEW is a very useful setting while adding a schema to a data share (ALTER DATASHARE). If set to TRUE, this automatically adds all the objects created in the future in the specified schema to the data share automatically. This might not be ideal in cases where you want to have fine-grained control on objects being shared. In these cases, leave the setting at its default of FALSE.

Use views to achieve fine-grained access control

To achieve fine-grained access control for data sharing, you can create late-binding views or materialized views on shared objects on the consumer, and then share the access to these views to users on consumer cluster, instead of giving full access on the original shared objects. This comes with its own set of considerations, which we explain later in this post.

Audit data share usage and changes

Amazon Redshift provides an efficient way to audit all the activity and changes with respect to a data share using system views. We can use the following views to check these details:

Performance

In this section, we discuss best practices related to performance.

Materialized views in data sharing environments

Materialized views (MVs) provide a powerful route to precompute complex aggregations for use cases where high throughput is needed, and you can directly share a materialized view object via data sharing as well.

For materialized views built on tables where there are frequent write operations, it’s ideal to create the materialized view object on the producer itself and share the view. This method gives us the opportunity to centralize the management of the view on the producer cluster itself.

For slowly changing data tables, you can share the table objects directly and build the materialized view on the shared objects directly on the consumer. This method gives us the flexibility of creating a customized view of data on each consumer according to your use case.

This can help optimize the block metadata download and caching times in the data sharing query lifecycle. This also helps in materialized view refreshes because, as of this writing, Redshift doesn’t support incremental refresh for MVs built on shared objects.

Factors to consider when using cross-Region data sharing

Data sharing is supported even if the producer and consumer are in different Regions. There are a few differences we need to consider while implementing a share across Regions:

  • Consumer data reads are charged at $5/TB for cross region data shares, Data sharing within the same Region is free. For more information, refer to Managing cost control for cross-Region data sharing.
  • Performance will also vary when compared to a uni-Regional data share because the block metadata exchange and data transfer process between the cross-Regional shared clusters will take more time due to network throughput.

Metadata access

There are many system views that help with fetching the list of shared objects a user has access to. Some of these include all the objects from the database that you’re currently connected to, including objects from all the other databases that you have access to on the cluster, including external objects. The views are as follows:

We suggest using very restrictive filtering while querying these views because a simple select * will result in an entire catalog read, which isn’t ideal. For example, take the following query:

select * from svv_all_tables;

This query will try to collect metadata for all the shared and local objects, making it very heavy in terms of metadata scans, especially for shared objects.

The following is a better query for achieving a similar result:

SELECT table_name,
column_name,
data_type FROM svv_all_tables WHERE table_name = < tablename > AND schema_name = < schemaname > AND database_name = < databasename > ORDER BY ordinal_position

This is a good practice to follow for all metadata views and tables; doing so allows seamless integration into several tools. You can also use the SVV_DATASHARE* system views to exclusively see shared object-related information.

Producer/consumer dependencies

In this section, we discuss the dependencies between the producer and consumer.

Impact of the consumer on the producer

Queries on the consumer cluster will have no impact in terms of performance or activity on the producer cluster. This is why we can achieve true workload isolation using data sharing.

Encrypted producers and consumers

Data sharing seamlessly integrates even if both the producer and the consumer are encrypted using different AWS Key Management Service (AWS KMS) keys. There are sophisticated, highly secure key exchange protocols to facilitate this so you don’t have to worry about encryption at rest and other compliance dependencies. The only thing to make sure is that both the producer and consumer are in a homogeneous encryption configuration.

Data visibility and consistency

A data sharing query on the consumer can’t impact the transaction semantics on the producer. All the queries involving shared objects on the consumer cluster follow read-committed transaction consistency while checking for visible data for that transaction.

Maintenance

If there is a scheduled manual VACUUM operation in use for maintenance activities on the producer cluster on shared objects, you should use VACUUM recluster whenever possible. This is especially important for large objects because it has optimizations in terms of the number of data blocks the utility interacts with, which results in less block metadata churn compared to a full vacuum. This benefits the data sharing workloads by reducing the block metadata sync times.

Add-ons

In this section, we discuss additional add-on features for data sharing in Amazon Redshift.

Real-time data analytics using Amazon Redshift streaming data

Amazon Redshift recently announced the preview for streaming ingestion using Amazon Kinesis Data Streams. This eliminates the need for staging the data and helps achieve low-latency data access. The data generated via streaming on the Amazon Redshift cluster is exposed using a materialized view. You can share this as any other materialized view via a data share and use it to set up low-latency shared data access across clusters in minutes.

Amazon Redshift concurrency scaling to improve throughput

Amazon Redshift data sharing queries can utilize concurrency scaling to improve the overall throughput of the cluster. You can enable concurrency scaling on the consumer cluster for queues where you expect a heavy workload to improve the overall throughput when the cluster is experiencing heavy load.

For more information about concurrency scaling, refer to Data sharing considerations in Amazon Redshift.

Amazon Redshift Serverless

Amazon Redshift Serverless clusters are ready for data sharing out of the box. A serverless cluster can also act as a producer or a consumer for a provisioned cluster. The following are the supported permutations with Redshift Serverless:

  • Serverless (producer) and provisioned (consumer)
  • Serverless (producer) and serverless (consumer)
  • Serverless (consumer) and provisioned (producer)

Conclusion

Amazon Redshift data sharing gives you the ability to fan out and scale complex workloads without worrying about workload isolation. However, like any system, not having the right optimization techniques in place could pose complex challenges in the long term as the systems grow in scale. Incorporating the best practices listed in this post presents a way to mitigate potential performance bottlenecks proactively in various areas.

Try data sharing today to unlock the full potential of Amazon Redshift, and please don’t hesitate to reach out to us in case of further questions or clarifications.


About the authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sai Teja Boddapati is a Database Engineer based out of Seattle. He works on solving complex database problems to contribute to building the most user friendly data warehouse available. In his spare time, he loves travelling, playing games and watching movies & documentaries.

Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerate-snowflake-to-amazon-redshift-migration-using-aws-schema-conversion-tool/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics jobs. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

Several new features of Amazon Redshift address a wide range of analytics requirements and improve performance of the cloud data warehouse:

To take advantage of these capabilities and future innovations, you must migrate from your existing data warehouse to Amazon Redshift.

In this post, we show you how to migrate your data warehouse schema from Snowflake to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT). AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application source code for embedded SQL statements and convert them as part of a database schema conversion project. During this process, AWS SCT performs cloud-native code optimization by converting legacy data warehouse functions to their equivalent AWS service, thereby helping you modernize the applications at the same time of database migration.

Solution overview

To implement this solution, you complete the following high-level steps:

  1. Configure your AWS SCT application.
  2. Analyze your source Snowflake schema.
  3. Convert your Snowflake schema to an Amazon Redshift schema.
  4. Deploy the schema to Amazon Redshift.

The following diagram illustrates the solution architecture.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

Set up an AWS SCT project and extract the schema from the source

In this walkthrough, we use the Snowflake sample database TPCDS_SF10TCL as the source of the schema conversion.

To set up the database migration project, complete the following steps:

  1. Launch the AWS SCT application.
  2. On the File menu, choose New project wizard.
  3. Enter the project name and location.
  4. For Source engine, choose Snowflake.
  5. Choose Next.
  6. Provide the database information and credentials for Snowflake and choose Test Connection.
  7. When the connection is successful, choose Next.

For more information, see Using Snowflake as a source for AWS SCT.

  1. Expand SNOWFLAKE_SAMPLE_DATA and Schemas.
  2. Select TPCDS_SF10TCL and choose Next.

AWS SCT analyzes the schema and prepares an assessment report summary, as shown in the following screenshot.

This report summarizes the objects that AWS SCT converts to Amazon Redshift.

  1. Review the report and choose Next.
  2. Provide the database information and credentials for Amazon Redshift and deselect Use AWS Glue.
  3. Choose Test Connection.
  4. When the connection is successful, choose Finish.

For more information about establishing a connection to Amazon Redshift, see Getting the JDBC URL.

Review and apply the schema from Snowflake to Amazon Redshift

To convert the schema from Snowflake objects to Amazon Redshift, complete the following steps:

  1. Expand SNOWFLAKE_SAMPLE_DATA and Schemas.
  2. Choose (right-click) TPCDS_SF10TCL and choose Convert schema.
  3. Choose Collect and continue.

AWS SCT converts the schema and shows the converted schema and objects in the right pane. The converted schema is marked with a red check mark.

Amazon Redshift take some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

  1. Choose the object and then on the top menu, choose Main view and choose Assessment Report view.
  2. Choose the Action items tab.

You can see list of all issues and actions taken by AWS SCT.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, we modify the DISTKEY to use inv_item_sk. AWS SCT analyze the tables and recommends the distribution and sort keys based on the statistics. For cases where you’re not sure, you should set it to AUTO. For more information about automatic data distribution and optimization in Amazon Redshift, refer to Automate your Amazon Redshift performance tuning with automatic table optimization.

  1. To deploy the objects DDL to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.
  2. Optionally, if you want to export the copy of the DDLs generated by AWS SCT and apply them manually, you can select the Amazon Redshift converted schema, right-click, and choose Save as SQL.
  3. Log in to your Amazon Redshift cluster using the Query Editor V2.

For more information about Query Editor V2, refer to Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts.

  1. To verify that the converted schema objects are deployed in Amazon Redshift, select the desired table, right-click, and choose Show table definition to see the underlying DDL.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the Amazon Redshift cluster created for this demonstration.
  2. If you were using an existing cluster, delete the new tables that you created as part of this exercise.
  3. Stop any Amazon Elastic Compute Cloud (Amazon EC2) instances that were created to run the AWS SCT application.

Summary

In this post, we showed how easy it is to convert a Snowflake schema to an Amazon Redshift schema and used AWS SCT for this automatic conversion.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

BP Yau is a Sr Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Shawn Sachdev is a Sr Analytics Specialist Solutions Architect at AWS. He works with customers and provides guidance to help them innovate and build well-architected and high-performance data warehouses and implement analytics at scale on the AWS platform. Before AWS, he has worked in several Analytics and System Engineering roles. Outside of work, he loves watching sports, and is an avid foodie and a craft beer enthusiast.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Accelerating Amazon Redshift federated query to Amazon Aurora MySQL with AWS CloudFormation

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerating-amazon-redshift-federated-query-to-amazon-aurora-mysql-with-aws-cloudformation/

Amazon Redshift federated query allows you to combine data from one or more Amazon Relational Database Service (Amazon RDS) for MySQL and Amazon Aurora MySQL databases with data already in Amazon Redshift. You can also combine such data with data in an Amazon Simple Storage Service (Amazon S3) data lake.

This post shows you how to set up Aurora MySQL and Amazon Redshift with a TPC-DS dataset so you can take advantage of Amazon Redshift federated query using AWS CloudFormation. You can use the environment you set up in this post to experiment with various use cases in the post Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL.

Benefits of using CloudFormation templates

The standard workflow for setting up Amazon Redshift federated query involves six steps. For more information, see Querying data with federated queries in Amazon Redshift. With a CloudFormation template, you can condense these manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources that you want to create and allows you to automate the setup of AWS resources to support Amazon Redshift federated query. You can further enhance this template to become the single source of truth for your infrastructure.

A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and accounts with the least amount of effort and time.

Architecture overview

The following diagram illustrates the solution architecture.

The following diagram illustrates the solution architecture.

The CloudFormation template provisions the following components in the architecture:

  • VPC
  • Subnets
  • Route tables
  • Internet gateway
  • Amazon Linux bastion host
  • Secrets
  • Aurora for MySQL cluster with TPC-DS dataset preloaded
  • Amazon Redshift cluster with TPC-DS dataset preloaded
  • Amazon Redshift IAM role with required permissions

Prerequisites

Before you create your resources in AWS CloudFormation, you must complete the following prerequisites:

Setting up resources with AWS CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

To create your resources, complete the following steps:

  1. Sign in to the console.
  2. Choose the us-east-1 Region in which to create the stack.
  3. Choose Launch Stack:
  4. Choose Next.

This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.

  1. For Stack name, enter a stack name.
  2. For Session, leave as the default.
  3. For ec2KeyPair, choose the key pair you created earlier.
  4. Choose Next.

This automatically launches AWS CloudFormation in your AWS account with a template.

  1. On the next screen, choose Next.
  2. Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create.

Stack creation can take up to 45 minutes.

  1. After the stack creation is complete, on the Outputs tab of the stack, record the value of the key for the following components, which you use in a later step:
  • AuroraClusterEndpoint
  • AuroraSecretArn
  • RedshiftClusterEndpoint
  • RedshiftClusterRoleArn

As of this writing, this feature is in public preview. You can create a snapshot of your Amazon Redshift cluster created by the stack and restore the snapshot as a new cluster in the sql_preview maintenance track with the same configuration.

You can create a snapshot of your Amazon Redshift cluster created by the stack and restore the snapshot as a new cluster

You’re now ready to log in to both the Aurora MySQL and Amazon Redshift cluster and run some basic commands to test them.

Logging in to the clusters using the Amazon Linux bastion host

The following steps assume that you use a computer with an SSH client to connect to the bastion host. For more information about connecting using various clients, see Connect to your Linux instance.

  1. Move the private key of the EC2 key pair (that you saved previously) to a location on your SSH client, where you are connecting to the Amazon Linux bastion host.
  2. Change the permission of the private key using the following code, so that it’s not publicly viewable:
    chmod 400 <private key file name; for example, bastion-key.pem>

  1. On the Amazon EC2 console, choose Instances.
  2. Choose the Amazon Linux bastion host that the CloudFormation stack created.
  3. Choose Connect.
  4. Copy the value for SSHCommand.
  5. On the SSH client, change the directory to the location where you saved the EC2 private key, and enter the SSHCommand value.
  6. On the console, open the AWS Secrets Manager dashboard.
  7. Choose the secret secretAuroraMasterUser-*.
  8. Choose Retrieve secret value.
  9. Record the password under Secret key/value, which you use to log in to the Aurora MySQL cluster.
  10. Choose the secret SecretRedshiftMasterUser.
  11. Choose Retrieve secret value.
  12. Record the password under Secret key/value, which you use to log in to the Amazon Redshift cluster.
  13. Log in to both Aurora MySQL using the MySQL Command-Line Client and Amazon Redshift using query editor.

The CloudFormation template has already set up MySQL Command-Line Client binaries on the Amazon Linux bastion host.

  1. On the Amazon Redshift console, choose Editor.
  2. Choose Query editor.
  3. For Connection, choose Create new connection.
  4. For Cluster, choose the Amazon Redshift cluster.
  5. For Database name, enter your database.
  6. Enter the database user and password recorded earlier.
  7. Choose Connect to database.

Choose Connect to database.

  1. Enter the following SQL command:
    select "table" from svv_table_info where schema='public';

You should see 25 tables as the output.

You should see 25 tables as the output.

  1. Launch a command prompt session of the bastion host and enter the following code (substitute <AuroraClusterEndpoint> with the value from the AWS CloudFormation output):
    mysql --host=<AuroraClusterEndpoint> --user=awsuser --password=<database user password recorded earlier>

  1. Enter the following SQL command:
    use tpc;
    show tables;
    

You should see the following eight tables as the output:

mysql> use tpc;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+------------------------+
| Tables_in_tpc          |
+------------------------+
| customer               |
| customer_address       |
| household_demographics |
| income_band            |
| item                   |
| promotion              |
| web_page               |
| web_sales              |
+------------------------+
8 rows in set (0.01 sec)

Completing federated query setup

The final step is to create an external schema to connect to the Aurora MySQL instance. The following example code creates an external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS mysqlfq 
FROM MYSQL 
DATABASE 'tpc' 
URI '<AuroraClusterEndpoint>' 
PORT 3306 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – The AuroraClusterEndpoint value from the CloudFormation stack outputs. The value is in the format <stackname>-cluster.<randomcharacter>.us-east-1.rds.amazonaws.com.
  • IAM_Role – The RedshiftClusterRoleArn value from the CloudFormation stack outputs. The value is in the format arn:aws:iam::<accountnumber>:role/<stackname>-RedshiftClusterRole-<randomcharacter>.
  • Secret_ARN – The AuroraSecretArn value from the CloudFormation stack outputs. The value is in the format arn:aws:secretsmanager:us-east-1:<accountnumber>: secret:secretAuroraMasterUser-<randomcharacter>. 

Federated query test

Now that you have set up federated query, you can start testing the feature using the TPC-DS dataset that was preloaded into both Aurora MySQL and Amazon Redshift.

For example, the following query aggregates the total net sales by product category and class from the web_sales fact table and date and item dimension tables. Tables web_sales and date are stored in Amazon Redshift, and the item table is stored in Aurora MySQL:

select
    sum(ws_net_paid
    ) as total_sum, i_category, i_class, 0 as g_category, 0 as g_class  
from
    web_sales ,date_dim d1 ,mysqlfq.item 
where
    d1.d_month_seq between 1205 
    and 1205+11 
    and d1.d_date_sk = ws_sold_date_sk 
    and i_item_sk = ws_item_sk 
group
    by i_category,i_class ; 

You can continue to experiment with the dataset and explore the three main use cases in the post [exact name of post title with embedded link].

Cleaning up

When you’re finished, delete the CloudFormation stack, because some of the AWS resources in this walkthrough incur a cost if you continue to use them. Complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose the stack you launched in this walkthrough. The stack must be currently running.
  3. In the stack details pane, choose Delete.
  4. Choose Delete stack.

Summary

This post showed you how to automate the creation of an Aurora MySQL and Amazon Redshift cluster preloaded with the TPC-DS dataset, the prerequisites for the new Amazon Redshift federated query feature using AWS CloudFormation, and a single manual step to complete the setup. It also provided an example federated query using the TPC-DS dataset, which you can use to accelerate your learning and adoption of the new feature. You can continue to modify the CloudFormation templates from this post to support your business needs.

If you have any questions or suggestions, please leave a comment.


About the Authors

BP Yau is an Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

 

Srikanth Sopirala is a Sr. Specialist Solutions Architect, Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.

 

 

 

Zhouyi Yang is a Software Development Engineer for Amazon Redshift Query Processing team. He’s passionate about gaining new knowledge about large databases and has worked on SQL language features such as federated query and IAM role privilege control. In his spare time, he enjoys swimming, tennis, and reading.

 

 

Entong Shen is a Senior Software Development Engineer for Amazon Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics, and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

 

Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-federated-querying-to-amazon-aurora-mysql-and-amazon-rds-for-mysql/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in April 2020, we announced general availability for federated querying to Amazon Aurora PostgreSQL and Amazon Relational Database Service (Amazon RDS) for PostgreSQL to enable you to query data across your operational databases, your data warehouse, and your data lake to gain faster and deeper insights not possible otherwise.

Today, we’re launching a new feature of Amazon Redshift federated query to Amazon Aurora MySQL and Amazon RDS for MySQL to help you expand your operational databases in the MySQL family. With this lake house architecture expansion to support more operational data stores, you can query and combine data more easily in real time and store data in open file formats in your Amazon Simple Storage Service (Amazon S3) data lake. Your data can then be more available to other analytics and machine learning (ML) tools, rather than siloed in disparate data stores.

In this post, we share information about how to get started with this new federated query feature to MySQL.

Prerequisites

To try this new feature, create a new Amazon Redshift cluster in a sql_preview maintenance track and Aurora MySQL instance and load sample TPC data into both data stores. To make sure both Aurora MySQL DB instances can accept connections from the Amazon Redshift cluster, you should make sure that both your Amazon Redshift cluster and Aurora MySQL instances are in the same Amazon Virtual Private Cloud (Amazon VPC) and subnet group. This way, you can add the security group for the Amazon Redshift cluster to the inbound rules of the security group for the Aurora MySQL DB instance.

If your Amazon Redshift cluster and Aurora MySQL instances are in the different VPC, you can set up VPC peering or other networking to allow Amazon Redshift to make connections to your Aurora MySQL instances. For more information about VPC networking, see Working with a DB instance in a VPC.

Configuring AWS Secrets Manager for remote database credentials

Amazon Redshift needs database credentials to issue a federated query to a MySQL database. AWS Secrets Manager provides a centralized service to manage secrets and can be used to store your MySQL database credentials. Because Amazon Redshift retrieves and uses these credentials, they are transient, not stored in any generated code, and discarded after the query runs.

Storing credentials in Secrets Manager takes only a few minutes. To store a new secret, complete the following steps:

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. For Select secret type, select Credentials for RDS database.
  4. For User name, enter a name.
  5. For Password, enter a password.
  6. For Select the encryption key, choose DefaultEncryptionkey.
  7. For Select which RDS database this secret will access, choose your database.

Storing credentials in Secrets Manager takes only a few minutes.

  1. Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post).
  2. Choose Next.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console. The secret ARN is needed in the subsequent step.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console.

Setting up IAM role

You can now pull everything together by embedding the secret ARN into an AWS Identity and Access Management (IAM) policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "<SecretARN>"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your cluster.
  3. On the Actions drop-down menu, choose Manage IAM roles.

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. Choose and add the IAM role you just created.

Setting up external schema

The final step is to create an external schema to connect to your Aurora MySQL instance. The following example code creates the external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS mysqlfq 
FROM MYSQL 
DATABASE 'tpc' 
URI '<AuroraClusterEndpoint>' 
PORT 3306 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – Aurora MySQL cluster endpoint
  • IAM_Role – IAM role created from the previous step
  • Secret_ARN – Secret ARN

After you set up the external schema, you’re ready to run some queries to test different use cases.

Querying live operational data

You can now query real-time operational data in your Aurora MySQL instance from Amazon Redshift. Note that isolation level is read committed for MySQL. See the following code:

dev=# select top 10 ws_order_number from mysqlfq.web_sales;
 ws_order_number 
-----------------
        93628990
       157020207
         4338647
        41395871
        58468186
       171095867
        12514566
        74946143
         3418243
        67054239
(10 rows)

Querying mysqlfq.web_sales in Amazon Redshift routes the request to MySQL tpc database and web_sales table. If you examine the query plan, you can see the query runs at the MySQL instance as shown by the step Remote MySQL Seq Scan:

dev=# explain select top 10 ws_order_number from mysqlfq.web_sales;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 XN Limit  (cost=0.00..0.20 rows=10 width=8)
   ->  XN MySQL Query Scan web_sales  (cost=0.00..6869.28 rows=343464 width=8)
         ->  Remote MySQL Seq Scan mysqlfq.web_sales  (cost=0.00..3434.64 rows=343464 width=8)
(3 rows)

Simplifying ELT and ETL

You can also extract operational data directly from your Aurora MySQL instance and load it into Amazon Redshift. See the following code:

dev=# create table staging_customer as select c_customer_id from mysqlfq.customer where c_customer_id not in (select c_customer_id from customer);
SELECT
dev=# select count(*) from staging_customer;
 count  
--------
 350000
(1 row)

The preceding code uses CTAS to create and load incremental data from your operational MySQL instance into a staging table in Amazon Redshift. You can then perform transformation and merge operations from the staging table to the target table. For more information, see Updating and inserting new data.

Combining operational data with data from your data warehouse and data lake

You can combine live operational data from your Aurora MySQL instance with data from your Amazon Redshift data warehouse and S3 data lake by creating a late binding view.

To access your S3 data lake historical data via Amazon Redshift Spectrum, create an external table:

create external schema mysqlspectrum
from data catalog
database 'spectrumdb'
iam_role '<IAMRole>'
create external database if not exists;
 
create external table mysqlspectrum.customer 
stored as parquet 
location 's3://<yourS3bucket>/customer/'
as select * from customer where c_customer_sk <= 100000;

You can then run queries on the view to gain insight on data across the three sources:

drop view vwCustomer;
create view vwCustomer as
select c_customer_sk, 'redshift' as source from public.customer where c_customer_sk > 100000
union all
select c_customer_sk, 'mysql' as source from mysqlfq.customer
union all
select c_customer_sk, 's3' as source from mysqlspectrum.customer
with no schema binding;

select * from vwCustomer where c_customer_sk in (1, 149712,29033279);

You should the following three records as output:

dev=# select * from vwCustomer where c_customer_sk in (1, 149712,29033279);
 c_customer_sk |  source  
---------------+----------
      29033279 | mysql
             1 | s3
        149712 | redshift
(3 rows)

If you examine the query plan, you can see that the predicates are pushed down to your MySQL instance to run:

dev=# explain select * from vwCustomer where c_customer_sk in (1,149712,29033279);
                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan vwcustomer  (cost=0.00..48398.40 rows=6988 width=36)
   ->  XN Append  (cost=0.00..48328.52 rows=6988 width=4)
         ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..40000.03 rows=3 width=4)
               ->  XN Seq Scan on customer  (cost=0.00..40000.00 rows=3 width=4)
                     Filter: (((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279)) AND (c_customer_sk > 100000))
         ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..6548.63 rows=5492 width=4)
               ->  XN MySQL Query Scan customer  (cost=0.00..6493.71 rows=5492 width=4)
                     ->  Remote MySQL Seq Scan mysqlfq.customer  (cost=0.00..6438.79 rows=5492 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
         ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..1779.86 rows=1493 width=4)
               ->  XN S3 Query Scan customer  (cost=0.00..1764.93 rows=1493 width=4)
                     ->  S3 Seq Scan mysqlspectrum.customer location:"s3://<yourS3bucket>/customer" format:PARQUET  (cost=0.00..1750.00 rows=1493 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
(13 rows)

Available Now

Amazon Redshift federated querying to Aurora MySQL and Amazon RDS for MySQL is now available for public preview with Amazon Redshift release version 1.0.21591 or later. Refer to the AWS Region Table for Amazon Redshift availability and to check the version of your clusters.


About the Authors

BP Yau is an Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next-generation big data analytics platform using AWS technologies.

 

Zhouyi Yang is a Software Development Engineer for the Amazon Redshift Query Processing team. He’s passionate about gaining new knowledge about large databases and has worked on SQL language features such as federated query and IAM role privilege control. In his spare time, he enjoys swimming, tennis, and reading.

 

 

Entong Shen is a Senior Software Development Engineer for Amazon Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics, and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

Introducing Amazon Redshift RA3.xlplus nodes with managed storage

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/introducing-amazon-redshift-ra3-xlplus-nodes-with-managed-storage/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in December 2019, we announced our third-generation RA3 node type to provide you the ability to scale and pay for compute and storage independently. In this post, I share more about RA3, including a new smaller size node type, and information about how to get started.

RA3 nodes in Amazon Redshift

The new RA3 nodes let you determine how much compute capacity you need to support your workload and then scale the amount of storage based on your needs. The managed storage of Amazon Redshift automatically uses high-performance, SSD-based local storage as its tier-1 cache. The managed storage takes advantage of optimizations such as data block temperature, data block age, and workload patterns to optimize Amazon Redshift performance and manage data placement across tiers of storage automatically. No action or changes to existing workflows are necessary on your part.

The first member of the RA3 family was the RA3.16xlarge, and we subsequently added the RA3.4xlarge to cater to the needs of customers with a large number of workloads.

We’re now adding a new smaller member of the RA3 family, the RA3.xlplus.

This allows Amazon Redshift to deliver up to three times better price performance than other cloud data warehouses. For existing Amazon Redshift customers using DS2 instances, you get up to two times better performance and double the storage at the same cost when you upgrade to RA3. RA3 also includes AQUA (Advanced Query Accelerator) for Amazon Redshift at no additional cost. AQUA is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to ten times faster than other cloud data warehouses by automatically boosting certain types of queries. The preview is open to all customers now, and it will be generally available in January 2021.

RA3 nodes with managed storage are a great fit for analytics workloads that require best price per performance, with massive storage capacity and the ability to scale and pay for compute and storage independently. In the past, there was pressure to offload or archive old data to other storage because of fixed storage limits, which made maintaining the operational analytics dataset and querying the larger historical dataset when needed difficult. With Amazon Redshift managed storage, we’re meeting the needs of customers that want to store more data in their data warehouse.

The new RA3.xlplus node provides 4 vCPUs and 32 GiB of RAM and addresses up to 32 TB of managed storage. A cluster with RA3.xlplus node-type can contain up to 32 of these instances, for a total storage of 1024 TB (that’s 1 petabyte!). With the new smaller RA3.xlplus instance type, it’s even easier to get started with Amazon Redshift.

The differences between RA3 nodes are summarized in the following table.

vCPU Memory Storage Quota I/O Price
(US East (N. Virginia))
ra3.xlplus 4 32 GiB 32TB RMS 0.65 GB/sec $1.086 per hour
ra3.4xlarge 12 96 GiB 64TB RMS 2 GB/sec $3.26 per hour
ra3.16xlarge 48 384 GiB 64TB RMS 8 GB/sec $13.04 per hour

Creating a new Amazon Redshift cluster

You can create a new cluster on the Amazon Redshift console or the AWS Command Line Interface (AWS CLI). For this post, I walk you through using the Amazon Redshift console.

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose Create cluster.
  3. For Choose the size of the cluster, choose I’ll choose.
  4. For Node type, select your preferred node type (for this post, we select xlplus).

The following screenshot shows the Cluster configuration page for the US East (N. Virginia) Region. The price may vary slightly in different Regions.

If you have a DS2 or DC2 instance-based cluster, you can create an RA3 cluster to evaluate the new instance with managed storage. You use a recent snapshot of your Amazon Redshift DS2 or DC2 cluster to create a new cluster based on RA3 instances. We recommend using 2 nodes of RA3.xlplus for every 3 nodes of DS2.xl or 3 nodes of RA3.xlplus for every 8 nodes of DC2.large. For more information about upgrading sizes, see Upgrading to RA3 node types. You can always adjust the compute capacity by adding or removing nodes with elastic resize.

If you’re migrating to Amazon Redshift from on-premises data warehouses, you can size your Amazon Redshift cluster using the sizing widget on the Amazon Redshift console. On the Cluster configuration page, for Choose the size of the cluster, choose Help me choose.

Answer the subsequent questions on total storage size and your data access pattern in order to size your cluster’s compute and storage resource.

The sizing widget recommends the cluster configuration in the Calculated configuration summary section.

Conclusion

RA3 instances are now available in 16 AWS Regions. For the latest RA3 node type availability, see RA3 node type availability in AWS Regions.

The price varies from Region to Region, starting at $1.086 per hour per node in US East (N. Virginia). For more information, see Amazon Redshift pricing.


About the Author

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Federated API access to Amazon Redshift using an Amazon Redshift connector for Python

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/federated-api-access-to-amazon-redshift-using-an-amazon-redshift-connector-for-python/

Amazon Redshift is the leading cloud data warehouse that delivers performance 10 times faster at one-tenth of the cost of traditional data warehouses by using massively parallel query execution, columnar storage on high-performance disks, and results caching. You can confidently run mission-critical workloads, even in highly regulated industries, because Amazon Redshift comes with out-of-the-box security and compliance.

You can use your corporate identity providers (IdPs), for example Azure AD, Active Directory Federation Services, Okta, or Ping Federate, with Amazon Redshift to provide single sign-on (SSO) to your users so they can use their IdP accounts to log in and access Amazon Redshift. With federation, you can centralize management and governance of authentication and permissions. For more information about the federation workflow using AWS Identity and Access Management (IAM) and an identity provider, see Federate Database User Authentication Easily with IAM and Amazon Redshift. For this post, we use Okta as our IdP.

If you have personas that use Python on a computer, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance, you generally use an IAM role attached to the EC2 instance for federated access. However, all users who log in to the EC2 instance assume the same IAM role to federate. The Amazon Redshift connector for Python enables IdP authentication for each user who logs in to the computer. This post shows you how to use the Amazon Redshift connector for Python and Okta to enable federated SSO into Amazon Redshift and query your data warehouse using a Python script.

Solution overview

To implement this solution, you complete the following high-level steps:

  1. Configure your Okta IdP, including users, groups, and SAML application.
  2. Configure your AWS service provider, including IdP, IAM policy, and IAM role.
  3. Set up your Amazon Redshift connector.
  4. Check your Amazon Redshift users.
  5. Configure the Python script and Amazon Redshift connector to use your enterprise credentials and sign in to Amazon Redshift to run queries.

The process flow for federated authentication includes the following steps:

  1. The user logs in from a Python script using an Amazon Redshift connector.
  2. The IdP authenticates using the corporate user name and password, and returns a SAML assertion.
  3. The client uses AWS SDK to call AWS Security Token Service (AWS STS) to assume a role with SAML.
  4. AWS STS returns temporary AWS credentials.
  5. The client uses the temporary AWS credentials to connect Amazon Redshift.

The following diagram illustrates this process flow.

Prerequisites

Before starting this walkthrough, you must have the following:

  • An Amazon Redshift cluster. The US East (N. Virginia) Region is preferred because you need to load data from Amazon Simple Storage Service (Amazon S3) in us-east-1.
  • A database user with superuser
  • Python 3.5 or above.
  • Pip.

Configuring your IdP (Okta)

For instructions on setting up your IdP, see steps 1–4 in Federate Amazon Redshift access with Okta as an identity provider.

For step 2, create a user called [email protected] and add it to the sales group. This is the example user that you use in this post. The following screenshot shows your example users.

When the setup is complete, you should see settings similar to the following screenshots. The following screenshot shows your general SAML settings.

The following screenshot shows your attribute statements.

The following screenshot shows the settings for your application embed link, login page, and access error page.

For instructions on setting up advanced IdP configurations, see steps 8–9 in Federate Amazon Redshift access with Okta as an identity provider.

Configuring your service provider (AWS)

To set up the service provider, complete steps 5–7 from Federate Amazon Redshift access with Okta as an identity provider.

For step 7, the Amazon Redshift connector needs the following additional permission:

{
            "Sid": "DescribeClusters",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
}

The following code shows the example policy used:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GetClusterCredsStatement",
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:*:*:cluster:<cluster>",
                "arn:aws:redshift:*:*:dbname:<cluster>/<databasename>",
                "arn:aws:redshift:*:*:dbuser:<cluster>/${redshift:DbUser}"
            ]
        },
        {
            "Sid": "DescribeClusters",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
        },
        {
            "Sid": "CreateClusterUserStatement",
            "Effect": "Allow",
            "Action": "redshift:CreateClusterUser",
            "Resource": [
                "arn:aws:redshift:*:*:dbname:<cluster>/<databasenmae>",
                "arn:aws:redshift:*:*:dbuser:<cluster>/${redshift:DbUser}"
            ]
        },
        {
            "Sid": "RedshiftJoinGroupStatement",
            "Effect": "Allow",
            "Action": "redshift:JoinGroup",
            "Resource": [
                "arn:aws:redshift:*:*:dbgroup:<cluster>/sales"
            ]
        }
    ]
}

When the setup is complete, you should see an IAM role setup similar to the following screenshot.

Setting up your Amazon Redshift connector

The easiest way to get started with redshift_connector is via pip. See the following code:

pip install redshift_connector

After running the command, you see a message indicating redshift_connector was installed successfully:

Successfully installed redshift_connector

Checking Amazon Redshift users

To check your users, connect to your Amazon Redshift cluster using your preferred SQL client as a superuser and enter the following code:

select * from pg_user where usename = '[email protected]';

The query should return zero records, which shows that the corporate user [email protected] doesn’t exist yet in Amazon Redshift. See the following code:

dev=# select * from pg_user where usename = '[email protected]'; 
 usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig 
---------+----------+-------------+----------+-----------+--------+----------+-----------
(0 rows)

Testing the Python script

The next step is to run an example Python script that uses federation to connect to your Amazon Redshift cluster and run some queries. See the following code:

import redshift_connector
import numpy
 
# Connects to Redshift cluster using Okta MFA Browser Plugin
conn = redshift_connector.connect(
    iam=True,
    ssl=True,
    host='<RedshiftClusterEndpoint>',
    port=<RedshiftClusterPortNumber>,
    database='<RedshiftDatabase>',
    db_user='[email protected]',
    cluster_identifier='<RedshiftClusterIdentifier>',
    region='<Region>',
    login_url='<IdPLoginUrl>',
    credentials_provider='BrowserSamlCredentialsProvider',
    user='',
    password=''
)
 
cursor: redshift_connector.Cursor = conn.cursor()
cursor.execute("select current_user")
result: numpy.ndarray = cursor.fetch_numpy_array()
print(result)
 
cursor.execute("create Temp table book(bookname varchar,author‎ varchar)")
cursor.executemany("insert into book (bookname, author‎) values (%s, %s)", [('One Hundred Years of Solitude', 'Gabriel García Márquez'),('A Brief History of Time', 'Stephen Hawking')])
cursor.execute("select * from book")
 
result: numpy.ndarray = cursor.fetch_numpy_array()
print(result)
cursor.execute("drop table book")
conn.commit()
 
conn.close()

Use the following parameters:

  • RedshiftClusterEndpoint – The Amazon Redshift cluster endpoint without a port and database
  • RedshiftClusterPortNumber – The Amazon Redshift port number
  • RedshiftDatabase – The Amazon Redshift database name
  • RedshiftClusterIdentifier – The Amazon Redshift cluster identifier
  • Region – The Region where the cluster is
  • IdPLoginUrl – The IdP (Okta) sign-in URL, which you can get it from the embed link from the Okta SAML application property

After you run the Python script, the Okta sign-in page opens. Sign in with [email protected].

Switch back to your Python environment; it should show three records returned:

$ python3 redshift_connector_okta.py
[['[email protected]']]
[['One Hundred Years of Solitude' 'Gabriel García Márquez']
 ['A Brief History of Time' 'Stephen Hawking']]

[email protected] is returned, showing that the Python script ran under the context of this federated user.

You may now close your browser window.

To check your users, connect to your Amazon Redshift cluster using your preferred SQL client as a superuser and enter the following code:

select * from pg_user where usename = '[email protected]';

The query should return one record, which shows that the database user is automatically created based on the IdP corporate user:

dev=# select * from pg_user where usename = '[email protected]';
      usename      | usesysid | usecreatedb | usesuper | usecatupd |  passwd  | valuntil | useconfig 
-------------------+----------+-------------+----------+-----------+----------+----------+-----------
 [email protected] |      116 | f           | f        | f         | ******** |          | 
(1 row)

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. With federation, you can centralize management and governance of authentication and permissions by managing users and groups within the enterprise IdP and use them to authenticate to Amazon Redshift. SSO enables users to have a seamless user experience while accessing various applications in the organization.

This post showed you how to set up an Amazon Redshift connector and use Okta as your IdP to enable federated SSO to an Amazon Redshift cluster from a Python script. If you have users such as data scientists that heavily rely on Python scripts, you can follow these steps to set up federated SSO for your organization and manage access privileges to your Amazon Redshift cluster.

If you have any questions or suggestions, please leave a comment or contribute to the project.


About the Authors


BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 


Brooke White is a Software Development Engineer at AWS. She enables customers to get the most out of their data through her work on Amazon Redshift drivers. Prior to AWS, she built ETL pipelines and analytics APIs at a San Francisco Bay Area startup.

 

 


Ilesh Garish is a Software Development Engineer at AWS. His role is to develop connectors for Amazon Redshift. Prior to AWS, he built database drivers for the Oracle RDBMS, TigerLogic XDMS, and OpenAccess SDK. He worked in the database internal technologies at San Francisco Bay Area startups.