All posts by Sudhir Gupta

Automate Amazon Redshift cluster creation using AWS CloudFormation

Post Syndicated from Sudhir Gupta original https://aws.amazon.com/blogs/big-data/automate-amazon-redshift-cluster-creation-using-aws-cloudformation/

In this post, I explain how to automate the deployment of an Amazon Redshift cluster in an AWS account. AWS best practices for security and high availability drive the cluster’s configuration, and you can create it quickly by using AWS CloudFormation. I walk you through a set of sample CloudFormation templates, which you can customize as per your needs.

Amazon Redshift is a fast, scalable, fully managed, ACID and ANSI SQL-compliant cloud data warehouse service. You can set up and deploy a new data warehouse in minutes, and run queries across petabytes of structured data stored in Amazon Redshift. With Amazon Redshift Spectrum, it extends your data warehousing capability to data lakes built on Amazon S3. Redshift Spectrum allows you to query exabytes of structured and semi-structured data in its native format, without requiring you to load the data. Amazon Redshift delivers faster performance than other data warehouse databases by using machine learning, massively parallel query execution, and columnar storage on high-performance disk. You can configure Amazon Redshift to scale up and down in minutes, as well as expand compute power automatically to ensure unlimited concurrency.

As you begin your journey with Amazon Redshift and set up AWS resources based on the recommended best practices of AWS Well-Architected Framework, you can use the CloudFormation templates provided here. With the modular approach, you can choose to build AWS infrastructure from scratch, or you can deploy Amazon Redshift into an existing virtual private cloud (VPC).

Benefits of using CloudFormation templates

With an AWS CloudFormation template, you can condense hundreds of manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources to create, and you can choose to automate the creation of hundreds of AWS resources. This template becomes the single source of truth for your infrastructure.

A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

Architecture overview

The following architecture diagram and summary describe the solution that this post uses.

Figure 1: Architecture diagram

The sample CloudFormation templates provision the network infrastructure and all the components shown in the architecture diagram.

I broke the CloudFormation templates into the following three stacks:

  1. A CloudFormation template to set up a VPC, subnets, route tables, internet gateway, NAT gateway, Amazon S3 gateway endpoint, and other networking components.
  2. A CloudFormation template to set up an Amazon Linux bastion host in an Auto Scaling group to connect to the Amazon Redshift cluster.
  3. A CloudFormation template to set up an Amazon Redshift cluster, CloudWatch alarms, AWS Glue Data Catalog, and an Amazon Redshift IAM role for Amazon Redshift Spectrum and ETL jobs.

I integrated the stacks using exported output values. Using three different CloudFormation stacks instead of one nested stack gives you additional flexibility. For example, you can choose to deploy the VPC and bastion host CloudFormation stacks one time and Amazon Redshift cluster CloudFormation stack multiple times in an AWS Region.

Best practices

The architecture built by these CloudFormation templates supports AWS best practices for high availability and security.

The VPC CloudFormation template takes care of the following:

  1. Configures three Availability Zones for high availability and disaster recovery. It geographically distributes the zones within a Region for best insulation and stability in the event of a natural disaster.
  2. Provisions one public subnet and one private subnet for each zone. I recommend using public subnets for external-facing resources and private subnets for internal resources to reduce the risk of data exfiltration.
  3. Creates and associates network ACLs with default rules to the private and public subnets. AWS recommends using network ACLs as firewalls to control inbound and outbound traffic at the subnet level. These network ACLs provide individual controls that you can customize as a second layer of defense.
  4. Creates and associates independent routing tables for each of the private subnets, which you can configure to control the flow of traffic within and outside the VPC. The public subnets share a single routing table because they all use the same internet gateway as the sole route to communicate with the internet.
  5. Creates a NAT gateway in each of the three public subnets for high availability. NAT gateways offer significant advantages over NAT instances in terms of deployment, availability, and maintenance. NAT gateways allow instances in a private subnet to connect to the internet or other AWS services even as they prevent the internet from initiating a connection with those instances.
  6. Creates an VPC endpoint for Amazon S3. Amazon Redshift and other AWS resources—running in a private subnet of a VPC—can connect privately to access S3 buckets. For example, data loading from S3 and unloading data to S3 happens over a private, secure, and reliable connection.

The Amazon Linux bastion host CloudFormation template takes care of the following:

  1. Creates an Auto Scaling group spread across the three public subnets set up by the VPC CloudFormation template. The Auto Scaling group keeps the Amazon Linux bastion host available in one of the three Availability Zones.
  2. Sets up an Elastic IP address and associates it with the Amazon Linux bastion host. An Elastic IP address makes it easier to remember and allow IP addresses from on-premises firewalls. If your system terminates an instance and the Auto Scaling group launches a new instance in its place, the existing Elastic IP address re-associates with the new instance automatically. This lets you use the same trusted Elastic IP address at all times.
  3. Sets up an Amazon EC2 security group and associates with the Amazon Linux bastion host. This allows you to lock down access to the bastion hosts, only allowing inbound traffic from known CIDR scopes and ports.
  4. Creates an Amazon CloudWatch Logs log group to hold the Amazon Linux bastion host’s shell history logs and sets up a CloudWatch metric to track SSH command counts. This helps with security audits by allowing you to check who accesses the bastion host and when.
  5. Creates a CloudWatch alarm to monitor the CPU on the bastion host and send an Amazon SNS notification when anything triggers the alarm.

The Amazon Redshift cluster template takes care of the following:

  1. Creates an Amazon Redshift cluster subnet group span across multiple Availability Zones so that you can create different clusters into different zones to minimize the impact of failure of one zone.
  2. Configures database auditing and stores audit logs into an S3 bucket. It also restricts access to the Amazon Redshift logging service and configures lifecycle rules to archive logs older than 14 days to Amazon S3 Glacier.
  3. Creates an IAM role with a policy to grant the minimum permissions required to use Amazon Redshift Spectrum to access S3, CloudWatch Logs, AWS Glue, and Amazon Athena. It then associates this IAM role with Amazon Redshift.
  4. Creates an EC2 security group and associates it with the Amazon Redshift cluster. This allows you to lock down access to the Amazon Redshift cluster to known CIDR scopes and ports.
  5. Creates an Amazon Redshift cluster parameter group with the following configuration and associates it with the Amazon Redshift cluster. These parameters are only a general guide. Review and customize them to suit your needs.

Parameter

Value

Description

enable_user_activity_loggingTRUEThis enables the user activity log. For more information, see Database Audit Logging.
require_sslTRUEThis enables SSL connections to the Amazon Redshift cluster.
wlm_json_configuration[ {
"query_group" : [ ],
"query_group_wild_card" : 0,
"user_group" : [ ],
"user_group_wild_card" : 0,
"concurrency_scaling" : "auto",
"rules" : [ {
"rule_name" : "DiskSpilling",
"predicate" : [ {
"metric_name" : "query_temp_blocks_to_disk",
"operator" : ">",
"value" : 100000
} ],
"action" : "log",
"value" : ""
}, {
"rule_name" : "RowJoining",
"predicate" : [ {
"metric_name" : "join_row_count",
"operator" : ">",
"value" : 1000000000
} ],
"action" : "log",
"value" : ""
} ],
"auto_wlm" : true
}, {
"short_query_queue" : true
} ]

This creates a custom workload management queue (WLM) with the following configuration:

Auto WLM: Amazon Redshift manages query concurrency and memory allocation automatically, as per workload.

Enable Short Query Acceleration (SQA): Amazon Redshift executes short-running queries in a dedicated space so that SQA queries aren’t forced to wait in queues behind longer queries.

Enable Concurrency Scaling for the queries routed to this WLM queue.

Creates two WLM QMR Rules:

Log queries when temporary disk space used to write intermediate results exceeds 100 GB.

Log queries when the number of rows processed in a join step exceed one billion rows.

You can also create different rules based on your needs and choose different actions (abort or hop or log).

max_concurrency_scaling_clusters1 (or what you chose)Sets the maximum number of concurrency scaling clusters allowed when concurrency scaling is enabled.
auto_analyzeTRUEIf true, Amazon Redshift continuously monitors your database and automatically performs analyze operations in the background.
statement_timeout43200000Terminates any statement that takes more than the specified number of milliseconds. The statement_timeout value is the maximum amount of time a query can run before Amazon Redshift terminates it.
  1. Configures the Amazon Redshift cluster to listen on a non-default Amazon Redshift port, according to security best practices.
  2. Creates the Amazon Redshift cluster in the private subnets according to AWS security best practices. To access the Amazon Redshift cluster, use the Amazon Linux bastion host that the Linux bastion host CloudFormation template sets up.
  3. Creates minimum two-nodes cluster, unless you choose 1 against input parameter NumberOfNodes. AWS recommends using at least two nodes per cluster for production. For more information, see the Availability and Durability section of Amazon Redshift-FAQ.
  4. Enables encryption at-rest for the Amazon Redshift cluster by using the Amazon Redshift managed KMS key or a user-specified KMS key. To use the user-specified KMS key and you have not created it yet, first create a KMS key. For more information, see Creating KMS Keys.
  5. Configures Amazon EBS snapshots retention to 35 days for production environments and 8 days for non-production environments. This allows you to recover your production database to any point in time in the last 35 days or the last 8 days for a non-production database.
  6. It takes a final snapshot of the Amazon Redshift database automatically when you delete the Amazon Redshift cluster using Delete stack option. It prevents data loss from the accidental deletion of your CloudFormation stack.
  7. Creates an AWS Glue Data Catalog as a metadata store for your AWS data lake.
  8. Configures CloudWatch alarms for key CloudWatch metrics like PercentageDiskSpaceUsed, and CPUUtilization for the Amazon Redshift cluster, and sends an SNS notification when one of these conditions triggers the alarm.
  9. Provides the option to restore the Amazon Redshift cluster from a previously taken snapshot.
  10. Attaches common tags to the Amazon Redshift clusters and other resources. AWS recommends assigning tags to your cloud infrastructure resources to manage resource access control, cost tracking, automation, and organization.

Prerequisites

Before setting up the CloudFormation stacks, note the following prerequisites.

  1. You must have an AWS account and an IAM user with sufficient permissions to interact with the AWS Management Console and the services listed in the preceding Architecture overview section. Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template.
  2. The VPC CloudFormation stack requires three Availability Zones to set up the public and private subnets. Make sure to select an AWS Region that has at least three Availability Zones.
  3. Create an EC2 key pair in the EC2 console in the AWS Region where you are planning to set up the CloudFormation stacks. Make sure that you save the private key, as this is the only time you can do this. You use this EC2 key pair as an input parameter during setup for the Amazon Linux bastion host CloudFormation stack.

Set up the resources using AWS CloudFormation

I provide these CloudFormation templates as a general guide. Review and customize them to suit your needs. Some of the resources deployed by these stacks incur costs as long as they remain in use.

Set up the VPC, subnets, and other networking components

This CloudFormation template will create a VPC, subnets, route tables, internet gateway, NAT gateway, Amazon S3 gateway endpoint, and other networking components. Follow below steps to create these resources in your AWS account.

  1. Log in to the AWS Management Console.
  2. In the top navigation ribbon, choose the AWS Region in which to create the stack, and choose Next. This CloudFormation stack requires three Availability Zones for setting up the public and private subnets. Select an AWS Region that has at least three Availability Zones.
  3. Choose the following Launch Stack button. This button automatically launches the AWS CloudFormation service in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console as required.
  4. The CloudFormation stack requires a few parameters, as shown in the following screenshot.
    • Stack name: Enter a meaningful name for the stack, for example, rsVPC
    • ClassB 2nd Octet : Specify the second octet of the IPv4 CIDR block for the VPC (10.XXX.0.0/16). You can specify any number between and including 0–255, for example, specify 33 to create a VPC with IPv4 CIDR block 10.33.0.0/16.To learn more about VPC and subnet sizing for IPv4, see VPC and Subnet Sizing for IPv4.

      Figure 2: VPC Stack, in the CloudFormation Console

  5. After entering all the parameter values, choose Next.
  6. On the next screen, enter any required tags, an IAM role, or any advanced options, and then choose Next.
  7. Review the details on the final screen, and choose Create.

Stack creation takes a few minutes. Check the AWS CloudFormation Resources section to see the physical IDs of the various components this stack sets up.

After this, you must set up the Amazon Linux bastion host, which you use to log in to the Amazon Redshift cluster.

Set up the Amazon Linux bastion host

This CloudFormation template will create an Amazon Linux bastion host in an Auto Scaling group. Follow below steps to create the bastion host in the VPC.

  1. In the top navigation ribbon, choose the AWS Region in which to create the stack, and choose Next.
  2. Choose the following Launch Stack button. This button automatically launches the AWS CloudFormation service in your AWS account with a template to launch.
  3. The CloudFormation stack requires a few parameters, as shown in the following screenshots.
    • Stack name: Enter a meaningful name for the stack, for example, rsBastion.
    • Parent VPC Stack: Enter the CloudFormation stack name for the VPC stack that you set up in the previous step. Find this value in the CloudFormation console, for example, rsVPC.
    • Allowed Bastion External Access CIDR: Enter the allowed CIDR block in the x.x.x.x/x format for external SSH access to the bastion host.
    • Key Pair Name: Select the key pair name that you set up in the Prerequisites section.
    • Bastion Instance Type: Select the Amazon EC2 instance type for the bastion instance.
    • LogsRetentionInDays: Specify the number of days to retain CloudWatch log events for the bastion host.
    • SNS Notification Email: Enter the email notification list used to configure an SNS topic for sending CloudWatch alarm notifications.
    • Bastion Tenancy: Select the VPC tenancy in which you launched the bastion host.
    • Enable Banner: Select to display a banner when connecting through SSH to the bastion.
    • Bastion Banner: Use Default or provide an S3 location for the file containing the banner text that the host displays upon login.
    • Enable TCP Forwarding: Select True to Enable/Disable TCP Forwarding. Setting this value to true enables TCP forwarding (SSH tunneling). This can be useful, but also presents a security risk, so I recommend that you keep the default Disabled setting unless required.
    • Enable X11 Forwarding: Select to Enable/Disable X11 Forwarding. Setting this value to true enables X Windows over SSH. X11 forwarding can be useful but it is also a security risk, so I recommend that you keep the default (disabled) setting unless required.
    • Custom Bootstrap Script: Optional. Specify a custom bootstrap script S3 location for running during bastion host setup.
    • AMI override: Optional. Specify an AWS Region-specific image for the instance.

      Figure 3: Bastion Stack, in the CloudFormation Console

  1. After entering all the parameter values, choose Next.
  2. On the next screen, enter any required tags, an IAM role, or any advanced options, and then choose Next.
  3. Review the details on the final screen, select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create.

Stack creation takes a few minutes. Check the AWS CloudFormation Resources section to see the physical IDs of the various components set up by this stack.

You are now ready to set up the Amazon Redshift cluster.

Set up the Amazon Redshift cluster

This CloudFormation template will set up an Amazon Redshift cluster, CloudWatch alarms, AWS Glue Data Catalog, an Amazon Redshift IAM role and required configuration. Follow below steps to create these resources in the VPC:

  1. Choose the AWS Region where you want to create the stack on the top right of the screen, and then choose Next.
  2. Choose the following Launch Stack button. This button automatically launches the AWS CloudFormation service in your AWS account with a template.
  3. The CloudFormation stack requires a few parameters, as shown in the following screenshots:
    • Stack name: Enter a meaningful name for the stack, for example, rsdb
    • Environment: Select the environment stage (Development, Test, Pre-prod, Production) of the Amazon Redshift cluster. If you specify the Production option for this parameter, it sets snapshot retention to 35 days, sets the enable_user_activity_logging parameter to true, and creates CloudWatch alarms for high CPU-utilization and high disk-space-usage. Setting Development, Test, or Pre-prod for this parameter sets snapshot retention to 8 days, sets the enable_user_activity_logging parameter to false, and creates CloudWatch alarms only for high disk-space-Usage.
    • Parent VPC stack: Provide the stack name of the parent VPC stack.  Find this value inthe CloudFormation console.
    • Parent bastion stack (Optional): Provide the stack name of parent Amazon Linux bastion host stack. Find this value in the CloudFormation console.
    • Node type for Redshift cluster: Enter the type of the node for your Amazon Redshift cluster, for example, dc2.large.
    • Number of nodes in Redshift cluster: Enter the number of compute nodes for the Amazon Redshift cluster, for example, 2.
    • Redshift cluster port: Enter the TCP/IP port for the Amazon Redshift cluster, for example, 8200.
    • Redshift database name:  Enter a database name, for example, rsdev01.
    • Redshift master user name: Enter a database master user name, for example, rsadmin.
    • Redshift master user password: Enter an alphanumeric password for the master user. The password must contain 8–64 printable ASCII characters, excluding: /, “, \”, \, and @. It must contain one uppercase letter, one lowercase letter, and one number. For example, Welcome123.
    • Enable Redshift logging to S3: If you choose true for this parameter, the stack enables database auditing for the newly created S3 bucket.
    • Max. number of concurrent clusters: Enter any number between 1–10 for concurrency scaling. To configure more than 10, you must request a limit increase by submitting an Amazon Redshift Limit Increase Form.
    • Encryption at rest: If you choose true for this parameter, the database encrypts your data with the KMS key.
    • KMS key ID: If you leave this empty, then the cluster uses the default Amazon Redshift KMS to encrypt the Amazon Redshift database. If you enter a user-created KMS key, then the cluster uses your user-defined KMS key to encrypt the Amazon Redshift database.
    • Redshift snapshot identifier: Enter a snapshot identifier only if you want to restore from a snapshot. Leave it blank for a new cluster.
    • AWS account ID of the Redshift snapshot: Enter the AWS Account number that created the snapshot. Leave it blank if snapshot comes from the current AWS account or you don’t want to restore from previously taken snapshot.
    • Redshift maintenance window: Enter a maintenance window for your Amazon Redshift cluster. For more information, see Amazon Redshift maintenance window. For example, sat:05:00-sat:05:30.
    • S3 bucket for Redshift IAM role: Enter the existing S3 bucket. The stack automatically creates an IAM role and associates it with the Amazon Redshift cluster with GET and LIST access to this bucket.
    • AWS Glue Data Catalog database name: Leave this field empty if you don’t want to create an AWS Glue Data Catalog. If you do want an associated AWS Glue Data Catalog database, enter a name for it, for example, dev-catalog-01. For a list of the AWS Regions in which AWS Glue is available, check the regional-product-services map.
    • Email address for SNS notification: Enter the email notification list that you used to configure an SNS topic for sending CloudWatch alarms. SNS sends a subscription confirmation email to the recipient. The recipient must choose the Confirm subscription link in this email to set up notifications.
    • Unique friendly name: This tag designates a unique, friendly name to append as a NAME tag into all AWS resources that this stack manages.
    • Designate business owner’s email: This tag designates the business owner’s email address associated with the given AWS resource. The stack sends outage or maintenance notifications to this address.
    • Functional tier: This tag designates the specific version of the application.
    • Project cost center: This tag designates the cost center associated with the project of the given AWS resource.
    • Confidentiality classifier: This tag designates the confidentiality classification of the data associated with the resource.
    • Compliance classifier: This tag specifies the Compliance level for the AWS resource.


      Figure 4: Amazon Redshift Stack, in the CloudFormation Console

  4. After entering the parameter values, choose Next.
  5. On the next screen, enter any required tags, an IAM role, or any advanced options, and then choose Next.
  6. Review the details on the final screen, select I acknowledge that AWS CloudFormation might create IAM resources, and choose Create.

Stack creation takes a few minutes. Check the AWS CloudFormation Resources section to see the physical IDs of the various components set up by these stacks.

With setup complete, log in to the Amazon Redshift cluster and run some basic commands to test it.

Log in to the Amazon Redshift cluster using the Amazon Linux bastion host

The following instructions assume that you use a Linux computer and use an SSH client to connect to the bastion host. For more information about how to connect using various clients, see Connect to Your Linux Instance.

  1. Move the private key of the EC2 key pair (that you saved in the Prerequisites section) to a location on your SSH Client, where you are connecting to the Amazon Linux bastion host.
  2. Change the permission of the private key using the following command, so that it’s not publicly viewable.chmod 400 <private key file name, e.g., bastion-key.pem >
  3. In the CloudFormation console, select the Amazon Linux bastion host stack. Choose Outputs and make a note of the SSHCommand parameter value, which you use to apply SSH to the Amazon Linux bastion host.
  4. On the SSH client, change the directory to the location where you saved the EC2 private key, and then copy and paste the SSHCommand value from the previous step.
  5. On the CloudFormation Dashboard, select the Amazon Redshift cluster stack. Choose Outputs and note the PSQLCommandLine parameter value, which you use to log in to the Amazon Redshift database using psql client.
  6. The EC2 Auto Scaling launch configuration already set up PostgreSQL binaries on the Amazon Linux bastion host. Copy and paste the PSQLCommandLine value at the command prompt of the bastion host.
    psql -h ClusterEndpointAddress -p AmazonRedshiftClusterPort -U Username -d DatabaseNameWhen prompted, enter the database user password.
  7. Run some basic commands, as shown in the following screenshot:
    select current_database();
    select current_user;

    Figure 5: Successful connection to Amazon Redshift

Next steps

Before you use the Amazon Redshift cluster to set up your application-related database objects, consider creating the following:

  • An application schema
  • A user with full access to create and modify objects in the application schema
  • A user with read/write access to the application schema
  • A user with read-only access to the application schema

Use the master user that you set up with the Amazon Redshift cluster only for administering the Amazon Redshift cluster. To create and modify application-related database objects, use the user with full access to the application schema. Your application should use the read/write user for storing, updating, deleting, and retrieving data. Any reporting or read-only application should use the read-only user. Granting the minimum privileges required to perform operations is a database security best practice.

Review AWS CloudTrail, AWS Config, and Amazon GuardDuty and configure them for your AWS account, according to AWS security best practices. Together, these services help you monitor activity in your AWS account; assess, audit, and evaluate the configurations of your AWS resources; monitor malicious or unauthorized behavior; and detect security threats against your resources.

Delete CloudFormation stacks

Some of the AWS resources deployed by the CloudFormation stacks in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack. To clean up all your stacks, use the CloudFormation console to remove the three stacks that you created in reverse order.

To delete a stack:

  1. On the Stacks page in the CloudFormation console, and select the stack to delete. The stack must be currently running.
  2. In the stack details pane, choose Delete.
  3. Select Delete stack when prompted.

After stack deletion begins, you cannot stop it. The stack proceeds to the DELETE_IN_PROGRESS state. After the stack deletion completes, the stack changes to the DELETE_COMPLETE state. The AWS CloudFormation console does not display stacks in the DELETE_COMPLETE state by default. To display deleted stacks, you must change the stack view filter, as described in Viewing Deleted Stacks on the AWS CloudFormation Console..

If the delete fails, the stack enters the DELETE_FAILED state. For solutions, see Delete Stack Fails.

Summary

In this post, I showed you how to automate creation of an Amazon Redshift cluster and required AWS infrastructure based on AWS security and high availability best practices using AWS CloudFormation. I hope you find the sample CloudFormation templates helpful and encourage you to modify them to support your business needs.

If you have any comments or questions about this post, I encourage you to use the comments section.

 


About the Author

Sudhir Gupta is a senior partner solutions architect at Amazon Web Services. He works with AWS consulting and technology partners to provide guidance and technical assistance on data warehouse and data lake projects, helping them to improve the value of their solutions when using AWS.