All posts by Manash Deb

Analyze Amazon SES events at scale using Amazon Redshift

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/analyze-amazon-ses-events-at-scale-using-amazon-redshift/

Email is one of the most important methods for business communication across many organizations. It’s also one of the primary methods for many businesses to communicate with their customers. With the ever-increasing necessity to send emails at scale, monitoring and analysis has become a major challenge.

Amazon Simple Email Service (Amazon SES) is a cost-effective, flexible, and scalable email service that enables you to send and receive emails from your applications. You can use Amazon SES for several use cases, such as transactional, marketing, or mass email communications.

An important benefit of Amazon SES is its native integration with other AWS services, such as Amazon CloudWatch and Amazon Redshift, which allows you to monitor and analyze your emails sending at scale seamlessly. You can store your email events in Amazon Redshift, which is a widely used, fast, and fully managed cloud data warehouse. You can then analyze these events using SQL to gain business insights such as marketing campaign success, email bounces, complaints, and so on.

In this post, you will learn how to implement an end-to-end solution to automate this email analysis and monitoring process.

Solution overview

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with an AWS CloudFormation template.

In this solution, you publish Amazon SES email events to an Amazon Kinesis Data Firehose delivery stream that publishes data to Amazon Redshift. You then connect to the Amazon Redshift database and use a SQL query tool to analyze Amazon SES email events that meet the given criteria. We use the Amazon Redshift SUPER data type to store the event (JSON data) in Amazon Redshift. The SUPER data type handles semi-structured data, which can have varying table attributes and types.

The alarm system uses Amazon CloudWatch logs that Kinesis Data Firehose generates when a data load to Amazon Redshift fails. We have set up a metric filter that pattern matches the CloudWatch log events to determine the error condition and triggers a CloudWatch alarm. This in turn sends out email notifications using Amazon Simple Notification Service (Amazon SNS).

Prerequisites

As a prerequisite for deploying the solution in this post, you need to set up Amazon SES in your account. For more information, see Getting Started with Amazon Simple Email Service.

Solution resources and features

The architecture built by AWS CloudFormation supports AWS best practices for high availability and security. The CloudFormation template takes care of the following key resources and features:

  • Amazon Redshift cluster – An Amazon Redshift cluster with encryption at rest enabled using an AWS Key Management Service (AWS KMS) customer managed key (CMK). This cluster acts as the destination for Kinesis Data Firehose and stores all the Amazon SES email sending events in the table ses, as shown in the following screenshot.
  • Kinesis Data Firehose configuration – A Kinesis Data Firehose delivery stream that acts as the event destination for all Amazon SES email sending metrics. The delivery stream is set up with Amazon Redshift as the destination. Server-side encryption is enabled using an AWS KMS CMK, and destination error logging has been enabled as per best practices.
  • Amazon SES configuration – A configuration set in Amazon SES that is used to map Kinesis Data Firehose as the event destination to publish email metrics.

To use the configuration set when sending emails, you can specify a default configuration set for your verified identity, or include a reference to the configuration set in the headers of the email.

  • Exploring and analyzing the data – We use Amazon Redshift query editor v2 for exploring and analyzing the data.
  • Alarms and notifications for ingestion failures – A data load error notification system using CloudWatch and Amazon SNS generates email-based notifications in the event of a failure during data load from Kinesis Data Firehose to Amazon Redshift. The setup creates a CloudWatch log metric filter, as shown in the following screenshot.

A CloudWatch alarm based on the metric filter triggers an SNS notification when in alarm state. For more information, see Using Amazon CloudWatch alarms.

Deploy the CloudFormation template

The provided CloudFormation template automatically creates all the required resources for this solution in your AWS account. For more information, see Getting started with AWS CloudFormation.

  1. Sign in to the AWS Management Console.
  2. Choose Launch Stack to launch AWS CloudFormation in your AWS account:
  3. For Stack name, enter a meaningful name for the stack, for example, ses_events.
  4. Provide the following values for the stack parameters:
    1. ClusterName – The name of the Amazon Redshift cluster.
    2. DatabaseName – The name of the first database to be created when the Amazon Redshift cluster is created.
    3. DeliveryStreamName – The name of the Firehose delivery stream.
    4. MasterUsername – The user name that is associated with the primary user account for the Amazon Redshift cluster.
    5. NodeType – The type of node to be provisioned. (Default dc2.large)
    6. NotificationEmailId – The email notification list that is used to configure an SNS topic for sending CloudWatch alarm and event notifications.
    7. NumberofNodes – The number of compute nodes in the Amazon Redshift cluster. For multi-node clusters, the NumberofNodes parameter must be greater than 1.
    8. OnPremisesCIDR – IP range (CIDR notation) for your existing infrastructure to access the target and replica Amazon Redshift clusters.
    9. SESConfigSetName – Name of the Amazon SES configuration set.
    10. SubnetId – Subnet ID where source Amazon Redshift cluster is created.
    11. Vpc – VPC in which Amazon Redshift cluster is launched.
  5. Choose Next.
  6. Review all the information and select I acknowledge that AWS CloudFormation might create IAM resources.
  7. Choose Create stack.

You can track the progress of the stack creation on the Events tab. Wait for the stack to complete and show the status CREATE_COMPLETE.

Test the solution

To send a test email, we use the Amazon SES mailbox simulator. Set the configuration-set header to the one created by the CloudFormation template.

We use the Amazon Redshift query editor V2 to query the Amazon Redshift table (created by the CloudFormation template) and see if the events have shown up.

If the data load of the event stream fails from Kinesis Data Firehose to Amazon Redshift, the failure notification system is triggered, and you receive an email notification via Amazon SNS.

Clean up

Some of the AWS resources deployed by the CloudFormation stacks in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack. To clean up all your stacks, use the AWS CloudFormation console to remove the stacks that you created in reverse order.

  1. On the Stacks page on the AWS CloudFormation console, choose the stack to delete.
  2. In the stack details pane, choose Delete.
  3. Choose Delete stack when prompted.

After stack deletion begins, you can’t stop it. The stack proceeds to the DELETE_IN_PROGRESS state. When the stack deletion is complete, the stack changes to the DELETE_COMPLETE state. The AWS CloudFormation console doesn’t display stacks in the DELETE_COMPLETE state by default. To display deleted stacks, you must change the stack view filter. For more information, see Viewing deleted stacks on the AWS CloudFormation console.

If the delete fails, the stack enters the DELETE_FAILED state. For solutions, see Delete stack fails.

Conclusion

In this post, we walked through the process of setting up Amazon SES and Amazon Redshift to deploy an email reporting service that can scale to support millions of events. We used Amazon Redshift to store semi-structured messages using the SUPER data type in database tables to support varying message sizes and formats. With this solution, you can easily run analytics at scale and analyze your email event data for deliverability-related issues such as bounces or complaints.

Use the CloudFormation template provided to speed up provisioning of the cloud resources required for the solution (Amazon SES, Kinesis Data Firehose, and Amazon Redshift) in your account while following security best practices. Then you can analyze Amazon SES events at scale using Amazon Redshift.


About the Authors

Manash Deb is a Software Development Engineer in the AWS Directory Service team. He has worked on building end-to-end applications in different database and technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems on AWS.

Arnab Ghosh is a Solutions Architect for AWS in North America helping enterprise customers build resilient and cost-efficient architectures. He has over 13 years of experience in architecting, designing, and developing enterprise applications solving complex business problems.

Sanjoy Thanneer is a Sr. Technical Account Manager with AWS based out of New York. He has over 20 years of experience working in Database and Analytics Domains.  He is passionate about helping enterprise customers build scalable , resilient and cost efficient Applications.

Justin Morris is a Email Deliverability Manager for the Simple Email Service team. With over 10 years of experience in the IT industry, he has developed a natural talent for diagnosing and resolving customer issues and continuously looks for growth opportunities to learn new technologies and services.

Automate building an integrated analytics solution with AWS Analytics Automation Toolkit

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/automate-building-an-integrated-analytics-solution-with-aws-analytics-automation-toolkit/

Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that powers the modern data architecture enabling fast and deep insights or machine learning (ML) predictions using SQL across your data warehouse, data lake, and operational databases. A key differentiating factor of Amazon Redshift is its native integration with other AWS services, which makes it easy to build complete, comprehensive, and enterprise-level analytics applications.

As analytics solutions have moved away from the one-size-fits-all model to choosing the right tool for the right function, architectures have become more optimized and performant while simultaneously becoming more complex. You can use Amazon Redshift for a variety of use cases, along with other AWS services for ingesting, transforming, and visualizing the data.

Manually deploying these services is time-consuming. It also runs the risk of making human errors and deviating from best practices.

In this post, we discuss how to automate the process of building an integrated analytics solution by using a simple script.

Solution overview

The framework described in this post uses Infrastructure as Code (IaC) to solve the challenges with manual deployments, by using AWS Cloud Development Kit (CDK)  to automate provisioning AWS analytics services. You can indicate the services and resources you want to incorporate in your infrastructure by editing a simple JSON configuration file.

The script then instantly auto-provisions all the required infrastructure components in a dynamic manner, while simultaneously integrating them according to AWS recommended best practices.

In this post, we go into further detail on the specific steps to build this solution.

Prerequisites

Prior to deploying the AWS CDK stack, complete the following prerequisite steps:

  1. Verify that you’re deploying this solution in a Region that supports AWS CloudShell. For more information, see AWS CloudShell endpoints and quotas.
  2. Have an AWS Identity and Access Management (IAM) user with the following permissions:
    1. AWSCloudShellFullAccess
    2. IAM Full Access
    3. AWSCloudFormationFullAccess
    4. AmazonSSMFullAccess
    5. AmazonRedshiftFullAccess
    6. AmazonS3ReadOnlyAccess
    7. SecretsManagerReadWrite
    8. AmazonEC2FullAccess
    9. Create a custom AWS Database Migration Service (AWS DMS) policy called AmazonDMSRoleCustom with the following permissions:
{
	"Version": "2012-10-17",
	"Statement": [
		{
		"Effect": "Allow",
		"Action": "dms:*",
		"Resource": "*"
		}
	]
}
  1. Optionally, create a key pair that you have access to. This is only required if deploying the AWS Schema Conversion Tool (AWS SCT).
  2. Optionally, if using resources outside your AWS account, open firewalls and security groups to allow traffic from AWS. This is only applicable for AWS DMS and AWS SCT deployments.

Prepare the config file

To launch the target infrastructures, download the user-config-template.json file from the GitHub repo.

To prep the config file, start by entering one of the following values for each key in the top section: CREATE, N/A, or an existing resource ID to indicate whether you want to have the component provisioned on your behalf, skipped, or integrated using an existing resource in your account.

For each of the services with the CREATE value, you then edit the appropriate section under it with the specific parameters to use for that service. When you’re done customizing the form, save it as user-config.json.

You can see an example of the completed config file under user-config-sample.json in the GitHub repo, which illustrates a config file for the following architecture by newly provisioning all the services, including Amazon Virtual Private Cloud (Amazon VPC), Amazon Redshift, an Amazon Elastic Compute Cloud (Amazon EC2) instance with AWS SCT, and AWS DMS instance connecting an external source SQL Server on Amazon EC2 to the Amazon Redshift cluster.

Launch the toolkit

This project uses CloudShell, a browser-based shell service, to programatically initiate the deployment through the AWS Management Console. Prior to opening CloudShell, you need to configure an IAM user, as described in the prerequisites.

  1. On the CloudShell console, clone the Git repository:
    git clone https://github.com/aws-samples/amazon-redshift-infrastructure-automation.git

  2. Run the deployment script:
    ~/amazon-redshift-infrastructure-automation/scripts/deploy.sh

  3. On the Actions menu, choose Upload file and upload user-config.json.
  4. Enter a name for the stack.
  5. Depending on the resources being deployed, you may have to provide additional information, such as the password for an existing database or Amazon Redshift cluster.
  6. Press Enter to initiate the deployment.

Monitor the deployment

After you run the script, you can monitor the deployment of resource stacks through the CloudShell terminal, or through the AWS CloudFormation console, as shown in the following screenshot.

Each stack corresponds to the creation of a resource from the config file. You can see the newly created VPC, Amazon Redshift cluster, EC2 instance running AWS SCT, and AWS DMS instance. To test the success of the deployment, you can test the newly created AWS DMS endpoint connectivity to the source system and the target Amazon Redshift cluster. Select your endpoint and on the Actions menu, choose Test connection.

If both statuses say Success, the AWS DMS workflow is fully integrated.

Troubleshooting

If the stack launch stalls at any point, visit our GitHub repository for troubleshooting instructions.

Conclusion

In this post, we discussed how you can use the AWS Analytics Infrastructure Automation utility to quickly get started with Amazon Redshift and other AWS services. It helps you provision your entire solution on AWS instantly without any spending any time on challenges around integrating the services or scaling your solution.


About the Authors

Manash Deb is a Software Development Engineer in the AWS Directory Service team. He has worked on building end-to-end applications in different database and technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems on AWS.

Samir Kakli is an Analytics Specialist Solutions Architect at AWS. He has worked with building and tuning databases and data warehouse solutions for over 20 years. His focus is  architecting end-to-end analytics solutions designed to meet the specific needs for each customer.

Julia Beck is a Specialist Solutions Architect at AWS. She supports customers building analytics proof of concept workloads. Outside of work, she enjoys traveling, cooking, and puzzles.

Compare different node types for your workload using Amazon Redshift

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/compare-different-node-types-for-your-workload-using-amazon-redshift/

Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that offers different node types to meet the flexible workload patterns of our customers. Amazon Redshift RA3 with managed storage is the newest instance type in Amazon Redshift, which allows you to scale and pay for compute and storage independently, and also allows advanced features like AQUA (Advanced Query Accelerator), cross-cluster data sharing, and cross-Availability Zone cluster relocation.

Many customers start their workload on Amazon Redshift with RA3 instances as their default choice, which currently offers three node sizes.

Node size vCPU RAM (GiB) Managed storage quota per node
ra3.xlplus 4 32 32 TB
ra3.4xlarge 12 96 128 TB
ra3.16xlarge 48 384 128 TB

A very common question we get from customers is “Which RA3 instance type and number of nodes should we choose for our workload?” In this post, we answer that question with the following two-step process:

  1. Use the Amazon Redshift sizing calculator on the Amazon Redshift console, which estimates your cluster configuration based on data size and data access frequency.
  2. Use the Amazon Redshift Node Configuration Comparison utility to find the right configuration for your cluster based on your query performance expectation for sequential or concurrently running queries. If you’re already using Amazon Redshift and want to migrate from your existing DC2 or DS2 instances to RA3, you may follow our recommendations on node count and type when upgrading. Before doing that, you can use the utility to evaluate the new cluster’s performance by replaying your past workloads too, which integrates with the Amazon Redshift Simple Replay utility to evaluate performance metrics for different Amazon Redshift configurations to meet your needs. We describe this utility in detail later in this post.

Amazon Redshift sizing calculator

The Amazon Redshift console provides the Help me choose option when you create a new Amazon Redshift cluster. This option allows you to get an estimate of your cluster configuration based on your data size and data access frequency. In this wizard, you need to provide the following information:

  • The compressed or uncompressed size of your total dataset
  • Whether your data is time-based:
    • If your data is time based, how many months of data does the data warehouse contain and how many months are frequently queried on that data
    • If your data is not time based, what percentage of data do your queries access frequently

Based on your response, you receive a recommended node configuration, as shown in the following screenshot.

You can use this recommendation to create your Amazon Redshift cluster and quickly get started with your workload to gain insights from your data in the Amazon Redshift data warehouse.

Amazon Redshift Node Configuration Comparison utility

If you have stringent service level agreements (SLAs) for query performance in your Amazon Redshift cluster or you want to explore different Amazon Redshift configurations based on the price/performance of your workload, you can use the Amazon Redshift Node Configuration Comparison utility. This utility helps evaluate performance of your queries using different Amazon Redshift cluster configurations in parallel and compares the end results to find the best cluster configuration that meets your need.

To perform this comparison, you can use the Amazon Redshift sizing calculator recommendations in the previous step as a benchmark and test other configurations with it in parallel to compare performance. If you’re migrating from DC2 or DS2 node types to RA3, you can use our recommendations on node count and type as a benchmark.

Solution overview

The solution uses AWS Step Functions, AWS Lambda, and AWS Batch to run an end-to-end automated orchestration to find the best Amazon Redshift configuration based on your price/performance requirements. We use an AWS CloudFormation template to deploy and run this solution in your AWS account. Along with other resources, this template creates an Amazon Simple Storage Service (Amazon S3) bucket to store all data and metadata related to this process. The following diagram shows the solution architecture.

You need to create a JSON file to provide the following input configurations for your test:

  • Amazon Redshift cluster configurations
  • DDL and load script (optional)
  • Amazon Redshift snapshot identifier (optional)
  • SQL script to conduct sequential and concurrency test (optional)
  • Amazon Redshift audit log location and simple replay time window (optional)

You need to store this file in an existing S3 bucket and then use the following CloudFormation template to deploy this solution. This solution uses AWS CloudFormation to automatically provision all the required resources in your AWS accounts. For more information, see Getting started with AWS CloudFormation.

Deploying the solution also initiates an iteration of this test by invoking a Step Functions state machine in your AWS account.

Prerequisites

If you’re already running an Amazon Redshift workload in production, you can use this solution to replay your past workload using the Amazon Redshift Simple Replay utility. This helps you conduct what-if analysis on Amazon Redshift and evaluate how your workload performs with different configurations. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature like AQUA, or assess different cluster configurations. It replays your workload with the same time interval between connections, transactions, and queries as the source. It also supports extract, transform, and load (ETL) workloads including COPY and UNLOAD statements. The Amazon Redshift Node Configuration Comparison utility uses Simple Replay to automate all these configuration tests. As a prerequisite to use Simple Replay, you need to enable audit logging and user-activity logging in your Amazon Redshift cluster.

Example use case

As an example, assume you have an existing Amazon Redshift cluster with two nodes of DC2.8XLarge instances. You want to evaluate migrating this cluster to RA3 node types. You used the Amazon Redshift sizing calculator in the previous step, which recommends using four nodes of RA3.4XL, but you also want to evaluate the performance with five nodes of RA3.4XL to meet your future growth demands. For that, you want to run five test queries sequentially as well as in 5 and 10 parallel sessions in all these clusters. You also want to replay the workload in the past hour on these clusters and compare their performance.

You also want to test your workload performance with concurrency scaling enabled in that clusters, which helps improve concurrent workloads with consistently fast query performance.

The following table summarizes the five cluster configurations that are evaluated as part of this test.

Node Type Number of Nodes Option
dc2.8xlarge 2 concurrency scaling – off
ra3.4xlarge 4 concurrency scaling – off
ra3.4xlarge 4 concurrency scaling – on
ra3.4xlarge 5 concurrency scaling – off
ra3.4xlarge 5 concurrency scaling – on

To perform this test using the Amazon Redshift Node Configuration Comparison utility, you provide these configurations in a JSON file and store it in an S3 bucket. You then use the provided CloudFormation template to deploy this utility, which performs the end-to-end performance testing in all the clusters in parallel and produces a performance evaluation summary, which can help you decide which configuration works best for you.

JSON file parameters

You need to provide a configuration JSON file to use this solution. The following table explains the input parameters for this JSON file.

JSON Parameter Allowed Values Description
SNAPSHOT_ID

N/A,

Amazon Redshift snapshot identifier

Enter the snapshot identifier if you want to create new Amazon Redshift clusters by restoring from a snapshot.
Snapshot identifier is mandatory if you’re using Simple Replay in this test to replay your past workload.
If you’re using this solution in a different AWS account, make sure to share your Amazon Redshift cluster snapshot with this account. For more information, see How do I copy an Amazon Redshift cluster to a different AWS account. Enter N/A if not applicable.
SNAPSHOT_ACCOUNT_ID

N/A,

AWS Account ID

The AWS account ID where the snapshot was created. Enter N/A if not applicable.
PARAMETER_GROUP_CONFIG_S3_PATH

N/A,

Amazon S3 URI

If you use a custom parameter group for this test, provide its Amazon S3 URI. You can get this JSON by running the following command in the AWS Command Line Interface (AWS CLI):

aws redshift describe-cluster-parameters --parameter-group-name your-custom-param-group --output json

Enter N/A if not applicable.

DDL_AND_COPY_SCRIPT_S3_PATH

N/A,

Amazon S3 URI

If you create tables and load data on them before performing the test, provide its Amazon S3 URI. Enter N/A if not applicable.
SQL_SCRIPT_S3_PATH

N/A,

Amazon S3 URI

If you run performance testing of your queries, provide the Amazon S3 URI of your script consisting of all your SQL commands. These commands should be deliminated by a semicolon (;). Enter N/A if not applicable. We recommend keeping only SELECT statements in this script path.
NUMBER_OF_PARALLEL_SESSIONS_LIST N/A, comma-separated numbers Enter comma-separated numbers to denote the number of parallel sessions in which you want to run the preceding script. Enter N/A if not applicable.
SIMPLE_REPLAY_LOG_LOCATION

N/A,

Amazon S3 URI

If you’re already running an Amazon Redshift workload and your cluster has audit logging enabled, provide the Amazon S3 URI of your Amazon Redshift audit logging location. If you’re using this solution in a different AWS account, copy these logs from your source clusters’ audit logging bucket to the S3 bucket in this account. Enter N/A if not applicable.
SIMPLE_REPLAY_EXTRACT_START_TIME N/A, time in ISO-8601 format If you’re using Simple Replay in this test to replay your past workload, provide the start time of that workload in ISO-8601 format (for example, 2021-01-26T21:41:16+00:00). Enter N/A if not applicable.
SIMPLE_REPLAY_EXTRACT_END_TIME N/A, time in ISO-8601 format If you’re using Simple Replay in this test to replay your past workload, provide the end time of that workload in ISO-8601 format (for example, 2021-01-26T21:41:16+00:00). Enter N/A if not applicable.
SIMPLE_REPLAY_EXTRACT_OVERWRITE_S3_PATH

N/A,

Amazon S3 URI

If you’re using Simple Replay and you want to use a custom extract.yaml file, provide its Amazon S3 URI. Enter N/A if not applicable.
SIMPLE_REPLAY_OVERWRITE_S3_PATH

N/A,

Amazon S3 URI

If you’re using Simple Replay and you want to use a custom replay.yaml file, provide its Amazon S3 URI. Enter N/A if not applicable.
AUTO_PAUSE

true,

false

Enter true if you want to automatically pause all except the first Amazon Redshift clusters created for this test, otherwise enter false.
CONFIGURATIONS JSON array with parameters NODE_TYPE, NUMBER_OF_NODES, WLM_CONFIG_S3_PATH Enter a JSON array mentioning your Amazon Redshift cluster configurations for which you want to perform this test. We explain the parameters in the three rows.
NODE_TYPE ra3.xlplus, ra3.4xlarge, ra3.16xlarge, dc2.large, dc2.8xlarge, ds2.xlarge, ds2.8xlarge The Amazon Redshift cluster node type that you want to run for this test.
NUMBER_OF_NODES a number between 1–128 The number of nodes for your Amazon Redshift cluster.
WLM_CONFIG_S3_PATH

N/A,

Amazon S3 URI

If you want to use custom workload management settings if you have different Amazon Redshift clusters, provide the Amazon S3 URI for that. Enter N/A if not applicable.

The following code is a sample configuration JSON file used to implement this example use case:

{
"SNAPSHOT_ID": "redshift-cluster-manual-snapshot",
"SNAPSHOT_ACCOUNT_ID": "123456789012",

"PARAMETER_GROUP_CONFIG_S3_PATH": "s3://node-config-compare-bucket/pg_config.json",

"DDL_AND_COPY_SCRIPT_S3_PATH": "s3://node-config-compare-bucket/ddl.sql",
"SQL_SCRIPT_S3_PATH":"s3://node-config-compare-bucket/test_queries.sql",
"NUMBER_OF_PARALLEL_SESSIONS_LIST": "1,5,10",

"SIMPLE_REPLAY_LOG_LOCATION":"s3://redshift-logging-xxxxxxxx/RSLogs/",
"SIMPLE_REPLAY_EXTRACT_START_TIME":"2021-08-28T11:15:00+00:00",
"SIMPLE_REPLAY_EXTRACT_END_TIME":"2021-08-28T12:00:00+00:00",

"SIMPLE_REPLAY_EXTRACT_OVERWRITE_S3_PATH":"N/A",
"SIMPLE_REPLAY_OVERWRITE_S3_PATH":"N/A",

"AUTO_PAUSE": true,

"CONFIGURATIONS": [
	{
	"NODE_TYPE": "dc2.8xlarge",
	"NUMBER_OF_NODES": "2",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/source-wlm.json"
	},	
	{
	"NODE_TYPE": "ra3.4xlarge",
	"NUMBER_OF_NODES": "4",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/source-wlm.json"
	},
	{
	"NODE_TYPE": "ra3.4xlarge",
	"NUMBER_OF_NODES": "4",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/wlm-concurrency-scaling.json"
	},
{
	"NODE_TYPE": "ra3.4xlarge",
	"NUMBER_OF_NODES": "5",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/source-wlm.json"
	},
{
	"NODE_TYPE": "ra3.4xlarge",
	"NUMBER_OF_NODES": "5",
	"WLM_CONFIG_S3_PATH": "s3://node-config-compare-bucket/ wlm-concurrency-scaling.json"
	}
]
}

Make sure to use same S3 bucket to store all your configurations for this test. For example, we use the S3 bucket node-config-compare-bucket to store all configuration scripts in the preceding JSON configuration. After you populate all the parameters in this JSON file, save the file in the same S3 bucket in your AWS account.

Deploy the solution using AWS CloudFormation

After you save the configuration JSON file in an S3 bucket, you can use the provided CloudFormation template to deploy this solution, which also initiates an iteration of this test. This template provisions the required AWS resources except for the Amazon Redshift clusters, which are created in the subsequent step by a Step Functions state machine. The following table lists the required parameters in the template.

CloudFormation Template Parameter Allowed Values Description
ConfigJsonS3Path Amazon S3 URI Enter the Amazon S3 URI where you stored your JSON configuration file from the previous step. The template grants access on this S3 bucket to the appropriate AWS resources created by this solution.
ClusterIdentifierPrefix Prefix of Amazon Redshift cluster identifiers Enter a valid string like rs to be used as the prefix of your Amazon Redshift cluster identifiers created by this solution.
PreExistingS3BucketToGrantRedshiftAccess

N/A,

Amazon S3 Bucket name

If you’re using Simple Replay, provide the Amazon Redshift audit logging bucket name so it can grant appropriate permissions to the AWS resources. You can also add an existing S3 bucket in same AWS Region, which can be accessed by Amazon Redshift. Enter N/A if not applicable.
GrantS3ReadOnlyAccessToRedshift

Yes,

No

If you’re using Simple Replay in the same AWS account as the source Amazon Redshift cluster, enter Yes for this parameter, which grants AmazonS3ReadOnlyAccess to the new Amazon Redshift clusters to replay copy statements within the account. Otherwise, enter No so you can’t replay copy statements if running on a different AWS account without manually configuring it.
SourceRedshiftClusterKMSKeyARN N/A, AWS KMS Key ARN Provide the AWS Key Management Service (AWS KMS) Key ARN (Amazon Resource Name) if your source Amazon Redshift cluster is encrypted (available on the stack Outputs tab). If using Simple Replay, you need to run extract and replay in the same AWS account if your source cluster is encrypted.
OnPremisesCIDR CIDR notation The IP range (CIDR notation) for your existing infrastructure to access the target and replica clusters from a SQL client. If unsure, enter your corporate desktop’s CIDR address. For instance, if your desktop’s IP address is 10.156.87.45, enter 10.156.87.45/32.
VPC VPC ID An existing Amazon Virtual Private Cloud (Amazon VPC) where you want to deploy the clusters and Amazon Elastic Compute Cloud (Amazon EC2) instances.
SubnetId Subnet ID An existing subnet within the VPC in which you deploy the Amazon Redshift clusters and AWS Batch compute environment.

Performance evaluation

The CloudFormation stack deployed in above step runs a Step Functions state machine to orchestrate the end-to-end workflow. Navigate to Step Functions in AWS Management Console to view the state machine deployed by this solution as shown in the following screenshot.

This state machine creates the Amazon Redshift clusters you defined in the configuration JSON file. Then, it runs the performance evaluation tests on these clusters. If you provided the value true for parameter AUTO_PAUSE in the input JSON file, it also pauses the Amazon Redshift clusters except for the first cluster. At the end, it unloads the performance comparison results to your Amazon S3 bucket.

The following screenshot shows the clusters that were created as part of this state machine run for our use case. The state machine automatically paused all clusters except the first one.

This state machine creates an external schema redshift_config_comparison and creates three external tables (comparison_stats, cluster_config, and pricing) in that schema to read the raw data created by this solution in an S3 bucket. Based on these external tables, it creates the views redshift_config_comparison_results and redshift_config_comparison_raw in the public schema of your Amazon Redshift clusters to compare their performance metrics. These objects can be accessed from any of the Amazon Redshift clusters created by this solution.

REDSHIFT_CONFIG_COMPARISON_RESULTS

This view provides the aggregated comparison summary of your Amazon Redshift clusters.

It provides the raw value and a percentage number for KPIs like total, mean, median, max query times, percentile-75, and percentile-90 to help you compare and find the most performant cluster based on the different cluster configurations. The test_type column in this view indicates if the test type was to replay your past workload using the Simple Replay utility or a concurrency test to run your queries in parallel with different concurrency numbers. We use the following query to view our outcome:

select * from public.redshift_config_comparison_results;

The following table summarizes the performance comparison results:

Test Type Cluster Identifier Total Query Time in Seconds Improved Total Query Time Mean Query Time in Seconds Improved Mean Query Time Median Query Time in Seconds Improved Median Query Time
simple-replay rs-dc2-8xlarge-2 98.75 0% 4.94 0% 5.11 0%
simple-replay rs-ra3-4xlarge-4 94.16 5% 4.71 5% 4.08 25%
simple-replay rs-ra3-4xlarge-4-cs 19.36 410% 0.97 409% 0.68 651%
simple-replay rs-ra3-4xlarge-5 16.21 509% 0.85 481% 0.65 686%
simple-replay rs-ra3-4xlarge-5-cs 14.66 574% 0.73 577% 0.6 752%
concurrency-1 rs-dc2-8xlarge-2 49.6 0% 9.92 0% 10.67 0%
concurrency-1 rs-ra3-4xlarge-4 45.2 10% 9.51 4% 10.3 4%
concurrency-1 rs-ra3-4xlarge-4-cs 45.2 10% 9.51 4% 10.4 3%
concurrency-1 rs-ra3-4xlarge-5 43.93 13% 8.79 13% 7.7 39%
concurrency-1 rs-ra3-4xlarge-5-cs 43.94 13% 8.78 13% 7.78 37%
concurrency-5 rs-dc2-8xlarge-2 24.8 0% 0.99 0% 0.79 8%
concurrency-5 rs-ra3-4xlarge-4 21.24 17% 0.85 16% 0.7 22%
concurrency-5 rs-ra3-4xlarge-4-cs 22.49 10% 0.9 10% 0.85 0%
concurrency-5 rs-ra3-4xlarge-5 19.73 26% 0.79 25% 0.72 19%
concurrency-5 rs-ra3-4xlarge-5-cs 18.73 32% 0.75 32% 0.72 18%
concurrency-10 rs-dc2-8xlarge-2 98.2 0% 1.96 0% 1.63 0%
concurrency-10 rs-ra3-4xlarge-4 85.46 15% 1.71 15% 1.58 3%
concurrency-10 rs-ra3-4xlarge-4-cs 88.09 11% 1.76 11% 1.59 2%
concurrency-10 rs-ra3-4xlarge-5 77.54 27% 1.55 26% 1.36 20%
concurrency-10 rs-ra3-4xlarge-5-cs 74.68 31% 1.49 32% 1.31 24%

Table 1: Summary of Performance Comparison Results

Based on these results, we can conclude that five nodes of RA3.4XLarge with concurrency scaling enabled was the best-performing configuration.

REDSHIFT_CONFIG_COMPARISON_RAW

This view provides the query-level comparison summary of your Amazon Redshift clusters. We use the following query:

select * from public.redshift_config_comparison_raw;

The following table shows the performance comparison results per query:

Query Hash Cluster Identifier Execution Time in Seconds Total Query Time in Seconds Compile Time in Seconds Queue Time in Seconds Username Query
0531f3b54885afb rs-dc2-8xlarge-2 2 7 5 0 awsuser 599
0531f3b54885afb rs-ra3-4xlarge-4 2 5 3 0 awsuser 510
0531f3b54885afb rs-ra3-4xlarge-4-cs 2 2 0 0 awsuser 499
0531f3b54885afb rs-ra3-4xlarge-5 1 6 4 0 awsuser 498
0531f3b54885afb rs-ra3-4xlarge-5-cs 1 1 0 0 awsuser 457
10ef3990f05c9f8 rs-dc2-8xlarge-2 0 0 0 0 awsuser 516
10ef3990f05c9f8 rs-ra3-4xlarge-4 0 0 0 0 awsuser 427
10ef3990f05c9f8 rs-ra3-4xlarge-4-cs 0 0 0 0 awsuser 423
10ef3990f05c9f8 rs-ra3-4xlarge-5 0 0 0 0 awsuser 412
10ef3990f05c9f8 rs-ra3-4xlarge-5-cs 0 0 0 0 awsuser 390

Table 2: Performance Comparison Per Query

Access permissions and security

To deploy this solution, you need administrator access on the AWS accounts where you plan to deploy the AWS CloudFormation resources for this solution.

The CloudFormation template provisions all the required resources using security best practices based on the principle of least privileges and hosts all resources within your account VPC. Access to the Amazon Redshift clusters is controlled with the CloudFormation template parameter OnPremisesCIDR, which you need to provide to allow on-premises users to connect to the new clusters using their SQL clients on the Amazon Redshift port.

Access permissions for all the resources are controlled using AWS Identity and Access Management (IAM) roles granting appropriate permissions to Amazon Redshift, AWS Lambda, AWS Step Functions, AWS Glue, and AWS Batch. Read and write access privileges are granted to the Amazon Redshift clusters and AWS Batch jobs on the S3 bucket created by the CloudFormation template so that it can read and update data and metadata configurations from that bucket. Read and write access privileges are also granted on the S3 bucket where the user configuration JSON file is uploaded.

Troubleshooting

There might be some rare instances in which failures occur in the state machine running this solution. To troubleshoot, refer to its logs, along with logs from the AWS Batch jobs in Amazon CloudWatch Logs. To view the AWS Batch logs, navigate to the Amazon CloudWatch console and choose Logs in the navigation pane. Find the log group with name <Your CloudFormation Stack Name>/log and choose the latest log streams.

To view the Step Functions logs, navigate to the state machine’s latest run on the Step Functions console and choose CloudWatch Logs for the failed Step Functions step.

After you fix the issue, you can restart the state machine by choosing New execution.

Clean up

Running this template in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save cost. We also recommend pausing your clusters when not in use. If you don’t plan to run this test in future, you should also delete the CloudFormation stack, which deletes all the resources it created.

Conclusion

In this post, we walked you through the process to find the correct size of your Amazon Redshift cluster based on your performance requirements. You can start with our easy-to-use, out-of-the-box Amazon Redshift sizing calculator, or you can use the Amazon Redshift Node Configuration Comparison tool to evaluate the performance of your workload and find the best cluster configuration that meets your need.


About the Authors

Manash Deb is a Software Development Engineer in the AWS Directory Service team. He has worked on building end-to-end applications in different database and technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems on AWS.

Ranjan Burman is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers to build scalable analytical solutions. He has more than 13 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Federated authentication to Amazon Redshift using AWS Single Sign-On

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/federated-authentication-to-amazon-redshift-using-aws-single-sign-on/

Managing database users through identity federation allows you to manage authentication and authorization procedures centrally. Amazon Redshift, a fast, fully managed cloud data warehouse, provides browser-based plugins for JDBC/ODBC drivers, which helps you easily implement identity federation capabilities added with multi-factor authentication (MFA) to secure your data warehouse, and also helps automation and enforcement of data access policies across the organization.

AWS Single Sign-On (AWS SSO) provides tools to federate access to users to the AWS environment. AWS SSO integrates with AWS Organizations to manage access to all the AWS accounts under the organization. In our previous post, we explained how you can integrate the Amazon Redshift browser-based Security Assertion Markup Language (SAML) plugin to add SSO and MFA capability with your federation identity provider (IdP). We expand on that in this post to show how you can set up this federated authentication to connect users to Amazon Redshift through AWS SSO integrated with a supported identity source directory of your choice, such as the native AWS SSO identity store, AWS managed or self-managed or on-premises Microsoft Active Directory (AD), or an external IdP such as Okta, Azure AD, or Ping.

Solution overview

When you connect to Amazon Redshift using a JDBC/ODBC client, you can use the Amazon Redshift browser SAML plugin to launch a custom AWS SSO SAML application, which provides the SAML attributes required to connect to Amazon Redshift, after authenticating the user identity against the identity source directory that you have integrated with AWS SSO. The SAML application uses the user’s identity source credentials to get their user and group attributes, and grants the appropriate Amazon Redshift database access privileges to the user. The following diagram illustrates this workflow.

The following are the high-level steps for this setup:

  1. Enable AWS SSO (linked with your preferred identity source directory) and set up custom SAML applications in AWS SSO with the appropriate user or group association and attribute mappings.
  2. Set up a SAML IdP for AWS SSO and link it to an AWS Identity and Access Management (IAM) role with appropriate permissions to access Amazon Redshift.
  3. Set up the Amazon Redshift cluster and database groups that correspond to your AWS SSO directory groups.
  4. Configure the JDBC/ODBC client to authenticate with your AWS SSO SAML application and obtain federated IAM credentials to access the Amazon Redshift cluster.

Prerequisites

You need the following prerequisites to set up this solution:

Use case

In this example use case, we use AWS SSO integrated with AWS Managed Microsoft AD as the identity source to connect to an Amazon Redshift cluster as users from two different AD groups: BI users and analysts. We create two SAML applications in AWS SSO to map these two groups with their respective users and then connect to Amazon Redshift using SQL Workbench/J client via AWS SSO using their Microsoft AD user credentials.

As a prerequisite step, we have already set up an AWS Managed Microsoft AD directory with sample directory groups and user mappings, and attached it to AWS SSO as the identity source.

The following screenshots show our AD groups and user mappings.

The following screenshot shows our AWS SSO identity source mapping.

Configure AWS SSO

As mentioned in the prerequisites section, you need to enable AWS SSO in your account, and map it with a supported identity source. If AWS SSO isn’t configured in your account, follow the steps in Getting Started.

In this step, you create two custom SAML applications in AWS SSO.

  1. On the AWS SSO console, choose Applications in the navigation pane.
  2. Choose Add a new application.
  3. Choose Add a custom SAML 2.0 application.
  4. For Display name, enter an appropriate name for the SAML application (for this post, because we create two applications, we first enter Redshift-SAML-BI-User).
  5. In the Application metadata section, choose the option to manually enter the metadata values.
  6. For Application ACS URL, enter http://localhost:7890/redshift/.
  7. For Application SAML audience, enter urn:amazon:webservices:redshift.
  8. On the Configuration tab, choose Download to download the AWS SSO SAML metadata file.

We use this file later to create the IdP.

  1. On the Assigned users tab, choose Assign users to add bi_users_group to this application.
  2. On the Attribute mappings tab, add the custom attribute mappings from the following table.
User attribute in the application Maps to this string value or user attribute in AWS SSO Description
Subject ${user:email} User identity
https://aws.amazon.com/SAML/Attributes/RoleSessionName ${user:email} Identification for the user session, which in most cases is the email_id of the user
https://redshift.amazon.com/SAML/Attributes/AutoCreate True If this parameter is set, new users authenticated by the IdP are automatically created in Amazon Redshift
https://aws.amazon.com/SAML/Attributes/Role arn:aws:iam::<yourAWSAccountID>:role/redshift-federation-role,arn:aws:iam:: <yourAWSAccountID>:saml-provider/redshift-federation-saml-provider aws_idp_iam_role_arn, aws_identity_provider_arn
https://redshift.amazon.com/SAML/Attributes/DbUser ${user:email} Identification for the user session, which in most cases is the email_id of the user
https://redshift.amazon.com/SAML/Attributes/DbGroups bi_users_group Amazon Redshift database group names for the user, which in most cases is the same as the directory groups the user belongs to

The IAM role and IdP names and ARN strings entered for the https://aws.amazon.com/SAML/Attributes/Role attribute mapping must match the names given while creating those IAM resources for the BI user group during the IAM role setup in the next section.

  1. Choose Dashboard in the navigation pane and choose the User portal URL link to launch the user portal’s login page in a new browser window.
  2. Log in to the AWS SSO user portal using your Microsoft AD user credentials for the BI user.

After logging in, you can see the new SAML application we created.

  1. Choose the application (right-click) and copy the link address.

We need this application login URL in a later step to connect to Amazon Redshift using AWS SSO federation.

  1. Similar to the preceding steps, create another SAML application called Redshift-SAML-Analyst-User and assign the analyst group to this application.
  2. On the application’s Attribute mappings tab, add all the attribute mappings similar to the previous application, but with different mapping values for the Role and DbGroups attributes:
    1. The DbGroups parameter should be mapped to analysts_group.
    2. The Role parameter value entered should match the names of IAM resources created for the analyst user group in the next section.
  1. Log in to the AWS SSO user portal using your Microsoft AD user credentials for the analyst user.
  2. Copy the application link address, which you need in a later step to connect to Amazon Redshift using AWS SSO federation.

Set up IAM roles and SAML IdPs

In this step, you set up two SAML IdPs and two IAM roles with appropriate permissions for the two AD directory groups to access your Amazon Redshift cluster. You need the SAML Metadata XML files downloaded from your AWS SSO SAML applications from the previous section. As outlined in our earlier post, you may follow the step-by-step process to add the IdPs and IAM roles manually, or use the following AWS CloudFormation template.

Next, you need to confirm or replace the role ARN and the IdP ARN values in the two SAML applications’ attribute mappings. Refer to the following screenshot for the location of the two ARN values on the IAM console.

Set up an Amazon Redshift cluster

If you haven’t set up an Amazon Redshift cluster yet, see Getting started with Amazon Redshift for a step-by-step guide to create a new cluster in your AWS account.

If you already have an Amazon Redshift cluster, note the admin user credentials for that cluster and connect to that cluster using a SQL client like SQL Workbench/J and the latest Amazon Redshift JDBC driver.

After logging in to your Amazon Redshift cluster as an admin user, you can set up database objects and appropriate access permissions for them. In the following code, we set up two schemas for analysts and BI users, and then grant access on them to the relevant groups:

CREATE GROUP analysts_group;
CREATE GROUP bi_users_group;

CREATE SCHEMA IF NOT EXISTS analysts_schema;
GRANT USAGE ON SCHEMA analysts_schema TO GROUP analysts_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA analysts_schema GRANT SELECT ON TABLES TO GROUP analysts_group;
GRANT SELECT ON ALL TABLES IN SCHEMA analysts_schema TO GROUP analysts_group;

CREATE SCHEMA IF NOT EXISTS bi_schema;
GRANT USAGE ON SCHEMA bi_schema TO GROUP bi_users_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA bi_schema GRANT SELECT ON TABLES TO GROUP bi_users_group;
GRANT SELECT ON ALL TABLES IN SCHEMA bi_schema TO GROUP bi_users_group;

Connect to Amazon Redshift with AWS SSO federation

In this step, you connect to your Amazon Redshift cluster from your SQL Workbench/J client using AWS SSO federation.

  1. Create a new connection in SQL Workbench/J and choose Amazon Redshift as the driver.
  2. Enter the following code in the URL section of your connection properties (provide your Amazon Redshift cluster endpoint):
    jdbc:redshift:iam://your_cluster_endpoint

  3. Choose Extended Properties and add the following three properties:
    1. login_url – Enter the BI user group’s SAML application’s login URL you copied in an earlier step.
    2. plugin_name – Enter com.amazon.redshift.plugin.BrowserSamlCredentialsProvider.
    3. idp_response_timeout – Enter 60.
  4. Choose OK, and connect to your cluster.

This launches your AWS SSO SAML application’s sign-in page in a browser window. After you successfully authenticate using the BI user’s AD user credentials in the browser, the SQL client connects you to Amazon Redshift as the BI user under the bi_users_group database group. You can verify the user’s database session and group association by running the following SQL:

select * from stv_sessions where starttime &gt; sysdate-1 order by 1 desc;

  1. Similar to the BI user, you can create a new database connection to test an analyst user login. Instead of adding the extended properties in SQL Workbench, you can also use an initialization file to add the JDBC connection properties. For that, create a file rsjdbc.ini on your file system with the following contents (provide the analyst group’s SAML application’s login URL you copied earlier):
[ANALYST]
login_url=https://xxxxxxxxxx.awsapps.com/start/#/saml/default/Redshift-SAML-Analyst-User/ins-xxxxxxxx
plugin_name=com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
idp_response_timeout=60
  1. Enter the following in the URL section of your connection properties (provide your cluster endpoint and file system path for the rsjdbc.ini file you created in the previous step):
jdbc:redshift:iam://your_cluster_endpoint?inifile=/network_path/rsjdbc.ini&amp;IniSection=ANALYST

Your connection profile should look like the following screenshot.

  1. Choose OK to connect.

After you authenticate using the analyst user’s AD user credentials in the browser, you should be logged in to Amazon Redshift as the analyst user mapped to the analysts_group, as shown in the following screenshot.

Conclusion

In this post, we showed how you can use AWS SSO with AWS Managed Microsoft AD to federate access to Amazon Redshift using identity federation. The same setup instructions can also work with any other supported identity source directory of your choice, such as the native AWS SSO identity store, self-managed or on-premises Active Directory, or an external identity provider (IdP) such as Okta, Azure AD, and Ping.


About the Authors

Manash Deb is a Software Development Engineer in AWS Redshift team. He has worked on building end-to-end data-driven solutions in different database and data warehousing technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS.

 

 

Manish Vazirani is an Analytics Specialist Solutions Architect at Amazon Web Services.

 

 

 

 

Rajesh Mohan is an SDE-II at Amazon Studios where he engineers solutions to build a media supply chain to integrate content vendors with Prime Video. In his free time, he enjoys wandering the streets of New York City, trekking the mountains of California and binging on food videos while chomping away on delicious food from around the world.

Amazon Redshift identity federation with multi-factor authentication

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/amazon-redshift-identity-federation-with-multi-factor-authentication/

Password-based access control alone is not considered secure enough, and many organizations are adopting multi-factor authentication (MFA) and single sign-on (SSO) as a de facto standard to prevent unauthorized access to systems and data. SSO frees up time and resources for both administrators and end users from the painful process of password-based credential management. MFA capability adds an extra layer of protection, improving organizational security and confidence.

Amazon Redshift is a fast, fully-managed cloud data warehouse that provides browser-based plugins for JDBC/ODBC drivers, which helps you easily implement SSO capabilities added with MFA to secure your data warehouse, and also helps automation and enforcement of data access policies across the organization through directory federation.

You can integrate Amazon Redshift with different identity providers (IdPs) like Microsoft Azure Active Directory, Ping, Okta, OneLogin, and more. For more information, see Using a credentials provider plugin. You may already have pre-existing integrations for federating to AWS using industry standard Security Assertion Markup Language (SAML) with these IdPs. In this post, we explain the steps to integrate the Amazon Redshift browser-based SAML plugin to add SSO and MFA capability with your federation IdP.

Solution overview

With this integration, users get authenticated to Amazon Redshift using the SSO and MFA credentials of the IdP application, which uses SAML tokens to map the IdP user identity (like login name or email address) as the database user in Amazon Redshift. It can also map users’ directory group memberships to corresponding database groups in Amazon Redshift, which allows you to control authorization grants for database objects in Amazon Redshift.

The following diagram illustrates our solution architecture.

High-level steps for SSO-MFA integration

The following diagram illustrates the authentication flow with the browser SAML plugin.

We complete the following high-level steps to set up the SSO and MFA authentication capability to an Amazon Redshift data warehouse using a browser-based SAML plugin with our IdP:

  1. Create a custom SAML 2.0 application with the IdP with the following configurations:
    1. A redirect URI (for example, http://localhost:7890/redshift/).
    2. MFA capability enabled.
    3. Relevant SAML claim attributes.
    4. Appropriate directory groups and users with the IdP.
  2. Add appropriate AWS Identity and Access Management (IAM) permissions:
    1. Add an IdP.
    2. Add appropriate IAM roles for the IdP.
    3. Use IAM policies to add appropriate permissions to the roles to access the Amazon Redshift cluster.
  3. Set up Amazon Redshift with group-level access control:
    1. Connect to Amazon Redshift using superuser credentials.
    2. Set up appropriate database groups in Amazon Redshift.
    3. Grant access permissions appropriate to relevant groups.
  4. Connect to Amazon Redshift with your JDBC/ODBC SQL client:
    1. Configure connection attributes for the IdP.
    2. Enable browser-based MFA.
    3. Connect to Amazon Redshift.

Create a custom SAML 2.0 application with the IdP

The first step of this setup is to create a SAML 2.0 application with your IdP. The various directory groups and users that need to log in to the Amazon Redshift cluster should have access to this SAML application. Provide an appropriate redirect_uri (for example, http://localhost:7890/redshift/) in the SAML application configuration, so that the IdP can seamlessly redirect SAML responses over HTTPS to this URI, which then allows the Amazon Redshift JDBC/ODBC driver to authenticate and authorize the user.

The following screenshot shows a SAML application configuration with PingOne as the IdP (for more details on PingOne Amazon Redshift SSO federation, see Federating single sign-on access to your Amazon Redshift cluster with PingIdentity).

You need to download the metadata XML file from the provider (as shown in the preceding screenshot) and use it in a later step to create a SAML IdP in IAM.

Next, you can enable MFA for this application so that users are authorized to access Amazon Redshift only after they pass the two-factor authentication with MFA.

The following screenshot shows the MFA configuration settings with PingOne as the IdP.

As part of the IdP application setup, map the following claim attributes so that Amazon Redshift can access them using the SAML response.

Claim Attribute Namespace Description Example Value
Role https://aws.amazon.com/SAML/Attributes/Role aws_iam_role_arn_for_identity_provider, aws_identity_provider_arn arn:aws:iam::<account>:role/PingOne_Redshift_SSO_Role,arn:aws:iam::<account>:saml-provider/PingOne
RoleSessionName https://aws.amazon.com/SAML/Attributes/RoleSessionName Identification for the user session, which in most cases is the email_id of the user. email
AutoCreate https://redshift.amazon.com/SAML/Attributes/AutoCreate If this parameter is set, new users authenticated by the IdP are automatically created in Amazon Redshift. "true"
DbUser https://redshift.amazon.com/SAML/Attributes/DbUser Identification for the user session, which in most cases is the email_id of the user. email
DbGroups https://redshift.amazon.com/SAML/Attributes/DbGroups Amazon Redshift database group names for the user, which in most cases is the same as the directory groups the user belongs to. data_scientist

The following screenshot is an example of these claim attributes set up for PingOne as IdP.

Apart from setting up the SAML application, you also need to set up appropriate directory groups and users with your IdP, which you will use to grant SSO and MFA access to users for different applications like AWS Single Sign-On and Amazon Redshift application single sign-on.

The following screenshot is an example of this user group set up for PingOne as IdP.

Add appropriate permissions using IAM

After you complete the configuration settings with the IdP, the next step is to configure appropriate permissions in IAM in your AWS account for identity federation using IAM.

The first step is to add an IdP using the SAML metadata XML file downloaded from the IdP’s SAML application you created in the previous step.

After you add the IdP, you need to create an IAM role with that IdP as a trusted entity.

Set the value of the SAML:aud attribute to the same redirect URI defined in your IdP’s SAML application setup (for example, http://localhost:7890/redshift/).

Create a new IAM policy with the necessary permissions needed by the users to access Amazon Redshift, and attach it to the IAM role you created earlier. See the following sample policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser",
                "redshift:JoinGroup",
                "redshift:GetClusterCredentials",
                "redshift:ListSchemas",
                "redshift:ListTables",
                "redshift:ListDatabases",
                "redshift:ExecuteQuery",
                "redshift:FetchResults",
                "redshift:CancelQuery",
                "redshift:DescribeClusters",
                "redshift:DescribeQuery",
                "redshift:DescribeTable"
            ],
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<clusterName>",
                "arn:aws:redshift:<region>:<account>:dbuser:<clusterName>/${redshift:DbUser}",
                "arn:aws:redshift:<region>:<account>:dbname:<clusterName>/${redshift:DbName}",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/bi_users_group",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/etl_users_group",
                "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/analysts_group"
            ]
        }
    ]
}

You can also use an AWS CloudFormation template to automate this IAM setup by uploading the IdP- specific SAML metadata XML file from the SAML application you created.

The following template takes care of creating the IAM resources required for this setup. You need to enter the following parameters to the template:

  • FederationProviderName – Enter a suitable name for the IAM IdP.
  • FederationXmlS3Location – Enter the Amazon Simple Storage Service (Amazon S3) URI where you uploaded the SAML metadata XML file from your IdP’s SAML application.
  • RedshiftClusterEndpoint – Enter the endpoint URI of your Amazon Redshift cluster. You can get this URI via the Amazon Redshift console. If you have multiple Amazon Redshift clusters, you may need to modify this CloudFormation template to add permissions for all the clusters in your account.

Grant group-level access control with Amazon Redshift

If you haven’t set up an Amazon Redshift cluster yet, see Getting started with Amazon Redshift for a step-by-step guide to create a new cluster in your AWS account.

If you already have an Amazon Redshift cluster, note the primary user credentials for that cluster and refer to the following resources to connect to that cluster using a SQL client like SQL Workbench/J and the latest Amazon Redshift JDBC driver with AWS SDK:

When you’re logged in, you need to set up the appropriate groups in Amazon Redshift. The following example code sets up three database groups for business intelligence (BI) users, analysts, and a cross-user group in Amazon Redshift:

CREATE GROUP bi_users_group;
CREATE GROUP analysts_group;
CREATE GROUP cross_user_group;

You can then set up database objects and appropriate access permissions for them. In the following code, we set up two schemas for analysts and BI users and then grant access on them to the relevant groups:

CREATE SCHEMA IF NOT EXISTS bi_schema;
CREATE SCHEMA IF NOT EXISTS analysts_schema;

GRANT SELECT ON ALL TABLES IN SCHEMA bi_schema TO GROUP bi_users_group, GROUP cross_user_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA bi_schema GRANT SELECT ON TABLES TO GROUP bi_users_group, GROUP cross_user_group;

GRANT SELECT ON ALL TABLES IN SCHEMA analysts_schema TO GROUP analysts_group, GROUP cross_user_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA analysts_schema GRANT SELECT ON TABLES TO GROUP analysts_group, GROUP cross_user_group;

These group-level grants allow federated users to access Amazon Redshift objects based on their associated permissions. As explained earlier in this post, you can map your IdP directory groups to their respective database groups in Amazon Redshift, which allows you to control both authentication and authorization to Amazon Redshift based on the IdP credentials.

However, you may choose to control the authorization part within the database itself instead of relying on IdP directory groups. In this case, you use the IdP only to facilitate system authentication to Amazon Redshift, but for data authorization, you map the users and groups manually using alter group statements, as in the following code:

CREATE USER "[email protected]" PASSWORD DISABLE;
ALTER GROUP bi_users_group ADD USER "[email protected]";
ALTER GROUP cross_user_group ADD USER "[email protected]";

In the preceding example, we create a new user, exampleuser, with password disabled. We can use the IdP credentials for this user to authenticate and therefore it doesn’t need a password. But to provide authorization, we added this user to the bi_user and cross_user groups, so that it can inherit the permissions granted to these groups and can work seamlessly with SSO and MFA federation.

Configure your JDBC/ODBC SQL client to use the browser-based plugin to connect to Amazon Redshift

In this step, you can test Amazon Redshift connectivity through your IdP using a SQL client like SQL Workbench/J.

You need to provide the following configurations in the SQL client.

Property Value
Driver Amazon Redshift
URL jdbc:redshift:iam://<your-redshift-cluster-endpoint>

Additionally, you need to set up the following extended properties.

Property Value
login_url This is the SAML application’s login page
plugin_name com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
idp_response_timeout Number of seconds to allow for SSO authentication to complete before timing out

The following screenshot shows the configurations to connect SQLWorkbench/J client with PingOne as IdP.

The following table summarizes our property values.

Property Value
login_url https://sso.connect.pingidentity.com/sso/sp/initsso?saasid=<your_saas_id>&idpid=<your_idp_id> (This is the SAML application’s SSO URL from your IDP)
plugin_name com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
idp_response_timeout 120

When you choose Test or OK, a new web browser window opens that shows the SAML application’s login page.

If this is the first time you’re logging in to PingOne, and haven’t set up MFA before, you can download and pair the PingID mobile app on iOS or Android.

After the PingID app is installed and paired, it pushes a notification to your phone to approve or deny the MFA authorization. When the MFA succeeds, the browser displays a success message on the redirect page.

After the connection is successful, let’s run a SQL query to confirm that the correct user identification was passed and also confirm that the correct database group was mapped for this SQL user session, based on the user’s directory group. In this case, the user manish was mapped to the bi_users_group directory group in PingOne. We should see the SQL session reflect the corresponding database group in Amazon Redshift.

We were able to successfully accomplish MFA-based SSO identity federation with PingOne using the browser-based plugin that Amazon Redshift provides.

IdP-specific configurations

As mentioned earlier, the first step of this process is to set up SSO for Amazon Redshift with your IdP. The setup steps for that may vary depending on the provider. For more information, see the following resources:

The following videos also cover these details if you want to view them in action:

Conclusion

Amazon Redshift makes it easy to integrate identity federation with your existing third-party identity providers, allowing you to centralize user and access management in a single corporate directory. In this post, we showed how the Amazon Redshift browser-based plugin works with popular SAML-based IdPs to provide an additional security layer with MFA authentication. You can also use the instructions in this post to set up various SAML-based IdPs (like Ping, Okta, JumpCloud, and OneLogin) to implement SSO and MFA with Amazon Redshift.


About the Authors

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data driven solutions in different database and data warehousing technologies for over fifteen years. He loves to learn new technologies and solving, automating and simplifying customer problems with easy-to-use cloud data solutions on AWS.

 

 

Manish Vazirani is an Analytics Specialist Solutions Architect at Amazon Web Services.

 

 

 

 

 

 

Building an event-driven application with AWS Lambda and the Amazon Redshift Data API

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/building-an-event-driven-application-with-aws-lambda-and-the-amazon-redshift-data-api/

Eventdriven applications are becoming popular with many customers, where applications run in response to events. A primary benefit of this architecture is the decoupling of producer and consumer processes, allowing greater flexibility in application design and building decoupled processes.

An example of an even-driven application is an automated workflow being triggered by an event, which runs a series of transformations in the data warehouse. At the end of this workflow, another event gets initiated to notify end-users about the completion of those transformations and that they can start analyzing the transformed dataset.

In this post, we explain how you can easily design a similar event-driven application with Amazon Redshift, AWS Lambda, and Amazon EventBridge. In response to a scheduled event defined in EventBridge, this application automatically triggers a Lambda function to run a stored procedure performing extract, load, and transform (ELT) operations in an Amazon Redshift data warehouse, using its out-of-the-box integration with the Amazon Redshift Data API. This stored procedure copies the source data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift and aggregates the results. When complete, it sends an event to EventBridge, which triggers a Lambda function to send notification to end-users through Amazon Simple Notification Service (Amazon SNS) to inform them about the availability of updated data in Amazon Redshift.

This event-driven server-less architecture offers greater extensibility and simplicity, making it easier to maintain and faster to release new features, and also reduces the impact of changes. It also simplifies adding other components or third-party products to the application without many changes.

Prerequisites

As a prerequisite for creating the application in this post, you need to set up an Amazon Redshift cluster and associate it with an AWS Identity and Access Management (IAM) role. For more information, see Getting Started with Amazon Redshift.

Solution overview

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with an AWS CloudFormation template.

The workflow includes the following steps:

  1. The EventBridge rule EventBridgeScheduledEventRule is initiated based on a cron schedule.
  2. The rule triggers the Lambda function LambdaRedshiftDataApiETL, with the action run_sql as an input parameter. The Python code for the Lambda function is available in the GitHub repo.
  3. The function performs an asynchronous call to the stored procedure run_elt_process in Amazon Redshift, performing ELT operations using the Amazon Redshift Data API.
  4. The stored procedure uses the Amazon S3 location event-driven-app-with-lambda-redshift/nyc_yellow_taxi_raw/ as the data source for the ELT process. We have pre-populated this with the NYC Yellow Taxi public dataset for the year 2015 to test this solution.
  5. When the stored procedure is complete, the EventBridge rule EventBridgeRedshiftEventRule is triggered automatically to capture the event based on the source parameter redshift-data from the Amazon Redshift Data API.
  6. The rule triggers the Lambda function LambdaRedshiftDataApiETL, with the action notify as an input parameter.
  7. The function uses the SNS topic RedshiftNotificationTopicSNS to send an automated email notification to end-users that the ELT process is complete.

The Amazon Redshift database objects required for this solution are provisioned automatically by the Lambda function LambdaSetupRedshiftObjects as part of the CloudFormation template initiation by invoking the function LambdaRedshiftDataApiETL, which creates the following objects in Amazon Redshift:

  • Table nyc_yellow_taxi, which we use to copy the New York taxi dataset from Amazon S3
  • Materialized view nyc_yellow_taxi_volume_analysis, providing an aggregated view of table
  • Stored procedure run_elt_process to take care of data transformations

The Python code for this function is available in the GitHub repo.

We also use the IAM role LambdaRedshiftDataApiETLRole for the Lambda function and  LambdaRedshiftDataApiETL to allow the following permissions:

  • Federate to the Amazon Redshift cluster through getClusterCredentials permission, avoiding password credentials
  • Initiate queries in the Amazon Redshift cluster through redshift-data API calls
  • Log with Amazon CloudWatch for troubleshooting purposes
  • Send notifications through Amazon SNS

A sample IAM role for this function is available in the GitHub repo.

Lambda is a key service in this solution because it initiates queries in Amazon Redshift using the redshift-data client. Based on the input parameter action, this function can asynchronously initiate Structured Query Language (SQL) statements in Amazon Redshift, thereby avoiding chances of timing out in case of long-running SQL statements[MOU1] [MOU2]   [MOU1]I think we should put reference to Redshift Data API and highlight that there is no need to configure drivers and connections [MOU2]done. It can also publish custom notifications through Amazon SNS. Also, it uses the Amazon Redshift Data API temporary credentials functionality, which allows it to communicate with Amazon Redshift using IAM permissions without the need of any password-based authentication. With the Data API, you also don’t need to configure drivers and connections for your Amazon Redshift cluster, because it’s handled automatically.

Deploying the CloudFormation template

When your Amazon Redshift cluster is set up, use the provided CloudFormation template to automatically create all required resources for this solution in your AWS account. For more information, see Getting started with AWS CloudFormation.

The template requires you to provide the following parameters:

  • RedshiftClusterIdentifier – Cluster identifier for your Amazon Redshift cluster.
  • DbUsername – Amazon Redshift database user name that has access to run the SQL script.
  • DatabaseName – Name of the Amazon Redshift database where the SQL script runs.
  • RedshiftIAMRoleARN – ARN of the IAM role associated with the Amazon Redshift cluster.
  • NotificationEmailId – Email to send event notifications through Amazon SNS.
  • ExecutionSchedule – Cron expression to schedule the ELT process through an EventBridge rule.
  • SqlText – SQL text to run as part of the ELT process. Don’t change the default value call run_elt_process(); if you want to test this solution with the test dataset provided for this post.

The following screenshot shows the stack details on the AWS CloudFormation console.

Testing the pipeline

After setting up the architecture, you should have an automated pipeline to trigger based on the schedule you defined in the EventBridge rule’s cron expression. You can view the CloudWatch logs and troubleshoot issues in the Lambda function. The following screenshot shows the logs for our setup.

You can also view the query status on the Amazon Redshift console, which allows you to view detailed execution plans for the queries you ran. Although the stored procedure may take around 6 minutes to complete, the Lambda function finishes in seconds. This is primarily because the execution from Lambda on Amazon Redshift was asynchronous. Therefore, the function is complete after initiating the process in Amazon Redshift without caring about the query completion.

When this process is complete, you receive the email notification that the ELT process is complete.

You may then view the updated data in your business intelligence tool, like Amazon QuickSight, or query data directly in Amazon Redshift Query Editor (see the following screenshot) to view the most recent data processed by this event-driven architecture.

Conclusion

The Amazon Redshift Data API enables you to painlessly interact with Amazon Redshift and enables you to build event-driven and cloud-native applications. We demonstrated how to build an event-driven application with Amazon Redshift, Lambda, and EventBridge. For more information about the Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters and Using the Amazon Redshift Data API.


About the Authors

Manash Deb is a Senior Analytics Specialist Solutions Architect. He has worked in different database and data warehousing technologies for more than 15 years.

 

 

 

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

 

Fei Peng is a Software Dev Engineer working in the Amazon Redshift team.