Tag Archives: Amazon Redshift

Centralize near-real-time governance through alerts on Amazon Redshift data warehouses for sensitive queries

2023-06-29 Rajdip Chaudhuri

Post Syndicated from Rajdip Chaudhuri original https://aws.amazon.com/blogs/big-data/centralize-near-real-time-governance-through-alerts-on-amazon-redshift-data-warehouses-for-sensitive-queries/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud that delivers powerful and secure insights on all your data with the best price-performance. With Amazon Redshift, you can analyze your data to derive holistic insights about your business and your customers. In many organizations, one or multiple Amazon Redshift data warehouses run daily for data and analytics purposes. Therefore, over time, multiple Data Definition Language (DDL) or Data Control Language (DCL) queries, such as CREATE, ALTER, DROP, GRANT, or REVOKE SQL queries, are run on the Amazon Redshift data warehouse, which are sensitive in nature because they could lead to dropping tables or deleting data, causing disruptions or outages. Tracking such user queries as part of the centralized governance of the data warehouse helps stakeholders understand potential risks and take prompt action to mitigate them following the operational excellence pillar of the AWS Data Analytics Lens. Therefore, for a robust governance mechanism, it’s crucial to alert or notify the database and security administrators on the kind of sensitive queries that are run on the data warehouse, so that prompt remediation actions can be taken if needed.

To address this, in this post we show you how you can automate near-real-time notifications over a Slack channel when certain queries are run on the data warehouse. We also create a simple governance dashboard using a combination of Amazon DynamoDB, Amazon Athena, and Amazon QuickSight.

Solution overview

An Amazon Redshift data warehouse logs information about connections and user activities taking place in databases, which helps monitor the database for security and troubleshooting purposes. These logs can be stored in Amazon Simple Storage Service (Amazon S3) buckets or Amazon CloudWatch. Amazon Redshift logs information in the following log files, and this solution is based on using an Amazon Redshift audit log to CloudWatch as a destination:

Connection log – Logs authentication attempts, connections, and disconnections
User log – Logs information about changes to database user definitions
User activity log – Logs each query before it’s run on the database

The following diagram illustrates the solution architecture.

Solution Architecture

The solution workflow consists of the following steps:

Audit logging is enabled in each Amazon Redshift data warehouse to capture the user activity log in CloudWatch.
Subscription filters on CloudWatch capture the required DDL and DCL commands by providing filter criteria.
The subscription filter triggers an AWS Lambda function for pattern matching.
The Lambda function processes the event data and sends the notification over a Slack channel using a webhook.
The Lambda function stores the data in a DynamoDB table over which a simple dashboard is built using Athena and QuickSight.

Prerequisites

Before starting the implementation, make sure the following requirements are met:

You have an AWS account.
The AWS Region used for this post is us-east-1. However, this solution is relevant in any other Region where the necessary AWS services are available.
Permissions to create Slack a workspace.

Create and configure an Amazon Redshift cluster

To set up your cluster, complete the following steps:

Create a provisioned Amazon Redshift data warehouse.

For this post, we use three Amazon Redshift data warehouses: demo-cluster-ou1, demo-cluster-ou2, and demo-cluster-ou3. In this post, all the Amazon Redshift data warehouses are provisioned clusters. However, the same solution applies for Amazon Redshift Serverless.

To enable audit logging with CloudWatch as the log delivery destination, open an Amazon Redshift cluster and go to the Properties tab.
On the Edit menu, choose Edit audit logging.

Redshift edit audit logging

Select Turn on under Configure audit logging.
Select CloudWatch for Log export type.
Select all three options for User log, Connection log, and User activity log.
Choose Save changes.

Create a parameter group for the clusters with enable_user_activity_logging set as true for each of the clusters.
Modify the cluster to attach the new parameter group to the Amazon Redshift cluster.

For this post, we create three custom parameter groups: custom-param-grp-1, custom-param-grp-2, and custom-param-grp-3 for three clusters.

Note, if you enable only the audit logging feature, but not the associated parameter, the database audit logs log information for only the connection log and user log, but not for the user activity log.

On the CloudWatch console, choose Log groups under Logs in the navigation pane.
Search for /aws/redshift/cluster/demo.

This will show all the log groups created for the Amazon Redshift clusters.

Create a DynamoDB audit table

To create your audit table, complete the following steps:

On the DynamoDB console, choose Tables in the navigation pane.
Choose Create table.
For Table name, enter demo_redshift_audit_logs.
For Partition key, enter partKey with the data type as String.

Keep the table settings as default.
Choose Create table.

Create Slack resources

Slack Incoming Webhooks expect a JSON request with a message string corresponding to a "text" key. They also support message customization, such as adding a user name and icon, or overriding the webhook’s default channel. For more information, see Sending messages using Incoming Webhooks on the Slack website.

The following resources are created for this post:

A Slack workspace named demo_rc
A channel named #blog-demo in the newly created Slack workspace
A new Slack app in the Slack workspace named demo_redshift_ntfn (using the From Scratch option)
Note down the Incoming Webhook URL, which will be used in this post for sending the notifications

Create an IAM role and policy

In this section, we create an AWS Identity and Access Management (IAM) policy that will be attached to an IAM role. The role is then used to grant a Lambda function access to a DynamoDB table. The policy also includes permissions to allow the Lambda function to write log files to Amazon CloudWatch Logs.

On the IAM console, choose Policies in navigation pane.
Choose Create policy.
In the Create policy section, choose the JSON tab and enter the following IAM policy. Make sure you replace your AWS account ID in the policy (replace XXXXXXXX with your AWS account ID).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadWriteTable",
            "Effect": "Allow",
            "Action": [
                "dynamodb:BatchGetItem",
                "dynamodb:BatchWriteItem",
                "dynamodb:PutItem",
                "dynamodb:GetItem",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:UpdateItem",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:dynamodb:us-east-1:XXXXXXXX:table/demo_redshift_audit_logs",
                "arn:aws:logs:*:XXXXXXXX:log-group:*:log-stream:*"
            ]
        },
        {
            "Sid": "WriteLogStreamsAndGroups",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:CreateLogGroup"
            ],
            "Resource": "arn:aws:logs:*:XXXXXXXX:log-group:*"
        }
    ]
}

Choose Next: Tags, then choose Next: Review.
Provide the policy name demo_post_policy and choose Create policy.

To apply demo_post_policy to a Lambda function, you first have to attach the policy to an IAM role.

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Select AWS service and then select Lambda.
Choose Next.

On the Add permissions page, search for demo_post_policy.
Select demo_post_policy from the list of returned search results, then choose Next.

On the Review page, enter demo_post_role for the role and an appropriate description, then choose Create role.

Create a Lambda function

We create a Lambda function with Python 3.9. In the following code, replace the slack_hook parameter with the Slack webhook you copied earlier:

import gzip
import base64
import json
import boto3
import uuid
import re
import urllib3

http = urllib3.PoolManager()
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table("demo_redshift_audit_logs")
slack_hook = "https://hooks.slack.com/services/xxxxxxx"

def exe_wrapper(data):
    cluster_name = (data['logStream'])
    for event in data['logEvents']:
        message = event['message']
        reg = re.match(r"'(?P<ts>\d{4}-\d\d-\d\dT\d\d:\d\d:\d\dZ).*?\bdb=(?P<db>\S*).*?\buser=(?P<user>\S*).*?LOG:\s+(?P<query>.*?);?$", message)
        if reg is not None:
            filter = reg.groupdict()
            ts = filter['ts']
            db = filter['db']
            user = filter['user']
            query = filter['query']
            query_type = ' '.join((query.split(" "))[0 : 2]).upper()
            object = query.split(" ")[2]
            put_dynamodb(ts,cluster_name,db,user,query,query_type,object,message)
            slack_api(cluster_name,db,user,query,query_type,object)
            
def put_dynamodb(timestamp,cluster,database,user,sql,query_type,object,event):
    table.put_item(Item = {
        'partKey': str(uuid.uuid4()),
        'redshiftCluster': cluster,
        'sqlTimestamp' : timestamp,
        'databaseName' : database,
        'userName': user,
        'sqlQuery': sql,
        'queryType' : query_type,
        'objectName': object,
        'rawData': event
        })
        
def slack_api(cluster,database,user,sql,query_type,object):
    payload = {
	'channel': '#blog-demo',
	'username': 'demo_redshift_ntfn',
	'blocks': [{
			'type': 'section',
			'text': {
				'type': 'mrkdwn',
				'text': 'Detected *{}* command\n *Affected Object*: `{}`'.format(query_type, object)
			}
		},
		{
			'type': 'divider'
		},
		{
			'type': 'section',
			'fields': [{
					'type': 'mrkdwn',
					'text': ':desktop_computer: *Cluster Name:*\n{}'.format(cluster)
				},
				{
					'type': 'mrkdwn',
					'text': ':label: *Query Type:*\n{}'.format(query_type)
				},
				{
					'type': 'mrkdwn',
					'text': ':card_index_dividers: *Database Name:*\n{}'.format(database)
				},
				{
					'type': 'mrkdwn',
					'text': ':technologist: *User Name:*\n{}'.format(user)
				}
			]
		},
		{
			'type': 'section',
			'text': {
				'type': 'mrkdwn',
				'text': ':page_facing_up: *SQL Query*\n ```{}```'.format(sql)
			}
		}
	]
	}
    encoded_msg = json.dumps(payload).encode('utf-8')
    resp = http.request('POST',slack_hook, body=encoded_msg)
    print(encoded_msg) 

def lambda_handler(event, context):
    print(f'Logging Event: {event}')
    print(f"Awslog: {event['awslogs']}")
    encoded_zipped_data = event['awslogs']['data']
    print(f'data: {encoded_zipped_data}')
    print(f'type: {type(encoded_zipped_data)}')
    zipped_data = base64.b64decode(encoded_zipped_data)
    data = json.loads(gzip.decompress(zipped_data))
    exe_wrapper(data)

Create your function with the following steps:

On the Lambda console, choose Create function.
Select Author from scratch and for Function name, enter demo_function.
For Runtime, choose Python 3.9.
For Execution role, select Use an existing role and choose demo_post_role as the IAM role.
Choose Create function.

On the Code tab, enter the preceding Lambda function and replace the Slack webhook URL.
Choose Deploy.

Create a CloudWatch subscription filter

We need to create the CloudWatch subscription filter on the useractivitylog log group created by the Amazon Redshift clusters.

On the CloudWatch console, navigate to the log group /aws/redshift/cluster/demo-cluster-ou1/useractivitylog.
On the Subscription filters tab, on the Create menu, choose Create Lambda subscription filter.

Choose demo_function as the Lambda function.
For Log format, choose Other.
Provide the subscription filter pattern as ?create ?alter ?drop ?grant ?revoke.
Provide the filter name as Sensitive Queries demo-cluster-ou1.
Test the filter by selecting the actual log stream. If it has any queries with a match pattern, then you can see some results. For testing, use the following pattern and choose Test pattern.

'2023-04-02T04:18:43Z UTC [ db=dev user=awsuser pid=100 userid=100 xid=100 ]' LOG: alter table my_table alter column string type varchar(16);
'2023-04-02T04:06:08Z UTC [ db=dev user=awsuser pid=100 userid=100 xid=200 ]' LOG: create user rs_user with password '***';

Choose Start streaming.

Repeat the same steps for /aws/redshift/cluster/demo-cluster-ou2/useractivitylog and /aws/redshift/cluster/demo-cluster-ou3/useractivitylog by giving unique subscription filter names.
Complete the preceding steps to create a second subscription filter for each of the Amazon Redshift data warehouses with the filter pattern ?CREATE ?ALTER ?DROP ?GRANT ?REVOKE, ensuring uppercase SQL commands are also captured through this solution.

Test the solution

In this section, we test the solution in the three Amazon Redshift clusters that we created in the previous steps and check for the notifications of the commands on the Slack channel as per the CloudWatch subscription filters as well as data getting ingested in the DynamoDB table. We use the following commands to test the solution; however, this is not restricted to these commands only. You can check with other DDL commands as per the filter criteria in your Amazon Redshift cluster.

create schema sales;
create schema marketing;
create table dev.public.demo_test_table_1  (id int, string varchar(10));
create table dev.public.demo_test_table_2  (empid int, empname varchar(100));
alter table dev.public.category alter column catdesc type varchar(65);
drop table dev.public.demo_test_table_1;
drop table dev.public.demo_test_table_2;

In the Slack channel, details of the notifications look like the following screenshot.

To get the results in DynamoDB, complete the following steps:

On the DynamoDB console, choose Explore items under Tables in the navigation pane.
In the Tables pane, select demo_redshift_audit_logs.
Select Scan and Run to get the results in the table.

Athena federation over the DynamoDB table

The Athena DynamoDB connector enables Athena to communicate with DynamoDB so that you can query your tables with SQL. As part of the prerequisites for this, deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more details, refer to Deploying a data source connector or Using the AWS Serverless Application Repository to deploy a data source connector. For this post, we use the Athena console.

On the Athena console, under Administration in the navigation pane, choose Data sources.
Choose Create data source.

Select the data source as Amazon DynamoDB, then choose Next.

For Data source name, enter dynamo_db.
For Lambda function, choose Create Lambda function to open a new window with the Lambda console.

Under Application settings, enter the following information:
- For Application name, enter AthenaDynamoDBConnector.
- For SpillBucket, enter the name of an S3 bucket.
- For AthenaCatalogName, enter dynamo.
- For DisableSpillEncryption, enter false.
- For LambdaMemory, enter 3008.
- For LambdaTimeout, enter 900.
- For SpillPrefix, enter athena-spill-dynamo.

Select I acknowledge that this app creates custom IAM roles and choose Deploy.
Wait for the function to deploy, then return to the Athena window and choose the refresh icon next to Lambda function.
Select the newly deployed Lambda function and choose Next.

Review the information and choose Create data source.
Navigate back to the query editor, then choose dynamo_db for Data source and default for Database.
Run the following query in the editor to check the sample data:

SELECT partkey,
       redshiftcluster,
       databasename,
       objectname,
       username,
       querytype,
       sqltimestamp,
       sqlquery,
       rawdata
FROM dynamo_db.default.demo_redshift_audit_logs limit 10;

Visualize the data in QuickSight

In this section, we create a simple governance dashboard in QuickSight using Athena in direct query mode to query the record set, which is persistently stored in a DynamoDB table.

Sign up for QuickSight on the QuickSight console.
Select Amazon Athena as a resource.
Choose Lambda and select the Lambda function created for DynamoDB federation.

Create a new dataset in QuickSight with Athena as the source.
Provide the name of the data source name as demo_blog.
Choose dynamo_db for Catalog, default for Database, and demo_redshift_audit_logs for Table.
Choose Edit/Preview data.

Choose String in the sqlTimestamp column and choose Date.

In the dialog box that appears, enter the data format yyyy-MM-dd'T'HH:mm:ssZZ.
Choose Validate and Update.

Choose PUBLISH & VISUALIZE.
Choose Interactive sheet and choose CREATE.

This will take you to the visualization page to create the analysis on QuickSight.

Create a governance dashboard with the appropriate visualization type.

Refer to the Amazon QuickSight learning videos in QuickSight community for basic to advanced level of authoring. The following screenshot is a sample visualization created on this data.

Clean up

Clean up your resources with the following steps:

Conclusion

Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. In this post, we discussed a pattern to implement a governance mechanism to identify and notify sensitive DDL/DCL queries on an Amazon Redshift data warehouse, and created a quick dashboard to enable the DBA and security team to take timely and prompt action as required. Additionally, you can extend this solution to include DDL commands used for Amazon Redshift data sharing across clusters.

Operational excellence is a critical part of the overall data governance on creating a modern data architecture, as it’s a great enabler to drive our customers’ business. Ideally, any data governance implementation is a combination of people, process, and technology that organizations use to ensure the quality and security of their data throughout its lifecycle. Use these instructions to set up your automated notification mechanism as sensitive queries are detected as well as create a quick dashboard on QuickSight to track the activities over time.

About the Authors

Rajdip Chaudhuri is a Senior Solutions Architect with Amazon Web Services specializing in data and analytics. He enjoys working with AWS customers and partners on data and analytics requirements. In his spare time, he enjoys soccer and movies.

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift

2023-06-29 Rohit Vashishtha

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/getting-started-guide-for-near-real-time-operational-analytics-using-amazon-aurora-zero-etl-integration-with-amazon-redshift/

Amazon Aurora zero-ETL integration with Amazon Redshift was announced at AWS re:Invent 2022 and is now available in public preview for Amazon Aurora MySQL-Compatible Edition 3 (compatible with MySQL 8.0) in regions us-east-1, us-east-2, us-west-2, ap-northeast-1 and eu-west-1. For more details, refer to the What’s New Post.

In this post, we provide step-by-step guidance on how to get started with near-real time operational analytics using this feature.

Challenges

Customers across industries today are looking to increase revenue and customer engagement by implementing near-real time analytics use cases like personalization strategies, fraud detection, inventory monitoring, and many more. There are two broad approaches to analyzing operational data for these use cases:

Analyze the data in-place in the operational database (e.g. read replicas, federated query, analytics accelerators)
Move the data to a data store optimized for running analytical queries such as a data warehouse

The zero-ETL integration is focused on simplifying the latter approach.

A common pattern for moving data from an operational database to an analytics data warehouse is via extract, transform, and load (ETL), a process of combining data from multiple sources into a large, central repository (data warehouse). ETL pipelines can be expensive to build and complex to manage. With multiple touchpoints, intermittent errors in ETL pipelines can lead to long delays, leaving applications that rely on this data to be available in the data warehouse with stale or missing data, further leading to missed business opportunities.

For customers that need to run unified analytics across data from multiple operational databases, solutions that analyze data in-place may work great for accelerating queries on a single database, but such systems have a limitation of not being able to aggregate data from multiple operational databases.

Zero-ETL

At AWS, we have been making steady progress towards bringing our zero-ETL vision to life. With Aurora zero-ETL integration with Amazon Redshift, you can bring together the transactional data of Aurora with the analytics capabilities of Amazon Redshift. It minimizes the work of building and managing custom ETL pipelines between Aurora and Amazon Redshift. Data engineers can now replicate data from multiple Aurora database clusters into the same or a new Amazon Redshift instance to derive holistic insights across many applications or partitions. Updates in Aurora are automatically and continuously propagated to Amazon Redshift so the data engineers have the most recent information in near-real time. Additionally, the entire system can be serverless and can dynamically scale up and down based on data volume, so there’s no infrastructure to manage.

When you create an Aurora zero-ETL integration with Amazon Redshift, you continue to pay for Aurora and Amazon Redshift usage with existing pricing (including data transfer). The Aurora zero-ETL integration with Amazon Redshift feature is available at no additional cost.

With Aurora zero-ETL integration with Amazon Redshift, the integration replicates data from the source database into the target data warehouse. The data becomes available in Amazon Redshift within seconds, allowing users to use the analytics features of Amazon Redshift and capabilities like data sharing, workload optimization autonomics, concurrency scaling, machine learning, and many more. You can perform real-time transaction processing on data in Aurora while simultaneously using Amazon Redshift for analytics workloads such as reporting and dashboards.

The following diagram illustrates this architecture.

Solution overview

Let’s consider TICKIT, a fictional website where users buy and sell tickets online for sporting events, shows, and concerts. The transactional data from this website is loaded into an Aurora MySQL 3.03.1 (or higher version) database. The company’s business analysts want to generate metrics to identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. They would like to get these metrics in near-real time using a zero-ETL integration.

The integration is set up between Amazon Aurora MySQL-Compatible Edition 3.03.1 (source) and Amazon Redshift (destination). The transactional data from the source gets refreshed in near-real time on the destination, which processes analytical queries.

You can use either the provisioned or serverless option for both Amazon Aurora MySQL-Compatible Edition as well as Amazon Redshift. For this illustration, we use a provisioned Aurora database and an Amazon Redshift Serverless data warehouse. For the complete list of public preview considerations, please refer to the feature AWS documentation.

The following diagram illustrates the high-level architecture.

The following are the steps needed to set up zero-ETL integration. For complete getting started guides, refer to the following documentation links for Aurora and Amazon Redshift.

Configure the Aurora MySQL source with a customized DB cluster parameter group.
Configure the Amazon Redshift Serverless destination with the required resource policy for its namespace.
Update the Redshift Serverless workgroup to enable case-sensitive identifiers.
Configure the required permissions.
Create the zero-ETL integration.
Create a database from the integration in Amazon Redshift.

Configure the Aurora MySQL source with a customized DB cluster parameter group

To create an Aurora MySQL database, complete the following steps:

On the Amazon RDS console, create a DB cluster parameter group called zero-etl-custom-pg.

Zero-ETL integrations require specific values for the Aurora DB cluster parameters that control binary logging (binlog). For example, enhanced binlog mode must be turned on (aurora_enhanced_binlog=1).

Set the following binlog cluster parameter settings:
1. binlog_backup=0
2. binlog_replication_globaldb=0
3. binlog_format=ROW
4. aurora_enhanced_binlog=1
5. binlog_row_metadata=FULL
6. binlog_row_image=FULL
Choose Save changes.
Choose Databases in the navigation pane, then choose Create database.
For Available versions, choose Aurora MySQL 3.03.1 (or higher).
For Templates, select Production.
For DB cluster identifier, enter zero-etl-source-ams.
Under Instance configuration, select Memory optimized classes and choose a suitable instance size (the default is db.r6g.2xlarge).
Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (zero-etl-custom-pg).
Choose Create database.

In a couple of minutes, it should spin up an Aurora MySQL database as the source for zero-ETL integration.

Configure the Redshift Serverless destination

For our use case, create a Redshift Serverless data warehouse by completing the following steps:

On the Amazon Redshift console, choose Serverless dashboard in the navigation pane.
Choose Create preview workgroup.
For Workgroup name, enter zero-etl-target-rs-wg.
For Namespace, select Create a new namespace and enter zero-etl-target-rs-ns.
Navigate to the namespace zero-etl-target-rs-ns and choose the Resource policy tab.
Choose Add authorized principals.
Enter either the Amazon Resource Name (ARN) of the AWS user or role, or the AWS account ID (IAM principals) that are allowed to create integrations in this namespace.

An account ID is stored as an ARN with root user.

Add an authorized integration source to the namespace and specify the ARN of the Aurora MySQL DB cluster that’s the data source for the zero-ETL integration.
Choose Save changes.

You can get the ARN for the Aurora MySQL source on the Configuration tab as shown in the following screenshot.

Update the Redshift Serverless workgroup to enable case-sensitive identifiers

Use the AWS Command Line Interface (AWS CLI) to run the update-workgroup action:

aws redshift-serverless update-workgroup --workgroup-name zero-etl-target-rs-wg --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true --region us-east-1

You can use AWS CloudShell or another interface like Amazon Elastic Compute Cloud (Amazon EC2) with an AWS user configuration that can update the Redshift Serverless parameter group. The following screenshot illustrates how to run this on CloudShell.

The following screenshot shows how to run the update-workgroup command on Amazon EC2.

Configure required permissions

To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. The following sample policy allows the associated principal to perform the following actions:

Create zero-ETL integrations for the source Aurora DB cluster.
View and delete all zero-ETL integrations.
Create inbound integrations into the target data warehouse. This permission is not required if the same account owns the Amazon Redshift data warehouse and this account is an authorized principal for that data warehouse. Also note that Amazon Redshift has a different ARN format for provisioned and serverless:
- Provisioned cluster – arn:aws:redshift:{region}:{account-id}:namespace:namespace-uuid
- Serverless – arn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid

Complete the following steps to configure the permissions:

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.

Create a new policy called rds-integrations using the following JSON:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "rds:CreateIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:cluster:source-cluster",
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DescribeIntegration"
        ],
        "Resource": ["*"]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DeleteIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "redshift:CreateInboundIntegration"
        ],
        "Resource": [
            "arn:aws:redshift:{region}:{account-id}:cluster:namespace-uuid"
        ]
    }]
}

Policy preview:

If you see IAM policy warnings for the RDS policy actions, this is expected because the feature is in public preview. These actions will become part of IAM policies when the feature is generally available. It’s safe to proceed.

Attach the policy you created to your IAM user or role permissions.

Create the zero-ETL integration

To create the zero-ETL integration, complete the following steps:

On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
Choose Create zero-ETL integration.
For Integration name, enter a name, for example zero-etl-demo.
For Aurora MySQL source cluster, browse and choose the source cluster zero-etl-source-ams.
Under Destination, for Amazon Redshift data warehouse, choose the Redshift Serverless destination namespace (zero-etl-target-rs-ns).
Choose Create zero-ETL integration.

To specify a target Amazon Redshift data warehouse that’s in another AWS account, you must create a role that allows users in the current account to access resources in the target account. For more information, refer to Providing access to an IAM user in another AWS account that you own.

Create a role in the target account with the following permissions:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "redshift:DescribeClusters",
            "redshift-serverless:ListNamespaces"
         ],
         "Resource":[
            "*"
         ]
      }
   ]
}

The role must have the following trust policy, which specifies the target account ID. You can do this by creating a role with a trusted entity as an AWS account ID in another account.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS": "arn:aws:iam::{external-account-id}:root"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

The following screenshot illustrates creating this on the IAM console.

Then while creating the zero-ETL integration, choose the destination account ID and the name of the role you created to proceed further, for Specify a different account option.

You can choose the integration to view the details and monitor its progress. It takes a few minutes to change the status from Creating to Active. The time varies depending on size of the dataset already available in the source.

Create a database from the integration in Amazon Redshift

To create your database, complete the following steps:

On the Redshift Serverless dashboard, navigate to the zero-etl-target-rs-ns namespace.
Choose Query data to open Query Editor v2.
Connect to the preview Redshift Serverless data warehouse by choosing Create connection.
Obtain the integration_id from the svv_integration system table:

select integration_id from svv_integration; ---- copy this result, use in the next sql
Use the integration_id from the previous step to create a new database from the integration:
```
CREATE DATABASE aurora_zeroetl FROM INTEGRATION '<result from above>';
```

The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near-real time.

Analyze the near-real time transactional data

Now we can run analytics on TICKIT’s operational data.

Populate the source TICKIT data

To populate the source data, complete the following steps:

Connect to your Aurora MySQL cluster and create a database/schema for the TICKIT data model, verify that the tables in that schema have a primary key, and initiate the load process:
```
mysql -h <amazon_aurora_mysql_writer_endpoint> -u admin -p
```

You can use the script from the following HTML file to create the sample database demodb (using the tickit.db model) in Amazon Aurora MySQL-Compatible edition.

Run the script to create the tickit.db model tables in the demodb database/schema:
Load data from Amazon Simple Storage Service (Amazon S3), record the finish time for change data capture (CDC) validations at destination, and observe how active the integration was.

The following are common errors associated with load from Amazon S3:

For the current version of the Aurora MySQL cluster, we need to set the aws_default_s3_role parameter in the DB cluster parameter group to the role ARN that has the necessary Amazon S3 access permissions.
If you get an error for missing credentials (for example, Error 63985 (HY000): S3 API returned error: Missing Credentials: Cannot instantiate S3 Client), you probably haven’t associated your IAM role to the cluster. In this case, add the intended IAM role to the source Aurora MySQL cluster.

Analyze the source TICKIT data in the destination

On the Redshift Serverless dashboard, open Query Editor v2 using the database you created as part of the integration setup. Use the following code to validate the seed or CDC activity:

SELECT * FROM SYS_INTEGRATION_ACTIVITY;

Choose the cluster or workgroup and database created from integration on the drop-down menu and run tickit.db sample analytic queries.

Monitoring

You can query the following system views and tables in Amazon Redshift to get information about your Aurora zero-ETL integrations with Amazon Redshift:

SVV_INTEGRATION – Provides configuration details for your integrations.
SYS_INTEGRATION_ACTIVITY – Provides information about completed integration runs.
SVV_INTEGRATION_TABLE_STATE – Describes the state of each table within an integration.

In order to view the integration-related metrics published to Amazon CloudWatch, navigate to Amazon Redshift console. Choose Zero-ETL integrations from left navigation pane and click on the integration links to display activity metrics.

Available metrics on the Redshift console are Integration metrics and table statistics, with table statistics providing details of each table replicated from Aurora MySQL to Amazon Redshift.

Integration metrics contains table replication success/failure counts and lag details:

Clean up

When you delete a zero-ETL integration, Aurora removes it from your Aurora cluster. Your transactional data isn’t deleted from Aurora or Amazon Redshift, but Aurora doesn’t send new data to Amazon Redshift.

To delete a zero-ETL integration, complete the following steps:

On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
Select the zero-ETL integration that you want to delete and choose Delete.
To confirm the deletion, choose Delete.

Conclusion

In this post, we showed you how to set up Aurora zero-ETL integration from Amazon Aurora MySQL-Compatible Edition to Amazon Redshift. This minimizes the need to maintain complex data pipelines and enables near-real time analytics on transactional and operational data.

To learn more about Aurora zero-ETL integration with Amazon Redshift, visit documentation for Aurora and Amazon Redshift.

About the Authors

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, Texas. He has 17 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with utmost security and data governance.

Vijay Karumajji is a Database Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.

BP Yau is a Sr Partner Solutions Architect at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Jyoti Aggarwal is a Product Manager on the Amazon Redshift team based in Seattle. She has spent the last 10 years working on multiple products in the data warehouse industry.

Adam Levin is a Product Manager on the Amazon Aurora team based in California. He has spent the last 10 years working on various cloud database services.

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

2023-06-23 Eliad Maimon

Post Syndicated from Eliad Maimon original https://aws.amazon.com/blogs/big-data/enable-business-users-to-analyze-large-datasets-in-your-data-lake-with-amazon-quicksight/

This blog post is co-written with Ori Nakar from Imperva.

Imperva Cloud WAF protects hundreds of thousands of websites and blocks billions of security events every day. Events and many other security data types are stored in Imperva’s Threat Research Multi-Region data lake.

Imperva harnesses data to improve their business outcomes. To enable this transformation to a data-driven organization, Imperva brings together data from structured, semi-structured, and unstructured sources into a data lake. As part of their solution, they are using Amazon QuickSight to unlock insights from their data.

Imperva’s data lake is based on Amazon Simple Storage Service (Amazon S3), where data is continually loaded. Imperva’s data lake has a few dozen different datasets, in the scale of petabytes. Each day, TBs of new data is added to the data lake, which is then transformed, aggregated, partitioned, and compressed.

In this post, we explain how Imperva’s solution enables users across the organization to explore, visualize, and analyze data using Amazon Redshift Serverless, Amazon Athena, and QuickSight.

Challenges and needs

A modern data strategy gives you a comprehensive plan to manage, access, analyze, and act on data. AWS provides the most complete set of services for the entire end-to-end data journey for all workloads, all types of data, and all desired business outcomes. In turn, this makes AWS the best place to unlock value from your data and turn it into insight.

Redshift Serverless is a serverless option of Amazon Redshift that allows you to run and scale analytics without having to provision and manage data warehouse clusters. Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver high performance for all your analytics. You just need to load and query your data, and you only pay for the compute used for the duration of the workloads on a per-second basis. Redshift Serverless is ideal when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, straightforward to use, and makes it simple for anyone with SQL skills to quickly analyze large-scale datasets in multiple Regions.

QuickSight is a cloud-native business intelligence (BI) service that you can use to visually analyze data and share interactive dashboards with all users in the organization. QuickSight is fully managed and serverless, requires no client downloads for dashboard creation, and has a pay-per-session pricing model that allows you to pay for dashboard consumption. Imperva uses QuickSight to enable users with no technical expertise, from different teams such as marketing, product, sales, and others, to extract insight from the data without the help of data or research teams.

QuickSight offers SPICE, an in-memory, cloud-native data store that allows end-users to interactively explore data. SPICE provides consistently fast query performance and automatically scales for high concurrency. With SPICE, you save time and cost because you don’t need to retrieve data from the data source (whether a database or data warehouse) every time you change an analysis or update a visual, and you remove the load of concurrent access or analytical complexity off the underlying data source with the data.

In order for QuickSight to consume data from the data lake, some of the data undergoes additional transformations, filters, joins, and aggregations. Imperva cleans their data by filtering incomplete records, reducing the number of records by aggregations, and applying internal logic to curate millions of security incidents out of hundreds of millions of records.

Imperva had the following requirements for their solution:

High performance with low query latency to enable interactive dashboards
Continuously update and append data to queryable sources from the data lake
Data freshness of up to 1 day
Low cost
Engineering efficiency

The challenge faced by Imperva and many other companies is how to create a big data extract, transform, and load (ETL) pipeline solution that fits these requirements.

In this post, we review two approaches Imperva implemented to address their challenges and meet their requirements. The solutions can be easily implemented while maintaining engineering efficiency, especially with the introduction of Redshift Serverless.

Imperva’s solutions

Imperva needed to have the data lake’s data available through QuickSight continuously. The following solutions were chosen to connect the data lake to QuickSight:

QuickSight caching layer, SPICE – Use Athena to query the data into a QuickSight SPICE dataset
Redshift Serverless – Copy the data to Redshift Serverless and use it as a data source

Our recommendation is to use a solution based on the use case. Each solution has its own advantages and challenges, which we discuss as part of this post.

The high-level flow is the following:

Data is continuously updated from the data lake into either Redshift Serverless or the QuickSight caching layer, SPICE
An internal user can create an analysis and publish it as a dashboard for other internal or external users

The following architecture diagram shows the high-level flow.

High-level flow

In the following sections, we discuss the details about the flow and the different solutions, including a comparison between them, which can help you choose the right solution for you.

Solution 1: Query with Athena and import to SPICE

QuickSight provides inherent capabilities to upload data using Athena into SPICE, which is a straightforward approach that meets Imperva’s requirements regarding simple data management. For example, it suits stable data flows without frequent exceptions, which may result in SPICE full refresh.

You can use Athena to load data into a QuickSight SPICE dataset, and then use the SPICE incremental upload option to load new data to the dataset. A QuickSight dataset will be connected to a table or a view accessible by Athena. A time column (like day or hour) is used for incremental updates. The following table summarizes the options and details.

Option	Description	Pros/Cons
Existing table	Use the built-in option by QuickSight.	Not flexible—the table is imported as is in the data lake.
Dedicated view	A view will let you better control the data in your dataset. It allows joining data, aggregation, or choosing a filter like the date you want to start importing data from. Note that QuickSight allows building a dataset based on custom SQL, but this option doesn’t allow incremental updates.	Large Athena resource consumption on a full refresh.
Dedicated ETL	Create a dedicated ETL process, which is similar to a view, but unlike the view, it allows reuse of the results in case of a full refresh. In case your ETL or view contains grouping or other complex operations, you know that these operations will be done only by the ETL process, according to the schedule you define.	Most flexible, but requires ETL development and implementation and additional Amazon S3 storage.

The following architecture diagram details the options for loading data by Athena into SPICE.

Architecture diagram details the options for loading data by Athena into SPICE

The following code provides a SQL example for a view creation. We assume the existence of two tables, customers and events, with one join column called customer_id. The view is used to do the following:

Aggregate the data from daily to weekly, and reduce the number of rows
Control the start date of the dataset (in this case, 30 weeks back)
Join the data to add more columns (customer_type) and filter it

CREATE VIEW my_dataset AS
SELECT DATE_ADD('day', -DAY_OF_WEEK(day) + 1, day) AS first_day_of_week,
       customer_type, event_type, COUNT(events) AS total_events
FROM my_events INNER JOIN my_customers USING (customer_id)
WHERE customer_type NOT IN ('Reseller')
      AND day BETWEEN DATE_ADD('DAY',-7 * 30 -DAY_OF_WEEK(CURRENT_DATE) + 1, CURRENT_DATE)
      AND DATE_ADD('DAY', -DAY_OF_WEEK(CURRENT_DATE), CURRENT_DATE)
GROUP BY 1, 2, 3

Solution 2: Load data into Redshift Serverless

Redshift Serverless provides full visibility to the data, which can be viewed or edited at any time. For example, if there is a delay in adding data to the data lake or the data isn’t properly added, with Redshift Serverless, you can edit data using SQL statements or retry data loading. Redshift Serverless is a scalable solution that doesn’t have a dataset size limitation.

Redshift Serverless is used as a serving layer for the datasets that are to be used in QuickSight. The pricing model for Redshift Serverless is based on storage utilization and the run of queries; idle compute resources have no associated cost. Setting up a cluster is simple and doesn’t require you to choose node types or amount of storage. You simply load the data to tables you create and start working.

To create a new dataset, you need to create an Amazon Redshift table and run the following process every time data is added:

Transform the data using an ETL process (optional):
- Read data from the tables.
- Transform to the QuickSight dataset schema.
- Write the data to an S3 bucket and load it to Amazon Redshift.
Delete old data if it exists to avoid duplicate data.
Load the data using the COPY command.

The following architecture diagram details the options to load data into Redshift Serverless with or without an ETL process.

Architecture diagram details the options to load data into Redshift Serverless with or without an ETL process

The Amazon Redshift COPY command is simple and fast. For example, to copy daily partition Parquet data, use the following code:

COPY my_table
FROM 's3://my_bucket/my_table/day=2022-01-01'
IAM_ROLE 'my_role' 
FORMAT AS PARQUET

Use the following COPY command to load the output file of the ETL process. Values will be truncated according to Amazon Redshift column size. The column truncation is important because, unlike in the data lake, in Amazon Redshift, the column size must be set. This option prevents COPY failures:

COPY my_table
FROM 's3://my_bucket/my_table/day=2022-01-01'
IAM_ROLE 'my_role' 
FORMAT AS JSON GZIP TRUNCATECOLUMNS

The Amazon Redshift COPY operation provides many benefits and options. It supports multiple formats as well as column mapping, escaping, and more. It also allows more control over data format, object size, and options to tune the COPY operation for improved performance. Unlike data in the data lake, Amazon Redshift has column length specifications. We use TRUNCATECOLUMNS to truncates the data in columns to the appropriate number of characters so that it fits the column specification.

Using this method provides full control over the data. In case of a problem, we can repair parts of the table by deleting old data and loading the data again. It’s also possible to use the QuickSight dataset JOIN option, which is not available in SPICE when using incremental update.

Additional benefit of this approach is that the data is available for other clients and services looking to use the same data, such as SQL clients or notebooks servers such as Apache Zeppelin.

Conclusion

QuickSight allows Imperva to expose business data to various departments within an organization. In the post, we explored approaches for importing data from a data lake to QuickSight, whether continuously or incrementally.

However, it’s important to note that there is no one-size-fits-all solution; the optimal approach will depend on the specific use case. Both options—continuous and incremental updates—are scalable and flexible, with no significant cost differences observed for our dataset and access patterns.

Imperva found incremental refresh to be very useful and uses it for simple data management. For more complex datasets, Imperva has benefitted from the greater scalability and flexibility provided by Redshift Serverless.

In cases where a higher degree of control over the datasets was required, Imperva chose Redshift Serverless so that data issues could be addressed promptly by deleting, updating, or inserting new records as necessary.

With the integration of dashboards, individuals can now access data that was previously inaccessible to them. Moreover, QuickSight has played a crucial role in streamlining our data distribution processes, enabling data accessibility across all departments within the organization.

To learn more, visit Amazon QuickSight.

About the Authors

Eliad Maimon is a Senior Startups Solutions Architect at AWS in Tel-Aviv with over 20 years of experience in architecting, building, and maintaining software products. He creates architectural best practices and collaborates with customers to leverage cloud and innovation, transforming businesses and disrupting markets. Eliad is specializing in machine learning on AWS, with a focus in areas such as generative AI, MLOps, and Amazon SageMaker.

Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure.

Accelerate onboarding and seamless integration with ThoughtSpot using Amazon Redshift partner integration

2023-06-21 Antony Prasad Thevaraj

Post Syndicated from Antony Prasad Thevaraj original https://aws.amazon.com/blogs/big-data/accelerate-onboarding-and-seamless-integration-with-thoughtspot-using-amazon-redshift-partner-integration/

Amazon Redshift is a fast, petabyte-scale cloud data warehouse that makes it simple and cost-effective to analyze all of your data using standard SQL. Tens of thousands of customers today rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse. You can run and scale analytics in seconds on all your data without having to manage your data warehouse infrastructure.

Today, we are excited to announce ThoughtSpot as a new BI partner available through Amazon Redshift partner integration. You can onboard with ThoughtSpot in minutes directly from the Amazon Redshift console and gain faster data-driven insights. Businesses typically look at ways to derive business insights. This is where modern analytics providers such as ThoughtSpot provide value. With its powerful AI-based search, live visualizations, and developer tools and APIs for sharing embedded analytics, ThoughtSpot democratizes access to data by providing self-service tools for all users.

In this post, you will learn how to integrate seamlessly with ThoughtSpot from the Amazon Redshift console. With the loosely coupled nature of the modern data stack, it’s simple to connect Amazon Redshift with ThoughtSpot. No data movement or replication is required.

ThoughtSpot: Live analytics for your modern data stack

Static dashboards cannot deliver consistent and reliable insights at the speed and global scale that customers demand. They lack the following:

Opportunities for collaboration
Discovery and reusability
Secure remote data and insight access
Rapid use case development with single-touch insight provisioning

ThoughtSpot empowers everyone to create, consume, and operationalize data-driven insights. ThoughtSpot consumer-grade search and AI technology delivers true self-service analytics that anyone can use, while its developer-friendly platform ThoughtSpot Everywhere makes it easy to build interactive data apps that integrate with your existing cloud provider.

As organizations increasingly move to the cloud, ThoughtSpot helps them quickly unlock value from their investment. ThoughtSpot’s simple search functionality enables you to easily ask and answer data questions in seconds to unearth impactful insights directly in Amazon Redshift. ThoughtSpot for AWS provides enterprises with more freedom and flexibility by eliminating the need to move data between cloud sources so that businesses can immediately benefit from data-driven decision-making.

ThoughtSpot is an AWS Data and Analytics Competency Partner with the Amazon Redshift Ready product designation. ThoughtSpot is also part of the Powered by Amazon Redshift program.

Integrate ThoughtSpot using Amazon Redshift partner integration

Complete the following steps to integrate ThoughtSpot with Amazon Redshift:

On the Amazon Redshift console, choose Clusters in the navigation pane.
Select your cluster and on the Actions menu, choose Add AWS Partner integration.

Alternatively, you can choose your individual cluster and on its details page, choose Add partner integration.

Select ThoughtSpot as your desired BI partner.
Choose Next.

Choose Add partner.

Select Continue to Setup.

On the Amazon Redshift connection details page, enter your Amazon Redshift database password for Password.
Choose Continue.

To connect to your Amazon Redshift cluster, make sure to enable the Publicly accessible option and allow list the ThoughtSpot IP in your Amazon Redshift cluster’s security group.

Select your desired tables and choose Update.
If a prompt appears, choose Update again.

After you have successfully integrated with ThoughtSpot, you will see an Active status in the Integrations section on the Amazon Redshift console.

Congratulations! You’re now ready to start visualizing data using ThoughtSpot. The following example shows you trends in sales growth YTD, current sales trends across regions, and a comparison between product type sales between the current year and the previous year. You can slice and dice the dataset based on the granularity defined by the user.

Partner feedback

“ThoughtSpot is thrilled to expand our long-time cooperation with AWS with the announcement of our Amazon Redshift partner integration. Leading organizations are already extracting value from their data using AI-powered analytics on Amazon Redshift, and today we are making it even more frictionless for Amazon Redshift users to launch ThoughtSpot’s free trial to solve real problems quickly.”

– Kuntal Vahalia, SVP of WW Partners & APAC

Conclusion

In this post, we discussed how Amazon Redshift partner integration provides a fast-onboarding experience and allows you to create valuable business insights by integrating with ThoughtSpot. ThoughtSpot enables you to unlock the value of your modern data stack by empowering your entire organization with live analytics and data search, while Amazon Redshift provides a modern data warehouse experience for you to manage analytics at scale.

If you’re an AWS Partner and would like to integrate your product into the Amazon Redshift console, contact [email protected] for additional information and guidance. This console partner integration functionality is available to new and existing customers at no additional cost. To get started and learn more, see Integrating Amazon Redshift with an AWS Partner.

About the Authors

Antony Prasad Thevaraj is a Sr. Partner Solutions Architect in Data and Analytics at AWS. He has over 12 years of experience as a Big Data Engineer, and has worked on building complex ETL and ELT pipelines for various business units.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Build an Amazon Redshift data warehouse using an Amazon DynamoDB single-table design

2023-06-21 Altaf Hussain

Post Syndicated from Altaf Hussain original https://aws.amazon.com/blogs/big-data/build-an-amazon-redshift-data-warehouse-using-an-amazon-dynamodb-single-table-design/

Amazon DynamoDB is a fully managed NoSQL service that delivers single-digit millisecond performance at any scale. It’s used by thousands of customers for mission-critical workloads. Typical use cases for DynamoDB are an ecommerce application handling a high volume of transactions, or a gaming application that needs to maintain scorecards for players and games. In traditional databases, we would model such applications using a normalized data model (entity-relation diagram). This approach comes with a heavy computational cost in terms of processing and distributing the data across multiple tables while ensuring the system is ACID-compliant at all times, which can negatively impact performance and scalability. If these entities are frequently queried together, it makes sense to store them in a single table in DynamoDB. This is the concept of single-table design. Storing different types of data in a single table allows you to retrieve multiple, heterogeneous item types using a single request. Such requests are relatively straightforward, and usually take the following form:

SELECT * FROM TABLE WHERE Some_Attribute = 'some_value'

In this format, some_attribute is a partition key or part of an index.

Nonetheless, many of the same customers using DynamoDB would also like to be able to perform aggregations and ad hoc queries against their data to measure important KPIs that are pertinent to their business. Suppose we have a successful ecommerce application handling a high volume of sales transactions in DynamoDB. A typical ask for this data may be to identify sales trends as well as sales growth on a yearly, monthly, or even daily basis. These types of queries require complex aggregations over a large number of records. A key pillar of AWS’s modern data strategy is the use of purpose-built data stores for specific use cases to achieve performance, cost, and scale. Deriving business insights by identifying year-on-year sales growth is an example of an online analytical processing (OLAP) query. These types of queries are suited for a data warehouse.

The goal of a data warehouse is to enable businesses to analyze their data fast; this is important because it means they are able to gain valuable insights in a timely manner. Amazon Redshift is fully managed, scalable, cloud data warehouse. Building a performant data warehouse is non-trivial because the data needs to be highly curated to serve as a reliable and accurate version of the truth.

In this post, we walk through the process of exporting data from a DynamoDB table to Amazon Redshift. We discuss data model design for both NoSQL databases and SQL data warehouses. We begin with a single-table design as an initial state and build a scalable batch extract, load, and transform (ELT) pipeline to restructure the data into a dimensional model for OLAP workloads.

DynamoDB table example

We use an example of a successful ecommerce store allowing registered users to order products from their website. A simple ERD (entity-relation diagram) for this application will have four distinct entities: customers, addresses, orders, and products. For customers, we have information such as their unique user name and email address; for the address entity, we have one or more customer addresses. Orders contain information regarding the order placed, and the products entity provides information about the products placed in an order. As we can see from the following diagram, a customer can place one or more orders, and an order must contain one or more products.

We could store each entity in a separate table in DynamoDB. However, there is no way to retrieve customer details alongside all the orders placed by the customer without making multiple requests to the customer and order tables. This is inefficient from both a cost and performance perspective. A key goal for any efficient application is to retrieve all the required information in a single query request. This ensures fast, consistent performance. So how can we remodel our data to avoid making multiple requests? One option is to use single-table design. Taking advantage of the schema-less nature of DynamoDB, we can store different types of records in a single table in order to handle different access patterns in a single request. We can go further still and store different types of values in the same attribute and use it as a global secondary index (GSI). This is called index overloading.

A typical access pattern we may want to handle in our single table design is to get customer details and all orders placed by the customer.

To accommodate this access pattern, our single-table design looks like the following example.

By restricting the number of addresses associated with a customer, we can store address details as a complex attribute (rather than a separate item) without exceeding the 400 KB item size limit of DynamoDB.

We can add a global secondary index (GSIpk and GSIsk) to capture another access pattern: get order details and all product items placed in an order. We use the following table.

We have used generic attribute names, PK and SK, for our partition key and sort key columns. This is because they hold data from different entities. Furthermore, the values in these columns are prefixed by generic terms such as CUST# and ORD# to help us identify the type of data we have and ensure that the value in PK is unique across all records in the table.

A well-designed single table will not only reduce the number of requests for an access pattern, but will service many different access patterns. The challenge comes when we need to ask more complex questions of our data, for example, what was the year-on-year quarterly sales growth by product broken down by country?

The case for a data warehouse

A data warehouse is ideally suited to answer OLAP queries. Built on highly curated structured data, it provides the flexibility and speed to run aggregations across an entire dataset to derive insights.

To house our data, we need to define a data model. An optimal design choice is to use a dimensional model. A dimension model consists of fact tables and dimension tables. Fact tables store the numeric information about business measures and foreign keys to the dimension tables. Dimension tables store descriptive information about the business facts to help understand and analyze the data better. From a business perspective, a dimension model with its use of facts and dimensions can present complex business processes in a simple-to-understand manner.

Building a dimensional model

A dimensional model optimizes read performance through efficient joins and filters. Amazon Redshift automatically chooses the best distribution style and sort key based on workload patterns. We build a dimensional model from the single DynamoDB table based on the following star schema.

We have separated each item type into individual tables. We have a single fact table (Orders) containing the business measures price and numberofitems, and foreign keys to the dimension tables. By storing the price of each product in the fact table, we can track price fluctuations in the fact table without continually updating the product dimension. (In a similar vein, the DynamoDB attribute amount is a simple derived measure in our star schema: amount is the summation of product prices per orderid).

By splitting the descriptive content of our single DynamoDB table into multiple Amazon Redshift dimension tables, we can remove redundancy by only holding in each dimension the information pertinent to it. This allows us the flexibility to query the data under different contexts; for example, we may want to know the frequency of customer orders by city or product sales by date. The ability to freely join dimensions and facts when analyzing the data is one of the key benefits of dimensional modeling. It’s also good practice to have a Date dimension to allow us to perform time-based analysis by aggregating the fact by year, month, quarter, and so forth.

This dimensional model will be built in Amazon Redshift. When setting out to build a data warehouse, it’s a common pattern to have a data lake as the source of the data warehouse. The data lake in this context serves a number of important functions:

It acts as a central source for multiple applications, not just exclusively for data warehousing purposes. For example, the same dataset could be used to build machine learning (ML) models to identify trends and predict sales.
It can store data as is, be it unstructured, semi-structured, or structured. This allows you to explore and analyze the data without committing upfront to what the structure of the data should be.
It can be used to offload historical or less-frequently-accessed data, allowing you to manage your compute and storage costs more effectively. In our analytic use case, if we are analyzing quarterly growth rates, we may only need a couple of years’ worth of data; the rest can be unloaded into the data lake.

When querying a data lake, we need to consider user access patterns in order to reduce costs and optimize query performance. This is achieved by partitioning the data. The choice of partition keys will depend on how you query the data. For example, if you query the data by customer or country, then they are good candidates for partition keys; if you query by date, then a date hierarchy can be used to partition the data.

After the data is partitioned, we want to ensure it’s held in the right format for optimal query performance. The recommended choice is to use a columnar format such as Parquet or ORC. Such formats are compressed and store data column-wise, allowing for fast retrieval times, and are parallelizable, allowing for fast load times when moving the data into Amazon Redshift. In our use case, it makes sense to store the data in a data lake with minimal transformation and formatting to enable easy querying and exploration of the dataset. We partition the data by item type (Customer, Order, Product, and so on), and because we want to easily query each entity in order to move the data into our data warehouse, we transform the data into the Parquet format.

Solution overview

The following diagram illustrates the data flow to export data from a DynamoDB table to a data warehouse.

We present a batch ELT solution using AWS Glue for exporting data stored in DynamoDB to an Amazon Simple Storage Service (Amazon S3) data lake and then a data warehouse built in Amazon Redshift. AWS Glue is a fully managed extract, transform, and load (ETL) service that allows you to organize, cleanse, validate, and format data for storage in a data warehouse or data lake.

The solution workflow has the following steps:

Move any existing files from the raw and data lake buckets into corresponding archive buckets to ensure any fresh export from DynamoDB to Amazon S3 isn’t duplicating data.
Begin a new DynamoDB export to the S3 raw layer.
From the raw files, create a data lake partitioned by item type.
Load the data from the data lake to landing tables in Amazon Redshift.
After the data is loaded, we take advantage of the distributed compute capability of Amazon Redshift to transform the data into our dimensional model and populate the data warehouse.

We orchestrate the pipeline using an AWS Step Functions workflow and schedule a daily batch run using Amazon EventBridge.

For simpler DynamoDB table structures you may consider skipping some of these steps by either loading data directly from DynamoDB to Redshift or using Redshift’s auto-copy or copy command to load data from S3.

Prerequisites

You must have an AWS account with a user who has programmatic access. For setup instructions, refer to AWS security credentials.

Use the AWS CloudFormation template cf_template_ddb-dwh-blog.yaml to launch the following resources:

A DynamoDB table with a GSI and point-in-time recovery enabled.
An Amazon Redshift cluster (we use two nodes of RA3.4xlarge).
Three AWS Glue database catalogs: raw, datalake, and redshift.
Five S3 buckets: two for the raw and data lake files; two for their respective archives, and one for the Amazon Athena query results.
Two AWS Identity and Access Management (IAM) roles: An AWS Glue role and a Step Functions role with the requisite permissions and access to resources.
A JDBC connection to Amazon Redshift.
An AWS Lambda function to retrieve the s3-prefix-list-id for your Region. This is required to allow traffic from a VPC to access an AWS service through a gateway VPC endpoint.
Download the following files to perform the ELT:
- The Python script to load sample data into our DynamoDB table: load_dynamodb.py.
- The AWS Glue Python Spark script to archive the raw and data lake files: archive_job.py.
- The AWS Glue Spark scripts to extract and load the data from DynamoDB to Amazon Redshift: GlueSparkJobs.zip.
- The DDL and DML SQL scripts to create the tables and load the data into the data warehouse in Amazon Redshift: SQL Scripts.zip.

Launch the CloudFormation template

AWS CloudFormation allows you to model, provision, and scale your AWS resources by treating infrastructure as code. We use the downloaded CloudFormation template to create a stack (with new resources).

On the AWS CloudFormation console, create a new stack and select Template is ready.
Upload the stack and choose Next.

Enter a name for your stack.
For MasterUserPassword, enter a password.
Optionally, replace the default names for the Amazon Redshift database, DynamoDB table, and MasterUsername (in case these names are already in use).
Reviewed the details and acknowledge that AWS CloudFormation may create IAM resources on your behalf.
Choose Create stack.

Load sample data into a DynamoDB table

To load your sample data into DynamoDB, complete the following steps:

Create an AWS Cloud9 environment with default settings.
Upload the load DynamoDB Python script. From the AWS Cloud9 terminal, use the pip install command to install the following packages:
1. boto3
2. faker
3. faker_commerce
4. numpy
In the Python script, replace all placeholders (capital letters) with the appropriate values and run the following command in the terminal:

python load_dynamodb.py

This command loads the sample data into our single DynamoDB table.

Extract data from DynamoDB

To extract the data from DynamoDB to our S3 data lake, we use the new AWS Glue DynamoDB export connector. Unlike the old connector, the new version uses a snapshot of the DynamoDB table and doesn’t consume read capacity units of your source DynamoDB table. For large DynamoDB tables exceeding 100 GB, the read performance of the new AWS Glue DynamoDB export connector is not only consistent but also significantly faster than the previous version.

To use this new export connector, you need to enable point-in-time recovery (PITR) for the source DynamoDB table in advance. This will take continuous backups of the source table (so be mindful of cost) and ensures that each time the connector invokes an export, the data is fresh. The time it takes to complete an export depends on the size of your table and how uniformly the data is distributed therein. This can range from a few minutes for small tables (up to 10 GiB) to a few hours for larger tables (up to a few terabytes). This is not a concern for our use case because data lakes and data warehouses are typically used to aggregate data at scale and generate daily, weekly, or monthly reports. It’s also worth noting that each export is a full refresh of the data, so in order to build a scalable automated data pipeline, we need to archive the existing files before beginning a fresh export from DynamoDB.

Complete the following steps:

Create an AWS Glue job using the Spark script editor.
Upload the archive_job.py file from GlueSparkJobs.zip.

This job archives the data files into timestamped folders. We run the job concurrently to archive the raw files and the data lake files.

In Job details section, give the job a name and choose the AWS Glue IAM role created by our CloudFormation template.
Keep all defaults the same and ensure maximum concurrency is set to 2 (under Advanced properties).

Archiving the files provides a backup option in the event of disaster recovery. As such, we can assume that the files will not be accessed frequently and can be kept in Standard_IA storage class so as to save up to 40% on costs while providing rapid access to the files when needed.

This job typically runs before each export of data from DynamoDB. After the datasets have been archived, we’re ready to (re)-export the data from our DynamoDB table.

We can use AWS Glue Studio to visually create the jobs needed to extract the data from DynamoDB and load into our Amazon Redshift data warehouse. We demonstrate how to do this by creating an AWS Glue job (called ddb_export_raw_job) using AWS Glue Studio.

In AWS Glue Studio, create a job and select Visual with a blank canvas.
Choose Amazon DynamoDB as the data source.

Choose our DynamoDB table to export from.
Leave all other options as is and finish setting up the source connection.

We then choose Amazon S3 as our target. In the target properties, we can transform the output to a suitable format, apply compression, and specify the S3 location to store our raw data.

Set the following options:
1. For Format, choose Parquet.
2. For Compression type, choose Snappy.
3. For S3 Target Location, enter the path for RawBucket (located on the Outputs tab of the CloudFormation stack).
4. For Database, choose the value for GlueRawDatabase (from the CloudFormation stack output).
5. For Table name, enter an appropriate name.

Because our target data warehouse requires data to be in a flat structure, verify that the configuration option dynamodb.unnestDDBJson is set to True on the Script tab.

On the Job details tab, choose the AWS Glue IAM role generated by the CloudFormation template.
Save and run the job.

Depending on the data volumes being exported, this job may take a few minutes to complete.

Because we’ll be adding the table to our AWS Glue Data Catalog, we can explore the output using Athena after the job is complete. Athena is a serverless interactive query service that makes it simple to analyze data directly in Amazon S3 using standard SQL.

In the Athena query editor, choose the raw database.

We can see that the attributes of the Address structure have been unnested and added as additional columns to the table.

After we export the data into the raw bucket, create another job (called raw_to_datalake_job) using AWS Glue Studio (select Visual with a blank canvas) to load the data lake partitioned by item type (customer, order, and product).
Set the source as the AWS Glue Data Catalog raw database and table.

In the ApplyMapping transformation, drop the Address struct because we have already unnested these attributes into our flattened raw table.

Set the target as our S3 data lake.

Choose the AWS Glue IAM role in the job details, then save and run the job.

Now that we have our data lake, we’re ready to build our data warehouse.

Build the dimensional model in Amazon Redshift

The CloudFormation template launches a two-node RA3.4xlarge Amazon Redshift cluster. To build the dimensional model, complete the following steps:

In Amazon Redshift Query Editor V2, connect to your database (default: salesdwh) within the cluster using the database user name and password authentication (MasterUserName and MasterUserPassword from the CloudFormation template).
You may be asked to configure your account if this is your first time using Query Editor V2.
Download the SQL scripts SQL Scripts.zip to create the following schemas and tables (run the scripts in numbered sequence).

In the landing schema:

address
customer
order
product

In the staging schema:

staging.address
staging.address_maxkey
staging.addresskey
staging.customer
staging.customer_maxkey
staging.customerkey
staging.date
staging.date_maxkey
staging.datekey
staging.order
staging.order_maxkey
staging.orderkey
staging.product
staging.product_maxkey
staging.productkey

In the dwh schema:

dwh.address
dwh.customer
dwh.order
dwh.product

We load the data from our data lake to the landing schema as is.

Use the JDBC connector to Amazon Redshift to build an AWS Glue crawler to add the landing schema to our Data Catalog under the ddb_redshift database.

Create an AWS Glue crawler with the JDBC data source.

Select the JDBC connection you created and choose Next.

Choose the IAM role created by the CloudFormation template and choose Next.

Review your settings before creating the crawler.

The crawler adds the four landing tables in our AWS Glue database ddb_redshift.

In AWS Glue Studio, create four AWS Glue jobs to load the landing tables (these scripts are available to download, and you can use the Spark script editor to upload these scripts individually to create the jobs):
1. land_order_job
2. land_product_job
3. land_customer_job
4. land_address_job

Each job has the structure as shown in the following screenshot.

Filter the S3 source on the partition column type:
1. For product, filter on type=‘product’.
2. For order, filter on type=‘order’.
3. For customer and address, filter on type=‘customer’.

Set the target for the data flow as the corresponding table in the landing schema in Amazon Redshift.
Use the built-in ApplyMapping transformation in our data pipeline to drop columns and, where necessary, convert the data types to match the target columns.

For more information about built-in transforms available in AWS Glue, refer to AWS Glue PySpark transforms reference.

The mappings for our four jobs are as follows:

land_order_job:

mappings=[
("pk", "string", "pk", "string"),
("orderid", "string", "orderid", "string"),
("numberofitems", "string", "numberofitems", "int"),
("orderdate", "string", "orderdate", "timestamp"),
]

land_product_job:

mappings=[
("orderid", "string", "orderid", "string"),
("category", "string", "category", "string"),
("price", "string", "price", "decimal"),
("productname", "string", "productname", "string"),
("productid", "string", "productid", "string"),
("color", "string", "color", "string"),
]

land_address_job:

mappings=[
("username", "string", "username", "string"),
("email", "string", "email", "string"),
("fullname", "string", "fullname", "string"),
]

land_customer_job:

mappings=[
("username", "string", "username", "string"),
("email", "string", "email", "string"),
("fullname", "string", "fullname", "string"),
]

Choose the AWS Glue IAM role, and under Advanced properties, verify the JDBC connector to Amazon Redshift as a connection.
Save and run each job to load the landing tables in Amazon Redshift.

Populate the data warehouse

From the landing schema, we move the data to the staging layer and apply the necessary transformations. Our dimensional model has a single fact table, the orders table, which is the largest table and as such needs a distribution key. The choice of key depends on how the data is queried and the size of the dimension tables being joined to. If you’re unsure of your query patterns, you can leave the distribution keys and sort keys for your tables unspecified. Amazon Redshift automatically assigns the correct distribution and sort keys based on your queries. This has the advantage that if and when query patterns change over time, Amazon Redshift can automatically update the keys to reflect the change in usage.

In the staging schema, we keep track of existing records based on their business key (the unique identifier for the record). We create key tables to generate a numeric identity column for each table based on the business key. These key tables allow us to implement an incremental transformation of the data into our dimensional model.

CREATE TABLE IF NOT EXISTS staging.productkey ( 
    productkey integer identity(1,1), 
    productid character varying(16383), 
    CONSTRAINT products_pkey PRIMARY KEY(productkey));

When loading the data, we need to keep track of the latest surrogate key value to ensure that new records are assigned the correct increment. We do this using maxkey tables (pre-populated with zero):

CREATE TABLE IF NOT EXISTS staging.product_maxkey ( 
    productmaxkey integer);

INSERT INTO staging.product_maxkey
select 0;

We use staging tables to store our incremental load, the structure of which will mirror our final target model in the dwh schema:

---staging tables to load data from data lake 
   
CREATE TABLE IF NOT EXISTS staging.product ( 
    productkey integer,
    productname character varying(200), 
    color character varying(50), 
    category character varying(100),
    PRIMARY KEY (productkey));

---dwh tables to load data from staging schema
     
CREATE TABLE IF NOT EXISTS dwh.product ( 
    productkey integer,
    productname character varying(200), 
    color character varying(50), 
    category character varying(100),
    PRIMARY KEY (productkey));

Incremental processing in the data warehouse

We load the target data warehouse using stored procedures to perform upserts (deletes and inserts performed in a single transaction):

CREATE OR REPLACE PROCEDURE staging.load_order() LANGUAGE plpgsql AS $$
DECLARE
BEGIN

TRUNCATE TABLE staging.order;

--insert new records to get new ids
insert into staging.orderkey
(
orderid
)
select
c.orderid
from landing.order c
LEFT JOIN staging.orderkey i
ON c.orderid=i.orderid
where i.orderid IS NULL;

--update the max key
update staging.order_maxkey
set ordermaxkey = (select max(orderkey) from staging.orderkey);


insert into staging.order
(
orderkey,
customerkey,
productkey,
addresskey,
datekey,
numberofitems,
price
)
select
xid.orderkey,
cid.customerkey,
pid.productkey,
aid.addresskey,
d.datekey,
o.numberofitems,
p.price
from
landing.order o
join staging.orderkey xid on o.orderid=xid.orderid
join landing.customer c on substring(o.pk,6,length(o.pk))=c.username   ---order table needs username
join staging.customerkey cid on cid.username=c.username
join landing.address a on a.username=c.username
join staging.addresskey aid on aid.pk=a.buildingnumber::varchar+'||'+a.postcode  ---maybe change pk to addressid
join staging.datekey d on d.orderdate=o.orderdate
join landing.product p on p.orderid=o.orderid
join staging.productkey pid on pid.productid=p.productid;

COMMIT;

END;
$$

CREATE OR REPLACE PROCEDURE dwh.load_order() LANGUAGE plpgsql AS $$
DECLARE
BEGIN

---delete old records 
delete from dwh.order
using staging.order as stage
where dwh.order.orderkey=stage.orderkey;

--insert new and modified
insert into dwh.order
(
orderkey,
customerkey,  
productkey,
addresskey,
price,
datekey  
)
select
orderkey,
customerkey,  
productkey,
addresskey,
price,
datekey
from staging.order;

COMMIT;
END;
$$

Use Step Functions to orchestrate the data pipeline

So far, we have stepped through each component in our workflow. We now need to stitch them together to build an automated, idempotent data pipeline. A good orchestration tool must manage failures, retries, parallelization, service integrations, and observability, so developers can focus solely on the business logic. Ideally, the workflow we build is also serverless so there is no operational overhead. Step Functions is an ideal choice to automate our data pipeline. It allows us to integrate the ELT components we have built on AWS Glue and Amazon Redshift and conduct some steps in parallel to optimize performance.

On the Step Functions console, create a new state machine.
Select Write your workflow in code.

Enter the stepfunction_workflow.json code into the definition, replacing all placeholders with the appropriate values:
1. [REDSHIFT-CLUSTER-IDENTIFIER] – Use the value for ClusterName (from the Outputs tab in the CloudFormation stack).
2. [REDSHIFT-DATABASE] – Use the value for salesdwh (unless changed, this is the default database in the CloudFormation template).

We use the Step Functions IAM role from the CloudFormation template.

This JSON code generates the following pipeline.

Starting from the top, the workflow contains the following steps:

We archive any existing raw and data lake files.
We add two AWS Glue StartJobRun tasks that run sequentially: first to export the data from DynamoDB to our raw bucket, then from the raw bucket to our data lake.
After that, we parallelize the landing of data from Amazon S3 to Amazon Redshift.
We transform and load the data into our data warehouse using the Amazon Redshift Data API. Because this is asynchronous, we need to check the status of the runs before moving down the pipeline.
After we move the data load from landing to staging, we truncate the landing tables.
We load the dimensions of our target data warehouse (dwh) first, and finally we load our single fact table with its foreign key dependency on the preceding dimension tables.

The following figure illustrates a successful run.

After we set up the workflow, we can use EventBridge to schedule a daily midnight run, where the target is a Step Functions StartExecution API calling our state machine. Under the workflow permissions, choose Create a new role for this schedule and optionally rename it.

Query the data warehouse

We can verify the data has been successfully loaded into Amazon Redshift with a query.

After we have the data loaded into Amazon Redshift, we’re ready to answer the query asked at the start of this post: what is the year-on-year quarterly sales growth by product and country? The query looks like the following code (depending on your dataset, you may need to select alternative years and quarters):

with sales2021q2
as
(
  select d.year, d.quarter,a.country,p.category,sum(o.price) as revenue2021q2
  from dwh.order o
  join dwh.date d on o.datekey=d.datekey
  join dwh.product p on o.productkey=p.productkey
  join dwh.address a on a.addresskey=o.addresskey
  where d.year=2021 and d.quarter=2
  group by d.year, d.quarter,a.country,p.category
  ),
sales2022q2
as
(
  select d.year, d.quarter,a.country,p.category,sum(o.price) as revenue2022q2
  from dwh.order o
  join dwh.date d on o.datekey=d.datekey
  join dwh.product p on o.productkey=p.productkey
  join dwh.address a on a.addresskey=o.addresskey
  where d.year=2022 and d.quarter=2
  group by d.year, d.quarter,a.country,p.category
  )

select a.country,a.category, ((revenue2022q2 - revenue2021q2)/revenue2021q2)*100 as quarteronquartergrowth
from sales2022q2 a
join sales2021q2 b on a.country=b.country and a.category=b.category
order by a.country,a.category

We can visualize the results in Amazon Redshift Query Editor V2 by toggling the chart option and setting Type as Pie, Values as quarteronquartergrowth, and Labels as category.

Cost considerations

We give a brief outline of the indicative costs associated with the key services covered in our solution based on us-east-1 Region pricing using the AWS Pricing Calculator:

DynamoDB – With on-demand settings for 1.5 million items (average size of 355 bytes) and associated write and read capacity plus PITR storage, the cost of DynamoDB is approximately $2 per month.
AWS Glue DynamoDB export connector – This connector utilizes the DynamoDB export to Amazon S3 feature. This has no hourly cost—you only pay for the gigabytes of data exported to Amazon S3 ($0.11 per GiB).
Amazon S3 – You pay for storing objects in your S3 buckets. The rate you’re charged depends on your objects’ size, how long you stored the objects during the month, and the storage class. In our solution, we used S3 Standard for our data lake and S3 Standard – Infrequent Access for archive. Standard-IA storage is $0.0125 per GB/month; Standard storage is $0.023 per GB/month.
AWS Glue Jobs – With AWS Glue, you only pay for the time your ETL job takes to run. There are no resources to manage, no upfront costs, and you are not charged for startup or shutdown time. AWS charges you an hourly rate based on the number of Data Processing Units (DPUs) used to run your ETL job. A single DPU provides 4 vCPU and 16 GB of memory. Every one of our nine Spark jobs uses 10 DPUs and has an average runtime of 3 minutes. This gives an approximate cost of $0.29 per job.
Amazon Redshift – We provisioned two RA3.4xlarge nodes for our Amazon Redshift cluster. If run on-demand, each node costs $3.26 per hour. If utilized 24/7, our monthly cost would be approximately $4,759.60. You should evaluate your workload to determine what cost savings can be achieved by using Amazon Redshift Serverless or using Amazon Redshift provisioned reserved instances.
Step Functions – You are charged based on the number of state transitions required to run your application. Step Functions counts a state transition as each time a step of your workflow is run. You’re charged for the total number of state transitions across all your state machines, including retries. The Step Functions free tier includes 4,000 free state transitions per month. Thereafter, it’s $0.025 per 1,000 state transitions.

Clean up

Remember to delete any resources created through the CloudFormation stack. You first need to manually empty and delete the S3 buckets. Then you can delete the CloudFormation stack using the AWS CloudFormation console or AWS Command Line Interface (AWS CLI). For instructions, refer to Clean up your “hello, world!” application and related resources.

Summary

In this post, we demonstrated how you can export data from DynamoDB to Amazon S3 and Amazon Redshift to perform advanced analytics. We built an automated data pipeline that you can use to perform a batch ELT process that can be scheduled to run daily, weekly, or monthly and can scale to handle very large workloads.

Please leave your feedback or comments in the comments section.

About the Author

Altaf Hussain is an Analytics Specialist Solutions Architect at AWS. He helps customers around the globe design and optimize their big data and data warehousing solutions.

Appendix

To extract the data from DynamoDB and load it into our Amazon Redshift database, we can use the Spark script editor and upload the files from GlueSparkJobs.zip to create each individual job necessary to perform the extract and load. If you choose to do this, remember to update, where appropriate, the account ID and Region placeholders in the scripts. Also, on the Job details tab under Advanced properties, add the Amazon Redshift connection.

Migrate from Google BigQuery to Amazon Redshift using AWS Glue and Custom Auto Loader Framework

2023-06-02 Tahir Aziz

Post Syndicated from Tahir Aziz original https://aws.amazon.com/blogs/big-data/migrate-from-google-bigquery-to-amazon-redshift-using-aws-glue-and-custom-auto-loader-framework/

Amazon Redshift is a widely used, fully managed, petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytic workloads. Customers are looking for tools that make it easier to migrate from other data warehouses, such as Google BigQuery, to Amazon Redshift to take advantage of the service price-performance, ease of use, security, and reliability.

In this post, we show you how to use AWS native services to accelerate your migration from Google BigQuery to Amazon Redshift. We use AWS Glue, a fully managed, serverless, ETL (extract, transform, and load) service, and the Google BigQuery Connector for AWS Glue (for more information, refer to Migrating data from Google BigQuery to Amazon S3 using AWS Glue custom connectors). We also add automation and flexibility to simplify migration of multiple tables to Amazon Redshift using the Custom Auto Loader Framework.

Solution overview

The solution provides a scalable and managed data migration workflow to migrate data from Google BigQuery to Amazon Simple Storage Service (Amazon S3), and then from Amazon S3 to Amazon Redshift. This pre-built solution scales to load data in parallel using input parameters.

The following architecture diagram shows how the solution works. It starts with setting up the migration configuration to connect to Google BigQuery, then converts the database schemas, and finally migrates the data to Amazon Redshift.

Architecture diagram showing how the solution works. It starts with setting-up the migration configuration to connect to Google BigQuery, then convert the database schemas, and finally migrate the data to Amazon Redshift.

The workflow contains the following steps:

A configuration file is uploaded to an S3 bucket you have chosen for this solution. This JSON file contains the migration metadata, namely the following:
- A list of Google BigQuery projects and datasets.
- A list of all tables to be migrated for each project and dataset pair.
An Amazon EventBridge rule triggers an AWS Step Functions state machine to start migrating the tables.
The Step Functions state machine iterates on the tables to be migrated and runs an AWS Glue Python shell job to extract the metadata from Google BigQuery and store it in an Amazon DynamoDB table used for tracking the tables’ migration status.
The state machine iterates on the metadata from this DynamoDB table to run the table migration in parallel, based on the maximum number of migration jobs without incurring limits or quotas on Google BigQuery. It performs the following steps:
- Runs the AWS Glue migration job for each table in parallel.
- Tracks the run status in the DynamoDB table.
- After the tables have been migrated, checks for errors and exits.
The data exported from Google BigQuery is saved to Amazon S3. We use Amazon S3 (even though AWS Glue jobs can write directly to Amazon Redshift tables) for a few specific reasons:
- We can decouple the data migration and the data load steps.
- It offers more control on the load steps, with the ability to reload the data or pause the process.
- It provides fine-grained monitoring of the Amazon Redshift load status.
The Custom Auto Loader Framework automatically creates schemas and tables in the target database and continuously loads data from Amazon S3 to Amazon Redshift.

A few additional points to note:

If you have already created the target schema and tables in the Amazon Redshift database, you can configure the Custom Auto Loader Framework to not automatically detect and convert the schema.
If you want more control over converting the Google BigQuery schema, you can use the AWS Schema Conversion Tool (AWS SCT). For more information, refer to Migrate Google BigQuery to Amazon Redshift using AWS Schema Conversion tool (SCT).
As of this writing, neither the AWS SCT nor Custom Auto Loader Framework support the conversion of nested data types (record, array and struct). Amazon Redshift supports semistructured data using the Super data type, so if your table uses such complex data types, then you need to create the target tables manually.

To deploy the solution, there are two main steps:

Deploy the solution stack using AWS CloudFormation.
Deploy and configure Custom Auto Loader Framework to load files from Amazon S3 to Amazon Redshift.

Prerequisites

Before getting started, make sure you have the following:

An account in Google Cloud, specifically a service account that has permissions to Google BigQuery.
An AWS account with an AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI). The IAM user also needs permissions to create an IAM role and policies.
An Amazon Redshift cluster or Amazon Redshift Serverless workgroup. If you don’t have one, refer to Amazon Redshift Serverless.
An S3 bucket. If you don’t want to use one of your existing buckets, you can create a new one. Note the name of the bucket to use later when you pass it to the CloudFormation stack as an input parameter.
A configuration file with the list of tables to be migrated. This file should have the following structure:
```
{
"migrationlist1" : 
		[{"project":"YOUR-GCP-PROJECT-NAME",
"dataset":" YOUR-DATASET-NAME ",
"table":[
     		" YOUR-TABLE-NAME-1",
        		" YOUR-TABLE-NAME-2",
		…
		" YOUR-TABLE-NAME-N",
    		]
}]
}
```
Alternatively, you can download the demo file, which uses the open dataset created by the Centers for Medicare & Medicaid Services.

In this example, we named the file bq-mig-config.json

Set up the Google BigQuery Connector for AWS Glue as described in the post Migrating data from Google BigQuery to Amazon S3 using AWS Glue custom connectors. The steps to consider are:

1. Configure your Google account.
2. Create an IAM role for AWS Glue (and note down the name of the IAM role).
3. Subscribe to and activate the Google BigQuery Connector for AWS Glue.

Deploy the solution using AWS CloudFormation

To deploy the solution stack using AWS CloudFormation, complete the following steps:

Choose Launch Stack:

This template provisions the AWS resources in the us-east-1 Region. If you want to deploy to a different Region, download the template bigquery-cft.yaml and launch it manually: on the AWS CloudFormation console, choose Create stack with new resources and upload the template file you downloaded.

The list of provisioned resources is as follows:

- An EventBridge rule to start the Step Functions state machine on the upload of the configuration file.
- A Step Functions state machine that runs the migration logic. The following diagram illustrates the state machine.
- An AWS Glue Python shell job used to extract the metadata from Google BigQuery. The metadata will be stored in an DynamoDB table, with a calculated attribute to prioritize the migration job. By default, the connector creates one partition per 400 MB in the table being read (before filtering). As of this writing, the Google BigQuery Storage API has a maximum quota for parallel read streams, so we set the limit for worker nodes for tables larger than 400 GB. We also calculate the max number of jobs that can run in parallel based on those values.
- An AWS Glue ETL job used to extract the data from each Google BigQuery table and saves it in Amazon S3 in Parquet format.
- A DynamoDB table (bq_to_s3_tracking) used to store the metadata for each table to be migrated (size of the table, S3 path used to store the migrated data, and the number of workers needed to migrate the table).
- A DynamoDB table (bq_to_s3_maxstreams) used to store the maximum number of streams per state machine run. This helps us minimize job failures due to limits or quotas. Use the Cloud Formation template to customize the name of the DynamoDB table. The prefix for the DynamoDB table is bq_to_s3.
- The IAM roles needed by the state machine and AWS Glue jobs.

Choose Next.

For Stack name, enter a name.
For Parameters, enter the parameters listed in the following table, then choose Create.

CloudFormation Template Parameter	Allowed Values	Description
`InputBucketName`	S3 bucket name	The S3 bucket where the AWS Glue job stores the migrated data. The data will be actually stored in a folder named `s3-redshift-loader-source`, which is used by the Custom Auto Loader Framework.
`InputConnectionName`	AWS Glue connection name, the default is `glue-bq-connector-24`	The name of the AWS Glue connection that is created using the Google BigQuery connector.
`InputDynamoDBTablePrefix`	DynamoDB table name prefix, the default is `bq_to_s3`	The prefix that will be used when naming the two DynamoDB tables created by the solution.
`InputGlueETLJob`	AWS Glue ETL job name, the default is `bq-migration-ETL`	The name you want to give to the AWS Glue ETL job. The actual script is saved in the S3 path specified in the parameter `InputGlueS3Path`.
`InputGlueMetaJob`	AWS Glue Python shell job name, the default is `bq-get-metadata`	The name you want to give to AWS Glue Python shell job. The actual script is saved in the S3 path specified in the parameter `InputGlueS3Path`.
`InputGlueS3Path`	S3 path, the default is `s3://aws-glue-scripts-${AWS::Account}-${AWS::Region}/admin/`	This is the S3 path in which the stack will copy the scripts for AWS Glue jobs. Remember to replace: `${AWS::Account}` with the actual AWS account ID and `${AWS::Region}` with the Region you plan to use, or provide your own bucket and prefix in a complete path.
`InputMaxParallelism`	Number of parallel migration jobs to run, the default is 30	The maximum number of tables you want to migrate concurrently.
`InputBQSecret`	AWS Secrets Manager secret name	The name of the AWS Secrets Manager secret in which you stored the Google BigQuery credential.
`InputBQProjectName`	Google BigQuery project name	The name of your project in Google BigQuery in which you want to store temporary tables; you will need write permissions on the project.
`StateMachineName`	Step Functions state machine name, the default is `bq-to-s3-migration-state-machine`	The name of the Step Functions state machine.
`SourceS3BucketName`	S3 bucket name, the default is `aws-blogs-artifacts-public`	The S3 bucket where the artifacts for this post are stored. Do not change the default.

Deploy and configure the Custom Auto Loader Framework to load files from Amazon S3 to Amazon Redshift

The Custom Auto Loader Framework utility makes data ingestion to Amazon Redshift simpler and automatically loads data files from Amazon S3 to Amazon Redshift. The files are mapped to the respective tables by simply dropping files into preconfigured locations on Amazon S3. For more details about the architecture and internal workflow, refer to Custom Auto Loader Framework.

To set up the Custom Auto Loader Framework, complete the following steps:

Choose Launch Stack to deploy the CloudFormation stack in the us-east-1 Region:

On the AWS CloudFormation console, choose Next.
Provide the following parameters to help ensure the successful creation of resources. Make sure you have collected these values beforehand.

Parameter Name	Allowed Values	Description
`CopyCommandSchedule`	`cron(0/5 * ? * * *)`	The EventBridge rules `KickoffFileProcessingSchedule` and `QueueRSProcessingSchedule` are triggered based on this schedule. The default is 5 minutes.
`DatabaseName`	`dev`	The Amazon Redshift database name.
`DatabaseSchemaName`	`public`	The Amazon Redshift schema name.
`DatabaseUserName`	`demo`	The Amazon Redshift user name who has access to run COPY commands on the Amazon Redshift database and schema.
`RedshiftClusterIdentifier`	`democluster`	The Amazon Redshift cluster name.
`RedshiftIAMRoleARN`	`arn:aws:iam::7000000000:role/RedshiftDemoRole`	The Amazon Redshift cluster attached role, which has access to the S3 bucket. This role is used in COPY commands.
`SourceS3Bucket`	`Your-bucket-name`	The S3 bucket where data is located. Use the same bucket you used to store the migrated data as indicated in the previous stack.
`CopyCommandOptions`	`delimiter '\|' gzip`	Provide the additional COPY command data format parameters as follows: `delimiter '\|' dateformat 'auto' TIMEFORMAT 'auto'`
`InitiateSchemaDetection`	`Yes`	The setting to dynamically detect the schema prior to file upload.

The following screenshot shows an example of our parameters.

Choose Create.
Monitor the progress of the Stack creation and wait until it is complete.
To verify the Custom Auto Loader Framework configuration, log in to the Amazon S3 console and navigate to the S3 bucket you provided as a value to the SourceS3Bucket parameter.

You should see a new directory called s3-redshift-loader-source is created.

Test the solution

To test the solution, complete the following steps:

Create the configuration file based on the prerequisites. You can also download the demo file.
To set up the S3 bucket, on the Amazon S3 console, navigate to the folder bq-mig-config in the bucket you provided in the stack.
Upload the config file into it.
To enable EventBridge notifications to the bucket, open the bucket on the console and on the Properties tab, locate Event notifications.
In the Amazon EventBridge section, choose Edit.
Select On, then choose Save changes.

On AWS Step Function console, monitor the run of the state machine.
Monitor the status of the loads in Amazon Redshift. For instructions, refer to Viewing Current Loads.
Open the Amazon Redshift Query Editor V2 and query your data.

Pricing considerations

You might have egress charges for migrating data out of Google BigQuery into Amazon S3. Review and calculate the cost for moving your data on your Google cloud billing console. As of this writing, AWS Glue 3.0 or later charges $0.44 per DPU-hour, billed per second, with a 1-minute minimum for Spark ETL jobs. For more information, see AWS Glue Pricing. With auto scaling enabled, AWS Glue automatically adds and removes workers from the cluster depending on the parallelism at each stage or microbatch of the job run.

Clean up

To avoid incurring future charges, clean up your resources:

Delete the CloudFormation solution stack.
Delete the CloudFormation Custom Auto Loader Framework stack.

Conclusion

In this post, we demonstrated how to build a scalable and automated data pipeline to migrate your data from Google BigQuery to Amazon Redshift. We also highlighted how the Custom Auto Loader framework can automate the schema detection, create tables for your S3 files, and continuously load the files into your Amazon Redshift warehouse. With this approach, you can automate the migration of entire projects (even multiple projects at the time) in Google BigQuery to Amazon Redshift. This helps improve data migration times into Amazon Redshift significantly through the automatic table migration parallelization.

The auto-copy feature in Amazon Redshift simplifies automatic data loading from Amazon S3 with a simple SQL command, users can easily automate data ingestion from Amazon S3 to Amazon Redshift using the Amazon Redshift auto-copy preview feature

For more information about the performance of the Google BigQuery Connector for AWS Glue, refer to Migrate terabytes of data quickly from Google Cloud to Amazon S3 with AWS Glue Connector for Google BigQuery and learn how to migrate a large amount of data (1.9 TB) into Amazon S3 quickly (about 8 minutes).

To learn more about AWS Glue ETL jobs, see Simplify data pipelines with AWS Glue automatic code generation and workflows and Making ETL easier with AWS Glue Studio.

About the Authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Fabrizio Napolitano is a Principal Specialist Solutions Architect for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.

Manjula Nagineni is a Senior Solutions Architect with AWS based in New York. She works with major financial service institutions, architecting and modernizing their large-scale applications while adopting AWS Cloud services. She is passionate about designing big data workloads cloud-natively. She has over 20 years of IT experience in software development, analytics, and architecture across multiple domains such as finance, retail, and telecom.

Sohaib Katariwala is an Analytics Specialist Solutions Architect at AWS. He has over 12 years of experience helping organizations derive insights from their data.

AWS Week in Review – AWS Wickr, Amazon Redshift, Generative AI, and More – May 29, 2023

2023-05-30 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-week-in-review-aws-wickr-amazon-redshift-generative-ai-and-more-may-29-2023/

This edition of Week in Review marks the end of the month of May. In addition, we just finished all of the in-person AWS Summits in Asia-Pacific and Japan starting from AWS Summit Sydney and AWS Summit Tokyo in April to AWS Summit ASEAN, AWS Summit Seoul, and AWS Summit Mumbai in May.

Thank you to everyone who attended our AWS Summits in APJ, especially the AWS Heroes, AWS Community Builders, and AWS User Group leaders, for your collaboration in supporting activities at AWS Summit events.

Last Week’s Launches
Here are some launches that caught my attention last week:

AWS Wickr is now HIPAA eligible — AWS Wickr is an end-to-end encrypted enterprise messaging and collaboration tool that enables one-to-one and group messaging, voice and video calling, file sharing, screen sharing, and location sharing, without increasing organizational risk. With this announcement, you can now use AWS Wickr for workloads that are within the scope of HIPAA. Visit AWS Wickr to get started.

Amazon Redshift announces support for auto-commit statements in stored procedure — If you’re using stored procedures in Amazon Redshift, you now have enhanced transaction controls that enable you to automatically commit the statements inside the procedure. This new NONATOMIC mode can be used to handle exceptions inside a stored procedure. You can also use the new PL/pgSQL statement RAISE to programmatically raise the exception, which helps prevent disruptions in applications due to an error inside a stored procedure. For more information on using this feature, refer to Managing transactions.

AWS Chatbot supports access to Amazon CloudWatch dashboards and logs insights in chat channels — With this launch, you now can receive Amazon CloudWatch alarm notifications for an incident directly in your chat channel, analyze the diagnostic data from the dashboards, and remediate directly from the chat channel without switching context. Visit the AWS Chatbot page to learn more.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

AWS Open Source Updates
As always, my colleague Ricardo has curated the latest updates for open source news at AWS. Here are some of the highlights:

OpenEMR on AWS Fargate — OpenEMR is a popular Electronic Health and Medical Practice management solution. If you’re looking to deploy OpenEMR on AWS, then this repo will help you to get your OpenEMR up and running on AWS Fargate using Amazon ECS.

Cloud-Radar — If you’re working with AWS Cloudformation and looking for performing unit tests, then you might want to try Cloud-Radar. You can also perform functional testing with Cloud-Radar as this tool also acts a wrapper around Taskcat.

Amazon and Generative AI
Using generative AI to improve extreme multilabel classification — In their research on extreme multilabel classification (XMC), Amazon scientists explored a generative approach, in which a model generates a sequence of labels for input sequences of words. The generative models with clustering consistently outperformed them. This demonstrates the effectiveness of incorporating hierarchical clustering in improving XMC performance.

Upcoming AWS Events
Don’t miss upcoming AWS-led events happening soon:

AWS Silicon Innovation Day (June 21)
AWS Summit London (June 7)
AWS Summit in Washington, DC (June 7 & 8)
AWS Summit Toronto (June 14)

Also, let’s learn from our fellow builders and give them support by attending AWS Community Days:

AWS Community Day Warsaw in Poland (June 1),
AWS Community Day Midwest in Chicago, USA (June 15),
AWS Community Day Security Edition in Chile (July 1),
AWS Community Day Philippines (July 29-30).

That’s all for this week. Check back next Monday for another Week in Review!

Happy building
— Donnie

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Week in Review – AWS Documentation Updates, Amazon EventBridge is Faster, and More – May 22, 2023

2023-05-22 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-aws-documentation-updates-amazon-eventbridge-is-faster-and-more-may-22-2023/

Here are your AWS updates from the previous 7 days. Last week I was in Turin, Italy for CloudConf, a conference I’ve had the pleasure to participate in for the last 10 years. AWS Hero Anahit Pogosova was also there sharing a few serverless tips in front of a full house. Here’s a picture I took from the last row during her keynote.

On Thursday, May 25, I’ll be at the AWS Community Day in Dublin to celebrate the 10 years of the local AWS User Group. Say hi if you’re there!

Last Week’s Launches
Last week was packed with announcements! Here are the launches that got my attention:

Amazon SageMaker – Geospatial capabilities are now generally available with security updates and more use case samples.

Amazon Detective – Simplify the investigation of AWS Security Findings coming from new sources such as AWS IAM Access Analyzer, Amazon Inspector, and Amazon Macie.

Amazon EventBridge – EventBridge now delivers events up to 80% faster than before, as measured by the time an event is ingested to the first invocation attempt. No change is required on your side.

AWS Control Tower – The service has launched 28 new proactive controls that allow you to block non-compliant resources before they are provisioned for services such as AWS OpenSearch Service, AWS Auto Scaling, Amazon SageMaker, Amazon API Gateway, and Amazon Relational Database Service (Amazon RDS). Check out the original posts from when proactive controls were launched.

Amazon CloudFront – CloudFront now supports two new control directives to help improve performance and availability: stale-while-revalidate (to immediately deliver stale responses to users while it revalidates caches in the background) and the stale-if-error cache (to define how long stale responses should be reused if there’s an error).

Amazon Timestream – Timestream now enables to export query results to Amazon S3 in a cost-effective and secure manner using the new UNLOAD statement.

AWS Distro for OpenTelemetry – The tail sampling and the group-by-trace processors are now generally available in the AWS Distro for OpenTelemetry (ADOT) collector. For example, with tail sampling, you can define sampling policies such as “ingest 100% of all error cases and 5% of all success cases.”

AWS DataSync – You can now use DataSync to copy data to and from Amazon S3 compatible storage on AWS Snowball Edge Compute Optimized devices.

AWS Device Farm – Device Farm now supports VPC integration for private devices, for example, when an unreleased version of an app is accessing a staging environment and tests are accessing internal packages only accessible via private networking. Read more at Access your private network from real mobile devices using AWS Device Farm.

Amazon Kendra – Amazon Kendra now helps you search across different content repositories with new connectors for Gmail, Adobe Experience Manager Cloud, Adobe Experience Manager On-Premise, Alfresco PaaS, and Alfresco Enterprise. There is also an updated Microsoft SharePoint connector.

Amazon Omics – Omics now offers pre-built bioinformatic workflows, synchronous upload capability, integration with Amazon EventBridge, and support for Graphical Processing Units (GPUs). For more information, check out New capabilities make it easier for healthcare and life science customers to get started, build applications, and scale-up on Amazon Omics.

Amazon Braket – Braket now supports Aria, IonQ’s largest and highest fidelity publicly available quantum computing device to date. To learn more, read Amazon Braket launches IonQ Aria whith built-in error mitigation.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more news items and blog posts you might have missed:

AWS Documentation – The AWS Documentation home page has been redesigned. Leave your feedback there to let us know what you think or to suggest future improvements. Last week we also announced that we are retiring the AWS Documentation GitHub repo to focus our resources to directly improve the documentation and the website.

Peloton case study – Peloton embraces Amazon Redshift to unlock the power of data during changing times.

Zoom case study – Learn how Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR.

Nice solution – Introducing an image-to-speech Generative AI application using SageMaker and Hugging Face.

For AWS open-source news and updates, check out the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Here are some opportunities to meet and learn:

AWS Data Insights Day (May 24) – A virtual event to discover how to innovate faster and more cost-effectively with data. This event focuses on customer voices, deep-dive sessions, and best practices of Amazon Redshift. You can register here.

AWS Silicon Innovation Day (June 21) – AWS has designed and developed purpose-built silicon specifically for the cloud. Join to learn AWS innovations in custom-designed Amazon EC2 chips built for high performance and scale in the cloud. Register here.

AWS re:Inforce (June 13–14) – You can still register for AWS re:Inforce. This year it is taking place in Anaheim, California.

AWS Global Summits – Sign up for the AWS Summit closest to where you live: Hong Kong (May 23), India (May 25), Amsterdam (June 1), London (June 7), Washington, DC (June 7-8), Toronto (June 14), Madrid (June 15), and Milano (June 22). If you want to meet, I’ll be at the one in London.

AWS Community Days – Join these community-led conferences where event logistics and content is planned, sourced, and delivered by community leaders: Dublin, Ireland (May 25), Shenzhen, China (May 28), Warsaw, Poland (June 1), Chicago, USA (June 15), and Chile (July 1).

That’s all from me for this week. Come back next Monday for another Week in Review!

— Danilo

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Peloton embraces Amazon Redshift to unlock the power of data during changing times

2023-05-17 Phil Goldstein

Post Syndicated from Phil Goldstein original https://aws.amazon.com/blogs/big-data/peloton-embraces-amazon-redshift-to-unlock-the-power-of-data-during-changing-times/

Jerry Wang, Peloton’s Director of Data Engineering (left), and Evy Kho, Peloton's Manager of Subscription Analytics, discuss how the company has benefited from using Amazon Redshift.

Credit: Phil Goldstein
Jerry Wang, Peloton’s Director of Data Engineering (left), and Evy Kho, Peloton’s Manager of Subscription Analytics, discuss how the company has benefited from using Amazon Redshift.

New York-based Peloton, which aims to help people around the world reach their fitness goals through its connected fitness equipment and subscription-based classes, saw booming growth in the early stage of the COVID-19 pandemic. In 2020, as gyms shuttered and people looked for ways to stay active from the safety of their homes, the company’s annual revenue soared from $915 million in 2019 to $4 billion in 2021. Meanwhile, the company’s subscribers jumped from around 360,000 in 2019 to 2.76 million at the end of 2022.

As Peloton’s business continued to evolve amid a changing macroeconomic environment, it was essential that it could make smart business decisions quickly, and one of the best ways to do that was to harness insights from the huge amount of data that it had been gathering over recent years.

During that same time, AWS has been focused on helping customers manage their ever-growing volumes of data with tools like Amazon Redshift, the first fully managed, petabyte-scale cloud data warehouse. The service has grown into a multifaceted service used by tens of thousands of customers to process exabytes of data on a daily basis (1 exabyte is equivalent to 119 billion song downloads). With Amazon Redshift, you get access to a modern data architecture that helps you break down internal data silos, share data securely and seamlessly, and support multiple users who don’t have specialized data and analytics skills.

When Jerry Wang, Peloton’s director of data engineering, joined the company in 2019, he needed to make sure the service would handle the company’s massive and growing amounts of data. He also needed to ensure Amazon Redshift could help the company efficiently manage the wide variety of data and the users who would need to access it, and deliver insights on that data at high velocity—all while being cost-effective and secure.

Wang was delighted to see that as Peloton experienced its massive growth and change, AWS continued to release new Amazon Redshift features and associated capabilities that would perfectly suit his company’s needs at just the right time.

“Over the years, I’ve always been in the stage where I hope Redshift can have a new, specific feature,” Wang says, “and then, in a very limited amount of time, AWS releases that kind of feature.”

Peloton’s data volumes soar as the business grows

Man working out with a weight while viewing a Peloton class on his TV in a living room.

Credit: Peloton

As Peloton’s business has evolved, the amount of data it is generating and analyzing has grown exponentially. From 2019 to now, Wang reports the amount of data the company holds has grown by a factor of 20. In fact, a full 95% of the total historical data the company has generated has come in the last 4 years. This growth has been driven both by surges in the number of users on Peloton’s platform and the variety of data the company is collecting.

Peloton collects reams of data on its sales of internet-connected exercise equipment like stationary bikes and treadmills. The company also collects data on customers’ workouts, which it then provides back to them in various reports such as a monthly summary, giving them insights into how often they worked out, their best output, trends in their workouts, the instructor they used the most, how many calories they burned, and more. All of this data helps Peloton make strategic business decisions, refine its operations to become more efficient, adjust its programming, and drive subscriber engagement and growth.

In 2019 and into 2020, as Peloton’s business boomed, the company needed an analytics system that could help it manage an explosion of data, both from users and related to its business. The company embraced Amazon Redshift because of the service’s versatility, ease of use, price-performance at scale, continuous pace of innovation, and ability to handle concurrent queries from dozens of internal data teams.

Wang said that when he joined the company, there were two kinds of users who were performing daily data operations in Peloton’s Amazon Redshift data warehouse. One group performed extract, transform, and load (ETL) operations to take raw data and make it available for analysis. The other was a group of business users who, each morning, would perform queries to generate local data visualizations, creating a surge of capacity on the Amazon Redshift data warehouse. “So, when these two loads ran together, the performance suffered directly,” Wang says.

One of the features Peloton adopted was Amazon Redshift Concurrency Scaling, which provides consistent and fast query performance even across thousands of concurrent users and concurrent queries. This helped solve the problem by automatically adding query processing power in seconds and processing queries without delays. When the workload demand subsided, the extra processing power was automatically removed, so Peloton only had to pay for the time when Concurrency Scaling data warehouses were in use. Wang says Peloton was running about 10 hours of Concurrency Scaling on a consistent daily basis to deal with the congestion, which, he says, “solved my problem at that moment.”

In 2020, as the pandemic inspired hoards to hop on bikes in their living rooms, Wang also upgraded Amazon Redshift with the newly introduced Amazon Redshift RA3 instances with managed storage (RMS). These represented a new generation of compute instances with managed, analytics-optimized storage designed for high-transaction, fast query performance and lower costs.

“The new instance … was a great feature for us,” Wang says. “It solved our concern about moving from terabyte scale to petabyte scale.”

Peloton’s business is driven by a variety of data for a wide range of users

Man watching a female Peloton biking instructor through a touch screen display on his Peloton bike.

Credit: Peloton

Peloton’s business model is driven by a wide variety of large volumes of data. In addition to selling bikes, treadmills, and indoor rowing machines, and expanding its subscription platform to include non-equipment-based workouts, the company has dozens of instructors in five countries, and it licenses music from three major music licensors. In 2022, it began renting bikes as well as selling them. Internally, Peloton employees working in finance, accounting, marketing, supply chain operations, music and content, and more are using data to track subscriber growth, content engagement, and which sales channels are leading to the most net new subscriptions.

“There was a time when we were just a bike company, and now we’re so much more than that,” says Evy Kho, manager of subscription analytics at Peloton.

There is also a much wider range of sales channels for Peloton equipment than just a few years ago. In the past, Peloton customers could only purchase bikes through the Peloton website or secondhand. Now, customers can purchase hardware from third-party sites like Amazon. That introduced “a really interesting data problem” for Peloton, says Kho, as it strives to determine how to link subscription signups back to exercise equipment sales.

In the face of this variability, complexity, and need for instant access to data to inform business decision-makers, Peloton embraced Amazon Redshift Serverless as an early adopter after AWS introduced the feature in late 2021. Redshift Serverless allows companies to quickly run and scale analytics capacity without database managers and data engineers needing to manage data warehouse infrastructure.

Redshift Serverless also has the ability to quickly spin up analytics capacity for different users, or personas, within an organization. This allows different teams across Peloton to perform analytics on the same datasets at the same time to generate insights on their individual parts of the business. It’s “incredibly important in terms of assessing what’s been good for our business,” Kho says.

Wang also says Peloton is considering supporting specific personas for those who need analytics around financial information governed by securities regulations, and another for users who need to perform analytics on data governed by regulations around personally identifiable information (PII).

Wang points out that Redshift Serverless also allows him to spin up Amazon Redshift data warehouses to handle special usage patterns. For example, ETL loads are often high I/O but require low CPU resources, and are very predictable because Peloton controls the process. However, when internal users want to perform data analytics or machine learning, the company doesn’t have control over the demand for those queries, and the load on Amazon Redshift data warehouses can be variable, with some queries more CPU-intensive than others. Previously, any provisioned data warehouse would have a fixed cost, and it would have to be provisioned to cope with the highest possible workloads even if the utilization rates turned out to be low. Now, for these different scenarios, Wang creates different Amazon Redshift instances to handle that variability without those heavy, fixed costs.

As Peloton’s use of Amazon Redshift has evolved and matured, its costs have gone down, according to Wang. “If you look at Serverless, the amount … that we spend on the Serverless is actually much smaller than we did previously, compared to the Concurrency Scaling cost.”

In a serverless environment, there is no upfront cost to Peloton. “I can set it up as quickly as I can and we pay as we need it,” Wang says. “It scales up when the load goes up. So, it’s a perfect fit.”

Peloton uses Amazon Redshift to get to insights faster

Women running on a Peloton treadmill with a touch screen display

Credit: Peloton

Peloton’s focus on efficiency and sustainable growth has meant that it needs to act more quickly than ever to make sound, data-informed business decisions. Peloton, Wang notes, is long past the stage where all it cared about was growth. “We are a mature company now, so operational efficiency is very important; it’s key to the business,” he says.

When Peloton launches new products, for example, two things typically happen, Wang says. One is that there is a spike in data volumes, both in traffic to its website and the number of sales transactions it’s processing. The second is that the company’s management team will want real-time updates and analysis of how sales are performing.

Redshift Serverless and data sharing lets users quickly start performing real-time analytics and build reporting and dashboard applications without any additional engineering required. Wang confirms this benefit, especially in the example of a new product launch, saying it “will scale up by itself without me having to intervene. I don’t need to allocate a budget. I don’t need to change any configurations.”

In the past, when Peloton only offered its fitness equipment through its own website, it was easy to associate fulfillment data on orders with subscriptions. However, as those channels grew and became more complex, Peloton turned to the data sharing capabilities of Amazon Redshift to share data quickly and easily across teams. Peloton’s teams for subscriber analytics, supply chain, accounting, and more need fast access to fulfillment data to ensure they can track it accurately, respond if changes are needed, and determine how fulfillment data aligns with subscriptions and revenue.

“Getting them those results even faster has been incredibly helpful, and is only becoming more important as we have become far more data-driven than I think you could argue we were before,” Kho says.

Amazon Redshift marries data security, governance, and compliance with innovation

Like all customers, Peloton is concerned about data security, governance, and compliance. With security features like dynamic data masking, role-based access control, and row-level security, Amazon Redshift protects customers’ data with granular authorization features and comprehensive identity management.

Customers also are able to easily provide authorizations for the right users or groups. These features are available out of the box, within the standard pricing model.

Wang notes that Amazon Redshift’s security model is based on a traditional database model, which is a well-understood and robust model. “So for us, to provision access on that model is quite straightforward,” Wang says.

At every stage of Peloton’s evolution over the last 4 years, the company has been able to turn to AWS and Amazon Redshift to help it effectively manage that growth and complexity.

“When I started,” Wang says, “I said, OK, I need a temporary boost in capacity. Then came Concurrency Scaling. And then I said, I need cheaper storage, and [RA3] comes along. And then the ultimate challenge [was], I’m no longer satisfied with a monolithic Redshift instance. Serverless solved that issue.”

Join AWS Data Insights Day 2023

If you want to learn how your company can use Amazon Redshift to analyze large volumes of data in an easy-to-use, scalable, cost-effective, and secure way, don’t miss AWS Data Insights Day on May 24, 2023. During the day-long virtual event, learn from AWS leaders, experts, partners, and customers—including Peloton, Gilead, McDonald’s, Global Foundries, Schneider Electric, and Flutter Entertainment—how Amazon Redshift and features like Amazon Redshift ML are helping drive business innovation, optimization, and cost savings, especially in today’s uncertain economic times.

To learn more about Amazon Redshift, see Amazon Redshift and Amazon Redshift: Ten years of continuous reinvention.

About the author

Phil Goldstein is a copywriter and editor with AWS product marketing. He has 15 years of technology writing experience, and prior to joining AWS was a senior editor at a content marketing agency and a business journalist covering the wireless industry.

Single sign-on with Amazon Redshift Serverless with Okta using Amazon Redshift Query Editor v2 and third-party SQL clients

2023-05-05 Maneesh Sharma

Post Syndicated from Maneesh Sharma original https://aws.amazon.com/blogs/big-data/single-sign-on-with-amazon-redshift-serverless-with-okta-using-amazon-redshift-query-editor-v2-and-third-party-sql-clients/

Amazon Redshift Serverless makes it easy to run and scale analytics in seconds without the need to set up and manage data warehouse clusters. With Redshift Serverless, users such as data analysts, developers, business professionals, and data scientists can get insights from data by simply loading and querying data in the data warehouse.

Customers use their preferred SQL clients to analyze their data in Redshift Serverless. They want to use an identity provider (IdP) or single sign-on (SSO) credentials to connect to Redshift Serverless to reuse existing using credentials and avoid additional user setup and configuration. When you use AWS Identity and Access Management (IAM) or IdP-based credentials to connect to a serverless data warehouse, Amazon Redshift automatically creates a database user for the end-user. You can simplify managing user privileges by using role-based access control. Admins can use a database-role mapping for SSO with the IAM roles that users are assigned to get their database privileges automatically. With this integration, organizations can simplify user management because they no longer need to create users and map them to database roles manually. You can define the mapped database roles as a principal tag for the IdP groups or IAM role, so Amazon Redshift database roles and users who are members of those IdP groups are granted to the database roles automatically.

In this post, we focus on Okta as the IdP and provide step-by-step guidance to integrate Redshift Serverless with Okta using the Amazon Redshift Query Editor V2 and with SQL clients like SQL Workbench/J. You can use this mechanism with other IdP providers such as Azure Active Directory or Ping with any applications or tools using Amazon’s JDBC/ODBC/Python driver.

Solution overview

The following diagram illustrates the authentication flow of Okta with Redshift Serverless using federated IAM roles and automatic database-role mapping.

The workflow contains the following steps:

Either the user chooses an IdP app in their browser, or the SQL client initiates a user authentication request to the IdP (Okta).
Upon a successful authentication, Okta submits a request to the AWS federation endpoint with a SAML assertion containing the PrincipalTags.
The AWS federation endpoint validates the SAML assertion and invokes the AWS Security Token Service (AWS STS) API AssumeRoleWithSAML. The SAML assertion contains the IdP user and group information that is stored in the RedshiftDbUser and RedshiftDbRoles principal tags, respectively. Temporary IAM credentials are returned to the SQL client or, if using the Query Editor v2, the user’s browser is redirected to the Query Editor v2 console using the temporary IAM credentials.
The temporary IAM credentials are used by the SQL client or Query Editor v2 to call the Redshift Serverless GetCredentials API. The API uses the principal tags to determine the user and database roles that the user belongs to. An associated database user is created if the user is signing in for the first time and is granted the matching database roles automatically. A temporary password is returned to the SQL client.
Using the database user and temporary password, the SQL client or Query Editor v2 connects to Redshift Serverless. Upon login, the user is authorized based on the Amazon Redshift database roles that were assigned in Step 4.

To set up the solution, we complete the following steps:

Set up your Okta application:
- Create Okta users.
- Create groups and assign groups to users.
- Create the Okta SAML application.
- Collect Okta information.
Set up AWS configuration:
- Create the IAM IdP.
- Create the IAM role and policy.
Configure Redshift Serverless role-based access.
Federate to Redshift Serverless using the Query Editor V2.
Configure the SQL client (for this post, we use SQL Workbench/J).
Optionally, implement MFA with SQL Client and Query Editor V2.

Prerequisites

You need the following prerequisites to set up this solution:

An AWS account. If you don’t have one, you can sign up for one.
An Redshift Serverless data warehouse. For setup instructions, see Getting started with Amazon Redshift Serverless.
The latest Redshift Serverless JDBC SDK driver-dependent libraries (download the libraries and unzip the JDBC JAR zipped folder).
An Okta account that has an active subscription. You need an admin role to set up the application on Okta. If you’re new to Okta, you can sign up for a free trial or sign up for a developer account.
A SQL client such as SQL Workbench/J.

Set up Okta application

In this section, we provide the steps to configure your Okta application.

Create Okta users

To create your Okta users, complete the following steps:

Sign in to your Okta organization as a user with administrative privileges.
On the admin console, under Directory in the navigation pane, choose People.
Choose Add person.
For First Name, enter the user’s first name.
For Last Name, enter the user’s last name.
For Username, enter the user’s user name in email format.
Select I will set password and enter a password.
Optionally, deselect User must change password on first login if you don’t want the user to change their password when they first sign in. Choose Save.

Create groups and assign groups to users

To create your groups and assign them to users, complete the following steps:

Sign in to your Okta organization as a user with administrative privileges.
On the admin console, under Directory in the navigation pane, choose Groups.
Choose Add group.
Enter a group name and choose Save.
Choose the recently created group and then choose Assign people.
Choose the plus sign and then choose Done.
Repeat Steps 1–6 to add more groups.

In this post, we create two groups: sales and finance.

Create an Okta SAML application

To create your Okta SAML application, complete the following steps:

Sign in to your Okta organization as a user with administrative privileges.
On the admin console, under Applications in the navigation pane, choose Applications.
Choose Create App Integration.
Select SAML 2.0 as the sign-in method and choose Next.
Enter a name for your app integration (for example, redshift_app) and choose Next.
Enter following values in the app and leave the rest as is:
- For Single Sign On URL, enter https://signin.aws.amazon.com/saml.
- For Audience URI (SP Entity ID), enter urn:amazon:webservices.
- For Name ID format, enter EmailAddress.
Choose Next.
Choose I’m an Okta customer adding an internal app followed by This is an internal app that we have created.
Choose Finish.
Choose Assignments and then choose Assign.
Choose Assign to groups and then select Assign next to the groups that you want to add.
Choose Done.

Set up Okta advanced configuration

After you create the custom SAML app, complete the following steps:

On the admin console, navigate to General and choose Edit under SAML settings.
Choose Next.
Set Default Relay State to the Query Editor V2 URL, using the format https://<region>.console.aws.amazon.com/sqlworkbench/home. For this post, we use https://us-west-2.console.aws.amazon.com/sqlworkbench/home.
Under Attribute Statements (optional), add the following properties:
- Provide the IAM role and IdP in comma-separated format using the Role attribute. You’ll create this same IAM role and IdP in a later step when setting up AWS configuration.
- Set user.login for RoleSessionName. This is used as an identifier for the temporary credentials that are issued when the role is assumed.
- Set the DB roles using PrincipalTag:RedshiftDbRoles. This uses the Okta groups to fill the principal tags and map them automatically with the Amazon Redshift database roles. Its value must be a colon-separated list in the format role1:role2.
- Set user.login for PrincipalTag:RedshiftDbUser. This uses the user name in the directory. This is a required tag and defines the database user that is used by Query Editor V2.
- Set the transitive keys using TransitiveTagKeys. This prevents users from changing the session tags in case of role chaining.

The preceding tags are forwarded to the GetCredentials API to get temporary credentials for your Redshift Serverless instance and map automatically with Amazon Redshift database roles. The following table summarizes their attribute statements configuration.

Name	Name Format	Format	Example
https://aws.amazon.com/SAML/Attributes/Role	Unspecified	`arn:aws:iam::<yourAWSAccountID>:role/role-name,arn:aws:iam:: <yourAWSAccountID>:saml-provider/provider-name`	`arn:aws:iam::112034567890:role/oktarole,arn:aws:iam::112034567890:saml-provider/oktaidp`
https://aws.amazon.com/SAML/Attributes/RoleSessionName	Unspecified	`user.login`	`user.login`
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbRoles	Unspecified	`String.join(":", isMemberOfGroupName("group1") ? 'group1' : '', isMemberOfGroupName("group2") ? 'group2' : '')`	`String.join(":", isMemberOfGroupName("sales") ? 'sales' : '', isMemberOfGroupName("finance") ? 'finance' : '')`
https://aws.amazon.com/SAML/Attributes/PrincipalTag:RedshiftDbUser	Unspecified	`user.login`	`user.login`
https://aws.amazon.com/SAML/Attributes/TransitiveTagKeys	Unspecified	`Arrays.flatten("RedshiftDbUser", "RedshiftDbRoles")`	`Arrays.flatten("RedshiftDbUser", "RedshiftDbRoles")`

After you add the attribute claims, choose Next followed by Finish.

Your attributes should be in similar format as shown in the following screenshot.

Collect Okta information

To gather your Okta information, complete the following steps:

On the Sign On tab, choose View SAML setup instructions.
For Identity Provider Single Sign-on URL, Use this URL when connecting with any third-party SQL client such as SQL Workbench/J.
Use the IdP metadata in block 4 and save the metadata file in .xml format (for example, metadata.xml).

Set up AWS configuration

In this section, we provide the steps to configure your IAM resources.

Create the IAM IdP

To create your IAM IdP, complete the following steps:

On the IAM console, under Access management in the navigation pane, choose Identity providers.
Choose Add provider.
For Provider type¸ select SAML.
For Provider name¸ enter a name.
Choose Choose file and upload the metadata file (.xml) you downloaded earlier.
Choose Add provider.

Create the IAM Amazon Redshift access policy

To create your IAM policy, complete the following steps:

On the IAM console, choose Policies.
Choose Create policy.
On the Create policy page, choose the JSON tab.

For the policy, enter the JSON in following format:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "redshift-serverless:GetCredentials",
            "Resource": "<Workgroup ARN>"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift-serverless:ListWorkgroups",
            "Resource": "*"
        }
    ]
}

The workgroup ARN is available on the Redshift Serverless workgroup configuration page.

The following example policy includes only a single Redshift Serverless workgroup; you can modify the policy to include multiple workgroups in the Resource section:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "redshift-serverless:GetCredentials",
            "Resource": "arn:aws:redshift-serverless:us-west-2:123456789012:workgroup/4a4f12vc-123b-2d99-fd34-a12345a1e87f"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift-serverless:ListWorkgroups",
            "Resource": "*"
        }
    ]
}

Choose Next: Tags.
Choose Next: Review.
In the Review policy section, for Name, enter the name of your policy; for example, OktaRedshiftPolicy.
For Description, you can optionally enter a brief description of what the policy does.
Choose Create policy.

Create the IAM role

To create your IAM role, complete the following steps:

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Trusted entity type, select SAML 2.0 federation.
For SAML 2.0-based provider, choose the IdP you created earlier.
Select Allow programmatic and AWS Management Console access.
Choose Next.
Choose the policy you created earlier.
Also, add the policy AmazonRedshiftQueryEditorV2ReadSharing.
Choose Next.
In the Review section, for Role Name, enter the name of your role; for example, oktarole.
For Description, you can optionally enter a brief description of what the role does.
Choose Create role.
Navigate to the role that you just created and choose Trust Relationships.
Choose Edit trust policy and choose TagSession under Add actions for STS.

When using session tags, trust policies for all roles connected to the IdP passing tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.

Choose Update policy.

The following screenshot shows the role permissions.

The following screenshot shows the trust relationships.

Update the advanced Okta Role Attribute

Complete the following steps:

Switch back to Okta.com.
Navigate to the application which you created earlier.
Navigate to General and click Edit under SAML settings.
Under Attribute Statements (optional), update the value for the attribute – https://aws.amazon.com/SAML/Attributes/Role, using the actual role and identity provider arn values from the above step. For example, arn:aws:iam::123456789012:role/oktarole,arn:aws:iam::123456789012:saml-provider/oktaidp.

Configure Redshift Serverless role-based access

In this step, we create database roles in Amazon Redshift based on the groups that you created in Okta. Make sure the role name matches with the Okta Group name.

Amazon Redshift roles simplify managing privileges required for your end-users. In this post, we create two database roles, sales and finance, and grant them access to query tables with sales and finance data, respectively. You can download this sample SQL Notebook and import into Redshift Query Editor v2 to run all cells in the notebook used in this example. Alternatively, you can copy and enter the SQL into your SQL client.

The following is the syntax to create a role in Redshift Serverless:

create role <IdP groupname>;

For example:

create role sales;
create role finance;

Create the sales and finance database schema:

create schema sales_schema;
create schema finance_schema;

Create the tables:

CREATE TABLE IF NOT EXISTS finance_schema.revenue
(
account INTEGER   ENCODE az64
,customer VARCHAR(20)   ENCODE lzo
,salesamt NUMERIC(18,0)   ENCODE az64
)
DISTSTYLE AUTO
;

insert into finance_schema.revenue values (10001, 'ABC Company', 12000);
insert into finance_schema.revenue values (10002, 'Tech Logistics', 175400);
insert into finance_schema.revenue values (10003, 'XYZ Industry', 24355);
insert into finance_schema.revenue values (10004, 'The tax experts', 186577);

CREATE TABLE IF NOT EXISTS sales_schema.store_sales
(
ID INTEGER   ENCODE az64,
Product varchar(20),
Sales_Amount INTEGER   ENCODE az64
)
DISTSTYLE AUTO
;

Insert into sales_schema.store_sales values (1,'product1',1000);
Insert into sales_schema.store_sales values (2,'product2',2000);
Insert into sales_schema.store_sales values (3,'product3',3000);
Insert into sales_schema.store_sales values (4,'product4',4000);

The following is the syntax to grant permission to the Redshift Serverless role:

GRANT { { SELECT | INSERT | UPDATE | DELETE | DROP | REFERENCES } [,...]| ALL [ PRIVILEGES ] } ON { [ TABLE ] table_name [, ...] | ALL TABLES IN SCHEMA schema_name [, ...] } TO role <IdP groupname>;

Grant relevant permission to the role as per your requirements. In the following example, we grant full permission to the role sales on sales_schema and only select permission on finance_schema to the role finance:

grant usage on schema sales_schema to role sales;
grant select on all tables in schema sales_schema to role sales;

grant usage on schema finance_schema to role finance;
grant select on all tables in schema finance_schema to role finance;

Federate to Redshift Serverless using Query Editor V2

The RedshiftDbRoles principal tag and DBGroups are both mechanisms that can be used to integrate with an IdP. However, federating with the RedshiftDbRoles principal has some clear advantages when it comes to connecting with an IdP because it provides automatic mapping between IdP groups and Amazon Redshift database roles. Overall, RedshiftDbRoles is more flexible, easier to manage, and more secure, making it the better option for integrating Amazon Redshift with your IdP.

Now you’re ready to connect to Redshift Serverless using the Query Editor V2 and federated login:

Use the SSO URL you collected earlier and log in to your Okta account with your user credentials. For this demo, we log in with user Ethan.
In the Query Editor v2, choose your Redshift Serverless instance (right-click) and choose Create connection.
For Authentication, select Federated user.
For Database, enter the database name you want to connect to.
Choose Create Connection.

User Ethan will be able to access sales_schema tables. If Ethan tries to access the tables in finance_schema, he will get a permission denied error.

Configure the SQL client (SQL Workbench/J)

To set up SQL Workbench/J, complete the following steps:

Create a new connection in SQL Workbench/J and choose Redshift Serverless as the driver.
Choose Manage drivers and add all the files from the downloaded AWS JDBC driver pack .zip file (remember to unzip the .zip file).
For Username and Password, enter the values that you set in Okta.
Capture the values for app_id, app_name, and idp_host from the Okta app embed link, which can be found on the General tab of your application.
Set the following extended properties:
- For app_id, enter the value from app embed link (for example, 0oa8p1o1RptSabT9abd0/avc8k7abc32lL4izh3b8).
- For app_name, enter the value from app embed link (for example, dev-123456_redshift_app_2).
- For idp_host, enter the value from app embed link (for example, dev-123456.okta.com).
- For plugin_name, enter com.amazon.redshift.plugin.OktaCredentialsProvider. The following screenshot shows the SQL Workbench/J extended properties.
  1. Choose OK.
  2. Choose Test from SQL Workbench/J to test the connection.
  3. When the connection is successful, choose OK.
  4. Choose OK to sign in with the users created.

User Ethan will be able to access the sales_schema tables. If Ethan tries to access the tables in the finance_schema, he will get a permission denied error.

Congratulations! You have federated with Redshift Serverless and Okta with SQL Workbench/J using RedshiftDbRoles.

[Optional] Implement MFA with SQL Client and Query Editor V2

Implementing MFA poses an additional challenge because the nature of multi-factor authentication is an asynchronous process between initiating the login (the first factor) and completing the login (the second factor). The SAML response will be returned to the appropriate listener in each scenario; the SQL Client or the AWS console in the case of QEV2. Depending on which login options you will be giving your users, you may need an additional Okta application. See below for the different scenarios:

If you are ONLY using QEV2 and not using any other SQL client, then you can use MFA with Query Editor V2 with the above application. There are no changes required in the custom SAML application which we have created above.
If you are NOT using QEV2 and only using third party SQL client (SQL Workbench/J etc), then you need to modify the above custom SAML app as mentioned below.
If you want to use QEV2 and third-party SQL Client with MFA, then you need create an additional custom SAML app as mentioned below.

Prerequisites for MFA

Each identity provider (IdP) has step for enabling and managing MFA for your users. In the case of Okta, see the following guides on how to enable MFA using the Okta Verify application and by defining an authentication policy.

Steps to create/update SAML application which supports MFA for a SQL Client

If creating a second app, follow all the steps which are described under section 1 (Create Okta SAML application).
Open the custom SAML app and select General.
Select Edit under SAML settings
Click Next in General Settings
Under General, update the Single sign-on URL to http://localhost:7890/redshift/
Select Next followed by Finish.

Below is the screenshot from the MFA App after making above changes:

Configure SQL Client for MFA

To set up SQL Workbench/J, complete the following steps:

Follow all the steps which are described under (Configure the SQL client (SQL Workbench/J))
Modify your connection updating the extended properties:
- login_url – Get the Single Sign-on URL as shown in section -Collect Okta information. (For example, https://dev-123456.okta.com/app/dev-123456_redshiftapp_2/abc8p6o5psS6xUhBJ517/sso/saml)
- plugin_name – com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
Choose OK
Choose OK from SQL Workbench/J. You’re redirected to the browser to sign in with your Okta credentials.
After that, you will get prompt for MFA. Choose either Enter a code or Get a push notification.
Once authentication is successful, log in to be redirected to a page showing the connection as successful.
With this connection profile, run the following query to return federated user name.

Troubleshooting

If your connection didn’t work, consider the following:

Enable logging in the driver. For instructions, see Configure logging.
Make sure to use the latest Amazon Redshift JDBC driver version.
If you’re getting errors while setting up the application on Okta, make sure you have admin access.
If you can authenticate via the SQL client but get a permission issue or can’t see objects, grant the relevant permission to the role, as detailed earlier in this post.

Clean up

When you’re done testing the solution, clean up the resources to avoid incurring future charges:

Delete the Redshift Serverless instance by deleting both the workgroup and the namespace.
Delete the IAM roles, IAM IdPs, and IAM policies.

Conclusion

In this post, we provided step-by-step instructions to integrate Redshift Serverless with Okta using the Amazon Redshift Query Editor V2 and SQL Workbench/J with the help of federated IAM roles and automatic database-role mapping. You can use a similar setup with any other SQL client (such as DBeaver or DataGrip) or business intelligence tool (such as Tableau Desktop). We also showed how Okta group membership is mapped automatically with Redshift Serverless roles to use role-based authentication seamlessly.

For more information about Redshift Serverless single sign-on using database roles, see Defining database roles to grant to federated users in Amazon Redshift Serverless.

About the Authors

Debu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Mohamed Shaaban is a Senior Software Engineer in Amazon Redshift and is based in Berlin, Germany. He has over 12 years of experience in the software engineering. He is passionate about cloud services and building solutions that delight customers. Outside of work, he is an amateur photographer who loves to explore and capture unique moments.

Rajiv Gupta is Sr. Manager of Analytics Specialist Solutions Architects based out of Irvine, CA. He has 20+ years of experience building and managing teams who build data warehouse and business intelligence solutions.

Amol Mhatre is a Database Engineer in Amazon Redshift and works on Customer & Partner engagements. Prior to Amazon, he has worked on multiple projects involving Database & ERP implementations.

Ning Di is a Software Development Engineer at Amazon Redshift, driven by a genuine passion for exploring all aspects of technology.

Harsha Kesapragada is a Software Development Engineer for Amazon Redshift with a passion to build scalable and secure systems. In the past few years, he has been working on Redshift Datasharing, Security and Redshift Serverless.

Data load made easy and secure in Amazon Redshift using Query Editor V2

2023-05-02 Raks Khare

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/data-load-made-easy-and-secure-in-amazon-redshift-using-query-editor-v2/

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data efficiently and securely. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based Query Editor V2 in addition to supporting connectivity via ODBC/JDBC or the Amazon Redshift Data API.

Amazon Redshift Query Editor V2 makes it easy to query your data using SQL and gain insights by visualizing your results using charts and graphs with a few clicks. With Query Editor V2, you can collaborate with team members by easily sharing saved queries, results, and analyses in a secure way.

Analysts performing ad hoc analyses in their workspace need to load sample data in Amazon Redshift by creating a table and load data from desktop. They want to join that data with the curated data in their data warehouse. Data engineers and data scientists have test data, and want to load data into Amazon Redshift for their machine learning (ML) or analytics use cases.

In this post, we walk through a new feature in Query Editor V2 to easily load data files either from your local desktop or Amazon Simple Storage Service (Amazon S3).

Prerequisites

Complete the following prerequisite steps:

1. Create an Amazon Redshift provisioned cluster or Serverless endpoint.
2. Provide access to Query Editor V2 for your end-users. To enable your users to access Query Editor V2 using IAM, as an administrator, you can attach one of the following AWS-managed policies to the AWS Identity and Access Management (IAM) user or role to grant permission:
  - AmazonRedshiftQueryEditorV2FullAccess – Grants full access to the Query Editor V2 operations and resources.
  - AmazonRedshiftQueryEditorV2NoSharing – Grants the ability to work with Query Editor V2 without sharing resources.
  - AmazonRedshiftQueryEditorV2ReadSharing – Grants the ability to work with Query Editor V2 with limited sharing of resources. The granted principal can read the resources shared with its team but can’t update them.
  - AmazonRedshiftQueryEditorV2ReadWriteSharing – Grants the ability to work with Query Editor V2 with sharing of resources. The granted principal can read and update the resources shared with its team.
3. Provide access to the S3 bucket to load data from a local desktop file.
  - To enable your users to load data from a local desktop using Query Editor V2, as an administrator, you have to specify a common S3 bucket, and the user account must be configured with proper permissions. You can use the following IAM policy as an example to configure your IAM user or role:
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::<staging-bucket-name>>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<staging-bucket-name>[/<optional-prefix>]/${aws:userid}/*"
            ]
        }
    ]
}
```
  - It’s also recommended to have proper separation of data access when loading data files from your local desktop. You can use the following S3 bucket policy as an example to separate data access between users of the staging bucket you configured:
```
{
 "Version": "2012-10-17",
    "Statement": [
        {"Sid": "userIdPolicy",
            "Effect": "Deny",
            "Principal": "*",
            "Action": ["s3:PutObject",
                       "s3:GetObject",
                       "s3:DeleteObject"],
            "NotResource": [
                "arn:aws:s3:::<staging-bucket-name>[/<optional-prefix>]/${aws:userid}/*"
            ]
         }
    ]
}
```

Configure Query Editor V2 for your AWS account

As an admin, you must first configure Query Editor V2 before providing access to your end-users. On the Amazon Redshift console, choose Query editor v2 in the navigation pane.

If you’re accessing Query Editor v2 for the first time, you must configure your account by providing AWS Key Management Service (AWS KMS) encryption and, optionally, an S3 bucket.

By default, an AWS-owned key is used to encrypt resources. Optionally, you can create a symmetric customer managed key to encrypt Query Editor V2 resources such as saved queries and query results using the AWS KMS console or AWS KMS API operations.

The S3 bucket URI is required when loading data from your local desktop. You can provide the S3 URI of the same bucket that you configured earlier as a prerequisite.

If you have previously configured Query Editor V2 with only AWS KMS encryption, you can choose Account Settings after launching the interface to update the S3 URI to support loading from your local desktop.

Load data from your local desktop

Users such as data analysts, database developers, and data scientists can now load local files up to 5 MB in size into Amazon Redshift tables from Query Editor V2, without using the COPY command. The supported data formats are CSV, JSON, DELIMITER, FIXEDWIDTH, SHAPEFILE, AVRO, PARQUET, and ORC. Complete the following steps:

1. 1. On the Amazon Redshift console, navigate to Query Editor V2.
  2. Click on Load data.
  3. Choose Load from local file and Browse to choose a local file. You can download the student_info.csv file to use as an example.
  4. If your file has column headers as the first row, keep the default selection of Ignore header rows as 1 to ignore first row.
  5. If your file has date columns, choose Data conversion parameters.
  6. Select Date format, set it to auto and choose Next.
  7. Choose Load new table to automatically infer the file schema.
  8. Specify the values for Cluster or workgroup, Database, Schema, and Table (for example, Student_info) to load data to.
  9. Choose Create table.

A success message appears that the table was created. Now you can load data into the newly created table from a local file.

1. 1. Choose Load data.

A message appears that the data load was successful.

1. 1. Query the Student_info table to see the data.

Load data from Amazon S3

You can easily load data from Amazon S3 into an Amazon Redshift table using Query Editor V2. Complete the following steps:

1. 1. On the Amazon Redshift console, launch Query Editor V2 and connect to your cluster.
  2. Browse to the database name (for example, dev), the public schema, and expand Tables.
  3. You can automatically infer the schema of a S3 file similar to Load from local file option shown above however for this demo, we will also show you how to load data to an existing table. Run the following create table script to make a sample table (for this example, public.customer):

CREATE TABLE customer ( 
	c_custkey int8 NOT NULL , 
	c_name varchar(25) NOT NULL, 
	c_address varchar(40) NOT NULL, 
	c_nationkey int4 NOT NULL, 
	c_phone char(15) NOT NULL, 
	c_acctbal numeric(12,2) NOT NULL, 
	c_mktsegment char(10) NOT NULL, 
	c_comment varchar(117) NOT NULL, 
PRIMARY Key(C_CUSTKEY) 
) DISTKEY(c_custkey) sortkey(c_custkey);

1. 1. Choose Load data.
  2. Choose Load from S3 bucket.
  3. For this post, we load data from the TPCH Sample data GitHub repo, so for the S3 URI, enter s3://redshift-downloads/TPC-H/2.18/10GB/customer.tbl.
  4. For S3 file location, choose us-east-1.
  5. For File format, choose Delimiter.
  6. For Delimiter character, enter |.
  7. Choose Data conversion parameters, then select Time format and Date format as auto.
  8. Choose Back.

Refer to Data conversion parameters for more details.

1. 1. Choose Load operations.
  2. Select Automatic update for compression encodings.
  3. Select Stop loading when maximum number of errors has been exceeded and specify a value (for example, 100).
  4. Select Statistics update and ON, then choose Next.

Refer to Data load operations for more details.

1. 1. Choose Load existing table.
  2. Specify the Cluster or workgroup, Database, Schema (for example, public) and Table name (for example, customer).
  3. For IAM role, choose a suitable IAM role.
  4. Choose Load data.

Query Editor V2 generates the COPY command and runs it on the Amazon Redshift cluster. The results of the COPY command are displayed in the Result section upon completion.

Conclusion

In this post, we showed how Amazon Redshift Query Editor V2 has simplified the process to load data into Amazon Redshift from Amazon S3 or your local desktop, thereby accelerating the data analysis. It’s an easy-to-use feature that your teams can start using to load and query datasets. If you have any questions or suggestions, please leave a comment.

About the Authors

Raks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Sapna Maheshwari is a Sr. Solutions Architect at Amazon Web Services. She has over 18 years of experience in data and analytics. She is passionate about telling stories with data and enjoys creating engaging visuals to unearth actionable insights.

Karthik Ramanathan is a Software Engineer with Amazon Redshift and is based in San Francisco. He brings close to two decades of development experience across the networking, data storage and IoT verticals. When not at work he is also a writer and loves to be in the water.

Albert Harkema is a Software Development Engineer at AWS. He is known for his curiosity and deep-seated desire to understand the inner workings of complex systems. His inquisitive nature drives him to develop software solutions that make life easier for others. Albert’s approach to problem-solving emphasizes efficiency, reliability, and long-term stability, ensuring that his work has a tangible impact. Through his professional experiences, he has discovered the potential of technology to improve everyday life.

Use the Amazon Redshift Data API to interact with Amazon Redshift Serverless

2023-04-28 Debu Panda

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/use-the-amazon-redshift-data-api-to-interact-with-amazon-redshift-serverless/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics. Amazon Redshift Serverless makes it convenient for you to run and scale analytics without having to provision and manage data warehouses. With Redshift Serverless, data analysts, developers, and data scientists can now use Amazon Redshift to get insights from data in seconds by loading data into and querying records from the data warehouse.

As a data engineer or application developer, for some use cases, you want to interact with the Redshift Serverless data warehouse to load or query data with a simple API endpoint without having to manage persistent connections. With the Amazon Redshift Data API, you can interact with Redshift Serverless without having to configure JDBC or ODBC. This makes it easier and more secure to work with Redshift Serverless and opens up new use cases.

This post explains how to use the Data API with Redshift Serverless from the AWS Command Line Interface (AWS CLI) and Python. If you want to use the Data API with Amazon Redshift clusters, refer to Using the Amazon Redshift Data API to interact with Amazon Redshift clusters.

Introducing the Data API

The Data API enables you to seamlessly access data from Redshift Serverless with all types of traditional, cloud-native, and containerized serverless web service-based applications and event-driven applications.

The following diagram illustrates this architecture.

The Data API simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++.

The Data API simplifies access to Amazon Redshift by eliminating the need for configuring drivers and managing database connections. Instead, you can run SQL commands to Redshift Serverless by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours. The Data API federates AWS Identity and Access Management (IAM) credentials so you can use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.

For customers using AWS Lambda, the Data API provides a secure way to access your database without the additional overhead for Lambda functions to be launched in an Amazon VPC. Integration with the AWS SDK provides a programmatic interface to run SQL statements and retrieve results asynchronously.

Relevant use cases

The Data API is not a replacement for JDBC and ODBC drivers, and is suitable for use cases where you don’t need a persistent connection to a serverless data warehouse. It’s applicable in the following use cases:

Accessing Amazon Redshift from custom applications with any programming language supported by the AWS SDK. This enables you to integrate web service-based applications to access data from Amazon Redshift using an API to run SQL statements. For example, you can run SQL from JavaScript.
Building a serverless data processing workflow.
Designing asynchronous web dashboards because the Data API lets you run long-running queries without having to wait for them to complete.
Running your query one time and retrieving the results multiple times without having to run the query again within 24 hours.
Building your ETL pipelines with AWS Step Functions, Lambda, and stored procedures.
Having simplified access to Amazon Redshift from Amazon SageMaker and Jupyter notebooks.
Building event-driven applications with Amazon EventBridge and Lambda.
Scheduling SQL scripts to simplify data load, unload, and refresh of materialized views.

The Data API GitHub repository provides examples for different use cases for both Redshift Serverless and provisioned clusters.

Create a Redshift Serverless workgroup

If you haven’t already created a Redshift Serverless data warehouse, or want to create a new one, refer to the Getting Started Guide. This guide walks you through the steps of creating a namespace and workgroup with their names as default. Also, ensure that you have created an IAM role and make sure that the IAM role you attach to your Redshift Serverless namespace has AmazonS3ReadOnlyAccess permission. You can use the AWS Management Console to create an IAM role and assign Amazon Simple Storage Service (Amazon S3) privileges (refer to Loading in data from Amazon S3). In this post, we create a table and load data using the COPY command.

Prerequisites for using the Data API

You must be authorized to access the Data API. Amazon Redshift provides the RedshiftDataFullAccess managed policy, which offers full access to Data API. This policy also allows access to Redshift Serverless workgroups, Secrets Manager, and API operations needed to authenticate and access a Redshift Serverless workgroup by using IAM credentials.

You can also create your own IAM policy that allows access to specific resources by starting with RedshiftDataFullAccess as a template.

The Data API allows you to access your database either using your IAM credentials or secrets stored in Secrets Manager. In this post, we use IAM credentials.

When you federate your IAM credentials to connect with Amazon Redshift, it automatically creates a database user for the IAM user that is being used. It uses the GetCredentials API to get temporary database credentials. If you want to provide specific database privileges to your users with this API, you can use an IAM role with the tag name RedshiftDBRoles with a list of roles separated by colons. For example, if you want to assign database roles such as sales and analyst, you can have a value sales:analyst assigned to RedshiftDBRoles.

Use the Data API from the AWS CLI

You can use the Data API from the AWS CLI to interact with the Redshift Serverless workgroup and namespace. For instructions on configuring the AWS CLI, see Setting up the AWS CLI. The Amazon Redshift Serverless CLI (aws redshift-serverless) is a part of AWS CLI that lets you manage Amazon Redshift workgroups and namespaces, such as creating, deleting, setting usage limits, tagging resource, and more. The Data API provides a command line interface to the AWS CLI (aws redshift-data) that allows you to interact with the databases in Redshift Serverless.

You can invoke help using the following command:

aws redshift-data help

The following table shows you the different commands available with the Data API CLI.

Command	Description
`list-databases`	Lists the databases in a workgroup.
`list-schemas`	Lists the schemas in a database. You can filter this by a matching schema pattern.
`list-tables`	Lists the tables in a database. You can filter the tables list by a schema name pattern, a matching table name pattern, or a combination of both.
`describe-table`	Describes the detailed information about a table including column metadata.
`execute-statement`	Runs a SQL statement, which can be SELECT, DML, DDL, COPY, or UNLOAD.
`batch-execute-statement`	Runs multiple SQL statements in a batch as a part of single transaction. The statements can be SELECT, DML, DDL, COPY, or UNLOAD.
`cancel-statement`	Cancels a running query. To be canceled, a query must not be in the FINISHED or FAILED state.
`describe-statement`	Describes the details of a specific SQL statement run. The information includes when the query started, when it finished, the number of rows processed, and the SQL statement.
`list-statements`	Lists the SQL statements in the last 24 hours. By default, only finished statements are shown.
`get-statement-result`	Fetches the temporarily cached result of the query. The result set contains the complete result set and the column metadata. You can paginate through a set of records to retrieve the entire result as needed.

If you want to get help on a specific command, run the following command:

aws redshift-data list-tables help

Now we look at how you can use these commands.

List databases

Most organizations use a single database in their Amazon Redshift workgroup. You can use the following command to list the databases in your Serverless endpoint. This operation requires you to connect to a database and therefore requires database credentials.

aws redshift-data list-databases --database dev --workgroup-name default

List schemas

Similar to listing databases, you can list your schemas by using the list-schemas command:

aws redshift-data list-schemas --database dev --workgroup-name default

If you have several schemas that match demo (demo, demo2, demo3, and so on), you can optionally provide a pattern to filter your results matching to that pattern:

aws redshift-data list-schemas --database dev --workgroup-name default --schema-pattern "demo%"

List tables

The Data API provides a simple command, list-tables, to list tables in your database. You might have thousands of tables in a schema; the Data API lets you paginate your result set or filter the table list by providing filter conditions.

You can search across your schema with table-pattern; for example, you can filter the table list by a table name prefix across all your schemas in the database or filter your tables list in a specific schema pattern by using schema-pattern.

The following is a code example that uses both:

aws redshift-data list-tables --database dev --workgroup-name default --schema-pattern "demo%" --table-pattern “orders%”

Run SQL commands

You can run SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift with the Data API. You can optionally specify the –with-event option if you want to send an event to EventBridge after the query run, then the Data API will send the event with queryId and final run status.

Create a schema

Let’s use the Data API to see how you can create a schema. The following command lets you create a schema in your database. You don’t have to run this SQL if you have pre-created the schema. You have to specify –-sql to specify your SQL commands.

aws redshift-data execute-statement --database dev --workgroup-name default \
--sql "CREATE SCHEMA demo;"

The following shows an example output of execute-statement:

{
    "CreatedAt": "2023-04-07T17:14:43.038000+00:00",
    "Database": "dev",
    "DbUser": "IAMR:Admin",
    "Id": "8e4e5af3-9af9-4567-8e70-7849515b3a79",
    "WorkgroupName": "default"
}

We discuss later in this post how you can check the status of a SQL that you ran with execute-statement.

Create a table

You can use the following command to create a table with the CLI:

aws redshift-data execute-statement --database dev --workgroup-name default  \
   --sql "CREATE TABLE demo.green_201601( \
  vendorid                VARCHAR(4), \
  pickup_datetime         TIMESTAMP, \
  dropoff_datetime        TIMESTAMP, \
  store_and_fwd_flag      VARCHAR(1), \
  ratecode                INT, \
  pickup_longitude        FLOAT4, \
  pickup_latitude         FLOAT4, \
  dropoff_longitude       FLOAT4, \
  dropoff_latitude        FLOAT4, \
  passenger_count         INT, \
  trip_distance           FLOAT4, \
  fare_amount             FLOAT4, \
  extra                   FLOAT4, \
  mta_tax                 FLOAT4, \
  tip_amount              FLOAT4, \
  tolls_amount            FLOAT4, \
  ehail_fee               FLOAT4, \
  improvement_surcharge   FLOAT4, \
  total_amount            FLOAT4, \
  payment_type            VARCHAR(4),\
  trip_type               VARCHAR(4));"

Load sample data

The COPY command lets you load bulk data into your table in Amazon Redshift. You can use the following command to load data into the table we created earlier:

aws redshift-data execute-statement --database dev --workgroup-name default --sql "COPY demo.green_201601 \
FROM 's3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01' \
IAM_ROLE default \
DATEFORMAT 'auto' \
IGNOREHEADER 1 \
DELIMITER ',' \
IGNOREBLANKLINES \
REGION 'us-west-2';"

Retrieve data

The following query uses the table we created earlier:

aws redshift-data execute-statement --database dev --workgroup-name default --sql "SELECT ratecode,  \
COUNT(*) FROM demo.green_201601 WHERE \
trip_distance > 5 GROUP BY 1 ORDER BY 1;"

The following shows an example output:

{
    "CreatedAt": "2023-04-07T17:25:16.030000+00:00",
    "Database": "dev",
    "DbUser": "IAMR:Admin",
    "Id": "cae88c08-0bb4-4279-8845-d5a8fefafade",
    "WorkgroupName": "default"
}

You can fetch results using the statement ID that you receive as an output of execute-statement.

Check the status of a statement

You can check the status of your statement by using describe-statement. The output for describe-statement provides additional details such as PID, query duration, number of rows in and size of the result set, and the query ID given by Amazon Redshift. You have to specify the statement ID that you get when you run the execute-statement command. See the following command:

aws redshift-data describe-statement --id cae88c08-0bb4-4279-8845-d5a8fefafade \

The following is an example output:

{
     "CreatedAt": "2023-04-07T17:27:15.937000+00:00",
     "Duration": 2602410468,
     "HasResultSet": true,
     "Id": "cae88c08-0bb4-4279-8845-d5a8fefafade",
     "QueryString": " SELECT ratecode, COUNT(*) FROM 
     demo.green_201601 WHERE
     trip_distance > 5 GROUP BY 1 ORDER BY 1;",
     "RedshiftPid": 1073815670,
     "WorkgroupName": "default",
     "UpdatedAt": "2023-04-07T17:27:18.539000+00:00"
}

The status of a statement can be STARTED, FINISHED, ABORTED, or FAILED.

Run SQL statements with parameters

You can run SQL statements with parameters. The following example uses two named parameters in the SQL that is specified using a name-value pair:

aws redshift-data execute-statement --database dev --workgroup-name default --sql "select sellerid,sum(pricepaid) totalsales from sales where eventid >= :eventid and sellerid > :selrid group by sellerid"  --parameters "[{\"name\": \"selrid\", \"value\": \"100\"},{\"name\": \"eventid\", \"value\": \"100\"}]"

The describe-statement returns QueryParameters along with QueryString.

You can map the name-value pair in the parameters list to one or more parameters in the SQL text, and the name-value parameter can be in random order. You can’t specify a NULL value or zero-length value as a parameter.

Cancel a running statement

If your query is still running, you can use cancel-statement to cancel a SQL query. See the following command:

aws redshift-data cancel-statement --id 39a0de2f-e85e-45ff-a0d7-cd074c348120

Fetch results from your query

You can fetch the query results by using get-statement-result. The query result is stored for 24 hours. See the following command:

aws redshift-data get-statement-result --id 7b61da88-1b11-4ade-956a-21085a29118d

The output of the result contains metadata such as the number of records fetched, column metadata, and a token for pagination.

Run multiple SQL statements

You can run multiple SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift in a single transaction with the Data API. The batch-execute-statement enables you to create tables and run multiple COPY commands or create temporary tables as part of your reporting system and run queries on that temporary table. See the following code:

aws redshift-data batch-execute-statement --database dev --workgroup-name default \
--sqls "create temporary table mysales \
(firstname, lastname, total_quantity ) as \
SELECT firstname, lastname, total_quantity \
FROM   (SELECT buyerid, sum(qtysold) total_quantity \
        FROM  sales  \
        GROUP BY buyerid \
        ORDER BY total_quantity desc limit 10) Q, users \
WHERE Q.buyerid = userid \ 
ORDER BY Q.total_quantity desc;" "select * from mysales limit 100;"

The describe-statement for a multi-statement query shows the status of all sub-statements:

{

{
"CreatedAt": "2023-04-10T14:01:11.257000-07:00",
"Duration": 30564173,
"HasResultSet": true,
"Id": "23d99d7f-fd13-4686-92c8-e2c279715c21",
"RedshiftPid": 1073922185,
"RedshiftQueryId": 0,
"ResultRows": -1,
"ResultSize": -1,
"Status": "FINISHED",
"SubStatements": [
{
"CreatedAt": "2023-04-10T14:01:11.357000-07:00",
"Duration": 12779028,
"HasResultSet": false,
"Id": "23d99d7f-fd13-4686-92c8-e2c279715c21:1",
"QueryString": "create temporary table mysales (firstname, lastname,
total_quantity ) as \nSELECT firstname, lastname, total_quantity \nFROM (SELECT
buyerid, sum(qtysold) total_quantity\nFROM sales\nGROUP BY
buyerid\nORDER BY total_quantity desc limit 10) Q, users\nWHERE Q.buyerid =
userid\nORDER BY Q.total_quantity desc;",
"RedshiftQueryId": 0,
"ResultRows": 0,
"ResultSize": 0,
"Status": "FINISHED",
"UpdatedAt": "2023-04-10T14:01:11.807000-07:00"
},
{
"CreatedAt": "2023-04-10T14:01:11.357000-07:00",
"Duration": 17785145,
"HasResultSet": true,
"Id": "23d99d7f-fd13-4686-92c8-e2c279715c21:2",
"QueryString": ""select *\nfrom mysales limit 100;",
"RedshiftQueryId": 0,
"ResultRows": 40,
"ResultSize": 1276,
"Status": "FINISHED",
"UpdatedAt": "2023-04-10T14:01:11.911000-07:00"
}
],
"UpdatedAt": "2023-04-10T14:01:11.970000-07:00",
"WorkgroupName": "default"
}

In the preceding example, we had two SQL statements and therefore the output includes the ID for the SQL statements as 23d99d7f-fd13-4686-92c8-e2c279715c21:1 and 23d99d7f-fd13-4686-92c8-e2c279715c21:2. Each sub-statement of a batch SQL statement has a status, and the status of the batch statement is updated with the status of the last sub-statement. For example, if the last statement has status FAILED, then the status of the batch statement shows as FAILED.

You can fetch query results for each statement separately. In our example, the first statement is a SQL statement to create a temporary table, so there are no results to retrieve for the first statement. You can retrieve the result set for the second statement by providing the statement ID for the sub-statement:

aws redshift-data get-statement-result --id 23d99d7f-fd13-4686-92c8-e2c279715c21:2

Use the Data API with Secrets Manager

The Data API allows you to use database credentials stored in Secrets Manager. You can create a secret type as Other type of secret and then specify username and password. Note you can’t choose an Amazon Redshift cluster because Redshift Serverless is different than a cluster.

Let’s assume that you created a secret key for your credentials as defaultWG. You can use the secret-arn parameter to pass your secret key as follows:

aws redshift-data list-tables --database dev --workgroup-name default --secret-arn defaultWG --region us-west-1

Export the data

Amazon Redshift allows you to export from database tables to a set of files in an S3 bucket by using the UNLOAD command with a SELECT statement. You can unload data in either text or Parquet format. The following command shows you an example of how to use the data lake export with the Data API:

aws redshift-data execute-statement --database dev --workgroup-name default --sql "unload ('select * from demo.green_201601') to '<your-S3-bucket>' iam_role '<your-iam-role>'; "

You can use batch-execute-statement if you want to use multiple statements with UNLOAD or combine UNLOAD with other SQL statements.

Use the Data API from the AWS SDK

You can use the Data API in any of the programming languages supported by the AWS SDK. For this post, we use the AWS SDK for Python (Boto3) as an example to illustrate the capabilities of the Data API.

We first import the Boto3 package and establish a session:

import botocore.session as bc
import boto3

def get_client(service, endpoint=None, region="us-west-2"):
    session = bc.get_session()
    s = boto3.Session(botocore_session=session, region_name=region)
    if endpoint:
        return s.client(service, endpoint_url=endpoint)
    return s.client(service)

Get a client object

You can create a client object from the boto3.Session object and using RedshiftData:

rsd = get_client('redshift-data')

If you don’t want to create a session, your client is as simple as the following code:

import boto3
client = boto3.client('redshift-data')

Run a statement

The following example code uses the Secrets Manager key to run a statement. For this post, we use the table we created earlier. You can use DDL, DML, COPY, and UNLOAD in the SQL parameter:

resp = rsd.execute_statement(
    WorkgroupName ="default",
Database = "dev",
Sql = "SELECT ratecode, COUNT(*) totalrides FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;" 
)

As we discussed earlier, running a query is asynchronous; running a statement returns an ExecuteStatementOutput, which includes the statement ID.

If you want to publish an event to EventBridge when the statement is complete, you can use the additional parameter WithEvent set to true:

resp = rsd.execute_statement(
    Database="dev",
    WorkgroupName="default",
    Sql="SELECT ratecode, COUNT(*) totalrides FROM demo.green_201601 WHERE trip_distance > 5 GROUP BY 1 ORDER BY 1;",
WithEvent=True
)

Describe a statement

You can use describe_statement to find the status of the query and number of records retrieved:

id=resp['Id']
desc = rsd.describe_statement(Id=id)
if desc["Status"] == "FINISHED":
    print(desc["ResultRows"])

Fetch results from your query

You can use get_statement_result to retrieve results for your query if your query is complete:

if desc and desc["ResultRows"]  > 0:
    result = rsd.get_statement_result(Id=qid)

The get_statement_result command returns a JSON object that includes metadata for the result and the actual result set. You might need to process the data to format the result if you want to display it in a user-friendly format.

Fetch and format results

For this post, we demonstrate how to format the results with the Pandas framework. The post_process function processes the metadata and results to populate a DataFrame. The query function retrieves the result from a database in an Amazon Redshift cluster. See the following code:

import pandas as pd

def post_process(meta, records):
    columns = [k["name"] for k in meta]
    rows = []
    for r in records:
        tmp = []
        for c in r:
            tmp.append(c[list(c.keys())[0]])
        rows.append(tmp)
    return pd.DataFrame(rows, columns=columns)

def query(sql, workgroup="default ", database="dev"):
    resp = rsd.execute_statement(
        Database=database,
        WorkgroupName=workgroup,
        Sql=sql
    )
    qid = resp["Id"]
    print(qid)
    desc = None
    while True:
        desc = rsd.describe_statement(Id=qid)
        if desc["Status"] == "FINISHED" or desc["Status"] == "FAILED":
            break
    	print(desc["ResultRows"])
    if desc and desc["ResultRows"]  > 0:
        result = rsd.get_statement_result(Id=qid)
        rows, meta = result["Records"], result["ColumnMetadata"]
        return post_process(meta, rows)

pf=query("select * from demo.customer_activity limit 100;")
print(pf)

In this post, we demonstrated using the Data API with Python with Redshift Serverless. However, you can use the Data API with other programming languages supported by the AWS SDK. You can read how Roche democratized access to Amazon Redshift data using the Data API with Google Sheets. You can also address this type of use case with Redshift Serverless.

Best practices

We recommend the following best practices when using the Data API:

Federate your IAM credentials to the database to connect with Amazon Redshift. Redshift Serverless allows users to get temporary database credentials with GetCredentials. Redshift Serverless scopes the access to the specific IAM user and the database user is automatically created.
Use a custom policy to provide fine-grained access to the Data API in the production environment if you don’t want your users to use temporary credentials. You have to use Secrets Manager to manage your credentials in such use cases.
Don’t retrieve a large amount of data from your client and use the UNLOAD command to export the query results to Amazon S3. You’re limited to retrieving only 100 MB of data with the Data API.
Don’t forget to retrieve your results within 24 hours; results are stored only for 24 hours.

Conclusion

In this post, we introduced how to use the Data API with Redshift Serverless. We also demonstrated how to use the Data API from the Amazon Redshift CLI and Python using the AWS SDK. Additionally, we discussed best practices for using the Data API.

To learn more, refer to Using the Amazon Redshift Data API or visit the Data API GitHub repository for code examples.

About the authors

Debu Panda is a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Fei Peng is a Software Dev Engineer working in the Amazon Redshift team.

Simplify and speed up Apache Spark applications on Amazon Redshift data with Amazon Redshift integration for Apache Spark

2023-04-20 Gagan Brahmi

Post Syndicated from Gagan Brahmi original https://aws.amazon.com/blogs/big-data/simplify-and-speed-up-apache-spark-applications-on-amazon-redshift-data-with-amazon-redshift-integration-for-apache-spark/

Customers use Amazon Redshift to run their business-critical analytics on petabytes of structured and semi-structured data. Apache Spark is a popular framework that you can use to build applications for use cases such as ETL (extract, transform, and load), interactive analytics, and machine learning (ML). Apache Spark enables you to build applications in a variety of languages, such as Java, Scala, and Python, by accessing the data in your Amazon Redshift data warehouse.

Amazon Redshift integration for Apache Spark helps developers seamlessly build and run Apache Spark applications on Amazon Redshift data. Developers can use AWS analytics and ML services such as Amazon EMR, AWS Glue, and Amazon SageMaker to effortlessly build Apache Spark applications that read from and write to their Amazon Redshift data warehouse. You can do so without compromising on the performance of your applications or transactional consistency of your data.

In this post, we discuss why Amazon Redshift integration for Apache Spark is critical and efficient for analytics and ML. In addition, we discuss use cases that use Amazon Redshift integration with Apache Spark to drive business impact. Finally, we walk you through step-by-step examples of how to use this official AWS connector in an Apache Spark application.

Amazon Redshift integration for Apache Spark

The Amazon Redshift integration for Apache Spark minimizes the cumbersome and often manual process of setting up a spark-redshift connector (community version) and shortens the time needed to prepare for analytics and ML tasks. You only need to specify the connection to your data warehouse, and you can start working with Amazon Redshift data from your Apache Spark-based applications within minutes.

You can use several pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from your Amazon Redshift data warehouse to the consuming Apache Spark application. This allows you to improve the performance of your applications. Amazon Redshift admins can easily identify the SQL generated from Spark-based applications. In this post, we show how you can find out the SQL generated by the Apache Spark job.

Moreover, Amazon Redshift integration for Apache Spark uses Parquet file format when staging the data in a temporary directory. Amazon Redshift uses the UNLOAD SQL statement to store this temporary data on Amazon Simple Storage Service (Amazon S3). The Apache Spark application retrieves the results from the temporary directory (stored in Parquet file format), which improves performance.

You can also help make your applications more secure by utilizing AWS Identity and Access Management (IAM) credentials to connect to Amazon Redshift.

Amazon Redshift integration for Apache Spark is built on top of the spark-redshift connector (community version) and enhances it for performance and security, helping you gain up to 10 times faster application performance.

Use cases for Amazon Redshift integration with Apache Spark

For our use case, the leadership of the product-based company wants to know the sales for each product across multiple markets. As sales for the company fluctuate dynamically, it has become a challenge for the leadership to track the sales across multiple markets. However, the overall sales are declining, and the company leadership wants to find out which markets aren’t performing so that they can target these markets for promotion campaigns.

For sales across multiple markets, the product sales data such as orders, transactions, and shipment data is available on Amazon S3 in the data lake. The data engineering team can use Apache Spark with Amazon EMR or AWS Glue to analyze this data in Amazon S3.

The inventory data is available in Amazon Redshift. Similarly, the data engineering team can analyze this data with Apache Spark using Amazon EMR or an AWS Glue job by using the Amazon Redshift integration for Apache Spark to perform aggregations and transformations. The aggregated and transformed dataset can be stored back into Amazon Redshift using the Amazon Redshift integration for Apache Spark.

Using a distributed framework like Apache Spark with the Amazon Redshift integration for Apache Spark can provide the visibility across the data lake and data warehouse to generate sales insights. These insights can be made available to the business stakeholders and line of business users in Amazon Redshift to make informed decisions to run targeted promotions for the low revenue market segments.

Additionally, we can use the Amazon Redshift integration with Apache Spark in the following use cases:

An Amazon EMR or AWS Glue customer running Apache Spark jobs wants to transform data and write that into Amazon Redshift as a part of their ETL pipeline
An ML customer uses Apache Spark with SageMaker for feature engineering for accessing and transforming data in Amazon Redshift
An Amazon EMR, AWS Glue, or SageMaker customer uses Apache Spark for interactive data analysis with data on Amazon Redshift from notebooks

Examples for Amazon Redshift integration for Apache Spark in an Apache Spark application

In this post, we show the steps to connect Amazon Redshift from Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), Amazon EMR Serverless, and AWS Glue using a common script. In the following sample code, we generate a report showing the quarterly sales for the year 2008. To do that, we join two Amazon Redshift tables using an Apache Spark DataFrame, run a predicate pushdown, aggregate and sort the data, and write the transformed data back to Amazon Redshift. The script uses PySpark

The script uses IAM-based authentication for Amazon Redshift. IAM roles used by Amazon EMR and AWS Glue should have the appropriate permissions to authenticate Amazon Redshift, and access to an S3 bucket for temporary data storage.

The following example policy allows the IAM role to call the GetClusterCredentials operations:

{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Action": "redshift:GetClusterCredentials",
    "Resource": "arn:aws:redshift:<aws_region_name>:xxxxxxxxxxxx:dbuser:*/temp_*"
  }
}

The following example policy allows access to an S3 bucket for temporary data storage:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::<s3_bucket_name>"
        }
    ]
}

The complete script is as follows:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initiate Apache Spark session
spark = SparkSession \
        .builder \
        .appName("SparkRedshiftConnector") \
        .enableHiveSupport() \
        .getOrCreate()

# Set connection options for Amazon Redshift
jdbc_iam_url = "jdbc:redshift:iam://redshift-spark-connector-1.xxxxxxxxxxx.<aws_region_name>.redshift.amazonaws.com:5439/sample_data_dev"
temp_dir = 's3://<s3_bucket_name>/redshift-temp-dir/'
aws_role = 'arn:aws:iam::xxxxxxxxxxxx:role/redshift-s3'

# Set query group for the query. More details on Amazon Redshift WLM https://docs.aws.amazon.com/redshift/latest/dg/cm-c-executing-queries.html
queryGroup = "emr-redshift"
jdbc_iam_url_withQueryGroup = jdbc_iam_url+'?queryGroup='+queryGroup

# Set User name for the query
userName = 'awsuser'
jdbc_iam_url_withUserName = jdbc_iam_url_withQueryGroup+';user='+userName

# Define the Amazon Redshift context
redshiftOptions = {
    "url": jdbc_iam_url_withUserName,
    "tempdir": temp_dir,
    "aws_iam_role" : aws_role
}

# Create the sales DataFrame from Amazon Redshift table using io.github.spark_redshift_community.spark.redshift class
sales_df = (
    spark.read
        .format("io.github.spark_redshift_community.spark.redshift")
        .options(**redshiftOptions)
        .option("dbtable", "tickit.sales")
        .load()
)

# Create the date Data Frame from Amazon Redshift table
date_df = (
    spark.read
        .format("io.github.spark_redshift_community.spark.redshift")
        .options(**redshiftOptions)
        .option("dbtable", "tickit.date")
        .load()
)

# Assign a Data Frame to the above output which will be written back to Amazon Redshift
output_df= sales_df.join(date_df, sales_df.dateid == date_df.dateid, 'inner').where(
    col("year") == 2008).groupBy("qtr").sum("qtysold").select(
        col("qtr"), col("sum(qtysold)")).sort(["qtr"], ascending=[1]).withColumnRenamed("sum(qtysold)","total_quantity_sold")

# Display the output
output_df.show()

## Lets drop the queryGroup for easy validation of push down queries
# Set User name for the query
userName = 'awsuser'
jdbc_iam_url_withUserName = jdbc_iam_url+'?user='+userName

# Define the Amazon Redshift context
redshiftWriteOptions = {
    "url": jdbc_iam_url_withUserName,
    "tempdir": temp_dir,
    "aws_iam_role" : aws_role
}

# Write the Data Frame back to Amazon Redshift
output_df.write \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .mode("overwrite") \
    .options(**redshiftWriteOptions) \
    .option("dbtable", "tickit.test") \
    .save()

If you plan to use the preceding script in your environment, make sure you replace the values for the following variables with the appropriate values for your environment: jdbc_iam_url, temp_dir, and aws_role.

In the next section, we walk through the steps to run this script to aggregate a sample dataset that is made available in Amazon Redshift.

Prerequisites

Before we begin, make sure the following prerequisites are met:

You have an AWS account
You have access to create AWS CloudFormation stack

Deploy resources using AWS CloudFormation

Complete the following steps to deploy the CloudFormation stack:

You can also download the CloudFormation template to create the resources mentioned in this post through infrastructure as code (IaC). Use this template when launching a new CloudFormation stack.

Scroll down to the bottom of the page to select I acknowledge that AWS CloudFormation might create IAM resources under Capabilities, then choose Create stack.

The stack creation process takes 15–20 minutes to complete. The CloudFormation template creates the following resources:

- An Amazon VPC with the needed subnets, route tables, and NAT gateway
- An S3 bucket with the name redshift-spark-databucket-xxxxxxx (note that xxxxxxx is a random string to make the bucket name unique)
- An Amazon Redshift cluster with sample data loaded inside the database dev and the primary user redshiftmasteruser. For the purpose of this blog post, redshiftmasteruser with administrative permissions is used. However, it is recommended to use a user with fine grained access control in production environment.
- An IAM role to be used for Amazon Redshift with the ability to request temporary credentials from the Amazon Redshift cluster’s dev database
- Amazon EMR Studio with the needed IAM roles
- Amazon EMR release version 6.9.0 on an EC2 cluster with the needed IAM roles
- An Amazon EMR Serverless application release version 6.9.0
- An AWS Glue connection and AWS Glue job version 4.0
- A Jupyter notebook to run using Amazon EMR Studio using Amazon EMR on an EC2 cluster
- A PySpark script to run using Amazon EMR Studio and Amazon EMR Serverless

After the stack creation is complete, choose the stack name redshift-spark and navigate to the Outputs

We utilize these output values later in this post.

In the next sections, we show the steps for Amazon Redshift integration for Apache Spark from Amazon EMR on Amazon EC2, Amazon EMR Serverless, and AWS Glue.

Use Amazon Redshift integration with Apache Spark on Amazon EMR on EC2

Starting from Amazon EMR release version 6.9.0 and above, the connector using Amazon Redshift integration for Apache Spark and Amazon Redshift JDBC driver are available locally on Amazon EMR. These files are located under the /usr/share/aws/redshift/ directory. However, in the previous versions of Amazon EMR, the community version of the spark-redshift connector is available.

The following example shows how to connect Amazon Redshift using a PySpark kernel via an Amazon EMR Studio notebook. The CloudFormation stack created Amazon EMR Studio, Amazon EMR on an EC2 cluster, and a Jupyter notebook available to run. To go through this example, complete the following steps:

Download the Jupyter notebook made available in the S3 bucket for you:
- In the CloudFormation stack outputs, look for the value for EMRStudioNotebook, which should point to the redshift-spark-emr.ipynb notebook available in the S3 bucket.
- Choose the link or open the link in a new tab by copying the URL for the notebook.
- After you open the link, download the notebook by choosing Download, which will save the file locally on your computer.

Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key EMRStudioURL.
In the navigation pane, choose Workspaces.
Choose Create Workspace.
Provide a name for the Workspace, for instance redshift-spark.
Expand the Advanced configuration section and select Attach Workspace to an EMR cluster.
Under Attach to an EMR cluster, choose the EMR cluster with the name emrCluster-Redshift-Spark.
Choose Create Workspace.
After the Amazon EMR Studio Workspace is created and in Attached status, you can access the Workspace by choosing the name of the Workspace.

This should open the Workspace in a new tab. Note that if you have a pop-up blocker, you may have to allow the Workspace to open or disable the pop-up blocker.

In the Amazon EMR Studio Workspace, we now upload the Jupyter notebook we downloaded earlier.

Choose Upload to browse your local file system and upload the Jupyter notebook (redshift-spark-emr.ipynb).
Choose (double-click) the redshift-spark-emr.ipynb notebook within the Workspace to open the notebook.

The notebook provides the details of different tasks that it performs. Note that in the section Define the variables to connect to Amazon Redshift cluster, you don’t need to update the values for jdbc_iam_url, temp_dir, and aws_role because these are updated for you by AWS CloudFormation. AWS CloudFormation has also performed the steps mentioned in the Prerequisites section of the notebook.

You can now start running the notebook.

Run the individual cells by selecting them and then choosing Play.

You can also use the key combination of Shift+Enter or Shift+Return. Alternatively, you can run all the cells by choosing Run All Cells on the Run menu.

Find the predicate pushdown operation performed on the Amazon Redshift cluster by the Amazon Redshift integration for Apache Spark.

We can also see the temporary data stored on Amazon S3 in the optimized Parquet format. The output can be seen from running the cell in the section Get the last query executed on Amazon Redshift.

To validate the table created by the job from Amazon EMR on Amazon EC2, navigate to the Amazon Redshift console and choose the cluster redshift-spark-redshift-cluster on the Provisioned clusters dashboard page.
In the cluster details, on the Query data menu, choose Query in query editor v2.
Choose the cluster in the navigation pane and connect to the Amazon Redshift cluster when it requests for authentication.
Select Temporary credentials.
For Database, enter dev.
For User name, enter redshiftmasteruser.
Choose Save.
In the navigation pane, expand the cluster redshift-spark-redshift-cluster, expand the dev database, expand tickit, and expand Tables to list all the tables inside the schema tickit.

You should find the table test_emr.

Choose (right-click) the table test_emr, then choose Select table to query the table.
Choose Run to run the SQL statement.

Use Amazon Redshift integration with Apache Spark on Amazon EMR Serverless

The Amazon EMR release version 6.9.0 and above provides the Amazon Redshift integration for Apache Spark JARs (managed by Amazon Redshift) and Amazon Redshift JDBC JARs locally on Amazon EMR Serverless as well. These files are located under the /usr/share/aws/redshift/ directory. In the following example, we use the Python script made available in the S3 bucket by the CloudFormation stack we created earlier.

In the CloudFormation stack outputs, make a note of the value for EMRServerlessExecutionScript, which is the location of the Python script in the S3 bucket.
Also note the value for EMRServerlessJobExecutionRole, which is the IAM role to be used with running the Amazon EMR Serverless job.
Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key EMRStudioURL.
Choose Applications under Serverless in the navigation pane.

You will find an EMR application created by the CloudFormation stack with the name emr-spark-redshift.

Choose the application name to submit a job.
Choose Submit job.
Under Job details, for Name, enter an identifiable name for the job.
For Runtime role, choose the IAM role that you noted from the CloudFormation stack output earlier.
For Script location, provide the path to the Python script you noted earlier from the CloudFormation stack output.
Expand the section Spark properties and choose the Edit in text

Enter the following value in the text box, which provides the path to the redshift-connector, Amazon Redshift JDBC driver, spark-avro JAR, and minimal-json JAR files:

--jars /usr/share/aws/redshift/jdbc/RedshiftJDBC.jar,/usr/share/aws/redshift/spark-redshift/lib/spark-redshift.jar,/usr/share/aws/redshift/spark-redshift/lib/spark-avro.jar,/usr/share/aws/redshift/spark-redshift/lib/minimal-json.jar

Choose Submit job.
Wait for the job to complete and the run status to show as Success.
Navigate to the Amazon Redshift query editor to view if the table was created successfully.
Check the pushdown queries run for Amazon Redshift query group emr-serverless-redshift. You can run the following SQL statement against the database dev:
```
SELECT query_text FROM SYS_QUERY_HISTORY WHERE query_label = 'emr-serverless-redshift' ORDER BY start_time DESC LIMIT 1
```

You can see that the pushdown query and return results are stored in Parquet file format on Amazon S3.

Use Amazon Redshift integration with Apache Spark on AWS Glue

Starting with AWS Glue version 4.0 and above, the Apache Spark jobs connecting to Amazon Redshift can use the Amazon Redshift integration for Apache Spark and Amazon Redshift JDBC driver. Existing AWS Glue jobs that already use Amazon Redshift as source or target can be upgraded to AWS Glue 4.0 to take advantage of this new connector. The CloudFormation template provided with this post creates the following AWS Glue resources:

AWS Glue connection for Amazon Redshift – The connection to establish connection from AWS Glue to Amazon Redshift using the Amazon Redshift integration for Apache Spark
IAM role attached to the AWS Glue job – The IAM role to manage permissions to run the AWS Glue job
AWS Glue job – The script for the AWS Glue job performing transformations and aggregations using the Amazon Redshift integration for Apache Spark

The following example uses the AWS Glue connection attached to the AWS Glue job with PySpark and includes the following steps:

On the AWS Glue console, choose Connections in the navigation pane.
Under Connections, choose the AWS Glue connection for Amazon Redshift created by the CloudFormation template.
Verify the connection details.

You can now reuse this connection within a job or across multiple jobs.

On the Connectors page, choose the AWS Glue job created by the CloudFormation stack under Your jobs, or access the AWS Glue job by using the URL provided for the key GlueJob in the CloudFormation stack output.
Access and verify the script for the AWS Glue job.
On the Job details tab, make sure that Glue version is set to Glue 4.0.

This ensures that the job uses the latest redshift-spark connector.

Expand Advanced properties and in the Connections section, verify that the connection created by the CloudFormation stack is attached.
Verify the job parameters added for the AWS Glue job. These values are also available in the output for the CloudFormation stack.
Choose Save and then Run.

You can view the status for the job run on the Run tab.

After the job run completes successfully, you can verify the output of the table test-glue created by the AWS Glue job.
We check the pushdown queries run for Amazon Redshift query group glue-redshift. You can run the following SQL statement against the database dev:
```
SELECT query_text FROM SYS_QUERY_HISTORY WHERE query_label = 'glue-redshift' ORDER BY start_time DESC LIMIT 1
```

Best practices

Keep in mind the following best practices:

Consider using the Amazon Redshift integration for Apache Spark from Amazon EMR instead of using the redshift-spark connector (community version) for your new Apache Spark jobs.
If you have existing Apache Spark jobs using the redshift-spark connector (community version), consider upgrading them to use the Amazon Redshift integration for Apache Spark
The Amazon Redshift integration for Apache Spark automatically applies predicate and query pushdown to optimize for performance. We recommend using supported functions (autopushdown) in your query. The Amazon Redshift integration for Apache Spark will turn the function into a SQL query and run the query in Amazon Redshift. This optimization results in required data being retrieved, so Apache Spark can process less data and have better performance.
- Consider using aggregate pushdown functions like avg, count, max, min, and sum to retrieve filtered data for data processing.
- Consider using Boolean pushdown operators like in, isnull, isnotnull, contains, endswith, and startswith to retrieve filtered data for data processing.
- Consider using logical pushdown operators like and, or, and not (or !) to retrieve filtered data for data processing.
It’s recommended to pass an IAM role using the parameter aws_iam_role for the Amazon Redshift authentication from your Apache Spark application on Amazon EMR or AWS Glue. The IAM role should have necessary permissions to retrieve temporary IAM credentials to authenticate to Amazon Redshift as shown in this blog’s “Examples for Amazon Redshift integration for Apache Spark in an Apache Spark application” section.
With this feature, you don’t have to maintain your Amazon Redshift user name and password in the secrets manager and Amazon Redshift database.
Amazon Redshift uses the UNLOAD SQL statement to store this temporary data on Amazon S3. The Apache Spark application retrieves the results from the temporary directory (stored in Parquet file format). This temporary directory on Amazon S3 is not cleaned up automatically, and therefore could add additional cost. We recommend using Amazon S3 lifecycle policies to define the retention rules for the S3 bucket.
It’s recommended to turn on Amazon Redshift audit logging to log the information about connections and user activities in your database.
It’s recommended to turn on Amazon Redshift at-rest encryption to encrypt your data as Amazon Redshift writes it in its data centers and decrypt it for you when you access it.
It’s recommended to upgrade to AWS Glue v4.0 and above to use the Amazon Redshift integration for Apache Spark, which is available out of the box. Upgrading to this version of AWS Glue will automatically make use of this feature.
It’s recommended to upgrade to Amazon EMR v6.9.0 and above to use the Amazon Redshift integration for Apache Spark. You don’t have to manage any drivers or JAR files explicitly.
Consider using Amazon EMR Studio notebooks to interact with your Amazon Redshift data in your Apache Spark application.
Consider using AWS Glue Studio to create Apache Spark jobs using a visual interface. You can also switch to writing Apache Spark code in either Scala or PySpark within AWS Glue Studio.

Clean up

Complete the following steps to clean up the resources that are created as a part of the CloudFormation template to ensure that you’re not billed for the resources if you’ll no longer be using them:

Stop the Amazon EMR Serverless application:
- Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key EMRStudioURL.
- Choose Applications under Serverless in the navigation pane.

You will find an EMR application created by the CloudFormation stack with the name emr-spark-redshift.

- If the application status shows as Stopped, you can move to the next steps. However, if the application status is Started, choose the application name, then choose Stop application and Stop application again to confirm.

Delete the Amazon EMR Studio Workspace:
- Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key EMRStudioURL.
- Choose Workspaces in the navigation pane.
- Select the Workspace that you created and choose Delete, then choose Delete again to confirm.
Delete the CloudFormation stack:

- On the AWS CloudFormation console, navigate to the stack you created earlier.
- Choose the stack name and then choose Delete to remove the stack and delete the resources created as a part of this post.
- On the confirmation screen, choose Delete stack.

Conclusion

In this post, we explained how you can use the Amazon Redshift integration for Apache Spark to build and deploy applications with Amazon EMR on Amazon EC2, Amazon EMR Serverless, and AWS Glue to automatically apply predicate and query pushdown to optimize the query performance for data in Amazon Redshift. It’s highly recommended to use Amazon Redshift integration for Apache Spark for seamless and secure connection to Amazon Redshift from your Amazon EMR or AWS Glue.

Here is what some of our customers have to say about the Amazon Redshift integration for Apache Spark:

“We empower our engineers to build their data pipelines and applications with Apache Spark using Python and Scala. We wanted a tailored solution that simplified operations and delivered faster and more efficiently for our clients, and that’s what we get with the new Amazon Redshift integration for Apache Spark.”

—Huron Consulting

“GE Aerospace uses AWS analytics and Amazon Redshift to enable critical business insights that drive important business decisions. With the support for auto-copy from Amazon S3, we can build simpler data pipelines to move data from Amazon S3 to Amazon Redshift. This accelerates our data product teams’ ability to access data and deliver insights to end-users. We spend more time adding value through data and less time on integrations.”

—GE Aerospace

“Our focus is on providing self-service access to data for all of our users at Goldman Sachs. Through Legend, our open-source data management and governance platform, we enable users to develop data-centric applications and derive data-driven insights as we collaborate across the financial services industry. With the Amazon Redshift integration for Apache Spark, our data platform team will be able to access Amazon Redshift data with minimal manual steps, allowing for zero-code ETL that will increase our ability to make it easier for engineers to focus on perfecting their workflow as they collect complete and timely information. We expect to see a performance improvement of applications and improved security as our users can now easily access the latest data in Amazon Redshift.”

—Goldman Sachs

About the Authors

Gagan Brahmi is a Senior Specialist Solutions Architect focused on big data analytics and AI/ML platform at Amazon Web Services. Gagan has over 18 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. In his spare time, he spends time with his family and explores new places.

Vivek Gautam is a Data Architect with specialization in data lakes at AWS Professional Services. He works with enterprise customers building data products, analytics platforms, and solutions on AWS. When not building and designing data lakes, Vivek is a food enthusiast who also likes to explore new travel destinations and go on hikes.

Naresh Gautam is a Data Analytics and AI/ML leader at AWS with 20 years of experience, who enjoys helping customers architect highly available, high-performance, and cost-effective data analytics and AI/ML solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.

Beaux Sharifi is a Software Development Engineer within the Amazon Redshift drivers’ team where he leads the development of the Amazon Redshift Integration with Apache Spark connector. He has over 20 years of experience building data-driven platforms across multiple industries. In his spare time, he enjoys spending time with his family and surfing.

Exploring new ETL and ELT capabilities for Amazon Redshift from the AWS Glue Studio visual editor

2023-04-20 Aniket Jiddigoudar

Post Syndicated from Aniket Jiddigoudar original https://aws.amazon.com/blogs/big-data/exploring-new-etl-and-elt-capabilities-for-amazon-redshift-from-the-aws-glue-studio-visual-editor/

In a modern data architecture, unified analytics enable you to access the data you need, whether it’s stored in a data lake or a data warehouse. In particular, we have observed an increasing number of customers who combine and integrate their data into an Amazon Redshift data warehouse to analyze huge data at scale and run complex queries to achieve their business goals.

One of the most common use cases for data preparation on Amazon Redshift is to ingest and transform data from different data stores into an Amazon Redshift data warehouse. This is commonly achieved via AWS Glue, which is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. AWS Glue provides an extensible architecture that enables users with different data processing use cases, and works well with Amazon Redshift. At AWS re:Invent 2022, we announced support for the new Amazon Redshift integration with Apache Spark available in AWS Glue 4.0, which provides enhanced ETL (extract, transform, and load) and ELT capabilities with improved performance.

Today, we are pleased to announce a new and enhanced visual job authoring capabilities for Amazon Redshift ETL and ELT workflows on the AWS Glue Studio visual editor. The new authoring experience gives you the ability to:

Get started faster with Amazon Redshift by directly browsing Amazon Redshift schemas and tables from the AWS Glue Studio visual interface
Flexible authoring through native Amazon Redshift SQL support as a source or custom preactions and postactions
Simplify common data loading operations into Amazon Redshift through new support for INSERT, TRUNCATE, DROP, and MERGE commands

With these enhancements, you can use existing transforms and connectors in AWS Glue Studio to quickly create data pipelines for Amazon Redshift. No-code users can complete end-to-end tasks using only the visual interface, SQL users can reuse their existing Amazon Redshift SQL within AWS Glue, and all users can tune their logic with custom actions on the visual editor.

In this post, we explore the new streamlined user interface and dive deeper into how to use these capabilities. To demonstrate these new capabilities, we showcase the following:

Passing a custom SQL JOIN statement to Amazon Redshift
Using the results to apply an AWS Glue Studio visual transform
Performing an APPEND on the results to load them into a destination table

Set up resources with AWS CloudFormation

To demonstrate the AWS Glue Studio visual editor experience with Amazon Redshift, we provide an AWS CloudFormation template for you to set up baseline resources quickly. The template creates the following resources for you:

An Amazon VPC, subnets, route tables, an internet gateway, and NAT gateways
An Amazon Redshift cluster
An AWS Identity and Access Management (IAM) role associated with the Amazon Redshift cluster
An IAM role for running the AWS Glue job
An Amazon Simple Storage Service (Amazon S3) bucket to be used as a temporary location for Amazon Redshift ETL
An AWS Secrets Manager secret that stores the user name and password for the Amazon Redshift cluster

Note that at the time of writing this post, Amazon Redshift MERGE is in preview, and the cluster created is a preview cluster.

To launch the CloudFormation stack, complete the following steps:

On the AWS CloudFormation console, choose Create stack and then choose With new resources (standard).
For Template source, select Upload a template file, and upload the provided template.
Choose Next.
Enter a name for the CloudFormation stack, then choose Next.
Acknowledge that this stack might create IAM resources for you, then choose Submit.
After the CloudFormation stack is successfully created, follow the steps mentioned at https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-create-sample-db.html to load sample tickit data into the created Redshift Cluster

Exploring Amazon Redshift reads

In this section, we go over the new read functionality in the AWS Glue Studio visual editor and demonstrate how we can run a custom SQL statement via the new UI.

On the AWS Glue console, choose ETL jobs in the navigation pane.
Select the Visual with a blank canvas, because we’re authoring a job from scratch, then choose Create.
In the blank canvas, choose the plus sign to add an Amazon Redshift node of type Source.

When you close the node selector, and you should see an Amazon Redshift source node on the canvas along with the data source properties.

You can choose from two methods of accessing your Amazon Redshift data:

Direct data connection – This new method allows you to establish a connection to your Amazon Redshift sources without the need to catalog them
Glue Data Catalog tables – This method requires you to have already crawled or generated your Amazon Redshift tables in the AWS Glue Data Catalog

For this post, we use the Direct data connection option.

For Redshift access type, select the Direct data connection.
For Redshift connection, choose your AWS Glue Connection redshift-demo-blog-connection created in the CloudFormation stack.

Specifying the connection automatically configures all the network related details along with the name of the database you wish to connect to.

The UI then presents a choice on how you’d like to access the data from within your selected Amazon Redshift cluster’s database:

Choose a single table – This option lets you select a single schema, and a single table from your database. You can browse through all of your available schemas and tables right from the AWS Glue Studio visual editor itself, which makes choosing your source table much easier.
Enter a custom query – If you’re looking to perform your ETL on a subset of data from your Amazon Redshift tables, you can author an Amazon Redshift query from the AWS Glue Studio UI. This query will be passed to the connected Amazon Redshift cluster, and the returned query result will be available in downstream transformations on AWS Glue Studio.

For the purposes of this post, we write our own custom query that joins data from the preloaded event table and venue table.

Select Enter a custom query and enter the following query into the query editor:

select venue.venueid from event, venue where event.venueid = venue.venueid and event.starttime between '2008-01-01 14:00:00' and '2008-01-01 15:00:00' and venue.venueseats = 0

The intent of this query is to gather the venueid of locations that have had an event between 2008-01-01 14:00:00 and 2008-01-01 15:00:00 and have had venueseats = 0. If we run a similar query from the Amazon Redshift Query Editor, we can see that there are actually five such venues within that time frame. We wish to merge this data back into Amazon Redshift without including these rows.

Choose Infer schema, which allows the AWS Glue Studio visual editor to understand the schema from the returned columns from your query.

You can see the schema on the Output schema tab.

Under Performance and security, for S3 staging directory, choose the S3 temporary directory location created by the CloudFormation stack ( RedshiftS3TempPath ).
For IAM role, choose the IAM role specified by RedshiftIamRoleARN in the CloudFormation stack.

Now we’re going to add a transform to drop duplicate rows from our join result. This will ensure that the MERGE operation in the following steps won’t have conflicting keys when performing the operation.

Choose the Drop Duplicates node to view the node properties.
On the Transform tab, for Drop duplicates, select Match specific keys.
For Keys to match rows, choose venueid.

In this section, we defined the steps to read the output of a custom JOIN query. We then dropped the duplicate records from the returned value. In the next section, we explore the write path on the same job.

Exploring Amazon Redshift writes

Now we go over the enhancements for writing to Amazon Redshift as a destination. This section goes over all the simplified options for writing to Amazon Redshift, but highlights the new Amazon Redshift MERGE capabilities for the purposes of this post.

The MERGE operator offers great flexibility for conditionally merging rows from a source into a destination table. MERGE is powerful because it simplifies operations that traditionally were only achievable by using multiple insert, update, or delete statements separately. Within AWS Glue Studio, particularly with the custom MERGE option, you can define a more complex matching condition to handle finding the records to update.

From the canvas page of the job used in the previous section, select Amazon Redshift to add an Amazon Redshift node of type Target.

When you close the selector, you should see your Amazon Redshift target node added on the Amazon Glue Studio canvas, along with possible options.

For Redshift access type, select Direct data connection.

Similar to the Amazon Redshift source node, the Direct data connection method allows you to write directly to your Amazon Redshift tables without needing to have them cataloged within the AWS Glue Data Catalog.

For Redshift connection, choose your AWS Glue connection redshift-demo-blog-connection created in the CloudFormation stack.
For Schema, choose public.
For Table, choose the venue table as the destination Amazon Redshift table where we will store the merged data.
Choose MERGE data into target table.

This selection provides the user with two options:

Choose keys and simple actions – This is a user-friendly version of the MERGE operation. You simply specify the matching keys, and choose what happens to the rows that match the key (update them or delete them) or don’t have any matches (insert them).
Enter custom MERGE statement – This option provides the most flexibility. You can enter your own custom logic for MERGE.

For this post, we use the simple actions method for performing a MERGE operation.

For Handling of data and target table, select MERGE data into target table, and then select Choose keys and simple actions.
For Matching Keys, select venueid .

This field will become our MERGE condition for checking keys

For When matched, select the Delete record in the table
For When not matched, select Insert source data as a new row into the table

With these selections, we’ve configured the AWS Glue job to run a MERGE statement on Amazon Redshift while inserting our data. Moreover, for performing this MERGE operation, we use the as the key (you can select multiple keys). If there is a key match with the destination table’s record, we delete that record. Otherwise, we insert the record into the destination table.

Navigate to the Job details tab.
For Name, enter a name for the job.
For the IAM Role drop down, select the RedshiftIamRole role that was created via the CloudFormation template.
Choose Save.
Choose Run and wait for the job to finish.

You can track its progress on the Runs tab.

After the run reaches a successful state, navigate back to the Amazon Redshift Query Editor.
Run the same query again to discover that those rows have been deleted in accordance to our MERGE specifications.

In this section, we configured an Amazon Redshift target node to write a MERGE statement to conditionally update records in our destination Amazon Redshift table. We then saved and ran the AWS Glue job, and saw the effect of the MERGE statement on our destination Amazon Redshift table.

Other available write options

In addition to MERGE, the AWS Glue Studio visual editor’s Amazon Redshift destination node also supports a number of other common operations:

APPEND – Appending to your target table performs an insert into the selected table without updating any of the existing records (if there are duplicates, both records will be retained). In cases where you want to update existing rows in addition to adding new rows (often referred to an UPSERT operation), you can select the Also update existing records in target table option. Note that both APPEND only and UPSERT (APPEND with UPDATE) are a simpler subset of the MERGE functionality discussed earlier.
TRUNCATE – The TRUNCATE option clears all the data in the existing table but retains all the existing table schema, followed by an APPEND of all new data to the empty table. This option is often used when the full dataset needs to be refreshed and downstream services or tools depend on the table schema being consistent. For example, every night an Amazon Redshift table needs to be fully updated with the latest customer information that will be consumed by an Amazon QuickSight dashboard. In this case, the ETL developer would choose TRUNCATE to ensure the data is fully refreshed but the table schema is guaranteed not to change.
DROP – This option is used when the full dataset needs to be refreshed and the downstream services or tools that depend on the schema or systems can handle possible schema changes without breaking.

How write operations are being handled on the backend

The Amazon Redshift connector supports two parameters called preactions and postactions. These parameters allow you to run SQL statements that will be passed on to the Amazon Redshift data warehouse before and after the actual write operation is carried out by Spark.

On the Script tab on the AWS Glue Studio page, we can see what SQL statements are being run.

Use a custom implementation for writing data into Amazon Redshift

In the event that the provided presets require more customization, or your use case requires more advanced implementations for writing to Amazon Redshift, AWS Glue Studio also allows you to freely select which preactions and postactions can be run when writing to Amazon Redshift.

To show an example, we create an Amazon Redshift datashare as a preaction, then perform the cleaning up of the same datashare as a postaction via AWS Glue Studio.

NOTE: This section is not executed as part of the above blog and is provided as an example.

Choose the Amazon Redshift data target node.
On the Data target properties tab, expand the Custom Redshift parameters section.
For the parameters, add the following:
1. Parameter: preactions with Value BEGIN; CREATE DATASHARE ds1; END
2. Parameter: postactions with Value BEGIN; DROP DATASHARE ds1; END

As you can see, we can specify multiple Amazon Redshift statements as a part of both the preactions and postactions parameters. Remember that these statements will override any existing preactions or postactions with your specified actions (as you can see in the following generated code).

Cleanup

To avoid additional costs, make sure to delete any unnecessary resources and files:

Empty and delete the contents from the S3 temporary bucket
If you deployed the sample CloudFormation stack, delete the CloudFormation stack via the AWS CloudFormation console. Make sure to empty the S3 bucket before you delete the bucket.

Conclusion

In this post, we went over the new AWS Glue Studio visual options for performing reads and writes from Amazon Redshift. We also saw the simplicity with which you can browse your Amazon Redshift tables right from the AWS Glue Studio visual editor UI, and how to run your own custom SQL statements against your Amazon Redshift sources. We then explored how to perform simple ETL loading tasks against Amazon Redshift with just a few clicks, and showcased the new Amazon Redshift MERGE statement.

To dive deeper into the new Amazon Redshift integrations for the AWS Glue Studio visual editor, check out Connecting to Redshift in AWS Glue Studio.

About the Authors

Aniket Jiddigoudar is a Big Data Architect on the AWS Glue team. He works with customers to help improve their big data workloads. In his spare time, he enjoys trying out new food, playing video games, and kickboxing.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18+ year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Get maximum value out of your cloud data warehouse with Amazon Redshift

2023-04-19 Sana Ahmed

Post Syndicated from Sana Ahmed original https://aws.amazon.com/blogs/big-data/get-maximum-value-out-of-your-cloud-data-warehouse-with-amazon-redshift/

Every day, customers are challenged with how to manage their growing data volumes and operational costs to unlock the value of data for timely insights and innovation, while maintaining consistent performance. Data creation, consumption, and storage are predicted to grow to 175 zettabytes by 2025, forecasted by the 2022 IDC Global DataSphere report.

As data workloads grow, costs to scale and manage data usage with the right governance typically increase as well. So how do organizational leaders drive their business forward with high performance, controlled costs, and high security? With the right analytics approach, this is possible.

In this post, we look at three key challenges that customers face with growing data and how a modern data warehouse and analytics system like Amazon Redshift can meet these challenges across industries and segments.

Building an optimal data system

As data grows at an extraordinary rate, data proliferation across your data stores, data warehouse, and data lakes can become a challenge. Different departments within an organization can place data in a data lake or within their data warehouse depending on the type of data and usage patterns of that department. Teams may place their unstructured data like social media feeds within their Amazon Simple Storage Service (Amazon S3) data lake and historical structured data within their Amazon Redshift data warehouse. Teams need access to both the data lake and the data warehouse to work seamlessly for best insights, requiring an optimal data infrastructure that can scale almost infinitely to accommodate a growing number of concurrent data users without impacting performance—all while keeping costs under control.

A quintessential example of a company managing analytics on billions of data points across the data lake and the warehouse in a mission-critical business environment is Nasdaq, an American stock exchange. Within 2 years of migration to Amazon Redshift, Nasdaq was managing 30–70 billion records, growing daily worth over 4 terabytes.

With Amazon Redshift, Nasdaq was able to query their warehouse and use Amazon Redshift Spectrum, a capability to query the data quickly in place without data loading, from their S3 data lakes. Nasdaq minimized time to insights with the ability to query 15 terabytes of data on Amazon S3 immediately without any extra data loading after writing data to Amazon S3. This performance innovation allows Nasdaq to have a multi-use data lake between teams.

Robert Hunt, Vice President of Software Engineering for Nasdaq, shared, “We have to both load and consume the 30 billion records in a time period between market close and the following morning. Data loading delayed the delivery of our reports. We needed to be able to write or load data into our data storage solution very quickly without interfering with the reading and querying of the data at the same time.”

Nasdaq’s massive data growth meant they needed to evolve their data architecture to keep up. They built their foundation of a new data lake on Amazon S3 so they could deliver analytics using Amazon Redshift as a compute layer. Nasdaq’s peak volume of daily data ingestion reached 113 billion records, and they completed data loading for reporting 5 hours faster while running 32% faster queries.

Enabling newer personas with data warehousing and analytics

Another challenge is enabling newer data users and personas with powerful analytics to meet business goals and perform critical decision-making. Where traditionally it was the data engineer and the database administrator who set up and managed the warehouse, today line of business data analysts, data scientists, and developers are all using the data warehouse to get to near-real-time business decision-making.
These personas who don’t have specialized data management or data engineering skills don’t want to be concerned with managing the capacity of their analytics systems to handle unpredictable or spiky data workloads or wait for IT to optimize for cost and capacity. Customers want to get started with analytics on large amounts of data instantly and scale analytics quickly and cost-effectively without infrastructure management.

Take the case of mobile gaming company Playrix. They were able to use Amazon Redshift Serverless to serve their key stakeholders with dashboards with financial data for quick decision-making.

Igor Ivanov, Technical Director of Playrix, stated, “Amazon Redshift Serverless is great for achieving the on-demand high performance that we need for massive queries.”

Playrix had a two-fold business goal, including marketing to its end-users (game players) with near-real-time data while also analyzing their historical data for the past 4–5 years. In seeking a solution, Playrix wanted to avoid disrupting other technical processes while also increasing cost savings. The company migrated to Redshift Serverless and scaled up to handle more complicated analytics on 600 TB from the past 5 years, all without storing two copies of the data or disrupting other analytics jobs. With Redshift Serverless, Playrix achieved a more flexible architecture and saved an overall 20% in costs of its marketing stack, decreasing its cost of customer acquisition.

“With no overhead and infrastructure management,” Ivanov shared, “we now have more time for experimenting, developing solutions, and planning new research.”

Breaking down data silos

Organizations need to easily access and analyze diverse types of structured and unstructured data, including log files, clickstreams, voice, and video. However, these wide-ranging data types are typically stored in silos across multiple data stores. To unlock the true potential of the data, organizations must break down these silos to unify and normalize all types of data and ensure that the right people have access to the right data.

Data unification can get expensive fast, with time and cost spent on building complex, custom extract, transform, load (ETL) pipelines that move or copy data from system to system. If not done right, you can end up with data latency issues, inaccuracies, and potential security and data governance risks. Instead, teams are looking for ways to share transactionally consistent, live, first-party and third-party data with each other or their end customers, without data movement or data copying.

Stripe, a payment processing platform for businesses, is an Amazon Redshift customer and a partner with thousands of end customers who require access to Stripe data for their applications. Stripe built the Stripe Data Pipeline, a solution for Stripe customers to access Stripe datasets within their Amazon Redshift data warehouses, without having to build, maintain, or scale custom ETL jobs. The Stripe Data Pipeline is powered by the data sharing capability of Amazon Redshift. Customers get a single source of truth, with low-latency data access, to speed up financial close and get better insights, analyzing best-performing payment methods, fraud by location, and more. Cutting down data engineering time and effort to access unified data creates new business opportunities from comprehensive insights and saves costs.

A modern data architecture with Amazon Redshift

These stories about harnessing maximum value from siloed data across the organization and applying powerful analytics for business insights in a cost-efficient way are possible because of AWS’s approach to a modern data architecture for their customers. Within this architecture, AWS’s data warehousing solution Amazon Redshift is a fully managed petabyte scale system, deeply integrated with AWS database, analytics, and machine learning (ML) services. Tens of thousands of customers use Amazon Redshift every day to run data warehousing and analytics in the cloud and process exabytes of data for business insights. Customers looking for a highly performing, cost-optimized cloud data warehouse solution choose Amazon Redshift for the following reasons:

Its leadership in price-performance
The ability to break through data silos for meaningful insights
Easy analytics capabilities that cut down data engineering and administrative requirements
Security and reliability features that are offered out of the box, at no additional cost

The price-performance in a cloud data warehouse benchmark metric is simply defined as the cost to perform a particular workload. Knowing how much your data warehouse is going to cost and how performance changes as your user base and data processing increases is crucial for planning, budgeting, and decision-making around choosing the best data warehouse.

Amazon Redshift is able to attain the best price-performance for customers (up to five times better than other cloud data warehouses) by optimizing the code for AWS hardware, high-performance and power-efficient compute hardware, new compression and caching algorithms, and autonomics (ML-based optimizations) within the warehouse to abstract the administrative activities away from the user, saving time and improving performance. Flexible pricing options such as pay-as-you-go with Redshift Serverless, separation of storage and compute scaling, and 1–3-year compute reservations with heavy discounts keep prices low.

The native integrations in Amazon Redshift with databases, data lakes, streaming data services, and ML services, employing zero-ETL approaches help you access data in place without data movement and easily ingest data into the warehouse without building complex pipelines. This keeps data engineering costs low and expands analytics for more users.

For example, the integration in Amazon Redshift with Amazon SageMaker allows data analysts to stay within the data warehouse and create, train, and build ML models in SQL with no need for ETL jobs or learning new languages for ML (see Jobcase Scales ML Workflows to Support Billions of Daily Predictions Using Amazon Redshift ML for an example). Every week, over 80 billion predictions happen in the warehouse with Amazon Redshift ML.

Finally, customers don’t have to pay more to secure their critical data assets. Security features offer comprehensive identity management with data encryption, granular access controls at row and column level, and data masking abilities to protect sensitive data and authorizations for the right users or groups. These features are available out of the box, within the standard pricing model.

Conclusion

Overall, customers who choose Amazon Redshift innovate in a new reality where the data warehouse scales up and down automatically as workloads change, and maximizes the value of data for all cornerstones of their business.

For market leaders like Nasdaq, they are able to ingest billions of data points daily for trading and selling at high volume and velocity, all in time for proper billing and trading the following business day. For customers like Playrix, choosing Redshift Serverless means marketing to customers with comprehensive analytics in near-real time without getting bogged down by maintenance and overhead. For Stripe, it also means taking the complexity and TCO out of ETL, removing silos and unifying data.

Although data will continue to grow at unprecedented amounts, your bottom line doesn’t need to suffer. While organizational leaders face the pressures of solving for cost optimization in all types of economic environments, Amazon Redshift gives market leaders a space to innovate without compromising their data value, performance, and budgets of their cloud data warehouse.

Learn more about maximizing the value of your data with a modern data warehouse like Amazon Redshift. For more information about the price-performance leadership of Amazon Redshift and to review benchmarks against other vendors, see Amazon Redshift continues its price-performance leadership. Additionally, you can optimize costs using a variety of performance and cost levers, including Amazon Redshift’s flexible pricing models, which cover pay-as-you-go pricing for variable workloads, free trials, and reservations for steady state workloads.

About the authors

Sana Ahmed is a Sr. Product Marketing Manager for Amazon Redshift. She is passionate about people, products and problem-solving with product marketing. As a Product Marketer, she has taken 50+ products to market and worked at various different companies including Sprinklr, PayPal and Facebook. Her hobbies include tennis, museum-hopping and fun conversations with friends and family.

Sunaina AbdulSalah leads product marketing for Amazon Redshift. She focuses on educating customers about the impact of data warehousing and analytics and sharing AWS customer stories. She has a deep background in marketing and GTM functions in the B2B technology and cloud computing domains. Outside of work, she spends time with her family and friends and enjoys traveling.

Implement column-level encryption to protect sensitive data in Amazon Redshift with AWS Glue and AWS Lambda user-defined functions

2023-04-05 Aaron Chong

Post Syndicated from Aaron Chong original https://aws.amazon.com/blogs/big-data/implement-column-level-encryption-to-protect-sensitive-data-in-amazon-redshift-with-aws-glue-and-aws-lambda-user-defined-functions/

Amazon Redshift is a massively parallel processing (MPP), fully managed petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using existing business intelligence tools.

When businesses are modernizing their data warehousing solutions to Amazon Redshift, implementing additional data protection mechanisms for sensitive data, such as personally identifiable information (PII) or protected health information (PHI), is a common requirement, especially for those in highly regulated industries with strict data security and privacy mandates. Amazon Redshift provides role-based access control, row-level security, column-level security, and dynamic data masking, along with other database security features to enable organizations to enforce fine-grained data security.

Security-sensitive applications often require column-level (or field-level) encryption to enforce fine-grained protection of sensitive data on top of the default server-side encryption (namely data encryption at rest). In other words, sensitive data should be always encrypted on disk and remain encrypted in memory, until users with proper permissions request to decrypt the data. Column-level encryption provides an additional layer of security to protect your sensitive data throughout system processing so that only certain users or applications can access it. This encryption ensures that only authorized principals that need the data, and have the required credentials to decrypt it, are able to do so.

In this post, we demonstrate how you can implement your own column-level encryption mechanism in Amazon Redshift using AWS Glue to encrypt sensitive data before loading data into Amazon Redshift, and using AWS Lambda as a user-defined function (UDF) in Amazon Redshift to decrypt the data using standard SQL statements. Lambda UDFs can be written in any of the programming languages supported by Lambda, such as Java, Go, PowerShell, Node.js, C#, Python, Ruby, or a custom runtime. You can use Lambda UDFs in any SQL statement such as SELECT, UPDATE, INSERT, or DELETE, and in any clause of the SQL statements where scalar functions are allowed.

Solution overview

The following diagram describes the solution architecture.

To illustrate how to set up this architecture, we walk you through the following steps:

We upload a sample data file containing synthetic PII data to an Amazon Simple Storage Service (Amazon S3) bucket.
A sample 256-bit data encryption key is generated and securely stored using AWS Secrets Manager.
An AWS Glue job reads the data file from the S3 bucket, retrieves the data encryption key from Secrets Manager, performs data encryption for the PII columns, and loads the processed dataset into an Amazon Redshift table.
We create a Lambda function to reference the same data encryption key from Secrets Manager, and implement data decryption logic for the received payload data.
The Lambda function is registered as a Lambda UDF with a proper AWS Identity and Access Management (IAM) role that the Amazon Redshift cluster is authorized to assume.
We can validate the data decryption functionality by issuing sample queries using Amazon Redshift Query Editor v2.0. You may optionally choose to test it with your own SQL client or business intelligence tools.

Prerequisites

To deploy the solution, make sure to complete the following prerequisites:

Have an AWS account. For this post, you configure the required AWS resources using AWS CloudFormation in the us-east-2 Region.
Have an IAM user with permissions to manage AWS resources including Amazon S3, AWS Glue, Amazon Redshift, Secrets Manager, Lambda, and AWS Cloud9.

Deploy the solution using AWS CloudFormation

Provision the required AWS resources using a CloudFormation template by completing the following steps:

Sign in to your AWS account.
Choose Launch Stack:
Navigate to an AWS Region (for example, us-east-2).
For Stack name, enter a name for the stack or leave as default (aws-blog-redshift-column-level-encryption).
For RedshiftMasterUsername, enter a user name for the admin user account of the Amazon Redshift cluster or leave as default (master).
For RedshiftMasterUserPassword, enter a strong password for the admin user account of the Amazon Redshift cluster.
Select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

The CloudFormation stack creation process takes around 5–10 minutes to complete.

When the stack creation is complete, on the stack Outputs tab, record the values of the following:
1. AWSCloud9IDE
2. AmazonS3BucketForDataUpload
3. IAMRoleForRedshiftLambdaUDF
4. LambdaFunctionName

Upload the sample data file to Amazon S3

To test the column-level encryption capability, you can download the sample synthetic data generated by Mockaroo. The sample dataset contains synthetic PII and sensitive fields such as phone number, email address, and credit card number. In this post, we demonstrate how to encrypt the credit card number field, but you can apply the same method to other PII fields according to your own requirements.

An AWS Cloud9 instance is provisioned for you during the CloudFormation stack setup. You may access the instance from the AWS Cloud9 console, or by visiting the URL obtained from the CloudFormation stack output with the key AWSCloud9IDE.

On the AWS Cloud9 terminal, copy the sample dataset to your S3 bucket by running the following command:

S3_BUCKET=$(aws s3 ls| awk '{print $3}'| grep awsblog-pii-data-input-)
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2274/pii-sample-dataset.csv s3://$S3_BUCKET/

Generate a secret and secure it using Secrets Manager

We generate a 256-bit secret to be used as the data encryption key. Complete the following steps:

Create a new file in the AWS Cloud9 environment.

Enter the following code snippet. We use the cryptography package to create a secret, and use the AWS SDK for Python (Boto3) to securely store the secret value with Secrets Manager:

from cryptography.fernet import Fernet
import boto3
import base64

key = Fernet.generate_key()
client = boto3.client('secretsmanager')

response = client.create_secret(
    Name='data-encryption-key',
    SecretBinary=base64.urlsafe_b64decode(key)
)

print(response['ARN'])

Save the file with the file name generate_secret.py (or any desired name ending with .py).
Install the required packages by running the following pip install command in the terminal:
```
pip install --user boto3
pip install --user cryptography
```
Run the Python script via the following command to generate the secret:
```
python generate_secret.py
```

Create a target table in Amazon Redshift

A single-node Amazon Redshift cluster is provisioned for you during the CloudFormation stack setup. To create the target table for storing the dataset with encrypted PII columns, complete the following steps:

On the Amazon Redshift console, navigate to the list of provisioned clusters, and choose your cluster.
To connect to the cluster, on the Query data drop-down menu, choose Query in query editor v2.
If this is the first time you’re using the Amazon Redshift Query Editor V2, accept the default setting by choosing Configure account.
To connect to the cluster, choose the cluster name.
For Database, enter demodb.
For User name, enter master.
For Password, enter your password.

You may need to change the user name and password according to your CloudFormation settings.

Choose Create connection.

In the query editor, run the following DDL command to create a table named pii_table:

CREATE TABLE pii_table(
  id BIGINT,
  full_name VARCHAR(50),
  gender VARCHAR(10),
  job_title VARCHAR(50),
  spoken_language VARCHAR(50),
  contact_phone_number VARCHAR(20),
  email_address VARCHAR(50),
  registered_credit_card VARCHAR(50)
);

We recommend using the smallest possible column size as a best practice, and you may need to modify these table definitions per your specific use case. Creating columns much larger than necessary will have an impact on the size of data tables and affect query performance.

Create the source and destination Data Catalog tables in AWS Glue

The CloudFormation stack provisioned two AWS Glue data crawlers: one for the Amazon S3 data source and one for the Amazon Redshift data source. To run the crawlers, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Select the crawler named glue-s3-crawler, then choose Run crawler to trigger the crawler job.
Select the crawler named glue-redshift-crawler, then choose Run crawler.

When the crawlers are complete, navigate to the Tables page to verify your results. You should see two tables registered under the demodb database.

Author an AWS Glue ETL job to perform data encryption

An AWS Glue job is provisioned for you as part of the CloudFormation stack setup, but the extract, transform, and load (ETL) script has not been created. We create and upload the ETL script to the /glue-script folder under the provisioned S3 bucket in order to run the AWS Glue job.

Return to your AWS Cloud9 environment either via the AWS Cloud9 console, or by visiting the URL obtained from the CloudFormation stack output with the key AWSCloud9IDE.

We use the Miscreant package for implementing a deterministic encryption using the AES-SIV encryption algorithm, which means that for any given plain text value, the generated encrypted value will be always the same. The benefit of using this encryption approach is to allow for point lookups, equality joins, grouping, and indexing on encrypted columns. However, you should also be aware of the potential security implication when applying deterministic encryption to low-cardinality data, such as gender, boolean values, and status flags.

Create a new file in the AWS Cloud9 environment and enter the following code snippet:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrameCollection
from awsglue.dynamicframe import DynamicFrame

import boto3
import base64
from miscreant.aes.siv import SIV
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType

args = getResolvedOptions(sys.argv, ["JOB_NAME", "SecretName", "InputTable"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# retrieve the data encryption key from Secrets Manager
secret_name = args["SecretName"]

sm_client = boto3.client('secretsmanager')
get_secret_value_response = sm_client.get_secret_value(SecretId = secret_name)
data_encryption_key = get_secret_value_response['SecretBinary']
siv = SIV(data_encryption_key)  # Without nonce, the encryption becomes deterministic

# define the data encryption function
def pii_encrypt(value):
    if value is None:
        value = ""
    ciphertext = siv.seal(value.encode())
    return base64.b64encode(ciphertext).decode('utf-8')

# register the data encryption function as Spark SQL UDF   
udf_pii_encrypt = udf(lambda z: pii_encrypt(z), StringType())

# define the Glue Custom Transform function
def Encrypt_PII (glueContext, dfc) -> DynamicFrameCollection:
    newdf = dfc.select(list(dfc.keys())[0]).toDF()
    
    # PII fields to be encrypted
    pii_col_list = ["registered_credit_card"]

    for pii_col_name in pii_col_list:
        newdf = newdf.withColumn(pii_col_name, udf_pii_encrypt(col(pii_col_name)))

    encrypteddyc = DynamicFrame.fromDF(newdf, glueContext, "encrypted_data")
    return (DynamicFrameCollection({"CustomTransform0": encrypteddyc}, glueContext))

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="demodb",
    table_name=args["InputTable"],
    transformation_ctx="S3bucket_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1,
    mappings=[
        ("id", "long", "id", "long"),
        ("full_name", "string", "full_name", "string"),
        ("gender", "string", "gender", "string"),
        ("job_title", "string", "job_title", "string"),
        ("spoken_language", "string", "spoken_language", "string"),
        ("contact_phone_number", "string", "contact_phone_number", "string"),
        ("email_address", "string", "email_address", "string"),
        ("registered_credit_card", "long", "registered_credit_card", "string"),
    ],
    transformation_ctx="ApplyMapping_node2",
)

# Custom Transform
Customtransform_node = Encrypt_PII(glueContext, DynamicFrameCollection({"ApplyMapping_node2": ApplyMapping_node2}, glueContext))

# Script generated for node Redshift Cluster
RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
    frame=Customtransform_node,
    database="demodb",
    table_name="demodb_public_pii_table",
    redshift_tmp_dir=args["TempDir"],
    transformation_ctx="RedshiftCluster_node3",
)

job.commit()

Save the script with the file name pii-data-encryption.py.

Copy the script to the desired S3 bucket location by running the following command:

S3_BUCKET=$(aws s3 ls| awk '{print $3}'| grep awsblog-pii-data-input-)
aws s3 cp pii-data-encryption.py s3://$S3_BUCKET/glue-script/pii-data-encryption.py

To verify the script is uploaded successfully, navigate to the Jobs page on the AWS Glue console.You should be able to find a job named pii-data-encryption-job.
Choose Run to trigger the AWS Glue job.It will first read the source data from the S3 bucket registered in the AWS Glue Data Catalog, then apply column mappings to transform data into the expected data types, followed by performing PII fields encryption, and finally loading the encrypted data into the target Redshift table. The whole process should be completed within 5 minutes for this sample dataset.You can switch to the Runs tab to monitor the job status.

Configure a Lambda function to perform data decryption

A Lambda function with the data decryption logic is deployed for you during the CloudFormation stack setup. You can find the function on the Lambda console.

The following is the Python code used in the Lambda function:

import boto3
import os
import json
import base64
import logging
from miscreant.aes.siv import SIV

logger = logging.getLogger()
logger.setLevel(logging.INFO)

secret_name = os.environ['DATA_ENCRYPT_KEY']

sm_client = boto3.client('secretsmanager')
get_secret_value_response = sm_client.get_secret_value(SecretId = secret_name)
data_encryption_key = get_secret_value_response['SecretBinary']

siv = SIV(data_encryption_key)  # Without nonce, the encryption becomes deterministic

# define lambda function logic
def lambda_handler(event, context):
    ret = dict()
    res = []
    for argument in event['arguments']:
        encrypted_value = argument[0]
        try:
            de_val = siv.open(base64.b64decode(encrypted_value)) # perform decryption
        except:
            de_val = encrypted_value
            logger.warning('Decryption for value failed: ' + str(encrypted_value)) 
        res.append(json.dumps(de_val.decode('utf-8')))

    ret['success'] = True
    ret['results'] = res

    return json.dumps(ret) # return decrypted results

If you want to deploy the Lambda function on your own, make sure to include the Miscreant package in your deployment package.

Register a Lambda UDF in Amazon Redshift

You can create Lambda UDFs that use custom functions defined in Lambda as part of your SQL queries. Lambda UDFs are managed in Lambda, and you can control the access privileges to invoke these UDFs in Amazon Redshift.

Navigate back to the Amazon Redshift Query Editor V2 to register the Lambda UDF.

Use the CREATE EXTERNAL FUNCTION command and provide an IAM role that the Amazon Redshift cluster is authorized to assume and make calls to Lambda:

CREATE OR REPLACE EXTERNAL FUNCTION pii_decrypt (value varchar(max))
RETURNS varchar STABLE
LAMBDA '<--Replace-with-your-lambda-function-name-->'
IAM_ROLE '<--Replace-with-your-redshift-lambda-iam-role-arn-->';

You can find the Lambda name and Amazon Redshift IAM role on the CloudFormation stack Outputs tab:

LambdaFunctionName
IAMRoleForRedshiftLambdaUDF

Validate the column-level encryption functionality in Amazon Redshift

By default, permission to run new Lambda UDFs is granted to PUBLIC. To restrict usage of the newly created UDF, revoke the permission from PUBLIC and then grant the privilege to specific users or groups. To learn more about Lambda UDF security and privileges, see Managing Lambda UDF security and privileges.

You must be a superuser or have the sys:secadmin role to run the following SQL statements:

GRANT SELECT ON "demodb"."public"."pii_table" TO PUBLIC;
CREATE USER regular_user WITH PASSWORD '1234Test!';
CREATE USER privileged_user WITH PASSWORD '1234Test!';
REVOKE EXECUTE ON FUNCTION pii_decrypt(varchar) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pii_decrypt(varchar) TO privileged_user;

First, we run a SELECT statement to verify that our highly sensitive data field, in this case the registered_credit_card column, is now encrypted in the Amazon Redshift table:

SELECT * FROM "demodb"."public"."pii_table";

For regular database users who have not been granted the permission to use the Lambda UDF, they will see a permission denied error when they try to use the pii_decrypt() function:

SET SESSION AUTHORIZATION regular_user;
SELECT *, pii_decrypt(registered_credit_card) AS decrypted_credit_card FROM "demodb"."public"."pii_table";

For privileged database users who have been granted the permission to use the Lambda UDF for decrypting the data, they can issue a SQL statement using the pii_decrypt() function:

SET SESSION AUTHORIZATION privileged_user;
SELECT *, pii_decrypt(registered_credit_card) AS decrypted_credit_card FROM "demodb"."public"."pii_table";

The original registered_credit_card values can be successfully retrieved, as shown in the decrypted_credit_card column.

Cleaning up

To avoid incurring future charges, make sure to clean up all the AWS resources that you created as part of this post.

You can delete the CloudFormation stack on the AWS CloudFormation console or via the AWS Command Line Interface (AWS CLI). The default stack name is aws-blog-redshift-column-level-encryption.

Conclusion

In this post, we demonstrated how to implement a custom column-level encryption solution for Amazon Redshift, which provides an additional layer of protection for sensitive data stored on the cloud data warehouse. The CloudFormation template gives you an easy way to set up the data pipeline, which you can further customize for your specific business scenarios. You can also modify the AWS Glue ETL code to encrypt multiple data fields at the same time, and to use different data encryption keys for different columns for enhanced data security. With this solution, you can limit the occasions where human actors can access sensitive data stored in plain text on the data warehouse.

You can learn more about this solution and the source code by visiting the GitHub repository. To learn more about how to use Amazon Redshift UDFs to solve different business problems, refer to Example uses of user-defined functions (UDFs) and Amazon Redshift UDFs.

About the Author

Aaron Chong is an Enterprise Solutions Architect at Amazon Web Services Hong Kong. He specializes in the data analytics domain, and works with a wide range of customers to build big data analytics platforms, modernize data engineering practices, and advocate AI/ML democratization.

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

2023-04-03 Akhil B

Post Syndicated from Akhil B original https://aws.amazon.com/blogs/big-data/generic-orchestration-framework-for-data-warehousing-workloads-using-amazon-redshift-rsql/

Tens of thousands of customers run business-critical workloads on Amazon Redshift, AWS’s fast, petabyte-scale cloud data warehouse delivering the best price-performance. With Amazon Redshift, you can query data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate AWS services like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to take advantage of all of the analytic capabilities in the AWS Cloud.

Amazon Redshift RSQL is a native command-line client for interacting with Amazon Redshift clusters and databases. You can connect to an Amazon Redshift cluster, describe database objects, query data, and view query results in various output formats. You can use Amazon Redshift RSQL to replace existing extract, transform, load (ETL) and automation scripts, such as Teradata BTEQ scripts. You can wrap Amazon Redshift RSQL statements within a shell script to replicate existing functionality in the on-premise systems. Amazon Redshift RSQL is available for Linux, Windows, and macOS operating systems.

This post explains how you can create a generic configuration-driven orchestration framework using AWS Step Functions, Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, Amazon DynamoDB, and AWS Systems Manager to orchestrate RSQL-based ETL workloads. If you’re migrating from legacy data warehouse workloads to Amazon Redshift, you can use this methodology to orchestrate your data warehousing workloads.

Solution overview

Customers migrating from legacy data warehouses to Amazon Redshift may have a significant investment in proprietary scripts like Basic Teradata Query (BTEQ) scripting for database automation, ETL, or other tasks. You can now use the AWS Schema Conversion Tool (AWS SCT) to automatically convert proprietary scripts like BTEQ scripts to Amazon Redshift RSQL scripts. The converted scripts run on Amazon Redshift with little to no changes. To learn about new options for database scripting, refer to Accelerate your data warehouse migration to Amazon Redshift – Part 4.

During such migrations, you may also want to modernize your current on-premises, third-party orchestration tools with a cloud-native framework to replicate and enhance your current orchestration capability. Orchestrating data warehouse workloads includes scheduling the jobs, checking if the pre-conditions have been met, running the business logic embedded within RSQL, monitoring the status of the jobs, and alerting if there are any failures.

This solution allows on-premises customers to migrate to a cloud-native orchestration framework that uses AWS serverless services such as Step Functions, Lambda, DynamoDB, and Systems Manager to run the Amazon Redshift RSQL jobs deployed on a persistent EC2 instance. You can also deploy the solution for greenfield implementations. In addition to meeting functional requirements, this solution also provides full auditing, logging, and monitoring of all ETL and ELT processes that are run.

To ensure high availability and resilience, you can use multiple EC2 instances that are a part of an auto scaling group along with Amazon Elastic File System (Amazon EFS) to deploy and run the RSQL jobs. When using auto scaling groups, you can install RSQL onto the EC2 instance as a part of the bootstrap script. You can also deploy the Amazon Redshift RSQL scripts onto the EC2 instance using AWS CodePipeline and AWS CodeDeploy. For more details, refer to Auto Scaling groups, the Amazon EFT User Guide, and Integrating CodeDeploy with Amazon EC2 Auto Scaling.

The following diagram illustrates the architecture of the orchestration framework.

The key components of the framework are as follows:

Amazon EventBridge is used as the ETL workflow scheduler, and it triggers a Lambda function at a preset schedule.
The function queries a DynamoDB table for the configuration associated to the RSQL job and queries the status of the job, run mode, and restart information for that job.
After receiving the configuration, the function triggers a Step Functions state machine by passing the configuration details.
Step Functions starts running different stages (like configuration iteration, run type check, and more) of the workflow.
Step Functions uses the Systems Manager SendCommand API to trigger the RSQL job and goes into a paused state with TaskToken. The RSQL scripts are persisted on an EC2 instance and are wrapped in a shell script. Systems Manager runs an AWS-RunShellScript SSM document to run the RSQL job on the EC2 instance.
The RSQL job performs ETL and ELT operations on the Amazon Redshift cluster. When it’s complete, it returns a success/failure code and status message back to the calling shell script.
The shell script calls a custom Python module with the success/failure code, status message, and the callwait TaskToken that was received from Step Functions. The Python module logs the RSQL job status in the job audit DynamoDB audit table, and exports logs to the Amazon CloudWatch log group.
The Python module then performs a SendTaskSuccess or SendTaskFailure API call based on the RSQL job run status. Based on the status of the RSQL job, Step Functions either resumes the flow or stops with failure.
Step Functions logs the workflow status (success or failure) in the DynamoDB workflow audit table.

Prerequisites

You should have the following prerequisites:

An AWS account.
The AWS Cloud Development Kit (AWS CDK) installed on the development environment. For more information, refer to Prerequisites.
An Amazon Redshift cluster.
Amazon Redshift database credentials stored in AWS Secrets Manager. For more information, refer to Storing database credentials in AWS Secrets Manager. After you create the secret, modify it to add the name of the Amazon Redshift database with the secret key dbname. For instructions, refer to Modify an AWS Secrets Manager secret.
An Amazon Linux 2 EC2 instance with Amazon Redshift RSQL installed.

Deploy AWS CDK stacks

Complete the following steps to deploy your resources using the AWS CDK:

Clone the GitHub repo:

git clone https://github.com/aws-samples/amazon-redshift-rsql-orchestration-framework.git

Update the following the environment parameters in cdk.json (this file can be found in the infra directory):
1. ec2_instance_id – The EC2 instance ID on which RSQL jobs are deployed
2. redshift_secret_id – The name of the Secrets Manager key that stores the Amazon Redshift database credentials
3. rsql_script_path – The absolute directory path in the EC2 instance where the RSQL jobs are stored
4. rsql_log_path – The absolute directory path in the EC2 instance used for storing the RSQL job logs
5. rsql_script_wrapper – The absolute directory path of the RSQL wrapper script (rsql_trigger.sh) on the EC2 instance.
The following is a sample cdk.json file after being populated with the parameters
```
    "environment": {
      "ec2_instance_id" : "i-xxxx",
      "redshift_secret_id" : "blog-secret",
      "rsql_script_path" : "/home/ec2-user/blog_test/rsql_scripts/",
      "rsql_log_path" : "/home/ec2-user/blog_test/logs/",
      "rsql_script_wrapper" : "/home/ec2-user/blog_test/instance_code/rsql_trigger.sh"
    }
```

Deploy the AWS CDK stack with the following code:

cd amazon-redshift-rsql-orchestration-framework/lambdas/lambda-layer/
sh zip_lambda_layer.sh
cd ../../infra/
python3 -m venv ./venv
source .venv/bin/activate
pip install -r requirements.txt
cdk bootstrap <AWS Account ID>/<AWS Region>
cdk deploy --all

Let’s look at the resources the AWS CDK stack deploys in more detail.

CloudWatch log group

A CloudWatch log group (/ops/rsql-logs/) is created, which is used to store, monitor, and access log files from EC2 instances and other sources.

The log group is used to store the RSQL job run logs. For each RSQL script, all the stdout and stderr logs are stored as a log stream within this log group.

DynamoDB configuration table

The DynamoDB configuration table (rsql-blog-rsql-config-table) is the basic building block of this solution. All the RSQL jobs, restart information and run mode (sequential or parallel), and sequence in which the jobs are to be run are stored in this configuration table.

The table has the following structure:

workflow_id – The identifier for the RSQL-based ETL workflow.
workflow_description – The description for the RSQL-based ETL workflow.
workflow_stages – The sequence of stages within a workflow.
execution_type – The type of run for RSQL jobs (sequential or parallel).
stage_description – The description for the stage.
scripts – The list of RSQL scripts to be run. The RSQL scripts must be placed in the location defined in a later step.

The following is an example of an entry in the configuration table. You can see the workflow_id is blog_test_workflow and the description is Test Workflow for Blog.

It has three stages that are triggered in the following order: Schema & Table Creation Stage, Data Insertion Stage 1, and Data Insertion Stage 2. The stage Schema & Table Creation Stage has two RSQL jobs running sequentially, and Data Insertion Stage 1 and Data Insertion Stage 2 each have two jobs running in parallel.

{
	"workflow_id": "blog_test_workflow",
	"workflow_description": "Test Workflow for Blog",
	"workflow_stages": [{
			"execution_flag": "y",
			"execution_type": "sequential",
			"scripts": [
				"rsql_blog_script_1.sh",
				"rsql_blog_script_2.sh"
			],
			"stage_description": "Schema & Table Creation Stage"
		},
		{
			"execution_flag": "y",
			"execution_type": "parallel",
			"scripts": [
				"rsql_blog_script_3.sh",
				"rsql_blog_script_4.sh"
			],
			"stage_description": "Data Insertion Stage 1"
		},
		{
			"execution_flag": "y",
			"execution_type": "parallel",
			"scripts": [
				"rsql_blog_script_5.sh",
				"rsql_blog_script_6.sh"
			],
			"stage_description": "Data Insertion Stage 2"
		}
	]
}

DynamoDB audit tables

The audit tables store the run details for each RSQL job within the ETL workflow with a unique identifier for monitoring and reporting purposes. The reason why there are two audit tables is because one table stores the audit information at a RSQL job level and the other stores it at a workflow level.

The job audit table (rsql-blog-rsql-job-audit-table) has the following structure:

job_name – The name of the RSQL script
workflow_execution_id – The run ID for the workflow
execution_start_ts – The start timestamp for the RSQL job
execution_end_ts – The end timestamp for the RSQL job
execution_status – The run status of the RSQL job (Running, Completed, Failed)
instance_id – The EC2 instance ID on which the RSQL job is run
ssm_command_id – The Systems Manager command ID used to trigger the RSQL job
workflow_id – The workflow_id under which the RSQL job is run

The workflow audit table (rsql-blog-rsql-workflow-audit-table) has the following structure:

workflow_execution_id – The run ID for the workflow
workflow_id – The identifier for a particular workflow
execution_start_ts – The start timestamp for the workflow
execution_status – The run status of the workflow or state machine (Running, Completed, Failed)
rsql_jobs – The list of RSQL scripts that are a part of the workflow
execution_end_ts – The end timestamp for the workflow

Lambda functions

The AWS CDK creates the Lambda functions that retrieve the config data from the DynamoDB config table, update the audit details in DynamoDB, trigger the RSQL scripts on the EC2 instance, and iterate through each stage. The following is a list of the functions:

rsql-blog-master-iterator-lambda
rsql-blog-parallel-load-check-lambda
rsql-blog-sequential-iterator-lambda
rsql-blog-rsql-invoke-lambda
rsql-blog-update-audit-ddb-lambda

Step Functions state machines

This solution implements a Step Functions callback task integration pattern that enables Step Functions workflows to send a token to an external system via multiple AWS services.

The AWS CDK deploys the following state machines:

RSQLParallelStateMachine – The parallel state machine is triggered if the execution_type for a stage in the configuration table is set to parallel. The Lambda function with a callback token is triggered in parallel for each of the RSQL scripts using a Map state.
RSQLSequentialStateMachine – The sequential state machine is triggered if the execution_type for a stage in the configuration table is set to sequential. This state machine uses a iterator design pattern to run each RSQL job within the stage as per the sequence mentioned in the configuration.
RSQLMasterStatemachine – The primary state machine iterates through each stage and triggers different state machines based on the run mode (sequential or parallel) using a Choice state.

Move the RSQL script and instance code

Copy the instance_code and rsql_scripts directories (present in the GitHub repo) to the EC2 instance. Make sure the framework directory within instance_code is copied as well.

The following screenshots show that the instance_code and rsql_scripts directories are copied to the same parent folder on the EC2 instance.

Instance Code EC2 Copy Image

RSQL Script EC2 Copy Image

RSQL script run workflow

To further illustrate the mechanism to run the RSQL scripts, see the following diagram.

The Lambda function, which gets the configuration details from the configuration DynamoDB table, triggers the Step Functions workflow, which performs the following steps:

A Lambda function defined as a workflow step receives the Step Functions TaskToken and configuration details.
The TaskToken and configuration details are passed onto the EC2 instance using the Systems Manger SendCommand API call. After the Lambda function is run, the workflow branch goes into paused state and waits for a callback token.
The RSQL scripts are run on the EC2 instance, which perform ETL and ELT on Amazon Redshift. After the scripts are run, the RSQL script passes the completion status and TaskToken to a Python script. This Python script is embedded within the RSQL script.
The Python script updates the RSQL job status (success/failure) in the job audit DynamoDB table. It also exports the RSQL job logs to the CloudWatch log group.
The Python script passes the RSQL job status (success/failure) and the status message back to the Step Functions workflow along with TaskToken using the SendTaskSuccess or SendTaskFailure API call.
Depending on the job status received, Step Functions either resumes the workflow or stops the workflow.

If EC2 auto scaling groups are used, then you can use the Systems Manager SendCommand to ensure resilience and high availability by specifying one or more EC2 instances (that are a part of the auto scaling group). For more information, refer to Run commands at scale.

When multiple EC2 instances are used, set the max-concurrency parameter of the RunCommand API call to 1, which makes sure that the RSQL job is triggered on only one EC2 instance. For further details, refer to Using concurrency controls.

Run the orchestration framework

To run the orchestration framework, complete the following steps:

On the DynamoDB console, navigate to the configuration table and insert the configuration details provided earlier. For instructions on how to insert the example JSON configuration details, refer to Write data to a table using the console or AWS CLI.
On the Lambda console, open the rsql-blog-rsql-workflow-trigger-lambda function and choose Test.

Add the test event similar to the following code and choose Test:

{
	"workflow_id": "blog_test_workflow",
	"workflow_execution_id": "demo_test_26"
}

On the Step Functions console, navigate to the rsql-master-state-machine function to open the details page.
Choose Edit, then choose Workflow Studio New. The following screenshot shows the primary state machine.
Choose Cancel to leave Workflow Studio, then choose Cancel again to leave edit mode. You’re directed back to the details page.
On the Executions tab, choose the latest run.
From the Graph view, you can check the status of each state by choosing it. Every state that uses an external resource has a link to it on the Details tab.
The orchestration framework runs the ETL load, which consists of the following sample RSQL scripts:
- rsql_blog_script_1.sh – This script creates a schema rsql_blog within the database
- rsql_blog_script_2.sh – This script creates a table blog_table within the schema created in the earlier script
- rsql_blog_script_3.sh – Inserts one row into the table created in the previous script
- rsql_blog_script_4.sh – Inserts one row into the table created in the previous script
- rsql_blog_script_5.sh – Inserts one row into the table created in the previous script
- rsql_blog_script_6.sh – Inserts one row into the table created in the previous script

You need to replace these RSQL scripts with the RSQL scripts developed for your workloads by inserting the relevant configuration details into the configuration DynamoDB table (rsql-blog-rsql-config-table).

Validation

After you run the framework, you’ll find a schema (called rsql_blog) with one table (called blog_table) created. This table consists of four rows.

You can check the logs of the RSQL job in the CloudWatch log group (/ops/rsql-logs/) and also the run status of the workflow in the workflow audit DynamoDB table (rsql-blog-rsql-workflow-audit-table).

Clean up

To avoid ongoing charges for the resources that you created, delete them. AWS CDK deletes all resources except data resources such as DynamoDB tables.

First, delete all AWS CDK stacks
```
cdk destroy --all
```
On the DynamoDB console, select the following tables and delete them:
- rsql-blog-rsql-config-table
- rsql-blog-rsql-job-audit-table
- rsql-blog-rsql-workflow-audit-table

Conclusion

You can use Amazon Redshift RSQL, Systems Manager, EC2 instances, and Step Functions to create a modern and cost-effective orchestration framework for ETL workflows. There is no overhead to create and manage different state machines for each of your ETL workflow. In this post, we demonstrated how to use this configuration-based generic orchestration framework to trigger complex RSQL-based ETL workflows.

You can also trigger an email notification through Amazon Simple Notification Service (Amazon SNS) within the state machine to the notify the operations team of the completion status of the ETL process. Further, you can achieve a event-driven ETL orchestration framework by using EventBridge to start the workflow trigger lambda function.

About the Authors

Akhil is a Data Analytics Consultant at AWS Professional Services. He helps customers design & build scalable data analytics solutions and migrate data pipelines and data warehouses to AWS. In his spare time, he loves travelling, playing games and watching movies.

Ramesh Raghupathy is a Senior Data Architect with WWCO ProServe at AWS. He works with AWS customers to architect, deploy, and migrate to data warehouses and data lakes on the AWS Cloud. While not at work, Ramesh enjoys traveling, spending time with family, and yoga.

Raza Hafeez is a Senior Data Architect within the Shared Delivery Practice of AWS Professional Services. He has over 12 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Dipal Mahajan is a Lead Consultant with Amazon Web Services based out of India, where he guides global customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings extensive experience on Software Development, Architecture and Analytics from industries like finance, telecom, retail and healthcare.

Manage your data warehouse cost allocations with Amazon Redshift Serverless tagging

2023-03-27 Sandeep Bajwa

Post Syndicated from Sandeep Bajwa original https://aws.amazon.com/blogs/big-data/manage-your-data-warehouse-cost-allocations-with-amazon-redshift-serverless-tagging/

Amazon Redshift Serverless makes it simple to run and scale analytics without having to manage your data warehouse infrastructure. Developers, data scientists, and analysts can work across databases, data warehouses, and data lakes to build reporting and dashboarding applications, perform real-time analytics, share and collaborate on data, and even build and train machine learning (ML) models with Redshift Serverless.

Tags allows you to assign metadata to your AWS resources. You can define your own key and value for your resource tag, so that you can easily manage and filter your resources. Tags can also improve transparency and map costs to specific teams, products, or applications. This way, you can raise cost awareness and also make teams and users accountable for their own cost and usage.

You can now use tagging in Redshift Serverless to categorize the following resources based on your grouping needs:

Namespace – A collection of database objects and users
Workgroup – A collection of compute resources
Snapshot – Point-in-time backups of a cluster
Recovery point – Recovery points in Redshift Serverless are created every 30 minutes and saved for 24 hours

When using Redshift Serverless, you may have to manage data across many business departments, environments, and billing groups. In doing so, you’re usually faced with one of the following tasks:

Cost allocation and financial management – You want to know what you’re spending on AWS for a given project, line of business, or environment
Operations support and incident management – You want to send issues to the right teams and users
Access control – You want to constrain user access to certain resources
Security risk management – You want to group resources based on their level of security or data sensitivity and make sure proper controls are in place

In this post, we focus on tagging Redshift Serverless resources for cost allocation and reporting purposes. Knowing where you have incurred costs at the resource, workload, team, and organization level enhances your ability to budget and manage cost.

Solution overview

Let’s say that your company has two departments: marketing and finance. Each department has multiple cost centers and environments, as illustrated in the following figure. In AWS Cost Explorer, you want to create cost reports for Redshift Serverless by department, environment, and cost center.

We start with creating and applying user-defined tags to Amazon Serverless workgroups for respective departments, environments, and cost centers. You can use both the AWS Command Line Interface (AWS CLI) and Redshift Serverless console to tag serverless resources.

The high-level steps are as follows:

Create tags.
View and edit tags.
Set up cost allocation tags.
Create cost reports.

Create tags

To create tags, complete the following steps:

On the Amazon Redshift console, choose Manage tags in the navigation pane.
For Filter by resource type, you can filter by Workgroup, Namespace, Snapshot, and Recovery Point.
Optionally, you can search for resources by an existing tag by entering values for Tag key or Tag value. For this post, we don’t include any tag filters, so we can view all the resources across our account.
Select your resource from the search results and choose Manage tags to customize the tag key and value parameters.

Here, you can add new tags, remove tags, save changes, and cancel your changes if needed.

Because we want to allocate cost across the various departments, we add a new key called department and a new value called marketing.

Choose Save changes.
Confirm the changes by choosing Apply changes.

For more details on tagging, refer to Tagging resources overview.

View and edit tags

If you already have resources such as workgroups (listed on the Workgroup configuration page) or snapshots (listed on the Data backup page), you can create new tags or edit existing tags on the given resource. In the following example, we manage tags on an existing workgroup.

On the Amazon Redshift console, choose Workgroup configuration in the navigation pane.
Select your workgroup and on the Actions menu, choose Manage tags.

Now we can remove existing tags or add new tags. For our use case, let’s assume that the marketing department is no longer using the default workgroup, so we want to remove the current tag.

Choose Remove next to the marketing tag.

We are given the option to choose Undo if needed.

Choose Save changes and then Apply the changes to confirm.

After we apply the tags, we can view the full list of resources. The number of tags applied to each resource is found in the Tags column.

Set up cost allocation tags

After you create and apply the user-defined tags to your Redshift Serverless workgroups, it can take up to 24 hours for the tags to appear on your cost allocation tags page for activation. You can activate tags by using the AWS Billing console for cost allocation tracking with the following steps:

On the AWS Billing console, choose Cost allocation tags in the navigation pane.
Under User-defined cost allocation tags¸ select the tags you created and applied (for this example, cost-center).
Choose Activate.

After we activate all the tags we created, we can view the full list by choosing Active on the drop-down menu.

Create cost reports

After you activate the cost allocation tags, they appear on your cost allocation reports in Cost Explorer.

Cost Explorer helps you manage your AWS costs by giving you detailed insights into the line items in your bill. In Cost Explorer, you can visualize daily, monthly, and forecasted spend by combining an array of available filters. Filters allow you to narrow down costs according to AWS service type, linked accounts, and tags.

The following screenshot shows the preconfigured reports in Cost Explorer.

To create custom reports for your cost and usage data, complete the following steps:

On the AWS Cost Management console, choose Reports in the navigation pane.
Choose Create new report.
Select the report type you want to create (for this example, we select Cost and usage).
Choose Create Report.
To view weekly Redshift Serverless cost by cost center, choose the applicable settings in the Report parameters pane. For this post, we group data by the cost-center tag and filter data by the department tag.
Save the report for later use by choosing Save to report library.
Enter a name for your report, then choose Save report.

The following screenshot shows a sample report for daily Redshift Serverless cost by department.

sample report for daily Redshift Serverless cost by department.

The following screenshot shows an example report of weekly Redshift Serverless cost by environment.

example report of weekly Redshift Serverless cost by environment.

Conclusion

Tagging resources in Amazon Redshift helps you maintain a central place to organize and view resources across the service for billing management. This feature saves you hours of manual work you would spend in grouping your Amazon Redshift resources via a spreadsheet or other manual alternatives.

For more tagging best practices, refer to Tagging AWS resources.

About the Authors

Sandeep Bajwa is a Sr. Analytics Specialist based out of Northern Virginia, specialized in the design and implementation of analytics and data lake solutions.

Michael Yitayew is a Product Manager for Amazon Redshift based out of New York. He works with customers and engineering teams to build new features that enable data engineers and data analysts to more easily load data, manage data warehouse resources, and query their data. He has supported AWS customers for over 3 years in both product marketing and product management roles.

How AWS Payments migrated from Redash to Amazon Redshift Query Editor v2

2023-03-21 Erol Murtezaoglu

Post Syndicated from Erol Murtezaoglu original https://aws.amazon.com/blogs/big-data/how-aws-payments-migrated-from-redash-to-amazon-redshift-query-editor-v2/

AWS Payments is part of the AWS Commerce Platform (CP) organization that owns the customer experience of paying AWS invoices. It helps AWS customers manage their payment methods and payment preferences, and helps customers make self-service payments to AWS.

The Machine Learning, Data and Analytics (MLDA) team at AWS Payments enables data-driven decision-making across payments processes and partners by delivering data, business insights, and causal and ML inferences through a scalable set of data, insights, and ML inference services.

In this post, we discuss how to democratize data access to Amazon Redshift using the Amazon Redshift Query Editor V2 .

Background

At AWS Payments, we had been using Redash to allow our users to author and run SQL queries against our Amazon Redshift data warehouse. Redash is a web-based SQL client application that can be used to author and run queries, visualize query results with charts, and collaborate with teams.

Over time, we began to notice incompatibilities between Redash’s operations and the needs of our workload.

We had the following requirements in mind when looking for an alternative tool:

Authentication and authorization
- Provide data access without creating a database user and password
- Allow list users using permission groups (POSIX/LDAP) for accessing the tool
- Limit user access to database objects
User experience
- Run SQL queries on the selected database
- Save a query and rerun it later
- Write a dynamic SQL query and run the query based on input parameters
- Export a query result to CSV
- Search saved queries
- Share a query with other users as a URL

After an evaluation of alternate services, we chose the Amazon Redshift Query Editor V2.

Amazon Redshift Query Editor V2

The Amazon Redshift Query Editor V2 has the following benefits:

It makes data across analytics and data scientists more accessible with a unified web-based analyst workbench for data analysts to explore, share, and collaborate on data through a SQL interface
It provides a managed service that allows you to focus on exploring your data without managing your infrastructure
Users can log in to the Query Editor using single sign-on (SSO)
Users can connect to Amazon Redshift using federated access without providing a user name and password
It enables you to collaborate with team members by providing the ability to share saved queries securely
You can benefit from new features as soon as they get released by the Amazon Redshift Query Editor team
You can keep track of changes made to saved queries using the Query History feature
You can write parameterized SQL queries, which allows you to reuse a query with different values
You can turn on the Chart feature to display a graphic visualization of the current page of results
You can use notebooks to organize, annotate, and share multiple SQL queries in a single document
You can run multiple queries in parallel by running each query in a separate tab

However, it presented the following challenges:

To restrict user access to other AWS services within our AWS account, we attached the AWS Identity and Access Management (IAM) policies (see the appendix at the end of this post) to the SAML IAM role. The policies promote the following:
- The user can only access the Query Editor V2 service.
- The federated user gets assigned to a database group with limited access.
The Query Editor V2 currently doesn’t support cross-account Amazon Redshift connections. However, we set up Amazon Redshift data sharing to access the Amazon Redshift cluster from other AWS accounts. For more details, refer to Sharing data across clusters in Amazon Redshift.

Architecture overview

The following diagram illustrates our architecture.
The diagram illustrates our architecture
In the following sections, we will walk you through the steps to set up the query editor and migrate Redash queries.

Prerequisites

To implement this solution, you must set up federated access to the Amazon Redshift Query Editor V2 using your identity provider (IdP) services.

You can find more information in the following posts:

Set up Amazon Redshift Query Editor V2

To set up the query editor, complete the following steps:

Create an Amazon Redshift database group with read-only access.
Create an IAM role for accessing Query Editor V2 in an AWS account and attach the required IAM policies based on your use case. For more information, refer to Configuring your AWS account.
Create a trust relationship between your IdP and AWS.
Add the principal tag sqlworkbench-team to the IAM role to share queries. For more information, refer to Sharing a query.

Migrate Redash queries to Amazon Redshift Query Editor V2

In this section, we walk you through different ways to migrate your Redash queries to the Amazon Redshift Query Editor V2.

Query without parameters

Querying without parameters is pretty straightforward, just copy your query from Redash and enter it in the query editor.

In Redash, navigate to the saved query and choose Edit Source.
Copy the source query.
In Amazon RedShift Query Editor V2, enter the query into the editor, choose the Save icon, and give your query a title.

Query with parameters

In Redash, a string between {{ }} will be treated as a parameter, but Amazon RedShift Query Editor V2 uses ${ } to identify a parameter. To migrate queries with parameters, follow the same steps but replace {{ with ${ and }} with }.

The following screenshot shows an example query in Redash.

screenshot shows an example query in Redash The following screenshot shows the same query in Amazon RedShift Query Editor V2.

screenshot shows the same query in Query Editor V2

Multi-part query to a Query Editor V2 notebook

For a multi-part query, copy the query of each section of a Redash dashboard and add it to a notebook. The notebook in Amazon Redshift Query Editor V2 runs queries successively. You also can add a description for your query.

The following screenshot shows an example query on the Redash dashboard.

The following screenshot shows the query in an Amazon Redshift Query Editor V2 notebook.

Summary

In this post, we demonstrated how we set up Amazon Redshift Query Editor V2 with SSO and Amazon Redshift federated access, and migrated our customers from Redash to Amazon Redshift Query Editor V2. This solution reduced our operational cost of maintaining a third-party application and its infrastructure.

If you have similar use cases and need to provide a web-based tool to your customers to explore data on your Amazon Redshift cluster, consider using Amazon Redshift Query Editor V2.

Appendix: Customer IAM policies

In this section, we provide the code for the IAM policies we attached to the SAML IAM role to restrict user access to other AWS services within our AWS account:

query-editor-credentials-policy – In the following code, provide your Region, account, and cluster parameters to grant access to Amazon Redshift to get cluster credentials, create users, and allow users to join groups:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:<region>:<account>:cluster:<cluster>",
                "arn:aws:redshift:<region>:<account>:dbname:<cluster>/payments_beta",
                "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${redshift:DbUser}"
            ],
            "Effect": "Allow"
        },
        {
            "Action": "redshift:JoinGroup",
            "Resource": "arn:aws:redshift:<region>:<account>:dbgroup:<cluster>/payments_ro_users",
            "Effect": "Allow"
        },
        {
            "Action": "redshift:DescribeClusters",
            "Resource": "arn:aws:redshift:<region>:<account>:cluster:<cluster>",
            "Effect": "Allow"
        },
        {
            "Action": "redshift:CreateClusterUser",
            "Resource": "arn:aws:redshift:<region>:<account>:dbuser:<cluster>/${redshift:DbUser}",
            "Effect": "Allow"
        }
    ]
}

query-editor-access-policy – See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "redshift:DescribeClusters",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "RedshiftPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "secretsmanager:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "secretsmanager:CreateSecret",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DeleteSecret",
                "secretsmanager:TagResource"
            ],
            "Resource": "arn:aws:secretsmanager:::sqlworkbench!",
            "Effect": "Allow",
            "Sid": "SecretsManagerPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:CalledViaLast": "sqlworkbench.amazonaws.com"
                }
            },
            "Action": "tag:GetResources",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "ResourceGroupsTaggingPermissions"
        },
        {
            "Action": [
                "sqlworkbench:CreateFolder",
                "sqlworkbench:PutTab",
                "sqlworkbench:BatchDeleteFolder",
                "sqlworkbench:DeleteTab",
                "sqlworkbench:GenerateSession",
                "sqlworkbench:GetAccountInfo",
                "sqlworkbench:GetAccountSettings",
                "sqlworkbench:GetUserInfo",
                "sqlworkbench:GetUserWorkspaceSettings",
                "sqlworkbench:PutUserWorkspaceSettings",
                "sqlworkbench:ListConnections",
                "sqlworkbench:ListFiles",
                "sqlworkbench:ListTabs",
                "sqlworkbench:UpdateFolder",
                "sqlworkbench:ListRedshiftClusters",
                "sqlworkbench:DriverExecute",
                "sqlworkbench:ListTaggedResources"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2NonResourceLevelPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "sqlworkbench:CreateConnection",
                "sqlworkbench:CreateSavedQuery",
                "sqlworkbench:CreateChart"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2CreateOwnedResourcePermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "sqlworkbench:DeleteChart",
                "sqlworkbench:DeleteConnection",
                "sqlworkbench:DeleteSavedQuery",
                "sqlworkbench:GetChart",
                "sqlworkbench:GetConnection",
                "sqlworkbench:GetSavedQuery",
                "sqlworkbench:ListSavedQueryVersions",
                "sqlworkbench:UpdateChart",
                "sqlworkbench:UpdateConnection",
                "sqlworkbench:UpdateSavedQuery",
                "sqlworkbench:AssociateConnectionWithTab",
                "sqlworkbench:AssociateQueryWithTab",
                "sqlworkbench:AssociateConnectionWithChart",
                "sqlworkbench:UpdateFileFolder",
                "sqlworkbench:ListTagsForResource"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2OwnerSpecificPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}",
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
                },
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": "sqlworkbench-resource-owner"
                }
            },
            "Action": "sqlworkbench:TagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TagOnlyUserIdPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
                }
            },
            "Action": [
                "sqlworkbench:GetChart",
                "sqlworkbench:GetConnection",
                "sqlworkbench:GetSavedQuery",
                "sqlworkbench:ListSavedQueryVersions",
                "sqlworkbench:ListTagsForResource",
                "sqlworkbench:AssociateQueryWithTab"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TeamReadAccessPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}",
                    "aws:RequestTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
                }
            },
            "Action": "sqlworkbench:TagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2TagOnlyTeamPermissions"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                },
                "ForAllValues:StringEquals": {
                    "aws:TagKeys": "sqlworkbench-team"
                }
            },
            "Action": "sqlworkbench:UntagResource",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonRedshiftQueryEditorV2UntagOnlyTeamPermissions"
        }
    ]
}

query-editor-notebook-policy – See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sqlworkbench:ListNotebooks",
                "sqlworkbench:ListNotebookVersions",
                "sqlworkbench:ListQueryExecutionHistory"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "sqlworkbench:CreateNotebook",
                "sqlworkbench:ImportNotebook",
                "sqlworkbench:DuplicateNotebook"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-resource-owner": "${aws:userid}"
                }
            },
            "Action": [
                "sqlworkbench:GetNotebook",
                "sqlworkbench:UpdateNotebook",
                "sqlworkbench:DeleteNotebook",
                "sqlworkbench:CreateNotebookCell",
                "sqlworkbench:DeleteNotebookCell",
                "sqlworkbench:UpdateNotebookCellContent",
                "sqlworkbench:UpdateNotebookCellLayout",
                "sqlworkbench:BatchGetNotebookCell",
                "sqlworkbench:AssociateNotebookWithTab",
                "sqlworkbench:ExportNotebook",
                "sqlworkbench:CreateNotebookVersion",
                "sqlworkbench:GetNotebookVersion",
                "sqlworkbench:CreateNotebookFromVersion",
                "sqlworkbench:DeleteNotebookVersion",
                "sqlworkbench:RestoreNotebookVersion"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/sqlworkbench-team": "${aws:PrincipalTag/sqlworkbench-team}"
                }
            },
            "Action": [
                "sqlworkbench:GetNotebook",
                "sqlworkbench:BatchGetNotebookCell",
                "sqlworkbench:AssociateNotebookWithTab"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

About the Authors

Mohammad Nejad

Mohammad Nejad leads the AWS Payments Data Platform team. He has experience leading teams, architecting designs, implementing solutions, and launching products. Currently, his team focuses on building a modern data platform on AWS to provide a complete solution for processing, analyzing, and presenting data.

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

2023-03-16 Anish Moorjani

Post Syndicated from Anish Moorjani original https://aws.amazon.com/blogs/big-data/how-safetyculture-scales-unpredictable-dbt-cloud-workloads-in-a-cost-effective-manner-with-amazon-redshift/

This post is co-written by Anish Moorjani, Data Engineer at SafetyCulture.

SafetyCulture is a global technology company that puts the power of continuous improvement into everyone’s hands. Its operations platform unlocks the power of observation at scale, giving leaders visibility and workers a voice in driving quality, efficiency, and safety improvements.

Amazon Redshift is a fully managed data warehouse service that tens of thousands of customers use to manage analytics at scale. Together with price-performance, Amazon Redshift enables you to use your data to acquire new insights for your business and customers while keeping costs low.

In this post, we share the solution SafetyCulture used to scale unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift.

Use case

SafetyCulture runs an Amazon Redshift provisioned cluster to support unpredictable and predictable workloads. A source of unpredictable workloads is dbt Cloud, which SafetyCulture uses to manage data transformations in the form of models. Whenever models are created or modified, a dbt Cloud CI job is triggered to test the models by materializing the models in Amazon Redshift. To balance the needs of unpredictable and predictable workloads, SafetyCulture used Amazon Redshift workload management (WLM) to flexibly manage workload priorities.

With plans for further growth in dbt Cloud workloads, SafetyCulture needed a solution that does the following:

Caters for unpredictable workloads in a cost-effective manner
Separates unpredictable workloads from predictable workloads to scale compute resources independently
Continues to allow models to be created and modified based on production data

Solution overview

The solution SafetyCulture used is comprised of Amazon Redshift Serverless and Amazon Redshift Data Sharing, along with the existing Amazon Redshift provisioned cluster.

Amazon Redshift Serverless caters to unpredictable workloads in a cost-effective manner because compute cost is not incurred when there is no workload. You pay only for what you use. In addition, moving unpredictable workloads into a separate Amazon Redshift data warehouse allows each Amazon Redshift data warehouse to scale resources independently.

Amazon Redshift Data Sharing enables data access across Amazon Redshift data warehouses without having to copy or move data. Therefore, when a workload is moved from one Amazon Redshift data warehouse to another, the workload can continue to access data in the initial Amazon Redshift data warehouse.

The following figure shows the solution and workflow steps:

We create a serverless instance to cater for unpredictable workloads. Refer to Managing Amazon Redshift Serverless using the console for setup steps.
We create a datashare called prod_datashare to allow the serverless instance access to data in the provisioned cluster. Refer to Getting started data sharing using the console for setup steps. Database names are identical to allow queries with full path notation database_name.schema_name.object_name to run seamlessly in both data warehouses.
dbt Cloud connects to the serverless instance and models, created or modified, are tested by being materialized in the default database dev, in either each users’ personal schema or a pull request related schema. Instead of dev, you can use a different database designated for testing. Refer to Connect dbt Cloud to Redshift for setup steps.
You can query materialized models in the serverless instance with materialized models in the provisioned cluster to validate changes. After you validate the changes, you can implement models in the serverless instance in the provisioned cluster.

Outcome

SafetyCulture carried out the steps to create the serverless instance and datashare, with integration to dbt Cloud, with ease. SafetyCulture also successfully ran its dbt project with all seeds, models, and snapshots materialized into the serverless instance via run commands from the dbt Cloud IDE and dbt Cloud CI jobs.

Regarding performance, SafetyCulture observed dbt Cloud workloads completing on average 60% faster in the serverless instance. Better performance could be attributed to two areas:

Amazon Redshift Serverless measures compute capacity using Redshift Processing Units (RPUs). Because it costs the same to run 64 RPUs in 10 minutes and 128 RPUs in 5 minutes, having a higher number of RPUs to complete a workload sooner was preferred.
With dbt Cloud workloads isolated on the serverless instance, dbt Cloud was configured with more threads to allow materialization of more models at once.

To determine cost, you can perform an estimation. 128 RPUs provides approximately the same amount of memory that an ra3.4xlarge 21-node provisioned cluster provides. In US East (N. Virginia), the cost of running a serverless instance with 128 RPUs is $48 hourly ($0.375 per RPU hour * 128 RPUs). In the same Region, the cost of running an ra3.4xlarge 21-node provisioned cluster on demand is $68.46 hourly ($3.26 per node hour * 21 nodes). Therefore, an accumulated hour of unpredictable workloads on a serverless instance is 29% more cost-effective than an on-demand provisioned cluster. Calculations in this example should be recalculated when performing future cost estimations because prices may change over time.

Learnings

SafetyCulture had two key learnings to better integrate dbt with Amazon Redshift, which can be helpful for similar implementations.

First, when integrating dbt with an Amazon Redshift datashare, configure INCLUDENEW=True to ease management of database objects in a schema:

ALTER DATASHARE datashare_name SET INCLUDENEW = TRUE FOR SCHEMA schema;

For example, assume the model customers.sql is materialized by dbt as the view customers. Next, customers is added to a datashare. When customers.sql is modified and rematerialized by dbt, dbt creates a new view with a temporary name, drops customers, and renames the new view to customers. Although the new view carries the same name, it’s a new database object that wasn’t added to the datashare. Therefore, customers is no longer found in the datashare.

Configuring INCLUDENEW=True allows new database objects to be automatically added to the datashare. An alternative to configuring INCLUDENEW=True and providing more granular control is the use of dbt post-hook.

Second, when integrating dbt with more than one Amazon Redshift data warehouse, define sources with database to aid dbt in evaluating the right database.

For example, assume a dbt project is used across two dbt Cloud environments to isolate production and test workloads. The dbt Cloud environment for production workloads is configured with the default database prod_db and connects to a provisioned cluster. The dbt Cloud environment for test workloads is configured with the default database dev and connects to a serverless instance. In addition, the provisioned cluster contains the table prod_db.raw_data.sales, which is made available to the serverless instance via a datashare as prod_db′.raw_data.sales.

When dbt compiles a model containing the source {{ source('raw_data', 'sales') }}, the source is evaluated as database.raw_data.sales. If database is not defined for sources, dbt sets the database to the configured environment’s default database. Therefore, the dbt Cloud environment connecting to the provisioned cluster evaluates the source as prod_db.raw_data.sales, while the dbt Cloud environment connecting to the serverless instance evaluates the source as dev.raw_data.sales, which is incorrect.

Defining database for sources allows dbt to consistently evaluate the right database across different dbt Cloud environments, because it removes ambiguity.

Conclusion

After testing Amazon Redshift Serverless and Data Sharing, SafetyCulture is satisfied with the result and has started productionalizing the solution.

“The PoC showed the vast potential of Redshift Serverless in our infrastructure,” says Thiago Baldim, Data Engineer Team Lead at SafetyCulture. “We could migrate our pipelines to support Redshift Serverless with simple changes to the standards we were using in our dbt. The outcome provided a clear picture of the potential implementations we could do, decoupling the workload entirely by teams and users and providing the right level of computation power that is fast and reliable.”

Although this post specifically targets unpredictable workloads from dbt Cloud, the solution is also relevant for other unpredictable workloads, including ad hoc queries from dashboards. Start exploring Amazon Redshift Serverless for your unpredictable workloads today.

About the authors

Anish Moorjani is a Data Engineer in the Data and Analytics team at SafetyCulture. He helps SafetyCulture’s analytics infrastructure scale with the exponential increase in the volume and variety of data.

Randy Chng is an Analytics Solutions Architect at Amazon Web Services. He works with customers to accelerate the solution of their key business problems.