Tag Archives: Amazon Redshift

Perform ETL operations using Amazon Redshift RSQL

2022-03-03 Saman Irfan

Post Syndicated from Saman Irfan original https://aws.amazon.com/blogs/big-data/perform-etl-operations-using-amazon-redshift-rsql/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

There are many ways to interact with Amazon Redshift. You can programmatically access data in your Amazon Redshift cluster by using the Amazon Redshift Data API, or you can use a web-based interface such as Amazon Redshift Query Editor V2 to author SQL queries. You can also interact with Amazon Redshift in interactive or batch mode via Amazon Redshift RSQL.

Previously, you had to use the PostgreSQL psql command line tool to interact with Amazon Redshift for simple use cases such as importing and exporting data to and from Amazon Redshift or submitting SQL in batch or interactive mode, or for advanced use cases such as performing ETL operations without writing complex ETL jobs.

Now you can use Amazon Redshift RSQL, a new command line tool to connect to an Amazon Redshift cluster and perform operations such as describe database objects, query data, view and export query results in various output file formats, run scripts that include both SQL and complex business logic, perform data cleansing and transformation operations using familiar SQL, and write ETL and ELT scripts using enhanced control flow and advanced error handling. Moreover, if you’re migrating from self-managed data warehousing engines such as Teradata to Amazon Redshift and using Teradata BTEQ (Basic Teradata Query) scripting for data automation, ETL or other tasks can now migrate to Amazon Redshift RSQL.

This post explains how to use Amazon Redshift RSQL for ETL, data cleansing and preparation, enhanced control flow, and exception and error handling.

Solution overview

Amazon Redshift RSQL enables you to connect to your Amazon Redshift cluster via a command line. It supports the capabilities of the PostgreSQL psql command line tool with an additional set of Amazon Redshift specific capabilities:

Describe properties or attributes of external tables in an AWS Glue catalog or Apache Hive metastore, external databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, and tables shared using Amazon Redshift data sharing
Use enhanced control flow commands such as \IF, \ELSEIF, \ELSE, \ENDIF, \GOTO, and \LABEL
Use single sign-on (SSO) authentication using ADFS, PingIdentity, Okta, Azure AD, or other SAML/JWT based identity providers (IdPs), as well as use browser-based SAML IdPs with multi-factor authentication (MFA)

The following are some commonly used commands in Amazon Redshift RSQL. We use these commands in this post to demonstrate different possible use cases using Amazon Redshift RSQL scripts.

\echo – Prints the specified string to the output.
\remark – An extension of the \echo command that has the ability to break the output over separate lines.
\goto – Skips all intervening commands and SQL statements and resume the processing at the specified \LABEL in sequence. The \LABEL must be a forward reference. You can’t jump to a \LABEL that lexically precedes the \GOTO.
\label – Establishes an entry point for running the program as the target for a \GOTO command.
\exit – Stops running Amazon Redshift RSQL. You can also specify an optional exit code.
\q – Logs off database sessions and exits Amazon Redshift RSQL.

Prerequisites

The following are the prerequisites for using Amazon Redshift RSQL and perform the steps in this post:

An AWS account
Linux, Windows, or MacOs operating system (Amazon Redshift RSQL is available for these operating systems)
An Amazon Redshift cluster
SQL knowledge

Additionally, complete the following prerequisite steps:

Install Amazon Redshift RSQL on your local machine. For instructions, refer to Getting started with Amazon Redshift RSQL.
Connect to the Amazon Redshift cluster.

Create the orders and orders_summary tables using the following DDL statements:

create table orders (
  O_ORDERKEY bigint NOT NULL,
  O_CUSTKEY bigint,
  O_ORDERSTATUS varchar(1),
  O_TOTALPRICE decimal(18,4),
  O_ORDERDATE Date,
  O_ORDERPRIORITY varchar(15),
  O_CLERK varchar(15),
  O_SHIPPRIORITY Integer,
  O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

CREATE TABLE orders_summary 
( o_orderkey bigint, 
 o_custkey bigint, 
 o_orderstatus character varying(1),
 o_totalprice integer,
 target_information character varying(14),
 rank character varying(15),
 description character varying(15)
) DISTSTYLE AUTO;

Import data into the Amazon Redshift cluster

There are multiple ways to load data into Amazon Redshift tables, including using Amazon Redshift RSQL. In this section, we review the syntax and an example of the Amazon Redshift RSQL script to load data into an Amazon Redshift table using the COPY command.

We use the following syntax:

COPY <TABLE> from <location> 
iam_role <arn>
region <region>;

We provide the following parameters:

<location> – The location of the source data to be loaded into the target table
<arn> – The AWS Identity and Access Management (IAM) role for accessing the data
<region> – The AWS Region of the source data

In the following example Amazon Redshift RSQL script, we load data from an Amazon Simple Storage Service (Amazon S3) bucket location into the orders table:

\echo 'Job started' 
  copy orders from 's3://redshift-immersionday-labs/data/orders/orders.tbl.'
  iam_role default
  region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;
\echo 'Job Ended'
\exit 0

Enhanced control flow

You can use Amazon Redshift RSQL to define programmatic enhanced control flow and conditional blocks in your ETL script. We use the following syntax:

\if <condition> 
  <code_block1>
\else
  <code_block2>
\endif

The syntax includes the following components:

<condition> –The conditional statement
<code_block1> – The code block to run when the condition is met
<code_block2> – The code block to run when the condition is not met

In the following example script, we perform some conditional checks using if, elseif, and else commands based on the count of records from the orders table, and we display some messages based on the record count value:

\echo 'Job started'
Select count(*)  from orders \gset
select :count as count;
\if :count < 76000000 
  \echo 'Orders are less than target'
\elseif :count =76000000
  \echo 'Order met the target'
\else :count > 76000000
  \echo 'Orders exceeded the target'
\endif
\echo 'Job Ended' 
\exit 0

Error handling

You can use Amazon Redshift RSQL to define exception handling blocks in your in ETL script to handle various user-defined and system-generated error scenarios that you might encounter while running the ETL process.

The following are some of the error handling options supported in Amazon Redshift RSQL:

:ACTIVITYCOUNT – This variable is similar to the psql variable ROW_COUNT, which returns the number of rows affected by last SQL statement. However, ACTIVITYCOUNT reports the affected row count for SELECT, COPY, and UNLOAD statements, which ROW_COUNT does not. For SQL statements returning data, ACTIVITYCOUNT specifies the number of rows returned to Amazon Redshift RSQL from the database.
:ERRORCODE – This variable contains the return code for the last submitted SQL statement to the database. Error code 0 specifies that SQL statement completed without any errors.
:ERRORLEVEL – This variable is used to assign severity levels to errors. You can use these severity levels to determine a course of action. The default value is ON.
:MAXERROR – This variable designates a maximum error severity level beyond which Amazon Redshift RSQL halts job processing. If SQL statements in Amazon Redshift RSQL scripts produce an error severity level greater than the designated maxerror value, Amazon Redshift RSQL immediately exits.
:LAST_ERROR_MESSAGE – This variable contains the error message of the most recent failed query.

We use the following syntax:

\if :ERROR <> 0 
  \echo :<ERRORCODE>
  \echo :<LAST_ERROR_MESSAGE>
  \goto <codeblock1>
\else
  \goto Y
\endif

The syntax includes the following information:

<ERRORCODE> –The error code number
<LAST_ERROR_MESSAGE> – The error message of the most recent failed query
<code_block1> – The code block to run when the error condition is met
<code_block2> – The code block to run when the error condition is not met

In the following example script, we create the orders_staging table and copy records into the table from an Amazon S3 location. The script also contains an exception handling section for both the table creation and copy process to handle the possible errors encountered during the process.

\echo `date`
\echo 'Job started'
DROP TABLE IF EXISTS orders_staging;

create table orders_staging (
O_ORDERKEY bigint NOT NULL,
O_CUSTKEY bigint,
O_ORDERSTATUS varchar(1),
O_TOTALPRICE decimal(18,4),
O_ORDERDATE Date,
O_ORDERPRIORITY varchar(15),
O_CLERK varchar(15),
O_SHIPPRIORITY Integer,
O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

\if :ERROR <> 0 
  \echo :ERRORCODE
  \remark :LAST_ERROR_MESSAGE
  \goto QUIT_JOB
\else
  \remark '***Orders_Staging Table Created Successfully****'
  \goto COPY_DATA
\endif

\label COPY_DATA
 copy orders_staging from 's3://redshift-immersionday-  labs/data/orders/orders.tbl.'
 iam_role default
 region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

\if :ERROR <> 0
  \echo :ERRORCODE
  \remark :LAST_ERROR_MESSAGE
  \goto QUIT_JOB
\else 
  \remark '****Data Copied Successfully****'
\endif

\echo `date`
\echo 'Job Ended'
\exit 0 

\label QUIT_JOB
 \echo `date`
 \echo 'Job Failed'
 \exit 1

Data transformation and preparation

You can perform some common data preparation and transformation operations on your dataset using SQL statements in the Amazon Redshift RSQL ETL script. In this section, we demonstrate data transformation and preparation operations such as casting, new data column creation, and splitting an input column into multiple columns.

We use industry standard SQL statements for transforming and preparing data for downstream consumption.

In the following example script, we use a SQL statement to transform the data from the orders_staging table and insert it into the orders_summary table:

\echo `date`
\echo 'Job started'

insert into orders_summary 
(o_orderkey,
o_custkey,
o_orderstatus,
o_totalprice,
target_information,
rank,
description)
select
o_orderkey,
o_custkey,
o_orderstatus,
o_totalprice::int,
case 
when o_totalprice < 200000
then 'Target Missed'
when o_totalprice = 200000
then 'Target Met'
when o_totalprice > 200000
then 'Target Exceeds'
end as "Target_Information",
split_part (o_orderpriority,'-',1) as RANK, 
split_part (o_orderpriority,'-',2) as DESCRIPTION
from orders_staging; 

\if :ERROR <> 0 or :ACTIVITYCOUNT=0
 \echo :ERRORCODE
 \remark :LAST_ERROR_MESSAGE
 \goto QUIT_JOB
\else
 \remark 'Data Inserted into Summary Orders Table'
\endif

\echo `date`
\echo 'Job Ended'
\exit 0 

\label QUIT_JOB
 \echo `date`
 \echo 'Job Failed'
 \exit 1

Export data from an Amazon Redshift cluster and output file formatting options

You can use Amazon Redshift RSQL to extract data from one or multiple Amazon Redshift tables and write to your disk for consumption by downstream applications. Amazon Redshift RSQL uses the \EXPORT option to export the result of query to an output file.

The following are some of the useful output formating options supported in RSQL:

\rset rformat on – This command is required for all the formatting commands to take effect.
\pset format – Formats can include aligned, AsciiDoc, CSV, HTML, LaTeX, LaTeX longtable, troff-ms, unaligned, and wrapped.
\pset border – This option specifies border information for output data. Value 0 means no border, 1 means internal dividing lines, and 2 means table frame.
\rset heading – This command adds the specified heading to the output result.
\rset rtitle – This command adds the specified heading as well as current system date of the client computer.
\rset titledashes on/off – This command specifies whether to print a line of dash characters between the column names and column data returned for the SQL query.
\rset width – This command specifies the target width for each line in a report
\t – This command turns off printing column names as well as result row count at the end of the output (footers).

We use the following syntax:

\export report file=<'FilePath/Filename'>
\rset rformat on
\pset format wrapped
\pset border 2
\rset heading ‘This is Heading’
\rset width 50
\rset titledashes on
<SQL Query>
\export reset

We provide the following information:

<‘FilePath/Filename’> – The file name and path for the output file
<SQL Query> – The SQL query to run

In the following example script, we export the data from the orders_summary table using a SQL query and write it into the orders.txt text file on the local machine:

\echo `date`
\echo 'Job started'

\export report file='/<FilePath>/orders.txt'
\rset rformat on
\pset format wrapped
\rset width 50
\rset titledashes on
select * from orders_summary limit 100;

\export reset
\echo 'Job Ended'
\echo `date`
\exit 0

Automate the Amazon Redshift RSQL ETL script

One of the options to automate Amazon Redshift RSQL scripts to run on a specific schedule is via shell scripting. You can schedule the shell script via a CRON job, a command line utility.

We use the following syntax:

#!/bin/sh
rsql -D awsuser -f <RSQL_SCRIPT> <LOG_FILE>

We provide the following information:

<RSQL_SCRIPT> – The SQL scripts to un
<LOG_FILE> – The output log file

In the following example shell script, we run the Amazon Redshift RSQL script that we created and write the output log in a log file in the local machine. You can schedule the shell script via a CRON job.

#!/bin/sh
SCRIPTS_DIR="<SCRIPTS_DIR>"
LOG_DIR="<LOG_DIR>"

RSQL_SCRIPT="${SCRIPTS_DIR}/<RSQL_SCRIPT>.sql"
LOG_FILE="${LOG_DIR}/test_rsql.log"

touch $LOG_FILE

rsql -D awsuser -f ${RSQL_SCRIPT} > ${LOG_FILE}

Clean up

To avoid incurring future charges, stop the Amazon Redshift cluster created for the purpose of this post.

Conclusion

In this post, we explained how to use Amazon Redshift RSQL to perform ETL operations. We also demonstrated how to implement advanced error handling and enhanced control flow in your Amazon Redshift RSQL ETL script.

If you’re using scripts via the psql command line client on Amazon Redshift, you can operate on Amazon Redshift RSQL with no modification. If you’re migrating your Teradata BTEQ scripts to Amazon Redshift RSQL, you can use the AWS Schema Conversion Tool (AWS SCT) to automatically convert BTEQ scripts to Amazon Redshift RSQL scripts.

To learn more, refer to Amazon Redshift RSQL.

About the Authors

Saman Irfan is a Specialist Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Sudipta Bagchi is a Specialist Solutions Architect at Amazon Web Services. He has over 12 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket.

ETL orchestration using the Amazon Redshift Data API and AWS Step Functions with AWS SDK integration

2022-03-02 Jason Pedreza

Post Syndicated from Jason Pedreza original https://aws.amazon.com/blogs/big-data/etl-orchestration-using-the-amazon-redshift-data-api-and-aws-step-functions-with-aws-sdk-integration/

Extract, transform, and load (ETL) serverless orchestration architecture applications are becoming popular with many customers. These applications offers greater extensibility and simplicity, making it easier to maintain and simplify ETL pipelines. A primary benefit of this architecture is that we simplify an existing ETL pipeline with AWS Step Functions and directly call the Amazon Redshift Data API from the state machine. As a result, the complexity for the ETL pipeline is reduced.

As a data engineer or an application developer, you may want to interact with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. The Amazon Redshift Data API allows you to interact with Amazon Redshift without having to configure JDBC or ODBC connections. This feature allows you to orchestrate serverless data processing workflows, design event-driven web applications, and run an ETL pipeline asynchronously to ingest and process data in Amazon Redshift, with the use of Step Functions to orchestrate the entire ETL or ELT workflow.

This post explains how to use Step Functions and the Amazon Redshift Data API to orchestrate the different steps in your ETL or ELT workflow and process data into an Amazon Redshift data warehouse.

AWS Lambda is typically used with Step Functions due to its flexible and scalable compute benefits. An ETL workflow has multiple steps, and the complexity may vary within each step. However, there is an alternative approach with AWS SDK service integrations, a feature of Step Functions. These integrations allow you to call over 200 AWS services’ API actions directly from your state machine. This approach is optimal for steps with relatively low complexity compared to using Lambda because you no longer have to maintain and test function code. Lambda functions have a maximum timeout of 15 minutes; if you need to wait for longer-running processes, Step Functions standard workflows allows a maximum runtime of 1 year.

You can replace steps that include a single process with a direct integration between Step Functions and AWS SDK service integrations without using Lambda. For example, if a step is only used to call a Lambda function that runs a SQL statement in Amazon Redshift, you may remove the Lambda function with a direct integration to the Amazon Redshift Data API’s SDK API action. You can also decouple Lambda functions with multiple actions into multiple steps. An implementation of this is available later in this post.

We created an example use case in the GitHub repo ETL Orchestration using Amazon Redshift Data API and AWS Step Functions that provides an AWS CloudFormation template for setup, SQL scripts, and a state machine definition. The state machine directly reads SQL scripts stored in your Amazon Simple Storage Service (Amazon S3) bucket, runs them in your Amazon Redshift cluster, and performs an ETL workflow. We don’t use Lambda in this use case.

Solution overview

In this scenario, we simplify an existing ETL pipeline that uses Lambda to call the Data API. AWS SDK service integrations with Step Functions allow you to directly call the Data API from the state machine, reducing the complexity in running the ETL pipeline.

The entire workflow performs the following steps:

Set up the required database objects and generate a set of sample data to be processed.
Run two dimension jobs that perform SCD1 and SCD2 dimension load, respectively.
When both jobs have run successfully, the load job for the fact table runs.
The state machine performs a validation to ensure the sales data was loaded successfully.

The following architecture diagram highlights the end-to-end solution:

We run the state machine via the Step Functions console, but you can run this solution in several ways:

Call the StartExecution API action
Use Amazon CloudWatch Events to trigger the state machine
Use Amazon API Gateway to trigger the state machine
Start a nested workflow run from a task state

You can deploy the solution with the provided CloudFormation template, which creates the following resources:

Database objects in the Amazon Redshift cluster:
- Four stored procedures:
  - sp_setup_sales_data_pipeline() – Creates the tables and populates them with sample data
  - sp_load_dim_customer_address() – Runs the SCD1 process on customer_address records
  - sp_load_dim_item() – Runs the SCD2 process on item records
  - sp_load_fact_sales (p_run_date date) – Processes sales from all stores for a given day
- Five Amazon Redshift tables:
  - customer
  - customer_address
  - date_dim
  - item
  - store_sales
The AWS Identity and Access Management (IAM) role StateMachineExecutionRole for Step Functions to allow the following permissions:
- Federate to the Amazon Redshift cluster through getClusterCredentials permission avoiding password credentials
- Run queries in the Amazon Redshift cluster through Data API calls
- List and retrieve objects from Amazon S3
The Step Functions state machine RedshiftETLStepFunction, which contains the steps used to run the ETL workflow of the sample sales data pipeline

Prerequisites

As a prerequisite for deploying the solution, you need to set up an Amazon Redshift cluster and associate it with an IAM role. For more information, see Authorizing Amazon Redshift to access other AWS services on your behalf. If you don’t have a cluster provisioned in your AWS account, refer to Getting started with Amazon Redshift for instructions to set it up.

When the Amazon Redshift cluster is available, perform the following steps:

Download and save the CloudFormation template to a local folder on your computer.
Download and save the following SQL scripts to a local folder on your computer:
1. sp_statements.sql – Contains the stored procedures including DDL and DML operations.
2. validate_sql_statement.sql – Contains two validation queries you can run.
Upload the SQL scripts to your S3 bucket. The bucket name is the designated S3 bucket specified in the ETLScriptS3Path input parameter.
On the AWS CloudFormation console, choose Create stack with new resources and upload the template file you downloaded in the previous step (etl-orchestration-with-stepfunctions-and-redshift-data-api.yaml).
Enter the required parameters and choose Next.
Choose Next until you get to the Review page and select the acknowledgement check box.
Choose Create stack.
Wait until the stack deploys successfully.

When the stack is complete, you can view the outputs, as shown in the following screenshot:

Run the ETL orchestration

After you deploy the CloudFormation template, navigate to the stack detail page. On the Resources tab, choose the link for RedshiftETLStepFunction to be redirected to the Step Functions console.

The RedshiftETLStepFunction state machine runs automatically, as outlined in the following workflow:

read_sp_statement and run_sp_deploy_redshift – Performs the following actions:
1. Retrieves the sp_statements.sql from Amazon S3 to get the stored procedure.
2. Passes the stored procedure to the batch-execute-statement API to run in the Amazon Redshift cluster.
3. Sends back the identifier of the SQL statement to the state machine.
wait_on_sp_deploy_redshift – Waits for at least 5 seconds.
run_sp_deploy_redshift_status_check – Invokes the Data API’s describeStatement to get the status of the API call.
is_run_sp_deploy_complete – Routes the next step of the ETL workflow depending on its status:
1. FINISHED – Stored procedures are created in your Amazon Redshift cluster.
2. FAILED – Go to the sales_data_pipeline_failure step and fail the ETL workflow.
3. All other status – Go back to the wait_on_sp_deploy_redshift step to wait for the SQL statements to finish.
setup_sales_data_pipeline – Performs the following steps:
1. Initiates the setup stored procedure that was previously created in the Amazon Redshift cluster.
2. Sends back the identifier of the SQL statement to the state machine.
wait_on_setup_sales_data_pipeline – Waits for at least 5 seconds.
setup_sales_data_pipeline_status_check – Invokes the Data API’s describeStatement to get the status of the API call.
is_setup_sales_data_pipeline_complete – Routes the next step of the ETL workflow depending on its status:
1. FINISHED – Created two dimension tables (customer_address and item) and one fact table (sales).
2. FAILED – Go to the sales_data_pipeline_failure step and fail the ETL workflow.
3. All other status – Go back to the wait_on_setup_sales_data_pipeline step to wait for the SQL statements to finish.
run_sales_data_pipeline – LoadItemTable and LoadCustomerAddressTable are two parallel workflows that Step Functions runs at the same time. The workflows run the stored procedures that were previously created. The stored procedure loads the data into the item and customer_address tables. All other steps in the parallel sessions follow the same concept as described previously. When both parallel workflows are complete, run_load_fact_sales runs.
run_load_fact_sales – Inserts data into the store_sales table that was created in the initial stored procedure.
Validation – When all the ETL steps are complete, the state machine reads a second SQL file from Amazon S3 (validate_sql_statement.sql) and runs the two SQL statements using the batch_execute_statement method.

The implementation of the ETL workflow is idempotent. If it fails, you can retry the job without any cleanup. For example, it recreates the stg_store_sales table each time, then deletes the target table store_sales with the data for the particular refresh date each time.

The following diagram illustrates the state machine workflow:

In this example, we use the task state resource arn:aws:states:::aws-sdk:redshiftdata:[apiAction] to call the corresponding Data API action. The following table summarizes the Data API actions and their corresponding AWS SDK integration API actions.

Amazon Redshift Data API Actions	AWS SDK Integrations API Actions
BatchExecuteStatement	`batchExecuteStatement`
ExecuteStatement	`executeStatement`
DescribeStatement	`describeStatement`
CancelStatement	`cancelStatement`
GetStatementResult	`getStatementResult`
DescribeTable	`describeTable`
ListDatabases	`listDatabases`
ListSchemas	`listSchemas`
ListStatements	`listStatements`
ListTables	`listTables`

To use AWS SDK integrations, you specify the service name and API call, and, optionally, a service integration pattern. The AWS SDK action is always camel case, and parameter names are Pascal case. For example, you can use the Step Functions action batchExecuteStatement to run multiple SQL statements in a batch as a part of a single transaction on the Data API. The SQL statements can be SELECT, DML, DDL, COPY, and UNLOAD.

Validate the ETL orchestration

The entire ETL workflow takes approximately 1 minute to run. The following screenshot shows that the ETL workflow completed successfully.

When the entire sales data pipeline is complete, you may go through the entire execution event history, as shown in the following screenshot.

Schedule the ETL orchestration

After you validate the sales data pipeline, you may opt to run the data pipeline on a daily schedule. You can accomplish this with Amazon EventBridge.

On the EventBridge console, create a rule to run the RedshiftETLStepFunction state machine daily.
To invoke the RedshiftETLStepFunction state machine on a schedule, choose Schedule and define the appropriate frequency needed to run the sales data pipeline.
Specify the target state machine as RedshiftETLStepFunction and choose Create.

You can confirm the schedule on the rule details page.

Clean up

Clean up the resources created by the CloudFormation template to avoid unnecessary cost to your AWS account. You can delete the CloudFormation stack by selecting the stack on the AWS CloudFormation console and choosing Delete. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may see some issues during cleanup; you need to clean these up independently.

Limitations

The Data API and Step Functions AWS SDK integration offers a robust mechanism to build highly distributed ETL applications within minimal developer overhead. Consider the following limitations when using the Data API and Step Functions:

Conclusion

In this post, we demonstrated how to build an ETL orchestration using the Amazon Redshift Data API and Step Functions with AWS SDK integration.

To learn more about the Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters and Using the Amazon Redshift Data API.

About the Authors

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with over 13 years of data warehousing experience. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Bipin Pandey is a Data Architect at AWS. He loves to build data lake and analytics platforms for his customers. He is passionate about automating and simplifying customer problems with the use of cloud solutions.

David Zhang is an AWS Solutions Architect who helps customers design robust, scalable, and data-driven solutions across multiple industries. With a background in software development, David is an active leader and contributor to AWS open-source initiatives. He is passionate about solving real-world business problems and continuously strives to work from the customer’s perspective. Feel free to connect with him on LinkedIn.

How to evaluate the benefits of AQUA for your Amazon Redshift workloads

2022-02-19 Dinesh Kumar

Post Syndicated from Dinesh Kumar original https://aws.amazon.com/blogs/big-data/how-to-evaluate-the-benefits-of-aqua-for-your-amazon-redshift-workloads/

Amazon Redshift is the cloud data warehouse of choice for tens of thousands of customers, who use it to analyze exabytes of data to gain business insights. Customers have been asking us for better performance at scale as the volume, variety, velocity, and veracity of their data grows. We have added several features to Amazon Redshift that enable you to get up to three times better price performance with Amazon Redshift than other cloud data warehouses.

In addition, we launched AQUA (Advanced Query Accelerator) for Amazon Redshift to help you cost-effectively run analytics at the new scale of data. AQUA is a distributed and hardware-accelerated cache that enables Amazon Redshift to run an order of magnitude faster than other enterprise cloud data warehouses by automatically boosting certain types of queries. AQUA uses AWS-designed processors with AWS Nitro chips adapted to speed up data encryption and compression, and custom analytics processors, implemented in FPGAs, to accelerate operations such as scans, filtering, and aggregation. AQUA is available with the RA3.16xlarge, RA3.4xlarge, and RA3.xlplus nodes at no additional charge and requires no code changes. You can enable AQUA for your existing Amazon Redshift RA3 clusters or launch a new AQUA-enabled RA3 cluster via the AWS Management Console, API, or AWS Command Line Interface (AWS CLI). To learn more about AQUA, see Working with AQUA (Advanced Query Accelerator).

This post walks you through steps to quantify expected benefits from AQUA for Amazon Redshift for your workloads. We cover the methodology used for testing AQUA and share the scripts, a sample dataset, and queries so you can test AQUA in your own environment. We have published scripts on GitHub (along with a README file), which you need as you follow the steps in this post. Download all scripts to your working directory from where you intend to connect to your Redshift clusters.

Solution overview

At a high-level, the AQUA test process involves the following steps:

Create a test cluster to evaluate AQUA and make sure that AQUA is turned on. For instructions, see Working with AQUA (Advanced Query Accelerator).
Analyze your workload for AQUA and capture eligible queries. To learn more about the types of queries accelerated by AQUA, refer to When does Amazon Redshift use AQUA to run queries?
Run the workload with AQUA activated and deactivated on your test cluster.
Compare performance results.

This approach is a straightforward yet practical way to measure the performance gain you can expect to get for your workload by turning on AQUA.

With the exception of one workload parsing and analyzing script (aqua_capture_query.sh), you can run the rest of the scripts on a test cluster. We recommend using a test cluster to minimize the impact to your production cluster.

Create a test cluster

Create a snapshot from a production cluster that has read-heavy workloads and restore it as test cluster.

Analyze your workload and capture eligible queries

Run aqua_capture_query.sh on the production cluster to determine the queries suitable for acceleration by AQUA. We recommend choosing workloads with SELECT queries that use LIKE or SIMILAR TO functions in the WHERE clause and scan, filter, and aggregate large datasets. aqua_capture_query.sh scans the query history on your production cluster and captures queries that AQUA can accelerate.

The script runtime may exceed several minutes depending on the selected time interval (analyze_starttime and analyze_endtime), size of the cluster, and workload complexity. We suggest restricting the script runtime by limiting the time interval to the lesser of your workload runtime or 3 hours. Alternately, you can evaluate AQUA using the Amazon Reviews sample dataset, which we demonstrate later in this post.

aqua_capture_query.sh saves an output file named aqua_eligible_queries to your working directory and contains a subset of most suitable AQUA-eligible queries.

The following is a sample output by the script:

select count(*) from amazon_reviews where product_title SIMILAR TO '%lap%' group by star_rating ORDER BY star_rating desc;
select count(*) from amazon_reviews where product_title ilike '%e%|%E%' or customer_ID like '3%__%45__3';

If your workload history doesn’t have enough AQUA-eligible queries, the script reports no eligible queries found. If this happens, run the script with different date/time parameters. If you still don’t see any queries, you can try using the sample dataset and queries provided in this post. For more information about AQUA-eligible queries, refer to When does Amazon Redshift use AQUA to run queries?

Run the workload on your test cluster

Run aqua_execute_query.sh on your test cluster. The script runs the captured queries on your cluster repeatedly to get consistent performance by reducing the impact of runtime difference due to environmental factors with AQUA activated and deactivated. The script records the start date/time and end date/time to a file named workload_datetime.txt in your working directory.

Compare performance results

When aqua_execute_query.sh script is complete, run aqua_perf_compare.sh, which generates a CSV file named aqua_benefit in your working directory. The following table summarizes the sample output.

Query ID Amazon Redshift with AQUA turned on	Query ID Amazon Redshift with AQUA turned off	Runtime in seconds with AQUA turned on	Runtime in seconds with AQUA turned off	Speedup (Column C/Column B)
1153194	1153370	2.7	59.4	22.0
1153214	1153456	22.8	104.9	4.6
334629	334631	1.66643	35.8	22.25
334850	334672	1.71297	26.9	15.7
334984	334998	1.85051	26.9	14.5
334830	334793	6.47884	87.2	13.4
334828	334740	6.6974	87.1	13

This table shows the query identifiers and runtime of the queries with AQUA activated and deactivated. You can compare the benefits offered by AQUA by reviewing the speedup column.

Example use case with the Amazon Reviews dataset

To test AQUA with the Amazon Reviews sample dataset, perform the following steps:

Create a two-node RA3.4xlarge cluster by issuing the following command:

aws redshift create-cluster --cluster-identifier test-amazon-reviews --node-type ra3.4xlarge --number-of-nodes 2 --master-username adminuser --master-user-password <xxpasswordxx> --aqua-configuration-status enabled

Create a test database on our Amazon Redshift cluster by issuing the following command:
```
CREATE DATABASE TestDB;
```
Load the table with the Amazon Reviews dataset by running the script load_amazon_sentiments_data.sql.

Run a few AQUA-eligible queries (similar to the following) multiple times with and without AQUA activated using the script execute_test_queries.sh:

select count(*) from amazon_reviews WHERE product_title SIMILAR TO '%lap%' or product_title SIMILAR TO '%hope%' or product_title SIMILAR TO '%nice%' or product_title SIMILAR TO '%soa%';

More evaluation queries are available on the GitHub repo.

Run aqua_perf_compare.sh on the test cluster.

The following table shows that AQUA accelerated the queries 5–22 times faster.

Query ID Amazon Redshift with AQUA turned on	Query ID Amazon Redshift with AQUA turned off	Runtime in seconds with AQUA turned on	Runtime in seconds with AQUA turned off	Speedup (Column C/Column B)
364202	364217	2.33411	51.56207	22.09073
334984	334998	1.85051	26.9	14.5
334830	334793	6.47884	87.2	13.4
334828	334740	6.6974	87.1	13

Summary

This post provides a straightforward yet practical way to measure the performance gain you can expect to get for your workload by turning on AQUA. You can use the scripts provided and test with your own dataset and queries, or use the sample dataset and queries to determine AQUA’s impact.

We continue to invest and launch new capabilities like AQUA for Amazon Redshift to make sure Amazon Redshift continues to improve as your data warehouse needs grow. For AQUA, we continue to add hardware acceleration for more SQL operators, functions, predicates, data types, and file formats so more scans, filters, and aggregations can be pushed down to AQUA. Pushdowns to AQUA remain transparent so Amazon Redshift decides when to push queries down to AQUA to take advantage of hardware acceleration. And when queries don’t get pushed down to AQUA, they continue to run on Amazon Redshift as before.

We invite you to test AQUA for yourself and share the findings.

About the Authors

Dinesh Kumar is a Database Engineer at AWS focusing on Amazon AQUA. He works with customers to build highly scalable data warehouse and high performant database solutions. Outside work, he enjoys gardening and spending time with his family.

Manan Goel is a Product Go-To-Market Leader for AWS Analytics Services including Amazon Redshift & AQUA at AWS. He has more than 25 years of experience and is well versed with databases, data warehousing, business intelligence, and analytics. Manan holds a MBA from Duke University and a BS in Electronics & Communications engineering.

Export JSON data to Amazon S3 using Amazon Redshift UNLOAD

2022-02-16 Dipankar Kushari

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/export-json-data-to-amazon-s3-using-amazon-redshift-unload/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as high-performance business intelligence (BI) reporting, dashboarding applications, data exploration, and real-time analytics.

As the amount of data generated by IoT devices, social media, and cloud applications continues to grow, organizations are looking to easily and cost-effectively analyze this data with minimal time-to-insight. A vast amount of this data is available in semi-structured format and needs additional extract, transform, and load (ETL) processes to make it accessible or to integrate it with structured data for analysis. Amazon Redshift powers the modern data architecture, which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights not possible otherwise. With a modern data architecture, you can store data in semi-structured format in your Amazon Simple Storage Service (Amazon S3) data lake and integrate it with structured data on Amazon Redshift. This allows you to make this data available to other analytics and machine learning applications rather than locking it in a silo.

In this post, we discuss the UNLOAD feature in Amazon Redshift and how to export data from an Amazon Redshift cluster to JSON files on an Amazon S3 data lake.

JSON support features in Amazon Redshift

Amazon Redshift features such as COPY, UNLOAD, and Amazon Redshift Spectrum enable you to move and query data between your data warehouse and data lake.

With the UNLOAD command, you can export a query result set in text, JSON, or Apache Parquet file format to Amazon S3. UNLOAD command is also recommended when you need to retrieve large result sets from your data warehouse. Since UNLOAD processes and exports data in parallel from Amazon Redshift’s compute nodes to Amazon S3, this reduces the network overhead and thus time in reading large number of rows. When using the JSON option with UNLOAD, Amazon Redshift unloads to a JSON file with each line containing a JSON object, representing a full record in the query result. In the JSON file, Amazon Redshift types are unloaded as the closest JSON representation. For example, Boolean values are unloaded as true or false, NULL values are unloaded as null, and timestamp values are unloaded as strings. If a default JSON representation doesn’t suit a particular use case, you can modify it by casting to the desired type in the SELECT query of the UNLOAD statement.

Additionally, to create a valid JSON object, the name of each column in the query result must be unique. If the column names in the query result aren’t unique, the JSON UNLOAD process fails. To avoid this, we recommend using proper column aliases so that each column in the query result remains unique while getting unloaded. We illustrate this behavior later in this post.

With the Amazon Redshift SUPER data type, you can store data in JSON format on local Amazon Redshift tables. This way, you can process the data without any network overhead and use Amazon Redshift schema properties to optimally save and query semi structured data locally. In addition to achieving low latency, you can also use the SUPER data type when your query requires strong consistency, predictable query performance, complex query support, and ease of use with evolving schemas and schemaless data. Amazon Redshift supports writing nested JSON when the query result contains SUPER columns.

Updating and maintaining data with constantly evolving schemas can be challenging and adds extra ETL steps to the analytics pipeline. The JSON file format provides support for schema definition, is lightweight, and is widely used as a data transfer mechanism by different services, tools, and technologies.

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a distributed, open-source search and analytics suite used for a broad set of use cases like real-time application monitoring, log analytics, and website search. It uses JSON as the supported file format for data ingestion. The ability to unload data natively in JSON format from Amazon Redshift into the Amazon S3 data lake reduces complexity and additional data processing steps if that data needs to be ingested into Amazon OpenSearch Service for further analysis.

This is one example of how seamless data movement can help you build an integrated data platform with a data lake on Amazon S3, data warehouse on Amazon Redshift and search and log analytics using Amazon OpenSearch Service and any other JSON-oriented downstream analytics solution. For more information about the Lake House approach, see Build a Lake House Architecture on AWS.

Examples of Amazon Redshift JSON UNLOAD

In this post, we show you the following different scenarios:

Example 1 – Unload customer data in JSON format into Amazon S3, partitioning output files into partition folders, following the Apache Hive convention, with customer birth month as the partition key. We make a few changes to the columns in the SELECT statement of the UNLOAD command:
- Convert the c_preferred_cust_flag column from character to Boolean
- Remove leading and trailing spaces from the c_first_name, c_last_name, and c_email_address columns using the Amazon Redshift built-in function btrim
Example 2 – Unload line item data (with SUPER column) in JSON format into Amazon S3 with data not partitioned
Example 3 – Unload line item data (With SUPER column) in JSON format into Amazon S3, partitioning output files into partition folders, following the Apache Hive convention, with customer key as the partition key

For the first example, we used the customer table and data from the TPCDS dataset. For examples involving table with SUPER column, we used the customer_orders_lineitem table and data from the following tutorial.

Example 1: Export customer data

For this example, we used the customer table and data from TPCDS dataset. We created the database schema and customer table, and copied data into it. See the following code:

-- Created a new database
create schema json_unload_demo; 

-- created and populated customer table in the new schema

create table json_unload_demo.customer
(
  c_customer_sk int4 not null ,                 
  c_customer_id char(16) not null ,             
  c_current_cdemo_sk int4 ,   
  c_current_hdemo_sk int4 ,   
  c_current_addr_sk int4 ,    
  c_first_shipto_date_sk int4 ,                 
  c_first_sales_date_sk int4 ,
  c_salutation char(10) ,     
  c_first_name char(20) ,     
  c_last_name char(30) ,      
  c_preferred_cust_flag char(1) ,               
  c_birth_day int4 ,          
  c_birth_month int4 ,        
  c_birth_year int4 ,         
  c_birth_country varchar(20) ,                 
  c_login char(13) ,          
  c_email_address char(50) ,  
  c_last_review_date_sk int4 ,
  primary key (c_customer_sk)
) distkey(c_customer_sk);

copy json_unload_demo.customer from 's3://redshift-downloads/TPC-DS/2.13/3TB/customer/' 
iam_role '<<AWS IAM role attached to your amazon redshift cluster>>' 
gzip delimiter '|' EMPTYASNULL;

You can create a default AWS Identity and Access Management (IAM) role for your Amazon Redshift cluster to copy from and unload to your Amazon S3 location. For more information, see Use the default IAM role in Amazon Redshift to simplify accessing other AWS services.

In this example, we unloaded customer data for all customers with birth year 1992 in JSON format into Amazon S3 without any partitions. We make the following changes to the UNLOAD statement:

Convert the c_preferred_cust_flag column from character to Boolean
Remove leading and trailing spaces from the c_first_name, c_last_name, and c_email_address columns using the btrim function
Set the maximum size of exported files in Amazon S3 to 64 MB

See the following code:

unload ('SELECT c_customer_sk,
    c_customer_id ,
    c_current_cdemo_sk ,
    c_current_hdemo_sk ,
    c_current_addr_sk ,
    c_first_shipto_date_sk ,
    c_first_sales_date_sk ,
    c_salutation ,
    btrim(c_first_name),
    btrim(c_last_name),
    c_birth_day ,
    c_birth_month ,
    c_birth_year ,
    c_birth_country ,
    c_last_review_date_sk,
    DECODE(c_preferred_cust_flag, ''Y'', TRUE, ''N'', FALSE)::boolean as c_preferred_cust_flag_bool,
    c_login, 
    btrim(c_email_address) 
    from customer where c_birth_year = 1992;')
to 's3://<<Your Amazon S3 Bucket>>/non-partitioned/non-super/customer/' 
FORMAT JSON 
partition by (c_birth_month)  include
iam_role '<<AWS IAM role attached to your amazon redshift cluster>>'
MAXFILESIZE 64 MB;

When we ran the UNLOAD command, we encountered an error because the columns that used the btrim function all attempted to be exported as btrim (which is the default behavior of Amazon Redshift when the same function is applied to multiple columns that are selected together). To avoid this error, we need to use a unique column alias for each column where the btrim function was used.

If we select the c_first_name, c_last_name, and c_email_address columns by applying the btrim function and c_preferred_cust_flag, we can convert them from character to Boolean.

We ran the following query in Amazon Redshift Query Editor v2:

SELECT btrim(c_first_name) ,
    btrim(c_last_name),
    btrim(c_email_address) , 
    DECODE(c_preferred_cust_flag, 'Y', TRUE, 'N', FALSE)::boolean c_preferred_cust_flag_bool  
    from customer where c_birth_year = 1992 limit 10;

All three columns that used the btrim function are set as btrim in the output result instead of their respective column name.

An error occurred in UNLOAD because we didn’t use a column alias.

We added column aliases in the following code:

unload ('SELECT c_customer_sk,
    c_customer_id ,
    c_current_cdemo_sk ,
    c_current_hdemo_sk ,
    c_current_addr_sk ,
    c_first_shipto_date_sk ,
    c_first_sales_date_sk ,
    c_salutation ,
    btrim(c_first_name) as c_first_name,
    btrim(c_last_name) as c_last_name,
    c_birth_day ,
    c_birth_month ,
    c_birth_year ,
    c_birth_country ,
    c_last_review_date_sk,
    DECODE(c_preferred_cust_flag, ''Y'', TRUE, ''N'', FALSE)::boolean as c_preferred_cust_flag_bool,
    c_login, 
    btrim(c_email_address) as c_email_addr_trimmed 
    from customer where c_birth_year = 1992;')
to 's3://<<Your Amazon S3 Bucket>>/non-partitioned/non-super/customer/' 
FORMAT JSON 
partition by (c_birth_month)  include
iam_role '<<AWS IAM role attached to your amazon redshift cluster>>'
MAXFILESIZE 64 MB;

After we added column aliases, the UNLOAD command completed successfully and files were exported to the desired location in Amazon S3.

The following screenshot shows data is unloaded in JSON format partitioning output files into partition folders, following the Apache Hive convention, with customer birth month as the partition key into Amazon S3 from the Amazon Redshift customer table.

A query with Amazon S3 Select shows a snippet of data in the JSON file on Amazon S3 that was unloaded.

The column aliases c_first_name, c_last_name, and c_email_addr_trimmed were written into the JSON record as per the SELECT query. Boolean values were saved in c_preferred_cust_flag_bool as well.

Examples 2 and 3: Using the SUPER column

For the next two examples, we used the customer_orders_lineitem table and data. We created the customer_orders_lineitem table and copied data into it with the following code:

-- Created a new table with SUPER column

CREATE TABLE JSON_unload_demo.customer_orders_lineitem
(c_custkey bigint
,c_name varchar
,c_address varchar
,c_nationkey smallint
,c_phone varchar
,c_acctbal decimal(12,2)
,c_mktsegment varchar
,c_comment varchar
,c_orders super
);

-- Loaded data into the new table
COPY json_unload_demo.customer_orders_lineitem 
FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem'
IAM_ROLE '<<AWS IAM role attached to your amazon redshift cluster>>'
FORMAT JSON 'auto';

Next, we ran a few queries to explore the customer_orders_lineitem table’s data:

select * from json_unload_demo.customer_orders_lineitem;

select c_orders from json_unload_demo.customer_orders_lineitem;

SELECT attr as attribute_name, val as object_value FROM json_unload_demo.customer_orders_lineitem c, c.c_orders o, UNPIVOT o AS val AT attr;

Example 2: Without partitions

In this example, we unloaded all the rows of the customer_orders_lineitem table in JSON format into Amazon S3 without any partitions:

unload ('select * from json_unload_demo.customer_orders_lineitem;')
to 's3://<<Your Amazon S3 Bucket>>/non-partitioned/super/customer-order-lineitem/'
FORMAT JSON
iam_role '<<AWS IAM role attached to your amazon redshift cluster>>';

After we run the UNLOAD command, the data is available in the desired Amazon S3 location. The following screenshot shows data is unloaded in JSON format without any partitions into Amazon S3 from the Amazon Redshift customer_orders_lineitem table.

A query with Amazon S3 Select shows a snippet of data in the JSON file on Amazon S3 that was unloaded.

Example 3: With partitions

In this example, we unloaded all the rows of the customer_orders_lineitem table in JSON format partitioning output files into partition folders, following the Apache Hive convention, with customer key as the partition key into Amazon S3:

unload ('select * from json_unload_demo.customer_orders_lineitem;')
to 's3://<<Your Amazon S3 Bucket>>/partitioned/super/customer-order-lineitem-1/'
FORMAT JSON
partition by (c_custkey) include
iam_role '<<AWS IAM role attached to your amazon redshift cluster>>';

After we run the UNLOAD command, the data is available in the desired Amazon S3 location. The following screenshot shows data is unloaded in JSON format partitioning output files into partition folders, following the Apache Hive convention, with customer key as the partition key into Amazon S3 from the Amazon Redshift customer_orders_lineitem table.

A query with Amazon S3 Select shows a snippet of data in the JSON file on Amazon S3 that got unloaded.

Conclusion

In this post, we showed how you can use the Amazon Redshift UNLOAD command to unload the result of a query to one or more JSON files into your Amazon S3 location. We also showed how you can partition the data using your choice of partition key while you unload the data. You can use this feature to export data to JSON files into Amazon S3 from your Amazon Redshift cluster or your Amazon Redshift Serverless endpoint to make your data processing simpler and build an integrated data analytics platform.

About the Authors

Dipankar Kushari is a Senior Analytics Solutions Architect with AWS.

Sayali Jojan is a Senior Analytics Solutions Architect with AWS. She has 7 years of experience working with customers to design and build solutions on the AWS Cloud, with a focus on data and analytics.

Cody Cunningham is a Software Development Engineer with AWS, working on data ingestion for Amazon Redshift.

Amazon Redshift at AWS re:Invent 2021 recap

2022-02-10 Sunaina Abdul Salah

Post Syndicated from Sunaina Abdul Salah original https://aws.amazon.com/blogs/big-data/amazon-redshift-at-aws-reinvent-2021-recap/

The annual AWS re:Invent learning conference is an exciting time full of new product and program launches. At the first re:Invent conference in 2012, AWS announced Amazon Redshift. Since then, tens of thousands of customers have started using Amazon Redshift as their preferred cloud data warehouse. At re:Invent 2021, AWS announced several new Amazon Redshift features that bring easy analytics for everyone while continuing to increase performance and help you break through data silos to analyze all the data in your data warehouse. With re:Invent packed with information and new announcements, it’s easy to miss the best of the updates in Amazon Redshift. In this post, we summarize these announcements, along with resources for you to get more details.

AWS takes analytics serverless

Adam Selipsky took the re:Invent keynote stage on November 29, 2021, for the first time as AWS CEO, announcing a string of innovations along the theme of pathfinders. He shared the story about Florence Nightingale, a pathfinder and data geek who had a passion for statistics, who collected and analyzed data on sanitation impact on mortality rates to mobilize the army at the time to approve new hygiene standards and restructure health efforts. The theme set the stage for the AWS modern data strategy, which enables everyone in the organization to find patterns in data and mobilize their business with data with the right tools for the right job. Selipsky’s announcement of serverless options for four AWS Analytics services, including Amazon Redshift, emphasized the growing need to be able to run complex analytics without touching infrastructure or managing capacity for your applications. Watch Adam Selipsky’s keynote announcement of AWS Analytics services with new serverless options (00:52).

Amazon Redshift: Under the hood

This year, VP of AWS Machine Learning Services Swami Sivasubramanian expanded his keynote to include the entire data and machine learning (ML) journey for all workloads and data types. In addition to mentioning the new serverless option announcement for Amazon Redshift in preview, he dove under the hood of Amazon Redshift to explore core innovations since its inception in 2012 that drove the service to become a best-in-class, petabyte-scale data warehouse for tens of thousands of customers. Sivasubramanian touched on features that enable Amazon Redshift to power large workloads with top-notch performance, including RA3 instances, AQUA (Advanced Query Accelerator), materialized views, short query acceleration, and automatic workload management. He also elaborated on how Amazon Redshift ML uses SQL to make ML predictions from your data warehouse. Neeraja Rentachintala then took you through a demo set in a gaming environment to outline the value of Amazon Redshift Serverless and Amazon QuickSight Q.

Reinvent your business for the future with AWS Analytics

Rahul Pathak, VP of AWS Analytics Services, talked in detail about how you can put your data to work and transform your businesses through end-to-end AWS Analytics services that help you modernize, unify, and innovate. He talked about how customers like Zynga, Schneider Electric, Magellan Rx, Jobcase, and Nasdaq are benefitting from Amazon Redshift with innovations in performance as data volumes grow. He elaborated on how Amazon Redshift helps you analyze all your data with AWS service integration, and touched upon the quest to make analytics easy for everyone with the introduction of the new Query Editor v2, Amazon Redshift Serverless, and data sharing capabilities. Watch the session to gain a deeper understanding of AWS Analytics services.

What’s new with Amazon Redshift, featuring Schneider Electric

This session is a must-watch for every existing and new Amazon Redshift customer to get a full understanding of the breadth and depth of features that Amazon Redshift offers along the dimensions of easy analytics for everyone, analyze all your data, and performance at scale. Eugene Kawamoto, Director of Amazon Redshift Product, goes into detail about all the new launches in 2021, tracing the architectural evolution of Amazon Redshift to the new serverless option. He explores how Amazon Redshift integrates with other popular AWS services to help you break through data silos, analyze all your data, and derive value from this data.

Democratizing data for self-service analytics and ML

Access to all your data for fast analytics at scale is foundational for 360-degree projects involving data engineers, database developers, data analysts, data scientists, business intelligence professionals, and the line of business. In this session, Greg Khairallah and Shruti Worlikar, leaders in the AWS Analytics GTM organization, team up with our customer Jobcase, represented by Senior Scientist Clay Martin, to show how easy-to-use ML can help your organization imagine new products or services, transform your customer experiences, streamline your business operations, and improve your decision-making. A secure, integrated platform that’s easy to use and supports nonproprietary data formats can improve collaboration through data sharing and also improve customer responsiveness.

Introducing Amazon Redshift Serverless

Following the announcements in the keynotes, this session takes you through the new serverless option on Amazon Redshift, which enables you to get started in seconds and run data warehousing and analytics workloads at scale without worrying about data warehouse management. In this session, learn from Yan Leshinsky, VP of Amazon Redshift, and Neeraja Rentachintala, Principal Product Manager for Amazon Redshift, on how Amazon Redshift Serverless automatically provisions data warehouse capacity and intelligently scales the underlying resources to deliver consistently high performance and simplified operations for even the most demanding and volatile workloads.

Introduction to AWS Data Exchange for Amazon Redshift

We’ve talked about Amazon Redshift Serverless several times, but there were some exciting announcements from the service leading into re:Invent. We launched AWS Data Exchange for Amazon Redshift, which allows you to combine third-party data found on AWS Data Exchange with your own data from your Amazon Redshift cloud data warehouse, requiring no ETL and accelerating time to value. This provides a powerful enhancement to the strong data-sharing capabilities in Amazon Redshift to share secure, live data across Regions, accounts, and organizations. In this session, product managers Neeraja Rentachintala and Ryan Waldorf walk through the value of this integration and how to access and analyze a provider’s data with data providers, which enables licensing this access to their Amazon Redshift cloud data warehouses. Alex Bohl, Director of Data Innovation from Mathematica, joins them in this session to provide a real-world example.

Additional sessions

In addition to these sessions, the hands-on-workshops and chalk talks were packed with customers looking to learn more about Amazon Redshift’s capabilities in ML with Redshift ML, concurrency scaling, and much more. These sessions were not recorded.

Get started with Amazon Redshift

Learn more about the latest and greatest of what Amazon Redshift offers you, and explore the following resources for more information about new releases:

What’s new in Amazon Redshift – 2021, a year in review
Amazon Redshift Serverless (in preview) – AWS News Blog, What’s new, LinkedIn announcement
Amazon Redshift Query Editor v2 – AWS News Blog, What’s new, Demo
AWS Data Exchange for Amazon Redshift – AWS News Blog, What’s new
Redshift ML – AWS News Blog, What’s new, Case study
Amazon Redshift data sharing – Featuring cross-region and cross-account data sharing

About the Author

Sunaina Abdul Salah leads product marketing for Amazon Redshift.

Build a REST API to enable data consumption from Amazon Redshift

2022-02-07 Jeetesh Srivastva

Post Syndicated from Jeetesh Srivastva original https://aws.amazon.com/blogs/big-data/build-a-rest-api-to-enable-data-consumption-from-amazon-redshift/

API (Application Programming Interface) is a design pattern used to expose a platform or application to another party. APIs enable programs and applications to communicate with platforms and services, and can be designed to use REST (REpresentational State Transfer) as a software architecture style.

APIs in OLTP (online transaction processing) are called frequently (tens to hundreds of times per second), delivering small payloads (output) in the order of a few bytes to kilobytes. However, OLAP (online analytical processing) has the ratio flipped. OLAP APIs have a low call volume but large payload (100 MB to several GBs). This pattern adds new challenges, like asynchronous processing, managing compute capacity, and scaling.

In this post, we walk through setting up an application API using the Amazon Redshift Data API, AWS Lambda, and Amazon API Gateway. The API performs asynchronous processing of user requests, sends user notifications, saves processed data in Amazon Simple Storage Service (Amazon S3), and returns a presigned URL for the user or application to download the dataset over HTTPS. We also provide an AWS CloudFormation template to help set up resources, available on the GitHub repo.

Solution overview

In our use case, Acme sells flowers on its site acmeflowers.com and collects reviews from customers. The website maintains a self-service inventory, allowing different producers to send flowers and other materials to acmeflowers.com when their supplies are running low.

Acme uses Amazon Redshift as their data warehouse. Near-real-time changes and updates to their inventory flow to Amazon Redshift, showing accurate availability of stock. The table PRODUCT_INVENTORY contains updated data. Acme wants to expose inventory information to partners in a cost-effective, secure way for inventory management process. If Acme’s partners are using Amazon Redshift, cross-account data sharing could be a potential option. If partners aren’t using Amazon Redshift, they could use the solution described in this post.

The following diagram illustrates our solution architecture:

The workflow contains the following steps:

The client application sends a request to API Gateway and gets a request ID as a response.
API Gateway calls the request receiver Lambda function.
The request receiver function performs the following actions:
1. Writes the status to an Amazon DynamoDB control table.
2. Writes a request to Amazon Simple Queue Service (Amazon SQS).
A second Lambda function, the request processor, performs following actions:
1. Polls Amazon SQS.
2. Writes the status back to the DynamoDB table.
3. Runs a SQL query on Amazon Redshift.
Amazon Redshift exports the data to an S3 bucket.
A third Lambda function, the poller, checks the status of the results in the DynamoDB table.
The poller function fetches results from Amazon S3.
The poller function sends a presigned URL to download the file from the S3 bucket to the requestor via Amazon Simple Email Service (Amazon SES).
The requestor downloads the file using the URL.

The workflow also contains the following steps to check the status of the request at various stages:

The client application or user sends a request ID to API Gateway that is generated in Step 1.
API Gateway calls the status check Lambda function.
The function reads the status from the DynamoDB control table.
The status is returned to the requestor through API Gateway.

Prerequisites

You need the following prerequisites to deploy the example application:

An AWS account
The AWS SAM CLI
Python 3.9
Node 17.3
An AWS Identity and Access Management (IAM) role with appropriate access
An Amazon Redshift cluster with a database and table

Complete the following prerequisite steps before deploying the sample application:

Run the following DDL on the Amazon Redshift cluster using the query editor to create the schema and table:

create schema rsdataapi;

create table rsdataapi.product_detail(
 sku varchar(20)
,product_id int 
,product_name varchar(50)
,product_description varchar(50)
);

Insert into rsdataapi.product_detail values ('FLOWER12',12345,'Flowers - Rose','Flowers-Rose');
Insert into rsdataapi.product_detail values ('FLOWER13',12346,'Flowers - Jasmine','Flowers-Jasmine');
Insert into rsdataapi.product_detail values ('FLOWER14',12347,'Flowers - Other','Flowers-Other');

Configure AWS Secrets Manager to store the Amazon Redshift credentials.
Configure Amazon SES with an email address or distribution list to send and receive status updates.

Deploy the application

To deploy the application, complete the following steps:

Clone the repository and download the sample source code to your environment where AWS SAM is installed:
```
git clone https://github.com/aws-samples/redshift-application-api
```

Change into the project directory containing the template.yaml file:

cd aws-samples/redshift-application-api/assets
export PATH=$PATH:/usr/local/opt/[email protected]/bin

Change the API .yaml file to update your AWS account number and the Region where you’re deploying this solution:

sed -i ‘’ “s/<input_region>/us-east-1/g” *API.yaml
sed -i ‘’ “s/<input_accountid>/<provide your AWS account id without dashes>/g” *API.yaml

Build the application using AWS SAM:
```
sam build
```
Deploy the application to your account using AWS SAM. Be sure to follow proper Amazon S3 naming conventions, providing globally unique names for S3 buckets:
```
sam deploy -g
```

SAM deploy requires you to provide the following parameters for configuration:

Parameter	Description
RSClusterID	The cluster identifier for your existing Amazon Redshift cluster.
RSDataFetchQ	The query to fetch the data from your Amazon Redshift tables (for example, select * from `rsdataapi.product_detail` where `sku=` the input passed from the API)
RSDataFileS3BucketName	The S3 bucket where the dataset from Amazon S3 is uploaded.
RSDatabaseName	The database on your Amazon Redshift cluster.
RSS3CopyRoleArn	The IAM role for Amazon Redshift that has access to copy files to and from Amazon Redshift to Amazon S3. This role should be associated with your Amazon Redshift cluster.
RSSecret	The Secrets Manager ARN for your Amazon Redshift credentials.
RSUser	The user name to connect to the Amazon Redshift cluster.
RsFileArchiveBucket	The S3 bucket from where the zipped dataset is downloaded. This should be different than your upload bucket.
RsS3CodeRepo	The S3 bucket where the packages or .zip file is stored.
RsSingedURLExpTime	The expiry time in seconds for the presigned URL to download the dataset from Amazon S3.
RsSourceEmailAddress	The email address of the distribution list for which Amazon SES is configured to use as the source for sending completion status.
RsTargetEmailAddress	The email address of the distribution list for which Amazon SES is configured to use as the destination for receiving completion status.
RsStatusTableName	The name of the status table for capturing the status of various stages from start to completion of request.

This template is designed only to show how you can set up an application API using the Amazon Redshift Data API, Lambda, and API Gateway. This setup isn’t intended for production use without modification.

Test the application

You can use Postman or any other application to connect to API Gateway and pass the request to access the dataset from Amazon Redshift. The APIs are authorized via IAM users. Before sending a request, choose your authorization type as AWS SigV4 and enter the values for AccessKey and SecretKey for the IAM user.

The following screenshot shows a sample request.

The following screenshot shows the email response.

The following screenshot shows sample response with the status of a request. You need to pass the request ID and enter all for status history or latest for latest status.

Clean up

When you’re finished testing this solution, remember to clean up all the AWS resources that you created using AWS SAM.

Delete the upload and download S3 buckets via the Amazon S3 console and then run the following on SAM CLI:

sam delete

For more information, see sam delete.

Summary

In this post, we showed you how you can set up an application API that uses the Amazon Redshift Data API, Lambda, and API Gateway. The API performs asynchronous processing of user requests, sends user notifications, saves processed data in Amazon S3, and returns a presigned URL for the user or application to download the dataset over HTTPs.

Give this solution a try and share your experience with us!

About the Authors

Jeetesh Srivastva is a Sr. Manager Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions using Amazon Redshift and other AWS Analytic services. He has worked to deliver on-premises and cloud-based analytic solutions for customers in banking and finance and hospitality industry verticals.

Ripunjaya Pattnaik is an Enterprise Solutions Architect at AWS. He enjoys problem-solving with his customers and being their advisor. In his free time, he likes to try new sports, play ping pong, and watch movies.

Use AQUA with Amazon Redshift RA3.xlplus nodes

2022-02-04 Quan Li

Post Syndicated from Quan Li original https://aws.amazon.com/blogs/big-data/use-aqua-with-amazon-redshift-ra3-xlplus-nodes/

Amazon Redshift RA3 is the latest generation node type that allows you to scale compute and storage for your data warehouses independently. The RA3 node family includes RA3.16xlarge, RA3.4xlarge, and RA3.xlplus nodes for large, medium, and small workloads, respectively. RA3.xlplus, the latest member of the RA3 node family, offers one third of the computing power of RA3.4xlarge and costs one third of the price. RA3.xlplus is the smallest node in the RA3 family, but it offers the same advanced functionalities. It has been widely used in environments with light computing demand such as QA, data analytics for small teams, or processing smaller datasets.

In 2021, Amazon Redshift introduced AQUA (Advanced Query Accelerator) for Amazon Redshift to boost performance of analytical queries that scan, filter, and aggregate large datasets. AQUA uses AWS-designed processors with the AWS Nitro chip adapter to speed up data encryption and compression, and custom analytical processors implemented in FPGAs to accelerate applications requiring text search of a very large dataset, such as marketing and personalization.

Customers have asked us to support AQUA for RA3.xlplus, and we recently launched AQUA for RA3.xlplus nodes. In this post, we continue to build on the post AQUA (Advanced Query Accelerator) – A Speed Boost for Your Amazon Redshift Queries and show that with AQUA support, RA3.xlplus provides the same benefit as the existing supported RA3 nodes in the following areas:

Automatically boosting certain types of queries
Reducing the impact on your Amazon Redshift cluster by offloading certain queries that scan, filter, and aggregate large datasets to AQUA

Test environment

To test AQUA for RA3.xlplus, we started by creating an RA3.xlplus cluster with the following details:

Amazon Redshift cluster – 2-node RA3.xlplus
Dataset – 3 TB TPC-DS, 3 TB TPC-H
Query set – Sample queries based on the TPC-H and TPC-DS workload

Sample queries

To test AQUA, we created six text search queries that scan, filter, and aggregate the lineitem table in the TPC-H dataset, which has 18 billion rows with a WHERE clause predicate against the l_comment column.

The following table summarizes our table definition.

table	encoded	diststyle	sortkey1	rows
lineitem	Y	KEY	l_shipdate	18,000,048,306

We randomly generated a query set with queries of various complexity. The queries are designed to measure scan cost, which are an area of focus for AQUA. Each query has a predicate with LIKE and OR. The number of LIKE or OR predicates gets progressively higher to simulate complex workloads.

For example, Query 1 has one OR predicate:

SELECT COUNT(l_orderkey)
FROM lineitem
WHERE (l_comment LIKE '%across%') OR (l_comment LIKE '%brave,%');

In contrast, Query 4 has 50 OR predicates:

SELECT COUNT(l_orderkey)
  FROM lineitem
  WHERE (l_comment LIKE '%outsi%') OR
  (l_comment LIKE '%uthless%') OR
  (l_comment LIKE '%capades%') OR
  (l_comment LIKE '%horses%') OR
  (l_comment LIKE '%ornis%' AND l_comment LIKE '%phins?%') OR
  (l_comment LIKE '%affix%') OR
  (l_comment LIKE '%integrat%') OR
....
  (l_comment LIKE '%ithin%' AND l_comment LIKE '%quiet%') OR
  (l_comment LIKE '%taphs%') OR
  (l_comment LIKE '%dugouts%' AND l_comment LIKE '%ches%') OR
  (l_comment LIKE '%telets%' AND l_comment LIKE '%detect!%') OR
  (l_comment LIKE '%grow%') OR
  (l_comment LIKE '%promise!%') OR
  (l_comment LIKE '%was%') OR
  (l_comment LIKE '%accounts%') OR
  (l_comment LIKE '%idly%' AND l_comment LIKE '%deposits%') OR
  (l_comment LIKE '%integrate!%' AND l_comment LIKE '%depend%') OR
  (l_comment LIKE '%ins%' AND l_comment LIKE '%uses!%') OR
  (l_comment LIKE '%epitaphs!%' AND l_comment LIKE '%breac%') OR
  (l_comment LIKE '%pliers%' AND l_comment LIKE '%phins%') OR
  (l_comment LIKE '%hogs%' AND l_comment LIKE '%sentiments%') OR
  (l_comment LIKE '%ctions%' AND l_comment LIKE '%daringly%') OR
  (l_comment LIKE '%ies%' AND l_comment LIKE '%esias%');

The following table summarizes the complexity of each query.

Query Number	Number of OR	Number of LIKE
Query 1	1	2
Query 2	5	7
Query 3	10	12
Query 4	50	66

Scan performance improvement with AQUA

We ran the four queries sequentially without any other workload on the system. With AQUA, the performance improvements range from approximately 7–13 times faster, as summarized in the following table.

Query Number	Amazon Redshift with AQUA (seconds)	Amazon Redshift Only (seconds)	Improvement
Query 1	78.53	635.89	709.74%
Query 2	92.75	810.04	773.36%
Query 3	130.68	956.83	632.19%
Query 4	137.68	1950.9	1316.98%

AQUA impact on multiple workloads

In this environment, we simulated a multi-user workflow using TPC-DS queries on the Amazon Redshift cluster. We recorded query runtime for three scenarios:

Baseline – We measured the end-to-end runtime running all TPC-DS queries serially on the Amazon Redshift cluster. In this scenario, AQUA was off and no additional workload was run (a single user was on the cluster).
Baseline with additional workload – This was the same as the baseline scenario with an additional workload run in parallel. We simulated a user load by running text scan queries randomly selected from Query 1, Query 2 and Query 3. These queries have relatively short runtimes. We had two variations of this scenario:
- AQUA turned off
- AQUA turned on

From the results, we observed the following:

With AQUA turned on for all workloads, the impact of a text scan query on the baseline runtime was negligible.
Without AQUA, the baseline runtime was impacted by the additional workload created with text scan queries. In our case, overhead was about 31%.

	Baseline	Baseline with additional workload		Improvement with AQUA
		AQUA turned off	AQUA turned on
TPC-DS End-to-End Time	3:43:35	4:54:50	3:44:36	31.27%

Single-node RA3.xlplus support

AQUA also supports the recently released Amazon Redshift single-node RA3.xlplus. In a single-node configuration, the resource is shared among all Amazon Redshift operations, which are traditionally handled separately by a leader node and compute nodes. A single-node configuration is commonly used in a personal or small group environment for data exploration.

We ran the same set of queries as before using Query 1, 2 and Query 3. The results demonstrated that AQUA provides a similar level of accelerations for these queries in a single-node environment.

Query Number	Amazon Redshift with AQUA (seconds)	Amazon Redshift Only (seconds)	Improvement
Query 1	157.91	1,254.03	694.13%
Query 2	193.64	2,037.79	952.36%
Query 3	260.75	2,495.85	857.19%

Summary

In this post, we ran a set of simulated performance tests on the Amazon Redshift RA3.xlplus platform with AQUA. With AQUA on, RA3.xlplus provides the same benefit as earlier supported platforms. It provides a query scan performance boost with AQUA-supported operators, which will expand over time. It can reduce the performance impact of your existing workflow by offloading the scan to AQUA.

We invite you to share your comments and use cases with the Amazon Redshift AQUA team.

For more information about how AQUA accelerates Amazon Redshift, see AQUA (Advanced Query Accelerator) for Amazon Redshift.

For more information about queries accelerated by AQUA, see When does Amazon Redshift use AQUA to run queries?

About the Authors

Quan Li is a Senior Database Engineer at Amazon Redshift. His focus is enabling customers to deliver maximum business value. Quan is passionate about optimizing high-performance analytical databases. During his spare time, he enjoys traveling and experiencing different types of cuisines with his family.

Steffen Rochel is a Sr. Software Development Manager at AWS. He is focused on data analytics acceleration. He has expertise in hardware-software design and operation of large-scale, high-performance distributed systems.

How ENGIE scales their data ingestion pipelines using Amazon MWAA

2022-02-03 Anouar Zaaber

Post Syndicated from Anouar Zaaber original https://aws.amazon.com/blogs/big-data/how-engie-scales-their-data-ingestion-pipelines-using-amazon-mwaa/

ENGIE—one of the largest utility providers in France and a global player in the zero-carbon energy transition—produces, transports, and deals electricity, gas, and energy services. With 160,000 employees worldwide, ENGIE is a decentralized organization and operates 25 business units with a high level of delegation and empowerment. ENGIE’s decentralized global customer base had accumulated lots of data, and it required a smarter, unique approach and solution to align its initiatives and provide data that is ingestible, organizable, governable, sharable, and actionable across its global business units.

In 2018, the company’s business leadership decided to accelerate its digital transformation through data and innovation by becoming a data-driven company. Yves Le Gélard, chief digital officer at ENGIE, explains the company’s purpose: “Sustainability for ENGIE is the alpha and the omega of everything. This is our raison d’être. We help large corporations and the biggest cities on earth in their attempts to transition to zero carbon as quickly as possible because it is actually the number one question for humanity today.”

ENGIE, as with any other big enterprise, is using multiple extract, transform, and load (ETL) tools to ingest data into their data lake on AWS. Nevertheless, they usually have expensive licensing plans. “The company needed a uniform method of collecting and analyzing data to help customers manage their value chains,” says Gregory Wolowiec, the Chief Technology Officer who leads ENGIE’s data program. ENGIE wanted a free-license application, well integrated with multiple technologies and with a continuous integration, continuous delivery (CI/CD) pipeline to more easily scale all their ingestion process.

ENGIE started using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to solve this issue and started moving various data sources from on-premise applications and ERPs, AWS services like Amazon Redshift, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, external services like Salesforce, and other cloud providers to a centralized data lake on top of Amazon Simple Storage Service (Amazon S3).

Amazon MWAA is used in particular to collect and store harmonized operational and corporate data from different on-premises and software as a service (SaaS) data sources into a centralized data lake. The purpose of this data lake is to create a “group performance cockpit” that enables an efficient, data-driven analysis and thoughtful decision-making by the Engie Management board.

In this post, we share how ENGIE created a CI/CD pipeline for an Amazon MWAA project template using an AWS CodeCommit repository and plugged it into AWS CodePipeline to build, test, and package the code and custom plugins. In this use case, we developed a custom plugin to ingest data from Salesforce based on the Airflow Salesforce open-source plugin.

Solution overview

The following diagrams illustrate the solution architecture defining the implemented Amazon MWAA environment and its associated pipelines. It also describes the customer use case for Salesforce data ingestion into Amazon S3.

The following diagram shows the architecture of the deployed Amazon MWAA environment and the implemented pipelines.

The preceding architecture is fully deployed via infrastructure as code (IaC). The implementation includes the following:

Amazon MWAA environment – A customizable Amazon MWAA environment packaged with plugins and requirements and configured in a secure manner.
Provisioning pipeline – The admin team can manage the Amazon MWAA environment using the included CI/CD provisioning pipeline. This pipeline includes a CodeCommit repository plugged into CodePipeline to continuously update the environment and its plugins and requirements.
Project pipeline – This CI/CD pipeline comes with a CodeCommit repository that triggers CodePipeline to continuously build, test and deploy DAGs developed by users. Once deployed, these DAGs are made available in the Amazon MWAA environment.

The following diagram shows the data ingestion workflow, which includes the following steps:

The DAG is triggered by Amazon MWAA manually or based on a schedule.
Amazon MWAA initiates data collection parameters and calculates batches.
Amazon MWAA distributes processing tasks among its workers.
Data is retrieved from Salesforce in batches.
Amazon MWAA assumes an AWS Identity and Access Management (IAM) role with the necessary permissions to store the collected data into the target S3 bucket.

This AWS Cloud Development Kit (AWS CDK) construct is implemented with the following security best practices:

With the principle of least privilege, you grant permissions to only the resources or actions that users need to perform tasks.
S3 buckets are deployed with security compliance rules: encryption, versioning, and blocking public access.
Authentication and authorization management is handled with AWS Single Sign-On (AWS SSO).
Airflow stores connections to external sources in a secure manner either in Airflow’s default secrets backend or an alternative secrets backend such as AWS Secrets Manager or AWS Systems Manager Parameter Store.

For this post, we step through a use case using the data from Salesforce to ingest it into an ENGIE data lake in order to transform it and build business reports.

Prerequisites for deployment

For this walkthrough, the following are prerequisites:

Basic knowledge of the Linux operating system
Access to an AWS account with administrator or power user (or equivalent) IAM role policies attached
Access to a shell environment or optionally with AWS CloudShell

Deploy the solution

To deploy and run the solution, complete the following steps:

Install AWS CDK.
Bootstrap your AWS account.
Define your AWS CDK environment variables.
Deploy the stack.

Install AWS CDK

The described solution is fully deployed with AWS CDK.

AWS CDK is an open-source software development framework to model and provision your cloud application resources using familiar programming languages. If you want to familiarize yourself with AWS CDK, the AWS CDK Workshop is a great place to start.

Install AWS CDK using the following commands:

npm install -g aws-cdk
# To check the installation
cdk --version

Bootstrap your AWS account

First, you need to make sure the environment where you’re planning to deploy the solution to has been bootstrapped. You only need to do this one time per environment where you want to deploy AWS CDK applications. If you’re unsure whether your environment has been bootstrapped already, you can always run the command again:

cdk bootstrap aws://YOUR_ACCOUNT_ID/YOUR_REGION

Define your AWS CDK environment variables

On Linux or MacOS, define your environment variables with the following code:

export CDK_DEFAULT_ACCOUNT=YOUR_ACCOUNT_ID
export CDK_DEFAULT_REGION=YOUR_REGION

On Windows, use the following code:

setx CDK_DEFAULT_ACCOUNT YOUR_ACCOUNT_ID
setx CDK_DEFAULT_REGION YOUR_REGION

Deploy the stack

By default, the stack deploys a basic Amazon MWAA environment with the associated pipelines described previously. It creates a new VPC in order to host the Amazon MWAA resources.

The stack can be customized using the parameters listed in the following table.

To pass a parameter to the construct, you can use the AWS CDK runtime context. If you intend to customize your environment with multiple parameters, we recommend using the cdk.json context file with version control to avoid unexpected changes to your deployments. Throughout our example, we pass only one parameter to the construct. Therefore, for the simplicity of the tutorial, we use the the --context or -c option to the cdk command, as in the following example:

cdk deploy -c paramName=paramValue -c paramName=paramValue ...

Parameter	Description	Default	Valid values
vpcId	VPC ID where the cluster is deployed. If none, creates a new one and needs the parameter `cidr` in that case.	None	VPC ID
cidr	The CIDR for the VPC that is created to host Amazon MWAA resources. Used only if the `vpcId` is not defined.	172.31.0.0/16	IP CIDR
subnetIds	Comma-separated list of subnets IDs where the cluster is deployed. If none, looks for private subnets in the same Availability Zone.	None	Subnet IDs list (coma separated)
envName	Amazon MWAA environment name	`MwaaEnvironment`	String
envTags	Amazon MWAA environment tags	None	See the following JSON example: `'{"Environment":"MyEnv", "Application":"MyApp", "Reason":"Airflow"}'`
environmentClass	Amazon MWAA environment class	mw1.small	mw1.small, mw1.medium, mw1.large
maxWorkers	Amazon MWAA maximum workers	1	int
webserverAccessMode	Amazon MWAA environment access mode (private or public)	`PUBLIC_ONLY`	`PUBLIC_ONLY`, `PRIVATE_ONLY`
secretsBackend	Amazon MWAA environment secrets backend	Airflow	`SecretsManager`

Clone the GitHub repository:

git clone https://github.com/aws-samples/cdk-amazon-mwaa-cicd

Deploy the stack using the following command:

cd mwaairflow && \
pip install . && \
cdk synth && \
cdk deploy -c vpcId=YOUR_VPC_ID

The following screenshot shows the stack deployment:

The following screenshot shows the deployed stack:

Create solution resources

For this walkthrough, you should have the following prerequisites:

An AWS account.
A Salesforce account

If you don’t have a Salesforce account, you can create a SalesForce developer account:

Sign up for a developer account.
Copy the host from the email that you receive.
Log in into your new Salesforce account
Choose the profile icon, then Settings.
Choose Reset my Security Token.
Check your email and copy the security token that you receive.

After you complete these prerequisites, you’re ready to create the following resources:

An S3 bucket for Salesforce output data
An IAM role and IAM policy to write the Salesforce output data on Amazon S3
A Salesforce connection on the Airflow UI to be able to read from Salesforce
An AWS connection on the Airflow UI to be able to write on Amazon S3
An Airflow variable on the Airflow UI to store the name of the target S3 bucket

Create an S3 bucket for Salesforce output data

To create an output S3 bucket, complete the following steps:

On the Amazon S3 console, choose Create bucket.

The Create bucket wizard opens.

For Bucket name, enter a DNS-compliant name for your bucket, such as airflow-blog-post.
For Region, choose the Region where you deployed your Amazon MWAA environment, for example, US East (N. Virginia) us-east-1.
Choose Create bucket.

For more information, see Creating a bucket.

Create an IAM role and IAM policy to write the Salesforce output data on Amazon S3

In this step, we create an IAM policy that allows Amazon MWAA to write on your S3 bucket.

On the IAM console, in the navigation pane, choose Policies.
Choose Create policy.
Choose the JSON tab.

Enter the following JSON policy document, and replace airflow-blog-post with your bucket name:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::airflow-blog-post"]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": ["arn:aws:s3:::airflow-blog-post/*"]
    }
  ]
}

Choose Next: Tags.
Choose Next: Review.
For Name, choose a name for your policy (for example, airflow_data_output_policy).
Choose Create policy.

Let’s attach the IAM policy to a new IAM role that we use in our Airflow connections.

On the IAM console, choose Roles in the navigation pane and then choose Create role.
In the Or select a service to view its use cases section, choose S3.
For Select your use case, choose S3.
Search for the name of the IAM policy that we created in the previous step (airflow_data_output_role) and select the policy.
Choose Next: Tags.
Choose Next: Review.
For Role name, choose a name for your role (airflow_data_output_role).
Review the role and then choose Create role.

You’re redirected to the Roles section.

In the search box, enter the name of the role that you created and choose it.
Copy the role ARN to use later to create the AWS connection on Airflow.

Create a Salesforce connection on the Airflow UI to be able to read from Salesforce

To read data from Salesforce, we need to create a connection using the Airflow user interface.

On the Airflow UI, choose Admin.
Choose Connections, and then the plus sign to create a new connection.
Fill in the fields with the required information.

The following table provides more information about each value.

Field	Mandatory	Description	Values
Conn Id	Yes	Connection ID to define and to be used later in the DAG	For example, `salesforce_connection`
Conn Type	Yes	Connection type	HTTP
Host	Yes	Salesforce host name	`host-dev-ed.my.salesforce.com` or `host.lightning.force.com`. Replace the host with your Salesforce host and don’t add the `http://` as prefix.
Login	Yes	The Salesforce user name. The user must have read access to the salesforce objects.	[email protected]
Password	Yes	The corresponding password for the defined user.	MyPassword123
Port	No	Salesforce instance port. By default, 443.	443
Extra	Yes	Specify the extra parameters (as a JSON dictionary) that can be used in the Salesforce connection. `security_token` is the Salesforce security token for authentication. To get the Salesforce security token in your email, you must reset your security token.	`{"security_token":"AbCdE..."}`

Create an AWS connection in the Airflow UI to be able to write on Amazon S3

An AWS connection is required to upload data into Amazon S3, so we need to create a connection using the Airflow user interface.

On the Airflow UI, choose Admin.
Choose Connections, and then choose the plus sign to create a new connection.
Fill in the fields with the required information.

The following table provides more information about the fields.

Field	Mandatory	Description	Value
Conn Id	Yes	Connection ID to define and to be used later in the DAG	For example, `aws_connection`
Conn Type	Yes	Connection type	Amazon Web Services
Extra	Yes	It is required to specify the Region. You also need to provide the role ARN that we created earlier.	`{ "region":"eu-west-1", "role_arn":"arn:aws:iam::123456789101:role/airflow_data_output_role " }`

Create an Airflow variable on the Airflow UI to store the name of the target S3 bucket

We create a variable to set the name of the target S3 bucket. This variable is used by the DAG. So, we need to create a variable using the Airflow user interface.

On the Airflow UI, choose Admin.
Choose Variables, then choose the plus sign to create a new variable.
For Key, enter bucket_name.
For Val, enter the name of the S3 bucket that you created in a previous step (airflow-blog-post).

Create and deploy a DAG in Amazon MWAA

To be able to ingest data from Salesforce into Amazon S3, we need to create a DAG (Directed Acyclic Graph). To create and deploy the DAG, complete the following steps:

Create a local Python DAG.
Deploy your DAG using the project CI/CD pipeline.
Run your DAG on the Airflow UI.
Display your data in Amazon S3 (with S3 Select).

Create a local Python DAG

The provided SalesForceToS3Operator allows you to ingest data from Salesforce objects to an S3 bucket. Refer to standard Salesforce objects for the full list of objects you can ingest data from with this Airflow operator.

In this use case, we ingest data from the Opportunity Salesforce object. We retrieve the last 6 months’ data in monthly batches and we filter on a specific list of fields.

The DAG provided in the sample in GitHub repository imports the last 6 months of the Opportunity object (one file by month) by filtering the list of retrieved fields.

This operator takes two connections as parameters:

An AWS connection that is used to upload ingested data into Amazon S3.
A Salesforce connection to read data from Salesforce.

The following table provides more information about the parameters.

Parameter	Type	Mandatory	Description
sf_conn_id	string	Yes	Name of the Airflow connection that has the following information: user name password security token
sf_obj	string	Yes	Name of the relevant Salesforce object (Account, Lead, Opportunity)
s3_conn_id	string	Yes	The destination S3 connection ID
s3_bucket	string	Yes	The destination S3 bucket
s3_key	string	Yes	The destination S3 key
sf_fields	string	No	The (optional) list of fields that you want to get from the object (`Id`, `Name`, and so on). If none (the default), then this gets all fields for the object.
fmt	string	No	The (optional) format that the S3 key of the data should be in. Possible values include CSV (default), JSON, and NDJSON.
from_date	date format	No	A specific date-time (optional) formatted input to run queries from for incremental ingestion. Evaluated against the `SystemModStamp` attribute. Not compatible with the query parameter and should be in date-time format (for example, 2021-01-01T00:00:00Z). Default: None
to_date	date format	No	A specific date-time (optional) formatted input to run queries to for incremental ingestion. Evaluated against the `SystemModStamp` attribute. Not compatible with the query parameter and should be in date-time format (for example, 2021-01-01T00:00:00Z). Default: None
query	string	No	A specific query (optional) to run for the given object. This overrides default query creation. Default: None
relationship_object	string	No	Some queries require relationship objects to work, and these are not the same names as the Salesforce object. Specify that relationship object here (optional). Default: None
record_time_added	boolean	No	Set this optional value to true if you want to add a Unix timestamp field to the resulting data that marks when the data was fetched from Salesforce. Default: False
coerce_to_timestamp	boolean	No	Set this optional value to true if you want to convert all fields with dates and datetimes into Unix timestamp (UTC). Default: False

The first step is to import the operator in your DAG:

from operators.salesforce_to_s3_operator import SalesforceToS3Operator

Then define your DAG default ARGs, which you can use for your common task parameters:

# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
    'owner': '[email protected]',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'retries': 0,
    'retry_delay': timedelta(minutes=1),
    'sf_conn_id': 'salesforce_connection',
    's3_conn_id': 'aws_connection',
    's3_bucket': 'salesforce-to-s3',
}
...

Finally, you define the tasks to use the operator.

The following examples illustrate some use cases.

Salesforce object full ingestion

This task ingests all the content of the Salesforce object defined in sf_obj. This selects all the object’s available fields and writes them into the defined format in fmt. See the following code:

...
salesforce_to_s3 = SalesforceToS3Operator(
    task_id="Opportunity_to_S3",
    sf_conn_id=default_args["sf_conn_id"],
    sf_obj="Opportunity",
    fmt="ndjson",
    s3_conn_id=default_args["s3_conn_id"],
    s3_bucket=default_args["s3_bucket"],
    s3_key=f"salesforce/raw/dt={s3_prefix}/{table.lower()}.json",
    dag=salesforce_to_s3_dag,
)
...

Salesforce object partial ingestion based on fields

This task ingests specific fields of the Salesforce object defined in sf_obj. The selected fields are defined in the optional sf_fields parameter. See the following code:

...
salesforce_to_s3 = SalesforceToS3Operator(
    task_id="Opportunity_to_S3",
    sf_conn_id=default_args["sf_conn_id"],
    sf_obj="Opportunity",
    sf_fields=["Id","Name","Amount"],
    fmt="ndjson",
    s3_conn_id=default_args["s3_conn_id"],
    s3_bucket=default_args["s3_bucket"],
    s3_key=f"salesforce/raw/dt={s3_prefix}/{table.lower()}.json",
    dag=salesforce_to_s3_dag,
)
...

Salesforce object partial ingestion based on time period

This task ingests all the fields of the Salesforce object defined in sf_obj. The time period can be relative using from_date or to_date parameters or absolute by using both parameters.

The following example illustrates relative ingestion from the defined date:

...
salesforce_to_s3 = SalesforceToS3Operator(
    task_id="Opportunity_to_S3",
    sf_conn_id=default_args["sf_conn_id"],
    sf_obj="Opportunity",
    from_date="YESTERDAY",
    fmt="ndjson",
    s3_conn_id=default_args["s3_conn_id"],
    s3_bucket=default_args["s3_bucket"],
    s3_key=f"salesforce/raw/dt={s3_prefix}/{table.lower()}.json",
    dag=salesforce_to_s3_dag,
)
...

The from_date and to_date parameters support Salesforce date-time format. It can be either a specific date or literal (for example TODAY, LAST_WEEK, LAST_N_DAYS:5). For more information about date formats, see Date Formats and Date Literals.

For the full DAG, refer to the sample in GitHub repository.

This code dynamically generates tasks that run queries to retrieve the data of the Opportunity object in the form of 1-month batches.

The sf_fields parameter allows us to extract only the selected fields from the object.

Save the DAG locally as salesforce_to_s3.py.

Deploy your DAG using the project CI/CD pipeline

As part of the CDK deployment, a CodeCommit repository and CodePipeline pipeline were created in order to continuously build, test, and deploy DAGs into your Amazon MWAA environment.

To deploy the new DAG, the source code should be committed to the CodeCommit repository. This triggers a CodePipeline run that builds, tests, and deploys your new DAG and makes it available in your Amazon MWAA environment.

Sign in to the CodeCommit console in your deployment Region.
Under Source, choose Repositories.

You should see a new repository mwaaproject.

Push your new DAG in the mwaaproject repository under dags. You can either use the CodeCommit console or the Git command line to do so:
1. CodeCommit console:
  1. Choose the project CodeCommit repository name mwaaproject and navigate under dags.
  2. Choose Add file and then Upload file and upload your new DAG.
2. Git command line:
  1. To be able to clone and access your CodeCommit project with the Git command line, make sure Git client is properly configured. Refer to Setting up for AWS CodeCommit.
  2. Clone the repository with the following command after replacing <region> with your project Region:
```
git clone https://git-codecommit.<region>.amazonaws.com/v1/repos/mwaaproject
```
  3. Copy the DAG file under dags and add it with the command:
```
git add dags/salesforce_to_s3.py
```
  4. Commit your new file with a message:
```
git commit -m "add salesforce DAG"
```
  5. Push the local file to the CodeCommit repository:
```
git push
```

The new commit triggers a new pipeline that builds, tests, and deploys the new DAG. You can monitor the pipeline on the CodePipeline console.

On the CodePipeline console, choose Pipeline in the navigation pane.
On the Pipelines page, you should see mwaaproject-pipeline.
Choose the pipeline to display its details.

After checking that the pipeline run is successful, you can verify that the DAG is deployed to the S3 bucket and therefore available on the Amazon MWAA console.

On the Amazon S3 console, look for a bucket starting with mwaairflowstack-mwaaenvstackne and go under dags.

You should see the new DAG.

On the Amazon MWAA console, choose DAGs.

You should be able to see the new DAG.

Run your DAG on the Airflow UI

Go to the Airflow UI and toggle on the DAG.

This triggers your DAG automatically.

Later, you can continue manually triggering it by choosing the run icon.

Choose the DAG and Graph View to see the run of your DAG.

If you have any issue, you can check the logs of the failed tasks from the task instance context menu.

Display your data in Amazon S3 (with S3 Select)

To display your data, complete the following steps:

On the Amazon S3 console, in the Buckets list, choose the name of the bucket that contains the output of the Salesforce data (airflow-blog-post).
In the Objects list, choose the name of the folder that has the object that you copied from Salesforce (opportunity).
Choose the raw folder and the dt folder with the latest timestamp.
Select any file.
On the Actions menu, choose Query with S3 Select.
Choose Run SQL query to preview the data.

Clean up

To avoid incurring future charges, delete the AWS CloudFormation stack and the resources that you deployed as part of this post.

On the AWS CloudFormation console, delete the stack MWAAirflowStack.

To clean up the deployed resources using the AWS Command Line Interface (AWS CLI), you can simply run the following command:

cdk destroy MWAAirflowStack

Make sure you are in the root path of the project when you run the command.

After confirming that you want to destroy the CloudFormation stack, the solution’s resources are deleted from your AWS account.

The following screenshot shows the process of deploying the stack:

The following screenshot confirms the stack is undeployed.

Navigate to the Amazon S3 console and locate the two buckets containing mwaairflowstack-mwaaenvstack and mwaairflowstack-mwaaproj that were created during the deployment.
Select each bucket delete its contents, then delete the bucket.
Delete the IAM role created to write on the S3 buckets.

Conclusion

ENGIE discovered significant value by using Amazon MWAA, enabling its global business units to ingest data in more productive ways. This post presented how ENGIE scaled their data ingestion pipelines using Amazon MWAA. The first part of the post described the architecture components and how to successfully deploy a CI/CD pipeline for an Amazon MWAA project template using a CodeCommit repository and plug it into CodePipeline to build, test, and package the code and custom plugins. The second part walked you through the steps to automate the ingestion process from Salesforce using Airflow with an example. For the Airflow configuration, you used Airflow variables, but you can also use Secrets Manager with Amazon MWAA using the secretsBackend parameter when deploying the stack.

The use case discussed in this post is just one example of how you can use Amazon MWAA to make it easier to set up and operate end-to-end data pipelines in the cloud at scale. For more information about Amazon MWAA, check out the User Guide.

About the Authors

Anouar Zaaber is a Senior Engagement Manager in AWS Professional Services. He leads internal AWS, external partner, and customer teams to deliver AWS cloud services that enable the customers to realize their business outcomes.

Amine El Mallem is a Data/ML Ops Engineer in AWS Professional Services. He works with customers to design, automate, and build solutions on AWS for their business needs.

Armando Segnini is a Data Architect with AWS Professional Services. He spends his time building scalable big data and analytics solutions for AWS Enterprise and Strategic customers. Armando also loves to travel with his family all around the world and take pictures of the places he visits.

Mohamed-Ali Elouaer is a DevOps Consultant with AWS Professional Services. He is part of the AWS ProServe team, helping enterprise customers solve complex problems related to automation, security, and monitoring using AWS services. In his free time, he likes to travel and watch movies.

Julien Grinsztajn is an Architect at ENGIE. He is part of the Digital & IT Consulting ENGIE IT team working on the definition of the architecture for complex projects related to data integration and network security. In his free time, he likes to travel the oceans to meet sharks and other marine creatures.

Build a modern data architecture on AWS with Amazon AppFlow, AWS Lake Formation, and Amazon Redshift: Part 2

2022-02-02 Dr. Yannick Misteli

Post Syndicated from Dr. Yannick Misteli original https://aws.amazon.com/blogs/big-data/part-2-build-a-modern-data-architecture-on-aws-with-amazon-appflow-aws-lake-formation-and-amazon-redshift/

In Part 1 of this post, we provided a solution to build the sourcing, orchestration, and transformation of data from multiple source systems, including Salesforce, SAP, and Oracle, into a managed modern data platform. Roche partnered with AWS Professional Services to build out this fully automated and scalable platform to provide the foundation for their machine learning goals. This post continues the data journey to include the steps undertaken to build an agile and extendable Amazon Redshift data warehouse platform using a DevOps approach.

The modern data platform ingests delta changes from all source data feeds once per night. The orchestration and transformations of the data is undertaken by dbt. dbt enables data analysts and engineers to write data transformation queries in a modular manner without having to maintain the run order manually. It compiles all code into raw SQL queries that run against the Amazon Redshift cluster. It also controls the dependency management within your queries and runs it in the correct order. dbt code is a combination of SQL and Jinja (a templating language); therefore, you can express logic such as if statements, loops, filters, and macros in your queries. dbt also contains automatic data validation job scheduling to measure the data quality of the data loaded. For more information about how to configure a dbt project within an AWS environment, see Automating deployment of Amazon Redshift ETL jobs with AWS CodeBuild, AWS Batch, and DBT.

Amazon Redshift was chosen as the data warehouse because of its ability to seamlessly access data stored in industry standard open formats within Amazon Simple Storage Service (Amazon S3) and rapidly ingest the required datasets into local, fast storage using well-understood SQL commands. Being able to develop extract, load, and transform (ELT) code pipelines in SQL was important for Roche to take advantage of the existing deep SQL skills of their data engineering teams.

A modern ELT platform requires a modern, agile, and highly performant data model. The solution in this post builds a data model using the Data Vault 2.0 standards. Data Vault has several compelling advantages for data-driven organizations:

It removes data silos by storing all your data in reusable source system independent data stores keyed on your business keys.
It’s a key driver for data integration at many levels, from multiple source systems, multiple local markets, multiple companies and affiliates, and more.
It reduces data duplication. Because data is centered around business keys, if more than one system sends the same data, then multiple data copies aren’t needed.
It holds all history from all sources; downstream you can access any data at any point in time.
You can load data without contention or in parallel, and in batch or real time.
The model can adapt to change with minimal impact. New business relationships can be made independently of the existing relationships
The model is well established in the industry and naturally drives templated and reusable code builds.

The following diagram illustrates the high-level overview of the architecture:

Amazon Redshift has several methods for ingesting data from Amazon S3 into the data warehouse cluster. For this modern data platform, we use a combination of the following methods:

We use Amazon Redshift Spectrum to read data directly from Amazon S3. This allows the project to rapidly load, store, and use external datasets. Amazon Redshift allows the creation of external schemas and external tables to facilitate data being accessed using standard SQL statements.
Some feeds are persisted in a staging schema within Amazon Redshift, for example larger data volumes and datasets that are used multiple times in subsequent ELT processing. dbt handles the orchestration and loading of this data in an incremental manner to cater to daily delta changes.

Within Amazon Redshift, the Data Vault 2.0 data model is split into three separate areas:

Raw Data Vault within a schema called raw_dv
Business Data Vault within a schema called business_dv
Multiple Data Marts, each with their own schema

Raw Data Vault

Business keys are central to the success of any Data Vault project, and we created hubs within Amazon Redshift as follows:

CREATE TABLE IF NOT EXISTS raw_dv.h_user
(
 user_pk          VARCHAR(32)   			 
,user_bk          VARCHAR(50)   			 
,load_dts         TIMESTAMP  	 
,load_source_dts  TIMESTAMP  	 
,bookmark_dts     TIMESTAMP  	 
,source_system_cd VARCHAR(10)   				 
) 
DISTSTYLE ALL;

Keep in mind the following:

The business keys from one or more source feeds are written to the reusable _bk column; compound business keys should be concatenated together with a common separator between each element.
The primary key is stored in the _pk column and is a hashed value of the _bk column. In this case, MD5 is the hashing algorithm used.
Load_Dts is the date and time of the insertion of this row.
Hubs hold reference data, which is typically smaller in volume than transactional data, so you should choose a distribution style of ALL for the most performant joining to other tables at runtime.

Because Data Vault is built on a common reusable notation, the dbt code is parameterized for each target. The Roche engineers built a Yaml-driven code framework to parameterize the logic for the build of each target table, enabling rapid build and testing of new feeds. For example, the preceding user hub contains parameters to identify source columns for the business key, source to target mappings, and physicalization choices for the Amazon Redshift target:

name: h_user
    type: hub
    materialized: incremental
    schema: raw_dv
    dist: all
    pk_name: user_pk
    bk:
      name: user_bk
      type: varchar(50)
    sources:
      - name: co_rems_invitee
        schema: re_rems_core
        key:
          - dwh_source_country_cd
          - employee_user_id
        columns:
          - source: "'REMS'"
            alias: source_system_cd
            type: varchar(10)
        load_source_dts: glue_dts
        bookmark_dts: bookmark_dts        
      - name: co_rems_event_users
        schema: re_rems_core
        key:
          - dwh_source_country_cd
          - user_name
        columns:
          - source: "'REMS'"
            alias: source_system_cd
            type: varchar(10)
        load_source_dts: glue_dts
        bookmark_dts: bookmark_dts        
      - name: user
        alias: user_by_id
        schema: roche_salesforce_we_prod
        key:
          - id
        columns:
          - source: "'SFDC_WE'"
            alias: source_system_cd
            type: varchar(10)
        load_source_dts: to_date(appflow_date_str,'YYYYMMDD')
        bookmark_dts: to_date(systemmodstamp,'YYYY-MM-DD HH24.mi.ss')
        where: id > 0 and id <> '' and usertype = 'Standard'
      - name: activity_g__c
        schema: roche_salesforce_we_prod
        key:
          - ownerid
        columns:
          - source: "'SFDC_WE'"
            alias: source_system_cd
            type: varchar(10)
        load_source_dts: to_date(appflow_date_str,'YYYYMMDD')
        bookmark_dts: to_date(systemmodstamp,'YYYY-MM-DD HH24.mi.ss')        
      - name: user_territory_g__c
        schema: roche_salesforce_we_prod
        key:
          - user_ref_g__c
        columns:
          - source: "'SFDC_WE'"
            alias: source_system_cd
            type: varchar(10)
        load_source_dts: to_date(appflow_date_str,'YYYYMMDD')
        bookmark_dts: to_date(systemmodstamp,'YYYY-MM-DD HH24.mi.ss')

On reading the YAML configuration, dbt outputs the following, which is run against the Amazon Redshift cluster:

{# Script generated by dbt model generator #}

{{
	config({
	  "materialized": "incremental",
	  "schema": "raw_dv",
	  "dist": "all",
	  "unique_key": "user_pk",
	  "insert_only": {}
	})
}}

with co_rems_invitee as (

	select
		{{ hash(['dwh_source_country_cd', 'employee_user_id'], 'user_pk') }},
		cast({{ compound_key(['dwh_source_country_cd', 'employee_user_id']) }} as varchar(50)) as user_bk,
		{{ dbt_utils.current_timestamp() }} as load_dts,
		glue_dts as load_source_dts,
		bookmark_dts as bookmark_dts,
		cast('REMS' as varchar(10)) as source_system_cd
	from
		{{ source('re_rems_core', 'co_rems_invitee') }}
	where
		dwh_source_country_cd is not null 
		and employee_user_id is not null

		{% if is_incremental() %}
			and glue_dts > (select coalesce(max(load_source_dts), to_date('20000101', 'yyyymmdd', true)) from {{ this }})
		{% endif %}

), 
co_rems_event_users as (

	select
		{{ hash(['dwh_source_country_cd', 'user_name'], 'user_pk') }},
		cast({{ compound_key(['dwh_source_country_cd', 'user_name']) }} as varchar(50)) as user_bk,
		{{ dbt_utils.current_timestamp() }} as load_dts,
		glue_dts as load_source_dts,
		bookmark_dts as bookmark_dts,
		cast('REMS' as varchar(10)) as source_system_cd
	from
		{{ source('re_rems_core', 'co_rems_event_users') }}
	where
		dwh_source_country_cd is not null 
		and user_name is not null

		{% if is_incremental() %}
			and glue_dts > (select coalesce(max(load_source_dts), to_date('20000101', 'yyyymmdd', true)) from {{ this }})
		{% endif %}

), 
all_sources as (

	select * from co_rems_invitee
	union
	select * from co_rems_event_users

),
unique_key as (

	select
		row_number() over(partition by user_pk order by bookmark_dts desc) as rn,
		user_pk,
		user_bk,
		load_dts,
		load_source_dts,
		bookmark_dts,
		source_system_cd
	from
		all_sources

)
select
	user_pk,
	user_bk,
	load_dts,
	load_source_dts,
	bookmark_dts,
	source_system_cd
from
	unique_key
where
	rn = 1

dbt also has the capability to add reusable macros to allow common tasks to be automated. The following example shows the construction of the business key with appropriate separators (the macro is called compound_key):

{% macro single_key(field) %}
  {# Takes an input field value and returns a trimmed version of it. #}
  NVL(NULLIF(TRIM(CAST({{ field }} AS VARCHAR)), ''), '@@')
{% endmacro %}

{% macro compound_key(field_list,sort=none) %}
  {# Takes an input field list and concatenates it into a single column value.
     NOTE: Depending on the sort parameter [True/False] the input field
     list has to be passed in a correct order if the sort parameter
     is set to False (default option) or the list will be sorted 
     if You will set up the sort parameter value to True #}
  {% if sort %}
    {% set final_field_list = field_list|sort %}
  {%- else -%}
    {%- set final_field_list = field_list -%}
  {%- endif -%}        
  {% for f in final_field_list %}
    {{ single_key(f) }}
    {% if not loop.last %} || '^^' || {% endif %}
  {% endfor %}
{% endmacro %}

{% macro hash(columns=none, alias=none, algorithm=none) %}
    {# Applies a Redshift supported hash function to the input string 
       or list of strings. #}

    {# If single column to hash #}
    {% if columns is string %}
        {% set column_str = single_key(columns) %}
        {{ redshift__hash(column_str, alias, algorithm) }}
    {# Else a list of columns to hash #}
    {% elif columns is iterable %}        
        {% set column_str = compound_key(columns) %}
        {{ redshift__hash(column_str, alias, algorithm) }}
    {% endif %}
   
{% endmacro %}

{% macro redshift__hash(column_str, alias, algorithm) %}
    {# Applies a Redshift supported hash function to the input string. #}

    {# If the algorithm is none the default project configuration for hash function will be used. #}
    {% if algorithm == none or algorithm not in ['MD5', 'SHA', 'SHA1', 'SHA2', 'FNV_HASH'] %}
        {# Using MD5 if the project variable is not defined. #}
        {% set algorithm = var('project_hash_algorithm', 'MD5') %}
    {% endif %}

    {# Select hashing algorithm #}
    {% if algorithm == 'FNV_HASH' %}
        CAST(FNV_HASH({{ column_str }}) AS BIGINT) AS {{ alias }}
    {% elif algorithm == 'MD5' %}
        CAST(MD5({{ column_str }}) AS VARCHAR(32)) AS {{ alias }}
    {% elif algorithm == 'SHA' or algorithm == 'SHA1' %}
        CAST(SHA({{ column_str }}) AS VARCHAR(40)) AS {{ alias }}
    {% elif algorithm == 'SHA2' %}
        CAST(SHA2({{ column_str }}, 256) AS VARCHAR(256)) AS {{ alias }}
    {% endif %}

{% endmacro %}

Historized reference data about each business key is stored in satellites. The primary key of each satellite is a compound key consisting of the _pk column of the parent hub and the Load_Dts. See the following code:

CREATE TABLE IF NOT EXISTS raw_dv.s_user_reine2
(
 user_pk             VARCHAR(32)   			 
,load_dts            TIMESTAMP    	 
,hash_diff           VARCHAR(32)   			 
,load_source_dts     TIMESTAMP  	 
,bookmark_dts        TIMESTAMP    	 
,source_system_cd    VARCHAR(10)				 
,is_deleted          VARCHAR(1)   				 
,invitee_type        VARCHAR(10)   			 
,first_name          VARCHAR(50)   			 
,last_name           VARCHAR(10)   			 
)
DISTSTYLE ALL
SORTKEY AUTO;

CREATE TABLE IF NOT EXISTS raw_dv.s_user_couser
(
 user_pk                VARCHAR(32)   			 
,load_dts               TIMESTAMP  	 
,hash_diff              VARCHAR(32)   			 
,load_source_dts        TIMESTAMP  	 
,bookmark_dts           TIMESTAMP  	 
,source_system_cd       VARCHAR(10)   			 
,name                   VARCHAR(150)   			 
,username               VARCHAR(80)   			 
,firstname              VARCHAR(40)   			 
,lastname               VARCHAR(80)   			 
,alias                  VARCHAR(8)   				 
,community_nickname     VARCHAR(30)   			 
,federation_identifier  VARCHAR(50)   			 
,is_active              VARCHAR(10)   			 
,email                  VARCHAR(130)   			 
,profile_name           VARCHAR(80)   			 
)
DISTSTYLE ALL
SORTKEY AUTO;

Keep in mind the following:

The feed name is saved as part of the satellite name. This allows the loading of reference data from either multiple feeds within the same source system or from multiple source systems.
Satellites are insert only; new reference data is loaded as a new row with an appropriate Load_Dts.
The HASH_DIFF column is a hashed concatenation of all the descriptive columns within the satellite. The dbt code uses it to decide whether reference data has changed and a new row is to be inserted.
Unless the data volumes within a satellite become very large (millions of rows), you should choose a distribution choice of ALL to enable the most performant joins at runtime. For larger volumes of data, choose a distribution style of AUTO to take advantage of Amazon Redshift automatic table optimization, which chooses the most optimum distribution style and sort key based on the downstream usage of these tables.

Transactional data is stored in a combination of link and link satellite tables. These tables hold the business keys that contribute to the transaction being undertaken as well as optional measures describing the transaction.

Previously, we showed the build of the user hub and two of its satellites. In the following link table, the user hub foreign key is one of several hub keys in the compound key:

CREATE TABLE IF NOT EXISTS raw_dv.l_activity_visit
(
 activity_visit_pk         VARCHAR(32)   			 
,activity_pk               VARCHAR(32)   			 
,activity_type_pk          VARCHAR(32)   			
,hco_pk                    VARCHAR(32)   			
,address_pk                VARCHAR(32)   			
,user_pk                   VARCHAR(32)   			
,hcp_pk                    VARCHAR(32)   			
,brand_pk                  VARCHAR(32)   			
,activity_attendee_pk      VARCHAR(32)   			
,activity_discussion_pk    VARCHAR(32)				
,load_dts                  TIMESTAMP  	
,load_source_dts           TIMESTAMP  				
,bookmark_dts              TIMESTAMP  				
,source_system_cd          VARCHAR(10)   				
)
DISTSTYLE KEY
DISTKEY (activity_visit_pk)
SORTKEY (activity_visit_pk);

Keep in mind the following:

The foreign keys back to each hub are a hash value of the business keys, giving a 1:1 join with the _pk column of each hub.
The primary key of this link table is a hash value of all of the hub foreign keys.
The primary key gives direct access to the optional link satellite that holds further historized data about this transaction. The definition of the link satellites is almost identical to satellites; instead of the _pk from the hub being part of the compound key, the _pk of the link is used.
Because data volumes are typically larger for links and link satellites than hubs or satellites, you can again choose AUTO distribution style to let Amazon Redshift choose the optimum physical table distribution choice. If you do choose a distribution style, then choose KEY on the _pk column for both the distribution style and sort key on both the link and any link satellites. This improves downstream query performance by co-locating the datasets on the same slice within the compute nodes and enables MERGE JOINS at run time for optimum performance.

In addition to the dbt code to build all the preceding targets in the Amazon Redshift schemas, the product contains a powerful testing tool that makes assertions on the underlying data contents. The platform continuously tests the results of each data load.

Tests are specified using a YAML file called schema.yml. For example, taking the territory satellite (s_territory), we can see automated testing for conditions, including ensuring the primary key is populated, its parent key is present in the territory hub (h_territory), and the compound key of this satellite is unique:

As shown in the following screenshot, the tests are clearly labeled as PASS or FAILED for quick identification of data quality issues.

Business Data Vault

The Business Data Vault is a vital element of any Data Vault model. This is the place where business rules, KPI calculations, performance denormalizations, and roll-up aggregations take place. Business rules can change over time, but the raw data does not, which is why the contents of the Raw Data Vault should never be modified.

The type of objects created in the Business Data Vault schema include the following:

Type 2 denormalization based on either the latest load date timestamp or a business-supplied effective date timestamp. These objects are ideal as the base for a type 2 dimension view within a data mart.
Latest row filtering based on either the latest load date timestamp or a business-supplied effective date timestamp. These objects are ideal as the base for a type 1 dimension within a data mart.
For hubs with multiple independently loaded satellites, point-in-time (PIT) tables are created with the snapshot date set to one time per day.
Where the data access requirements span multiple links and link satellites, bridge tables are created with the snapshot date set to one time per day.

In the following diagram, we show an example of user reference data from two source systems being loaded into separate satellite targets.

Keep in mind the following:

You should create a separate schema for the Business Data Vault objects
You can build several object types in the Business Data Vault:
- PIT and bridge targets are typically either tables or materialized views can be used for data that incrementally changes due to the auto refresh capabilities
- The type 2 and latest row selections from an underlying satellite are typically views because of the lower data volumes typically found in reference datasets
Because the Raw Data Vault tables are insert only, to determine a timeline of changes, create a view similar to the following:

CREATE OR REPLACE VIEW business_dv.ref_user_type2 AS
SELECT 
  s.user_pk,
  s.load_dts from_dts,
  DATEADD(second,-1,COALESCE(LEAD(s.load_dts) OVER (PARTITION BY s.user_pk ORDER BY s.load_dts),'2200-01-01 00:00:00')) AS to_dts
  FROM raw_dv.s_user_reine2 s
  INNER JOIN raw_dv.h_user h ON h.user_pk = s.user_pk
  WITH NO SCHEMA BINDING;

Data Marts

The work undertaken in the Business Data Vault means that views can be developed within the Data Marts to directly access the data without having to physicalize the results into another schema. These views may apply filters to the Business Vault objects, for example to filter only for data from specific countries, or the views may choose a KPI that has been calculated in the Business Vault that is only useful within this one data mart.

Conclusion

In this post, we detailed how you can use dbt and Amazon Redshift for continuous build and validation of a Data Vault model that stores all data from multiple sources in a source-independent manner while offering flexibility and choice of subsequent business transformations and calculations.

Special thanks go to Roche colleagues Bartlomiej Zalewski, Wojciech Kostka, Michalina Mastalerz, Kamil Piotrowski, Igor Tkaczyk, Andrzej Dziabowski, Joao Antunes, Krzysztof Slowinski, Krzysztof Romanowski, Patryk Szczesnowicz, Jakub Lanski, and Chun Wei Chan for their project delivery and support with this post.

About the Authors

Dr. Yannick Misteli, Roche – Dr. Yannick Misteli is leading cloud platform and ML engineering teams in global product strategy (GPS) at Roche. He is passionate about infrastructure and operationalizing data-driven solutions, and he has broad experience in driving business value creation through data analytics.

Simon Dimaline, AWS – Simon Dimaline has specialised in data warehousing and data modelling for more than 20 years. He currently works for the Data & Analytics team within AWS Professional Services, accelerating customers’ adoption of AWS analytics services.

Matt Noyce, AWS – Matt Noyce is a Senior Cloud Application Architect in Professional Services at Amazon Web Services. He works with customers to architect, design, automate, and build solutions on AWS for their business needs.

Chema Artal Banon, AWS – Chema Artal Banon is a Security Consultant at AWS Professional Services and he works with AWS’s customers to design, build, and optimize their security to drive business. He specializes in helping companies accelerate their journey to the AWS Cloud in the most secure manner possible by helping customers build the confidence and technical capability.

Best practices to optimize your Amazon Redshift and MicroStrategy deployment

2022-02-01 Ranjan Burman

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/best-practices-to-optimize-your-amazon-redshift-and-microstrategy-deployment/

This is a guest blog post co-written by Amit Nayak at Microstrategy. In their own words, “MicroStrategy is the largest independent publicly traded business intelligence (BI) company, with the leading enterprise analytics platform. Our vision is to enable Intelligence Everywhere. MicroStrategy provides modern analytics on an open, comprehensive enterprise platform used by many of the world’s most admired brands in the Fortune Global 500. Optimized for cloud and on-premises deployments, the platform features HyperIntelligence, a breakthrough technology that overlays actionable enterprise data on popular business applications to help users make smarter, faster decisions.”

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse. It provides a simple and cost-effective way to analyze all your data using your existing BI tools. Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift has custom JDBC and ODBC drivers that you can download from the Connect Client tab on the Amazon Redshift console, allowing you to use a wide range of familiar BI tools.

When using your MicroStrategy application with Amazon Redshift, it’s important to understand how to optimize Amazon Redshift to get the best performance to meet your workload SLAs.

In this post, we look at the best practices for optimized deployment of MicroStrategy using Amazon Redshift.

Optimize Amazon Redshift

In this section, we discuss ways to optimize Amazon Redshift.

Amazon Redshift RA3 instances

RA3 nodes with managed storage help you optimize your data warehouse by scaling and paying for the compute capacity and managed storage independently. With RA3 instances, you can choose the number of nodes based on your performance requirements, and only pay for the managed storage that you use. Size your RA3 cluster based on the amount of data you process daily without increasing your storage costs.

For additional details about RA3 features, see Amazon Redshift RA3 instances with managed storage.

Distribution styles

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute nodes according to the table’s distribution style. When you run a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in choosing a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is run.

When you create a table, you can designate one of four distribution styles: AUTO, EVEN, KEY, or ALL. If you don’t specify a distribution style, Amazon Redshift uses AUTO distribution. With AUTO distribution, Amazon Redshift assigns an optimal distribution style based on the size of the table data. You can use automatic table optimization to get started with Amazon Redshift easily or optimize production workloads while decreasing the administrative effort required to get the best possible performance.

MicroStrategy, like any SQL application, transparently takes advantage of the distribution style defined on base tables. MicroStrategy recommends following Amazon Redshift recommended best practices when implementing the physical schema of the base tables.

Sort keys

Defining a table with a sort key results in the physical ordering of data in the Amazon Redshift cluster nodes based on the sort type and the columns chosen in the key definition. Sorting enables efficient handling of range-restricted predicates to scan the minimal number of blocks on disk to satisfy a query. A contrived example would be having an orders table with 5 years of data with a SORTKEY on the order_date column. Now suppose a query on the orders table specifies a date range of 1 month on the order_date column. In this case, you can eliminate up to 98% of the disk blocks from the scan. If the data isn’t sorted, more of the disk blocks (possibly all of them) have to be scanned, resulting in the query running longer.

We recommend creating your tables with SORTKEY AUTO. This way, Amazon Redshift uses automatic table optimization to choose the sort key. If Amazon Redshift determines that applying a SORTKEY improves cluster performance, tables are automatically altered within hours from the time the cluster was created, with minimal impact to queries.

We also recommend using the sort key on columns often used in the WHERE clause of the report queries. Keep in mind that SQL functions (such as data transformation functions) applied to sort key columns in queries reduce the effectiveness of the sort key for those queries. Instead, make sure that you apply the functions to the compared values so that the sort key is used. This is commonly found on DATE columns that are used as sort keys.

Amazon Redshift Advisor provides recommendations to help you improve the performance and decrease the operating costs for your Amazon Redshift cluster. The Advisor analyzes your cluster’s workload to identify the most appropriate distribution key and sort key based on the query patterns of your cluster.

Compression

Compression settings can also play a big role when it comes to query performance in Amazon Redshift. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.

By default, Amazon Redshift automatically manages compression encoding for all columns in a table. You can specify the ENCODE AUTO option for the table to enable Amazon Redshift to automatically manage compression encoding for all columns in a table. You can alternatively apply a specific compression type to the columns in a table manually when you create the table, or you can use the COPY command to analyze and apply compression automatically.

We don’t recommend compressing the first column in a compound sort key because it might result in scanning more rows than expected.

Amazon Redshift materialized views

Materialized views can significantly boost query performance for repeated and predictable analytical workloads such as dashboarding, queries from BI tools, and extract, load, and transform (ELT) data processing.

Materialized views are especially useful for queries that are predictable and repeated over and over. Instead of performing resource-intensive queries on large tables, applications can query the pre-computed data stored in the materialized view.

For example, consider the scenario where a set of queries is used to populate a collection of charts for a dashboard. This use case is ideal for a materialized view, because the queries are predictable and repeated over and over again. Whenever a change occurs in the base tables (data is inserted, deleted, or updated), the materialized views can be automatically or manually refreshed to represent the current data.

Amazon Redshift can automatically refresh materialized views with up-to-date data from its base tables when materialized views are created with or altered to have the auto-refresh option. Amazon Redshift auto-refreshes materialized views as soon as possible after base tables changes.

To update the data in a materialized view manually, you can use the REFRESH MATERIALIZED VIEW statement at any time. There are two strategies for refreshing a materialized view:

Incremental refresh – In an incremental refresh, it identifies the changes to the data in the base tables since the last refresh and updates the data in the materialized view
Full refresh – If incremental refresh isn’t possible, Amazon Redshift performs a full refresh, which reruns the underlying SQL statement, replacing all the data in the materialized view

Amazon Redshift automatically chooses the refresh method for a materialized view depending on the SELECT query used to define the materialized view. For information about the limitations for incremental refresh, see Limitations for incremental refresh.

The following are some of the key advantages using materialized views:

You can speed up queries by pre-computing the results of complex queries, including multiple base tables, predicates, joins, and aggregates
You can simplify and accelerate ETL and BI pipelines
Materialized views support Amazon Redshift local, Amazon Redshift Spectrum, and federated queries
Amazon Redshift can use automatic query rewrites of materialized views

For example, let’s consider the sales team wants to build a report that shows
the product sales across different stores. This dashboard query is based out of a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark dataset.

In this first step, you create a regular view. See the following code:

create view vw_product_sales
as
select 
	i_brand,
	i_category,
	d_year,
	d_quarter_name,
	s_store_name,
	sum(ss_sales_price) as total_sales_price,
	sum(ss_net_profit) as total_net_profit,
	sum(ss_quantity) as total_quantity
from
store_sales ss, item i, date_dim d, store s
where ss.ss_item_sk=i.i_item_sk
and ss.ss_store_sk = s.s_store_sk
and ss.ss_sold_date_sk=d.d_date_sk
and d_year = 2000
group by i_brand,
	i_category,
	d_year,
	d_quarter_name,
	s_store_name;

The following code is a report to analyze the product sales by category:

SELECT 
	i_category,
	d_year,
	d_quarter_name,
    sum(total_quantity) as total_quantity
FROM vw_product_sales
GROUP BY 
i_category,
	d_year,
	d_quarter_name
ORDER BY 3 desc

The preceding reports take approximately 15 seconds to run. As more products are sold, this elapsed time gradually gets longer. To speed up those reports, you can create a materialized view to precompute the total sales per category. See the following code:

create materialized view mv_product_sales
as
select 
	i_brand,
	i_category,
	d_year,
	d_quarter_name,
	s_store_name,
	sum(ss_sales_price) as total_sales_price,
	sum(ss_net_profit) as total_net_profit,
	sum(ss_quantity) as total_quantity
from 
store_sales ss, item i, date_dim d, store s
where ss.ss_item_sk=i.i_item_sk
and ss.ss_store_sk = s.s_store_sk
and ss.ss_sold_date_sk=d.d_date_sk
and d_year = 2000
group by i_brand,
	i_category,
	d_year,
	d_quarter_name,
	s_store_name;

The following code analyzes the product sales by category against the materialized view:

SELECT 
	i_category,
	d_year,
	d_quarter_name,
    sum(total_quantity) as total_quantity
FROM mv_product_sales
GROUP BY 
i_category,
	d_year,
	d_quarter_name
ORDER BY 3 desc;

The same reports against a materialized view took around 4 seconds because the new queries access precomputed joins, filters, grouping, and partial sums instead of the multiple, larger base tables.

Workload management

Amazon Redshift workload management (WLM) enables you to flexibly manage priorities within workloads so that short, fast-running queries don’t get stuck in queues behind long-running queries. You can use WLM to define multiple query queues and route queries to the appropriate queues at runtime.

You can query WLM in two modes:

Automatic WLM – Amazon Redshift manages the resources required to run queries. Amazon Redshift determines how many queries run concurrently and how much memory is allocated to each dispatched query. Amazon Redshift uses highly trained sophisticated ML algorithms to make these decisions.
Query priority is a feature of automatic WLM that lets you assign priority ranks to different user groups or query groups, to ensure that higher-priority workloads get more resources for consistent query performance, even during busy times. For example, consider a critical dashboard report query that has higher priority than an ETL job. You can assign the priority as highest for the report query and high priority to the ETL query.
No queries are ever starved of resources, and lower priority queries always complete, but may just take longer to complete.
Manual WLM – With manual WLM, you can manage the system performance by modifying the WLM configuration to create separate queues for long-running queries and short-running queries. You can define up to eight queues to separate workloads from each other. Each queue contains a number of query slots, and each queue is associated with a portion of available memory.

You can also use the Amazon Redshift query monitoring rules (QMR) feature to set metrics-based performance boundaries for workload management (WLM) queues, and specify what action to take when a query goes beyond those boundaries. For example, for a queue that’s dedicated to short-running queries, you might create a rule that cancels queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops. You can use predefined rule templates in Amazon Redshift to get started with QMR.

We recommend the following configuration for WLM:

Enable automatic WLM
Enable concurrency scaling to handle an increase in concurrent read queries, with consistent fast query performance
Create QMR rules to track and handle poorly written queries

After you create and configure different WLM queues, you can use a MicroStrategy query label to set the Amazon Redshift query group for queue assignment. This tells Amazon Redshift which WLM queue to send the query to.

You can set the following as a report pre-statement in MicroStrategy:

set query_group to 'mstr_dashboard';

You can use MicroStrategy query labels to identify the MicroStrategy submitted SQL statements within Amazon Redshift system tables.

You can use it with all SQL statement types; therefore, we recommend using it for multi-pass SQL reports. When the label of a query is stored in the system view stl_query, it’s truncated to 15 characters (30 characters are stored in all other system tables). For this reason, you should be cautious when choosing the value for query label.

You can set the following as a report pre-statement:

set query_group to 'MSTR=!o;Project=!p;User=!u;Job=!j;'

This collects information on the server side about variables like project name, report name, user, and more.

To clean up the query group and release resources, use the cleanup post-statement:

reset query_group;

MicroStrategy allows the use of wildcards that are replaced by values retrieved at a report’s run time, as shown in the pre- and post-statements. The following table provides an example of pre- and post-statements.

VLDB Category	VLDB Property Setting	Value Example
Pre/Post Statements	Report Pre-statement	`set query_group to 'MSTR=!o;Project=!p;User=!u;Job=!j;'`
Pre/Post Statements	Cleanup Post-statement	`reset query_group;`

For example, see the following code:

VLDB Property Report Pre Statement = set query_group to 'MSTRReport=!o;'
set query_group to 'MSTRReport=Cost, Price, and Profit per Unit;'

Query prioritization in MicroStrategy

In general, you may have multiple applications submitting queries to Amazon Redshift in addition to MicroStrategy. You can use Amazon Redshift query groups to identify MicroStrategy submitted SQL to Amazon Redshift, along with its assignment to the appropriate Amazon Redshift WLM queue.

The Amazon Redshift query group for a MicroStrategy report is set and reset through the use of the following report-level MicroStrategy VLDB properties.

VLDB Category	VLDB Property Setting	Value Example
Pre/Post Statements	Report Pre-statement	`set query_group to 'MSTR_High=!o;'`
Pre/Post Statements	Cleanup Post-statement	`reset query_group;`

A MicroStrategy report job can submit one or more queries to Amazon Redshift. In such cases, all queries for a MicroStrategy report are labeled with the same query group and therefore are assigned to same queue in Amazon Redshift.

The following is an example implementation of MicroStrategy Amazon Redshift WLM:

High-priority MicroStrategy reports are set with report pre-statement MSTR_HIGH=!o;, medium priority reports with MSTR_MEDIUM=!o;, and low priority reports with MSTR_LOW=!o;.
Amazon Redshift WLM queues are created and associated with corresponding query groups. For example, the MSTR_HIGH_QUEUE queue is associated with the MSTR_HIGH=*; query group (where * is an Amazon Redshift wildcard).

Concurrency scaling

With concurrency scaling, you can configure Amazon Redshift to handle spikes in workloads while maintaining consistent SLAs by elastically scaling the underlying resources as needed. When concurrency scaling is enabled, Amazon Redshift continuously monitors the designated workload. If the queries start to get backlogged because of bursts of user activity, Amazon Redshift automatically adds transient cluster capacity and routes the requests to these new clusters. You manage which queries are sent to the concurrency scaling cluster by configuring the WLM queues. This happens transparently in a matter of seconds, so your queries continue to be served with low latency. In addition, every 24 hours that the Amazon Redshift main cluster is in use, you accrue a 1-hour credit towards using concurrency scaling. This enables 97% of Amazon Redshift customers to benefit from concurrency scaling at no additional charge.

For more details on concurrency scaling pricing, see Amazon Redshift pricing.

Amazon Redshift removes the additional transient capacity automatically when activity reduces on the cluster. You can enable concurrency scaling for the MicroStrategy report queue and in case of heavy load on the cluster, the queries run on a concurrent cluster, thereby improving the overall dashboard performance and maintaining a consistent user experience.

To make concurrency scaling work with MicroStrategy, use derived tables instead of temporary tables, which you can do by setting the VLDB property Intermediate table type to Derived table.

In the following example, we enable concurrency scaling on the Amazon Redshift cluster for the MicroStrategy dashboard queries. We create a user group in Amazon Redshift, and all the dashboard queries are allocated to this user group’s queue. With concurrency scaling in place for the report queries, we can see a significant reduction in query wait time.

For this example, we created one WLM queue to run our dashboard queries with highest priority and another ETL queue with high priority. Concurrency scaling is turned on for the dashboard queue, as shown in the following screenshot.

As part of this test, we ran several queries in parallel on the cluster, some of which are ETL jobs (insert, delete, update, and copy), and some are complex select queries, such as dashboard queries. The following graph illustrates how many queries are waiting in the WLM queues and how concurrency scaling helps to address those queries.

In the preceding graph, several queries are waiting in the WLM queues; concurrency scaling automatically starts in seconds to process queries without any delays, as shown in the following graph.

This example has demonstrated how concurrency scaling helps handle spikes in user workloads by adding transient clusters as needed to provide consistent performance even as the workload grows to hundreds of concurrent queries.

Amazon Redshift federated queries

Customers using MicroStrategy often connect various relational data sources to a single MicroStrategy project for reporting and analysis purposes. For example, you might integrate an operational (OLTP) data source (such as Amazon Aurora PostgreSQL) and data warehouse data to get meaningful insights into your business.

With federated queries in Amazon Redshift, you can query and analyze data across operational databases, data warehouses, and data lakes. The federated query feature allows you to integrate queries from Amazon Redshift on live data in external databases with queries across your Amazon Redshift and Amazon Simple Storage Service (Amazon S3) environments.

Federated queries help incorporate live data as part of your MicroStrategy reporting and analysis, without the need to connect to multiple relational data sources from MicroStrategy.

You can also use federated queries to MySQL.

This simplifies the multi-source reports use case by having the ability to run queries on both operational and analytical data sources, without the need to explicitly connect and import data from different data sources within MicroStrategy.

Redshift Spectrum

The MicroStrategy Amazon Redshift connector includes support for Redshift Spectrum, so you can connect directly to query data in Amazon Redshift and analyze it in conjunction with data in Amazon S3.

With Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data (such as PARQUET, JSON, and CSV) from files in Amazon S3 without having to load the data into Amazon Redshift tables. It allows customers with large datasets stored in Amazon S3 to query that data from within the Amazon Redshift cluster using Amazon Redshift SQL queries with no data movement—you pay only for the data you scanned. Redshift Spectrum also allows multiple Amazon Redshift clusters to concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster. Based on the demands of the queries, Redshift Spectrum can intelligently scale out to take advantage of massively parallel processing.

Use cases that might benefit from using Redshift Spectrum include:

A large volume of less-frequently accessed data
Heavy scan-intensive and aggregation-intensive queries
Selective queries that can use partition pruning and predicate pushdown, so the output is fairly small

Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it.

With Redshift Spectrum, you take advantage of a fast, cost-effective engine that minimizes data processed with dynamic partition pruning. You can further improve query performance by reducing the amount of data scanned. You could do this by partitioning and compressing data and by using a columnar format for storage.

For more details on how to optimize Redshift Spectrum query performance and cost, see Best Practices for Amazon Redshift Spectrum.

Optimize MicroStrategy

In this section, we discuss ways to optimize MicroStrategy.

SQL optimizations

With MicroStrategy 2021, MicroStrategy has delivered support for 70 new advanced customizable functions to enhance usability and capability, especially when compared to previously existing Apply functions. Application architects can customize the functions and make them ready and available for regular users like business analysts to use! For more information on how to use these new customizable functions, visit the MicroStrategy community site.

SQL Global Optimization

This setting can substantially reduce the number of SQL passes generated by MicroStrategy. In MicroStrategy, SQL Global Optimization reduces the total number of SQL passes with the following optimizations:

Eliminates unused SQL passes – For example, a temp table is created but not referenced in a later pass
Reuses redundant SQL passes – For example, the exact same temp table is created multiple times when a single temp table is created
Combines SQL passes where the SELECT list is different – For example, two temp tables that have the same FROM clause, joins, WHERE clause, and GROUP BY SELECT lists are combined into single SELECT statement
Combines SQL passes where the WHERE clause is different – For example, two temp tables that have same the SELECT list, FROM clause, joins, and GROUP BY predicates from the WHERE clause are moved into CASE statements in the SELECT list

The default setting for Amazon Redshift is to enable SQL Global Optimization at its highest level. If your database instance is configured as an earlier version of Amazon Redshift, you may have to enable this setting manually. For more information, see the MicroStrategy System Administration Guide.

Set Operator Optimization

This setting is used to combine multiple subqueries into a single subquery using set operators (such as UNION, INTERSECT, and EXCEPT). The default setting for Amazon Redshift is to enable Set Operator Optimization.

SQL query generation

The MicroStrategy query engine is able to combine multiple passes of SQL that access the same table (typically the main fact table). This can improve performance by eliminating multiple table scans of large tables. For example, this feature significantly reduces the number of SQL passes required to process datasets with custom groups.

Technically, the WHERE clauses of different passes are resolved in CASE statements of a single SELECT clause, which doesn’t contain qualifications in the WHERE clause. Generally, this elimination of WHERE clauses causes a full table scan on a large table.

In some cases (on a report-by-report basis), this approach can be slower than many highly qualified SELECT statements. Because any performance difference between approaches is mostly impacted by the reporting requirement and implementation in the MicroStrategy application, it’s necessary to test both options for each dataset to identify the optimal case.

The default behavior is to merge all passes with different WHERE clauses (level 4). We recommend testing any option for this setting, but most commonly the biggest performance improvements (if any) are observed by switching to the option Level 2: Merge Passes with Different SELECT.

VLDB Category	VLDB Property Setting	Value
Query Optimizations	SQL Global Optimization	Level 2: Merge Passes with Different SELECT

SQL size

As we explained earlier, MicroStrategy tries to submit a single query statement containing the analytics of multiple passes in the derived table syntax. This can lead to sizeable SQL query syntax. It’s possible for such a statement to exceed the capabilities of the driver or database. For this reason, MicroStrategy governs the size of generated queries and throws an error message if this is exceeded. Starting with MicroStrategy 10.9, this value is tuned to current Amazon Redshift capabilities (16 MB). Earlier versions specify a smaller limit that can be modified using the following VLDB setting on the Amazon Redshift DB instance in Developer.

VLDB Category	VLDB Property Setting	Value
Governing	SQL Size/MDX Size	16777216

Subquery type

There are many cases in which the SQL engine generates subqueries (query blocks in the WHERE clause):

Reports that use relationship filters
Reports that use NOT IN set qualification, such as AND NOT
Reports that use attribute qualification with M-M relationships; for example, showing revenue by category and filtering on catalog
Reports that raise the level of a filter; for example, dimensional metric at Region level, but qualify on store
Reports that use non-aggregatable metrics, such as inventory metrics
Reports that use dimensional extensions
Reports that use attribute-to-attribute comparison in the filter

The default setting for subquery type for Amazon Redshift is Where EXISTS(select (col1, col2…)):

create table T00001 (
       year_id NUMERIC(10, 0),
       W000001 DOUBLE PRECISION)
 DISTSTYLE EVEN
 
insert into ZZMD00DistKey(1)
select a12.year_id  year_id,
       sum(a11.tot_sls_dlr)  W000001
from   items2 a11
       join   dates    a12
         on   (a11.cur_trn_dt = a12.cur_trn_dt)
where ((exists (select      r11.store_nbr
       from   items r11
       where r11.class_nbr = 1
        and   r11.store_nbr = a11.store_nbr))
 and a12.year_id>1993)
group by      a12.year_id

Some reports may perform better with the option of using a temporary table and falling back to IN for a correlated subquery. Reports that include a filter with an AND NOT set qualification (such as AND NOT relationship filter) will likely benefit from using temp tables to resolve the subquery. However, such reports will probably benefit more from using the Set Operator Optimization option discussed earlier. The other settings are not likely to be advantageous with Amazon Redshift.

VLDB Category	VLDB Property Setting	Value
Query Optimizations	Subquery Type	Use temporary table, falling back to IN for correlated subquery

Full outer join support

Full outer join support is enabled in the Amazon Redshift object by default. Levels at which you can set this are database instance, report, and template.

For example, the following query shows the use of full outer join with the states_dates and regions tables:

select pa0.region_id W000000,
       pa2.month_id W000001,
       sum(pa1.tot_dollar_sales) Column1
from   states_dates pa1
       full outer join       regions     pa0
         on (pa1.region_id = pa0.region_id)
       cross join    LU_MONTH      pa2
group by      pa0.region_id, pa2.month_id

DISTINCT or GROUP BY option (for no aggregation and no table key)

If no aggregation is needed and the attribute defined on the table isn’t a primary key, this property tells the SQL engine whether to use SELECT DISTINCT, GROUP BY, or neither.

Possible values for this setting include:

Use DISTINCT
No DISTINCT, no GROUP BY
Use GROUP BY

The DISTINCT or GROUP BY option property controls the generation of DISTINCT or GROUP BY in the SELECT SQL statement. The SQL engine doesn’t consider this property if it can make the decision based on its own knowledge. Specifically, the SQL engine ignores this property in the following situations:

If there is aggregation, the SQL engine uses GROUP BY, not DISTINCT
If there is no attribute (only metrics), the SQL engine doesn’t use DISTINCT
If there is COUNT (DISTINCT …) and the database doesn’t support it, the SQL engine performs a SELECT DISTINCT pass and then a COUNT(*) pass
If for certain selected column data types, the database doesn’t allow DISTINCT or GROUP BY, the SQL engine doesn’t do it
If the SELECT level is the same as the table key level and the table’s true key property is selected, the SQL engine doesn’t issue a DISTINCT

When none of the preceding conditions are met, the SQL engine uses the DISTINCT or GROUP BY property.

Use the latest Amazon Redshift drivers

For running MicroStrategy reports using Amazon Redshift, we encourage upgrading when new versions of the Amazon Redshift drivers are available. Running an application on the latest driver provides better performance, bugs recovery, and better security features. To get the latest driver version based on the OS, see Drivers and Connectors.

Conclusion

In this post, we discussed various Amazon Redshift cluster optimizations, data model optimizations, and SQL optimizations within MicroStrategy for optimizing your Amazon Redshift and MicroStrategy deployment.

About the Authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 13 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Nita Shah is a Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Bosco Albuquerque is a Sr Partner Solutions Architect at AWS and has over 20 years of experience working with database and analytics products from enterprise database vendors and cloud providers, and has helped large technology companies design data analytics solutions as well as led engineering teams in designing and implementing data analytics platforms and data products.

Amit Nayak is responsible for driving the Gateways roadmap at MicroStrategy, focusing on relational and big data databases, as well as authentication. Amit joined MicroStrategy after completing his master’s in Business Analytics at George Washington University and has maintained an oversight of the company’s gateways portfolio for the 3+ years he has been with the company.

Centralize governance for your data lake using AWS Lake Formation while enabling a modern data architecture with Amazon Redshift Spectrum

2022-01-19 Rajesh Francis

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/centralize-governance-for-your-data-lake-using-aws-lake-formation-while-enabling-a-modern-data-architecture-with-amazon-redshift-spectrum/

Many customers are modernizing their data architecture using Amazon Redshift to enable access to all their data from a central data location. They are looking for a simpler, scalable, and centralized way to define and enforce access policies on their data lakes on Amazon Simple Storage Service (Amazon S3). They want access policies to allow their data lake consumers to use the analytics service of their choice, to best suit the operations they want to perform on the data. Although the existing method of using Amazon S3 bucket policies to manage access control is an option, when the number of combinations of access levels and users increase, managing bucket level policies may not scale.

AWS Lake Formation allows you to simplify and centralize access management. It allows organizations to manage access control for Amazon S3-based data lakes using familiar concepts of databases, tables, and columns (with more advanced options like row and cell-level security). Lake Formation uses the AWS Glue Data Catalog to provide access control for Amazon S3 data lake with most commonly used AWS analytics services, like Amazon Redshift (via Amazon Redshift Spectrum), Amazon Athena, AWS Glue ETL, and Amazon EMR (for Spark-based notebooks). These services honor the Lake Formation permissions model out of the box, which makes it easy for customers to simplify, standardize, and scale data security management for data lakes.

With Amazon Redshift, you can build a modern data architecture, to seamlessly extend your data warehouse to your data lake and read all data – data in your data warehouse, and data in your data lake – without creating multiple copies of data. Amazon Redshift Spectrum feature enable direct query of your S3 data lake, and many customers are leveraging this to modernize their data platform. You can use Amazon Redshift managed storage for frequently accessed data and move less frequently accessed data to Amazon S3 data lake and securely access it using Redshift Spectrum.

In this post, we discuss how you can use AWS Lake Formation to centralize data governance and data access management while using Amazon Redshift Spectrum to query your data lake. Lake Formation allows you to grant and revoke permissions on databases, tables, and column catalog objects created on top of Amazon S3 data lake. This is easier for customers, as it is similar to managing permissions on relational databases.

In the first post of this two-part series, we focus on resources within the same AWS account. In the second post, we extend the solution across AWS accounts using the Lake Formation data sharing feature.

Solution overview

The following diagram illustrates our solution architecture.

The solution workflow consists of the following steps:

Data stored in an Amazon S3 data lake is crawled using an AWS Glue crawler.
The crawler infers the metadata of data on Amazon S3 and stores it in the form of a database and tables in the AWS Glue Data Catalog.
You register the Amazon S3 bucket as the data lake location with Lake Formation. It’s natively integrated with the Data Catalog.
You use Lake Formation to grant permissions at the database, table, and column level to defined AWS Identity and Access Management (IAM) roles.
You create external schemas within Amazon Redshift to manage access for marketing and finance teams.
You provide access to the marketing and finance groups to their respective external schemas and associate the appropriate IAM roles to be assumed. The admin role and admin group is limited for administration work.
Marketing and finance users now can assume their respective IAM roles and query data using the SQL query editor to their external schemas inside Amazon Redshift.

Lake Formation default security settings

To maintain backward compatibility with AWS Glue, Lake Formation has the following initial security settings:

The super permission is granted to the group IAMAllowedPrincipals on all existing Data Catalog resources.
Settings to use only IAM access control are enabled for new Data Catalog resources.

To change security settings, see Changing the Default Security Settings for Your Data Lake.

Note: Leave the default settings as is until you’re ready to move completely to the Lake Formation permission model. You can update settings at a database level if you want permissions set by Lake Formation to take effect. For more details about upgrades, refer to Upgrading AWS Glue Data Permissions to the AWS Lake Formation Model.

We don’t recommend reverting back from the Lake Formation permission model to an IAM-only permission model. You may also want to first deploy the solution in a new test account.

Prerequisites

To set up this solution, you need basic familiarity with the AWS Management Console, an AWS account, and access to the following AWS services:

AWS CloudFormation
AWS Glue
IAM
Lake Formation
AWS Lambda
Amazon Redshift
Amazon S3
AWS Step Functions
Amazon Virtual Private Cloud (Amazon VPC)

Create the data lake administrator

Data lake administrators are initially the only IAM users or roles that can grant Lake Formation permissions on data locations and Data Catalog resources to any principal.

To set up an IAM user as a data lake administrator, add the provided inline policy to the IAM user or IAM role you use to provision the resources for this blog solution. For more details, refer to Create a Data Lake Administrator.

On the IAM console, choose Users, and choose the IAM user who you want to designate as the data lake administrator.

Choose Add an inline policy on the Permissions tab and add the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataLakeSettings",
                "lakeformation:PutDataLakeSettings"
            ],
            "Resource": "*"
        }
    ]
}

Provide a policy name.
Review and save your settings.

Note: If you’re using an existing administrator user/role, you may have this already provisioned.

Note: The CloudFormation template doesn’t work if you skip the below step.

If this is your first time on the Lake Formation console, select Add myself and choose Get started.

You can also add yourself as data lake administrator by going to Administrative roles and tasks under Permissions, select Choose administrators, and adding yourself as an administrator if you missed this in the initial welcome screen.

Provision resources with CloudFormation

In this step, we create the solution resources using a CloudFormation template. The template performs the following actions:

Creates an S3 bucket to copy sample data files and SQL scripts
Registers the S3 data lake location with Lake Formation
Creates IAM roles and policies as needed for the environment
Assigns principals (IAM roles) to handle data lake settings
Creates Lambda and Step Functions resources to load necessary data
Runs AWS Glue crawler jobs to create Data Catalog tables
Configures Lake Formation permissions
Creates an Amazon Redshift cluster
Runs a SQL script to create the database group, database user, and external schemas for the admin, marketing, and finance groups

To create your resources, complete the following steps:

Launch the provided template in AWS Region us-east-1.
Choose Next.
For Stack name, you can keep the default stack name or change it.
For DbPassword, provide a secure password instead of using the default provided.
For InboundTraffic, change the IP address range to your local machine’s IP address in CIDR format instead of using the default.
Choose Next.
Choose Next again until you get to the review page.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

The stack takes approximately 10 minutes to deploy successfully. When it’s complete, you can view the outputs on the AWS CloudFormation console.

Update Lake Formation default settings

You also need to update the default settings at the Lake Formation database level. This makes sure that the Lake Formation permissions the CloudFormation template sets up during provisioning can take effect over the default settings.

On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
Choose the database you created with the CloudFormation template.
Choose Edit.
Deselect Use only IAM access control for new tables in the database.
Choose Save.

This action is important because it removes the IAM control model from this database and allows only Lake Formation to take security grant/revoke access to it. This step makes sure other steps in this solution are successful.

Choose Databases in the navigation pane.
Select the same database.
On the Actions menu, choose View permissions.

You can review the permissions enabled for this database.

Select the IAMAllowedPrincipals group and choose Revoke to remove default permission settings for this individual database.

The IAMAllowedPrincipal row no longer appears in the list on the Permissions page.

Similarly, we need to remove the IAMAllowedPrincipal group at the table level. The CloudFormation template created six tables for this database. Let’s see how to use data lake permissions to remove access at the table level.

On the Lake Formation console, choose Data lake permissions in the navigation pane.
Filter by Principal:IAMAllowedPrincipals and Database:<<database name>>.

You can review all the tables we need to update permissions for.

Select each table one by one and choose Revoke.

With these steps, we’ve made sure that the default settings at the Lake Formation account level are still in place, and only manually updated for the database and tables we’re going to work with in this post. When you’re ready to move completely to a Lake Formation permission model, you can update the settings at the account level instead of individually updating them. For more details, see Change the default permission model.

Validate the provisioned resources

The CloudFormation template provisions many resources automatically to create your environment. In this section, we check some of the key resources to understand them better.

Lake Formation resources

On the Lake Formation console, check that a new data lake location is registered with an IAM role on the Data lake locations page.

This is the IAM role any integrated service like Amazon Redshift assumes to access data on the registered Amazon S3 location. This integration happens out of the box when the right roles and policies are applied. For more details, see Requirements for Roles Used to Register Locations.

Check the Administrative roles and tasks page confirm that the logged-in user is added as the data lake administrator and IAMAllowedPrincipals is added as database creator.

Then check the tables that the AWS Glue crawlers created in the Data Catalog database. These tables are logical entities, because the data is in an Amazon S3 location. After you create these objects, you can access them via different services.

Lastly, check permissions set by the template using the Lake Formation permission model on the tables to be accessed by finance and marketing users from Amazon Redshift.

The following screenshot shows that the finance role has access to all columns for the store and item tables, but only the listed columns for the store_sales table.

Similarly, you can review access for the marketing role, which has access to all columns in the customer_activity and store_sales tables.

Amazon S3 resources

The CloudFormation template creates two S3 buckets:

data-lake – Contains the data used for this post
script – Contains the SQL which we use to create Amazon Redshift database objects

Open the script bucket to see the scripts. You can download and open them to view the SQL code used.

The setup_lakeformation_demo.sql script gives you the SQL code to create the external database schema and assign different roles for data governance purposes. The external schema is for AWS Glue Data Catalog-based objects that point to data in the data lake. We then grant access to different database groups and users to manage security for finance and marketing users.

The scripts run in the following order:

sp_create_db_group.sql
sp_create_db_user.sql
setup_lakeformation_demo.sql

Amazon Redshift resources

On the Amazon Redshift console, choose Clusters in the navigation pane and choose the cluster you created with the CloudFormation template. Then choose the Properties tab.

The Cluster permissions section lists three attached roles. The template used the admin role to provision Amazon Redshift database-level objects. The finance role is attached to the finance schema in Amazon Redshift, and the marketing role is attached to the marketing schema.

Each of these roles are given permissions in such a way that they can use the Amazon Redshift query editor to query Data Catalog tables using Redshift Spectrum. For more details, see Using Redshift Spectrum with AWS Lake Formation and Query the Data in the Data Lake Using Amazon Redshift Spectrum.

Query the data

We use Amazon Redshift query editor v2 to query the external schema and Data Catalog tables (external tables). The external schema is already created as part of the CloudFormation template. When the external schema is created using the Data Catalog, the tables in the database are automatically created and are available through Amazon Redshift as external tables.

On the Amazon Redshift console, choose Query editor v2.
Choose Configure account.
Choose the database cluster.
For Database, enter dev.
For User name, enter awsuser.
For Authentication, select Temporary credentials.
Choose Create connection.

When you’re connected and logged in as administrator user, you can see both local and external schemas and tables, as shown in the following screenshot.

Validate role-based Lake formation permissions in Amazon Redshift

Next, we validate how the Lake Formation security settings work for the marketing and finance users.

In the query editor, choose (right-click) the database connection.
Choose Edit connection.
For User name, enter marketing_ro.
Choose Edit connection.
After connected as maketing_ro, choose the dev database under the cluster and navigate to the customer_activity table.
Choose the refresh icon.
Repeat these steps to edit the connection and update the user to finance_ro.
Try again to refresh the dev database.

As expected, this user only has access to the allowed schema and tables.

With this solution, you can segregate different users at the schema level and use Lake Formation to make sure they can only see the tables and columns their role allows.

Column-level security with Lake Formation permissions

Lake Formation also allows you to set which columns a principal can or can’t see within a table. For example, when you select store_sales as the marketing_ro user, you see many columns, like customer_purchase_estimate. However, as the finance_ro user, you don’t see these columns.

Manual access control via the Lake Formation console

In this post, we’ve been working with a CloudFormation template-based environment, which is an automated way to create environment templates and simplify operations.

In this section, we show how you can set up all the configurations through the console, and we use another table as an example to walk you through the steps.

As demonstrated in previous steps, the marketing user in this environment has all column access to the tables customer_activity and store_sales in the external schema retail_datalake_marketing. We change some of that manually to see how it works using the console.

On the Lake Formation console, choose Data lake permissions.
Filter by the principal RedshiftMarketingRole.
Select the principal for the store_sales table and choose Revoke.
Confirm by choosing Revoke again.

A success message appears, and the permission row is no longer listed.

Choose Grant to configure a new permission level for the marketing user on the store_sales table at the column level.
Select IAM users and roles and choose your role.
In the LF-Tags or catalog resources section, select Named data catalog resources.
For Databases, choose your database.
For Tables, choose the store_sales table.
For Table permissions¸ check Select.
In the Data permissions section, select Simple column-based access.
Select Exclude columns.
Choose the columns as shown in the following screenshot.
Choose Grant.

We now query the table from Amazon Redshift again to confirm that the effective changes match the controls placed by Lake Formation. In the following query, we select a column that isn’t authorized:

/* Selecting columns not authorized will result in error. */
select s_country, ss_net_profit from retail_datalake_marketing.store_sales;

As expected, we get an error.

Clean up

Clean up resources created by the CloudFormation template to avoid unnecessary cost to your AWS account. You can delete the CloudFormation stack by selecting the stack on the AWS CloudFormation console and choosing Delete. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may see some issues during clean-up, and you need to clean these up manually.

Summary

In this post, we showed how you can integrate Lake Formation with Amazon Redshift to seamlessly control access to Amazon S3 data lake. We also demonstrated how to query your data lake using Redshift Spectrum and external tables. This is a powerful mechanism that helps you build a modern data architecture to easily query data on your data lake and data warehouses together. We also saw how you can use CloudFormation templates to automate the resource creation with infrastructure as code. You can use this to simplify your operations, especially when you want replicate the resource setup from development to production landscape during your project cycles.

Finally, we covered how data lake administrators can manually control search on data catalog objects and grant or revoke access at the database, table, and column level. We encourage you to try the steps we outlined in this post and use the CloudFormation template to set up security in Lake Formation to control data lake access from Redshift Spectrum.

In the second post of this series, we focus on how you can take this concept and apply it across accounts using a Lake Formation data-sharing feature in a hub-and-spoke topography.

About the Authors

Vaibhav Agrawal is an Analytics Specialist Solutions Architect at AWS. Throughout his career, he has focused on helping customers design and build well-architected analytics and decision support platforms.

Rajesh Francis is a Senior Analytics Customer Experience Specialist at AWS. He specializes in Amazon Redshift and focuses on helping to drive AWS market and technical strategy for data warehousing and analytics services. Rajesh works closely with large strategic customers to help them adopt our new services and features, develop long-term partnerships, and feed customer requirements back to our product development teams to guide our product roadmap.

Creating a Multi-Region Application with AWS Services – Part 2, Data and Replication

2022-01-12 Joe Chapman

Post Syndicated from Joe Chapman original https://aws.amazon.com/blogs/architecture/creating-a-multi-region-application-with-aws-services-part-2-data-and-replication/

In Part 1 of this blog series, we looked at how to use AWS compute, networking, and security services to create a foundation for a multi-Region application.

Data is at the center of many applications. In this post, Part 2, we will look at AWS data services that offer native features to help get your data where it needs to be.

In Part 3, we’ll look at AWS application management and monitoring services to help you build, monitor, and maintain a multi-Region application.

Considerations with replicating data

Data replication across the AWS network can happen quickly, but we are still limited by the speed of light. For this reason, data consistency must be considered when building a multi-Region application. Generally speaking, the longer a physical distance is, the longer it will take the data to get there.

When building a distributed system, consider the consistency, availability, partition tolerance (CAP) theorem. This theorem states that an application can only pick 2 out of the 3, and tradeoffs should be considered.

Consistency – all clients always have the same view of data
Availability – all clients can always read and write data
Partition Tolerance – the system will continue to work despite physical partitions

Achieving consistency and availability is common for single-Region applications. For example, when an application connects to a single in-Region database. However, this becomes more difficult with multi-Region applications due to the latency added by transferring data over long distances. For this reason, highly distributed systems will typically follow an eventual consistency approach, favoring availability and partition tolerance.

Replicating objects and files

To ensure objects are in multiple Regions, Amazon Simple Storage Service (Amazon S3) can be set up to replicate objects across AWS Regions automatically with one-way or two-way replication. A subset of objects in an S3 bucket can be replicated with S3 replication rules. If low replication lag is critical, S3 Replication Time Control can help meet requirements by replicating 99.99% of objects within 15 minutes, and most within seconds. To monitor the replication status of objects, Amazon S3 events and metrics will track replication and can send an alert if there’s an issue.

Traditionally, each S3 bucket has its own single, Regional endpoint. To simplify connecting to and managing multiple endpoints, S3 Multi-Region Access Points create a single global endpoint spanning multiple S3 buckets in different Regions. When applications connect to this endpoint, it will route over the AWS network using AWS Global Accelerator to the bucket with the lowest latency. Failover routing is also automatically handled if the connectivity or availability to a bucket changes.

For files stored outside of Amazon S3, AWS DataSync simplifies, automates, and accelerates moving file data across Regions and accounts. It supports homogeneous and heterogeneous file migrations across Elastic File System (Amazon EFS), Amazon FSx, AWS Snowcone, and Amazon S3. It can even be used to sync on-premises files stored on NFS, SMB, HDFS, and self-managed object storage to AWS for hybrid architectures.

File and object replication should be expected to be eventually consistent. The rate at which a given dataset can transfer is a function of the amount of data, I/O bandwidth, network bandwidth, and network conditions.

Copying backups

Scheduled backups can be set up with AWS Backup, which automates backups of your data to meet business requirements. Backup plans can automate copying backups to one or more AWS Regions or accounts. A growing number of services are supported, and this can be especially useful for services that don’t offer real-time replication to another Region such as Amazon Elastic Block Store (Amazon EBS) and Amazon Neptune.

Figure 1 shows how these data transfer services can be combined for each resource.

Figure 1. Storage replication services

Spanning non-relational databases across Regions

Amazon DynamoDB global tables provide multi-Region and multi-writer features to help you build global applications at scale. A DynamoDB global table is the only AWS managed offering that allows for multiple active writers in a multi-Region topology (active-active and multi-Region). This allows for applications to read and write in the Region closest to them, with changes automatically replicated to other Regions.

Global reads and fast recovery for Amazon DocumentDB (with MongoDB compatibility) can be achieved with global clusters. These clusters have a primary Region that handles write operations. Dedicated storage-based replication infrastructure enables low-latency global reads with a lag of typically less than one second.

Keeping in-memory caches warm with the same data across Regions can be critical to maintain application performance. Amazon ElastiCache for Redis offers global datastore to create a fully managed, fast, reliable, and secure cross-Region replica for Redis caches and databases. With global datastore, writes occurring in one Region can be read from up to two other cross-Region replica clusters – eliminating the need to write to multiple caches to keep them warm.

Spanning relational databases across Regions

For applications that require a relational data model, Amazon Aurora global database provides for scaling of database reads across Regions in Aurora PostgreSQL-compatible and MySQL-compatible editions. Dedicated replication infrastructure utilizes physical replication to achieve consistently low replication lag that outperforms the built-in logical replication database engines offer, as shown in Figure 2.

Figure 2. SysBench OLTP (write-only) stepped every 600 seconds on R4.16xlarge

With Aurora global database, one primary Region is designated as the writer, and secondary Regions are dedicated to reads. Aurora MySQL supports write forwarding, which forwards write requests from a secondary Region to the primary Region to simplify logic in application code. Failover testing can happen by utilizing managed planned failover, which will change the active write cluster to another Region while keeping the replication topology intact. All databases discussed in this post employ eventual consistency when used across Regions, but Aurora PostgreSQL has an option to set the maximum a replica lag allowed with managed recovery point objective (managed RPO).

Logical replication, which utilizes a database engine’s built-in replication technology, can be set up for Amazon Relational Database Service (Amazon RDS) for MariaDB, MySQL, Oracle, PostgreSQL, and Aurora databases. A cross-Region read replica will receive these changes from the writer in the primary Region. For applications built on RDS for Microsoft SQL Server, cross-Region replication can be achieved by utilizing the AWS Database Migration Service. Cross-Region replicas allow for quicker local reads and can reduce data loss and recovery times in the case of a disaster by being promoted to a standalone instance.

For situations where a longer RPO and recovery time objective (RTO) are acceptable, backups can be copied across Regions. This is true for all of the relational and non-relational databases mentioned in this post, except for ElastiCache for Redis. Amazon Redshift can also automatically do this for your data warehouse. Backup copy times will vary depending on size and change rates.

A purpose-built database strategy offers many benefits, Figure 3 forms a purpose-built global database architecture.

Figure 3. Purpose-built global database architecture

Summary

Data is at the center of almost every application. In this post, we reviewed AWS services that offer cross-Region data replication to get your data where it needs to be quickly. Whether you need faster local reads, an active-active database, or simply need your data durably stored in a second Region, we have a solution for you. In the 3rd and final post of this series, we’ll cover application management and monitoring features.

Ready to get started? We’ve chosen some AWS Solutions, AWS Blogs, and Well-Architected labs to help you!

Define error handling for Amazon Redshift Spectrum data

2022-01-10 Ahmed Shehata

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/define-error-handling-for-amazon-redshift-spectrum-data/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift Spectrum allows you to query open format data directly from the Amazon Simple Storage Service (Amazon S3) data lake without having to load the data into Amazon Redshift tables. With Redshift Spectrum, you can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. This feature of Amazon Redshift enables a modern data architecture that allows you to query all your data to obtain more complete insights.

Amazon Redshift has a standard way of handling data errors in Redshift Spectrum. Data file fields containing any special character are set to null. Character fields longer than the defined table column length get truncated by Redshift Spectrum, whereas numeric fields display the maximum number that can fit in the column. With this newly added user-defined data error handling feature in Amazon Redshift Spectrum, you can now customize data validation and error handling.

This feature provides you with specific methods for handling each of the scenarios for invalid characters, surplus characters, and numeric overflow while processing data using Redshift Spectrum. Also, the errors are captured and visible in the newly created dictionary view SVL_SPECTRUM_SCAN_ERROR. You can even cancel the query when a defined threshold of errors has been reached.

Prerequisites

To demonstrate Redshift Spectrum user-defined data handling, we build an external table over a data file with soccer league information and use that to show different data errors and how the new feature offers different options in dealing with those data errors. We need the following prerequisites:

An AWS account
An S3 bucket
An AWS Glue Data Catalog database
An Amazon Redshift cluster with an AWS Identity and Access Management (IAM) role that can read from the S3 bucket
A data file for the demonstration (for this post, our data file contains soccer clubs, their leagues, and their rankings)
A SQL script to create an external schema and external table
A SQL script for the needed DDL for user-defined data handling tests

Solution overview

We use the data file for soccer leagues to define an external table to demonstrate different data errors and the different handling techniques offered by the new feature to deal with those errors. The following screenshot shows an example of the data file.

Note the following in the example:

The club name can typically be longer than 15 characters
The league name can be typically be longer than 20 characters
The club name Barcelôna includes an invalid character
The column nspi includes values that are bigger than SMALLINT range

Then we create an external table (see the following code) to demonstrate the new user-defined handling:

We define the club_name and league_name shorter than they should to demonstrate handling of surplus characters
We define the column league_nspi as SMALLINT to demonstrate handling of numeric overflow
We use the new table property data_cleansing_enabled to enable custom data handling

CREATE EXTERNAL TABLE schema_spectrum_uddh.soccer_league
(
  league_rank smallint,
  prev_rank   smallint,
  club_name   varchar(15),
  league_name varchar(20),
  league_off  decimal(6,2),
  league_def  decimal(6,2),
  league_spi  decimal(6,2),
  league_nspi smallint
)
ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    LINES TERMINATED BY '\n\l'
stored as textfile
LOCATION 's3://uddh-soccer/league/'
table properties ('skip.header.line.count'='1','data_cleansing_enabled'='true');

Invalid character data handling

With the introduction of the new table and column property invalid_char_handling, you can now choose how you deal with invalid characters in your data. The supported values are as follows:

DISABLED – Feature is disabled (no handling).
SET_TO_NULL – Replaces the value with null.
DROP_ROW – Drops the whole row.
FAIL – Fails the query when an invalid UTF-8 value is detected.
REPLACE – Replaces the invalid character with a replacement. With this option, you can use the newly introduced table property replacement_char.

The table property can work over the whole table or just a column level. Additionally, you can define the table property during create time or later by altering the table.

When you disable user-defined handling, Redshift Spectrum by default sets the value to null (similar to SET_TO_NULL):

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='DISABLED');

When you change the setting of the handling to DROP_ROW, Redshift Spectrum simply drops the row that has an invalid character:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='DROP_ROW');

When you change the setting of the handling to FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='FAIL');

When you change the setting of the handling to REPLACE and choose a replacement character, Redshift Spectrum replaces the invalid character with the chosen replacement character:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='REPLACE','replacement_char'='?');

Surplus character data handling

As mentioned earlier, we defined the columns club_name and league_name shorter than the actual contents of the corresponding fields in the data file.

With the introduction of the new table property surplus_char_handling, you can choose from multiple options:

DISABLED – Feature is disabled (no handling)
TRUNCATE – Truncates the value to the column size
SET_TO_NULL – Replaces the value with null
DROP_ROW – Drops the whole row
FAIL – Fails the query when a value is too large for the column

When you disable the user-defined handling, Redshift Spectrum defaults to truncating the surplus characters (similar to TRUNCATE):

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DISABLED');

When you change the setting of the handling to SET_TO_NULL, Redshift Spectrum simply sets to NULL the column value of any field that is longer than the defined length:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'SET_TO_NULL');

When you change the setting of the handling to DROP_ROW, Redshift Spectrum drops the row of any field that is longer than the defined length:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DROP_ROW');

When you change the setting of the handling to FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'FAIL');

We need to disable the user-defined data handling for this data error before demonstrating the next type of error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DISABLED');

Numeric overflow data handling

For this demonstration, we defined league_nspi intentionally SMALLINT (with a range to hold from -32,768 to +32,767) to show the available options for data handling.

With the introduction of the new table property numeric_overflow_handling, you can choose from multiple options:

DISABLED – Feature is disabled (no handling)
SET_TO_NULL – Replaces the value with null
DROP_ROW – Replaces each value in the row with NULL
FAIL – Fails the query when a value is too large for the column

When we look at the source data, we can observe that the top five countries have more points than the SMALLINT field can handle.

When you disable the user-defined handling, Redshift Spectrum defaults to the maximum number the numeric data type can handle, for our case SMALLINT can handle up to 32767:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DISABLED');

When you choose SET_TO_NULL, Redshift Spectrum sets to null the column with numeric overflow:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'SET_TO_NULL');

When you choose DROP_ROW, Redshift Spectrum drops the row containing the column with numeric overflow:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DROP_ROW');

When you choose FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'FAIL');

We need to disable the user-defined data handling for this data error before demonstrating the next type of error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DISABLED');

Stop queries at MAXERROR threshold

You can also choose to stop the query if it reaches a certain threshold in errors by using the newly introduced parameter spectrum_query_maxerror:

Set spectrum_query_maxerror to 7;

The following screenshot shows that the query ran successfully.

However, if you decrease this threshold to a lower number, the query fails because it reached the preset threshold:

Set spectrum_query_maxerror to 6;

Error logging

With the introduction of the new user-defined data handling feature, we also introduced the new view svl_spectrum_scan_error, which allows you to view a useful sample set of the logs of errors. The table contains the query, file, row, column, error code, handling action that was applied, as well as the original value and the modified (resulting) value. See the following code:

SELECT *
FROM svl_spectrum_scan_error
where location = 's3://uddh-soccer/league/spi_global_rankings.csv'

Clean up

To avoid incurring future charges, complete the following steps:

Delete the Amazon Redshift cluster created for this demonstration. If you were using an existing cluster, drop the created external table and external schema.
Delete the S3 bucket.
Delete the AWS Glue Data Catalog database.

Conclusion

In this post, we demonstrated Redshift Spectrum’s newly added feature of user-defined data error handling and showed how this feature provides the flexibility to take a user-defined approach to deal with data exceptions in processing external files. We also demonstrated how the logging enhancements provide transparency on the errors encountered in external data processing without needing to write additional custom code.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.

About the Authors

Ahmed Shehata is a Data Warehouse Specialist Solutions Architect with Amazon Web Services, based out of Toronto.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift. He is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Set up cross-account audit logging for your Amazon Redshift cluster

2022-01-05 Milind Oke

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/set-up-cross-account-audit-logging-for-your-amazon-redshift-cluster/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the best practices of modern application design is to have centralized logging. Troubleshooting application problems is easy when you can correlate all your data together.

When you enable audit logging, Amazon Redshift logs information about connections and user activities in the database. These logs help you monitor the database for security and troubleshooting purposes, a process called database auditing. The logs are stored in Amazon Simple Storage Service (Amazon S3) buckets. These provide convenient access with data security features for users who are responsible for monitoring activities in the database.

If you want to establish a central audit logging account to capture audit logs generated by Amazon Redshift clusters located in separated AWS accounts, you can use the solution in this post to achieve cross-account audit logging for Amazon Redshift. As of this writing, the Amazon Redshift console only lists S3 buckets from the same account (in which the Amazon Redshift cluster is located) while enabling audit logging, so you can’t set up cross-account audit logging using the Amazon Redshift console. In this post, we demonstrate how to configure cross-account audit logging using the AWS Command Line Interface (AWS CLI).

Prerequisites

For this walkthrough, you must have the following prerequisites:

Two AWS accounts: one for analytics and one for centralized logging
A provisioned Amazon Redshift cluster in the analytics AWS account
An S3 bucket in the centralized logging AWS account
Access to the AWS CLI

Overview of solution

As a general security best practice, we recommend making sure that Amazon Redshift audit logs are sent to the correct S3 buckets. The Amazon Redshift service team has introduced additional security controls in the event that the destination S3 bucket resides in a different account from the Amazon Redshift cluster owner account. For more information, see Bucket permissions for Amazon Redshift audit logging.

This post uses the AWS CLI to establish cross-account audit logging for Amazon Redshift, as illustrated in the following architecture diagram.

For this post, we established an Amazon Redshift cluster named redshift-analytics-cluster-01 in the analytics account in Region us-east-2.

We also set up an S3 bucket named redshift-cluster-audit-logging-xxxxxxxxxxxx in the centralized logging account for capturing audit logs in Region us-east-1.

Now you’re ready to complete the following steps to set up the cross-account audit logging:

Create AWS Identity and Access Management (IAM) policies in the analytics AWS account.
Create an IAM user and attach the policies you created.
Create an S3 bucket policy in the centralized logging account to allow Amazon Redshift to write audit logs to the S3 bucket, and allow the IAM user to enable audit logging for the S3 bucket.
Configure the AWS CLI.
Enable audit logging in the centralized logging account.

Create IAM policies in the analytics account

Create two IAM policies in the analytics account that has the Amazon Redshift cluster.

The first policy is the Amazon Redshift access policy (we named the policy redshift-audit-logging-redshift-policy). This policy allows the principal to whom it’s attached to enable, disable, or describe Amazon Redshift logs. It also allows the principal to describe the Amazon Redshift cluster. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "redshift:EnableLogging",
                "redshift:DisableLogging",
                "redshift:DescribeLoggingStatus"
            ],            
"Resource": "arn:aws:redshift:us-east-2:xxxxxxxxxxxx:cluster: redshift-analytics-cluster-01"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
        }
    ]
}

The second policy is the Amazon S3 access policy (we named the policy redshift-audit-logging-s3-policy). This policy allows the principal to whom it’s attached to write to the S3 bucket in the centralized logging account. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        }
    ]
}

Create an IAM user and attach the policies

Create an IAM user (we named it redshift-audit-logging-user) with programmatic access in the analytics account and attach the policies you created to it.

Save the generated AWS secret key and secret access key credentials for this user securely. We use these credentials in the next step.

Create an S3 bucket policy for the S3 bucket in the centralized logging AWS account

Add the following bucket policy to the audit logging S3 bucket redshift-cluster-audit-logging-xxxxxxxxxxxx in the centralized logging account. This policy serves two purposes: it allows Amazon Redshift to write audit logs to the S3 bucket, and it allows the IAM user to enable audit logging for the S3 bucket. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Put bucket policy needed for audit logging",
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": [
                "s3:PutObject",
                "s3:GetBucketAcl"
            ],
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        },
        {
            "Sid": "Put IAM User bucket policy needed for audit logging",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::xxxxxxxxxxxx:user/redshift-audit-logging-user"
            },
            "Action": "s3:PutObject",
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        }
    ]
}

Note that you have to modify the service name redshift.amazonaws.com to look like redshift.region.amazonaws.com if the cluster is in one of the opt-in Regions.

Configure the AWS CLI

As part of this step, you need to install and configure the AWS CLI. After you install the AWS CLI, configure it to use the IAM user credentials that we generated earlier. We perform the next steps based on the permissions attached to the IAM user we created.

Enable audit logging in the centralized logging account

Run the AWS CLI command to enable audit logging for the Amazon Redshift cluster in an S3 bucket in the centralized logging AWS account. In the following code, provide the Amazon Redshift cluster ID, S3 bucket name, and the prefix applied to the log file names:

aws redshift enable-logging --cluster-identifier <ClusterName> --bucket-name <BucketName> --s3-key-prefix <value>

The following screenshot shows that the cross-account Amazon Redshift audit logging is successfully set up.

A test file is also created by AWS to ensure that the log files can be successfully written into the S3 bucket. The following screenshot shows the test file was created successfully in the S3 bucket under the rsauditlog1 prefix.

After some time, we started seeing the audit logs created in the S3 bucket. By default, Amazon Redshift organizes the log files in the S3 bucket using the following bucket and object structure:

AWSLogs/AccountID/ServiceName/Region/Year/Month/Day/AccountID_ServiceName_Region_ClusterName_LogType_Timestamp.gz

Amazon Redshift logs information in the following log files:

Connection log – Logs authentication attempts, connections, and disconnections
User log – Logs information about changes to database user definitions
User activity log – Logs each query before it’s run on the database

The following screenshot shows that log files, such as connection logs and user activity logs, are now being created in the centralized logging account in us-east-1 from the Amazon Redshift cluster in the analytics account in us-east-2.

For more details on analyzing Amazon Redshift audit logs, refer to below mentioned blogs

Clean up

To avoid incurring future charges, you can delete all the resources you created while following the steps in this post.

Conclusion

In this post, we demonstrated how to accomplish cross-account audit logging for an Amazon Redshift cluster in one account to an Amazon S3 bucket in another account. Using this solution, you can establish a central audit logging account to capture audit logs generated by Amazon Redshift clusters located in separated AWS accounts.

Try this solution to achieve cross-account audit logging for Amazon Redshift and leave a comment.

About the Authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Dipankar Kushari is a Sr. Analytics Solutions Architect with AWS.

Pankaj Pattewar is a Cloud Application Architect at Amazon Web Services. He specializes in architecting and building cloud-native applications and enables customers with best practices in their cloud journey.

Sudharshan Veerabatheran is a Cloud Support Engineer based out of Portland.

Federate access to Amazon Redshift using the JDBC browser plugin for Single Sign-on authentication with Microsoft Azure Active Directory

2022-01-04 Ilesh Garish

Post Syndicated from Ilesh Garish original https://aws.amazon.com/blogs/big-data/federate-access-to-amazon-redshift-using-the-jdbc-browser-plugin-for-single-sign-on-authentication-with-microsoft-azure-active-directory/

Since 2020, Amazon Redshift has supported multi-factor authentication (MFA) to any SAML 2.0 compliant identity provider (IdP) in our JDBC and ODBC drivers. You can map the IdP user identity and group memberships in order to control authorization for database objects in Amazon Redshift. This simplifies administration by enabling you to manage user access in a central location, reducing the overhead of creating and maintaining users in the database in conjunction with the IdP.

Recently, we helped a customer who was building their data warehouse on Amazon Redshift and had the requirement of using Microsoft Azure Active Directory (Azure AD) as their corporate IdP with MFA. This post illustrates how to set up federation using Azure AD and AWS Identity and Access Management (IAM). Azure AD manages the users and provides federated access to Amazon Redshift using IAM.

Prerequisites

This post assumes that you have the following:

An Azure AD account
An Amazon Redshift cluster (for instructions, see Step 2: Create a sample Amazon Redshift cluster)
Permissions for administering IAM identities and AWS Management Console access
A database you want to permit access to (for this post, we use the default database dev)

Solution overview

This post consists of the following three sections to implement the solution:

Set up the Azure Enterprise non-gallery application using single sign-on (SSO) with SAML.
Set up the IAM provider and roles, which includes the following steps:
1. Create the SAML identity provider.
2. Create an IAM role for access to the Amazon Redshift cluster.
3. Create an IAM provider and an IAM role to use SAML-based federation.
4. Test the SSO setup.
Configure the JDBC client to use Azure AD user credentials using a browser to log in to the Amazon Redshift cluster. This post uses a JDBC client, but you can use the same setup to support ODBC clients.

Set up an Azure Enterprise application

To set up an Azure Enterprise application to control Amazon Redshift access, complete the following steps:

Log in to Azure Portal and under Services, choose Enterprise applications.
Choose New application.
For Add an application, choose Non-gallery application.
For Name¸ enter Redshift.
Choose Add.
For Identifier (Entity ID), enter a string (it’s not used in the flow by default).
For ReplyURL, enter http://localhost/redshift/.
Choose Add new claim.
Configure your SAML claims as shown in the following table (for more information, see Configure SAML assertions for your IdP).

Claim name	Value
Unique user identifier (name ID)	`user.userprincipalname`
https://aws.amazon.com/SAML/Attributes/Role	`arn:aws:iam::111122223333:role/AzureSSO,arn:aws:iam::111122223333:saml-provider/AzureADProvider`
https://aws.amazon.com/SAML/Attributes/RoleSessionName	`user.userprincipalname`
https://redshift.amazon.com/SAML/Attributes/AutoCreate	`"true"`
https://redshift.amazon.com/SAML/Attributes/DbGroups	`user.assignedroles`
https://redshift.amazon.com/SAML/Attributes/DbUser	`user.userprincipalname`

In the Manage claim section, for Name, enter Role.
For Source attribute, enter your source, which includes your AWS account ID, IAM policy, and IAM provider.
On the Permissions page, add users or groups to your application (alternatively, grant universal admin consent for the entire organization).
Download your federation metadata.

You need the metadata to configure the IAM IdP. Check your IdP for how to download this document, because every IdP handles this differently.

On the App registration page, choose Authentication in the navigation pane.
In the Mobile and desktop applications section, add http://localhost/redshift/.
For Enable the following mobile and desktop flows, choose Yes.
On the Enterprise applications page, choose your application.
In the Set up Single Sign-On with SAML section, choose Edit.
Confirm the reply URL.
On the Users and groups page, add the necessary role or group.

Set up IAM to allow Azure AD users to access Amazon Redshift

In this section, you configure IAM to allow Azure AD users to access Amazon Redshift resources and get temporary credentials.

Sign in to the AWS Management Console as the admin account.
On the IAM console, choose Identity providers in the navigation pane.
Choose Create Provider.
For Provider Type, choose SAML.
For Provider name, enter a name for your provider.
For Metadata Document, choose the file you downloaded or saved from your IdP.
Choose Next Step.
Choose Create.

Now you set up your policy. For detailed instructions, see Federate Database User Authentication Easily with IAM and Amazon Redshift.

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.

You’re directed to the Create policy page, where you can choose the Visual editor tab for step-by-step policy creation or the JSON tab to edit the policy in one step.

For this post, choose the JSON tab.
In the text box, enter the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:us-west-1:your-account-number:dbname:cluster-identifier/dev",
                "arn:aws:redshift:us-west-1:your-account-number :dbuser:cluster-identifier/${redshift:DbUser}",
                "arn:aws:redshift:us-west-1:your-account-number :cluster:cluster-identifier"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:userid": "unique role ID:${redshift:DbUser}@companyemail.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": "redshift:CreateClusterUser",
            "Resource": "arn:aws:redshift:us-west-1:your-account-number:dbuser:cluster-identifier/${redshift:DbUser}"
        },
        {
            "Effect": "Allow",
            "Action": "redshift:JoinGroup",
            "Resource": "arn:aws:redshift:us-west-1:your-account-number:dbgroup:cluster-identifier/db_group"
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters",
                "iam:ListRoles"
            ],
            "Resource": "*"
        }
    ]
}

In the preceding code, provide the following information:

The Region of your cluster (for this post, we use us-west-1).
The account number your cluster is on.
The Amazon Redshift cluster you want to grant users permission to (you can also enter * for all clusters under that account).
Your database name (for this post, we use dev; you can also enter * to allow access to all databases).
The unique ID of the IAM role you created (you can get this by running aws iam get-role --role-name Your_Role_Name in the terminal).
Your tenant or company email.
The database group you want to assign users to.
${redshift:DbUser} is replaced with whatever your IdP (Azure) specified for the SAML DbUser field for the user.

The first statement allows users to grab temporary credentials from the cluster if:

It’s on the specified cluster, in the correct account, in the Region specified.
The dbname the user is trying to connect to is dev.
The user trying to connect matches the DbUser specified in Azure.
The user is under the role specified by the unique role ID with the IAM account under your company’s email.

This all depends on your setup with IdP (Azure) configuration. If your employee’s email is [email protected], you need to set ${redshift:DbUser} to the super field that matches to the employee’s username johndoe and set the AWS SAML RoleSessionName field to be the super field that matches the employee’s email [email protected] to make this condition work.

If you set ${redshift:DbUser} to be the employee’s email, remove the @companyemail.com in the example code to match the RoleSessionName.

If you set the RoleSessionId to be just the employee’s username, remove the @companyemail.com in the example code to match the RoleSessionName.

In the Azure setup instructions, ${redshift:DbUser} and RoleSessionName are both set to the employee’s email, so you should remove @companyemail.com in the preceding code if you’re following these instructions closely. This post creates the user’s database username under their email and signs them in to AWS under this email.

The second statement allows users to create a database username under the specified conditions. In the preceding code, it restricts creation to ${redshift:DbUser}.

The third statement specifies what groups the user can join.

The final statement specifies what actions the user can perform on the resources. In the preceding code, users can call DescribeClusters to get cluster information, and IAM ListRoles to check which roles the user can assume. “Resource: “*” applies the preceding actions to any Amazon Redshift cluster the user has access to.

Choose Review policy.
For Name¸ enter a name for your policy.
For Description, enter an optional description.
For Summary, review your policy components and make sure to resolve any warnings or errors.
Choose Create policy.

Lastly, we create the IAM role.

In the navigation pane, choose Roles.
Choose Create role.
For Select type of trusted entity, choose SAML 2.0 federation.
For SAML provider, choose the provider you created (for this post, AzureTest).
For Attribute, leave at the default (SAML:aud).
For Value¸ enter http://localhost/redshift/.
Choose Next: Permissions.
Under Attach permissions policies, search for and select the policy you created.
Under Set permissions boundary¸ you can set advanced controls for user permissions (for this post, we don’t make any changes).
Choose Next: Tags.
Under Add tabs (optional), you can add key-value tags to better organize, track, or control access for this role. For this post, we don’t add any tags.
Choose Next: Review.
For Role name, enter a name for your role.
For Role description, enter an optional description.
For Trusted entities, verify the ARN of the provider you specified earlier is correct.
For Permissions boundary, verify that the settings you specified earlier (if any) are correct.
Choose Create role.

If you haven’t already, you’re now ready to create the Amazon Redshift cluster for the Azure AD users to connect to.

Connect through JDBC and run queries

You can use any application that can take in a JDBC driver to connect using Azure SSO, or even use a language like Java to connect using a script. For this post, we use SQL Workbench/J, which is a common application to connect to JDBC and run queries.

Install SQL Workbench/J if not already installed.
Start SQL Workbench/J.
On the Select Connection Profile page, choose the Add profile group icon.

Adding a folder is optional but helps keep things organized.

Name your folder (for this post, AzureAuth).
Choose the New connection profile icon.

Creating a new profile in your profile group is optional but helps keep things organized.

Name your profile (for this post, Azure).
Choose Manage Drivers.
Choose Amazon Redshift.
Choose the Open folder icon.
Choose the JDBC JAR file.
Choose OK.
On the Select Connection Profile page, for Driver, choose Amazon Redshift (com.amazon.redshift.jdbc.Driver).
For URL, enter the IAM JDBC URL with your cluster identifier, Region, and database name (for example, jdbc:redshift:iam://cluster-identifier:us-west-1/dev).

Alternatively, you can use the format jdbc:redshift:iam://<cluster-dns-here>:<cluster-port>/<your-DB-name-here>.

Choose Extended Properties.
Choose the Create new entry icon.
Enter the following information:
1. Property plugin_name with value com.amazon.redshift.plugin.BrowseAzureCredentialsProvider. This tells the driver what authentication method to choose. This should always be set to com.amazon.redshift.plugin.BrowserAzureCredentialsProvider.
2. Property idp_tenant with the value of your IdP tenant. This is the tenant name of your company configured on your IdP (Azure). This value can either be the tenant name or the tenant unique ID with hyphens (preferred). If you use a tenant name, it could cause uncertainty when setting up the application.
3. Property client_id with the value of your application client ID. This is the client ID with hyphens of the Amazon Redshift application you created when setting up your Azure SSO configurations.
Choose OK.
On the Select Connection Profile page, leave everything else at the default values and choose OK.

The driver opens the default browser with the SSO sign-in page.

After you sign in, you’re redirected to localhost with a success message.

Troubleshooting

If something goes wrong, logging is the first call to start an investigation.

You can add an extended property with the following code:

DSILogLevel=6
LogPath=<any existing directory>

Alternatively, use a connection string:

jdbc:redshift://<cluster_url>:<port>/<db>?DSILogLevel=6&LogPath=<any existing directory>

For an Unauthorized exception, check your authentication in Azure Portal, under Mobile and desktop applications.

For a PKX exception, first try to use the ssl=false extended property. The vanish exception means that the problem is in the SSL certificate between the cluster and client. If so, first try to use the latest driver and check if your cluster version is old. Then run your application with the “-Djavax.net.debug=all” key for JVM. This shows all the TLS traffic. Make sure the certification is there.

For the exception SAML error: Not authorized to perform sts:AssumeRoleWithSAML, you need to edit the IAM role trust relationship.

Change "StringEquals" to "StringLike" : { "saml:aud": "*" }, then save it and try again. Also check that saml:aud and replyURL in Azure are exactly the same. If they’re different, authentication fails and causes the same error.

Summary

Amazon Redshift makes it easy to integrate with third-party identity providers to provide centralized user management. In this post, we showed how to configure the Amazon Redshift browser-based plugin to use multi-factored authentication with Microsoft Azure Active Directory. You can follow these same steps to work with your SAML 2.0 compliant identity provider of choice.

About the Authors

Ilesh Garish is a Software Development Engineer at AWS. His role is to develop connectors for Amazon Redshift. Prior to AWS, he built database drivers for the Oracle RDBMS, TigerLogic XDMS, and OpenAccess SDK. He worked in the database internal technologies at San Francisco Bay Area startups.

Brandon Schur is a Senior Database Engineer at AWS. He focuses on performance tuning for MPP databases, drivers & connectivity, and integrations with AWS services and partners.

Optimize your analytical workloads using the automatic query rewrite feature of Amazon Redshift materialized views

2022-01-04 Harshida Patel

Post Syndicated from Harshida Patel original https://aws.amazon.com/blogs/big-data/optimize-your-analytical-workloads-using-the-automatic-query-rewrite-feature-of-amazon-redshift-materialized-views/

Amazon Redshift materialized views enable you to significantly improve performance of complex queries that are frequently run as part of your extract, load, and transform (ELT), business intelligence (BI), or dashboarding applications. Materialized views precompute and store the result sets of the SQL query in the view definition. Materialized views speed up data access, because the query doesn’t need to rerun the computation each time the query runs, which also reduces the resource consumption.

Amazon Redshift has the ability to automatically rewrite your SQL queries that don’t explicitly reference existing materialized views to use an existing materialized view if it will improve performance. This feature is valuable and, in some cases, the only option for performance optimization. Consider packaged ISV apps or even just reports— users often don’t have access to the SQL to optimize. In some cases, even if they do have access, the code or script is so old that nobody is familiar with it and you don’t know what regressions even a small change might introduce.

In this post, we describe how the automatic query rewrite feature works and some scenarios where you could take advantage of this feature. For information about the materialized view feature itself, refer to Speed up your ELT and BI queries with Amazon Redshift materialized views and Creating materialized views in Amazon Redshift.

All examples in this post are run on an 8 node ra3.4xlarge cluster with the 3 TB TPC-DS cloud benchmark dataset.

Let’s look at three different scenarios where the automatic query rewrite feature could help: optimizing joins between two large tables, optimizing joins for tables that have multiple join paths, and optimizing table scans.

Optimize joins between two large tables

There are many situations where you have two large tables that are joined frequently. In this case, creating a materialized view that joins these two tables could help improve the performance of those queries. Materialized views precompute the join and store the results so subsequent runs only need to retrieve the saved results; no need to run the expensive JOINs each time. With automatic query rewrite, none of the end-user queries have to be modified to refer to the materialized view. When creating the explain plan for the query, Amazon Redshift replaces the join between the two tables with the materialized view.

By default, the automatic query rewrite uses a materialized view only if it’s up to date and reflects all changes from its base tables. This means that the query isn’t rewritten to use the materialized view if the base tables have more recent updates that aren’t yet reflected in the materialized view.

For example, consider the following SQL query. The query joins two tables: store_sales (8,639,936,081 rows) and customer (30,000,000 rows):

SELECT 
cust.c_customer_id 
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
GROUP BY cust.c_customer_id;

The query runs in 545,520 milliseconds; the following is the explain plan for the query:

XN HashAggregate  (cost=9602679386653.98..9602679386653.98 rows=29705556 width=20)
  ->  XN Hash Join DS_BCAST_INNER  (cost=375000.00..9602659714194.54 rows=7868983773 width=20)
        Hash Cond: (""outer"".ss_customer_sk = ""inner"".c_customer_sk)
        ->  XN Seq Scan on store_sales sales  (cost=0.00..86399365.12 rows=8245454518 width=4)
              Filter: (ss_customer_sk IS NOT NULL)
        ->  XN Hash  (cost=300000.00..300000.00 rows=30000000 width=24)
              ->  XN Seq Scan on customer cust  (cost=0.00..300000.00 rows=30000000 width=24)

Let’s create a materialized view that pre-computes the join between the store_sales and customer tables using the following SQL statement:

CREATE MATERIALIZED VIEW cust_store_sales
AS 
SELECT         
  cust.c_customer_id
, cust.c_first_name
, cust.c_last_name
, sales.ss_item_sk
, sales.ss_quantity
, cust.c_current_addr_sk
FROM  store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk;

Let’s now rerun the original query:

SELECT 
cust.c_customer_id 
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
GROUP BY cust.c_customer_sk;

The query runs much faster (46,493 milliseconds). This is because of the automatic query rewrite feature, which has rewritten the preceding query to use the newly created materialized view instead of joining both tables. The explain plan for this query shows this change:

XN HashAggregate  (cost=103138905.60..103138905.60 rows=29705556 width=20)
  ->  XN Seq Scan on mv_tbl__cust_store_sales__0 derived_table1  (cost=0.00..82511124.48 rows=8251112448 width=20)

The original query run also consumed 1,263 CPU seconds and read 45,013 blocks of data, whereas the query that ran after the creation of the materialized view only consumed 898 CPU seconds and read 29,256 blocks. That is a reduction of 29% in CPU consumption and 35% in blocks read.

The optimizer can also rewrite the following query to use the previously created materialized view, which includes the additional join to the customer_address table:

SELECT
cust.c_customer_id
,addr.ca_state
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
INNER JOIN customer_address addr
ON cust.c_current_addr_sk = addr.ca_address_sk
GROUP BY cust.c_customer_id, addr.ca_state;
     
      XN HashAggregate  (cost=30242919089.37..30242919089.37 rows=1544688912 width=26)
         ->  XN Hash Join DS_BCAST_INNER  (cost=542661.20..30201663527.13 rows=8251112448 width=26)
        Hash Cond: ("outer".c_current_addr_sk = "inner".ca_address_sk)
        ->  XN Seq Scan on mv_tbl__cust_store_sales_1__0 derived_table1  (cost=0.00..82511124.48 rows=8251112448 width=24)
        ->  XN Hash  (cost=150000.00..150000.00 rows=15000000 width=10)
              ->  XN Seq Scan on customer_address addr  (cost=0.00..150000.00 rows=15000000 width=10)

Optimize joins for tables that have multiple join paths

For large tables on Amazon Redshift, the ideal distribution style would be ‘KEY’, with the distribution key being the column that is used most frequently in the JOIN clause. There are situations where some large tables have multiple join paths. 50% of the queries may use a particular column to join to the table, and the other 50% of the queries may use a different column to join to the table. Both types of queries are important and have stringent performance requirements. In this case, you could pick one column as the distribution key for the table and then create a materialized view with the second column as the distribution key. This is possible because materialized views can have their own distribution and sort keys.

Here’s an example to illustrate how this works.

The web_sales table (2,159,968,881 rows) has the distribution key ws_order_number. This helps optimize a majority of the queries (70% of the joins to this table use ws_order_number as the join column). The remaining 30% use the column ws_bill_customer_sk to join to the table, as shown in the following SQL statement. This query took 12,790 milliseconds to run.

SELECT 
  c_customer_id
, c_email_address 
FROM web_sales ws
INNER JOIN customer cs
ON ws.ws_bill_customer_sk=cs.c_customer_sk;

We can create the materialized view to help improve the performance of the remaining 30% of the queries. Note the DISTKEY keyword in the following code. We have defined a new distribution key for the materialized view (ws_bill_customer_sk):

CREATE MATERIALIZED VIEW web_sales_cust_dist
DISTKEY (ws_bill_customer_sk)
AS
SELECT * FROM web_sales;

Rerunning the following query returns rows much faster than before (7,715 milliseconds vs. 12,790 milliseconds):

SELECT 
  c_customer_id
, c_email_address 
FROM web_sales ws
INNER JOIN customer cs
ON ws.ws_bill_customer_sk=cs.c_customer_sk;

Again, the explain plan of the query has changed; it now references the materialized view even though the SQL statement doesn’t explicitly reference the materialized view:

XN Hash Join DS_DIST_NONE  (cost=375000.00..696964927.69 rows=2159968768 width=74)
  Hash Cond: (""outer"".ws_bill_customer_sk = ""inner"".c_customer_sk)
  ->  XN Seq Scan on mv_tbl__web_sales_cust_dist__0 derived_table1  (cost=0.00..21599687.68 rows=2159968768 width=4)
  ->  XN Hash  (cost=300000.00..300000.00 rows=30000000 width=78)
        ->  XN Seq Scan on customer cs  (cost=0.00..300000.00 rows=30000000 width=78)

Optimize table scans

Table scans on Amazon Redshift are made efficient through the use of sort keys. Sort keys determine the order in which the columns are stored in the data blocks. Picking a column that appears frequently in your filtering conditions as a sort key can improve query performance significantly.

Compound sort keys with multiple columns can be defined on your table in case multiple columns are good candidates for sort keys. But in some situations where two or more high cardinality columns are sort key candidates, the compound sort key may not provide adequate performance. In these cases, a materialized view could be created with a different sort key to maintain that data in an alternate sorted order to help cater to a subset of the queries.

In the following example query, the web_sales table uses the column ws_sold_date_sk for the sort key, because this is the column that is used commonly for filtering rows. A smaller set of queries use ws_sales_price for filtering rows. Given that both ws_sold_date_sk and ws_sales_price are high cardinality columns with lots of unique values, a compound sort key with both columns may not be performant for all query patterns.

SELECT *
FROM web_sales 
WHERE ws_sales_price BETWEEN 50 AND 100;

Let’s create the following materialized view and see how it can help improve the performance of the preceding query:

CREATE MATERIALIZED VIEW web_sales_sort_on_price
SORTKEY (ws_sales_price)
AS
SELECT * FROM web_sales;

Running the following query returns rows much faster (5 milliseconds vs. 3,548 milliseconds) because the automatic query rewrite is using the materialized view:

SELECT *
FROM web_sales 
WHERE ws_sales_price BETWEEN 50 AND 100;

The following is the new explain plan:

XN Seq Scan on mv_tbl__web_sales_cust_dist__0 derived_table1  (cost=0.00..32399531.52 rows=10799844 width=260)
  Filter: ((ws_sales_price <= 100.00) AND (ws_sales_price >= 50.00))

Conclusion

Materialized views on Amazon Redshift can be a powerful optimization tool if used appropriately. With automatic query rewrite, you can optimize queries without any impact to end-users or their queries. This allows you to create materialized views after the application has gone live. Some customers plan this as part of their performance-optimization strategy when building new apps. The real value is that you can optimize queries and workloads without needing to modify the source code or scripts, and you can benefit even with a partial match.

About the Authors

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.

Jeetesh Srivastva is a Sr. Manager, Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions using Amazon Redshift and other AWS Analytic services. He has worked to deliver on-premises and cloud-based analytic solutions for customers in banking and finance and hospitality industry verticals.

Sain Das is an Analytics Specialist Solutions Architect at AWS and helps customers build scalable cloud solutions that help turn data into actionable insights.

Somdeb Bhattacharjee is an Enterprise Solutions Architect at AWS.

Power highly resilient use cases with Amazon Redshift

2021-12-27 Asser Moustafa

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/power-highly-resilient-use-cases-with-amazon-redshift/

Amazon Redshift is the most popular and fastest cloud data warehouse, offering seamless integration with your data lake and other data sources, up to three times faster performance than any other cloud data warehouse, automated maintenance, separation of storage and compute, and up to 75% lower cost than any other cloud data warehouse. This post explores different architectures and use cases that focus on maximizing data availability, using Amazon Redshift as the core data warehouse platform.

In the modern data-driven organization, many data analytics use cases using Amazon Redshift have increasingly evolved to assume a critical business profile. Those use cases are now required to be highly resilient with little to no downtime. For example, analytical use cases that once relied solely on historical data and produced static forecasts are now expected to continuously weave real-time streaming and operational data into their ever-updating analytical forecasts. Machine learning (ML) use cases that relied on overnight batch jobs to extract customer churn predictions from extremely large datasets are now expected to perform those same customer churn predictions on demand using both historical and intraday datasets.

This post is part one of a series discussing high resiliency and availability with Amazon Redshift. In this post, we discuss a diverse set of popular analytical use cases that have traditionally or perhaps more recently assumed a critical business profile. The goal of this post is to show the art of the possible with high resiliency use cases. For each use case, we provide a brief description and explore the reasons for its critical business profile, and provide a reference architecture for implementing the use case following best practices. In the following section, we include a brief mention of some of the complimentary high resiliency features in Amazon Redshift as they apply for each use case.

In the final section of this post, we expand the scope to discuss high resiliency in a data ecosystem that uses Amazon Redshift. In particular, we discuss the Lake House Architecture in the high resiliency context.

Part two of this series (coming soon) provides a deeper look into the individual high resiliency and availability features of Amazon Redshift.

Now let’s explore some of the most popular use cases that have traditionally required high resiliency or have come to require high resiliency in the modern data-driven organization.

Data analytics as a service

Many analytical use cases focus on extracting value from data collected and produced by an organization to serve the organization’s internal business and operational goals. In many cases, however, the data collected and produced by an organization can itself be packaged and offered as a product to other organizations. More specifically, access to the data collected and produced along with analytical capabilities is typically offered as a paid service to other organizations. This is referred to as data analytics as a service (DaaS).

For example, consider a marketing agency that has amassed demographic information for a geographic location such as population by age, income, and family structure. Such demographic information often serves as a vital input for many organizations’ decisions to identify the ideal location for expansion, match their products with likely buyers, product offerings, and many other business needs. The marketing agency can offer access to this demographic information as a paid service to a multitude of retailers, healthcare providers, resorts, and more.

Some of the most critical aspects of DaaS offerings are ease of management, security, cost-efficiency, workload isolation, and high resiliency and availability. For example, the marketing agency offering the DaaS product needs the ability to easily refresh the demographic data on a regular cadence (ease of management), ensure paying customers are able to access only authorized data (security), minimize data duplication to avoid runaway costs and keep the DaaS competitively priced (cost-efficiency), ensure a consistent performance profile for paying customers (workload isolation), and ensure uninterrupted access to the paid service (high availability).

By housing the data in one or more Amazon Redshift clusters, organizations can use the service’s data sharing capabilities to make such DaaS patterns possible in an easily manageable, secure, cost-efficient, and workload-isolated manner. Paying customers are then able to access the data using the powerful search and aggregation capabilities of Amazon Redshift. The following architecture diagram illustrates a commonly used reference architecture for this scenario.

The following diagram illustrates another reference architecture that provides high resiliency and availability for internal and external consumers of the data.

While an in-depth discussion of the data sharing capabilities in Amazon Redshift is beyond the scope of this post, refer to the following resources for more information:

Implementing multi-tenant patterns in Amazon Redshift using data sharing
Cross-Account Data Sharing for Amazon Redshift
Announcing Amazon Redshift data sharing
Create highly resilient and available architectures with Amazon Redshift (coming soon)

Fresh forecasts

As the power of the modern data ecosystem is unleashed, analytical workloads that traditionally yielded point-in-time reports based solely on historical datasets are evolving to incorporate data in real-time and produce on-demand analysis.

For example, event coordinators that may have had to rely solely on historical datasets to create analytical sales forecasts in business intelligence (BI) dashboards for upcoming events are now able to use Amazon Redshift federated queries to incorporate live ticket sales stored in operational data stores such as Amazon Aurora or Amazon Relational Database Service (Amazon RDS). With federated queries, event coordinators can now have their analytical workloads running on Amazon Redshift query and incorporate operational data such as live ticket sales stored in Aurora on demand so that BI dashboards reflect the most up-to-date ticket sales.

Setting up federated queries is achieved by creating external tables that reference the tables of interest in an RDS instance. The following reference architecture illustrates one straightforward way to achieve federated queries using two different versions of Aurora.

While an in-depth discussion of federated query capabilities in Amazon Redshift is beyond the scope of this post, refer to the following resources for more information:

ML-based predictions

The multitude of ML-based predictive use cases and extensive analytical capabilities offered within the AWS ecosystem has placed ML in an ever-prominent and critical role within data-driven organizations. This could be retailers looking to predict customer churn, healthcare insurers looking to predict the number of claims in the next 30 days, financial services organizations working to detect fraud or managing their market risk and exposure, and more.

Amazon Redshift ML provides seamless integration to Amazon SageMaker for training ML models as often as necessary using data stored in Amazon Redshift. Redshift ML also provides the ability to weave on-demand, ML-based predictions directly into Amazon Redshift analytical workloads. The ease with which ML predictions can now be used in Amazon Redshift has paved the path to analytical workloads or BI dashboards that either use or center around ML-based predictions, and that are relied on heavily by operations teams, business teams, and many other users.

For example, retailers may have traditionally relied on ML models that were trained in a periodic cadence, perhaps weekly or some other lengthy interval, to predict customer churn. A lot can change, however, during those training intervals, rendering the retailer’s ability to predict customer churn less effective. With Redshift ML, retailers are now able to train their ML models using their most recent data within Amazon Redshift and incorporate ML predictions directly in the Amazon Redshift analytical workloads used to power BI dashboards.

The following reference architecture demonstrates the use of Redshift ML functions in various analytical workloads. With ANSI SQL commands, you can use Amazon Redshift data to create and train an ML model (Amazon Redshift uses SageMaker) that is then made accessible through an Amazon Redshift function. That function can then be used in various analytical workloads.

While an in-depth discussion of Redshift ML is beyond the scope of this post, refer to the following resources for more information:

Production data for dev environments

Having access to high-quality test data is one of the most common challenges encountered in the development process. To maintain access to high-quality test data, developers must often overcome hurdles such as high administrative overhead for replicating data, increased costs from data duplication, prolonged downtime, and risk of losing development artifacts when refreshing test environments.

The data sharing feature enables Amazon Redshift development clusters to access high-quality production data directly from an Amazon Redshift production or pre-production cluster in a straightforward, secure, and cost-efficient approach that achieves a highly resilient posture.

For example, you can establish a data share on the Amazon Redshift production cluster that securely exposes only the schemas, tables, or views appropriate for development environments. The Amazon Redshift development cluster can then use that data share to query the high-quality production data directly where it is persisted on Amazon Simple Storage Service (Amazon S3), without impacting the Amazon Redshift production cluster’s compute capacity. Because the development cluster uses its own compute capacity, the production cluster’s high resiliency and availability posture is insulated from long-running experimental or development workloads. Likewise, development workloads are insulated from competing for compute resources on the production cluster.

In addition, querying the high-quality production data via the production cluster’s data share avoids unnecessary data duplication that can lead to higher storage costs. As the production data changes, the development cluster automatically gains access to the latest high-quality production data.

Finally, for development features that require schema changes, developers are free to create custom schemas on the development cluster that are based on the high-quality production data. Because the production data is decoupled from the development cluster, the custom schemas are located only on the development cluster, and the production data is not impacted in any way.

Let’s explore two example reference architectures that you can use for this use case.

Production data for dev environments using current-generation Amazon Redshift instance types

With the native Amazon Redshift data sharing available with the current generation of Amazon Redshift instance types (RA3), we can use a relatively straightforward architecture to enable dev environments with the freshest high-quality production data.

In the following architecture diagram, the production cluster takes on the role of a producer cluster, because it’s the cluster producing the data of interest. The development clusters take on the role of the consumer cluster because they’re the clusters interested in accessing the produced data. Note that the producer and consumer roles are merely labels to clarify the different role of each cluster, and not a formal designation within Amazon Redshift.

Production data for dev environments using previous-generation Amazon Redshift instance types

When we discussed this use case, we relied entirely on the native data sharing capability in Amazon Redshift. However, if you’re using the previous generation Amazon Redshift instance types of dense compute (DC) and dense storage (DS) nodes in your production environments, you should employ a slightly different implementation of this use case, because native Amazon Redshift data sharing is available only for the current generation of Amazon Redshift instance types (RA3).

First, we use a snapshot of the dense compute or dense storage production cluster to restore the production environment to a new RA3 cluster that has the latest production data. Let’s call this cluster the dev-read cluster to emphasize that this cluster is only for read-only purposes and doesn’t exhibit any data modifications. In addition, we can stand up a second RA3 cluster that simply serves as a sandbox for developers with data shares established to the dev-read cluster. Let’s call this cluster the dev-write cluster, because its main purpose is to serve as a read/write sandbox for developers and broader development work.

The following diagram illustrates this setup.

One of the key benefits of having a separate dev-read and dev-write cluster is that the dev-read cluster can be swapped out with a new RA3 cluster containing fresher production data, without wiping out all of the potential development artifacts created by developers (stored procedures for debugging, modified schemas, elevated privileges, and so on). This resiliency is a crucial benefit for many development teams that might otherwise significantly delay refreshing their development data simply because they don’t want to lose their testing and debugging artifacts or broader development settings.

For example, if the development team wants to refresh the production data in the dev-read cluster on the first of every month, then every month you could rename the current dev-read cluster to dev-read-old, and use the latest production snapshot to create a new dev-read RA3 cluster. You also have to reestablish the data share setup between the dev-write and dev-read clusters along with the dev-read cluster swap, but this task can be automated fairly easily and quickly using a number of approaches.

Another key benefit is that the dev-read cluster doesn’t exhibit any load beyond the initial snapshot restoration, so it can be a simple two-node ra3.xlplus cluster to minimize cost, while the dev-write cluster can be more appropriately sized for development workloads. In other words, there is minimal additional cost with this setup vs. using single development cluster.

While an in-depth discussion of Amazon Redshift’s data sharing capabilities is beyond the scope of this post, refer to the following resources for more information:

Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation
Announcing Amazon Redshift data sharing
Create highly resilient and available architectures with Amazon Redshift (coming soon)

Streaming data analytics

With integration between the Amazon Kinesis family of services and Amazon Redshift, you have an easy and reliable way to load streaming data into data lakes as well as analytics services. Amazon Kinesis Data Firehose micro-batches real-time streaming messages and loads those micro-batches into the designated table within Amazon Redshift. With a few clicks on the Kinesis Data Firehose console, you can create a delivery stream that can ingest streaming data from hundreds of sources to multiple destinations, including Amazon Redshift. Should there be any interruptions in publishing streaming messages to Amazon Redshift, Kinesis Data Firehose automatically attempts multiple retries, and you can configure and customize that retry behavior.

You can also configure Amazon Kinesis Data Streams to convert the incoming data to open formats like Apache Parquet and ORC before data is delivered to Amazon Redshift for optimal query performance. You can even dynamically partition your streaming data using well-defined keys like customer_id or transaction_id. Kinesis Data Firehose groups data by these keys and delivers into key-unique S3 prefixes, making it easier for you to perform high-performance, cost-efficient analytics in Amazon S3 using Amazon Redshift and other AWS services.

The following reference architecture shows one of the straightforward approaches to integrating Kinesis Data Firehose and Amazon Redshift.

While an in-depth discussion of Kinesis Data Firehose and integration with Amazon Redshift are beyond the scope of this post, refer to the following resources for more information:

Amazon Kinesis- Setting up a Streaming Data Pipeline

Change data capture

While Amazon Redshift federated query enables Amazon Redshift to directly query data stored in an operational data store such as Aurora, there are also times when it helps for some of that operational data to be entirely replicated to Amazon Redshift for a multitude of other analytical use cases, such as data refinement.

After an initial replication from the operational data store to Amazon Redshift, ongoing change data capture (CDC) replications are required to keep Amazon Redshift updated with subsequent changes that occurred on the operational data store.

With AWS Database Migration Service (AWS DMS), you can automatically replicate changes in an operational data store such as Aurora to Amazon Redshift in a straightforward, cost-efficient, secure, and highly resilient and available approach. As data changes on the operational data store, AWS DMS automatically replicates those changes to the designated table on Amazon Redshift.

The following reference architecture illustrates the straightforward use of AWS DMS to replicate changes in an operational data store such as Amazon Aurora, Oracle, SQL Server, etc. to Amazon Redshift and other destinations such as Amazon S3.

While an in-depth discussion of AWS DMS is beyond the scope of this post, refer to the following resources for more information:

Workload isolation

Sharing data can improve the agility of your organization by encouraging more connections and fostering collaboration, which allows teams to build upon the work of others rather than repeat already existing processes. Amazon Redshift does this by giving you instant, granular, and high-performance access to data across Amazon Redshift clusters without needing you to manually copy or move your data. You have live access to data so your users can see the most up-to-date and consistent information as it’s updated in Amazon Redshift clusters.

Amazon Redshift parallelizes queries across the different nodes of a cluster, but there may be circumstances when you want to allow more concurrent queries than one cluster can provide or provide workload separation. You can use data sharing to isolate your workloads, thereby minimizing the possibility that a deadlock situation in one workload impacts other workloads running on the same cluster.

The traditional approach to high resiliency and availability is to deploy two or more identical, independent, and parallel Amazon Redshift clusters. However, this design requires that all database updates be performed on all Amazon Redshift clusters. This introduces complexity in your overall architecture. In this section, we demonstrate how to use data sharing to design a highly resilient and available architecture with workload isolation.

The following diagram illustrates the high-level architecture for data sharing in Amazon Redshift.

This architecture supports different kinds of business-critical workloads, such as using a central extract, transform, and load (ETL) cluster that shares data with multiple analytic or BI clusters. This approach provides BI workload isolation, so individual BI workloads don’t impact the performance of the ETL workloads and vice-versa. You can scale the individual Amazon Redshift cluster compute resources according to the workload-specific requirements of price and performance.

Amazon Redshift Spectrum is a feature of Amazon Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required. You can use your producer cluster to process the Amazon S3 data and unload the resulting dataset back to Amazon S3. Then set up as many Amazon Redshift consumer clusters as you need to query your Amazon S3 data lake, thereby providing high resiliency and availability, and limitless concurrency.

Highly available data ecosystem using Amazon Redshift

In this section, we delve a little deeper into the Lake House Architecture, which achieves a wide range of best practices while providing several high resiliency and availability benefits that complement Amazon Redshift.

In the modern data ecosystem, many data-driven organizations have achieved tremendous success employing a Lake House Architecture to process the ever-growing volume, velocity, and variety of data. In addition, the Lake House Architecture has helped those data-driven organizations achieve greater resiliency.

As the following diagram shows, the Lake House Architecture consists of a data lake serving as the single source of truth with different compute layers such as Amazon Redshift sitting atop the data lake (in effect building a house on the lake, hence the term “lake house”).

Organizations can use a data lake to maximize data availability by centrally storing the data in the durable Amazon S3 layer but obtain access from multiple AWS products. Separation of compute and storage offers several resiliency and availability advantages. A data lake provides these same advantages but from a heterogeneous set of services that can all access a common data layer. Using Amazon Redshift with a Lake House Architecture reinforces the lake house’s high resiliency and availability. Furthermore, with the seamless integration of Amazon Redshift with the S3 data lake, you can use Redshift Spectrum to run ANSI SQL queries within Amazon Redshift that directly reference external tables in the S3 data lake, as is often done with cold data (data that is infrequently accessed).

In addition, there are a multitude of straightforward services such as AWS Glue, AWS DMS, and AWS Lambda that you can use to load warm data (data that is frequently accessed) from an S3 data lake to Amazon Redshift for greater performance.

Conclusion

In this post, we explored several analytical use cases that require high resiliency and availability and provided an overview of the Amazon Redshift features that help fulfill those requirements. We also presented a few example reference architectures for those use cases as well as a data ecosystem reference architecture that provides a wide range of benefits and reinforces high resiliency and availability postures.

For further information on high resiliency and availability within Amazon Redshift or implementing the aforementioned use cases, we encourage you to reach out to your AWS Solutions Architect—we look forward to helping.

About the Authors

Asser Moustafa is an Analytics Specialist Solutions Architect at AWS based out of Dallas, TX, USA. He advises customers in the Americas on their Amazon Redshift and data lake architectures and migrations, starting from the POC stage to actual production deployment and maintenance.

Build a modern data architecture on AWS with Amazon AppFlow, AWS Lake Formation, and Amazon Redshift.

2021-12-20 Dr. Yannick Misteli

Post Syndicated from Dr. Yannick Misteli original https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-on-aws-with-amazon-appflow-aws-lake-formation-and-amazon-redshift/

This is a guest post written by Dr. Yannick Misteli, lead cloud platform and ML engineering in global product strategy (GPS) at Roche.

Recently the Roche Data Insights (RDI) initiative was launched to achieve our vision using new ways of working and collaboration in order to build shared, interoperable data & insights with federated governance. Furthermore, a simplified & integrated data landscape shall be established in order to empower insights communities. One of the first domains to engage in this program is the Go-to-Market (GTM) area which comprises sales, marketing, medical access and market affairs in Roche. GTM domain enables Roche to understand customers and to ultimately create and deliver valuable services that meet their needs. GTM as a domain extends beyond health care professionals (HCPs) to a larger healthcare ecosystem consisting of patients, communities, health authorities, payers, providers, academia, competitors, so on and so forth. Therefore, Data & Analytics are key in supporting the internal and external stakeholders in their decision-making processes through actionable insights.

Roche GTM built a modern data and machine learning (ML) platform on AWS while utilizing DevOps best practices. The mantra of everything as code (EaC) was key in building a fully automated, scalable data lake and data warehouse on AWS.

In this this post, you learn about how Roche used AWS products and services such as Amazon AppFlow, AWS Lake Formation, and Amazon Redshift to provision and populate their data lake; how they sourced, transformed, and loaded data into the data warehouse; and how they realized best practices in security and access control.

In the following sections, you dive deep into the scalable, secure, and automated modern data platform that Roche has built. We demonstrate how to automate data ingestion, security standards, and utilize DevOps best practices to ease management of your modern data platform on AWS.

Data platform architecture

The following diagram illustrates the data platform architecture.

The architecture contains the following components:

The core infrastructure is deployed from a GitLab Runner container using the AWS Cloud Development Kit (AWS CDK)
Amazon AppFlow flows gather objects from Salesforce and places them into an Amazon Simple Storage Service (Amazon S3) bucket
Processed data from the S3 bucket is also loaded back to Salesforce using Amazon AppFlow
AWS Glue jobs load data from external databases (for example Oracle) and APIs to the S3 bucket
Invocation of DBT ELT jobs using Argo (an open source project) on Amazon Elastic Kubernetes Service (Amazon EKS) are launched to transform and cleanse data in external Amazon Redshift tables
An Amazon EKS cluster running Kubeflow interacts with an Amazon Redshift cluster for data science ML experimentation and use cases
Lake Formation secures the S3 buckets, AWS Glue databases, and tables aligned to specific AWS Identity and Access Management (IAM) roles
AWS Firewall Manager secures the entire AWS Cloud perimeter

Lake Formation security

We use Lake Formation to secure all data as it lands in the data lake. Separating each data lake layer into distinct S3 buckets and prefixes enables fine-grained access control policies that Lake Formation implements. This concept also extends to locking down access to specific rows and columns and applying policies to specific IAM roles and users. Governance and access to data lake resources is difficult to manage, but Lake Formation simplifies this process for administrators.

To secure access to the data lake using Lake Formation, the following steps are automated using the AWS CDK with customized constructs:

Register the S3 data buckets and prefixes, and corresponding AWS Glue databases with Lake Formation.
Add data lake administrators (GitLab runner IAM deployment role and administrator IAM role).
Grant the AWS Glue job IAM roles access to the specific AWS Glue databases.
Grant the AWS Lambda IAM role access to the Amazon AppFlow databases.
Grant the listed IAM roles access to the corresponding tables in the AWS Glue databases.

AWS Glue Data Catalog

The AWS Glue Data Catalog is the centralized registration and access point for all databases and tables that are created in both the data lake and in Amazon Redshift. This provides centralized transparency to all resources along with their schemas and the location of all data that is referenced. This is a critical aspect for any data operations performed within the lake house platform.

Data sourcing and ingestion

Data is sourced and loaded into the data lake through the use of AWS Glue jobs and Amazon AppFlow. The ingested data is made available in the Amazon Redshift data warehouse through Amazon Redshift Spectrum using external schemas and tables. The process of creating the external schemas and linking it to the Data Catalog is outlined later in this post.

Amazon AppFlow Salesforce ingestion

Amazon AppFlow is a fully-managed integration service that allows you to pull data from sources such as Salesforce, SAP, and Zendesk. Roche integrates with Salesforce to load Salesforce objects securely into their data lake without needing to write any custom code. Roche also pushes ML results back to Salesforce using Amazon AppFlow to facilitate the process.

Salesforce objects are first fully loaded into Amazon S3 and then are flipped to a daily incremental load to capture deltas. The data lands in the raw zone bucket in Parquet format using the date as a partition. The Amazon AppFlow flows are created through the use of a YAML configuration file (see the following code). This configuration is consumed by the AWS CDK deployment to create the corresponding flows.

appflow:
  flow_classes:
    salesforce:
      source: salesforce
      destination: s3
      incremental_load: 1
      schedule_expression: "rate(1 day)"
      s3_prefix: na
      connector_profile: roche-salesforce-connector-profile1,roche-salesforce-connector-profile2
      description: appflow flow flow from Salesforce
      environment: all
  - name: Account
    incremental_load: 1
    bookmark_col: appflow_date_str
  - name: CustomSalesforceObject
    pii: 0
    bookmark_col: appflow_date_str
    upsert_field_list: upsertField
    s3_prefix: prefix
    source: s3
    destination: salesforce
    schedule_expression: na
    connector_profile: roche-salesforce-connector-profile

The YAML configuration makes it easy to select whether data should be loaded from an S3 bucket back to Salesforce or from Salesforce to an S3 bucket. This configuration is subsequently read by the AWS CDK app and corresponding stacks to translate into Amazon AppFlow flows.

The following options are specified in the preceding YAML configuration file:

source – The location to pull data from (Amazon S3, Salesforce)
destination – The destination to put data to (Amazon S3, Salesforce)
object_name – The name of the Salesforce object to interact with
incremental_load – A Boolean specifying if the load should be incremental or full (0 means full, 1 means incremental)
schedule_expression – The cron or rate expression to run the flow (na makes it on demand)
s3_prefix – The prefix to push or pull the data from in the S3 bucket
connector_profile – The Amazon AppFlow connector profile name to use when connecting to Salesforce (can be a CSV list)
environment – The environment to deploy this Amazon AppFlow flow to (all means deploy to dev and prod, dev means development environment, prod means production environment)
upsert_field_list – The set of Salesforce object fields (can be a CSV list) to use when performing an upsert operation back to Salesforce (only applicable when loaded data back from an S3 bucket back to Salesforce)
bookmark_col – The name of the column to use in the Data Catalog for registering the daily load date string partition

Register Salesforce objects to the Data Catalog

Complete the following steps to register data loaded into the data lake with the Data Catalog and link it to Amazon Redshift:

Gather Salesforce object fields and corresponding data types.
Create a corresponding AWS Glue database in the Data Catalog.
Run a query against Amazon Redshift to create an external schema that links to the AWS Glue database.
Create tables and partitions in the AWS Glue database and tables.

Data is accessible via the Data Catalog and the Amazon Redshift cluster.

Amazon AppFlow dynamic field gathering

To construct the schema of the loaded Salesforce object in the data lake, you invoke the following Python function. The code utilizes an Amazon AppFlow client from Boto3 to dynamically gather the Salesforce object fields to construct the Salesforce object’s schema.

import boto3

client = boto3.client('appflow')

def get_salesforce_object_fields(object_name: str, connector_profile: str):
    """
    Gathers the Salesforce object and its corresponding fields.

    Parameters:
        salesforce_object_name (str) = the name of the Salesforce object to consume.
        appflow_connector_profile (str) = the name of AppFlow Connector Profile.

    Returns:
        object_schema_list (list) =  a list of the object's fields and datatype (a list of dictionaries).
    """
    print("Gathering Object Fields")

    object_fields = []

    response = client.describe_connector_entity(
        connectorProfileName=connector_profile,
        connectorEntityName=object_name,
        connectorType='Salesforce'
    )

    for obj in response['connectorEntityFields']:
        object_fields.append(
            {'field': obj['identifier'], 'data_type': obj['supportedFieldTypeDetails']['v1']['fieldType']})

    return object_fields

We use the function for both the creation of the Amazon AppFlow flow via the AWS CDK deployment and for creating the corresponding table in the Data Catalog in the appropriate AWS Glue database.

Create an Amazon CloudWatch Events rule, AWS Glue table, and partition

To add new tables (one per Salesforce object loaded into Amazon S3) and partitions into the Data Catalog automatically, you create an Amazon CloudWatch Events rule. This function enables you to query the data in both AWS Glue and Amazon Redshift.

After the Amazon AppFlow flow is complete, it invokes a CloudWatch Events rule and a corresponding Lambda function to either create a new table in AWS Glue or add a new partition with the corresponding date string for the current day. The CloudWatch Events rule looks like the following screenshot.

The invoked Lambda function uses the Amazon SageMaker Data Wrangler Python package to interact with the Data Catalog. Using the preceding function definition, the object fields and their data types are accessible to pass to the following function call:

import awswrangler as wr

def create_external_parquet_table(
    database_name: str, 
    table_name: str, 
    s3_path: str, 
    columns_map: dict, 
    partition_map: dict
):
    """
    Creates a new external table in Parquet format.

    Parameters:
        database_name (str) = the name of the database to create the table in.
        table_name (str) = the name of the table to create.
        s3_path (str) = the S3 path to the data set.
        columns_map (dict) = a dictionary object containing the details of the columns and their data types from appflow_utility.get_salesforce_object_fields
        partition_map (dict) = a map of the paritions for the parquet table as {'column_name': 'column_type'}
    
    Returns:
        table_metadata (dict) = metadata about the table that was created.
    """

    column_type_map = {}

    for field in columns_map:
        column_type_map[field['name']] = field['type']

    return wr.catalog.create_parquet_table(
        database=database_name,
        table=table_name,
        path=s3_path,
        columns_types=column_type_map,
        partitions_types=partition_map,
        description=f"AppFlow ingestion table for {table_name} object"
    )

If the table already exists, the Lambda function creates a new partition to account for the date in which the flow completed (if it doesn’t already exist):

import awswrangler as wr

def create_parquet_table_date_partition(
    database_name: str, 
    table_name: str, 
    s3_path: str, 
    year: str, 
    month: str, 
    day: str
):
    """
    Creates a new partition by the date (YYYY-MM-DD) on an existing parquet table.

    Parameters:
        database_name (str) = the name of the database to create the table in.
        table_name (str) = the name of the table to create.
        s3_path (str) = the S3 path to the data set.
        year(str) = the current year for the partition (YYYY format).
        month (str) = the current month for the partition (MM format).
        day (str) = the current day for the partition (DD format).
    
    Returns:
        table_metadata (dict) = metadata about the table that has a new partition
    """

    date_str = f"{year}{month}{day}"
    
    return wr.catalog.add_parquet_partitions(
        database=database_name,
        table=table_name,
        partitions_values={
            f"{s3_path}/{year}/{month}/{day}": [date_str]
        }
    )
    
def table_exists(
    database_name: str, 
    table_name: str
):
    """
    Checks if a table exists in the Glue catalog.

    Parameters:
        database_name (str) = the name of the Glue Database where the table should be.
        table_name (str) = the name of the table.
    
    Returns:
        exists (bool) = returns True if the table exists and False if it does not exist.
    """

    try:
        wr.catalog.table(database=database_name, table=table_name)
        return True
    except ClientError as e:
        return False

Amazon Redshift external schema query

An AWS Glue database is created for each Amazon AppFlow connector profile that is present in the preceding configuration. The objects that are loaded from Salesforce into Amazon S3 are registered as tables in the Data Catalog under the corresponding database. To link the database in the Data Catalog with an external Amazon Redshift schema, run the following query:

CREATE EXTERNAL SCHEMA ${connector_profile_name}_ext from data catalog
database '${appflow_connector_profile_name}'
iam_role 'arn:aws:iam::${AWS_ACCOUNT_ID}:role/RedshiftSpectrumRole'
region 'eu-west-1';

The specified iam_role value must be an IAM role created ahead of time and must have the appropriate access policies specified to query the Amazon S3 location.

Now, all the tables available in the Data Catalog can be queried using SQL locally in Amazon Redshift Spectrum.

Amazon AppFlow Salesforce destination

Roche trains and invokes ML models using data found in the Amazon Redshift data warehouse. After the ML models are complete, the results are pushed back into Salesforce. Through the use of Amazon AppFlow, we can achieve the data transfer without writing any custom code. The schema of the results must match the schema of the corresponding Salesforce object, and the format of the results must be written in either JSON lines or CSV format in order to be written back into Salesforce.

AWS Glue Jobs

To source on-premises data feeds into the data lake, Roche has built a set of AWS Glue jobs in Python. There are various external sources including databases and APIs that are directly loaded into the raw zone S3 bucket. The AWS Glue jobs are run on a daily basis to load new data. The data that is loaded follows the partitioning scheme of YYYYMMDD format in order to more efficiently store and query datasets. The loaded data is then converted into Parquet format for more efficient querying and storage purposes.

Amazon EKS and KubeFlow

To deploy ML models on Amazon EKS, Roche uses Kubeflow on Amazon EKS. The use of Amazon EKS as the backbone infrastructure makes it easy to build, train, test, and deploy ML models and interact with Amazon Redshift as a data source.

Firewall Manager

As an added layer of security, Roche takes extra precautions through the use of Firewall Manager. This allows Roche to explicitly deny or allow inbound and outbound traffic through the use of stateful and stateless rule sets. This also enables Roche to allow certain outbound access to external websites and deny websites that they don’t want resources inside of their Amazon VPC to have access to. This is critical especially when dealing with any sensitive datasets to ensure that data is secured and has no chance of being moved externally.

CI/CD

All the infrastructure outlined in the architecture diagram was automated and deployed to multiple AWS Regions using a continuous integration and continuous delivery (CI/CD) pipeline with GitLab Runners as the orchestrator. The GitFlow model was used for branching and invoking automated deployments to the Roche AWS accounts.

Infrastructure as code and AWS CDK

Infrastructure as code (IaC) best practices were used to facilitate the creation of all infrastructure. The Roche team uses the Python AWS CDK to deploy, version, and maintain any changes that occur to the infrastructure in their AWS account.

AWS CDK project structure

The top level of the project structure in GitLab includes the following folders (while not limited to just these folders) in order to keep infrastructure and code all in one location.

To facilitate the various resources that are created in the Roche account, the deployment was broken into the following AWS CDK apps, which encompass multiple stacks:

core
data_lake
data_warehouse

The core app contains all the stacks related to account setup and account bootstrapping, such as:

VPC creation
Initial IAM roles and policies
Security guardrails

The data_lake app contains all the stacks related to creating the AWS data lake, such as:

Lake Formation setup and registration
AWS Glue database creation
S3 bucket creation
Amazon AppFlow flow creation
AWS Glue job setup

The data_warehouse app contains all the stacks related to setting up the data warehouse infrastructure, such as:

Amazon Redshift cluster
Load balancer to Amazon Redshift cluster
Logging

The AWS CDK project structure described was chosen to keep the deployment flexible and to logically group together stacks that relied on each other. This flexibility allows for deployments to be broken out by function and deployed only when truly required and needed. This decoupling of different parts of the provisioning maintains flexibility when deploying.

AWS CDK project configuration

Project configurations are flexible and extrapolated away as YAML configuration files. For example, Roche has simplified the process of creating a new Amazon AppFlow flow and can add or remove flows as needed simply by adding a new entry into their YAML configuration. The next time the GitLab runner deployment occurs, it picks up the changes on AWS CDK synthesis to generate a new change set with the new set of resources. This configuration and setup keeps things dynamic and flexible while decoupling configuration from code.

Network architecture

The following diagram illustrates the network architecture.

We can break down the architecture into the following:

All AWS services are deployed in two Availability Zones (except Amazon Redshift)
Only private subnets have access to the on-premises Roche environment
Services are deployed in backend subnets
Perimeter protection using AWS Network Firewall
A network load balancer publishes services to the on premises environment

Network security configurations

Infrastructure, configuration, and security are defined as code in AWS CDK, and Roche uses a CI/CD pipeline to manage and deploy them. Roche has an AWS CDK application to deploy the core services of the project: VPC, VPN connectivity, and AWS security services (AWS Config, Amazon GuardDuty, and AWS Security Hub). The VPC contains four network layers deployed in two Availability Zones, and they have VPC endpoints to access AWS services like Amazon S3, Amazon DynamoDB, and Amazon Simple Queue Service (Amazon SQS). They limit internet access using AWS Network Firewall.

The infrastructure is defined as code and the configuration is segregated. Roche performed the VPC setup by running the CI/CD pipeline to deploy their infrastructure. The configuration is in a specific external file; if Roche wants to change any value of the VPC, they need to simply modify this file and run the pipeline again (without typing any new lines of code). If Roche wants to change any configurations, they don’t want to have to change any code. It makes it simple for Roche to make changes and simply roll them out to their environment, making the changes more transparent and easier to configure. Traceability of the configuration is more transparent and it makes it simpler for approving the changes.

The following code is an example of the VPC configuration:

"test": {
        "vpc": {
            "name": "",
            "cidr_range": "192.168.40.0/21",
            "internet_gateway": True,
            "flow_log_bucket": shared_resources.BUCKET_LOGGING,
            "flow_log_prefix": "vpc-flow-logs/",
        },
        "subnets": {
            "private_subnets": {
                "private": ["192.168.41.0/25", "192.168.41.128/25"],
                "backend": ["192.168.42.0/23", "192.168.44.0/23"],
            },
            "public_subnets": {
                "public": {
                    "nat_gateway": True,
                    "publics_ip": True,
                    "cidr_range": ["192.168.47.64/26", "192.168.47.128/26"],
                }
            },
            "firewall_subnets": {"firewall": ["192.168.47.0/28", "192.168.47.17/28"]},
        },
        ...
         "vpc_endpoints": {
            "subnet_group": "backend",
            "services": [
                "ec2",
                "ssm",
                "ssmmessages",
                "sns",
                "ec2messages",
                "glue",
                "athena",
                "secretsmanager",
                "ecr.dkr",
                "redshift-data",
                "logs",
                "sts",
            ],
            "gateways": ["dynamodb", "s3"],
            "subnet_groups_allowed": ["backend", "private"],
        },
        "route_53_resolvers": {
            "subnet": "private",
        ...

The advantages of this approach are as follows:

No need to modify the AWS CDK constructor and build new code to change VPC configuration
Central point to manage VPC configuration
Traceability of changes and history of the configuration through Git
Redeploy all the infrastructure in a matter of minutes in other Regions or accounts

Operations and alerting

Roche has developed an automated alerting system if any part of the end-to-end architecture encounters any issues, focusing on any issues when loading data from AWS Glue or Amazon AppFlow. All logging is published to CloudWatch by default for debugging purposes.

The operational alerts have been built for the following workflow:

AWS Glue jobs and Amazon AppFlow flows ingest data.
If a job fails, it emits an event to a CloudWatch Events rule.
The rule is triggered and invokes an Lambda function to send failure details to an Amazon Simple Notification Service (Amazon SNS) topic.
The SNS topic has a Lambda subscriber that gets invoked:
1. The Lambda function reads out specific webhook URLs from AWS Secrets Manager.
2. The function fires off an alert to the specific external systems.
The external systems receive the message and the appropriate parties are notified of the issue with details.

The following architecture outlines the alerting mechanisms built for the lake house platform.

Conclusion

The GTM (Go-To-Market) domain has been successful in enabling their business stakeholders, data engineers and data scientists providing a platform that is extendable to many use-cases that Roche faces. It is a key enabler and an accelerator for the GTM organization in Roche. Through a modern data platform, Roche is now able to better understand customers and ultimately create and deliver valuable services that meet their needs. It extends beyond health care professionals (HCPs) to a larger healthcare ecosystem. The platform and infrastructure in this blog help to support and accelerate both internal and external stakeholders in their decision-making processes through actionable insights.

The steps in this post can help you plan to build a similar modern data strategy using AWS managed services to ingest data from sources like Salesforce, automatically create metadata catalogs and share data seamlessly between the data lake and data warehouse, and create alerts in the event of an orchestrated data workflow failure. In part 2 of this post, you learn about how the data warehouse was built using an agile data modeling pattern and how ELT jobs were quickly developed, orchestrated, and configured to perform automated data quality testing.

Special thanks go to the Roche team: Joao Antunes, Krzysztof Slowinski, Krzysztof Romanowski, Bartlomiej Zalewski, Wojciech Kostka, Patryk Szczesnowicz, Igor Tkaczyk, Kamil Piotrowski, Michalina Mastalerz, Jakub Lanski, Chun Wei Chan, Andrzej Dziabowski for their project delivery and support with this post.

About The Authors

A special Thank You goes out to the following people whose expertise made this post possible from AWS:

Thiyagarajan Arumugam – Principal Analytics Specialist Solutions Architect
Taz Sayed – Analytics Tech Leader
Glenith Paletta – Enterprise Service Manager
Mike Murphy – Global Account Manager
Natacha Maheshe – Senior Product Marketing Manager
Derek Young – Senior Product Manager
Jamie Campbell – Amazon AppFlow Product Manager
Kamen Sharlandjiev – Senior Solutions Architect – Amazon AppFlow
Sunil Jethwani – Principal Customer Delivery Architect
Vinay Shukla – Amazon Redshift Principal Product Manager
Nausheen Sayed – Program Manager

What’s new in Amazon Redshift – 2021, a year in review

2021-12-16 Manan Goel

Post Syndicated from Manan Goel original https://aws.amazon.com/blogs/big-data/whats-new-in-amazon-redshift-2021-a-year-in-review/

Amazon Redshift is the cloud data warehouse of choice for tens of thousands of customers who use it to analyze exabytes of data to gain business insights. Customers have asked for more capabilities in Redshift to make it easier, faster, and secure to store, process, and analyze all of their data. We announced Redshift in 2012 as the first cloud data warehouse to remove the complexity around provisioning, managing, and scaling data warehouses. Since then, we have launched capabilities such as Concurrency scaling, Spectrum, and RA3 nodes to help customers analyze all of their data and support growing analytics demands across all users in the organization. We continue to innovate with Redshift on our customers’ behalf and launched more than 50 significant features in 2021. This post covers some of those features, including use cases and benefits.

Working backwards from customer requirements, we are investing in Redshift to bring out new capabilities in three main areas:

Easy analytics for everyone
Analyze all of your data
Performance at any scale

Customers told us that the data warehouse users in their organizations are expanding from administrators, developers, analysts, and data scientists to the Line of Business (LoB) users, so we continue to invest to make Redshift easier to use for everyone. Customers also told us that they want to break free from data silos and access data across their data lakes, databases, and data warehouses and analyze that data with SQL and machine learning (ML). So we continue to invest in letting customers analyze all of their data. And finally, customers told us that they want the best price performance for analytics at any scale from Terabytes to Petabytes of data. So we continue to bring out new capabilities for performance at any scale. Let’s dive into each of these pillars and cover the key capabilities that we launched in 2021.

Amazon Redshift key innovations

Redshift delivers easy analytics for everyone

Easy analytics for everyone requires a simpler getting-started experience, automated manageability, and visual user interfaces that make is easier, simpler, and faster for both technical and non-technical users to quickly get started, operate, and analyze data in a data warehouse. We launched new features such as Redshift Serverless (in preview), Query Editor V2, and automated materialized views (in preview), as well as enhanced the Data API in 2021 to make it easier for customers to run their data warehouses.

Redshift Serverless (in preview) makes it easy to run and scale analytics in seconds without having to provision and manage data warehouse clusters. The serverless option lets all users, including data analysts, developers, business users, and data scientists use Redshift to get insights from data in seconds by simply loading and querying data into the data warehouse. Customers can launch a data warehouse and start analyzing the data with the Redshift Serverless option through just a few clicks in the AWS Management Console. There is no need to choose node types, node count, or other configurations. Customers can take advantage of pre-loaded sample data sets along with sample queries to kick start analytics immediately. They can create databases, schemas, tables, and load their own data from their desktop, Amazon Simple Storage Service (S3), via Amazon Redshift data shares, or restore an existing Amazon Redshift provisioned cluster snapshot. They can also directly query data in open formats, such as Parquet or ORC, in their Amazon S3 data lakes, as well as data in their operational databases, such as Amazon Aurora and Amazon RDS. Customers pay only for what they use, and they can manage their costs with granular cost controls.

Redshift Query Editor V2 is a web-based tool for data analysts, data scientists, and database developers to explore, analyze, and collaborate on data in Redshift data warehouses and data lake. Customers can use Query Editor’s visual interface to create and browse schema and tables, load data, author SQL queries and stored procedures, and visualize query results with charts. They can share and collaborate on queries and analysis, as well a track changes with built in version control. Query Editor V2 also supports SQL Notebooks (in preview), which provides a new Notebook interface that lets users such as data analysts and data scientists author queries, organize multiple SQL queries and annotations on a single document, and collaborate with their team members by sharing Notebooks.

Amazon Redshift Query Editor V2

Customers have long used Amazon Redshift materialized views (MV) for precomputed result sets, based on an SQL query over one or more base tables to improve query performance, particularly for frequently used queries such as those in dashboards and reports. In 2021, we launched Automated Materialized View (AutoMV) in preview to improve the performance of queries (reduce the total execution time) without any user effort by automatically creating and maintaining materialized views. Customers told us that while MVs offer significant performance benefits, analyzing the schema, data, and workload to determine which queries might benefit from having an MV or which MVs are no longer beneficial and should be dropped requires knowledge, time, and effort. AutoMV lets Redshift continually monitor the cluster to identify candidate MVs and evaluates the benefits vs costs. It creates MVs that have high benefit-to-cost ratios, while ensuring existing workloads are not negatively impacted by this process. AutoMV continually monitors the system and will drop MVs that are no longer beneficial. All of these are transparent to users and applications. Applications such as dashboards benefit without any code change thanks to automatic query re-write, which lets existing queries benefit from MVs even when not explicitly referenced. Customers can also set the MVs to autorefresh so that MVs always have up-to-date data for added convenience.

Customers have also asked us to simplify and automate data warehouse maintenance tasks, such as schema or table design, so that they can get optimal performance out of their clusters. Over the past few years, we have invested heavily to automate these maintenance tasks. For example, Automatic Table Optimization (ATO) selects the best sort and distribution keys to determine the optimal physical layout of data to maximize performance. We’ve extended ATO to modify column compression encodings to achieve high performance and reduce storage utilization. We have also introduced various features, such as auto vacuum delete and auto analyze, over the past few years to make sure that customer data warehouses continue to operate at peak performance.

Data API, which launched in 2020, has also seen major enhancements, such as multi-statement query execution, support for parameters to develop reusable code, and availability in more regions in 2021 to make it easier for customers to programmatically access data in Redshift. Data API lets Redshift enable customers to painlessly access data with all types of traditional, cloud-native, and containerized, serverless web services-based applications and event-driven applications. It simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK, such as Python, Go, Java, Node.js, PHP, Ruby, and C++. The Data API eliminates the need for configuring drivers and managing database connections. Instead, customers can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so results can be retrieved later and are stored for 24 hours.

Finally in our easy analytics for everyone pillar, in 2021 we launched the Grafana Redshift Plugin to help customers gain a deeper understanding of their cluster’s performance. Grafana is a popular open-source tool for running analytics and monitoring systems online. The Grafana Redshift Plugin lets customers query system tables and views for the most complete set of operational metrics on their Redshift cluster. The Plugin is available in the Open Source Grafana repository, as well as in our Amazon Managed Grafana service. We also published a default in-depth operational dashboard to take advantage of this feature.

Redshift makes it possible for customers to analyze all of their data

Redshift gives customers the best of both data lakes and purpose-built data stores, such as databases and data warehouses. It enables customers to store any amount of data, at low cost, and in open, standards-based data formats such as parquet and JSON in data lakes, and run SQL queries against it without loading or transformations. Furthermore, it lets customers run complex analytic queries with high performance against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. Redshift lets customers access live data from the transactional databases as part of their business intelligence (BI) and reporting applications to enable operational analytics. Customers can break down data silos by seamlessly querying data in the data lakes, data warehouses, and databases; empower their teams to run analytics and ML using their preferred tool or technique; and manage who has access to data with the proper security and data governance controls. We launched new features in 2021, such as Data Sharing, AWS Data Exchange integration, and Redshift ML, to make it easier for customers to analyze all of their data.

Amazon Redshift data sharing lets customers extend the ease of use, performance, and cost benefits that Amazon Redshift offers in a single cluster to multi-cluster deployments while being able to share data. It enables instant, granular, and fast data access across Amazon Redshift clusters without the need to copy or move data around. Data sharing provides live access to data so that your users always see the most up-to-date and consistent information as it’s updated in the data warehouse. Customers can securely share live data with Amazon Redshift clusters in the same or different AWS accounts within the same region or across regions. Data sharing features several performance enhancements, including result caching and concurrency scaling, which allow customers to support a broader set of analytics applications and meet critical performance SLAs when querying shared data. Customers can use data sharing for use cases such as workload isolation and offer chargeability, as well as provide secure and governed collaboration within and across teams and external parties.

Customers also asked us to help them with internal or external data marketplaces so that they can enable use cases such as data as a service and onboard 3^rd-party data. We launched the public preview of AWS Data Exchange for Amazon Redshift, a new feature that enables customers to find and subscribe to third-party data in AWS Data Exchange that they can query in an Amazon Redshift data warehouse in minutes. Data providers can list and offer products containing Amazon Redshift data sets in the AWS Data Exchange catalog, granting subscribers direct, read-only access to the data stored in Amazon Redshift. This feature empowers customers to quickly query, analyze, and build applications with these third-party data sets. AWS Data Exchange for Amazon Redshift lets customers combine third-party data found on AWS Data Exchange with their own first-party data in their Amazon Redshift cloud data warehouse, with no ETL required. Since customers are directly querying provider data warehouses, they can be certain that they are using the latest data being offered. Additionally, entitlement, billing, and payment management are all automated: access to Amazon Redshift data is granted when a data subscription starts and is removed when it ends, invoices are automatically generated, and payments are automatically collected and disbursed through AWS Marketplace.

Customers also asked for our help to make it easy to train and deploy ML models such as prediction, natural language processing, object detection, and image classification directly on top of the data in purpose-built data stores without having to perform complex data movement or learn new tools. We launched Redshift ML earlier this year to enable customers to create, train, and deploy ML models using familiar SQL commands. Amazon Redshift ML lets customers leverage Amazon SageMaker, a fully managed ML service, without moving their data or learning new skills. Furthermore, Amazon Redshift ML powered by Amazon SageMaker lets customers use SQL statements to create and train ML models from their data in Amazon Redshift, and then use these models for use cases such as churn prediction and fraud risk scoring directly in their queries and reports. Amazon Redshift ML automatically discovers the best model and tunes it based on training data using Amazon SageMaker Autopilot. SageMaker Autopilot chooses between regression, binary, or multi-class classification models. Alternatively, customers can choose a specific model type such as Xtreme Gradient Boosted tree (XGBoost) or multilayer perceptron (MLP), a problem type like regression or classification, and preprocessors or hyperparameters. Amazon Redshift ML uses customer parameters to build, train, and deploy the model in the Amazon Redshift data warehouse. Customers can obtain predictions from these trained models using SQL queries as if they were invoking a user defined function (UDF), and leverage all of the benefits of Amazon Redshift, including massively parallel processing capabilities. Customers can also import their pre-trained SageMaker Autopilot, XGBoost, or MLP models into their Amazon Redshift cluster for local inference. Redshift ML supports both supervised and unsupervised ML for advanced analytics use cases ranging from forecasting to personalization.

Customers want to combine live data from operational databases with the data in Amazon Redshift data warehouse and the data in Amazon S3 data lake environment to get unified analytics views across all of the data in the enterprise. We launched Amazon Redshift federated query to let customers incorporate live data from the transactional databases as part of their BI and reporting applications to enable operational analytics. The intelligent optimizer in Amazon Redshift pushes down and distributes a portion of the computation directly into the remote operational databases to help speed up performance by reducing data moved over the network. Amazon Redshift complements subsequent execution of the query by leveraging its massively parallel processing capabilities for further speed up. Federated query also makes it easier to ingest data into Amazon Redshift by letting customers query operational databases directly, applying transformations on the fly, and loading data into the target tables without requiring complex ETL pipelines. In 2021, we added support for Amazon Aurora MySQL and Amazon RDS for MySQL databases in addition to the existing Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL databases for federated query to enable customers to access more data sources for richer analytics.

Finally in our analyze all your data pillar in 2021, we added data types such as SUPER, GEOGRAPHY, and VARBYTE to enable customers to store semi-structured data natively in the Redshift data warehouse so that they can analyze all of their data at scale and with performance. The SUPER data type lets customers ingest and store JSON and semi-structured data in their Amazon Redshift data warehouses. Amazon Redshift also includes support for PartiQL for SQL-compatible access to relational, semi-structured, and nested data. Using the SUPER data type and PartiQL in Amazon Redshift, customers can perform advanced analytics that combine classic structured SQL data (such as string, numeric, and timestamp) with the semi-structured SUPER data (such as JSON) with superior performance, flexibility, and ease-of-use. The GEOGRAPHY data type builds on Redshift’s support of spatial analytics, opening-up support for many more third-party spatial and GIS applications. Moreover, it adds to the GEOMETRY data type and over 70 spatial functions that are already available in Redshift. The GEOGRAPHY data type is used in queries requiring higher precision results for spatial data with geographic features that can be represented with a spheroid model of the Earth and referenced using latitude and longitude as a spatial coordinate system. VARBYTE is a variable size data type for storing and representing variable-length binary strings.

Redshift delivers performance at any scale

Since we announced Amazon Redshift in 2012, performance at any scale has been a foundational tenet for us to deliver value to tens of thousands of customers who trust us every day to gain business insights from their data. Our customers span all industries and sizes, from startups to Fortune 500 companies, and we work to deliver the best price performance for any use case. Over the years, we have launched features such as dynamically adding cluster capacity when you need it with concurrency scaling, making sure that you use cluster resources efficiently with automatic workload management (WLM), and automatically adjusting data layout, distribution keys, and query plans to provide optimal performance for a given workload. In 2021, we launched capabilities such as AQUA, concurrency scaling for writes, and further enhancements to RA3 nodes to continue to improve Redshift’ price performance.

We introduced the RA3 node types in 2019 as a technology that allows the independent scaling of compute and storage. We also described how customers, including Codeacademy, OpenVault, Yelp, and Nielsen, have taken advantage of Amazon Redshift RA3 nodes with managed storage to scale their cloud data warehouses and reduce costs. RA3 leverages Redshift Managed Storage (RMS) as its durable storage layer which allows near-unlimited storage capacity where data is committed back to Amazon S3. This enabled new capabilities, such as Data Sharing and AQUA, where RMS is used as a shared storage across multiple clusters. RA3 nodes are available in three sizes (16XL, 4XL, and XLPlus) to balance price/performance. In 2021, we launched single node RA3 XLPlus clusters to help customers cost-effectively migrate their smaller data warehouse workloads to RA3s and take advantage of better price performance. We also introduced a self-service DS2 to RA3 RI migration capability that lets RIs be converted at a flat cost between equivalent node types.

AQUA (Advanced Query Accelerator) for Amazon Redshift is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run an order of magnitude faster than other enterprise cloud data warehouses by automatically boosting certain query types. AQUA uses AWS-designed processors with AWS Nitro chips adapted to speed up data encryption and compression, and custom analytics processors, implemented in FPGAs, to accelerate operations such as scans, filtering, and aggregation. AQUA is available with the RA3.16xlarge, RA3.4xlarge, or RA3.xlplus nodes at no additional charge and requires no code changes.

Concurrency Scaling was launched in 2019 to handle spiky and unpredictable read workloads without having to pre-provision any capacity. Redshift offers one hour of free Concurrency Scaling for every 24 hours of usage that your main cluster is running. It also offers cost controls to monitor and limit your usage and associated costs for Concurrency Scaling. In addition to read queries, supporting write queries has been a big ask from customers to support ETL workloads. In 2021, we launched Redshift Concurrency Scaling write queries support in preview with common operations such as INSERT, DELETE, UPDATE, and COPY to handle unpredictable spikes in ETL workloads. If you are currently using Concurrency Scaling, this new capability is automatically enabled in your cluster. You can monitor your Concurrency Scaling usage using the Amazon Redshift Console and get alerts on any usage exceeding your defined limits. You can also create, modify, and delete usage limits programmatically by using the AWS Command Line Interface (CLI) and AWS API.

Finally we continue to ensure that AWS has comprehensive security capabilities to satisfy the most demanding requirements, and Amazon Redshift continues to provides data security out-of-the-box at no extra cost. We introduced new security features in 2021, such as cross-VPC support and default IAM roles, to continue to make Redshift more secure for customer workloads.

Summary

When it comes to making it easier, simpler, and faster for customers to analyze all of their data, velocity matters and we are innovating at a rapid pace to bring new capabilities to Redshift. We continue to make Redshift features available in more AWS regions worldwide to make sure that all customers have access to all capabilities. We have covered the key features above and the complete list is available here. We look forward to how you will use some of these capabilities to continue innovating with data and analytics.

About the Author

Introducing new features for Amazon Redshift COPY: Part 1

2021-12-15 Dipankar Kushari

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/part-1-introducing-new-features-for-amazon-redshift-copy/

Loading data is a key process for any analytical system, including Amazon Redshift. Loading very large datasets can take a long time and consume a lot of computing resources. How your data is loaded can also affect query performance. You can use many different methods to load data into Amazon Redshift. One of the fastest and most scalable methods is to use the COPY command. This post dives into some of the recent enhancements made to the COPY command and how to use them effectively.

Overview of the COPY command

A best practice for loading data into Amazon Redshift is to use the COPY command. The COPY command loads data in parallel from Amazon Simple Storage Service (Amazon S3), Amazon EMR, Amazon DynamoDB, or multiple data sources on any remote hosts accessible through a Secure Shell (SSH) connection.

The COPY command reads and loads data in parallel from a file or multiple files in an S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files, in cases where the files are compressed. The COPY command appends the new input data to any existing rows in the target table. The COPY command can load data from Amazon S3 for the file formats AVRO, CSV, JSON, and TXT, and for columnar format files such as ORC and Parquet.

Use COPY with FILLRECORD

In situations when the contiguous fields are missing at the end of some of the records for data files being loaded, COPY reports an error indicating that there is mismatch between the number of fields in the file being loaded and the number of columns in the target table. In some situations, columnar files (such as Parquet) that are produced by applications and ingested into Amazon Redshift via COPY may have additional fields added to the files (and new columns to the target Amazon Redshift table) over time. In such cases, these files may have values absent for certain newly added fields. To load these files, you previously had to either preprocess the files to fill up values in the missing fields before loading the files using the COPY command, or use Amazon Redshift Spectrum to read the files from Amazon S3 and then use INSERT INTO to load data into the Amazon Redshift table.

With the FILLRECORD parameter, you can now load data files with a varying number of fields successfully in the same COPY command, as long as the target table has all columns defined. The FILLRECORD parameter addresses ease of use because you can now directly use the COPY command to load columnar files with varying fields into Amazon Redshift instead of achieving the same result with multiple steps.

With the FILLRECORD parameter, missing columns are loaded as NULLs. For text and CSV formats, if the missing column is a VARCHAR column, zero-length strings are loaded instead of NULLs. To load NULLs to VARCHAR columns from text and CSV, specify the EMPTYASNULL keyword. NULL substitution only works if the column definition allows NULLs.

Use FILLRECORD while loading Parquet data from Amazon S3

In this section, we demonstrate the utility of FILLRECORD by using a Parquet file that has a smaller number of fields populated than the number of columns in the target Amazon Redshift table. First we try to load the file into the table without the FILLRECORD parameter in the COPY command, then we use the FILLRECORD parameter in the COPY command.

For the purpose of this demonstration, we have created the following components:

An Amazon Redshift cluster with a database, public schema, awsuser as admin user, and an AWS Identity and Access Management (IAM) role, used to perform the COPY command to load the file from Amazon S3, attached to the Amazon Redshift cluster. For details on authorizing Amazon Redshift to access other AWS services, refer to Authorizing Amazon Redshift to access other AWS services on your behalf.
An Amazon Redshift table named call_center_parquet.
A Parquet file already uploaded to an S3 bucket from where the file is copied into the Amazon Redshift cluster.

The following code is the definition of the call_center_parquet table:

DROP TABLE IF EXISTS public.call_center_parquet;
CREATE TABLE IF NOT EXISTS public.call_center_parquet
(
	cc_call_center_sk INTEGER NOT NULL ENCODE az64
	,cc_call_center_id varchar(100) NOT NULL  ENCODE lzo
	,cc_rec_start_date VARCHAR(50)   ENCODE lzo
	,cc_rec_end_date VARCHAR(50)   ENCODE lzo
	,cc_closed_date_sk varchar (100)   ENCODE lzo
	,cc_open_date_sk INTEGER   ENCODE az64
	,cc_name VARCHAR(50)   ENCODE lzo
	,cc_class VARCHAR(50)   ENCODE lzo
	,cc_employees INTEGER   ENCODE az64
	,cc_sq_ft INTEGER   ENCODE az64
	,cc_hours VARCHAR(20)   ENCODE lzo
	,cc_manager VARCHAR(40)   ENCODE lzo
	,cc_mkt_id INTEGER   ENCODE az64
	,cc_mkt_class VARCHAR(50)   ENCODE lzo
	,cc_mkt_desc VARCHAR(100)   ENCODE lzo
	,cc_market_manager VARCHAR(40)   ENCODE lzo
	,cc_division INTEGER   ENCODE az64
	,cc_division_name VARCHAR(50)   ENCODE lzo
	,cc_company INTEGER   ENCODE az64
	,cc_company_name VARCHAR(50)   ENCODE lzo
	,cc_street_number INTEGER   ENCODE az64
	,cc_street_name VARCHAR(60)   ENCODE lzo
	,cc_street_type VARCHAR(15)   ENCODE lzo
	,cc_suite_number VARCHAR(10)   ENCODE lzo
	,cc_city VARCHAR(60)   ENCODE lzo
	,cc_county VARCHAR(30)   ENCODE lzo
	,cc_state CHAR(2)   ENCODE lzo
	,cc_zip INTEGER   ENCODE az64
	,cc_country VARCHAR(20)   ENCODE lzo
	,cc_gmt_offset NUMERIC(5,2)   ENCODE az64
	,cc_tax_percentage NUMERIC(5,2)   ENCODE az64
)
DISTSTYLE ALL
;
ALTER TABLE public.call_center_parquet OWNER TO awsuser;

The table has 31 columns.

The Parquet file doesn’t contain any value for the cc_gmt_offset and cc_tax_percentage fields. It has 29 columns. The following screenshot shows the schema definition for the Parquet file located in Amazon S3, which we load into Amazon Redshift.

We ran the COPY command two different ways: with or without the FILLRECORD parameter.

We first tried to load the Parquet file into the call_center_parquet table without the FILLRECORD parameter:

COPY call_center_parquet
FROM 's3://*****************/parquet/part-00000-d9a3ab22-9d7d-439a-b607-2ddc2d39c5b0-c000.snappy.parquet'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole'
FORMAT PARQUET;

It generated an error while performing the copy.

Next, we tried to load the Parquet file into the call_center_parquet table and used the FILLRECORD parameter:

COPY call_center_parquet
FROM 's3://*****************/parquet/part-00000-d9a3ab22-9d7d-439a-b607-2ddc2d39c5b0-c000.snappy.parquet'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole'
FORMAT PARQUET FILLRECORD;

The Parquet data was loaded successfully in the call_center_parquet table, and NULL was entered into the cc_gmt_offset and cc_tax_percentage columns.

Split large text files while copying

The second new feature we discuss in this post is automatically splitting large files to take advantage of the massive parallelism of the Amazon Redshift cluster. A best practice when using the COPY command in Amazon Redshift is to load data using a single COPY command from multiple data files. This loads data in parallel by dividing the workload among the nodes and slices in the Amazon Redshift cluster. When all the data from a single file or small number of large files is loaded, Amazon Redshift is forced to perform a much slower serialized load, because the Amazon Redshift COPY command can’t utilize the parallelism of the Amazon Redshift cluster. You have to write additional preprocessing steps to split the large files into smaller files so that the COPY command loads data in parallel into the Amazon Redshift cluster.

The COPY command now supports automatically splitting a single file into multiple smaller scan ranges. This feature is currently supported only for large uncompressed delimited text files. More file formats and options, such as COPY with CSV keyword, will be added in the near future.

This helps improve performance for COPY queries when loading a small number of large uncompressed delimited text files into your Amazon Redshift cluster. Scan ranges are implemented by splitting the files into 64 MB chunks, which get assigned to each Amazon Redshift slice. This change addresses ease of use because you don’t need to split large uncompressed text files as an additional preprocessing step.

With Amazon Redshift’s ability to split large uncompressed text files, you can see performance improvements for the COPY command with a single large file or a few files with significantly varying relative sizes (for example, one file 5 GB in size and 20 files of a few KBs). Performance improvements for the COPY command are more significant as the file size increases even with keeping the same Amazon Redshift cluster configuration. Based on tests done, we observed a more than 1,500% performance improvement for the COPY command for loading a 6 GB uncompressed text file when the auto splitting feature became available.

There are no changes in the COPY query or keywords to enable this change, and splitting of files is automatically applied for the eligible COPY commands. Splitting isn’t applicable for the COPY query with the keywords CSV, REMOVEQUOTES, ESCAPE, and FIXEDWIDTH.

For the test, we used a single 6 GB uncompressed text file and the following COPY command:

COPY store_sales
FROM 's3://*****************/text/store_sales.txt'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole';

The Amazon Redshift cluster without the auto split option took 102 seconds to copy the file from Amazon S3 to the Amazon Redshift store_sales table. When the auto split option was enabled in the Amazon Redshift cluster (without any other configuration changes), the same 6 GB uncompressed text file took just 6.19 seconds to copy the file from Amazon S3 to the store_sales table.

Summary

In this post, we showed two enhancements to the Amazon Redshift COPY command. First, we showed how you can add the FILLRECORD parameter in the COPY command in order to successfully load data files even when the contiguous fields are missing at the end of some of the records, as long as the target table has all the columns. Secondly, we described how Amazon Redshift auto splits large uncompressed text files into 64 MB chunks before copying the files into the Amazon Redshift cluster to enhance COPY performance. This automatic split of large files allows you to use the COPY command on large uncompressed text files—Amazon Redshift auto splits the file without needing you to add a preprocessing step to the split the large files yourself. Try these features to make your data loading to Amazon Redshift much simpler by removing custom preprocessing steps.

In Part 2 of this series, we will discuss additional new features of Amazon Redshift COPY command and demonstrate how you can take benefits of those to optimize your data loading process.

About the Authors

Dipankar Kushari is a Senior Analytics Solutions Architect with AWS.

Cody Cunningham is a Software Development Engineer with AWS, working on data ingestion for Amazon Redshift.

Joe Yong is a Senior Technical Product Manager on the Amazon Redshift team and a relentless pursuer of making complex database technologies easy and intuitive for the masses. He has worked on database and data management systems for SMP, MPP, and distributed systems. Joe has shipped dozens of features for on-premises and cloud-native databases that serve IoT devices through petabyte-sized cloud data warehouses. Off keyboard, Joe tries to onsight 5.11s, hunt for good eats, and seek a cure for his Australian Labradoodle’s obsession with squeaky tennis balls.

Anshul Purohit is a Software Development Engineer with AWS, working on data ingestion and query processing for Amazon Redshift.