All posts by Milind Oke

Unlock insights on Amazon RDS for MySQL data with zero-ETL integration to Amazon Redshift

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/unlock-insights-on-amazon-rds-for-mysql-data-with-zero-etl-integration-to-amazon-redshift/

Amazon Relational Database Service (Amazon RDS) for MySQL zero-ETL integration with Amazon Redshift was announced in preview at AWS re:Invent 2023 for Amazon RDS for MySQL version 8.0.28 or higher. In this post, we provide step-by-step guidance on how to get started with near real-time operational analytics using this feature. This post is a continuation of the zero-ETL series that started with Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift.

Challenges

Customers across industries today are looking to use data to their competitive advantage and increase revenue and customer engagement by implementing near real time analytics use cases like personalization strategies, fraud detection, inventory monitoring, and many more. There are two broad approaches to analyzing operational data for these use cases:

  • Analyze the data in-place in the operational database (such as read replicas, federated query, and analytics accelerators)
  • Move the data to a data store optimized for running use case-specific queries such as a data warehouse

The zero-ETL integration is focused on simplifying the latter approach.

The extract, transform, and load (ETL) process has been a common pattern for moving data from an operational database to an analytics data warehouse. ELT is where the extracted data is loaded as is into the target first and then transformed. ETL and ELT pipelines can be expensive to build and complex to manage. With multiple touchpoints, intermittent errors in ETL and ELT pipelines can lead to long delays, leaving data warehouse applications with stale or missing data, further leading to missed business opportunities.

Alternatively, solutions that analyze data in-place may work great for accelerating queries on a single database, but such solutions aren’t able to aggregate data from multiple operational databases for customers that need to run unified analytics.

Zero-ETL

Unlike the traditional systems where data is siloed in one database and the user has to make a trade-off between unified analysis and performance, data engineers can now replicate data from multiple RDS for MySQL databases into a single Redshift data warehouse to derive holistic insights across many applications or partitions. Updates in transactional databases are automatically and continuously propagated to Amazon Redshift so data engineers have the most recent information in near real time. There is no infrastructure to manage and the integration can automatically scale up and down based on the data volume.

At AWS, we have been making steady progress towards bringing our zero-ETL vision to life. The following sources are currently supported for zero-ETL integrations:

When you create a zero-ETL integration for Amazon Redshift, you continue to pay for underlying source database and target Redshift database usage. Refer to Zero-ETL integration costs (Preview) for further details.

With zero-ETL integration with Amazon Redshift, the integration replicates data from the source database into the target data warehouse. The data becomes available in Amazon Redshift within seconds, allowing you to use the analytics features of Amazon Redshift and capabilities like data sharing, workload optimization autonomics, concurrency scaling, machine learning, and many more. You can continue with your transaction processing on Amazon RDS or Amazon Aurora while simultaneously using Amazon Redshift for analytics workloads such as reporting and dashboards.

The following diagram illustrates this architecture.

AWS architecture diagram showcasing example zero-ETL architecture

Solution overview

Let’s consider TICKIT, a fictional website where users buy and sell tickets online for sporting events, shows, and concerts. The transactional data from this website is loaded into an Amazon RDS for MySQL 8.0.28 (or higher version) database. The company’s business analysts want to generate metrics to identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. They would like to get these metrics in near real time using a zero-ETL integration.

The integration is set up between Amazon RDS for MySQL (source) and Amazon Redshift (destination). The transactional data from the source gets refreshed in near real time on the destination, which processes analytical queries.

You can use either the serverless option or an encrypted RA3 cluster for Amazon Redshift. For this post, we use a provisioned RDS database and a Redshift provisioned data warehouse.

The following diagram illustrates the high-level architecture.

High-level zero-ETL architecture for TICKIT data use case

The following are the steps needed to set up zero-ETL integration. These steps can be done automatically by the zero-ETL wizard, but you will require a restart if the wizard changes the setting for Amazon RDS or Amazon Redshift. You could do these steps manually, if not already configured, and perform the restarts at your convenience. For the complete getting started guides, refer to Working with Amazon RDS zero-ETL integrations with Amazon Redshift (preview) and Working with zero-ETL integrations.

  1. Configure the RDS for MySQL source with a custom DB parameter group.
  2. Configure the Redshift cluster to enable case-sensitive identifiers.
  3. Configure the required permissions.
  4. Create the zero-ETL integration.
  5. Create a database from the integration in Amazon Redshift.

Configure the RDS for MySQL source with a customized DB parameter group

To create an RDS for MySQL database, complete the following steps:

  1. On the Amazon RDS console, create a DB parameter group called zero-etl-custom-pg.

Zero-ETL integration works by using binary logs (binlogs) generated by MySQL database. To enable binlogs on Amazon RDS for MySQL, a specific set of parameters must be enabled.

  1. Set the following binlog cluster parameter settings:
    • binlog_format = ROW
    • binlog_row_image = FULL
    • binlog_checksum = NONE

In addition, make sure that the binlog_row_value_options parameter is not set to PARTIAL_JSON. By default, this parameter is not set.

  1. Choose Databases in the navigation pane, then choose Create database.
  2. For Engine Version, choose MySQL 8.0.28 (or higher).

Selected MySQL Community edition Engine version 8.0.36

  1. For Templates, select Production.
  2. For Availability and durability, select either Multi-AZ DB instance or Single DB instance (Multi-AZ DB clusters are not supported, as of this writing).
  3. For DB instance identifier, enter zero-etl-source-rms.

Selected Production template, Multi-AZ DB instance and DB instance identifier zero-etl-source-rms

  1. Under Instance configuration, select Memory optimized classes and choose the instance db.r6g.large, which should be sufficient for TICKIT use case.

Selected db.r6g.large for DB instance class under Instance configuration

  1. Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (zero-etl-custom-pg).

Selected DB parameter group zero-etl-custom-pg under Additional configuration

  1. Choose Create database.

In a couple of minutes, it should spin up an RDS for MySQL database as the source for zero-ETL integration.

RDS instance status showing as Available

Configure the Redshift destination

After you create your source DB cluster, you must create and configure a target data warehouse in Amazon Redshift. The data warehouse must meet the following requirements:

  • Using an RA3 node type (ra3.16xlarge, ra3.4xlarge, or ra3.xlplus) or Amazon Redshift Serverless
  • Encrypted (if using a provisioned cluster)

For our use case, create a Redshift cluster by completing the following steps:

  1. On the Amazon Redshift console, choose Configurations and then choose Workload management.
  2. In the parameter group section, choose Create.
  3. Create a new parameter group named zero-etl-rms.
  4. Choose Edit parameters and change the value of enable_case_sensitive_identifier to True.
  5. Choose Save.

You can also use the AWS Command Line Interface (AWS CLI) command update-workgroup for Redshift Serverless:

aws redshift-serverless update-workgroup --workgroup-name <your-workgroup-name> --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true

Cluster parameter group setup

  1. Choose Provisioned clusters dashboard.

At the top of you console window, you will see a Try new Amazon Redshift features in preview banner.

  1. Choose Create preview cluster.

Create preview cluster

  1. For Preview track, chose preview_2023.
  2. For Node type, choose one of the supported node types (for this post, we use ra3.xlplus).

Selected ra3.xlplus node type for preview cluster

  1. Under Additional configurations, expand Database configurations.
  2. For Parameter groups, choose zero-etl-rms.
  3. For Encryption, select Use AWS Key Management Service.

Database configuration showing parameter groups and encryption

  1. Choose Create cluster.

The cluster should become Available in a few minutes.

Cluster status showing as Available

  1. Navigate to the namespace zero-etl-target-rs-ns and choose the Resource policy tab.
  2. Choose Add authorized principals.
  3. Enter either the Amazon Resource Name (ARN) of the AWS user or role, or the AWS account ID (IAM principals) that are allowed to create integrations.

An account ID is stored as an ARN with root user.

Add authorized principals on the Clusters resource policy tab

  1. In the Authorized integration sources section, choose Add authorized integration source to add the ARN of the RDS for MySQL DB instance that’s the data source for the zero-ETL integration.

You can find this value by going to the Amazon RDS console and navigating to the Configuration tab of the zero-etl-source-rms DB instance.

Add authorized integration source to the Configuration tab of the zero-etl-source-rms DB instance

Your resource policy should resemble the following screenshot.

Completed resource policy setup

Configure required permissions

To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. An AWS account owner can configure required permissions for users or roles who may create zero-ETL integrations. The sample policy allows the associated principal to perform the following actions:

  • Create zero-ETL integrations for the source RDS for MySQL DB instance.
  • View and delete all zero-ETL integrations.
  • Create inbound integrations into the target data warehouse. This permission is not required if the same account owns the Redshift data warehouse and this account is an authorized principal for that data warehouse. Also note that Amazon Redshift has a different ARN format for provisioned and serverless clusters:
    • Provisioned arn:aws:redshift:{region}:{account-id}:namespace:namespace-uuid
    • Serverlessarn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid

Complete the following steps to configure the permissions:

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. Create a new policy called rds-integrations using the following JSON (replace region and account-id with your actual values):
{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": [
            "rds:CreateIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:db:source-instancename",
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DescribeIntegration"
        ],
        "Resource": ["*"]
    },
    {
        "Effect": "Allow",
        "Action": [
            "rds:DeleteIntegration"
        ],
        "Resource": [
            "arn:aws:rds:{region}:{account-id}:integration:*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "redshift:CreateInboundIntegration"
        ],
        "Resource": [
            "arn:aws:redshift:{region}:{account-id}:cluster:namespace-uuid"
        ]
    }]
}
  1. Attach the policy you created to your IAM user or role permissions.

Create the zero-ETL integration

To create the zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Choose Create zero-ETL integration.

Create zero-ETL integration on the Amazon RDS console

  1. For Integration identifier, enter a name, for example zero-etl-demo.

Enter the Integration identifier

  1. For Source database, choose Browse RDS databases and choose the source cluster zero-etl-source-rms.
  2. Choose Next.

Browse RDS databases for zero-ETL source

  1. Under Target, for Amazon Redshift data warehouse, choose Browse Redshift data warehouses and choose the Redshift data warehouse (zero-etl-target-rs).
  2. Choose Next.

Browse Redshift data warehouses for zero-ETL integration

  1. Add tags and encryption, if applicable.
  2. Choose Next.
  3. Verify the integration name, source, target, and other settings.
  4. Choose Create zero-ETL integration.

Create zero-ETL integration step 4

You can choose the integration to view the details and monitor its progress. It took about 30 minutes for the status to change from Creating to Active.

Zero-ETL integration details

The time will vary depending on the size of your dataset in the source.

Create a database from the integration in Amazon Redshift

To create your database from the zero-ETL integration, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Open the zero-etl-target-rs cluster.
  3. Choose Query data to open the query editor v2.

Query data via the Query Editor v2

  1. Connect to the Redshift data warehouse by choosing Save.

Connect to the Redshift data warehouse

  1. Obtain the integration_id from the svv_integration system table:

select integration_id from svv_integration; -- copy this result, use in the next sql

Query for integration identifier

  1. Use the integration_id from the previous step to create a new database from the integration:

CREATE DATABASE zetl_source FROM INTEGRATION '<result from above>';

Create database from integration

The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near real time.

Analyze the near real time transactional data

Now we can run analytics on TICKIT’s operational data.

Populate the source TICKIT data

To populate the source data, complete the following steps:

  1. Copy the CSV input data files into a local directory. The following is an example command:

aws s3 cp 's3://redshift-blogs/zero-etl-integration/data/tickit' . --recursive

  1. Connect to your RDS for MySQL cluster and create a database or schema for the TICKIT data model, verify that the tables in that schema have a primary key, and initiate the load process:

mysql -h <rds_db_instance_endpoint> -u admin -p password --local-infile=1

Connect to your RDS for MySQL cluster and create a database or schema for the TICKIT data model

  1. Use the following CREATE TABLE commands.
  2. Load the data from local files using the LOAD DATA command.

The following is an example. Note that the input CSV file is broken into several files. This command must be run for every file if you would like to load all data. For demo purposes, a partial data load should work as well.

Create users table for demo

Analyze the source TICKIT data in the destination

On the Amazon Redshift console, open the query editor v2 using the database you created as part of the integration setup. Use the following code to validate the seed or CDC activity:

SELECT * FROM SYS_INTEGRATION_ACTIVITY ORDER BY last_commit_timestamp DESC;

Query to validate the seed or CDC activity

You can now apply your business logic for transformations directly on the data that has been replicated to the data warehouse. You can also use performance optimization techniques like creating a Redshift materialized view that joins the replicated tables and other local tables to improve query performance for your analytical queries.

Monitoring

You can query the following system views and tables in Amazon Redshift to get information about your zero-ETL integrations with Amazon Redshift:

To view the integration-related metrics published to Amazon CloudWatch, open the Amazon Redshift console. Choose Zero-ETL integrations in the navigation pane and choose the integration to display activity metrics.

Zero-ETL integration activity metrics

Available metrics on the Amazon Redshift console are integration metrics and table statistics, with table statistics providing details of each table replicated from Amazon RDS for MySQL to Amazon Redshift.

Integration metrics and table statistics

Integration metrics contain table replication success and failure counts and lag details.

Integration metrics showing table replication success and failure counts and lag details. Integration metrics showing table replication success and failure counts and lag details. Integration metrics showing table replication success and failure counts and lag details.

Manual resyncs

The zero-ETL integration will automatically initiate a resync if a table sync state shows as failed or resync required. But in case the auto resync fails, you can initiate a resync at table-level granularity:

ALTER DATABASE zetl_source INTEGRATION REFRESH TABLES tbl1, tbl2;

A table can enter a failed state for multiple reasons:

  • The primary key was removed from the table. In such cases, you need to re-add the primary key and perform the previously mentioned ALTER command.
  • An invalid value is encountered during replication or a new column is added to the table with an unsupported data type. In such cases, you need to remove the column with the unsupported data type and perform the previously mentioned ALTER command.
  • An internal error, in rare cases, can cause table failure. The ALTER command should fix it.

Clean up

When you delete a zero-ETL integration, your transactional data isn’t deleted from the source RDS or the target Redshift databases, but Amazon RDS doesn’t send any new changes to Amazon Redshift.

To delete a zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Select the zero-ETL integration that you want to delete and choose Delete.
  3. To confirm the deletion, choose Delete.

delete a zero-ETL integration

Conclusion

In this post, we showed you how to set up a zero-ETL integration from Amazon RDS for MySQL to Amazon Redshift. This minimizes the need to maintain complex data pipelines and enables near real time analytics on transactional and operational data.

To learn more about Amazon RDS zero-ETL integration with Amazon Redshift, refer to Working with Amazon RDS zero-ETL integrations with Amazon Redshift (preview).


 About the Authors

Milind Oke is a senior Redshift specialist solutions architect who has worked at Amazon Web Services for three years. He is an AWS-certified SA Associate, Security Specialty and Analytics Specialty certification holder, based out of Queens, New York.

Aditya Samant is a relational database industry veteran with over 2 decades of experience working with commercial and open-source databases. He currently works at Amazon Web Services as a Principal Database Specialist Solutions Architect. In his role, he spends time working with customers designing scalable, secure and robust cloud native architectures. Aditya works closely with the service teams and collaborates on designing and delivery of the new features for Amazon’s managed databases.

Stored procedure enhancements in Amazon Redshift

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/stored-procedure-enhancements-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. It supports stored procedures where prepared SQL code is saved and the code can be reused over and over again.

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can create reusable code blocks that can run together as a single transaction or multiple individual transactions. You can also schedule stored procedures to automate data processing on Amazon Redshift. For more information, refer to Bringing your stored procedures to Amazon Redshift.

In the Redshift stored procedure default atomic transaction mode, a call to a Redshift stored procedure will create its own transaction when the call starts or is part of the existing transaction if an explicit transaction is opened before the stored procedure is called. All the statements inside a procedure behave as if they are in a single transaction block that ends when the stored procedure call finishes. A nested call to another procedure is treated like any other SQL statement and operates within the context of the same transaction as the caller. Statements for TRUNCATE, COMMIT, and ROLLBACK and the exception handling block with arbitrary SQL statements close the current transaction and start a new transaction implicitly. This behavior can cause challenges in migration to Amazon Redshift from other systems like Teradata.

In this post, we discuss the enhancements to Amazon Redshift stored procedures for non-atomic transaction mode. This mode provides enhanced transaction controls that enable you to automatically commit the statements inside the stored procedure.

Non-atomic transaction mode

The new non-atomic transaction mode feature provides three enhancements on stored procedures in Amazon Redshift:

  • Unless the DML or DDL statements are part of an explicit open transaction, each statement in the stored procedure will run in its own implicit transaction and a new transaction will be opened to handle following statements. If an explicit transaction is opened, then all subsequent statements are run and remain un-committed until an explicit transaction control command (COMMIT or ROLLBACK) is run to end the transaction.
  • Amazon Redshift will not re-raise the exception after the exception handler statements are complete. Therefore, a new RAISE statement without any INFO or EXCEPTION has been provided to re-throw the exception caught by the exception handling block. This RAISE statement without any INFO or EXCEPTION will only be allowed in the exception handling block.
  • Also, the new START TRANSACTION statement begins an explicit transaction inside the non-atomic transaction mode stored procedure. Use the existing transaction control command (COMMIT or ROLLBACK) to end the explicitly started transaction.
    • Amazon Redshift does not support sub-transactions so if there is already an open transaction, then calling this statement again will do nothing, and no error is raised.
    • If an explicit transaction is still open when the nonatomic transaction mode stored procedure call ends, then the explicit transaction remains open until a transaction control command is run in the session.
    • If the session disconnects before running a transaction control command, the whole transaction is automatically rolled back.

Additional restrictions

Some restrictions have also been introduced for Redshift stored procedures:

  • For nesting stored procedure calls, all the procedures must be created in the same transaction mode, no matter if it’s in atomic (default) transaction mode or the new non-atomic transaction mode
  • You can’t nest stored procedures across the two transaction modes (atomic and non-atomic)
  • You can’t set the SECURITY DEFINER option or SET configuration_parameter option for non-atomic transaction mode stored procedures

Impact to cursors

Cursors in non-atomic transaction mode stored procedures will behave differently compared to the default atomic transaction mode:

  • Cursor statements will need an explicit transaction block before beginning the cursor to ensure that each iteration of the cursor loop is not auto-committed.
  • To return a cursor from non-atomic transaction mode stored procedure, you will need an explicit transaction block before beginning the cursor. Otherwise, the cursor will be closed when the SQL statement inside the loop is automatically committed.

Advantages

The following are key advantages of this feature from a user perspective:

  • It provides the capability to lift and shift Teradata stored procedures to run in Teradata session mode. This helps in seamless migrations from traditional data warehouses like Teradata and SQL Server.
  • It enables Amazon Redshift to provide more flexible operations inside of stored procedures when encountering errors and exceptions. Amazon Redshift can now preserve previous action’s state before reaching an exception.

Syntax

The new optional keyword NONATOMIC has been added to the stored procedure definition syntax, as shown in the following code:

CREATE [ OR REPLACE ] PROCEDURE sp_procedure_name
( [ [ argname ] [ argmode ] argtype [, ...] ] )
[ NONATOMIC ]
AS $$
procedure_body
$$ LANGUAGE plpgsql

This optional keyword creates the stored procedure under the non-atomic transaction mode. If you don’t specify the keyword, then the default atomic mode will be the transaction mode when creating the stored procedure.

NONATOMIC means each DML and DDL statement in the procedure will be implicitly committed.

Without non-atomic mode, the procedure will create its own transaction when the call starts or be part of the existing transaction if an explicit transaction is opened before it is called. Every statement within the stored procedure will belong to this one transaction.

Example of NONATOMIC mode

Let’s consider the customer contact table custcontacts, which stores customer primary and secondary contact phone numbers:

CREATE table custcontacts(
custid int4 not null,
primaryphone char(10),
secondaryphone char(10));

We insert three sample customer records with no contact values:

INSERT INTO custcontacts VALUES (101, 'xxxxxxxxxx', 'xxxxxxxxxx');
INSERT INTO custcontacts VALUES (102, 'xxxxxxxxxx', 'xxxxxxxxxx');
INSERT INTO custcontacts VALUES (103, 'xxxxxxxxxx', 'xxxxxxxxxx');

You need to create a stored procedure to update the primary and secondary phone numbers. The requirement is not to roll back updates to the primary contact number if updates to the secondary contact number fail for some reason.

You can achieve this by creating the stored procedure with the NONATOMIC keyword. The NONATOMIC keyword ensures that each statement in the stored procedure runs in its own implicit transaction block. Therefore, if the UPDATE statement for the secondary phone fails, then it won’t roll back the data update made to the primary phone. See the following code:

CREATE PROCEDURE sp_update_custcontacts(cid int4,pphone char(15),sphone char(15)) NONATOMIC AS
$$
BEGIN
UPDATE custcontacts SET primaryphone=pphone WHERE custid=cid;
UPDATE custcontacts SET secondaryphone=sphone WHERE custid=cid;
END;
$$
LANGUAGE plpgsql;

Now let’s call the stored procedure passing the secondary phone number with more than 10 digits, which will fail in the secondaryphone UPDATE statement due to incorrect length:

call sp_update_custcontacts(101,'1234567890','345443345324');

The preceding procedure call will update the primary phone number successfully. The secondary phone number update fails. However, the primaryphone update will not roll back because it ran in its own implicit transaction block due to the NONATOMIC clause in the stored procedure definition.

select * from custcontacts;

custcontacts | primaryphone | secondaryphone
-------------+---------------+---------------
101 | 1234567890 | XXXXXXXXXX
102 | XXXXXXXXXX | XXXXXXXXXX
103 | XXXXXXXXXX | XXXXXXXXXX

Exception handling in NONATOMIC mode

Exceptions are handled in stored procedures differently based on the atomic or non-atomic mode:

  • Atomic (default) – Exceptions are always re-raised
  • Non-atomic – Exceptions are handled and you can choose to re-raise or not

Let’s continue with the previous example to illustrate exception handling in non-atomic mode.

Create the following table to log exceptions raised by stored procedures:

CREATE TABLE procedure_log
(log_timestamp timestamp, procedure_name varchar(100), error_message varchar(255));

Now update the sp_update_custcontacts() procedure to handle exceptions. Note that we’re adding an EXCEPTION block in the procedure definition. It inserts a record in the procedure_log table in the event of an exception.

CREATE OR REPLACE PROCEDURE sp_update_custcontacts(cid int4,pphone char(15),sphone char(15)) NONATOMIC AS
$$
BEGIN
UPDATE custcontacts SET primaryphone=pphone WHERE custid=cid;
UPDATE custcontacts SET secondaryphone=sphone WHERE custid=cid;
EXCEPTION
WHEN OTHERS THEN
INSERT INTO procedure_log VALUES (getdate(), 'sp_update_custcontacts', sqlerrm);
END;
$$
LANGUAGE plpgsql;

Now create one more stored procedure, which will call the preceding procedure. It also has an EXCEPTION block and inserts a record in the procedure_log table in the event of an exception.

CREATE PROCEDURE sp_update_customer() NONATOMIC AS
$$
BEGIN
-- Let us assume you have additional staments here to update other fields. For this example, ommitted them for simplifiction.
-- Nested call to update contacts
call sp_update_custcontacts(101,'1234567890','345443345324');
EXCEPTION
WHEN OTHERS THEN
INSERT INTO procedure_log VALUES (getdate(), 'sp_update_customer', sqlerrm);
END;
$$
LANGUAGE plpgsql;

Let’s call the parent procedure we created:

call sp_update_customer();

This in turn will call the sp_update_custcontacts() procedure. The inner procedure sp_update_custcontacts() will fail because we’re updating the secondary phone with an invalid value. The control will enter the EXCEPTION block of the sp_update_custcontacts() procedure and make an insert into the procedure_log table.

However, it will not re-raise the exception in non-atomic mode. Therefore, the parent procedure sp_update_customer() will not get the exception passed from the sp_update_custcontacts() procedure. The control will not enter the EXCEPTION block of the sp_update_customer() procedure.

If you query the procedure_log table, you will see an entry only for the error handled by the sp_update_custcontacts() procedure:

select * from procedure_log;

Procedure Log Output

Now redefine the sp_update_custcontacts() procedure with the RAISE statement:

CREATE PROCEDURE sp_update_custcontacts(cid int4,pphone char(15),sphone char(15)) NONATOMIC AS
$$
BEGIN
UPDATE custcontacts SET primaryphone=pphone WHERE custid=cid;
UPDATE custcontacts SET secondaryphone=sphone WHERE custid=cid;
EXCEPTION
WHEN OTHERS THEN
INSERT INTO procedure_log VALUES (getdate(), 'sp_update_custcontacts', sqlerrm);
RAISE;
END;
$$
LANGUAGE plpgsql;

Let’s call the parent stored procedure sp_update_customer() again:

call sp_update_customer();

Now the inner procedure sp_update_custcontacts() will re-raise the exception to the parent procedure sp_update_customer() after handling the exception in its own EXCEPTION block. Then the control will reach the EXCEPTION block in the parent procedure and insert another record into the procedure_log table.

If you query the procedure_log table now, you will see two entries: one by the inner procedure sp_update_custcontacts() and another by the parent procedure sp_update_customer(). This demonstrates that the RAISE statement in the inner procedure re-raised the exception.

select * from procedure_log;

Procedure log output

Explicit START TRANSACTION statement in non-atomic mode

You can issue a START TRANSACTION statement to begin a transaction block inside the stored procedure. It will open a new transaction inside the stored procedure. For examples, refer to Nonatomic mode stored procedure transaction management.

Conclusion

In this post, we discussed the enhancements to Redshift stored procedures for non-atomic transaction mode, which provides enhanced transaction controls to enable you to automatically commit the statements inside the stored procedure. This mode also enables easier migration to Amazon Redshift from other systems like Teradata. Try out these enhancements and let us know your experience in comments.


About the Authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Kiran Chinta is a Software Development Manager at Amazon Redshift. He leads a strong team in query processing, SQL language, data security, and performance. Kiran is passionate about delivering products that seamlessly integrate with customers’ business applications with the right ease of use and performance. In his spare time, he enjoys reading and playing tennis.

Huichen Liu is a software development engineer on the Amazon Redshift query processing team. She focuses on query optimization, statistics and SQL language features. In her spare time, she enjoys hiking and photography.

Simplify management of database privileges in Amazon Redshift using role-based access control

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/simplify-management-of-database-privileges-in-amazon-redshift-using-role-based-access-control/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the challenges with security is that enterprises don’t want to have a concentration of superuser privileges amongst a handful of users. Instead, enterprises want to design their overarching security posture based on the specific duties performed via roles and assign these elevated privilege roles to different users. By assigning different privileges to different roles and assigning these roles to different users, enterprises can have more granular control of elevated user access.

In this post, we explore the role-based access control (RBAC) features of Amazon Redshift and how you can use roles to simplify managing privileges required to your end-users. We also cover new system views and functions introduced alongside RBAC.

Overview of RBAC in Amazon Redshift

As a security best practice, it’s recommended to design security by applying the principle of least privileges. In Amazon Redshift, RBAC applies the same principle to users based on their specific work-related role requirements, regardless of the type of database objects involved. This granting of privileges is performed at a role level, without the need to grant permissions for the individual user or user groups. You have four system-defined roles to get started, and can create additional, more granular roles with privileges to run commands that used to require the superuser privilege. With RBAC, you can limit access to certain commands and assign roles to authorized users. And you can assign object-level as well as system-level privileges to roles across Amazon Redshift native objects.

System-defined roles in Amazon Redshift

Amazon Redshift provides four system-defined roles that come with specific privileges. These can’t be altered or customized, but you can create your own roles as required. The system-defined roles use the sys: prefix, and you can’t use this prefix for the roles you create.

The following table summarizes the roles and their privileges.

Role Name Description of Privileges
sys:operator Can access catalog or system tables, and analyze, vacuum, or cancel queries.
sys:dba Can create schemas, create tables, drop schemas, drop tables, truncate tables, create or replace stored procedures, drop procedures, create or replace functions, create or replace external functions, create views, and drop views. Additionally, this role inherits all the privileges from the sys:operator role.
sys:superuser Has the same privileges as the Amazon Redshift superuser.
sys:secadmin Can create users, alter users, drop users, create roles, drop roles, and grant roles. This role can have access to user tables only when the privilege is explicitly granted to the role.

System privileges

Amazon Redshift also adds support for system privileges that can be granted to a role or a user. A system privilege allows admins to grant a limited set of privileges to a user, such as the ability to create and alter users. These system-defined privileges are immutable and can’t be altered, removed, or added to.

Create custom roles for RBAC in Amazon Redshift

To further granularize the system privileges being granted to users to perform specific tasks, you can create custom roles that authorize users to perform those specific tasks within the Amazon Redshift cluster.

RBAC also supports nesting of roles via role hierarchy, and Amazon Redshift propagates privileges with each role authorization. In the following example, granting role R1 to role R2 and then granting role R2 to role R3 authorizes role R3 with all the privileges from the three roles. Therefore, by granting role R3 to a user, the user has all the privileges from roles R1, R2, and R3.

Amazon Redshift doesn’t allow creation of a cyclic role authorization cycle, so role R3 can’t be granted to role R1, as that would be cyclic role authorization.

You can use the Amazon Redshift commands for privileges to create role, grant role, revoke role, and the admin options for the grant and revoke. Only superusers or regular users who have been granted create role privileges can use those commands.

RBAC example use cases

For this post, we use the industry standard TPC-H dataset to demonstrate our example use cases.

We have three different teams in the organization: Sales, Marketing, and Admin. For this example, we have two schemas, sales and marketing, in the Amazon Redshift database. Each schema has the following tables: nation, orders, part, partsupp, supplier, region, customer, and lineitem.

We have two different database roles, read-only and read/write, for both the Sales team and Marketing team individually. Each role can only perform operations to the objects belonging to the schema to which the role is assigned. For example, a role assigned to the sales schema can only perform operations based on assigned privileges to the sales schema, and can’t perform any operation on the marketing schema.

The read-only role has read-only access to the objects in the respective schema when the privilege is granted to the objects.

The read/write role has read and write (insert, update) access to the objects in the respective schema when the privileges are granted to the objects.

The Sales team has read-only ( role name sales_ro) and read/write ( role name sales_rw) privileges.

The Marketing team has similar roles: read-only ( role name marketing_ro) and read/write ( role name marketing_rw).

The Admin team has one role (db_admin), which has privileges to drop or create database roles, truncate tables, and analyze the entire database. The admin role can perform at the database level across both sales and marketing schemas.

Set up for the example use cases

To set up for the example use cases, create a database admin role and attach it to a database administrator. A superuser must perform all these steps.

All the queries for this post are run in the Amazon Redshift native Query Editor v2, but can be run just the same in any query editor, such as SQLWorkbench/J.

  1. Create the admin role (db_admin):
    create role db_admin;

  2. Create a database user named dbadmin:
    create user dbadmin password 'Test12345';

  3. Assign a system-defined role named sys:dba to the db_admin role:
    grant role sys:dba to role db_admin;

This role has the privileges to create schemas, create tables, drop schemas, drop tables, truncate tables, create or replace stored procedures, drop procedures, create or replace functions, create or replace external functions, create views, drop views, access catalog or system tables, analyze, vacuum, and cancel queries.

  1. Assign a system-defined role named sys:secadmin to the db_admin role:
    grant role sys:secadmin to role db_admin;

This role has the privileges to create users, alter users, drop users, create roles, drop roles, and grant roles.

  1. Assign the user dbadmin to the db_admin role:
    grant role db_admin to dbadmin;

From this point forward, we use the dbadmin user credential for performing any of the following steps when no specific user is mentioned.

  1. Create the sales and marketing database schema:
    create schema sales;
    
    create schema marketing;

  2. Create all the eight tables (nation, orders, part, partsupp, supplier, region, customer, lineitem) in the sales and marketing schemas.

You can use the DDL available on the GitHub repo to create and populate the tables.

After the tables are created and populated, let’s move to the example use cases.

Example 1: Data read-only task

Sales analysts may want to get the list of suppliers with minimal cost. For this, the sales analyst only needs read-only access to the tables in the sales schema.

  1. Let’s create the read-only role (sales_ro) in the sales schema:
    create role sales_ro;

  2. Create a database user named salesanalyst:
    create user salesanalyst password 'Test12345';

  3. Grant the sales schema usage and select access to objects of the sales schema to the read-only role:
    grant usage on schema sales to role sales_ro;
    
    grant select on all tables in schema sales to role sales_ro;

  4. Now assign the user to the read-only sales role:
    grant role sales_ro to salesanalyst;

Now the salesanalyst database user can access the sales schema in the Amazon Redshift database using the salesanalyst credentials.

The salesanalyst user can generate a report of least-expensive suppliers using the following query:

set search_path to sales;
SELECT	TOP 100
	S_ACCTBAL,
	S_NAME,
	N_NAME,
	P_PARTKEY,
	P_MFGR,
	S_ADDRESS,
	S_PHONE,
	S_COMMENT
FROM	PART,
	SUPPLIER,
	PARTSUPP,
	NATION,
	REGION
WHERE	P_PARTKEY	= PS_PARTKEY AND
	S_SUPPKEY	= PS_SUPPKEY AND
	P_SIZE		= 34 AND
	P_TYPE		LIKE '%COPPER' AND
	S_NATIONKEY	= N_NATIONKEY AND
	N_REGIONKEY	= R_REGIONKEY AND
	R_NAME		= 'MIDDLE EAST' AND
	PS_SUPPLYCOST	= (	SELECT	MIN(PS_SUPPLYCOST)
				FROM	PARTSUPP,
					SUPPLIER,
					NATION,
					REGION
				WHERE	P_PARTKEY	= PS_PARTKEY AND
					S_SUPPKEY	= PS_SUPPKEY AND
					S_NATIONKEY	= N_NATIONKEY AND
					N_REGIONKEY	= R_REGIONKEY AND
					R_NAME		= 'MIDDLE EAST'
			  )
ORDER	BY	S_ACCTBAL DESC,
		N_NAME,
		S_NAME,
		P_PARTKEY
;

The salesanalyst user can successfully read data from the region table of the sales schema.

select * from sales.region;

In the following example, the salesanalyst user wants to update the comment for Region key 0 and Region name AFRICA in the region table. But the command fails with a permission denied error because they only have select permission on the region table in the sales schema.

update sales.region
set r_comment = 'Comment from Africa'
where r_regionkey = 0;

The salesanalyst user also wants to access objects from the marketing schema, but the command fails with a permission denied error.

select * from marketing.region;

Example 2: Data read/write task

In this example, the sales engineer who is responsible for building the extract, transform, and load (ETL) pipeline for data processing in the sales schema is given read and write access to perform their tasks. For these steps, we use the dbadmin user unless otherwise mentioned.

  1. Let’s create the read/write role (sales_rw) in the sales schema:
    create role sales_rw;

  2. Create a database user named salesengineer:
    create user salesengineer password 'Test12345';

  3. Grant the sales schema usage and select access to objects of the sales schema to the read/write role by assigning the read-only role to it:
    grant role sales_ro to role sales_rw;

  4. Now assign the user salesengineer to the read/write sales role:
    grant role sales_rw to salesengineer;

Now the salesengineer database user can access the sales schema in the Amazon Redshift database using the salesengineer credentials.

The salesengineer user can successfully read data from the region table of the sales schema.

select * from sales.region;

However, they can’t read tables from the marketing schema because the salesengineer user doesn’t have permission.

select * from marketing.region;

The salesengineer user then tries to update the region table in the sales schema but fails to do so.

update sales.region
set r_comment = 'Comment from Africa'
where r_regionkey = 0;

  1. Now, grant additional insert, update, and delete privileges to the read/write role:
grant update, insert, delete on all tables in schema sales to role sales_rw;

The salesengineer user then retries to update the region table in the sales schema and is able to do so successfully.

update sales.region
set r_comment = 'Comment from Africa'
where r_regionkey = 0;


When they read the data, it shows that the comment was updated for Region key 0 (for AFRICA) in the region table in the sales schema.

select * from sales.region;

Now salesengineer wants to analyze the region table since it was updated. However, they can’t do so, because this user doesn’t have the necessary privileges and isn’t the owner of the region table in the sales schema.

analyze sales.region;

Finally, the salesengineer user wants to vacuum the region table since it was updated. However, they can’t do so because they don’t have the necessary privileges and aren’t the owner of the region table.

vacuum sales.region;

Example 3: Database administration task

Amazon Redshift automatically sorts data and runs VACUUM DELETE in the background.

Similarly, Amazon Redshift continuously monitors your database and automatically performs analyze operations in the background. In some situations, such as a major one-off data load, the database administrator may want to perform maintenance on objects in the sales and marketing schemas immediately. They access the database using dbadmin credentials to perform these tasks.

The dbadmin database user can access the Amazon Redshift database using their credentials to perform analyze and vacuum of the region table in the sales schema.

analyze sales.region;

Vacuum sales.region;


Now the dbadmin database user accesses the Amazon Redshift database to perform analyze and vacuum of the region table in the marketing schema.

analyze marketing.region;

vacuum marketing.region;


As part of developing the ETL process, the salesengineer user needs to truncate the region table in the sales schema. However, they can’t perform a truncate because they don’t have the necessary privileges, and aren’t the owner of the region table in the sales schema.

truncate sales.region;


The dbadmin database user can access the Amazon Redshift database to provide truncate table privileges to the sales_rw role.

grant truncate table to role sales_rw;

Now the salesengineer can perform a truncate on the region table in the sales schema successfully.

First, they read the data:

select * from sales.region;


Then they perform the truncate:

truncate sales.region;


They read the data again to see the changes:

select * from sales.region;


For the marketing schema, you must perform similar operations for the marketing analyst and marketing engineer. We include the following scripts for your reference. The dbadmin user can use the following SQL commands to create the marketing roles and database users, assign privileges to those roles, and attach the users to the roles.

create role marketing_ro;

create role marketing_rw;

grant usage on schema marketing to role marketing_ro, role marketing_rw;

grant select on all tables in schema marketing to role marketing_ro;

grant role marketing_ro to role marketing_rw;

grant insert, update, delete on all tables in schema marketing to role marketing_rw;

create user marketinganalyst password 'Test12345';

create user marketingengineer password 'Test12345';

grant role marketing_ro to  marketinganalyst;

grant role marketing_rw to  marketingengineer;

System functions for RBAC in Amazon Redshift

Amazon Redshift has introduced two new functions to provide system information about particular user membership and role membership in additional groups or roles: role_is_member_of and user_is_member_of. These functions are available to superusers as well as regular users. Superusers can check all role memberships, whereas regular users can only check membership for roles that they have been granted access to.

role_is_member_of(role_name, granted_role_name)

The role_is_member_of function returns true if the role is a member of another role. Superusers can check all roles memberships; regular users can only check roles to which they have access. You receive an error if the provided roles don’t exist or the current user doesn’t have access to them. The following two role memberships are checked using the salesengineer user credentials:

select role_is_member_of('sales_rw', 'sales_ro');

select role_is_member_of('sales_ro', 'sales_rw');

user_is_member_of( user_name, role_or_group_name)

The user_is_member_of function returns true if the user is a member of the specified role or group. Superusers can check all user memberships; regular users can only check their own membership. You receive an error if the provided identities don’t exist or the current user doesn’t have access to them. The following user membership is checked using the salesengineer user credentials, and fails because salesengineer doesn’t have access to salesanalyst:

select user_is_member_of('salesanalyst', 'sales_ro');


When the same user membership is checked using the superuser credential, it returns a result:

select user_is_member_of('salesanalyst', 'sales_ro');

When salesengineer checks their own user membership, it returns the correct results:

select user_is_member_of('salesengineer', 'sales_ro');

select user_is_member_of('salesengineer', 'marketing_ro');

select user_is_member_of('marketinganalyst', 'sales_ro');

System views for RBAC in Amazon Redshift

Amazon Redshift has added several new views to be able to view the roles, the assignment of roles to users, the role hierarchy, and the privileges for database objects via roles. These views are available to superusers as well as regular users. Superusers can check all role details, whereas regular users can only check details for roles that they have been granted access to.

For example, you can query svv_user_grants to view the list of users that are explicitly granted roles in the cluster, or query svv_role_grants to view a list of roles that are explicitly granted roles in the cluster. For the full list of system views, refer to SVV views.

Conclusion

In this post, we demonstrated how you can use role-based access control to further fortify your security posture by granularizing privileged access across users without needing to centralize superuser privileges in your Amazon Redshift cluster. Try out using database roles for your future Amazon Redshift implementations, and feel free to leave a comment about your experience.

In the future posts, we will show how these roles also integrate tightly with workload management. You can use them when defining WLM queues, and also while implementing single sign-on via identity federation with Microsoft Active Directory or a standards-based identity provider, such as Okta Universal Directory or Azure AD and other SAML-based applications.


About the Authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Dipankar Kushari is a Sr. Specialist Solutions Architect, Analytics with AWS.

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.

Debu Panda is a Senior Manager, Product Management with AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Huiyuan Wang is a software development engineer of Amazon Redshift. She has been working on MPP databases for over 6 years and has focused on query processing, optimization and metadata security.

Set up cross-account audit logging for your Amazon Redshift cluster

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/set-up-cross-account-audit-logging-for-your-amazon-redshift-cluster/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the best practices of modern application design is to have centralized logging. Troubleshooting application problems is easy when you can correlate all your data together.

When you enable audit logging, Amazon Redshift logs information about connections and user activities in the database. These logs help you monitor the database for security and troubleshooting purposes, a process called database auditing. The logs are stored in Amazon Simple Storage Service (Amazon S3) buckets. These provide convenient access with data security features for users who are responsible for monitoring activities in the database.

If you want to establish a central audit logging account to capture audit logs generated by Amazon Redshift clusters located in separated AWS accounts, you can use the solution in this post to achieve cross-account audit logging for Amazon Redshift. As of this writing, the Amazon Redshift console only lists S3 buckets from the same account (in which the Amazon Redshift cluster is located) while enabling audit logging, so you can’t set up cross-account audit logging using the Amazon Redshift console. In this post, we demonstrate how to configure cross-account audit logging using the AWS Command Line Interface (AWS CLI).

Prerequisites

For this walkthrough, you must have the following prerequisites:

  • Two AWS accounts: one for analytics and one for centralized logging
  • A provisioned Amazon Redshift cluster in the analytics AWS account
  • An S3 bucket in the centralized logging AWS account
  • Access to the AWS CLI

Overview of solution

As a general security best practice, we recommend making sure that Amazon Redshift audit logs are sent to the correct S3 buckets. The Amazon Redshift service team has introduced additional security controls in the event that the destination S3 bucket resides in a different account from the Amazon Redshift cluster owner account. For more information, see Bucket permissions for Amazon Redshift audit logging.

This post uses the AWS CLI to establish cross-account audit logging for Amazon Redshift, as illustrated in the following architecture diagram.

For this post, we established an Amazon Redshift cluster named redshift-analytics-cluster-01 in the analytics account in Region us-east-2.

We also set up an S3 bucket named redshift-cluster-audit-logging-xxxxxxxxxxxx in the centralized logging account for capturing audit logs in Region us-east-1.

Now you’re ready to complete the following steps to set up the cross-account audit logging:

  1. Create AWS Identity and Access Management (IAM) policies in the analytics AWS account.
  2. Create an IAM user and attach the policies you created.
  3. Create an S3 bucket policy in the centralized logging account to allow Amazon Redshift to write audit logs to the S3 bucket, and allow the IAM user to enable audit logging for the S3 bucket.
  4. Configure the AWS CLI.
  5. Enable audit logging in the centralized logging account.

Create IAM policies in the analytics account

Create two IAM policies in the analytics account that has the Amazon Redshift cluster.

The first policy is the Amazon Redshift access policy (we named the policy redshift-audit-logging-redshift-policy). This policy allows the principal to whom it’s attached to enable, disable, or describe Amazon Redshift logs. It also allows the principal to describe the Amazon Redshift cluster. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "redshift:EnableLogging",
                "redshift:DisableLogging",
                "redshift:DescribeLoggingStatus"
            ],            
"Resource": "arn:aws:redshift:us-east-2:xxxxxxxxxxxx:cluster: redshift-analytics-cluster-01"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
        }
    ]
}

The second policy is the Amazon S3 access policy (we named the policy redshift-audit-logging-s3-policy). This policy allows the principal to whom it’s attached to write to the S3 bucket in the centralized logging account. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        }
    ]
}

Create an IAM user and attach the policies

Create an IAM user (we named it redshift-audit-logging-user) with programmatic access in the analytics account and attach the policies you created to it.

Save the generated AWS secret key and secret access key credentials for this user securely. We use these credentials in the next step.

Create an S3 bucket policy for the S3 bucket in the centralized logging AWS account

Add the following bucket policy to the audit logging S3 bucket redshift-cluster-audit-logging-xxxxxxxxxxxx in the centralized logging account. This policy serves two purposes: it allows Amazon Redshift to write audit logs to the S3 bucket, and it allows the IAM user to enable audit logging for the S3 bucket. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Put bucket policy needed for audit logging",
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": [
                "s3:PutObject",
                "s3:GetBucketAcl"
            ],
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        },
        {
            "Sid": "Put IAM User bucket policy needed for audit logging",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::xxxxxxxxxxxx:user/redshift-audit-logging-user"
            },
            "Action": "s3:PutObject",
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        }
    ]
}

Note that you have to modify the service name redshift.amazonaws.com to look like redshift.region.amazonaws.com if the cluster is in one of the opt-in Regions.

Configure the AWS CLI

As part of this step, you need to install and configure the AWS CLI. After you install the AWS CLI, configure it to use the IAM user credentials that we generated earlier. We perform the next steps based on the permissions attached to the IAM user we created.

Enable audit logging in the centralized logging account

Run the AWS CLI command to enable audit logging for the Amazon Redshift cluster in an S3 bucket in the centralized logging AWS account. In the following code, provide the Amazon Redshift cluster ID, S3 bucket name, and the prefix applied to the log file names:

aws redshift enable-logging --cluster-identifier <ClusterName> --bucket-name <BucketName> --s3-key-prefix <value>

The following screenshot shows that the cross-account Amazon Redshift audit logging is successfully set up.

A test file is also created by AWS to ensure that the log files can be successfully written into the S3 bucket. The following screenshot shows the test file was created successfully in the S3 bucket under the rsauditlog1 prefix.

After some time, we started seeing the audit logs created in the S3 bucket. By default, Amazon Redshift organizes the log files in the S3 bucket using the following bucket and object structure:

AWSLogs/AccountID/ServiceName/Region/Year/Month/Day/AccountID_ServiceName_Region_ClusterName_LogType_Timestamp.gz

Amazon Redshift logs information in the following log files:

  • Connection log – Logs authentication attempts, connections, and disconnections
  • User log – Logs information about changes to database user definitions
  • User activity log – Logs each query before it’s run on the database

The following screenshot shows that log files, such as connection logs and user activity logs, are now being created in the centralized logging account in us-east-1 from the Amazon Redshift cluster in the analytics account in us-east-2.

For more details on analyzing Amazon Redshift audit logs, refer to below mentioned blogs

  1. Visualize Amazon Redshift audit logs using Amazon Athena and Amazon QuickSight
  2. How do I analyze my audit logs using Amazon Redshift Spectrum?

Clean up

To avoid incurring future charges, you can delete all the resources you created while following the steps in this post.

Conclusion

In this post, we demonstrated how to accomplish cross-account audit logging for an Amazon Redshift cluster in one account to an Amazon S3 bucket in another account. Using this solution, you can establish a central audit logging account to capture audit logs generated by Amazon Redshift clusters located in separated AWS accounts.

Try this solution to achieve cross-account audit logging for Amazon Redshift and leave a comment.


About the Authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Dipankar Kushari is a Sr. Analytics Solutions Architect with AWS.

Pankaj Pattewar is a Cloud Application Architect at Amazon Web Services. He specializes in architecting and building cloud-native applications and enables customers with best practices in their cloud journey.

Sudharshan Veerabatheran is a Cloud Support Engineer based out of Portland.

Implement a slowly changing dimension in Amazon Redshift

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/implement-a-slowly-changing-dimension-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. A star schema is a database organization structure optimized for use in a data warehouse. In a star schema, a dimension is a structure that categorizes the facts and measures in order to enable you to answer business questions. The attributes (or columns) of the dimension table provide the business meaning to the measures of the fact table. Rows in a dimension table are identified using a unique identifier like a customer identification key, and the fact table’s rows have a referential key pointing to the dimension table’s primary key. Dimension and fact tables are joined using the dimension table’s primary key and the fact table’s foreign key.

Over time, the attributes of a given row in a dimension table may change. For example, the shipping address for a customer may change. This phenomenon is called a slowly changing dimension (SCD). For historical reporting purposes, it may be necessary to keep a record of the fact that the customer has a change in address. The range of options for dealing with this involves SCD management methodologies referred to as type 1 to type 7. Type 0 is when no changes are allowed to the dimension, for example a date dimension that doesn’t change. The most common types are 1, 2 and 3:

  • Type 1 (No history) – The dimension table reflects the latest version; no history is maintained
  • Type 2 (Maintain history) – All changes are recorded and versions are tracked with dates and flags
  • Type 3 (Previous value) – The [latest – 1] value for specific columns in maintained as a separate attribute

Prerequisites

For this walkthrough, you should have the following prerequisites:

Overview of solution

This post walks you through the process of implementing SCDs on an Amazon Redshift cluster. We go through the best practices and anti-patterns. To demonstrate this, we use the customer table from the TPC-DS benchmark dataset. We show how to create a type 2 dimension table by adding slowly changing tracking columns, and we go over the extract, transform, and load (ETL) merge technique, demonstrating the SCD process.

The following figure is the process flow diagram.

The following diagram shows how a regular dimensional table is converted to a type 2 dimension table.

Implement slowly changing dimensions

To get started, we use one of two AWS CloudFormation templates from Amazon Redshift Labs:

In this post, we only show the important SQL statements; the complete SQL code is available in scd2_sample_customer_dim.sql.

The first step to implement SCD for a given dimension table is to create the dimension table with SCD tracking attributes. For example, record effective date, record end date, and active record indicator are typically added to track if a record is active or not. These fields are collectively referenced as the SCD fields (as shown in the following code) going forward in this post.

These SCD fields are added so that when a field is changed, for example, a customer’s address, the existing record in the dimension table is updated to indicate that the record isn’t active and a new record is inserted with an active flag. This way, every change to an SCD field is stored in the table and business users can run queries to see historical performance of a dimension for a given change that is being tracked.

We also introduce the following:

  • Record hash value to easily track if the customer data fields have changed their values. This hash column is computed over all the customer fields. This single hash column is compared instead of comparing multiple individual columns to determine if the data has changed.
  • Record insert and update timestamps to capture when the actual dimension row was added to the table and updated.

The following code shows the SCD fields added to the dimension table:

drop table if exists customer_dim cascade;
create table customer_dim ( 
customer_dim_id     bigint generated by default as identity(1, 1), 
c_custkey           bigint distkey, 
c_name              character varying(30), 
c_address           character varying(50), 
c_nationkey         integer, 
c_phone             character varying(20), 
c_acctbal           numeric(12, 2), 
c_mktsegment        character varying(10), 
c_comment           character varying(120), 
track_hash          bigint, 
record_start_ts     timestamp without time zone 
                    default '1970-01-01 00:00:00'::timestamp without time zone, 
record_end_ts       timestamp without time zone 
                    default '2999-12-31 00:00:00'::timestamp without time zone, 
record_active_flag  smallint default 1, 
record_upd_ts       timestamp without time zone default current_timestamp, 
record_insert_ts    timestamp without time zone default current_timestamp 
)
diststyle key 
sortkey (c_custkey);

Next, we perform the initial load to the dimension table. Because this is the first time that the dimension records are loaded, the SCD tracking attributes are set to active. For example, record start date is set to a low date, like 1900-01-01, or to a business date value to reflect when a particular change became effective. The record end date is set to a high date, like 2999-12-31, and active record indicator is set 1, indicating these rows are active.

After the initial load is complete, we create a staging table to load the incremental changes that come from the source system. This table acts as temporary holding place for incoming records. To identify if a change has occurred or not for a given record, we left outer join the customer staging table to the customer dimension table on the customer primary key (c_cust_key). We use left outer join because we want to flag matching records for the update process and unmatched records for the insert process. Left outer joining the staging table to the customer table projects both matched and unmatched rows. Matched rows are treated as updates and unmatched rows are treated as inserts.

In our data warehouse system, let’s assume we have to meet the following criteria:

  • Track changes on the address and phone fields only—type 2 with start and end timestamps
  • Other attributes are required to be kept up to date without creating history records—type 1
  • The source system provides incremental delta change records

If your source systems can’t provide delta change records and instead provides full load every time, then the data warehouse needs to have logic to identify the changed records. For such a workload, we build a second, uniquely identifiable value by using a built-in Amazon Redshift hash function on all the dimension columns to identify the changed rows.

The customer address and phone are being tracked as slowly changing dimensions. We use FNV_HASH to generate a 64-bit signed integer that accommodates 18.4 quintillion unique values. For smaller dimension tables, we can also use CHECKSUM to generate a 32-bit signed integer that accommodates 4.4 billion unique values.

We determine if the dimension row is a new record by using new_ind, or if the dimension row is changed by comparing the record hash and using track_ind for the change indicator.

Changes are identified by joining the staging table and target table on the primary key. See the following code:

truncate table stg_customer;
insert into stg_customer 
with stg as (
    select
        custkey as stg_custkey, name as stg_name, 
        address as stg_address, nationkey as stg_nationkey, 
        phone as stg_phone, acctbal as stg_acctbal,
        mktsegment as stg_mktsegment, comment as stg_comment, 
        effective_dt as stg_effective_dt,
        FNV_HASH(address,FNV_HASH(phone)) as stg_track_hash
    from
        src_customer
    )
select 
    s.* , 
    case when c.c_custkey is null then 1 else 0 end new_ind,
    case when c.c_custkey is not null 
          and s.stg_track_hash <> track_hash then 1 else 0 end track_ind
 from
    stg s
left join customer_dim c
    on s.stg_custkey = c.c_custkey
;

For rows that aren’t matched (for example, completely new records such as new_ind = 1), the rows are inserted into the dimensional table with SCD tracking attributes set as new and an active record flag indicating Active = 1.

For matched records, two possibilities could happen:

  • SCD type 2 field has changed – For this category, we use a two-step process to retain the previous version of the customer record and also record the latest version of the customer record for type 2 fields in our data warehouse. This satisfies our first business requirement. The steps are as follows:
    • Step 1 – Update the existing record in the target customer dimension table as inactive by setting the record end date to the current timestamp and active record indicator to 0.
    • Step 2 – Insert the new rows from the customer staging table into the customer target table with the record start date set to the current timestamp, record end date set to a high date, and the record active flag set to 1.
  • SCD type 1 field has changed – For this category, the row in the customer target table is updated directly with the latest rows from staging table. While doing so, we don’t update any SCD tracking date fields or flags. With this step, we retain only the latest version of the record for type 1 fields in our data warehouse. This satisfies our second business requirement.

Apply changes to the dimension table with the following code:

-- merge changes to dim customer
begin transaction;

-- close current type 2 active record based of staging data where change indicator is 1
update customer_dim
set record_end_ts = stg_effective_dt - interval '1 second',
    record_active_flag = 0,
    record_upd_ts = current_timestamp 
from stg_customer
where c_custkey = stg_custkey
and record_end_ts = '2999-12-31'
and track_ind = 1;

-- create latest version type 2 active record from staging data
-- this includes Changed + New records
insert into customer_dim
   (c_custkey,c_name,c_address,c_nationkey,c_phone,c_acctbal,
    c_mktsegment,c_comment,track_hash,record_start_ts,record_end_ts, 
    record_active_flag, record_insert_ts, record_upd_ts) 
select
    stg_custkey, stg_name, stg_address, stg_nationkey, stg_phone,
    stg_acctbal, stg_mktsegment, stg_comment, stg_track_hash, 
    stg_effective_dt as record_start_ts, '2999-12-31' as record_end_ts,
    1 as record_active_flag, current_timestamp as record_insert_ts, 
    current_timestamp as record_upd_ts
from
    stg_customer
where
    track_ind = 1 or new_ind = 1;

-- update type 1 current active records for non-tracking attributes
update customer_dim
set c_name = stg_name,
    c_nationkey = stg_nationkey,
    c_acctbal = stg_acctbal,
    c_mktsegment = stg_mktsegment,
    c_comment = stg_comment,
    record_upd_ts = current_timestamp
from
    stg_customer
where
    c_custkey = stg_custkey
and record_end_ts = '2999-12-31'
and track_ind = 0 and new_ind = 0;

-- end merge operation
commit transaction;

Best practices

The Amazon Redshift cloud data warehouse can process a large number of updates efficiently. To achieve this, have a staging table that shares the same table definition as your target dimension table. Then, as shown in the earlier code snippet, you can join the staging and the target dimension tables and perform the update and insert in a transaction block. This operation performs bulk updates and inserts on the target table, yielding good performance.

The Amazon Redshift shared nothing architecture typically performs at its peak when operations can be run by each node independently with minimal data movement between nodes. The target customer dimension table and the intermediate staging table created with matched distribution keys provide the best performance because all operations can be completed within the node.

Anti-patterns

You can also approach this method by comparing dimension records in a row-by-row fashion using cursors and then updating or inserting a particular row on the target table. Although this method works on smaller tables, for larger tables, it’s advised to use the bulk operations method explained in this post.

Clean up

To avoid incurring future charges, you can delete all the resources created by the CloudFormation template by deleting the CloudFormation stack.

Conclusion

In this post, you learned about slowly changing dimensions, implementing SCDs on Amazon Redshift, best practices for running the ETL operations against the target table by using intermediate staging tables, and finally anti-patterns to avoid.

Refer to Amazon Redshift data loading best practices for further materials and additional best practices, and see Updating and inserting new data for instructions to implement updates and inserts.


About the Authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift. He is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouses—architecture, development, and administration. He has been in the data and analytical field for over 13 years.