All posts by Jason Pedreza

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

Post Syndicated from Jason Pedreza original https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers today rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse.

Data ingestion is the process of getting data to Amazon Redshift. You can leverage one of the many zero-ETL integration methods to make data available in Amazon Redshift directly. However, if your data is in your Amazon S3 bucket, then you can simply load data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift using the COPY command. A COPY command is the most efficient way to load a table from S3 because it uses the Amazon Redshift’s massively parallel processing (MPP) architecture to read and load data in parallel.

Amazon Redshift launched auto-copy support to simplify data loading from Amazon S3 into Amazon Redshift. You can now setup continuous file ingestion rules to track your Amazon S3 paths and automatically load new files without the need for additional tools or custom solutions. This also enables end users to have the latest data available in Amazon Redshift shortly after the source data is available.

This post shows you how to build automatic file ingestion pipelines in Amazon Redshift when source files are located on Amazon S3 by using a simple SQL command. In addition, we show you how to enable auto-copy using auto-copy jobs, how to monitor jobs, considerations, and best practices.

Overview of the auto-copy feature in Amazon Redshift

The auto-copy feature in Amazon Redshift leverages the S3 event integration to automatically load data into Amazon Redshift and simplifies automatic data loading from Amazon S3 with a simple SQL command. You can enable Amazon Redshift auto-copy by creating auto-copy jobs. A auto-copy job is a database object that stores, automates, and reuses the COPY statement for newly created files that land in the S3 folder.

The following diagram illustrates this process.

S3 event integration and auto-copy jobs have the following benefits:

  • Users can now load data from Amazon S3 automatically without having to build a pipeline or using an external framework
  • auto-copy jobs offer automatic and incremental data ingestion from an Amazon S3 location without the need to implement a custom solution
  • This functionality comes at no additional cost
  • Existing COPY statements can be converted into auto-copy jobs by appending the JOB CREATE <job_name> parameter
  • It keeps track of loaded files and minimizes data duplication.
  • It can be quickly set up using a simple SQL statement using your choice of JDBC/ODBC clients.
  • It has automatic error handling of bad quality data files.
  • It has a mechanism to load-once for each file. This means that there is no need to generate explicit manifest files.

Prerequisites

To get started with auto-copy, you need the following prerequisites:

  • An AWS account
  • An encrypted Amazon Redshift provisioned cluster or Amazon redshift serverless workgroup
  • An Amazon S3 bucket
  • Add following to the Amazon S3 bucket policy
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Auto-Copy-Policy-01",
                "Effect": "Allow",
                "Principal": {
                    "Service":"redshift.amazonaws.com"
                        
                    
                },
                "Action": [
                    "s3:GetBucketNotification",
                    "s3:PutBucketNotification",
                    "s3:GetBucketLocation"
                ],
                "Resource": "arn:aws:s3:::<<your-s3-bucket-name>>",
                "Condition": {
                    "StringEquals": {
                        "aws:SourceArn": "arn:aws:redshift:<region-name>:<aws-account-id>:integration:*",
                        "aws:SourceAccount": "<aws-account-id>"
                    }
                }
            }
        ]
    }

Set up Amazon S3 event Integration

An Amazon S3 event integration facilitates seamless and automated data ingestion from S3 buckets into an Amazon Redshift data warehouse, streamlining the process of transferring and storing data for analytical purposes

  1. Sign in to the AWS Management Console and Navigate to Amazon Redshift home page. Under Integrations section choose S3 event integrations

  2. Choose Create S3 event integration

  3. Enter Integration name and Description, choose Next

  4. Choose Browse S3 buckets, a dialog box pops up, select the Amazon S3 bucket and choose Continue

  5. Amazon S3 bucket is selected. Choose Next
  6. Choose Browse Redshift data warehouse

  7. Choose the Amazon Redshift data warehouse and choose Continue

  8. Then Amazon Redshift resource policy needs access to S3 event integration. In case of Resource policy error, check Fix it for me and choose Next

  9. Add Tags as required and choose Next

  10. Review changes and choose Create S3 event integration
  11. An S3 event integration is created. Wait until the status of S3 event integration is Active

Set up auto-copy jobs

In this section, we demonstrate how to automate data loading of files from Amazon S3 into Amazon Redshift. With the existing COPY syntax, we add the JOB CREATE parameter to perform a one-time setup for automatic file ingestion. See the following code:

COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name> [AUTO ON | OFF];

Auto ingestion is enabled by default on auto-copy jobs. Files already present at the S3 location will not be visible to the auto-copy job. Only files added after JOB creation are tracked by Amazon Redshift.

Automate ingestion from a single data source

With a auto-copy job, you can automate ingestion from a single data source by creating one job and specifying the path to the S3 objects that contain the data. The S3 object path can reference a set of folders that have the same key prefix.

In this example, we have multiple files that are being loaded on a daily basis containing the sales transactions across all the stores in the US. For this we can create a store_sales folder in the bucket.

The following code creates the store_sales table:

DROP TABLE IF EXISTS public.store_sales;
CREATE TABLE IF NOT EXISTS public.store_sales
(
  ss_sold_date_sk int4,            
  ss_sold_time_sk int4,     
  ss_item_sk int4 not null,      
  ss_customer_sk int4,           
  ss_cdemo_sk int4,              
  ss_hdemo_sk int4,         
  ss_addr_sk int4,               
  ss_store_sk int4,           
  ss_promo_sk int4,           
  ss_ticket_number int8 not null,        
  ss_quantity int4,           
  ss_wholesale_cost numeric(7,2),          
  ss_list_price numeric(7,2),              
  ss_sales_price numeric(7,2),
  ss_ext_discount_amt numeric(7,2),             
  ss_ext_sales_price numeric(7,2),              
  ss_ext_wholesale_cost numeric(7,2),           
  ss_ext_list_price numeric(7,2),               
  ss_ext_tax numeric(7,2),                 
  ss_coupon_amt numeric(7,2), 
  ss_net_paid numeric(7,2),   
  ss_net_paid_inc_tax numeric(7,2),             
  ss_net_profit numeric(7,2),
  primary key (ss_item_sk, ss_ticket_number)
 ) 
DISTKEY (ss_item_sk) 
SORTKEY (ss_sold_date_sk);

Next, we create the auto-copy job to automatically load the gzip-compressed files into the store_sales table:

COPY store_sales
FROM 's3://aws-redshift-s3-auto-copy-demo/store_sales'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
gzip delimiter '|' EMPTYASNULL
region 'us-east-1'
JOB CREATE job_store_sales AUTO ON;

Each day’s sales transactions are loaded to their own folder in Amazon S3.

Now upload the files for transaction sold on 2002-12-31. Each folder contains multiple gzip-compressed files.

Since the auto-copy job is already created, it automatically loads the gzip-compressed files located in the S3 object path specified in the COPY command to the store_sales table.

Let’s run a query to get the daily total of sales transactions across all the stores in the US:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

The output shown comes from the transactions sold on 2002-12-31.

The following day, incremental sales transactions data are loaded to a new folder in the same S3 object path.

As new files arrive to the same S3 object path, the auto-copy job automatically loads the unprocessed files to the store_sales table in an incremental fashion.

All new sales transactions for 2003-01-01 are automatically ingested, which can be verified by running the following query:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

Automate ingestion from multiple data sources

We can also load an Amazon Redshift table from multiple data sources. When using a pub/sub pattern where multiple S3 buckets populate data to an Amazon Redshift table, you have to maintain multiple data pipelines for each source/target combination. With new parameters in the COPY command, this can be automated to handle data loads efficiently.

In the following example, the Customer_1 folder has Green Cab Company sales data, and the Customer_2 folder has Red Cab Company sales data. We can use the COPY command with the JOB parameter to automate this ingestion process.

The following screenshot shows sample data stored in files. Each folder has similar data but for different customers.

The target for these files in this example is the Amazon Redshift table cab_sales_data.

Define the target table cab_sales_data:

DROP TABLE IF EXISTS cab_sales_data;
CREATE TABLE IF NOT EXISTS cab_sales_data
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
DISTSTYLE EVEN
SORTKEY (passenger_count,pickup_datetime);

You can define two auto-copy jobs as shown in the following code to handle and monitor the ingestion of sales data belonging to different customers, in our case Customer_1 and Customer_2. These jobs monitor the Customer_1 and Customer_2 folders and load new files that are added here.

COPY public.cab_sales_data
FROM 's3://aws-redshift-s3-auto-copy-demo/Customer_1'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_green_cab AUTO ON;

COPY public.cab_sales_data
FROM 's3:// aws-redshift-s3-auto-copy-demo/Customer_2'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_red_cab AUTO ON;

After setting up the two jobs, we can upload the relevant files into their respective folders. This will make sure that the data is loaded efficiently as soon as the files arrive. Each customer is assigned its own vendorid, as shown in the following output:

SELECT vendorid,
       sum(passenger_count) as total_passengers 
  FROM cab_sales_data
GROUP BY vendorid;

result 1

Manually run a auto-copy job

There might be scenarios wherein the auto-copy job needs to be paused, meaning it needs to stop looking for new files, for example, to fix a corrupted data pipeline at the data source.

In that case, either use the auto-copy job ALTER command to set AUTO to OFF or create a new auto-copy job with AUTO OFF. Once this is set, auto copy will no longer look for new files.

If necessary, users can manually invoke auto-copy job which will do the work and ingest if new files are found.

auto-copy job RUN <auto-copy job Name>

You can disable “AUTO ON” in the existing auto-copy job using the following command:

auto-copy job ALTER <auto-copy job Name> AUTO OFF

The following table compares the syntax and data duplication between a regular copy statement and the new auto-copy job

. Copy Auto-copy job
Syntax COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name>;
Data Duplication If it is run multiple times against the same S3 folder, it will load the data again, resulting in data duplication. It will not load the same file twice, preventing data duplication.

Error handling and monitoring for auto-copy jobs

auto-copy jobs continuously monitor the S3 folder specified during job creation and perform ingestion whenever new files are created. New files created under the S3 folder are loaded exactly once to avoid data duplication.

By default, if there are data or format issues with the specific files, the auto-copy job will fail to ingest the files with a load error and log details to the system tables. The auto-copy job will remain AUTO ON with new data files and will continue to ignore previously failed files.

Amazon Redshift provides the following system tables for users to monitor or troubleshoot auto-copy jobs as needed:

  • List auto-copy jobs – Use SYS_COPY_JOB to list the auto-copy jobs stored in the database:
SELECT * 
  FROM sys_copy_job;

  • Get a summary of a auto-copy job – Use the SYS_LOAD_HISTORY view to get the aggregate metrics of a auto-copy job operation by specifying the copy_job_id. It shows the aggregate metrics of the files that have been processed by a auto-copy job.
SELECT *
  FROM sys_load_history
 WHERE copy_job_id = 274978;

  • Get details of a auto-copy job – Use STL_LOAD_COMMITS to get the status and details of each file that was processed by a auto-copy job:
SELECT *
  FROM stl_load_commits
 WHERE copy_job_id = 274978
ORDER BY curtime ASC;

  • Get exception details of a auto-copy job – Use STL_LOAD_ERRORS to get the details of files that failed to ingest from a auto-copy job:
SELECT   query,
    slice,
    starttime , 
    filename,
    line_number,
    colname,
    type,
    err_code,
    err_reason,
    copy_job_id,
    raw_line,
    raw_field_value
  FROM stl_load_errors
 WHERE copy_job_id = 274978;

Auto-copy job best practices

In an auto-copy job, when a new file is detected and ingested (automatically or manually), Amazon Redshift stores the file name and doesn’t run this specific job when a new file is created with the same file name.

The following are the recommended best practices when working with files using the auto-copy job:

  • Use unique file names for each file in a auto-copy job (for example, 2022-10-15-batch-1.csv). However, you can use the same file name as long as it’s from different auto-copy jobs:
    • job_customerA_saless3://redshift-blogs/sales/customerA/2022-10-15-sales.csv
    • job_customerB_saless3://redshift-blogs/sales/customerB/2022-10-15-sales.csv
  • Do not update file contents. Do not overwrite existing files. Changes in existing files will not be reflected to the target table. The auto-copy job doesn’t pick up updated or overwritten files, so make sure they’re renamed as new file names for the auto-copy job to pick up.
  • Run regular COPY statements (not a job) if you need to ingest a file that was already processed by your auto-copy job. (COPY statement without a JOB CREATE syntax doesn’t track loaded files.) For example, this is helpful in scenarios where you don’t have control of the file name and the initial file received failed. The following figure shows a typical workflow in this case.

  • Delete and recreate your auto-copy job if you want to reset file tracking history and start over. You can drop auto-copy job using following command.
    auto-copy job DROP <auto-copy job Name>

auto-copy job considerations

Here are the main things to consider when using auto-copy:

For additional details on other considerations for auto-copy, refer to the AWS documentation.

Customer feedback

GE Aerospace is a global provider of jet engines, components, and systems for commercial and military aircraft. The company has been designing, developing, and manufacturing jet engines since World War I.

“GE Aerospace uses AWS analytics and Amazon Redshift to enable critical business insights that drive important business decisions. With the support for auto-copy from Amazon S3, we can build simpler data pipelines to move data from Amazon S3 to Amazon Redshift. This accelerates our data product teams’ ability to access data and deliver insights to end users. We spend more time adding value through data and less time on integrations.”

Alcuin Weidus Sr Principal Data Architect at GE Aerospace

Conclusion

This post demonstrated how to automate data ingestion from Amazon S3 to Amazon Redshift using the auto-copy feature. This new functionality helps make Amazon Redshift data ingestion easier than ever, and will allow SQL users to get access to the most recent data using a simple SQL command.

Users can begin ingesting data to Redshift from Amazon S3 with simple SQL commands and gain access to the most up-to-date data without the need for third-party tools or custom implementation.


About the authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Eren Baydemir, a Technical Product Manager at AWS, has 15 years of experience in building customer-facing products and is currently focusing on data lake and file ingestion topics in the Amazon Redshift team. He was the CEO and co-founder of DataRow, which was acquired by Amazon in 2020.

Eesha Kumar is an Analytics Solutions Architect with AWS. He works with customers to realize the business value of data by helping them build solutions using the AWS platform and tools.

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

 Hangjian Yuan is a Software Development Engineer at Amazon Redshift. He’s passionate about analytical databases and focuses on delivering cutting-edge streaming experiences for customers.

ETL orchestration using the Amazon Redshift Data API and AWS Step Functions with AWS SDK integration

Post Syndicated from Jason Pedreza original https://aws.amazon.com/blogs/big-data/etl-orchestration-using-the-amazon-redshift-data-api-and-aws-step-functions-with-aws-sdk-integration/

Extract, transform, and load (ETL) serverless orchestration architecture applications are becoming popular with many customers. These applications offers greater extensibility and simplicity, making it easier to maintain and simplify ETL pipelines. A primary benefit of this architecture is that we simplify an existing ETL pipeline with AWS Step Functions and directly call the Amazon Redshift Data API from the state machine. As a result, the complexity for the ETL pipeline is reduced.

As a data engineer or an application developer, you may want to interact with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. The Amazon Redshift Data API allows you to interact with Amazon Redshift without having to configure JDBC or ODBC connections. This feature allows you to orchestrate serverless data processing workflows, design event-driven web applications, and run an ETL pipeline asynchronously to ingest and process data in Amazon Redshift, with the use of Step Functions to orchestrate the entire ETL or ELT workflow.

This post explains how to use Step Functions and the Amazon Redshift Data API to orchestrate the different steps in your ETL or ELT workflow and process data into an Amazon Redshift data warehouse.

AWS Lambda is typically used with Step Functions due to its flexible and scalable compute benefits. An ETL workflow has multiple steps, and the complexity may vary within each step. However, there is an alternative approach with AWS SDK service integrations, a feature of Step Functions. These integrations allow you to call over 200 AWS services’ API actions directly from your state machine. This approach is optimal for steps with relatively low complexity compared to using Lambda because you no longer have to maintain and test function code. Lambda functions have a maximum timeout of 15 minutes; if you need to wait for longer-running processes, Step Functions standard workflows allows a maximum runtime of 1 year.

You can replace steps that include a single process with a direct integration between Step Functions and AWS SDK service integrations without using Lambda. For example, if a step is only used to call a Lambda function that runs a SQL statement in Amazon Redshift, you may remove the Lambda function with a direct integration to the Amazon Redshift Data API’s SDK API action. You can also decouple Lambda functions with multiple actions into multiple steps. An implementation of this is available later in this post.

We created an example use case in the GitHub repo ETL Orchestration using Amazon Redshift Data API and AWS Step Functions that provides an AWS CloudFormation template for setup, SQL scripts, and a state machine definition. The state machine directly reads SQL scripts stored in your Amazon Simple Storage Service (Amazon S3) bucket, runs them in your Amazon Redshift cluster, and performs an ETL workflow. We don’t use Lambda in this use case.

Solution overview

In this scenario, we simplify an existing ETL pipeline that uses Lambda to call the Data API. AWS SDK service integrations with Step Functions allow you to directly call the Data API from the state machine, reducing the complexity in running the ETL pipeline.

The entire workflow performs the following steps:

  1. Set up the required database objects and generate a set of sample data to be processed.
  2. Run two dimension jobs that perform SCD1 and SCD2 dimension load, respectively.
  3. When both jobs have run successfully, the load job for the fact table runs.
  4. The state machine performs a validation to ensure the sales data was loaded successfully.

The following architecture diagram highlights the end-to-end solution:

We run the state machine via the Step Functions console, but you can run this solution in several ways:

You can deploy the solution with the provided CloudFormation template, which creates the following resources:

  • Database objects in the Amazon Redshift cluster:
    • Four stored procedures:
      • sp_setup_sales_data_pipeline() – Creates the tables and populates them with sample data
      • sp_load_dim_customer_address() – Runs the SCD1 process on customer_address records
      • sp_load_dim_item() – Runs the SCD2 process on item records
      • sp_load_fact_sales (p_run_date date) – Processes sales from all stores for a given day
    • Five Amazon Redshift tables:
      • customer
      • customer_address
      • date_dim
      • item
      • store_sales
  • The AWS Identity and Access Management (IAM) role StateMachineExecutionRole for Step Functions to allow the following permissions:
    • Federate to the Amazon Redshift cluster through getClusterCredentials permission avoiding password credentials
    • Run queries in the Amazon Redshift cluster through Data API calls
    • List and retrieve objects from Amazon S3
  • The Step Functions state machine RedshiftETLStepFunction, which contains the steps used to run the ETL workflow of the sample sales data pipeline

Prerequisites

As a prerequisite for deploying the solution, you need to set up an Amazon Redshift cluster and associate it with an IAM role. For more information, see Authorizing Amazon Redshift to access other AWS services on your behalf. If you don’t have a cluster provisioned in your AWS account, refer to Getting started with Amazon Redshift for instructions to set it up.

When the Amazon Redshift cluster is available, perform the following steps:

  1. Download and save the CloudFormation template to a local folder on your computer.
  2. Download and save the following SQL scripts to a local folder on your computer:
    1. sp_statements.sql – Contains the stored procedures including DDL and DML operations.
    2. validate_sql_statement.sql – Contains two validation queries you can run.
  3. Upload the SQL scripts to your S3 bucket. The bucket name is the designated S3 bucket specified in the ETLScriptS3Path input parameter.
  4. On the AWS CloudFormation console, choose Create stack with new resources and upload the template file you downloaded in the previous step (etl-orchestration-with-stepfunctions-and-redshift-data-api.yaml).
  5. Enter the required parameters and choose Next.
  6. Choose Next until you get to the Review page and select the acknowledgement check box.
  7. Choose Create stack.
  8. Wait until the stack deploys successfully.

When the stack is complete, you can view the outputs, as shown in the following screenshot:

Run the ETL orchestration

After you deploy the CloudFormation template, navigate to the stack detail page. On the Resources tab, choose the link for RedshiftETLStepFunction to be redirected to the Step Functions console.

The RedshiftETLStepFunction state machine runs automatically, as outlined in the following workflow:

  1. read_sp_statement and run_sp_deploy_redshift – Performs the following actions:
    1. Retrieves the sp_statements.sql from Amazon S3 to get the stored procedure.
    2. Passes the stored procedure to the batch-execute-statement API to run in the Amazon Redshift cluster.
    3. Sends back the identifier of the SQL statement to the state machine.
  2. wait_on_sp_deploy_redshift – Waits for at least 5 seconds.
  3. run_sp_deploy_redshift_status_check – Invokes the Data API’s describeStatement to get the status of the API call.
  4. is_run_sp_deploy_complete – Routes the next step of the ETL workflow depending on its status:
    1. FINISHED – Stored procedures are created in your Amazon Redshift cluster.
    2. FAILED – Go to the sales_data_pipeline_failure step and fail the ETL workflow.
    3. All other status – Go back to the wait_on_sp_deploy_redshift step to wait for the SQL statements to finish.
  5. setup_sales_data_pipeline – Performs the following steps:
    1. Initiates the setup stored procedure that was previously created in the Amazon Redshift cluster.
    2. Sends back the identifier of the SQL statement to the state machine.
  6. wait_on_setup_sales_data_pipeline – Waits for at least 5 seconds.
  7. setup_sales_data_pipeline_status_check – Invokes the Data API’s describeStatement to get the status of the API call.
  8. is_setup_sales_data_pipeline_complete – Routes the next step of the ETL workflow depending on its status:
    1. FINISHED – Created two dimension tables (customer_address and item) and one fact table (sales).
    2. FAILED – Go to the sales_data_pipeline_failure step and fail the ETL workflow.
    3. All other status – Go back to the wait_on_setup_sales_data_pipeline step to wait for the SQL statements to finish.
  9. run_sales_data_pipeline LoadItemTable and LoadCustomerAddressTable are two parallel workflows that Step Functions runs at the same time. The workflows run the stored procedures that were previously created. The stored procedure loads the data into the item and customer_address tables. All other steps in the parallel sessions follow the same concept as described previously. When both parallel workflows are complete, run_load_fact_sales runs.
  10. run_load_fact_sales – Inserts data into the store_sales table that was created in the initial stored procedure.
  11. Validation – When all the ETL steps are complete, the state machine reads a second SQL file from Amazon S3 (validate_sql_statement.sql) and runs the two SQL statements using the batch_execute_statement method.

The implementation of the ETL workflow is idempotent. If it fails, you can retry the job without any cleanup. For example, it recreates the stg_store_sales table each time, then deletes the target table store_sales with the data for the particular refresh date each time.

The following diagram illustrates the state machine workflow:

In this example, we use the task state resource arn:aws:states:::aws-sdk:redshiftdata:[apiAction] to call the corresponding Data API action. The following table summarizes the Data API actions and their corresponding AWS SDK integration API actions.

Amazon Redshift Data API Actions AWS SDK Integrations API Actions
BatchExecuteStatement batchExecuteStatement
ExecuteStatement executeStatement
DescribeStatement describeStatement
CancelStatement cancelStatement
GetStatementResult getStatementResult
DescribeTable describeTable
ListDatabases listDatabases
ListSchemas listSchemas
ListStatements listStatements
ListTables listTables

To use AWS SDK integrations, you specify the service name and API call, and, optionally, a service integration pattern. The AWS SDK action is always camel case, and parameter names are Pascal case. For example, you can use the Step Functions action batchExecuteStatement to run multiple SQL statements in a batch as a part of a single transaction on the Data API. The SQL statements can be SELECT, DML, DDL, COPY, and UNLOAD.

Validate the ETL orchestration

The entire ETL workflow takes approximately 1 minute to run. The following screenshot shows that the ETL workflow completed successfully.

When the entire sales data pipeline is complete, you may go through the entire execution event history, as shown in the following screenshot.

Schedule the ETL orchestration

After you validate the sales data pipeline, you may opt to run the data pipeline on a daily schedule. You can accomplish this with Amazon EventBridge.

  1. On the EventBridge console, create a rule to run the RedshiftETLStepFunction state machine daily.
  2. To invoke the RedshiftETLStepFunction state machine on a schedule, choose Schedule and define the appropriate frequency needed to run the sales data pipeline.
  3. Specify the target state machine as RedshiftETLStepFunction and choose Create.

You can confirm the schedule on the rule details page.

Clean up

Clean up the resources created by the CloudFormation template to avoid unnecessary cost to your AWS account. You can delete the CloudFormation stack by selecting the stack on the AWS CloudFormation console and choosing Delete. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may see some issues during cleanup; you need to clean these up independently.

Limitations

The Data API and Step Functions AWS SDK integration offers a robust mechanism to build highly distributed ETL applications within minimal developer overhead. Consider the following limitations when using the Data API and Step Functions:

Conclusion

In this post, we demonstrated how to build an ETL orchestration using the Amazon Redshift Data API and Step Functions with AWS SDK integration.

To learn more about the Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters and Using the Amazon Redshift Data API.


About the Authors

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with over 13 years of data warehousing experience. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Bipin Pandey is a Data Architect at AWS. He loves to build data lake and analytics platforms for his customers. He is passionate about automating and simplifying customer problems with the use of cloud solutions.

David Zhang is an AWS Solutions Architect who helps customers design robust, scalable, and data-driven solutions across multiple industries. With a background in software development, David is an active leader and contributor to AWS open-source initiatives. He is passionate about solving real-world business problems and continuously strives to work from the customer’s perspective. Feel free to connect with him on LinkedIn.