All posts by Jason Pedreza

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy (preview)

Post Syndicated from Jason Pedreza original https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy-preview/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all of your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers today rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse.

Data ingestion is the process of getting data from the source system to Amazon Redshift. This can be done by using one of many AWS cloud-based ETL tools like AWS Glue, Amazon EMR, or AWS Step Functions, or you can simply load data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift using the COPY command. A COPY command is the most efficient way to load a table because it uses the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from a file or multiple files in an S3 bucket.

Now SQL users can easily automate data ingestion from Amazon S3 to Amazon Redshift with a simple SQL command using the Amazon Redshift auto-copy preview feature. COPY statements are triggered and start loading data when Amazon Redshift auto-copy detects new files in the specified Amazon S3 paths. This also ensures end-users have the latest data available in Amazon Redshift shortly after the source data is available.

This post shows you how to easily build continuous file ingestion pipelines in Amazon Redshift using auto-copy when source files are located on Amazon S3 using a simple SQL command. In addition, we show you how to enable auto-copy using copy jobs, how to monitor jobs, considerations, and best practices.

Overview of the auto-copy feature in Amazon Redshift

The auto-copy feature in Amazon Redshift simplifies automatic data loading from Amazon S3 with a simple SQL command. You can enable Amazon Redshift auto-copy by creating copy jobs. A copy job is a database object that stores, automates, and reuses the COPY statement for newly created files that land in the S3 folder.

The following diagram illustrates this process.

automatic data ingestion from Amazon S3 using copy jobx

Copy jobs have the following benefits:

  • SQL users such as data analysts can now load data from Amazon S3 automatically without having to build a pipeline or using an external framework
  • Copy jobs offer continuous and incremental data ingestion from an Amazon S3 location without the need to implement a custom solution
  • This functionality comes at no additional cost
  • Existing COPY statements can be converted into copy jobs by appending the JOB CREATE <job_name> parameter
  • It keeps track of all loaded files and prevents data duplication
  • It can be easily set up using a simple SQL statement and any JDBC or ODBC client

Prerequisites

To get started with auto-copy preview, you need the following prerequisites:

  • An AWS account
  • An Amazon Redshift cluster with a maintenance track of PREVIEW_AUTOCOPY_2022

The Amazon Redshift auto-copy support from Amazon S3 is available as a preview for provisioned clusters in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), Europe (Ireland), and Europe (Stockholm). Please note that for the preview track, restore from snapshot is not supported.

You can refer to the SQL notebook redshift-auto-copy-preview-demo-sql-notebook.ipynb for the SQL statements used in this post.

Set up copy jobs

In this section, we demonstrate how to automate data loading of files from Amazon S3 into Amazon Redshift. With the existing COPY syntax, we add the JOB CREATE parameter to perform a one-time setup for automatic file ingestion. See the following code:

COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name> [AUTO ON | OFF];

Auto ingestion is enabled by default on copy jobs.

Automate ingestion from a single data source

With a copy job, you can automate ingestion from a single data source by creating one job and specifying the path to the S3 objects that contain the data. The S3 object path can reference a set of folders that have the same key prefix.

In this example, we have multiple files that are being loaded on a daily basis containing the sales transactions across all the stores in the US. Each day’s sales transactions are loaded to their own folder in Amazon S3.

store_sales

Each folder contains multiple gzip-compressed files.

gzip-compressed files

The following code creates the store_sales table:

DROP TABLE IF EXISTS store_sales;
CREATE TABLE IF NOT EXISTS store_sales
(
  ss_sold_date_sk int4 ,            
  ss_sold_time_sk int4 ,     
  ss_item_sk int4 not null ,      
  ss_customer_sk int4 ,           
  ss_cdemo_sk int4 ,              
  ss_hdemo_sk int4 ,         
  ss_addr_sk int4 ,               
  ss_store_sk int4 ,           
  ss_promo_sk int4 ,           
  ss_ticket_number int8 not null,        
  ss_quantity int4 ,           
  ss_wholesale_cost numeric(7,2) ,          
  ss_list_price numeric(7,2) ,              
  ss_sales_price numeric(7,2) ,
  ss_ext_discount_amt numeric(7,2) ,             
  ss_ext_sales_price numeric(7,2) ,              
  ss_ext_wholesale_cost numeric(7,2) ,           
  ss_ext_list_price numeric(7,2) ,               
  ss_ext_tax numeric(7,2) ,                 
  ss_coupon_amt numeric(7,2) , 
  ss_net_paid numeric(7,2) ,   
  ss_net_paid_inc_tax numeric(7,2) ,             
  ss_net_profit numeric(7,2) ,
  primary key (ss_item_sk, ss_ticket_number)
) DISTKEY (ss_item_sk) 
  SORTKEY(ss_sold_date_sk);

Next, we create the copy job to automatically load the gzip-compressed files into the store_sales table:

COPY store_sales
FROM 's3://redshift-blogs/amazon-Redshift-auto-copy/store_sales'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
gzip delimiter '|' EMPTYASNULL
region 'us-east-1'
JOB CREATE job_store_sales AUTO ON;

When the copy job is created, it automatically loads the existing gzip-compressed files located in the S3 object path specified in the command to the store_sales table.

Let’s run a query to get the daily total of sales transactions across all the stores in the US:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

The output shown comes from the transactions sold on 2002-12-31 and 2003-01-01, respectively.

transactions shown

The following day, incremental sales transactions data are loaded to a new folder in the same S3 object path.

incremental sales transactions

As new files arrive to the same S3 object path, the copy job automatically loads the unprocessed files to the store_sales table in an incremental fashion.

All new sales transactions for 2003-01-02 are automatically ingested, which can be verified by running the following query:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

query 1

Automate ingestion from multiple data sources

We can also load an Amazon Redshift table from multiple data sources. When using a pub/sub pattern where multiple S3 buckets populate data to an Amazon Redshift table, you have to maintain multiple data pipelines for each source/target combination. With new parameters in the COPY command, this can be automated to handle data loads efficiently.

In the following example, the Customer_1 folder has Green Cab Company sales data, and the Customer_2 folder has Red Cab Company sales data. We can use the COPY command with the JOB parameter to automate this ingestion process.

automate this ingestion process

The following screenshot shows sample data stored in files. Each folder has similar data but for different customers.

sample data sored in files

The target for these files in this example is the Amazon Redshift table cab_sales_data.

Define the target table cab_sales_data:

DROP TABLE IF EXISTS cab_sales_data;
CREATE TABLE IF NOT EXISTS cab_sales_data
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
DISTSTYLE EVEN
SORTKEY (passenger_count,pickup_datetime);

You can define two copy jobs as shown in the following code to handle and monitor ingestion of sales data belonging to different customers , in our case Customer_1 and Customer_2. These jobs monitor the Customer_1 and Customer_2 folders and load any new files that are added here.

COPY cab_sales_data
FROM 's3://redshift-blogs/amazon-Redshift-auto-copy/Customer_1'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_green_cab AUTO ON;

COPY cab_sales_data
FROM 's3://redshift-blogs/amazon-Redshift-auto-copy/Customer_2'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_red_cab AUTO ON;

Each customer is assigned its own vendorid, as shown in the following output:

SELECT vendorid,
       sum(passenger_count) as total_passengers 
  FROM cab_sales_data
GROUP BY vendorid;

result 1

Manually run a copy job

There might be scenarios wherein the copy job needs to be paused, meaning it needs to stop looking for new files, for example, to fix a corrupted data pipeline at the data source.

In that case, either use the COPY JOB ALTER command to set AUTO to OFF or create a new COPY JOB with AUTO OFF. Once this is set, auto copy will no longer look for new files.

If in case required, users can manually invoke COPY JOB which will do the work and ingest if any new files found.

COPY JOB RUN <Copy Job Name>

You can disable “AUTO ON” in the existing copy job using the following command:

COPY JOB ALTER <Copy Job Name> AUTO OFF

The following table compares the syntax and data duplication between a regular copy statement and the new auto-copy job.

. Copy Auto-Copy Job (preview)
Syntax COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name>;
Data Duplication If it is run multiple times against the same S3 folder, it will load the data again, resulting in data duplication. It will not load the same file twice, preventing data duplication.

For the copy job preview, support on other data formats will be expanded.

Error handling and monitoring for copy jobs

Copy jobs continuously monitor the S3 folder specified during job creation and perform ingestion whenever new files are created. New files created under the S3 folder are loaded exactly once to avoid data duplication.

By default, if there are any data or format issues with the specific files, the copy job will fail to ingest the files with a load error and log details to the system tables. The copy job will remain AUTO ON with new data files and will continue to ignore previously failed files.

Amazon Redshift provides the following system tables for users to monitor or troubleshoot copy jobs as needed:

  • List copy jobs – Use SYS_COPY_JOB to list all the copy jobs stored in the database:
SELECT * 
  FROM sys_copy_job;

  • Get a summary of a copy job – Use the SYS_LOAD_HISTORY view to get the aggregate metrics of a copy job operation by specifying the copy_job_id. It shows the aggregate metrics of all the files that have been processed by a copy job.
SELECT *
  FROM sys_load_history
 WHERE copy_job_id = 105928;

  • Get details of a copy job – Use STL_LOAD_COMMITS to get the status and details of each file that was processed by a copy job:
SELECT *
  FROM stl_load_commits
 WHERE copy_job_id = 105928
ORDER BY curtime ASC;

  • Get exception details of a copy job – Use STL_LOAD_ERRORS to get the details of files that failed to ingest from a copy job:
SELECT *
  FROM stl_load_errors
 WHERE copy_job_id = 105939;

Copy job best practices

In a copy job, when a new file is detected and ingested (automatically or manually), Amazon Redshift stores the file name and doesn’t run this specific job when a new file is created with the same file name.

The following are the recommended best practices when working with files using the copy job:

  • Use unique file names for each file in a copy job (for example, 2022-10-15-batch-1.csv). However, you can use the same file name as long as it’s from different copy jobs:
    • job_customerA_saless3://redshift-blogs/sales/customerA/2022-10-15-sales.csv
    • job_customerB_saless3://redshift-blogs/sales/customerB/2022-10-15-sales.csv
  • Do not update file contents. Do not overwrite existing files. Changes in existing files will not be reflected to the target table. The copy job doesn’t pick up updated or overwritten files, so make sure they’re renamed as new file names for the copy job to pick up.
  • Run regular COPY statements (not a job) if you need to ingest a file that was already processed by your copy job. (COPY without a copy job doesn’t track loaded files.) For example, this is helpful in scenarios where you don’t have control of the file name and the initial file received failed. The following figure shows a typical workflow in this case.

  • Delete and recreate your copy job if you want to reset file tracking history and start over.

Copy job considerations

During the preview, here are the main things to consider when using auto-copy:

For additional details on other considerations for auto-copy preview, refer to the AWS documentation.

Customer feedback

GE Aerospace is a global provider of jet engines, components, and systems for commercial and military aircraft. The company has been designing, developing, and manufacturing jet engines since World War I.

“GE Aerospace uses AWS analytics and Amazon Redshift to enable critical business insights that drive important business decisions. With the support for auto-copy from Amazon S3, we can build simpler data pipelines to move data from Amazon S3 to Amazon Redshift. This accelerates our data product teams’ ability to access data and deliver insights to end users. We spend more time adding value through data and less time on integrations.”

Alcuin Weidus Sr Principal Data Architect at GE Aerospace

Conclusion

This post demonstrated how to automate data load from Amazon S3 to Amazon Redshift using the auto-copy preview feature. This new functionality helps make Amazon Redshift data ingestion easier than ever, and will allow SQL users to get access to the most recent data using a simple SQL command.

As an analysts or SQL users, you can begin ingesting data to Redshift from Amazon S3 with simple SQL commands and gain access to the most up-to-date data without the need for third-party tools or custom implementation.


About the authors

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Eren Baydemir, a Technical Product Manager at AWS, has 15 years of experience in building customer-facing products and is currently focusing on data lake and file ingestion topics in the Amazon Redshift team. He was the CEO and co-founder of DataRow, which was acquired by Amazon in 2020.

Eesha Kumar is an Analytics Solutions Architect with AWS. He works with customers to realize the business value of data by helping them build solutions using the AWS platform and tools.

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

 Hangjian Yuan is a Software Development Engineer at Amazon Redshift. He’s passionate about analytical databases and focuses on delivering cutting-edge streaming experiences for customers.

ETL orchestration using the Amazon Redshift Data API and AWS Step Functions with AWS SDK integration

Post Syndicated from Jason Pedreza original https://aws.amazon.com/blogs/big-data/etl-orchestration-using-the-amazon-redshift-data-api-and-aws-step-functions-with-aws-sdk-integration/

Extract, transform, and load (ETL) serverless orchestration architecture applications are becoming popular with many customers. These applications offers greater extensibility and simplicity, making it easier to maintain and simplify ETL pipelines. A primary benefit of this architecture is that we simplify an existing ETL pipeline with AWS Step Functions and directly call the Amazon Redshift Data API from the state machine. As a result, the complexity for the ETL pipeline is reduced.

As a data engineer or an application developer, you may want to interact with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. The Amazon Redshift Data API allows you to interact with Amazon Redshift without having to configure JDBC or ODBC connections. This feature allows you to orchestrate serverless data processing workflows, design event-driven web applications, and run an ETL pipeline asynchronously to ingest and process data in Amazon Redshift, with the use of Step Functions to orchestrate the entire ETL or ELT workflow.

This post explains how to use Step Functions and the Amazon Redshift Data API to orchestrate the different steps in your ETL or ELT workflow and process data into an Amazon Redshift data warehouse.

AWS Lambda is typically used with Step Functions due to its flexible and scalable compute benefits. An ETL workflow has multiple steps, and the complexity may vary within each step. However, there is an alternative approach with AWS SDK service integrations, a feature of Step Functions. These integrations allow you to call over 200 AWS services’ API actions directly from your state machine. This approach is optimal for steps with relatively low complexity compared to using Lambda because you no longer have to maintain and test function code. Lambda functions have a maximum timeout of 15 minutes; if you need to wait for longer-running processes, Step Functions standard workflows allows a maximum runtime of 1 year.

You can replace steps that include a single process with a direct integration between Step Functions and AWS SDK service integrations without using Lambda. For example, if a step is only used to call a Lambda function that runs a SQL statement in Amazon Redshift, you may remove the Lambda function with a direct integration to the Amazon Redshift Data API’s SDK API action. You can also decouple Lambda functions with multiple actions into multiple steps. An implementation of this is available later in this post.

We created an example use case in the GitHub repo ETL Orchestration using Amazon Redshift Data API and AWS Step Functions that provides an AWS CloudFormation template for setup, SQL scripts, and a state machine definition. The state machine directly reads SQL scripts stored in your Amazon Simple Storage Service (Amazon S3) bucket, runs them in your Amazon Redshift cluster, and performs an ETL workflow. We don’t use Lambda in this use case.

Solution overview

In this scenario, we simplify an existing ETL pipeline that uses Lambda to call the Data API. AWS SDK service integrations with Step Functions allow you to directly call the Data API from the state machine, reducing the complexity in running the ETL pipeline.

The entire workflow performs the following steps:

  1. Set up the required database objects and generate a set of sample data to be processed.
  2. Run two dimension jobs that perform SCD1 and SCD2 dimension load, respectively.
  3. When both jobs have run successfully, the load job for the fact table runs.
  4. The state machine performs a validation to ensure the sales data was loaded successfully.

The following architecture diagram highlights the end-to-end solution:

We run the state machine via the Step Functions console, but you can run this solution in several ways:

You can deploy the solution with the provided CloudFormation template, which creates the following resources:

  • Database objects in the Amazon Redshift cluster:
    • Four stored procedures:
      • sp_setup_sales_data_pipeline() – Creates the tables and populates them with sample data
      • sp_load_dim_customer_address() – Runs the SCD1 process on customer_address records
      • sp_load_dim_item() – Runs the SCD2 process on item records
      • sp_load_fact_sales (p_run_date date) – Processes sales from all stores for a given day
    • Five Amazon Redshift tables:
      • customer
      • customer_address
      • date_dim
      • item
      • store_sales
  • The AWS Identity and Access Management (IAM) role StateMachineExecutionRole for Step Functions to allow the following permissions:
    • Federate to the Amazon Redshift cluster through getClusterCredentials permission avoiding password credentials
    • Run queries in the Amazon Redshift cluster through Data API calls
    • List and retrieve objects from Amazon S3
  • The Step Functions state machine RedshiftETLStepFunction, which contains the steps used to run the ETL workflow of the sample sales data pipeline

Prerequisites

As a prerequisite for deploying the solution, you need to set up an Amazon Redshift cluster and associate it with an IAM role. For more information, see Authorizing Amazon Redshift to access other AWS services on your behalf. If you don’t have a cluster provisioned in your AWS account, refer to Getting started with Amazon Redshift for instructions to set it up.

When the Amazon Redshift cluster is available, perform the following steps:

  1. Download and save the CloudFormation template to a local folder on your computer.
  2. Download and save the following SQL scripts to a local folder on your computer:
    1. sp_statements.sql – Contains the stored procedures including DDL and DML operations.
    2. validate_sql_statement.sql – Contains two validation queries you can run.
  3. Upload the SQL scripts to your S3 bucket. The bucket name is the designated S3 bucket specified in the ETLScriptS3Path input parameter.
  4. On the AWS CloudFormation console, choose Create stack with new resources and upload the template file you downloaded in the previous step (etl-orchestration-with-stepfunctions-and-redshift-data-api.yaml).
  5. Enter the required parameters and choose Next.
  6. Choose Next until you get to the Review page and select the acknowledgement check box.
  7. Choose Create stack.
  8. Wait until the stack deploys successfully.

When the stack is complete, you can view the outputs, as shown in the following screenshot:

Run the ETL orchestration

After you deploy the CloudFormation template, navigate to the stack detail page. On the Resources tab, choose the link for RedshiftETLStepFunction to be redirected to the Step Functions console.

The RedshiftETLStepFunction state machine runs automatically, as outlined in the following workflow:

  1. read_sp_statement and run_sp_deploy_redshift – Performs the following actions:
    1. Retrieves the sp_statements.sql from Amazon S3 to get the stored procedure.
    2. Passes the stored procedure to the batch-execute-statement API to run in the Amazon Redshift cluster.
    3. Sends back the identifier of the SQL statement to the state machine.
  2. wait_on_sp_deploy_redshift – Waits for at least 5 seconds.
  3. run_sp_deploy_redshift_status_check – Invokes the Data API’s describeStatement to get the status of the API call.
  4. is_run_sp_deploy_complete – Routes the next step of the ETL workflow depending on its status:
    1. FINISHED – Stored procedures are created in your Amazon Redshift cluster.
    2. FAILED – Go to the sales_data_pipeline_failure step and fail the ETL workflow.
    3. All other status – Go back to the wait_on_sp_deploy_redshift step to wait for the SQL statements to finish.
  5. setup_sales_data_pipeline – Performs the following steps:
    1. Initiates the setup stored procedure that was previously created in the Amazon Redshift cluster.
    2. Sends back the identifier of the SQL statement to the state machine.
  6. wait_on_setup_sales_data_pipeline – Waits for at least 5 seconds.
  7. setup_sales_data_pipeline_status_check – Invokes the Data API’s describeStatement to get the status of the API call.
  8. is_setup_sales_data_pipeline_complete – Routes the next step of the ETL workflow depending on its status:
    1. FINISHED – Created two dimension tables (customer_address and item) and one fact table (sales).
    2. FAILED – Go to the sales_data_pipeline_failure step and fail the ETL workflow.
    3. All other status – Go back to the wait_on_setup_sales_data_pipeline step to wait for the SQL statements to finish.
  9. run_sales_data_pipeline LoadItemTable and LoadCustomerAddressTable are two parallel workflows that Step Functions runs at the same time. The workflows run the stored procedures that were previously created. The stored procedure loads the data into the item and customer_address tables. All other steps in the parallel sessions follow the same concept as described previously. When both parallel workflows are complete, run_load_fact_sales runs.
  10. run_load_fact_sales – Inserts data into the store_sales table that was created in the initial stored procedure.
  11. Validation – When all the ETL steps are complete, the state machine reads a second SQL file from Amazon S3 (validate_sql_statement.sql) and runs the two SQL statements using the batch_execute_statement method.

The implementation of the ETL workflow is idempotent. If it fails, you can retry the job without any cleanup. For example, it recreates the stg_store_sales table each time, then deletes the target table store_sales with the data for the particular refresh date each time.

The following diagram illustrates the state machine workflow:

In this example, we use the task state resource arn:aws:states:::aws-sdk:redshiftdata:[apiAction] to call the corresponding Data API action. The following table summarizes the Data API actions and their corresponding AWS SDK integration API actions.

Amazon Redshift Data API Actions AWS SDK Integrations API Actions
BatchExecuteStatement batchExecuteStatement
ExecuteStatement executeStatement
DescribeStatement describeStatement
CancelStatement cancelStatement
GetStatementResult getStatementResult
DescribeTable describeTable
ListDatabases listDatabases
ListSchemas listSchemas
ListStatements listStatements
ListTables listTables

To use AWS SDK integrations, you specify the service name and API call, and, optionally, a service integration pattern. The AWS SDK action is always camel case, and parameter names are Pascal case. For example, you can use the Step Functions action batchExecuteStatement to run multiple SQL statements in a batch as a part of a single transaction on the Data API. The SQL statements can be SELECT, DML, DDL, COPY, and UNLOAD.

Validate the ETL orchestration

The entire ETL workflow takes approximately 1 minute to run. The following screenshot shows that the ETL workflow completed successfully.

When the entire sales data pipeline is complete, you may go through the entire execution event history, as shown in the following screenshot.

Schedule the ETL orchestration

After you validate the sales data pipeline, you may opt to run the data pipeline on a daily schedule. You can accomplish this with Amazon EventBridge.

  1. On the EventBridge console, create a rule to run the RedshiftETLStepFunction state machine daily.
  2. To invoke the RedshiftETLStepFunction state machine on a schedule, choose Schedule and define the appropriate frequency needed to run the sales data pipeline.
  3. Specify the target state machine as RedshiftETLStepFunction and choose Create.

You can confirm the schedule on the rule details page.

Clean up

Clean up the resources created by the CloudFormation template to avoid unnecessary cost to your AWS account. You can delete the CloudFormation stack by selecting the stack on the AWS CloudFormation console and choosing Delete. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may see some issues during cleanup; you need to clean these up independently.

Limitations

The Data API and Step Functions AWS SDK integration offers a robust mechanism to build highly distributed ETL applications within minimal developer overhead. Consider the following limitations when using the Data API and Step Functions:

Conclusion

In this post, we demonstrated how to build an ETL orchestration using the Amazon Redshift Data API and Step Functions with AWS SDK integration.

To learn more about the Data API, see Using the Amazon Redshift Data API to interact with Amazon Redshift clusters and Using the Amazon Redshift Data API.


About the Authors

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with over 13 years of data warehousing experience. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Bipin Pandey is a Data Architect at AWS. He loves to build data lake and analytics platforms for his customers. He is passionate about automating and simplifying customer problems with the use of cloud solutions.

David Zhang is an AWS Solutions Architect who helps customers design robust, scalable, and data-driven solutions across multiple industries. With a background in software development, David is an active leader and contributor to AWS open-source initiatives. He is passionate about solving real-world business problems and continuously strives to work from the customer’s perspective. Feel free to connect with him on LinkedIn.