Automate data archival for Amazon Redshift time series tables

Post Syndicated from Nita Shah original https://aws.amazon.com/blogs/big-data/automate-data-archival-for-amazon-redshift-time-series-tables/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all of your data using standard SQL. Tens of thousands of customers today rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse. You can run and scale analytics in seconds on all your data without having to manage your data warehouse infrastructure.

A data retention policy is part of an organization’s overall data management. In a big data world, the size of data is consistently increasing, which directly affects the cost of storing the data in data stores. It’s necessary to keep optimizing your data in data warehouses for consistent performance, reliability, and cost control. It’s crucial to define how long an organization needs to hold on to specific data, and if data that is no longer needed should be archived or deleted. The frequency of data archival depends on the relevance of the data with respect to your business or legal needs.

Data archiving is the process of moving data that is no longer actively used in a data warehouse to a separate storage device for long-term retention. Archive data consists of older data that is still important to the organization and may be needed for future reference, as well as data that must be retained for regulatory compliance.

Data purging is the process of freeing up space in the database or deleting obsolete data that isn’t required by the business. The purging process can be based on the data retention policy, which is defined by the data owner or business need.

This post walks you through the process of how to automate data archival and purging of Amazon Redshift time series tables. Time series tables retain data for a certain period of time (days, months, quarters, or years) and need data to be purged regularly to maintain the rolling data to be analyzed by end-users.

Solution overview

The following diagram illustrates our solution architecture.

We use two database tables as part of this solution.

The arch_table_metadata database table stores the metadata for all the tables that need to be archived and purged. You need to add rows into this table that you want to archive and purge. The arch_table_metadata table contains the following columns.

ColumnName	Description
`id`	Database-generated, automatically assigns a unique value to each record.
`schema_name`	Name of the database schema of the table.
`table_name`	Name of the table to be archived and purged.
`column_name`	Name of the date column that is used to identify records to be archived and purged.
`s3_uri`	Amazon S3 location where the data will be archived.
`retention_days`	Number of days the data will be retained for the table. Default is 90 days.

The arch_job_log database table stores the run history of stored procedures. Records are added to this table by the stored procedure. It contains the following columns.

ColumnName	Description
`job_run_id`	Assigns unique numeric value per stored procedure run.
`arch_table_metadata_id`	Id column value from table `arch_table_metadata`.
`no_of_rows_bfr_delete`	Number of rows in the table before purging.
`no_of_rows_deleted`	Number of rows deleted by the purge operation.
`job_start_time`	Time in UTC when the stored procedure started.
`job_end_time`	Time in UTC when the stored procedure ended.
`job_status`	Status of the stored procedure run: IN-PROGRESS, COMPLETED, or FAILED.

Prerequisites

For this solution, complete the following prerequisites:

Create an Amazon Redshift provisioned cluster or Amazon Redshift serverless workgroup.

In Amazon Redshift query editor v2 or a compatible SQL editor of your choice, create the tables arch_table_metadata and arch_job_log. Use the following code for the table DDLs:

create table arch_table_metadata
(
id integer identity(0,1) not null, 
schema_name varchar(100) not null, 
table_name varchar(100) not null, 
column_name varchar(100) not null,
s3_uri varchar(1000) not null,
retention_days integer default 90
);

create table arch_job_log
(
job_run_id bigint not null, 
arch_table_metadata_id  integer not null,
no_of_rows_bfr_delete bigint,
no_of_rows_deleted bigint,
table_arch_start_time timestamp default current_timestamp,
table_arch_end_time timestamp default current_timestamp,
job_start_time timestamp default current_timestamp,
job_end_time timestamp default current_timestamp,
job_status varchar(20)
);

Create the stored procedure sp_archive_data with the following code snippet. The stored procedure takes the AWS Identity and Access Management (IAM) role ARN as an input argument if you’re not using the default IAM role. If you’re using the default IAM role for your Amazon Redshift cluster, you can pass the input parameter as default. For more information, refer to Creating an IAM role as default in Amazon Redshift.

CREATE OR REPLACE PROCEDURE archive_data_sp(p_iam_role IN varchar(256))
AS $$
DECLARE

v_command           varchar(500);
v_sql               varchar(500);
v_count_sql         text;

v_table_id          int;
v_schema_name       text;
v_table_name        text;
v_column_name       text;
v_s3_bucket_url     text;
v_s3_folder_name_prefix     text;
v_retention_days            int = 0;
v_no_of_rows_before_delete  int = 0;
v_no_of_deleted_rows        int =0;
v_job_start_time            timestamp;
v_job_status                int = 1;
v_job_id                    int =0;


table_meta_data_cur CURSOR FOR
SELECT id, schema_name, table_name, column_name,s3_uri,retention_days
FROM arch_table_metadata;

BEGIN

    SELECT NVL(MAX(job_run_id),0) + 1 INTO v_job_id FROM arch_job_log;
    RAISE NOTICE '%', v_job_id;

    OPEN table_meta_data_cur;
    FETCH table_meta_data_cur INTO v_table_id,v_schema_name, v_table_name, v_column_name, v_s3_bucket_url, v_retention_days;
    WHILE v_table_id IS NOT NULL LOOP

        v_count_sql = 'SELECT COUNT(*) AS v_no_of_rows_before_delete FROM ' || v_schema_name || '.' || v_table_name;
        RAISE NOTICE '%', v_count_sql;
        EXECUTE v_count_sql INTO v_no_of_rows_before_delete;
        RAISE NOTICE 'v_no_of_rows_before_delete %', v_no_of_rows_before_delete;

        v_job_start_time = GETDATE();
        v_s3_folder_name_prefix = v_schema_name || '.' || v_table_name || '/';
        v_sql = 'SELECT * FROM ' || v_schema_name || '.' || v_table_name || ' WHERE ' || v_column_name || ' <= DATEADD(DAY,-' || v_retention_days || ',CURRENT_DATE)';

        IF p_iam_role = 'default' THEN
            v_command = 'UNLOAD (''' || v_sql ||  ''') to ''' || v_s3_bucket_url || v_s3_folder_name_prefix || ''' IAM_ROLE default  PARQUET PARTITION BY (' || v_column_name || ') INCLUDE ALLOWOVERWRITE';
        ELSE
            v_command = 'UNLOAD (''' || v_sql ||  ''') to ''' || v_s3_bucket_url || v_s3_folder_name_prefix || ''' IAM_ROLE ''' || p_iam_role || ''' PARQUET PARTITION BY (' || v_column_name || ') INCLUDE ALLOWOVERWRITE';
        END IF;
        RAISE NOTICE '%', v_command;
        EXECUTE v_command;

        v_sql := 'DELETE FROM ' || v_schema_name || '.' || v_table_name || ' WHERE ' || v_column_name || ' <= DATEADD(DAY,-' || v_retention_days || ',CURRENT_DATE)';
        RAISE NOTICE '%', v_sql;
        EXECUTE v_sql;

        GET DIAGNOSTICS v_no_of_deleted_rows := ROW_COUNT;
        RAISE INFO '# of rows deleted = %', v_no_of_deleted_rows;

        v_sql = 'INSERT INTO arch_job_log (job_run_id, arch_table_metadata_id ,no_of_rows_bfr_delete,no_of_rows_deleted,job_start_time,job_end_time,job_status) VALUES ('
                    || v_job_id || ',' || v_table_id || ',' || v_no_of_rows_before_delete || ',' || v_no_of_deleted_rows || ',''' || v_job_start_time || ''',''' || GETDATE() || ''',' || v_job_status || ')';
        RAISE NOTICE '%', v_sql;
        EXECUTE v_sql;

        FETCH table_meta_data_cur INTO v_table_id,v_schema_name, v_table_name, v_column_name, v_s3_bucket_url, v_retention_days;
    END LOOP;
    CLOSE table_meta_data_cur;

    EXCEPTION
    WHEN OTHERS THEN
        RAISE NOTICE 'Error - % ', SQLERRM;
END;
$$ LANGUAGE plpgsql;

Archival and purging

For this use case, we use a table called orders, for which we want to archive and purge any records older than the last 30 days.

Use the following DDL to create the table in the Amazon Redshift cluster:

create table orders (
  O_ORDERKEY bigint NOT NULL,
  O_CUSTKEY bigint,
  O_ORDERSTATUS varchar(1),
  O_TOTALPRICE decimal(18,4),
  O_ORDERDATE Date,
  O_ORDERPRIORITY varchar(15),
  O_CLERK varchar(15),
  O_SHIPPRIORITY Integer,
  O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

The O_ORDERDATE column makes it a time series table, which you can use to retain the rolling data for a certain period.

In order to load the data into the orders table using the below COPY command , you would need to have default IAM role attached to your Redshift cluster or replace the default keyword in the COPY command with the arn of the IAM role attached to the Redshift cluster

copy orders from 's3://redshift-immersionday-labs/data/orders/orders.tbl.'
iam_role default
region 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

When you query the table, you can see that this data is for 1998. To test this solution, you need to manually update some of the data to the current date by running the following SQL statement:

update orders set O_ORDERDATE = current_date where O_ORDERDATE < '1998-08-02';

The table looks like the following screenshot after running the update statement.

Now let’s run the following SQL to get the count of number of records to be archived and purged:

select count (*) from orders where O_ORDERDATE <= DATEADD(DAY,-30,CURRENT_DATE)

Before running the stored procedure, we need to insert a row into the arch_file_metadata table for the stored procedure to archive and purge records in the orders table. In the following code, provide the Amazon Simple Storage Service (Amazon S3) bucket name where you want to store the archived data:

INSERT INTO arch_table_metadata (schema_name, table_name, column_name, s3_uri, retention_days) VALUES ('public', 'orders', 'O_ORDERDATE', 's3://<your-bucketname>/redshift_data_archival/', 30);

The stored procedure performs the following high-level steps:

Open a cursor to read and loop through the rows in the arch_table_metadata table.
Retrieve the total number of records in the table before purging.
Export and archive the records to be deleted into the Amazon S3 location as specified in the s3_uri column value. Data is partitioned in Amazon S3 based on the column_name field in arch_table_metadata. The stored procedure uses the IAM role passed as input for the UNLOAD operation.
Run the DELETE command to purge the identified records based on the retention_days column value.
Add a record in arch_job_log with the run details.

Now, let’s run the stored procedure via the call statement passing a role ARN as input parameter to verify the data was archived and purged correctly:

call archive_data_sp('arn:aws:iam:<your-account-id>:role/RedshiftRole-7OR1UWVPFI5J');

As shown in the following screenshot, the stored procedure ran successfully.

Now let’s validate the table was purged successfully by running the following SQL:

select count (*) from orders where O_ORDERDATE <= DATEADD(DAY,-30,CURRENT_DATE)

We can navigate to the Amazon S3 location to validate the archival process. The following screenshot shows the data has been archived into the Amazon S3 location specified in the arch_table_metadata table.

Now let’s run the following SQL statement to look at the stored procedure run log entry:

select a.* from arch_job_log a, arch_table_metadata b
where a.arch_table_metadata_id = b.id
and b.table_name = 'orders'

The following screenshot shows the query results.

In this example, we demonstrated how you can set up and validate your Amazon Redshift table archival and purging process.

Schedule the stored procedure

Now that you have learned how to set up and validate your Amazon Redshift tables for archival and purging, you can schedule this process. For instructions on how to schedule a SQL statement using either the AWS Management Console or the AWS Command Line Interface (AWS CLI), refer to Scheduling SQL queries on your Amazon Redshift data warehouse.

Archive data in Amazon S3

As part of this solution, data is archived in an S3 bucket before it’s deleted from the Amazon Redshift table. This helps reduce the storage on the Amazon Redshift cluster and enables you to analyze the data for any ad hoc requests without needing to load back into the cluster. In the stored procedure, the UNLOAD command exports the data to be purged to Amazon S3, partitioned by the date column, which is used to identify the records to purge. To save costs on Amazon S3 storage, you can manage the storage lifecycle with Amazon S3 lifecycle configuration.

Analyze the archived data in Amazon S3 using Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3, and easily analyze the archived data in Amazon S3 without having to load it back in Amazon Redshift tables. For further analysis of your archived data (cold data) and frequently accessed data (hot data) in the cluster’s local disk, you can run queries joining Amazon S3 archived data with tables that reside on the Amazon Redshift cluster’s local disk. The following diagram illustrates this process.

Let’s take an example where you want to view the number of orders for the last 2 weeks of December 1998, which is archived in Amazon S3. You need to complete the following steps using Redshift Spectrum:

Create an external schema in Amazon Redshift.

Create a late-binding view to refer to the underlying Amazon S3 files with the following query:

create view vw_orders_hist as select count(*),o_orderdate
from <external_schema>. orders 
where o_orderdate between '1998-12-15' and '1998-12-31' group by 2
with no schema binding;

To see a unified view of the orders historical data archived in Amazon S3 and the current data stored in the Amazon Redshift local table, you can use a UNION ALL clause to join the Amazon Redshift orders table and the Redshift Spectrum orders table:
```
create view vw_orders_unified as 
select * from <external_schema>.orders
union all
select * from public.orders
with no schema binding;
```

To learn more about the best practices for Redshift Spectrum, refer to Best Practices for Amazon Redshift Spectrum.

Best practices

The following are some best practices to reduce your storage footprint and optimize performance of your workloads:

Conclusion

In this post, we demonstrated the automatic archival and purging of data in Amazon Redshift tables to meet your compliance and business requirements, thereby optimizing your application performance and reducing storage costs. As an administrator, you can start working with application data owners to identify retention policies for Amazon Redshift tables to achieve optimal performance, prevent any storage issues specifically for DS2 and DC2 nodes, and reduce overall storage costs.

About the authors

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Prathap Thoguru is an Enterprise Solutions Architect at Amazon Web Services. He has over 15 years of experience in the IT industry and is a 9x AWS certified professional. He helps customers migrate their on-premises workloads to the AWS Cloud.

Noise