All posts by Michael Soo

Accelerate your data warehouse migration to Amazon Redshift – Part 6

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-6-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the sixth in a series of posts. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially simplify your migrations from legacy data warehouses to Amazon Redshift.

Check out all the previous posts in this series:

Amazon Redshift is the cloud data warehouse of choice for tens of thousands of customers who use it to analyze exabytes of data to gain business insights. With Amazon Redshift, you can query data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other AWS services such as Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in the AWS Cloud.

Migrating a data warehouse can be a complex undertaking. Your legacy workload might rely on proprietary features that aren’t directly supported by a modern data warehouse like Amazon Redshift. For example, some data warehouses enforce primary key constraints, making a tradeoff with DML performance. Amazon Redshift lets you define a primary key but uses the constraint for query optimization purposes only. If you use Amazon Redshift, or are migrating to Amazon Redshift, you may need a mechanism to check that primary key constraints are not being violated by extract, transform, and load (ETL) processes.

In this post, we describe two design patterns that you can use to accomplish this efficiently. We also show you how to use the AWS Schema Conversion Tool (AWS SCT) to automatically apply the design patterns to your SQL code.

We start by defining the semantics to address. Then we describe the design patterns and analyze their performance. We conclude by showing you how AWS SCT can automatically convert your code to enforce primary keys.

Primary keys

A primary key (PK) is a set of attributes such that no two rows can have the same value in the PK. For example, the following Teradata table has a two-attribute primary key (emp_id, div_id). Presumably, employee IDs are unique only within divisions.

CREATE TABLE testschema.emp ( 
  emp_id INTEGER NOT NULL
, name VARCHAR(12) NOT NULL
, div_id INTEGER NOT NULL
, job_title VARCHAR(12)
, salary DECIMAL(8,2)
, birthdate DATE NOT NULL ) 
CONSTRAINT pk_emp_id PRIMARY KEY (emp_id, div_id);

Most databases require that a primary key satisfy two criteria:

  • Uniqueness – The PK values are unique over all rows in the table
  • Not NULL – The PK attributes don’t accept NULL values

In this post, we focus on how to support the preceding primary key semantics. We describe two design patterns that you can use to develop SQL applications that respect primary keys in Amazon Redshift. Our focus is on INSERT-SELECT statements. Customers have told us that INSERT-SELECT operations comprise over 50% of the DML workload against tables with unique constraints. We briefly provide some guidance for other DML statements later in the post.

INSERT-SELECT

In the rest of this post, we dive deep into design patterns for INSERT-SELECT statements. We’re concerned with statements of the following form:

INSERT INTO <target table> SELECT * FROM <staging table>

The schema of the staging table is identical to the target table on a column-by-column basis.

A duplicate PK value can be introduced by two scenarios:

  • The staging table contains duplicates, meaning there are two or more rows in the staging data with the same PK value
  • There is a row x in the staging table and a row y in the target table that share the same PK value

Note that these situations are independent. It can be the case that the staging table contains duplicates, the staging table and target table share a duplicate, or both.

It’s imperative that the staging table doesn’t contain duplicate PK values. To ensure this, you can apply deduplication logic, as described in this post, to the staging table when it’s loaded. Alternatively, if your upstream source can guarantee that duplicates have been eliminated before delivery, you can eliminate this step.

Join

The first design pattern simply joins the staging and target tables. If any rows are returned, then the staging and target tables share a primary key value.

Suppose we have staging and target tables defined as the following:

CREATE TABLE stg ( 
  pk_col INTEGER 
, payload VARCHAR(100) 
, PRIMARY KEY (pk_col)
); 

CREATE TABLE tgt ( 
  pk_col INTEGER 
, payload VARCHAR(100) 
, PRIMARY KEY (pk_col)
);
We can use the following query to detect any duplicate primary key values:
SELECT count(1) 
FROM stg, tgt 
WHERE tgt.pk_col = stg.pk_col;

If the primary key has multiple columns, then the WHERE condition can be extended:

SELECT count(1)
FROM stg, tgt
WHERE
    tgt.pk_col1 = stg.pk_col1
AND tgt.pk_col2 = tgt.pk_col2
AND …
;

There is one complication with this design pattern. If you allow NULL values in the primary key column, then you need to add special code to handle the NULL to NULL matching:

SELECT count(1)
FROM stg, tgt
WHERE
   (tgt.pk_col = stg.pk_col) 
OR (tgt.pk_col IS NULL AND stg.pk_col IS NULL)
;

This is the primary disadvantage of this design pattern—the code can be ugly and unintuitive. Furthermore, if you have a multicolumn primary key, then the code becomes even more complicated.

INTERSECT

The second design pattern that we describe uses the Amazon Redshift INTERSECT operation. INTERSECT is a set-based operation that determines if two queries have any rows in common. You can check out UNION, INTERSECT, and EXCEPT in the Amazon Redshift documentation for more information.

We can determine if the staging and target table have duplicate PK values using the following query:

SELECT COUNT(1)
FROM (
  SELECT pk_col FROM stg
  INTERSECT
  SELECT pk_col FROM tgt
) a
;

If the primary key is composed of more than one column, you can simply modify the subqueries to include the additional columns:

SELECT COUNT(1)
FROM (
  SELECT pk_col1, pk_col2, …, pk_coln FROM stg
  INTERSECT
  SELECT pk_col, pk_col2, …, pk_coln FROM tgt
) a
;

This pattern’s main advantage is its simplicity. The code is easier to understand and validate than the join design pattern. INTERSECT handles the NULL to NULL matching implicitly so you don’t have to write any special code for NULL values.

Performance

We tested both design patterns using an Amazon Redshift cluster consisting of 12 ra3.4xlarge nodes. Each node contained 12 CPU and 96 GB of memory.

We created the staging and target tables with the same distribution and sort keys to minimize data redistribution at query time.

We generated the test data artificially using a custom program. The target dataset contained 1 billion rows of data. We ran 10 trials of both algorithms using staging datasets that ranged from 20–200 million rows, in 20-million-row increments.

In the following graph, the join design pattern is shown as a blue line. The intersect design pattern is shown as an orange line.

You can observe that the performance of both algorithms is excellent. Each is able to detect duplicates in less than 1 second for all trials. The join algorithm outperforms the intersect algorithm, but both have excellent performance.

So, which algorithm should use you choose? If you’re developing a new application on Amazon Redshift, the intersect algorithm is probably the best choice. The inherent NULL matching logic and simple intuitive code make this the best choice for new applications.

Conversely, if you need to squeeze every bit of performance from your application, then the join algorithm is your best option. In this case, you’ll have to trade complexity and perhaps extra effort in code review to gain the extra performance.

Automation

If you’re migrating an existing application to Amazon Redshift, you can use AWS SCT to automatically convert your SQL code.

Let’s see how this works. Suppose you have the following Teradata table. We use it as the target table in an INSERT-SELECT operation.

CREATE MULTISET TABLE testschema.test_pk_tgt (
  pk_col INTEGER NOT NULL
, payload VARCHAR(100) NOT NULL
, PRIMARY KEY (pk_col)
);

The staging table is identical to the target table, with the same columns and data types.

Next, we create a procedure to load the target table from the staging table. The procedure contains a single INSERT-SELECT statement:

REPLACE PROCEDURE testschema.insert_select()
BEGIN
INSERT INTO testschema.test_pk_tgt (pk_col, payload)
SELECT pk_col, payload FROM testschema.test_pk_stg;
END;

Now we use AWS SCT to convert the Teradata stored procedure to Amazon Redshift. First, open Settings, Conversion settings, and ensure that you’ve selected the option Automate Primary key / Unique constraint. If you don’t select this option, AWS SCT won’t add the PK check to the converted code.

Next, choose the stored procedure in the source database tree, right-click, and choose Convert schema.

AWS SCT converts the stored procedure (and embedded INSERT-SELECT) using the join rewrite pattern. Because AWS SCT performs the conversion for you, it uses the join rewrite pattern to leverage its performance advantage.

And that’s it, it’s that simple. If you’re migrating from Oracle or Teradata, you can use AWS SCT to convert your INSERT-SELECT statements now. We’ll be adding support for additional data warehouse engines soon.

In this post, we focused on INSERT-SELECT statements, but we’re also happy to report that AWS SCT can enforce primary key constraints for INSERT-VALUE and UPDATE statements. AWS SCT injects the appropriate SELECT statement into your code to determine if the INSERT-VALUE or UPDATE will create duplicate primary key values. Download the latest version of AWS SCT and give it a try!

Conclusion

In this post, we showed you how to enforce primary keys in Amazon Redshift. If you’re implementing a new application in Amazon Redshift, you can use the design patterns in this post to enforce the constraints as part of your ETL stream.

Also, if you’re migrating from an Oracle or Teradata database, you can use AWS SCT to automatically convert your SQL to Amazon Redshift. AWS SCT will inject additional code into your SQL stream to enforce your unique key constraints, and thereby insulate your application code from any related changes.

We’re happy to share these updates to help you in your data warehouse migration projects. In the meantime, you can learn more about Amazon Redshift and AWS SCT. Happy migrating!


About the authors

Michael Soo is a Principal Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Illia Kravtsov is a Database Developer with the AWS Project Delta Migration team. He has 10+ years experience in data warehouse development with Teradata and other MPP databases.

Accelerate your data warehouse migration to Amazon Redshift – Part 5

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-5-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the fifth in a series of posts. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially simplify your migrations from legacy data warehouses to Amazon Redshift.

Check out the all the posts in this series:

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other AWS services such as Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in the AWS Cloud.

Until now, migrating a data warehouse to AWS has been a complex undertaking, involving a significant amount of manual effort. You need to manually remediate syntax differences, inject code to replace proprietary features, and manually tune the performance of queries and reports on the new platform.

Legacy workloads may rely on non-ANSI, proprietary features that aren’t directly supported by modern databases like Amazon Redshift. For example, many Teradata applications use SET tables, which enforce full row uniqueness—there can’t be two rows in a table that are identical in all of their attribute values.

If you’re an Amazon Redshift user, you may want to implement SET semantics but can’t rely on a native database feature. You can use the design patterns in this post to emulate SET semantics in your SQL code. Alternatively, if you’re migrating a workload to Amazon Redshift, you can use the AWS Schema Conversion Tool (AWS SCT) to automatically apply the design patterns as part of your code conversion.

In this post, we describe the SQL design patterns and analyze their performance, and show how AWS SCT can automate this as part of your data warehouse migration. Let’s start by understanding how SET tables behave in Teradata.

Teradata SET tables

At first glance, a SET table may seem similar to a table that has a primary key defined across all of its columns. However, there are some important semantic differences from traditional primary keys. Consider the following table definition in Teradata:

CREATE SET TABLE testschema.sales_by_month (
  sales_dt DATE
, amount DECIMAL(8,2)
);

We populate the table with four rows of data, as follows:

select * from testschema.sales_by_month order by sales_dt;

*** Query completed. 4 rows found. 2 columns returned. 
*** Total elapsed time was 1 second.

sales_dt amount
-------- ----------
22/01/01 100.00
22/01/02 200.00
22/01/03 300.00
22/01/04 400.00

Notice that we didn’t define a UNIQUE PRIMARY INDEX (similar to a primary key) on the table. Now, when we try to insert a new row into the table that is a duplicate of an existing row, the insert fails:

INSERT INTO testschema.sales_by_month values (20220101, 100);

 *** Failure 2802 Duplicate row error in testschema.sales_by_month.
 Statement# 1, Info =0 
 *** Total elapsed time was 1 second.

Similarly, if we try to update an existing row so that it becomes a duplicate of another row, the update fails:

UPDATE testschema.sales_by_month 
SET sales_dt = 20220101, amount = 100
WHERE sales_dt = 20220104 and amount = 400;

 *** Failure 2802 Duplicate row error in testschema.sales_by_month.
 Statement# 1, Info =0 
 *** Total elapsed time was 1 second.

In other words, simple INSERT-VALUE and UPDATE statements fail if they introduce duplicate rows into a Teradata SET table.

There is a notable exception to this rule. Consider the following staging table, which has the same attributes as the target table:

CREATE MULTISET TABLE testschema.sales_by_month_stg (
  sales_dt DATE
, amount DECIMAL(8,2)
);

The staging table is a MULTISET table and accepts duplicate rows. We populate three rows into the staging table. The first row is a duplicate of a row in the target table. The second and third rows are duplicates of each other, but don’t duplicate any of the target rows.

select * from testschema.sales_by_month_stg;

 *** Query completed. 3 rows found. 2 columns returned. 
 *** Total elapsed time was 1 second.

sales_dt amount
-------- ----------
22/01/01 100.00
22/01/05 500.00
22/01/05 500.00

Now we successfully insert the staging data into the target table (which is a SET table):

INSERT INTO testschema.sales_by_month (sales_dt, amount)
SELECT sales_dt, amount FROM testschema.sales_by_month_stg;

 *** Insert completed. One row added. 
 *** Total elapsed time was 1 second.

If we examine the target table, we can see that a single row for (2022-01-05, 500) has been inserted, and the duplicate row for (2022-01-01, 100) has been discarded. Essentially, Teradata silently discards any duplicate rows when it performs an INSERT-SELECT statement. This includes duplicates that are in the staging table and duplicates that are shared between the staging and target tables.

select * from testschema.sales_by_month order by sales_dt;

 *** Query completed. 6 rows found. 2 columns returned. 
 *** Total elapsed time was 1 second.

sales_dt amount
-------- ----------
22/01/01 100.00
22/01/02 200.00
22/01/03 300.00
22/01/03 200.00
22/01/04 400.00
22/01/05 500.00

Essentially, SET tables behave differently depending on the type of operation being run. An INSERT-VALUE or UPDATE operation suffers a failure if it introduces a duplicate row into the target. An INSERT-SELECT operation doesn’t suffer a failure if the staging table contains a duplicate row, or a duplicate row is shared between the staging and table tables.

In this post, we don’t go into detail on how to convert INSERT-VALUE or UPDATE statements. These statements typically involve one or a few rows and are less impactful in terms of performance than INSERT-SELECT statements. For INSERT-VALUE or UPDATE statements, you can materialize the row (or rows) being created, and join that set to the target table to check for duplicates.

INSERT-SELECT

In the rest of this post, we analyze INSERT-SELECT statements carefully. Customers have told us that INSERT-SELECT operations can comprise up to 78% of the INSERT workload against SET tables. We are concerned with statements with the following form:

INSERT into <target table> SELECT * FROM <staging table>

The schema of the staging table is identical to the target table on a column-by-column basis. As we mentioned earlier, a duplicate row can appear in two different circumstances:

  • The staging table is not set-unique, meaning that there are two or more full row duplicates in the staging data
  • There is a row x in the staging table and an identical row x in the target table

Because Amazon Redshift supports multiset table semantics, it’s possible that the staging table contains duplicates (the first circumstance we listed). Therefore, any automation must address both cases, because either can introduce a duplicate into an Amazon Redshift table.

Based on this analysis, we implemented the following algorithms:

  • MINUS – This implements the full set logic deduplication using SQL MINUS. MINUS works in all cases, including when the staging table isn’t set-unique and when the intersection of the staging table and target table is non-empty. MINUS also has the advantage that NULL values don’t require special comparison logic to overcome NULL to NULL comparisons. MINUS has the following syntax:
    INSERT INTO <target table> (<column list>)
    SELECT <column list> FROM <staging table> 
    MINUS
    SELECT <column list> FROM <target table>;

  • MINUS-MIN-MAX – This is an optimization on MINUS that incorporates a filter to limit the target table scan based on the values in the stage table. The min/max filters allow the query engine to skip large numbers of block during table scans. See Working with sort keys for more details.
    INSERT INTO <target table>(<column list>)
    SELECT <column list> FROM <staging table> 
    MINUS
    SELECT <column list> FROM <target table>
    WHERE <target table>.<sort key> >= (SELECT MIN(<sort key>) FROM <staging table>)
      AND <target table>).<sort key> <= (SELECT MAX(<sort key>) FROM <staging table>)
    );

We also considered other algorithms, but we don’t recommend that you use them. For example, you can perform a GROUP BY to eliminate duplicates in the staging table, but this step is unnecessary if you use the MINUS operator. You can also perform a left (or right) outer join to find shared duplicates between the staging and target tables, but then additional logic is needed to account for NULL = NULL conditions.

Performance

We tested the MINUS and MINUS-MIN-MAX algorithms on Amazon Redshift. We ran the algorithms on two Amazon Redshift clusters. The first configuration consisted of 6 x ra3.4xlarge nodes. The second consisted of 12 x ra3.4xlarge nodes. Each node contained 12 CPU and 96 GB of memory.

We created the stage and target tables with identical sort and distribution keys to minimize data movement. We loaded the same target dataset into both clusters. The target dataset consisted of 1.1 billion rows of data. We then created staging datasets that ranged from 20 million to 200 million rows, in 20 million row increments.

The following graph shows our results.

The test data was artificially generated and some skew was present in the distribution key values. This is manifested in the small deviations from linearity in the performance.

However, you can observe the performance increase that is afforded the MINUS-MIN-MAX algorithm over the basic MINUS algorithm (comparing orange lines or blue lines to themselves). If you’re implementing SET tables in Amazon Redshift, we recommend using MINUS-MIN-MAX because this algorithm provides a happy convergence of simple, readable code and good performance.

Automation

All Amazon Redshift tables allow duplicate rows, i.e., they are MULTISET tables by default. If you are converting a Teradata workload to run on Amazon Redshift, you’ll need to enforce SET semantics outside of the database.

We’re happy to share that AWS SCT will automatically convert your SQL code that operates against SET tables. AWS SCT will rewrite INSERT-SELECT that load SET tables to incorporate the rewrite patterns we described above.

Let’s see how this works. Suppose you have the following target table definition in Teradata:

CREATE SET TABLE testschema.fact (
  id bigint NOT NULL
, se_sporting_event_id INTEGER NOT NULL
, se_sport_type_name VARCHAR(15) NOT NULL
, se_home_team_id INTEGER NOT NULL
, se_away_team_id INTEGER NOT NULL
, se_location_id INTEGER NOT NULL
, se_start_date_time DATE NOT NULL
, se_sold_out INTEGER DEFAULT 0 NOT NULL
, stype_sport_type_name varchar(15) NOT NULL
, stype_short_name varchar(10) NOT NULL
, stype_long_name varchar(60) NOT NULL
, stype_description varchar(120)
, sd_sport_type_name varchar(15) NOT NULL
, sd_sport_league_short_name varchar(10) NOT NULL
, sd_short_name varchar(10) NOT NULL
, sd_long_name varchar(60)
, sd_description varchar(120)
, sht_id INTEGER NOT NULL
, sht_name varchar(30) NOT NULL
, sht_abbreviated_name varchar(10)
, sht_home_field_id INTEGER 
, sht_sport_type_name varchar(15) NOT NULL
, sht_sport_league_short_name varchar(10) NOT NULL
, sht_sport_division_short_name varchar(10)
, sat_id INTEGER NOT NULL
, sat_name varchar(30) NOT NULL
, sat_abbreviated_name varchar(10)
, sat_home_field_id INTEGER 
, sat_sport_type_name varchar(15) NOT NULL
, sat_sport_league_short_name varchar(10) NOT NULL
, sat_sport_division_short_name varchar(10)
, sl_id INTEGER NOT NULL
, sl_name varchar(60) NOT NULL
, sl_city varchar(60) NOT NULL
, sl_seating_capacity INTEGER
, sl_levels INTEGER
, sl_sections INTEGER
, seat_sport_location_id INTEGER
, seat_seat_level INTEGER
, seat_seat_section VARCHAR(15)
, seat_seat_row VARCHAR(10)
, seat_seat VARCHAR(10)
, seat_seat_type VARCHAR(15)
, pb_id INTEGER NOT NULL
, pb_full_name varchar(60) NOT NULL
, pb_last_name varchar(30)
, pb_first_name varchar(30)
, ps_id INTEGER NOT NULL
, ps_full_name varchar(60) NOT NULL
, ps_last_name varchar(30)
, ps_first_name varchar(30)
)
PRIMARY INDEX(id)
;

The stage table is identical to the target table, except that it’s created as a MULTISET table in Teradata.

Next, we create a procedure to load the fact table from the stage table. The procedure contains a single INSERT-SELECT statement:

REPLACE PROCEDURE testschema.insert_select()  
BEGIN
  INSERT INTO testschema.test_fact 
  SELECT * FROM testschema.test_stg;
END;

Now we use AWS SCT to convert the Teradata stored procedure to Amazon Redshift. First, select the stored procedure in the source database tree, then right-click and choose Convert schema.

AWS SCT converts the stored procedure (and embedded INSERT-SELECT) using the MINUS-MIN-MAX rewrite pattern.

And that’s it! Presently, AWS SCT only performs rewrite for INSERT-SELECT because those statements are heavily used by ETL workloads and have the most impact on performance. Although the example we used was embedded in a stored procedure, you can also use AWS SCT to convert the same statements if they’re in BTEQ scripts, macros, or application programs. Download the latest version of AWS SCT and give it a try!

Conclusion

In this post, we showed how to implement SET table semantics in Amazon Redshift. You can use the described design patterns to develop new applications that require SET semantics. Or, if you’re converting an existing Teradata workload, you can use AWS SCT to automatically convert your INSERT-SELECT statements so that they preserve the SET table semantics.

We’ll be back soon with the next installment in this series. Check back for more information on automating your migrations from Teradata to Amazon Redshift. In the meantime, you can learn more about Amazon Redshift and AWS SCT. Happy migrating!


About the Authors

Michael Soo is a Principal Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Po Hong, PhD, is a Principal Data Architect of the Modern Data Architecture Global Specialty Practice (GSP), AWS Professional Services.  He is passionate about helping customers to adopt innovative solutions and migrate from large scale MPP data warehouses to the AWS modern data architecture.

Accelerate your data warehouse migration to Amazon Redshift – Part 4

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-4-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the fourth in a series of posts. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially reduce your overall cost to migrate to Amazon Redshift.

Check out the previous posts in the series:

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other AWS services like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating from self-managed data warehouse engines, like Teradata, to Amazon Redshift. In these cases, you typically have terabytes or petabytes of data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over a few years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort. You needed to manually remediate syntax differences, inject code to replace proprietary features, and manually tune the performance of queries and reports on the new platform.

For example, you may have a significant investment in BTEQ (Basic Teradata Query) scripting for database automation, ETL, or other tasks. Previously, you needed to manually recode these scripts as part of the conversion process to Amazon Redshift. Together with supporting infrastructure (job scheduling, job logging, error handling), this was a significant impediment to migration.

Today, we’re happy to share with you a new, purpose-built command line tool called Amazon Redshift RSQL. Some of the key features added in Amazon Redshift RSQL are enhanced flow control syntax and single sign-on support. You can also describe properties or attributes of external tables in an AWS Glue catalog or Apache Hive Metastore, external databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition, and tables shared using Amazon Redshift data sharing.

We have also enhanced the AWS Schema Conversion Tool (AWS SCT) to automatically convert BTEQ scripts to Amazon Redshift RSQL scripts. The converted scripts run on Amazon Redshift with little to no changes.

In this post, we describe some of the features of Amazon Redshift RSQL, show example scripts, and demonstrate how to convert BTEQ scripts into Amazon Redshift RSQL scripts.

Amazon Redshift RSQL features

If you currently use Amazon Redshift, you may already be running scripts on Amazon Redshift using the PSQL command line client. These scripts operate on Amazon Redshift RSQL with no modification. You can think of Amazon Redshift RSQL as an Amazon Redshift-native version of PSQL.

In addition, we have designed Amazon Redshift RSQL to make it easy to transition BTEQ scripts to the tool. The following are some examples of Amazon Redshift RSQL commands that make this possible. (For full details, see Amazon Redshift RSQL.)

  • \EXIT – This command is an extension of the PSQL \quit command. Like \quit, \EXIT terminates the execution of Amazon Redshift RSQL. In addition, you can specify an optional exit code with \EXIT.
  • \LOGON – This command creates a new connection to a database. \LOGON is an alias for the PSQL \connect command. You can specify connection parameters using positional syntax or as a connection string.
  • \REMARK – This command prints the specified string to the output. \REMARK extends the PSQL \echo command by adding the ability to break the output over multiple lines, using // as a line break.
  • \RUN – This command runs the Amazon Redshift RSQL script contained in the specified file. \RUN extends the PSQL \i command by adding an option to skip any number of lines in the specified file.
  • \OS – This is an alias for the PSQL \! command. \OS runs the operating system command that is passed as a parameter. Control returns to Amazon Redshift RSQL after running the OS command.
  • \LABEL – This is a new command for Amazon Redshift RSQL. \LABEL establishes an entry point for execution, as the target for a \GOTO command.
  • \GOTO – This command is a new command for Amazon Redshift RSQL. It’s used in conjunction with the \LABEL command. \GOTO skips all intervening commands and resumes processing at the specified \LABEL. The \LABEL must be a forward reference. You can’t jump to a \LABEL that lexically precedes the \GOTO.
  • \IF (\ELSEIF, \ELSE, \ENDIF) – This command is an extension of the PSQL \if (\elif, \else, \endif) command. \IF and \ELSEIF support arbitrary Boolean expressions including AND, OR, and NOT conditions. You can use the \GOTO command within a \IF block to control conditional execution.
  • \EXPORT – This command specifies the name of an export file that Amazon Redshift RSQL uses to store database information returned by a subsequent SQL SELECT statement.

We’ve also added some variables to Amazon Redshift RSQL to support converting your BTEQ scripts.

  • :ACTIVITYCOUNT – This variable returns the number of rows affected by the last submitted request. For a data-returning request, this is the number of rows returned to Amazon Redshift RSQL from the database. ACTIVITYCOUNT is similar to the PSQL variable ROW_COUNT, however, ROW_COUNT does not report affected-row count for SELECT, COPY or UNLOAD.
  • :ERRORCODE – This variable contains the return code for the last submitted request to the database. A zero signifies the request completed without error. The ERRORCODE variable is an alias for the variable SQLSTATE.
  • :ERRORLEVEL – This variable assigns severity levels to errors. Use the severity levels to determine a course of action based on the severity of the errors that Amazon Redshift RSQL encounters.
  • :MAXERROR – This variable designates a maximum error severity level beyond which Amazon Redshift RSQL terminates job processing.

An example Amazon Redshift RSQL script

Let’s look at an example. First, we log in to an Amazon Redshift database using Amazon Redshift RSQL. You specify the connection information on the command line as shown in the following code. The port and database are optional and default to 5439 and dev respectively if not provided.

$ rsql -h testcluster1.<example>.redshift.amazonaws.com -U testuser1 -d myredshift -p 5439
Password for user testuser1: 
DSN-less Connected
DBMS Name: Amazon Redshift
Driver Name: Amazon Redshift ODBC Driver
Driver Version: 1.4.34.1000
Rsql Version: 1.0.1
Redshift Version: 1.0.29551
Type "help" for help.

(testcluster1) testuser1@myredshift=#

If you choose to change the connection from within the client, you can use the \LOGON command:

(testcluster1) testuser1@myredshift=# \logon testschema testuser2 testcluster2.<example>.redshift.amazonaws.com
Password for user testuser2: 
DBMS Name: Amazon Redshift
Driver Name: Amazon Redshift ODBC Driver
Driver Version: 1.4.34.1000
Rsql Version: 1.0.1

Now, let’s run a simple script that runs a SELECT statement, checks for output, then branches depending on whether data was returned or not.

First, we inspect the script by using the \OS command to print the file to the screen:

(testcluster1) testuser1@myredshift=# \os cat activitycount.sql
select * from testschema.employees;

\if :ACTIVITYCOUNT = 0
  \remark '****No data found****'
  \goto LETSQUIT
\else
  \remark '****Data found****'
  \goto LETSDOSOMETHING
\endif

\label LETSQUIT
\remark '****We are quitting****'
\exit 0

\label LETSDOSOMETHING
\remark '****We are doing it****'
\exit 0

The script prints one of two messages depending on whether data is returned by the SELECT statement or not.

Now, let’s run the script using the \RUN command. The SELECT statement returns 11 rows of data. The script prints a “data found” message, and jumps to the LETSDOSOMETHING label.

(testcluster1) testuser1@myredshift=# \run file=activitycount.sql
  id  | name    | manager_id | last_promo_date
 -----+---------+------------+-----------------
 112  | Britney | 201        | 2041-03-30
 101  | Bob     | 100        |
 110  | Mark    | 201        |
 106  | Jeff    | 102        |
 201  | Ana     | 104        |
 104  | Chris   | 103        |
 111  | Phyllis | 103        |
 102  | Renee   | 101        | 2021-01-01
 100  | Caitlin |            | 2021-01-01
 105  | David   | 103        | 2021-01-01
 103  | John    | 101        |
 (11 rows)

****Data found****
\label LETSQUIT ignored
\label LETSDOSOMETHING processed
****We are doing it****

That’s Amazon Redshift RSQL in a nutshell. If you’re developing new scripts for Amazon Redshift, we encourage you to use Amazon Redshift RSQL and take advantage of its additional capabilities. If you have existing PSQL scripts, you can run those scripts using Amazon Redshift RSQL with no changes.

Use AWS SCT to automate your BTEQ conversions

If you’re a Teradata developer or DBA, you’ve probably built a library of BTEQ scripts that you use to perform administrative work, load or transform data, or to generate datasets and reports. If you’re contemplating a migration to Amazon Redshift, you’ll want to preserve the investment you made in creating those scripts.

AWS SCT has long had the ability to convert BTEQ to AWS Glue. Now, you can also use AWS SCT to automatically convert BTEQ scripts to Amazon Redshift RSQL. AWS SCT supports all the new Amazon Redshift RSQL features like conditional execution, escape to the shell, and branching.

Let’s see how it works. We create two Teradata tables, product_stg and product. Then we create a simple ETL script that uses a MERGE statement to update the product table using data from the product_stg table:

CREATE TABLE testschema.product_stg (
  prod_id INTEGER
, description VARCHAR(100) CHARACTER SET LATIN
,category_id INTEGER)
UNIQUE PRIMARY INDEX ( prod_id );

CREATE TABLE testschema.product (
  prod_id INTEGER
, description VARCHAR(100) CHARACTER SET LATIN
, category_id INTEGER)
UNIQUE PRIMARY INDEX ( prod_id );

We embed the MERGE statement inside a BTEQ script. The script tests error conditions and branches accordingly:

.SET WIDTH 100

SELECT COUNT(*) 
FROM testschema.product_stg 
HAVING COUNT(*) > 0;

.IF ACTIVITYCOUNT = 0 then .GOTO NODATA;

MERGE INTO testschema.product tgt 
USING testschema.product_stg stg 
   ON tgt.prod_id = stg.prod_id
WHEN MATCHED THEN UPDATE SET
      description = stg.description
    , category_id = stg.category_id
WHEN NOT MATCHED THEN INSERT VALUES (
  stg.prod_id
, stg.description
, stg.category_id
);

.GOTO ALLDONE;

.LABEL NODATA

.REMARK 'Staging table is empty. Stopping'

.LABEL ALLDONE 

.QUIT;               

Now, let’s use AWS SCT to convert the script to Amazon Redshift RSQL. AWS SCT converts the BTEQ commands to their Amazon Redshift RSQL and Amazon Redshift equivalents. The converted script is as follows:

\rset width 100
SELECT
    COUNT(*)
    FROM testschema.product_stg
    HAVING COUNT(*) > 0;
\if :ACTIVITYCOUNT = 0
    \goto NODATA
\endif
UPDATE testschema.product
SET description = stg.description, category_id = stg.category_id
FROM testschema.product_stg AS stg
JOIN testschema.product AS tgt
    ON tgt.prod_id = stg.prod_id;
INSERT INTO testschema.product
SELECT
    stg.prod_id, stg.description, stg.category_id
    FROM testschema.product_stg AS stg
    WHERE NOT EXISTS (
        SELECT 1
        FROM testschema.product AS tgt
        WHERE tgt.prod_id = stg.prod_id);
\goto ALLDONE
\label NODATA
\remark 'Staging table is empty. Stopping'
\label ALLDONE
\quit :ERRORLEVEL

The following are the main points of interest in the conversion:

  • The BTEQ .SET WIDTH command is converted to the Amazon Redshift RSQL \RSET WIDTH command.
  • The BTEQ ACTIVITYCOUNT variable is converted to the Amazon Redshift RSQL ACTIVITYCOUNT variable.
  • The BTEQ MERGE statement is converted into an UPDATE followed by an INSERT statement. Currently, Amazon Redshift doesn’t support a native MERGE statement.
  • The BTEQ .LABEL and .GOTO statements are translated to their Amazon Redshift RSQL equivalents \LABEL and \GOTO.

Let’s look at the actual process of using AWS SCT to convert a BTEQ script.

After starting AWS SCT, you create a Teradata migration project and navigate to the BTEQ scripts node in the source tree window pane. Right-click and choose Load scripts.

Then select the folder that contains your BTEQ scripts. The folder appears in the source tree. Open it and navigate to the script you want to convert. In our case, the script is contained in the file merge.sql. Right-click on the file, choose Convert script, then choose Convert to RSQL.You can inspect the converted script in the bottom middle pane. When you’re ready to save the script to a file, do that from the target tree on the right side.

If you have many BTEQ scripts, you can convert an entire folder at once by selecting the folder instead of an individual file.

Convert shell scripts

Many applications run BTEQ commands from within shell scripts. For example, you may have a shell script that redirects log output and controls login credentials, as in the following:

bteq <<EOF >> ${LOG} 2>&1

.run file $LOGON;

SELECT COUNT(*) 
FROM testschema.product_stg 
HAVING COUNT(*) > 0;
…

EOF

If you use shell scripts to run BTEQ, we’re happy to share that AWS SCT can help you convert those scripts. AWS SCT supports bash scripts now, and we’ll add additional shell dialects in the future.

The process to convert shell scripts is very similar to BTEQ conversion. You select a folder that contains your scripts by navigating to the Shell node in the source tree and then choosing Load scripts.

After the folder is loaded, you can convert one (or more) scripts by selecting them and choosing Convert script.

As before, the converted script appears in the UI, and you can save it from the target tree on the right side of the page.

Conclusion

We’re happy to share Amazon Redshift RSQL and expect it to be a big hit with customers. If you’re contemplating a migration from Teradata to Amazon Redshift, Amazon Redshift RSQL and AWS SCT can simplify the conversion of your existing Teradata scripts and help preserve your investment in existing reports, applications, and ETL.

All of the features described in this post are available for you to use today. You can download Amazon Redshift RSQL and AWS SCT and give it a try.

We’ll be back soon with the next installment in this series. Check back for more information on automating your migrations from Teradata to Amazon Redshift. In the meantime, you can learn more about Amazon Redshift, Amazon Redshift RSQL, and AWS SCT. Happy migrating!


About the Authors

Michael Soo is a Senior Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Po Hong, PhD, is a Principal Data Architect of Lake House Global Specialty Practice,
AWS Professional Services. He is passionate about supporting customers to adopt innovative solutions to reduce time to insight. Po is specialized in migrating large scale MPP on-premises data warehouses to the AWS Lake House architecture.

Entong Shen is a Software Development Manager of Amazon Redshift. He has been working on MPP databases for over 9 years and has focused on query optimization, statistics and migration related SQL language features such as stored procedures and data types.

Adekunle Adedotun is a Sr. Database Engineer with Amazon Redshift service. He has been working on MPP databases for 6 years with a focus on performance tuning. He also provides guidance to the development team for new and existing service features.

Asia Khytun is a Software Development Manager for the AWS Schema Conversion Tool. She has 10+ years of software development experience in C, C++, and Java.

Illia Kratsov is a Database Developer with the AWS Project Delta Migration team. He has 10+ years experience in data warehouse development with Teradata and other MPP databases.

Accelerate your data warehouse migration to Amazon Redshift – Part 3

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-3-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the third post in a multi-part series. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and reduce your overall cost to migrate to Amazon Redshift.

Check out the previous posts in the series:

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services such as Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating from self-managed data warehouse engines, like Teradata, to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a Teradata data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share recent enhancements to Amazon Redshift and the AWS Schema Conversion Tool (AWS SCT) that make it easier to automate your Teradata to Amazon Redshift migrations.

In this post, we introduce new automation for merge statements, a native function to support ASCII character conversion, enhanced error checking for string to date conversion, enhanced support for Teradata cursors and identity columns, automation for ANY and SOME predicates, automation for RESET WHEN clauses, automation for two proprietary Teradata functions (TD_NORMALIZE_OVERLAP and TD_UNPIVOT), and automation to support analytic functions (QUANTILE and QUALIFY).

Merge statement

Like its name implies, the merge statement takes an input set and merges it into a target table. If an input row already exists in the target table (a row in the target table has the same primary key value), then the target row is updated. If there is no matching target row, the input row is inserted into the table.

Until now, if you used merge statements in your workload, you were forced to manually rewrite the merge statement to run on Amazon Redshift. Now, we’re happy to share that AWS SCT automates this conversion for you. AWS SCT decomposes a merge statement into an update on existing records followed by an insert for new records.

Let’s look at an example. We create two tables in Teradata: a target table, employee, and a delta table, employee_delta, where we stage the input rows:

CREATE TABLE testschema.employee(
  id INTEGER
, name VARCHAR(20)
, manager INTEGER)
UNIQUE PRIMARY INDEX (id)
;

CREATE TABLE testschema.employee_delta (
  id INTEGER
, name VARCHAR(20)
, manager INTEGER)
UNIQUE PRIMARY INDEX(id)
;

Now we create a Teradata merge statement that updates a row if it exists in the target, otherwise it inserts the new row. We embed this merge statement into a macro so we can show you the conversion process later.

REPLACE MACRO testschema.merge_employees AS (
  MERGE INTO testschema.employee tgt
  USING testschema.employee_delta delta
    ON delta.id = tgt.id
  WHEN MATCHED THEN
    UPDATE SET name = delta.name, manager = delta.manager
  WHEN NOT MATCHED THEN
    INSERT (delta.id, delta.name, delta.manager);
);

Now we use AWS SCT to convert the macro. (See Accelerate your data warehouse migration to Amazon Redshift – Part 1 for details on macro conversion.) AWS SCT creates a stored procedure that contains an update (to implement the WHEN MATCHED condition) and an insert (to implement the WHEN NOT MATCHED condition).

CREATE OR REPLACE PROCEDURE testschema.merge_employees()
AS $BODY$
BEGIN
    UPDATE testschema.employee
    SET name = "delta".name, manager = "delta".manager
    FROM testschema.employee_delta AS delta JOIN testschema.employee AS tgt
        ON "delta".id = tgt.id;
      
    INSERT INTO testschema.employee
    SELECT
      "delta".id
    , "delta".name
    , "delta".manager
    FROM testschema.employee_delta AS delta
    WHERE NOT EXISTS (
      SELECT 1
      FROM testschema.employee AS tgt
      WHERE "delta".id = tgt.id
    );
END;
$BODY$
LANGUAGE plpgsql;

This example showed how to use merge automation for macros, but you can convert merge statements in any application context: stored procedures, BTEQ scripts, Java code, and more. Download the latest version of AWS SCT and try it out.

ASCII() function

The ASCII function takes as input a string and returns the ASCII code, or more precisely, the UNICODE code point, of the first character in the string. Previously, Amazon Redshift supported ASCII as a leader-node only function, which prevented its use with user-defined tables.

We’re happy to share that the ASCII function is now available on Amazon Redshift compute nodes and can be used with user-defined tables. In the following code, we create a table with some string data:

CREATE TABLE testschema.char_table (
  id INTEGER
, char_col  CHAR(10)
, varchar_col VARCHAR(10)
);

INSERT INTO testschema.char_table VALUES (1, 'Hello', 'world');

Now you can use the ASCII function on the string columns:

# SELECT id, char_col, ascii(char_col), varchar_col, ascii(varchar_col) FROM testschema.char_table;

 id |  char_col  | ascii | varchar_col | ascii 
  1 | Hello      |    72 | world       |   119

Lastly, if your application code uses the ASCII function, AWS SCT automatically converts any such function calls to Amazon Redshift.

The ASCII feature is available now—try it out in your own cluster.

TO_DATE() function

The TO_DATE function converts a character string into a DATE value. A quirk of this function is that it can accept a string value that isn’t a valid date and translate it into a valid date.

For example, consider the string 2021-06-31. This isn’t a valid date because the month of June has only 30 days. However, the TO_DATE function accepts this string and returns the “31st” day of June (July 1):

# SELECT to_date('2021-06-31', 'YYYY-MM-DD');
 to_date 
 2021-07-01
(1 row)

Customers have asked for strict input checking for TO_DATE, and we’re happy to share this new capability. Now, you can include a Boolean value in the function call that turns on strict checking:

# SELECT to_date('2021-06-31', 'YYYY-MM-DD', TRUE);
ERROR: date/time field date value out of range: 2021-6-31

You can turn off strict checking explicitly as well:

# SELECT to_date('2021-06-31', 'YYYY-MM-DD', FALSE);
 to_date 
 2021-07-01
(1 row)

Also, the Boolean value is optional. If you don’t include it, strict checking is turned off, and you see the same behavior as before the feature was launched.

You can learn more about the TO_DATE function and try out strict date checking in Amazon Redshift now.

CURSOR result sets

A cursor is a programming language construct that applications use to manipulate a result set one row at a time. Cursors are more relevant for OLTP applications, but some legacy applications built on data warehouses also use them.

Teradata provides a diverse set of cursor configurations. Amazon Redshift supports a more streamlined set of cursor features.

Based on customer feedback, we’ve added automation to support Teradata WITH RETURN cursors. These types of cursors are opened within stored procedures and returned to the caller for processing of the result set. AWS SCT will convert a WITH RETURN cursor to an Amazon Redshift REFCURSOR.

For example, consider the following procedure, which contains a WITH RETURN cursor. The procedure opens the cursor and returns the result to the caller as a DYNAMIC RESULT SET:

REPLACE PROCEDURE testschema.employee_cursor (IN p_mgrid INTEGER) DYNAMIC RESULT SETS 1
BEGIN
   DECLARE result_set CURSOR WITH RETURN ONLY FOR 
     SELECT id, name, manager 
     FROM testschema.employee
     WHERE manager = to_char(p_mgrid); 
   OPEN result_set;
END;

AWS SCT converts the procedure as follows. An additional parameter is added to the procedure signature to pass the REFCURSOR:

CREATE OR REPLACE PROCEDURE testschema.employee_cursor(par_p_mgrid IN INTEGER, dynamic_return_cursor INOUT refcursor)
AS $BODY$
DECLARE
BEGIN
    OPEN dynamic_return_cursor FOR
    SELECT
        id, name, manager
        FROM testschema.employee
        WHERE manager = to_char(par_p_mgrid, '99999');
END;
$BODY$
LANGUAGE plpgsql;

IDENTITY columns

Teradata supports several non-ANSI compliant features for IDENTITY columns. We have enhanced AWS SCT to automatically convert these features to Amazon Redshift, whenever possible.

Specifically, AWS SCT now converts the Teradata START WITH and INCREMENT BY clauses to the Amazon Redshift SEED and STEP clauses, respectively. For example, consider the following Teradata table:

CREATE TABLE testschema.identity_table (
  a2 BIGINT GENERATED ALWAYS AS IDENTITY (
    START WITH 1 
    INCREMENT BY 20
  )
);

The GENERATED ALWAYS clause indicates that the column is always populated automatically—a value can’t be explicitly inserted or updated into the column. The START WITH clause defines the first value to be inserted into the column, and the INCREMENT BY clause defines the next value to insert into the column.

When you convert this table using AWS SCT, the following Amazon Redshift DDL is produced. Notice that the START WITH and INCREMENT BY values are preserved in the target syntax:

CREATE TABLE IF NOT EXISTS testschema.identity_table (
  a2 BIGINT IDENTITY(1, 20)
)
DISTSTYLE KEY
DISTKEY
(a2)
SORTKEY
(a2);

Also, by default, an IDENTITY column in Amazon Redshift only contains auto-generated values, so that the GENERATED ALWAYS property in Teradata is preserved:

# INSERT INTO testschema.identity_table VALUES (100);
ERROR:  cannot set an identity column to a value

IDENTITY columns in Teradata can also be specified as GENERATED BY DEFAULT. In this case, a value can be explicitly defined in an INSERT statement. If no value is specified, the column is filled with an auto-generated value like normal. Before, AWS SCT didn’t support conversion for GENERATED BY DEFAULT columns. Now, we’re happy to share that AWS SCT automatically converts such columns for you.

For example, the following table contains an IDENTITY column that is GENERATED BY DEFAULT:

CREATE TABLE testschema.identity_by_default (
  a1 BIGINT GENERATED BY DEFAULT AS IDENTITY (
     START WITH 1 
     INCREMENT BY 20 
  )
PRIMARY INDEX (a1);

The IDENTITY column is converted by AWS SCT as follows. The converted column uses the Amazon Redshift GENERATED BY DEFAULT clause:

CREATE TABLE testschema.identity_by_default (
  a1 BIGINT GENERATED BY DEFAULT AS IDENTITY(1,20) DISTKEY
)                                                          
 DISTSTYLE KEY                                               
 SORTKEY (a1);

There is one additional syntax issue that requires attention. In Teradata, an auto-generated value is inserted when NULL is specified for the column value:

INSERT INTO identity_by_default VALUES (null);

Amazon Redshift uses a different syntax for the same purpose. Here, you include the keyword DEFAULT in the values list to indicate that the column should be auto-generated:

INSERT INTO testschema.identity_by_default VALUES (default);

We’re happy to share that AWS SCT automatically converts the Teradata syntax for INSERT statements like the preceding example. For example, consider the following Teradata macro:

REPLACE MACRO testschema.insert_identity_by_default AS (
  INSERT INTO testschema.identity_by_default VALUES (NULL);
);

AWS SCT removes the NULL and replaces it with DEFAULT:

CREATE OR REPLACE PROCEDURE testschema.insert_identity_by_default() LANGUAGE plpgsql
AS $$                                                              
BEGIN                                                              
  INSERT INTO testschema.identity_by_default VALUES (DEFAULT);
END;                                                               
$$                                                                 

IDENTITY column automation is available now in AWS SCT. You can download the latest version and try it out.

ANY and SOME filters with inequality predicates

The ANY and SOME filters determine if a predicate applies to one or more values in a list. For example, in Teradata, you can use <> ANY to find all employees who don’t work for a certain manager:

REPLACE MACRO testschema.not_in_103 AS (
  SELECT *
  FROM testschema.employee 
  WHERE manager <> ANY (103)
;
);

Of course, you can rewrite this query using a simple not equal filter, but you often see queries from third-party SQL generators that follow this pattern.

Amazon Redshift doesn’t support this syntax natively. Before, any queries using this syntax had to be manually converted. Now, we’re happy to share that AWS SCT automatically converts ANY and SOME clauses with inequality predicates. The macro above is converted to a stored procedure as follows.

CREATE OR REPLACE PROCEDURE testschema.not_in_103(macro_out INOUT refcursor)
AS $BODY$
BEGIN
    OPEN macro_out FOR
    SELECT *
    FROM testschema.employee
    WHERE ((manager <> 103));
END;
$BODY$
LANGUAGE plpgsql;

If the values list following the ANY contains two more values, AWS SCT will convert this to a series of OR conditions, one for each element in the list.

ANY/SOME filter conversion is available now in AWS SCT. You can try it out in the latest version of the application.

Analytic functions with RESET WHEN

RESET WHEN is a Teradata feature used in SQL analytical window functions. It’s an extension to the ANSI SQL standard. RESET WHEN determines the partition over which a SQL window function operates based on a specified condition. If the condition evaluates to true, a new dynamic sub-partition is created inside the existing window partition.

For example, the following view uses RESET WHEN to compute a running total by store. The running total accumulates as long as sales increase month over month. If sales drop from one month to the next, the running total resets.

CREATE TABLE testschema.sales (
  store_id INTEGER
, month_no INTEGER
, sales_amount DECIMAL(9,2)
)
;

REPLACE VIEW testschema.running_total (
  store_id
, month_no
, sales_amount
, cume_sales_amount
)
AS
SELECT 
  store_id
, month_no
, sales_amount
, SUM(sales_amount) OVER (
     PARTITION BY store_id 
     ORDER BY month_no
     RESET WHEN sales_amount < SUM(sales_amount) OVER (
       PARTITION BY store_id
       ORDER BY month_no
       ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
     )
     ROWS UNBOUNDED PRECEDING 
  )
FROM testschema.sales;

To demonstrate, we insert some test data into the table:

INSERT INTO testschema.sales VALUES (1001, 1, 35000.00);
INSERT INTO testschema.sales VALUES (1001, 2, 40000.00);
INSERT INTO testschema.sales VALUES (1001, 3, 45000.00);
INSERT INTO testschema.sales VALUES (1001, 4, 25000.00);
INSERT INTO testschema.sales VALUES (1001, 5, 30000.00);
INSERT INTO testschema.sales VALUES (1001, 6, 30000.00);
INSERT INTO testschema.sales VALUES (1001, 7, 50000.00);
INSERT INTO testschema.sales VALUES (1001, 8, 35000.00);
INSERT INTO testschema.sales VALUES (1001, 9, 60000.00);
INSERT INTO testschema.sales VALUES (1001, 10, 80000.00);
INSERT INTO testschema.sales VALUES (1001, 11, 90000.00);
INSERT INTO testschema.sales VALUES (1001, 12, 100000.00);

The sales amounts drop after months 3 and 7. The running total is reset accordingly at months 4 and 8.

SELECT * FROM testschema.running_total;

   store_id     month_no  sales_amount  cume_sales_amount
-----------  -----------  ------------  -----------------
       1001            1      35000.00           35000.00
       1001            2      40000.00           75000.00
       1001            3      45000.00          120000.00
       1001            4      25000.00           25000.00
       1001            5      30000.00           55000.00
       1001            6      30000.00           85000.00
       1001            7      50000.00          135000.00
       1001            8      35000.00           35000.00
       1001            9      60000.00           95000.00
       1001           10      80000.00          175000.00
       1001           11      90000.00          265000.00
       1001           12     100000.00          365000.00

AWS SCT converts the view as follows. The converted code uses a subquery to emulate the RESET WHEN. Essentially, a marker attribute is added to the result that flags a month over month sales drop. The flag is then used to determine the longest preceding run of increasing sales to aggregate.

CREATE OR REPLACE VIEW testschema.running_total (
  store_id
, month_no
, sales_amount
, cume_sales_amount) AS
SELECT
  store_id
, month_no
, sales_amount
, sum(sales_amount) OVER 
    (PARTITION BY k1, store_id ORDER BY month_no NULLS 
     FIRST ROWS UNBOUNDED      PRECEDING)
FROM (
  SELECT
   store_id
 , month_no
 , sales_amount
 , SUM(CASE WHEN k = 1 THEN 0 ELSE 1 END) OVER 
     (PARTITION BY store_id ORDER BY month_no NULLS 
       FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS k1
 FROM (
   SELECT
     store_id
   , month_no
   , sales_amount
   , CASE WHEN sales_amount < SUM(sales_amount) OVER 
      (PARTITION BY store_id ORDER BY month_no 
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) 
      OR sales_amount IS NULL THEN 0 ELSE 1 END AS k
   FROM testschema.sales
  )
);

We expect that RESET WHEN conversion will be a big hit with customers. You can try it now in AWS SCT.

TD_NORMALIZE_OVERLAP() function

The TD_NORMALIZE_OVERLAP function combines rows that have overlapping PERIOD values. The resulting normalized row contains the earliest starting bound and the latest ending bound from the PERIOD values of all the rows involved.

For example, we create a Teradata table that records employee salaries with the following code. Each row in the table is timestamped with the period that the employee was paid the given salary.

CREATE TABLE testschema.salaries (
  emp_id INTEGER
, salary DECIMAL(8,2)
, from_to PERIOD(DATE)
);

Now we add data for two employees. For emp_id = 1 and salary = 2000, there are two overlapping rows. Similarly, the two rows with emp_id = 2 and salary = 3000 are overlapping.

SELECT * FROM testschema.salaries ORDER BY emp_id, from_to;

     emp_id      salary  from_to
-----------  ----------  ------------------------
          1     1000.00  ('20/01/01', '20/05/31')
          1     2000.00  ('20/06/01', '21/02/28')
          1     2000.00  ('21/01/01', '21/06/30')
          2     3000.00  ('20/01/01', '20/03/31')
          2     3000.00  ('20/02/01', '20/04/30')

Now we create a view that uses the TD_NORMALIZE_OVERLAP function to normalize the overlapping data:

REPLACE VIEW testschema.normalize_salaries AS 
WITH sub_table(emp_id, salary, from_to) AS (
  SELECT 
    emp_id
  , salary
  , from_to
  FROM testschema.salaries
)
SELECT *
FROM 
  TABLE(TD_SYSFNLIB.TD_NORMALIZE_OVERLAP (NEW VARIANT_TYPE(sub_table.emp_id, sub_table.salary), sub_table.from_to)
    RETURNS (emp_id INTEGER, salary DECIMAL(8,2), from_to PERIOD(DATE))
    HASH BY emp_id
    LOCAL ORDER BY emp_id, salary, from_to
  ) AS DT(emp_id, salary, duration)
;

We can check that the view data is actually normalized:

select * from testschema.normalize_salaries order by emp_id, duration;

     emp_id      salary  duration
-----------  ----------  ------------------------
          1     1000.00  ('20/01/01', '20/05/31')
          1     2000.00  ('20/06/01', '21/06/30')
          2     3000.00  ('20/01/01', '20/04/30')

You can now use AWS SCT to convert any TD_NORMALIZE_OVERLAP statements. We first convert the salaries table to Amazon Redshift (see Accelerate your data warehouse migration to Amazon Redshift – Part 2 for details about period data type automation):

CREATE TABLE testschema.salaries (
  emp_id integer distkey
, salary numeric(8,2) ENCODE az64
, from_to_begin date ENCODE az64
, from_to_end date ENCODE az64    
)                                   
DISTSTYLE KEY                       
SORTKEY (emp_id);

# SELECT * FROM testschema.salaries ORDER BY emp_id, from_to_begin;
 emp_id | salary  | from_to_begin | from_to_end 
      1 | 1000.00 | 2020-01-01    | 2020-05-31
      1 | 2000.00 | 2020-06-01    | 2021-02-28
      1 | 2000.00 | 2021-01-01    | 2021-06-30
      2 | 3000.00 | 2020-01-01    | 2020-03-31
      2 | 3000.00 | 2020-02-01    | 2020-04-30

Now we use AWS SCT to convert the normalize_salaries view. AWS SCT adds a column that marks the start of a new group of rows. It then produces a single row for each group with a normalized timestamp.

CREATE VIEW testschema.normalize_salaries (emp_id, salary, from_to_begin, from_to_end) AS
WITH sub_table AS (
  SELECT
    emp_id
  , salary
  , from_to_begin AS start_date
  , from_to_end AS end_date
  , CASE
      WHEN start_date <= lag(end_date) OVER (PARTITION BY emp_id, salary ORDER BY start_date, end_date) THEN 0 
      ELSE 1
    END AS GroupStartFlag
    FROM testschema.salaries
  )
SELECT
  t2.emp_id
, t2.salary
, min(t2.start_date) AS from_to_begin
, max(t2.end_date) AS from_to_end
FROM (
  SELECT
    emp_id
  , salary
  , start_date
  , end_date
  , sum(GroupStartFlag) OVER (PARTITION BY emp_id, salary ORDER BY start_date ROWS UNBOUNDED PRECEDING) AS GroupID
  FROM 
    sub_table
) AS t2
GROUP BY 
  t2.emp_id
, t2.salary
, t2.GroupID;

We can check that the converted view returns the correctly normalized data:

# SELECT * FROM testschema.normalize_salaries ORDER BY emp_id;
 emp_id | salary  | from_to_begin | from_to_end 
      1 | 1000.00 | 2020-01-01    | 2020-05-31
      1 | 2000.00 | 2020-06-01    | 2021-06-30
      2 | 3000.00 | 2020-01-01    | 2020-04-30

You can try out TD_NORMALIZE_OVERLAP conversion in the latest release of AWS SCT. Download it now.

TD_UNPIVOT() function

The TD_UNPIVOT function transforms columns into rows. Essentially, we use it to take a row of similar metrics over different time periods and create a separate row for each metric.

For example, consider the following Teradata table. The table records customer visits by year and month for small kiosk stores:

CREATE TABLE TESTSCHEMA.kiosk_monthly_visits (
  kiosk_id INTEGER
, year_no INTEGER
, jan_visits INTEGER
, feb_visits INTEGER
, mar_visits INTEGER
, apr_visits INTEGER
, may_visits INTEGER
, jun_visits INTEGER
, jul_visits INTEGER
, aug_visits INTEGER
, sep_visits INTEGER
, oct_visits INTEGER
, nov_visits INTEGER
, dec_visits INTEGER)
PRIMARY INDEX (kiosk_id);

We insert some sample data into the table:

INSERT INTO testschema.kiosk_monthly_visits VALUES (100, 2020, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200);

Next, we create a view that unpivots the table so that the monthly visits appear on separate rows. The single row in the pivoted table creates 12 rows in the unpivoted table, one row per month.

REPLACE VIEW testschema.unpivot_kiosk_monthly_visits (
  kiosk_id
, year_no
, month_name
, month_visits
)
AS
SELECT 
  kiosk_id
, year_no
, month_name (FORMAT 'X(10)')
, month_visits
FROM TD_UNPIVOT (
 ON (SELECT * FROM testschema.kiosk_monthly_visits)
 USING
 VALUE_COLUMNS ('month_visits')
 UNPIVOT_COLUMN('month_name')
 COLUMN_LIST(
   'jan_visits'
 , 'feb_visits'
 , 'mar_visits'
 , 'apr_visits'
 , 'may_visits'
 , 'jun_visits'
 , 'jul_visits'
 , 'aug_visits'
 , 'sep_visits'
 , 'oct_visits'
 , 'nov_visits'
 , 'dec_visits'
 )
 COLUMN_ALIAS_LIST (
   'jan'
 , 'feb'
 , 'mar'
 , 'apr'
 , 'may'
 , 'jun'
 , 'jul'
 , 'aug'
 , 'sep'
 , 'oct'
 , 'nov'
 , 'dec'
 )
) a;

When you select from the view, the monthly sales are unpivoted into 12 separate rows:

SELECT * FROM testschema.unpivot_monthly_sales;

 id           yr           mon     mon_sales
----------- ----------- ---------- ----------
100         2021        jan           1100.00
100         2021        feb           1200.00
100         2021        mar           1300.00
100         2021        apr           1400.00
100         2021        may           1500.00
100         2021        jun           1600.00
100         2021        jul           1700.00
100         2021        aug           1800.00
100         2021        sep           1900.00
100         2021        oct           2000.00
100         2021        nov           2100.00
100         2021        dec           2200.00

Now we use AWS SCT to convert the view into ANSI SQL that can be run on Amazon Redshift. The conversion creates a common table expression (CTE) to place each month in a separate row. It then joins the CTE and the remaining attributes from the original pivoted table.

REPLACE VIEW testschema.unpivot_kiosk_monthly_visits (kiosk_id, year_no, month_name, month_visits) AS
WITH cols
AS (SELECT
    'jan' AS col
UNION ALL
SELECT
    'feb' AS col
UNION ALL
SELECT
    'mar' AS col
UNION ALL
SELECT
    'apr' AS col
UNION ALL
SELECT
    'may' AS col
UNION ALL
SELECT
    'jun' AS col
UNION ALL
SELECT
    'jul' AS col
UNION ALL
SELECT
    'aug' AS col
UNION ALL
SELECT
    'sep' AS col
UNION ALL
SELECT
    'oct' AS col
UNION ALL
SELECT
    'nov' AS col
UNION ALL
SELECT
    'dec' AS col)
SELECT
    t1.kiosk_id, t1.year_no, col AS "month_name",
    CASE col
        WHEN 'jan' THEN "jan_visits"
        WHEN 'feb' THEN "feb_visits"
        WHEN 'mar' THEN "mar_visits"
        WHEN 'apr' THEN "apr_visits"
        WHEN 'may' THEN "may_visits"
        WHEN 'jun' THEN "jun_visits"
        WHEN 'jul' THEN "jul_visits"
        WHEN 'aug' THEN "aug_visits"
        WHEN 'sep' THEN "sep_visits"
        WHEN 'oct' THEN "oct_visits"
        WHEN 'nov' THEN "nov_visits"
        WHEN 'dec' THEN "dec_visits"
        ELSE NULL
    END AS "month_visits"
    FROM testschema.kiosk_monthly_visits AS t1
    CROSS JOIN cols
    WHERE month_visits IS NOT NULL;

You can check that the converted view produces the same result as the Teradata version:

# SELECT * FROM testschema.unpivot_kiosk_monthly_visits;
 kiosk_id | year_no | month_name | month_visits 
      100 |    2020 | oct        |        2000
      100 |    2020 | nov        |        2100
      100 |    2020 | jul        |        1700
      100 |    2020 | feb        |        1200
      100 |    2020 | apr        |        1400
      100 |    2020 | aug        |        1800
      100 |    2020 | sep        |        1900
      100 |    2020 | jan        |        1100
      100 |    2020 | mar        |        1300
      100 |    2020 | may        |        1500
      100 |    2020 | jun        |        1600
      100 |    2020 | dec        |        2200

You can try out the conversion support for TD_UNPIVOT in the latest version of AWS SCT.

QUANTILE function

QUANTILE is a ranking function. It partitions the input set into a specified number of groups, each containing an equal portion of the total population. QUANTILE is a proprietary Teradata extension of the NTILE function found in ANSI SQL.

For example, we can compute the quartiles of the monthly visit data using the following Teradata view:

REPLACE VIEW testschema.monthly_visit_rank AS
SELECT
  kiosk_id
, year_no
, month_name
, month_visits
, QUANTILE(4, month_visits) qtile
FROM
 testschema.unpivot_kiosk_monthly_visits
;

When you select from the view, the QUANTILE function computes the quartile and applies it as an attribute on the output:

SELECT * FROM monthly_visit_rank;

   kiosk_id      year_no  month_name  month_visits        qtile
-----------  -----------  ----------  ------------  -----------
        100         2020  jan                 1100            0
        100         2020  feb                 1200            0
        100         2020  mar                 1300            0
        100         2020  apr                 1400            1
        100         2020  may                 1500            1
        100         2020  jun                 1600            1
        100         2020  jul                 1700            2
        100         2020  aug                 1800            2
        100         2020  sep                 1900            2
        100         2020  oct                 2000            3
        100         2020  nov                 2100            3
        100         2020  dec                 2200            3

Amazon Redshift supports a generalized NTILE function, which can implement QUANTILE, and is ANSI-compliant. We’ve enhanced AWS SCT to automatically convert QUANTILE function calls to equivalent NTILE function calls.

For example, when you convert the preceding Teradata view, AWS SCT produces the following Amazon Redshift code:

SELECT 
  unpivot_kiosk_monthly_visits.kiosk_id
, unpivot_kiosk_monthly_visits.year_no
, unpivot_kiosk_monthly_visits.month_name
, unpivot_kiosk_monthly_visits.month_visits
, ntile(4) OVER (ORDER BY unpivot_kiosk_monthly_visits.month_visits ASC  NULLS FIRST) - 1) AS qtile 
FROM 
  testschema.unpivot_kiosk_monthly_visits
;

QUANTILE conversion support is available now in AWS SCT.

QUALIFY filter

The QUALIFY clause in Teradata filters rows produced by an analytic function. Let’s look at an example. We use the following table, which contains store revenue by month. Our goal is to find the top five months by revenue:

CREATE TABLE testschema.sales (
  store_id INTEGER
, month_no INTEGER
, sales_amount DECIMAL(9,2))
PRIMARY INDEX (store_id);


SELECT * FROM sales;

   store_id     month_no  sales_amount
-----------  -----------  ------------
       1001            1      35000.00
       1001            2      40000.00
       1001            3      45000.00
       1001            4      25000.00
       1001            5      30000.00
       1001            6      30000.00
       1001            7      50000.00
       1001            8      35000.00
       1001            9      60000.00
       1001           10      80000.00
       1001           11      90000.00
       1001           12     100000.00

The data shows that July, September, October, November, and December were the top five sales months.

We create a view that uses the RANK function to rank each month by sales, then use the QUALIFY function to select the top five months:

REPLACE VIEW testschema.top_five_months(
  store_id
, month_no
, sales_amount
, month_rank
) as
SELECT
  store_id
, month_no
, sales_amount
, RANK() OVER (PARTITION BY store_id ORDER BY sales_amount DESC) month_rank
FROM
  testschema.sales
QUALIFY RANK() OVER (PARTITION by store_id ORDER BY sales_amount DESC) <= 5
;

Before, if you used the QUALIFY clause, you had to manually recode your SQL statements. Now, AWS SCT automatically converts QUALIFY into Amazon Redshift-compatible, ANSI-compliant SQL. For example, AWS SCT rewrites the preceding view as follows:

CREATE OR REPLACE VIEW testschema.top_five_months (
  store_id
, month_no
, sales_amount
, month_rank) AS
SELECT
  qualify_subquery.store_id
, qualify_subquery.month_no
, qualify_subquery.sales_amount
, month_rank
FROM (
  SELECT
    store_id
  , month_no
  , sales_amount
  , rank() OVER (PARTITION BY store_id ORDER BY sales_amount DESC NULLS FIRST) AS month_rank
  , rank() OVER (PARTITION BY store_id ORDER BY sales_amount DESC NULLS FIRST) AS qualify_expression_1
  FROM testschema.sales) AS qualify_subquery
  WHERE 
    qualify_expression_1 <= 5;

AWS SCT converts the original query into a subquery, and applies the QUALIFY expression as a filter on the subquery. AWS SCT adds an additional column to the subquery for the purpose of filtering. This is not strictly needed, but simplifies the code when column aliases aren’t used.

You can try QUALIFY conversion in the latest version of AWS SCT.

Summary

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports and applications. If you’re looking to get started on a data warehouse migration, you can learn more about Amazon Redshift and AWS SCT from our public documentation.

This post described a few of the dozens of new features we’re introducing to automate your Teradata migrations to Amazon Redshift. We’ll share more in upcoming posts about automation for proprietary Teradata features and other exciting new capabilities.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool. Happy migrating!


About the Authors

Michael Soo is a Senior Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Raza Hafeez is a Data Architect within the Lake House Global Specialty Practice of AWS Professional Services. He has over 10 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Lake House Architecture.

Po Hong, PhD, is a Principal Data Architect of Lake House Global Specialty Practice, AWS Professional Services. He is passionate about supporting customers to adopt innovative solutions to reduce time to insight. Po is specialized in migrating large scale MPP on-premises data warehouses to the AWS Lake House architecture.

Entong Shen is a Software Development Manager of Amazon Redshift. He has been working on MPP databases for over 9 years and has focused on query optimization, statistics and migration related SQL language features such as stored procedures and data types.

Sumit Singh is a database engineer with Database Migration Service team at Amazon Web Services. He works closely with customers and provide technical assistance to migrate their on-premises workload to AWS cloud. He also assists in continuously improving the quality and functionality of AWS Data migration products.

Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical background focusing on migrating and replicating databases along with data-warehouse workloads. She is passionate about helping customers in their cloud journey.

Accelerate your data warehouse migration to Amazon Redshift – Part 2

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-2-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the second post in a multi-part series. We’re excited to shared dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially reduce your overall cost to migrate to Amazon Redshift. Check out the first post Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about automated macro conversion, case-insensitive string comparison, case-sensitive identifiers, and other exciting new features.

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services like Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating their self-managed data warehouse engines to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share additional enhancements to the AWS Schema Conversion Tool (AWS SCT) to automate your migrations to Amazon Redshift. These enhancements reduce the recoding needed for your data tables, and more importantly, the manual work needed for views, stored procedures, scripts, and other application code that use those tables.

In this post, we introduce automation for INTERVAL and PERIOD data types, automatic type casting, binary data support, and some other enhancements that have been requested by customers. We show you how to use AWS SCT to convert objects from a Teradata data warehouse and provide links to relevant documentation so you can continue exploring these new capabilities.

INTERVAL data types

An INTERVAL is an unanchored duration of time, like “1 year” or “2 hours,” that doesn’t have a specific start or end time. In Teradata, INTERVAL data is implemented as 13 distinct data types depending on the granularity of time being represented. The following table summarizes these types.

Year intervals Month intervals Day intervals Hour intervals Minute intervals Second intervals

INTERVAL YEAR

INTERVAL YEAR TO MONTH

INTERVAL MONTH

 

INTERVAL DAY

INTERVAL DAY TO HOUR

INTERVAL DAY TO MINUTE

INTERVAL DAY TO SECOND

INTERVAL HOUR

INTERVAL HOUR TO MINUTE

INTERVAL HOUR TO SECOND

INTERVAL MINUTE

INTERVAL MINUTE TO SECOND

INTERVAL SECOND

 

Amazon Redshift doesn’t support INTERVAL data types natively. Previously, if you used INTERVAL types in your data warehouse, you had to develop custom code as part of the database conversion process.

Now, AWS SCT automatically converts INTERVAL data types for you. AWS SCT converts an INTERVAL column into a CHARACTER VARYING column in Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the INTERVAL semantics.

For example, consider the following Teradata table, which has a MONTH interval column. This table store different types of leaves of absences and the allowable duration for each.

CREATE TABLE testschema.loa_durations (
  loa_type_id INTEGER
, loa_name VARCHAR(100) CHARACTER SET LATIN
, loa_duration INTERVAL MONTH(2))
PRIMARY INDEX (loa_type_id);

AWS SCT converts the table to Amazon Redshift as follows. Because Amazon Redshift doesn’t have a native INTERVAL data type, AWS SCT replaces it with a VARCHAR data type.

CREATE TABLE testschema.loa_durations(
  loa_type_id INTEGER
, loa_name VARCHAR(100)
, loa_duration VARCHAR(64)
)
DISTSTYLE KEY
DISTKEY
(
loa_type_id
)
SORTKEY
(
loa_type_id
);

Now, let’s suppose your application code uses the loa_duration column, like the following Teradata view. Here, the INTERVAL MONTH field is added to the current date to compute when a leave of absence ends if it starts today.

REPLACE VIEW testschema.loa_projected_end_date AS
SELECT
  loa_type_id loa_type_id
, loa_name loa_name
, loa_duration
, current_date AS today
, current_date + loa_duration AS end_date
FROM
testschema.loa_durations
;

Because the data is stored as CHARACTER VARYING, AWS SCT injects the proper type CAST into the Amazon Redshift code to interpret the string values as a MONTH interval. It then converts the arithmetic using Amazon Redshift date functions.

CREATE OR REPLACE VIEW testschema.loa_projected_end_date (loa_type_id, loa_name, loa_duration, today, end_date) AS
SELECT
  loa_type_id AS loa_type_id
, loa_name AS loa_name
, loa_duration
, CURRENT_DATE AS today
, dateadd(MONTH, CAST (loa_duration AS INTEGER),CURRENT_DATE)::DATE AS end_date
FROM testschema.loa_durations
;

Also, as a bonus, AWS SCT automatically converts any literal INTERVAL values that you might be using in your code.

For example, consider the following Teradata table. The table contains a DATE column, which records the last date when an employee was promoted.

CREATE TABLE TESTSCHEMA.employees (
  id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD'
)
UNIQUE PRIMARY INDEX ( id );

Now, suppose the database contains a view that computes the next date an employee is eligible for a promotion. We implement a business rule that employees who have never been promoted are eligible for promotion in 1.5 years. All other employees become eligible 2.5 years after their last promotion. See the following code:

REPLACE VIEW testschema.eligible_for_promo AS
SELECT 
  id
, name
, last_promo_date
, CASE WHEN last_promo_date is NULL THEN current_date + INTERVAL '18' MONTH
       ELSE last_promo_date + INTERVAL '2-06' YEAR TO MONTH
  END eligible_date
FROM employees
;

AWS SCT converts the INTERVAL values used in the CASE statement and translates the date expressions accordingly:

CREATE OR REPLACE VIEW testschema.eligible_for_promo (id, name, last_promo_date, eligible_date) AS
SELECT
  id
, name
, last_promo_date
, CASE
    WHEN last_promo_date IS NULL THEN dateadd(MONTH, 18, CURRENT_DATE)::DATE
    ELSE dateadd(MONTH, 30, last_promo_date)::DATE
  END AS eligible_date
FROM testschema.employees
;

We’re excited about INTERVAL automation in AWS SCT and encourage you to give it a try. For more information about getting started with AWS SCT, see Installing, verifying, and updating AWS SCT.

PERIOD data type

A PERIOD data value represents a duration of time with a specified start and end. For example, the Teradata literal “(2021-01-01 to 2021-01-31)” is a period with a duration of 31 days that starts and ends on the first and last day of January 2021, respectively. PERIOD data types can have three different granularities: DATE, TIME, or TIMESTAMP. The following table provides some examples.

Period Type Example
PERIOD(DATE) “(2021-01-01 to 2021-01-31) “
PERIOD(TIME) “(12:00:00 to 13:00:00)”
PERIOD(TIMESTAMP) “(2021-01-31 00:00:00 to 2021-01-31 23:59:59)”

As with INTERVAL, the PERIOD data type isn’t natively supported by Amazon Redshift. Previously, if you used these data types in your tables, you had to write custom code as part of the database conversion process.

Now, AWS SCT automatically converts PERIOD data types for you. AWS SCT converts a PERIOD column into two DATE (or TIME or TIMESTAMP) columns as appropriate on Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the source engine semantics.

For example, consider the following Teradata table:

CREATE SET TABLE testschema.period_table (
  id INTEGER
, period_col PERIOD(timestamp)) 
UNIQUE PRIMARY INDEX (id);

AWS SCT converts the PERIOD(TIMESTAMP) column into two TIMESTAMP columns in Amazon Redshift:

CREATE TABLE IF NOT EXISTS testschema.period_table(
  id INTEGER
, period_col_begin TIMESTAMP
, period_col_end TIMESTAMP
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

Now, let’s look at a simple example of how you can use AWS SCT to convert your application code. A common operation in Teradata is to extract the starting (or ending) timestamps in a PERIOD value using the BEGIN and END built-in functions:

REPLACE VIEW testschema.period_view_begin_end AS 
SELECT 
  BEGIN(period_col) AS period_start
, END(period_col) AS period_end 
FROM testschema.period_table
;

AWS SCT converts the view to reference the transformed table columns:

CREATE OR REPLACE VIEW testschema.period_view_begin_end (period_start, period_end) AS
SELECT
  period_col_begin AS period_start
, period_col_end AS period_end
FROM testschema.period_table;

We’ll continue to build automation for PERIOD data conversion, so stay tuned for more improvements. In the meantime, you can try out the PERIOD data type conversion features in AWS SCT now. For more information, see Installing, verifying, and updating AWS SCT.

Type casting

Some data warehouse engines, like Teradata, provide an extensive set of rules to cast data values in expressions. These rules permit implicit casts, where the target data type is inferred from the expression, and explicit casts, which typically use a function to perform the type conversion.

Previously, you had to manually convert implicit cast operations in your SQL code. Now, we’re happy to share that AWS SCT automatically converts implicit casts as needed. This feature is available now for the following set of high-impact Teradata data types.

Category Source data type Target data types
Numeric CHAR BIGINT
NUMBER
TIMESTAMP
VARCHAR NUMBER
NUMERIC
DEC
CHAR
GEOMETRY
INTEGER DATE
DEC
BIGINT DATE
NUMBER CHARACTER
VARCHAR
DEC
DECIMAL DATE
TIMESTAMP
SMALLINT
DOUBLE PRECISION
FLOAT DEC
Time DATE BIGINT
INTEGER
DECIMAL
FLOAT
NUMBER
CHARACTER
TIMESTAMP
INTERVAL NUMBER
BIGINT
INTEGER
Other GEOMETRY DECIMAL

Let’s look at how to cast numbers to DATE. Many Teradata applications treat numbers and DATE as equivalent values. Internally, Teradata stores DATE values as INTEGER. The rules to convert between an INTEGER and a DATE are well-known and developers have commonly exploited this information to perform date calculations using INTEGER arithmetic.

For example, consider the following Teradata table:

CREATE TABLE testschema.employees (
  id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD')
UNIQUE PRIMARY INDEX ( id );

We insert a single row of data into the table:

select * from employees;

 *** Query completed. One row found. 4 columns returned. 
 *** Total elapsed time was 1 second.

         id  name                   manager_id  last_promo_date
-----------  --------------------  -----------  ---------------
        112  Britney                       201                ?

We use a macro to update the last_promo_date field for id = 112. The macro accepts a BIGINT parameter to populate the DATE field.

replace macro testschema.set_last_promo_date(emp_id integer, emp_promo_date bigint) AS (
update testschema.employees
set last_promo_date = :emp_promo_date
where id = :emp_id;
);

Now, we run the macro and check the value of the last_promo_date attribute:

exec testschema.set_last_promo_date(112, 1410330);

 *** Update completed. One row changed. 
 *** Total elapsed time was 1 second.


select * from employees;

 *** Query completed. One row found. 4 columns returned. 
 *** Total elapsed time was 1 second.

         id  name                   manager_id  last_promo_date
-----------  --------------------  -----------  ---------------
        112  Britney                       201         41/03/30

You can see the last_promo_date attribute is set to the date March 30, 2041.

Now, let’s use AWS SCT to convert the table and macro to Amazon Redshift. As we saw in Part 1 of this series, AWS SCT converts the Teradata macro into an Amazon Redshift stored procedure:

CREATE TABLE IF NOT EXISTS testschema.employees(
  id INTEGER
, name CHARACTER VARYING(20) 
, manager_id INTEGER
, last_promo_date DATE
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

CREATE OR REPLACE PROCEDURE testschema.set_last_promo_date(par_emp_id INTEGER, par_emp_promo_date BIGINT)
AS $BODY$
BEGIN
    UPDATE testschema.employees
    SET last_promo_date = TO_DATE((par_emp_promo_date + 19000000), 'YYYYMMDD')
        WHERE id = par_emp_id;
END;
$BODY$
LANGUAGE plpgsql;

Note that 20410330 = 1410330 + 19000000; so adding 19,000,000 to the input returns the correct date value 2041-03-30.

Now, when we run the stored procedure, it updates the last_promo_date as expected:

myredshift=# select * from testschema.employees;
 id  |  name   | manager_id | last_promo_date
 112 | Britney |        201 |
(1 row)

myredshift=# call testschema.set_last_promo_date(112, 1410330);
CALL

myredshift=# select * from testschema.employees;
 id  |  name   | manager_id | last_promo_date
 112 | Britney |        201 | 2041-03-30
(1 row)

Automatic data type casting is available in AWS SCT now. You can download the latest version and try it out.

BLOB data

Amazon Redshift doesn’t have native support for BLOB columns, which you use to store large binary objects like text or images.

Previously, if you were migrating a table with a BLOB column, you had to manually move the BLOB values to file storage, like Amazon Simple Storage Service (Amazon S3), then add a reference to the S3 file in the table. Using Amazon S3 as the storage target for binary objects is a best practice because these objects are large and typically have low analytic value.

We’re happy to share that AWS SCT now automates this process for you. AWS SCT replaces the BLOB column with a CHARACTER VARYING column on the target table. Then, when you use the AWS SCT data extractors to migrate your data, the extractors upload the BLOB value to Amazon S3 and insert a reference to the BLOB into the target table.

For example, let’s create a table in Teradata and populate it with some data:

CREATE SET TABLE TESTSCHEMA.blob_table (
  id INTEGER
, blob_col BLOB(10485760))
PRIMARY INDEX ( id );

select * from blob_table;

 *** Query completed. 2 rows found. 2 columns returned. 
 *** Total elapsed time was 1 second.

         id blob_col
----------- ---------------------------------------------------------------
          1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
          2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Now, we convert the table with AWS SCT and build it on Amazon Redshift:

myredshift=# \d testschema.blob_table;
                    Table "testschema.blob_table"
Column  | Type                     | Collation | Nullable | Default 
id      | integer                  |.          |          | 
blob_col | character varying(1300) |           |          |

Then we use the AWS SCT data extractors to migrate the table data from Teradata to Amazon Redshift.

When we look at the table in Amazon Redshift, you can see the paths to the S3 files that contain the BLOB values:

myredshift=# select * from testschema.blob_table;
(2 rows)

 id |                                                               blob_col                                                               
  2 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/308b6f0a902941e793212058570cdda5.dat
  1 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/a7686067af5549479b52d81e83c3871e.dat

And on Amazon S3, you can see the actual data files. There are two, one for each BLOB value:

$ aws s3 ls s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/
2021-05-13 23:59:47         23 522fee54fda5472fbae790f43e36cba1.dat
2021-05-13 23:59:47         24 5de6c53831f741629476e2c2cbc6b226.dat

BLOB support is available now in AWS SCT and the AWS SCT data extractors. Download the latest version of the application and try it out today.

Multi-byte CHARACTER conversion

Teradata supports multibyte characters in CHARACTER data columns, which are fixed length fields. Amazon Redshift supports multibyte characters in CHARACTER VARYING fields but not in fixed-length CHARACTER columns.

Previously, if you had fixed-length CHARACTER columns, you had to determine if they contained multibyte character data, and increase the target column size as appropriate.

AWS SCT now bridges this gap for you. If your Teradata tables contain CHARACTER columns with multibyte characters, AWS SCT automatically converts these columns to Amazon Redshift CHARACTER VARYING fields and sets the column sizes accordingly. Consider the following example, which contains four columns, a LATIN column that contains only single-byte characters, and UNICODE, GRAPHIC, and KANJISJIS columns that can contain multi-byte characters:

create table testschema.char_table (
  latin_col char(70) character set latin
, unicode_col char(70) character set unicode
, graphic_col char(70) character set graphic
, kanjisjis_col char(70) character set kanjisjis
);

AWS SCT translates the LATIN column to a fixed length CHARACTER column. The multi-byte columns are upsized and converted to CHARACTER VARYING:

CREATE TABLE IF NOT EXISTS testschema.char_table (
  latin_col CHARACTER(70)
, unicode_col CHARACTER VARYING(210)
, graphic_col CHARACTER VARYING(210)
, kanjisjis_col CHARACTER VARYING(210)
)
DISTSTYLE KEY
DISTKEY
(col1)
SORTKEY
(col1);

Automatic conversion for multibyte CHARACTER columns is available in AWS SCT now.

GEOMETRY data type size

Amazon Redshift has long supported geospatial data with a GEOMETRY data type and associated spatial functions.

Previously, Amazon Redshift restricted the maximum size of a GEOMETRY column to 64 KB, which constrained some customers with large objects. Now, we’re happy to share that the maximum size of GEOMETRY objects has been increased to just under 1 MB (specifically, 1,048,447 bytes).

For example, consider the following Teradata table:

create table geometry_table (
 id INTEGER
, geometry_col1 ST_GEOMETRY 
, geometry_col2 ST_GEOMETRY(1000)
, geometry_col3 ST_GEOMETRY(1048447) 
, geometry_col4 ST_GEOMETRY(10484470)
, geometry_col5 ST_GEOMETRY INLINE LENGTH 1000
)
;

You can use AWS SCT to convert it to Amazon Redshift. The converted table definition is as follows. A size specification isn’t needed on the converted columns because Amazon Redshift implicitly sets the column size.

CREATE TABLE IF NOT EXISTS testschema.geometry_table(
id INTEGER,
geometry_col1 GEOMETRY,
geometry_col2 GEOMETRY,
geometry_col3 GEOMETRY,
geometry_col4 GEOMETRY,
geometry_col5 GEOMETRY
)
DISTSTYLE KEY
DISTKEY
(
id
)
SORTKEY
(
id
);
ALTER TABLE testschema.geometry_table ALTER DISTSTYLE AUTO;
ALTER TABLE testschema.geometry_table ALTER SORTKEY AUTO;

Large GEOMETRY columns are available in Amazon Redshift now. For more information, see Querying spatial data in Amazon Redshift.

Conclusion

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports, applications, and ETL, as well as accelerate your query performance.

This post described a few of the dozens of new features we have recently introduced to automate your data warehouse migrations to Amazon Redshift. We will share more in upcoming posts. You’ll hear about additional SQL automation, a purpose-built scripting language for Amazon Redshift with BTEQ compatibility, and automated support for proprietary SQL features.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool on the AWS website. Happy migrating!


About the Author

Michael Soo is a database engineer with the AWS DMS and AWS SCT team at Amazon Web Services.