Migrating IBM Netezza to Amazon Redshift using the AWS Schema Conversion Tool

Post Syndicated from Mattia Berlusconi original https://aws.amazon.com/blogs/big-data/migrating-ibm-netezza-to-amazon-redshift-with-the-aws-sct/

The post How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime described a high-level strategy to move from an on-premises Netezza data warehouse to Amazon Redshift. In this post, we explain how a large European Enterprise customer implemented a Netezza migration strategy spanning multiple environments, using the AWS Schema Conversion Tool (AWS SCT) to accelerate schema and data migration. We also walk you through validating that the schema and data content were migrated as expected and followed Amazon Redshift best practices.

Solution overview

It’s important to build a migration plan unique to your organization’s processes and non-functional requirements. The following plan is a real-world use case from a large European Enterprise customer. It details the different environments migrated to and the tasks, tools, and scripts used to complete the work:

  1. Assess migration tasks
    1. Understand the scope of the migration
    2. Record objects to be migrated into a migration runbook
  2. Set up the migration environment
    1. Install AWS SCT
    2. Configure AWS SCT for Netezza source environments
  3. Migrate to the development environment
    1. Create users, groups, and schema
    2. Convert schema
    3. Migrate data
    4. Validate data
    5. Transform ETL, UDF, and procedures
  4. Migrate to other pre-production environments
    1. Create users, groups, and schema
    2. Convert schema
    3. Migrate data
    4. Validate data
    5. Transform ETL, UDF, and procedures
  5. Migrate to the production environment
    1. Create users, groups, and schema
    2. Convert schema
    3. Migrate data
    4. Validate data
    5. Transform ETL, UDF, and procedures
    6. Business validation (including optional dual-running)
    7. Cut over

Assessing migration tasks

To plan and keep track of the migration tasks, you should produce a tracker of all the Netezza databases, tables, and views in scope. This information forms a migration runbook that is updated during the migration to document the progress of data migration from Netezza to Amazon Redshift. For each table identified, record the number of rows and size in GB.

Some Netezza source systems contain two Netezza data warehouses, for example one for ETL loading throughout the day and one for end-user reporting users. Make sure it’s clear which data warehouses are in scope for the migration.

Setting up the migration environment

The migration strategy uses the AWS SCT to accelerate schema object conversion and migrate the data from the Netezza database to the Amazon Redshift cluster. The following diagram illustrates this architecture.

The migration should ensure the following:

  • The AWS SCT is installed within the AWS account onto an Amazon Elastic Compute Cloud (Amazon EC2) instance to facilitate migration operations, orchestrate the AWS SCT data extraction agents, and provide access via a user-friendly console.
  • The AWS SCT data extraction agents are installed and run as close to the Netezza data warehouse as possible. AWS strongly recommends installing them on premises within the same subnet as the Netezza data warehouse.

During the transfer of data from the on-premises data center to the AWS account, you can use either a direct connection or offline storage. AWS Snowball is a petabyte-scale offline solution for moving large amounts of data into the AWS account where sufficient bandwidth of a direct connection isn’t available. AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to an AWS account. You can establish private connectivity between your AWS account and your data center, office, or co-location environment by using Direct Connect, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections. Using Direct Connect also adds flexibility in case extract jobs need to be re-run.

Configuring AWS SCT for the Netezza source environment

The AWS SCT is installed on an EC2 instance running Microsoft Windows 10 with administrator privileges. Choosing Microsoft Windows as the operating system allows your users to graphically control the creation of projects, modify profiles, start and view the progress of the conversions, and view the output of the migration assessment reports.

Because you don’t perform the data migration directly on the AWS SCT console, a general purpose EC2 instance with 4 vCPU, 16 GB memory, 100 GB storage, and moderate network bandwidth is sufficient.

You should configure several AWS SCT data extraction agents to match the amount of data to be concurrently transferred and the number of Netezza connections available. You can install the data extraction agents on on-premises VM instances running Linux with root administration privileges. The size of each instance is 8 vCPU, 32 GB memory, and up to 10 Gb network capacity. For disk storage, we use 1TB of 500 IOPS Provisioned SSD because intermediate results are stored on disk.

It’s preferable that the on-premises instances are located as close as possible to the Netezza data warehouse, ideally only a single network hop away. This is important because each data extraction agent creates a table on the instance file system as storage for the extracted data. Also, for each agent, the CPU chosen is more powerful because the compression of the extracted data is processor intensive.

As stated earlier, the number of agents should be proportionate to the amount of concurrent data streams being transferred and the number of Netezza connections available for the transfer. A rule of thumb is to have one data extraction agent for each TB of compressed Netezza data to be migrated in parallel. For optimum performance, it’s recommended that each agent is installed on a single VM instance.

You should work with the DBA team to ensure as many Netezza concurrent connections are made available to the data extraction agents as possible. For the best performance, allocating all the available connections gives all the power of the source database, but if you need to run workloads in parallel with the data extracts, asking for a smaller amount (for example, 21) can suffice. This is a trade-off between resources available against the time required to migrate the data.

For this use case, we allocated seven extraction agents, because the largest project phase extracted 6 TB of Netezza data. The DBA team configured 21 Netezza concurrent connections, so each agent was configured with three parallel data extraction processes (known as threads; see the following configuration file).

Two parameters on the data extraction agents can impact the length of time it takes for the data to migrate from Netezza to the agents: the number of connections and the number of threads.

Tuning is required for each data extraction agent to maximize throughput during the data migration phase. Tuning is achieved by modifying the file /usr/share/aws/sct-extractor/conf/settings.properties, and the file must be applied against each agent. See the following code:

# Number of connections in the pool per agent
extractor.source.connection.pool.size=5

# Number of threads per agent
extractor.extracting.thread.pool.size=3

The preceding code has the following features:

  • extractor.source.connection.pool.size defines the number of connections the agent opens against the Netezza data warehouse.
  • extractor.extracting.thread.pool.size defines the number of parallel jobs the agent can spawn concurrently. The sum of this parameter for all the agents should be smaller than the maximum concurrent connections configured from Netezza.
  • It’s an AWS recommendation to have extractor.source.connection.pool.size 1.5 times larger than extractor.extracting.thread.pool.size. This is because while a task is running, the AWS SCT may need additional connections to retrieve metadata from Netezza to create additional tasks or other operations, such as to collect table statistics.

Migrating to the development environment

The first task to undertake is data model schema transformation. It consists of transforming the Netezza schema objects into Amazon Redshift-compliant syntax and deploying them into the Amazon Redshift development environment. Before migrating the Netezza tables and views, you must create the schemas, groups, and users.

Creating schemas, users, and groups

If you don’t follow this step, all the objects are created in the Amazon Redshift public schema, which isn’t recommended. The following best practices aren’t specific to Netezza migration, but you can use them as a checklist during this step:

  • Create schemas to logically separate views and tables.
  • Groups are easier to maintain than many users because you can grant permissions to groups, and you can add and remove users from groups. Also, groups can direct all traffic from all users in the group to a specific Amazon Redshift WLM queue (which can control priorities as well as QMR limits).
  • Grant permissions at the schema level to allow selected groups to access the schema. This is independent of the permissions for the objects within the schema.
  • Finally, assign users to groups.

Transforming the schema

The AWS SCT analyzes the Netezza data model schema, converts the syntax into Amazon Redshift-compliant DDL statements, and applies the target schema to the Amazon Redshift cluster. The AWS SCT accelerates this phase by making sure Amazon Redshift best practices are taken into account during the transformation.

Within Amazon Redshift, column-level encoding makes sure that the most performant level of compression is applied to every data block of storage for the tables. It’s recommended that the latest ZSTD encoding is applied to all varchar, char, Boolean, and geometry columns, and the AZ64 encoding is applied to all other columns, including integers and decimals.

To improve zone map performance, don’t encode the first column of a sort key (set to raw encoding).

Netezza supports both character-length and byte-length semantics.

If character length semantics (the default) is selected, the length is specified in terms of characters, and it can consume more bytes than the length indicates. For example, if the varchar datatype length is set to 100, it allows multi-byte characters from 1–4 bytes to a maximum of 400 bytes.

If the bytes length semantics is selected, length is specified in terms of bytes. It can support only the number of bytes specified in the varchar. For example, if the varchar datatype is set to 100 only, it allows storing characters up to 100 bytes. This includes single byte and multi byte.

As of this writing, Amazon Redshift doesn’t support character-length semantics, which can lead to String length exceeds DDL length errors while loading the data into Amazon Redshift tables. The simplest solution is to multiply the length of such attributes by 4. A more efficient solution requires determining the maximum length of each varchar column in bytes in Netezza, adding an additional 20% buffer to the maximum length, and setting that as the maximum value for the Amazon Redshift varchar datatype column.

If the Netezza column maximum length in bytes is less than Amazon Redshift column length in bytes, you don’t need to increase the size of the column length in Amazon Redshift. The following query gets the column datatype in Netezza:

SELECT
	DATABASE,
	NAME,
	TYPE,
	ATTNAME,
	FORMAT_TYPE
FROM
	_V_relation_column
WHERE
	DATABASE = 'YOUR_DBASE'
	AND FORMAT_TYPE LIKE '%CHAR%'
	AND name IN ('tablename')
ORDER BY
	NAME;

The following script generates a query to get the maximum amount of bytes actually used for each varchar column:

SELECT
	'Select max(octet_length(' || ATTNAME || ')) from ' || DATABASE || '.' || name || ';'
FROM
	_v_relation_column
WHERE
	DATABASE = 'YOUR_DBASE'
	AND FORMAT_TYPE LIKE '%CHAR%'
	AND name IN ('TABLE')
ORDER BY
	NAME;

During data migration, you can use the following query to identify the reason for the load failure:

SELECT
	DISTINCT(ti."table"),
	ti."schema",
	starttime,
	err.tbl,
	err.colname,
	err.type,
	err.col_length,
	err.err_reason
FROM
	stl_load_errors err,
	svv_table_info ti
WHERE
	starttime > 'YYYY-MM-DD'
	AND ti.table_id = err.tbl
	AND err.err_reason = 'String length exceeds DDL length'
ORDER BY
	starttime DESC;

If the error reason is String length exceeds DDL length, you need to increase the length of the affected column.

As recommended earlier, based on the maximum column length in Netezza, you should add an additional 20% buffer to it and set that as maximum length to Amazon Redshift.

The following in Netezza is example output from the preceding SQL command:

SELECT
	max(octet_length(column_a))
FROM
	schema_a.table_a;

The following code is output in Amazon Redshift:

ALTER TABLE schema_a.table_a 
ALTER COLUMN column_a TYPE varchar(60);

AWS SCT uses statistics from the source database with user-specified optimization strategies to determine the appropriate distribution key and sort key strategies for the target schema. These optimization strategies require collecting statistics from the source database in order to activate the most relevant optimization rule for each table.

It’s recommended to do the following:

  • Choose the current Netezza key distribution style as a good starting point for an Amazon Redshift table’s key distribution strategy. When the table is within Amazon Redshift with representative workloads, you can optimize the distribution choice if needed.
  • Set the Amazon Redshift distribution style to auto for all Netezza tables with random distribution. This makes sure that Amazon Redshift automatically chooses the most performant distribution style depending on the number of rows in the table.

Migrating the data

You use the AWS SCT to migrate the data from the source Netezza data warehouse to the Amazon Redshift cluster. The AWS SCT migrates data with a three-phase approach:

  • Extract – Extracts data from Netezza and stores it into the file system of on-premises AWS SCT data extraction agents
  • Upload – Uploads data from the agents to Amazon Simple Storage Service (Amazon S3)
  • Copy – Loads the data from Amazon S3 into Amazon Redshift via the COPY command

For any migration, especially ones with large volumes of data or many objects to migrate, it’s important to plan and migrate the tables in smaller tasks. This is where tracking the runs and progress via the migration runbook from the assessment phase is important.

Segment the source tables based on their size. The following choices were successful for a 60 TB Netezza migration:

  • One AWS SCT task for all tables less than 5 GB
  • One AWS SCT task for all tables 5–15 GB
  • Multiple AWS SCT tasks for tables under 50 GB; a few tables per task
  • One AWS SCT task for each table bigger than 50 GB

You should refine configuration according to the available migration windows. The approach ensures the following:

  • A task is an atomic process; if it succeeds, all the managed tables are migrated successfully
  • If it fails, it might be more convenient to run the entire task from scratch rather than double-check the status and consistency of each table
  • Task size should trade off between the mentioned opposite poles

To manage the substitution of special characters during these phases, set the following parameters:

  • For NULL values as a string, enter ~~~~. By default, this is not checked. Numeric and date type nulls are by default extracted as ‘\N’ and loaded to Amazon Redshift as nulls.
    • If it is unchecked or if it is checked and value is left black, AWS SCT extracts char/varchar type null as ‘\N’ and the COPY command has the NULL AS ‘\N’ parameter set. This causes issues during COPY operations when we have data with value ‘N’ in any column.
    • If checked and value is ~~~~, AWS SCT extracts char and varchar type null as ~~~~ and the COPY command has the NULL AS ~~~~ parameter set. Using junk characters (such as ~~~~) extracts char and varchar null values as ~~~~, and the COPY command replaces and loads ~~~~ as NULL. This way, we can extract and load char and varchar null values. This doesn’t cause issues during COPY when we have data with the value ‘N’ in any column.
    • If checked and value is ”, AWS SCT extracts the char and varchar type null as ” and the COPY command has the NULL AS ” parameter set. NULL AS ” is equivalent to EMPTYASNULL.
  • Deselect Use blank as null value. If BLANKASNULL is set (which is default setting), it replaces white space characters (‘ ‘) with NULL for char and varchar datatypes, and if the column is NOT NULL, inserting NULL fails. Deselecting BLANKASNULL loads the data as it is in the source.
  • Deselect Use empty as null value. If EMPTYASNULL is set (which is default setting), it replaces empty data (two delimiters in succession with no characters between the delimiters) with NULL for char and varchar datatypes. This is not needed.

The following screenshot shows our configuration for the AWS SCT tasks.

To keep track of the tasks and record them accurately in the migration runbook, on the AWS S3 settings tab, set the folder name to be the same as the task name. Using a consistent naming convention allows easier tracking of progress in the runbook, and is useful during troubleshooting for any issues encountered.

For each subject area in scope, the extraction can either occur while sharing the connections and threads with other process during the day, or it’s recommended for the initial data load to schedule the tasks during the evening, weekend, or agreed schedule with as many Netezza resources as possible.

Breaking the migration down into smaller tasks allows you to log the progress in the migration runbook and run individual tasks to completion during the allocated migration window.

It’s recommended to migrate a small sample table first to test the parameter settings. The following sample table contains specific examples of edge cases that can provide quick feedback as to the suitability of the parameter settings:

create table <schema>.test_dummy
(
       idrow integer,
       field1 integer,
       field2 character varying (50)
);
 
insert into <schema>.test_dummy (idrow, field1, field2) values (1, null, null);
insert into <schema>.test_dummy (idrow, field1, field2) values (2, null, '');
insert into <schema>.test_dummy (idrow, field1, field2) values (3, null, '    ');
insert into <schema>.test_dummy (idrow, field1, field2) values (4, 34, '  Test4   ');
insert into <schema>.test_dummy (idrow, field1, field2) values (5, 15, '');
insert into <schema>.test_dummy (idrow, field1, field2) values (6, 25, '   ');
insert into <schema>.test_dummy (idrow, field1, field2) values (7, 655, 'Test7');

When migrating large Netezza tables, data is migrated on a table-by-table basis using multiple data extraction agents. You should split large tables (for example, tables with more than 20 million rows or greater than 50 GB) into partitions using the AWS SCT virtual partitions functionality. Using virtual partitioning is a recommended best practice for data warehouse migrations using the AWS SCT extractors.

Virtual partitions decrease the migration timeline of a table by parallelizing the extraction of a configurable amount of subsections. You can migrate partitions in parallel, and extract failure is limited to a single partition instead of the entire table.

The AWS SCT creates a subtask for each table partition. Then, when the migration is running, AWS SCT assigns the subtask to an available data extractor to run. The AWS SCT orchestrates which subtask runs on which extractor, thereby keeping all extractors as busy as possible throughout the migration.

To use virtual partitioning, you should identify an attribute that you can use to evenly split the table. It’s important that the virtual partitions are well balanced in order to exploit the benefit of the parallelism. The AWS SCT usually virtually defines such partitions at extraction time—virtual partitions aren’t related to how data is stored into the source data warehouse.

AWS SCT provides three types of virtual partitioning: list, range, and auto split. For more information, see Use virtual partitioning in the AWS Schema Conversion Tool.

When using list partitioning, for very big tables (over 100 GB), the Netezza data slice IDs are an option for the partition key.

Migrating to other pre-production environments

After the data migration has successfully been proven in the development environment, you may choose to migrate to other pre-production environments. Apply the same steps and validation checks, including:

  • Validate that the schema deployment matches the development environment.
  • Validate the data migration has completed successfully, and that no data load errors are logged into the STL_LOAD_ERRORS table. The typical reasons for errors at this stage include schema mismatch, different input file formats, or insufficient varchar length for the input data.
  • Validate the ETL deployment is loading the data as expected.

Migrating to the production environment

Migration to the production environment follows the same processes as the non-production environments, with the addition of the following steps:

  • Undertake the task of business validation with your stakeholders to measure the accuracy of the migration in meeting the program goals:
    1. Undertake a period of dual-running the ETL deployment with production data being dual-loaded into the Netezza data warehouse and the production Amazon Redshift cluster.
    2. Compare the results sets from the Netezza data warehouse and the production Amazon Redshift cluster (the data validation scripts in the following section support this task).
    3. Update the migration runbook for each source table to record the number of records migrated, which validation checks have been run, and any discrepancies found during the checks.
    4. Run reports and dashboards against the Netezza data warehouse and the production Amazon Redshift cluster and ensure the results match.
    5. Obtain sign-off upon successful completion of these business validation tests.
  • After you successfully complete the dual-running of both ETL and reporting deployments, the source of truth is transferred from the Netezza data warehouse to the production Amazon Redshift cluster by decommissioning the Netezza ETL deployment and the Netezza data warehouse, and re-pointing all reporting and dashboard connections to the Amazon Redshift cluster.
  • When the Amazon Redshift cluster is live, monitor the cluster and ensure data model best practices are being followed.

Validating the data

After you migrate the data model schema and data contents to Amazon Redshift, you should perform data-validation tests to measure the migration’s success. The scripts included in this section cover checks commonly undertaken during migration engagements. All these scripts must be run by a superuser account.

Amazon Redshift utilities

The Amazon Redshift Utilities GitHub repo contains a set of utilities to accelerate troubleshooting or analysis on Amazon Redshift. Such utilities consist of queries, views, and scripts. These scripts aren’t deployed by default into Amazon Redshift clusters. The recommendation is to deploy the views into an admin schema.

Comparing source vs. target table and view counts

For Netezza, enter the following code:

SET CATALOG <database_name>;

SELECT
	'<schema_name>' ,
	sum(CASE OBJTYPE WHEN 'TABLE' THEN 1 ELSE 0 END) AS table_count ,
	sum(CASE OBJTYPE WHEN 'VIEW' THEN 1 ELSE 0 END) AS view_count
FROM
	_v_objects
WHERE
	OBJTYPE IN ('TABLE', 'VIEW')
	AND OBJNAME IN ('<table_name>')
GROUP BY
	SCHEMA;

For Amazon Redshift, enter the following code:

SELECT
	trim(pg_namespace.nspname) AS schema_name ,
	sum(CASE pg_class.relkind WHEN 'r' THEN 1 ELSE 0 END) AS table_count ,
	sum(CASE pg_class.relkind WHEN 'v' THEN 1 ELSE 0 END) AS view_count
FROM
	pg_namespace,
	pg_class
WHERE
	pg_class.relnamespace = pg_namespace.oid
	AND pg_class.relkind IN ('r', 'v')
	AND schema_name IN ('<schema_name>')
	AND pg_class.relname IN ('<table_name>')
GROUP BY
	schema_name
ORDER BY
	1;

Comparing source vs. target table constraints

For Netezza, enter the following code:

SET CATALOG <database_name>;

SELECT
	nc.database,
	nc.Schema_Name,
	nc.Table_Name,
	ISNULL(pk_count, 0) pk_count,
	ISNULL(fk_count, 0) fk_count,
	ISNULL(uk_count, 0) uk_count,
	ISNULL(ck_count, 0) ck_count,
	ISNULL(nn_t_count, 0) nn_count,
	ISNULL(pk_count, 0)+ ISNULL(fk_count, 0)+ ISNULL(uk_count, 0)+ ISNULL(ck_count, 0)+ ISNULL(nn_t_count, 0) Total_Count
FROM
	(
	SELECT
		database, Schema_Name, Table_Name, Table_Id, sum(CASE Constraint_Type WHEN 'p' THEN 1 ELSE 0 END) AS pk_count, sum(CASE Constraint_Type WHEN 'f' THEN 1 ELSE 0 END) AS fk_count, sum(CASE Constraint_Type WHEN 'u' THEN 1 ELSE 0 END) AS uk_count, sum(CASE Constraint_Type WHEN 'c' THEN 1 ELSE 0 END) AS ck_count
	FROM
		(
		SELECT
			DISTINCT database, SCHEMA Schema_Name, relation Table_Name, contype Constraint_Type, constraintname, objid Table_Id
		FROM
			_v_relation_keydata) in1
	GROUP BY
		database, Schema_Name, Table_Name, Table_Id) oc
RIGHT OUTER JOIN (
	SELECT
		database, SCHEMA Schema_Name, name Table_Name, objid Table_Id, sum(CASE attnotnull WHEN TRUE THEN 1 ELSE 0 END) AS nn_t_count, sum(CASE attnotnull WHEN FALSE THEN 1 ELSE 0 END) AS nn_f_count, count(attnotnull) nn_total_count
	FROM
		_v_relation_column
	WHERE
		TYPE = 'TABLE'
		AND attnum>0
	GROUP BY
		database, Schema_Name, Table_Name, Table_Id) nc ON
	(oc.Table_Id = nc.Table_Id)
WHERE
	nc.database = '<database_name>'
	AND nc.table_name IN ('<table_name>')
ORDER BY
	1, 2;

For Amazon Redshift, enter the following code:

SELECT
	Schema_Name,
	Table_Name,
	pk_count,
	fk_count,
	uk_count,
	ck_count,
	ISNULL(nn_count, 0) nn_count,
	pk_count + fk_count + uk_count + ck_count + ISNULL(nn_count, 0) Total_Count
FROM
	(
	SELECT
		Schema_Name, Table_Name, Table_Id, sum(CASE Constraint_Type WHEN 'p' THEN 1 ELSE 0 END) AS pk_count, sum(CASE Constraint_Type WHEN 'f' THEN 1 ELSE 0 END) AS fk_count, sum(CASE Constraint_Type WHEN 'u' THEN 1 ELSE 0 END) AS uk_count, sum(CASE Constraint_Type WHEN 'c' THEN 1 ELSE 0 END) AS ck_count
	FROM
		(
		SELECT
			trim(pg_namespace.nspname) Schema_Name, trim(pg_class.relname) Table_Name, trim(pg_constraint.conname) Constraint_Name, pg_constraint.contype Constraint_Type, pg_class.oid Table_Id
		FROM
			pg_namespace
		INNER JOIN pg_class ON
			pg_namespace.oid = pg_class.relnamespace
		LEFT OUTER JOIN pg_constraint ON
			pg_constraint.conrelid = pg_class.oid
		WHERE
			schema_name NOT IN ('pg_catalog', 'pg_toast', 'information_schema')
			AND schema_name NOT LIKE '%_ext'
			AND pg_class.relkind = 'r')
	GROUP BY
		Schema_Name, Table_Name, Table_Id) oc
LEFT OUTER JOIN (
	SELECT
		attrelid Table_Id, count(attnotnull) nn_count
	FROM
		pg_attribute
	WHERE
		attnotnull = TRUE
		AND attnum>0
	GROUP BY
		Table_Id) nc ON
	(oc.Table_Id = nc.Table_Id)
WHERE
	schema_name IN ('<schema_name>')
	AND table_name IN ('<table_name>')
ORDER BY
	1, 2;
Netezza
Amazon Redshift

Generating missing constraints from Netezza

Run the following SQL statements in Netezza to generate the DDL statements to add any missing constraints in Amazon Redshift:

-- Generate Primary Key Constraints DDL
SET CATALOG <database_name>;

SELECT
	'ALTER TABLE <schema_name>.' || relation || ' ADD CONSTRAINT ' || constraintname || ' PRIMARY KEY (' || attname || ')'
FROM
	_V_RELATION_KEYDATA
WHERE
	DATABASE = '<database_name>'
	AND relation IN ('<table_name>')
	AND contype = 'p';

-- Generate Unique Key Constraints DDL
SET CATALOG <database_name>;

SELECT
	'ALTER TABLE <schema_name>.' || relation || ' ADD CONSTRAINT ' || constraintname || ' UNIQUE (' || attname || ')'
FROM
	_V_RELATION_KEYDATA
WHERE
	DATABASE = '<database_name>'
	AND relation IN ('<table_name>')
	AND contype = 'u';

-- Generate Foreign Key Constraints DDL  
SET CATALOG <database_name>;

SELECT
	'ALTER TABLE <schema_name>.' || relation || ' ADD CONSTRAINT ' || constraintname || ' FOREIGN KEY (' || attname || ') REFERENCES <schema_name>.' || pkrelation || '(' || pkattname || ')' refconstrname
FROM
	_V_RELATION_KEYDATA
WHERE
	DATABASE = '<database_name>'
	AND relation IN ('<table_name>')
	AND contype = 'f';

Run the generated script against the Amazon Redshift database.

Identifying tables with insufficient varchar column length

For Netezza, enter the following code:

-- Generate SQL for Varchar Column Length in Bytes   
SET CATALOG <database_name>;

SELECT
	'SELECT ''' || database || ''' database_name,''' || SCHEMA || ''' schema_name,''' || name || ''' table_name, ''' || attname || ''' column_name, ''' || format_type || ''' data_type, ''' || attcolleng || ''' data_type_length_char, ' || 'MAX(OCTET_LENGTH(' || attname || ')) max_bytes FROM ' || SCHEMA || '.' || name || ' UNION ALL'
FROM
	_v_relation_column
WHERE
	TYPE = 'TABLE'
	AND format_type LIKE 'CHARACTER VARYING%'
	AND database = '<database_name>'
	AND name IN ('<table_name>');

For Amazon Redshift, enter the following code:

-- Varchar Column Length in Bytes 
SELECT
	trim(pg_namespace.nspname) Schema_Name,
	trim(pg_class.relname) Table_Name,
	trim(pg_attribute.attname) Column_Name,
	trim(pg_type.typname) Data_Type,
	pg_attribute.atttypmod-4 Data_Type_Length_Bytes
FROM
	pg_attribute
JOIN pg_type ON
	pg_type.oid = pg_attribute.atttypid
JOIN pg_class ON
	pg_class.oid = pg_attribute.attrelid
JOIN pg_namespace ON
	pg_namespace.oid = pg_class.relnamespace
WHERE
	trim(pg_type.typname) LIKE 'varchar%'
	AND Data_Type_Length_Bytes <> 1
	AND Schema_Name IN ('<schema_name>')
	AND Table_Name IN ('<table_name>')
ORDER BY
	1, 2, 3;

Comparing source vs. target row count

Remove the final UNION ALL from the following two scripts output before running.

For Netezza, enter the following:

SET CATALOG <database_name>;

SELECT
	'SELECT ''' || database || ''' database_name,''' || SCHEMA || ''' schema_name,''' || tablename || ''' table_name,COUNT(*) count_of_rows from ' || SCHEMA || '.' || tablename || ' UNION ALL'
FROM
	_v_table
WHERE
	OBJTYPE = 'TABLE'
	AND database = '<database_name>'
	AND tablename IN ('<table_name>');

For Amazon Redshift, enter the following code:

SELECT
	'SELECT ''' || schema_name || ''' schema_name,''' || table_name || ''' table_name,COUNT(*) count_of_rows from ' || schema_name || '.' || table_name || ' UNION ALL'
FROM
	(
	SELECT
		trim(pg_namespace.nspname) schema_name, trim(pg_class.relname) table_name
	FROM
		pg_namespace
	INNER JOIN pg_class ON
		pg_namespace.oid = pg_class.relnamespace
	WHERE
		pg_class.relkind = 'r'
		AND schema_name IN ('<schema_name>')
		AND table_name IN ('<table_name>')
	ORDER BY
		1, 2 );

Comparing source vs. target columns

For Netezza, enter the following code:

SET CATALOG <database_name>;

SELECT
	database,
	SCHEMA Schema_Name,
	name Table_Name,
	attnum,
	attname Column_Name,
	Format_Type Data_Type
FROM
	_v_relation_column
WHERE
	TYPE = 'TABLE'
	AND attnum>0
	AND database = '<database_name>'
	AND name IN ('<table_name>')
ORDER BY
	name,
	attnum;

For Amazon Redshift, enter the following code:

SET search_path TO <schema_name1>,<schema_name2>;

SELECT
	trim(pgn.nspname) AS Schema_Name,
	trim(pgc.relname) AS Table_Name,
	det.attnum,
	det.attname AS Column_Name,
	def.type Data_Type
FROM
	pg_class AS pgc
JOIN pg_namespace AS pgn ON
	pgn.oid = pgc.relnamespace
LEFT OUTER JOIN (
	SELECT
		attrelid, attname, attnum
	FROM
		pg_attribute
	WHERE
		attnum>0) AS det ON
	det.attrelid = pgc.oid
LEFT OUTER JOIN pg_table_def def ON
	(def.schemaname = pgn.nspname
	AND def.tablename = pgc.relname
	AND def."column" = det.attname)
WHERE
	Schema_Name IN ('<schema_name>')
	AND Table_Name IN ('<table_name>')
ORDER BY
	1, 2, 3;

Comparing source vs. target distribution key

For Netezza, enter the following code:

SET CATALOG <database_name>;

SELECT
	t.DATABASE,
	t.SCHEMA,
	t.tablename,
	td.attname AS Dist_Key
FROM
	_v_table t
LEFT OUTER JOIN _v_table_dist_map td ON
	t.tablename = td.tablename
	AND (td.distseqno IS NULL
	OR td.distseqno = 1)
WHERE
	t.database = '<database_name>'
	AND t.tablename IN ('<table_name>')
ORDER BY
	1,
	3;

For Amazon Redshift, enter the following code:

SELECT
	trim(pgn.nspname) AS Schema_Name,
	trim(pgc.relname) AS Table_Name,
	decode(pgc.reldiststyle, 0, 'even', 1, 'key', 8, 'all') AS dist_style,
	det.dist_key AS dist_key
FROM
	pg_class AS pgc
JOIN pg_namespace AS pgn ON
	pgn.oid = pgc.relnamespace
JOIN (
	SELECT
		attrelid, min(CASE attisdistkey WHEN 't' THEN attname ELSE NULL END) AS dist_key
	FROM
		pg_attribute
	GROUP BY
		1) AS det ON
	det.attrelid = pgc.oid
WHERE
	Schema_Name IN ('<schema_name>')
	AND Table_Name IN ('<table_name>')
ORDER BY
	1, 2;

Verifying if any invalid UTF-8 characters were replaced

For Amazon Redshift, enter the following code:

-- Validate if any invalid characters are replaced with '?' during COPY

SELECT userid,
       slice,
       tbl,
       starttime,
       session,
       query,
       filename,
       line_number,
       colname,
       raw_line 
FROM   stl_replacements;

Identifying COPY errors

For Amazon Redshift, enter the following code:

SELECT
	ti."schema",
	ti."table",
	starttime,
	err.tbl,
	err.colname,
	err.type,
	err.col_length,
	err.position,
	err.err_reason,
	err.filename,
	err.raw_line,
	err.raw_field_value
FROM
	stl_load_errors err,
	svv_table_info ti
WHERE
	starttime > '<YYYY-MM-DD>'
	AND ti.table_id = err.tbl
	AND ti."table" = '<Table_Name>'
ORDER BY
	1, 2, 3 DESC;

Additional data validation checks

In addition to checking the row count for each table, you should perform tests on data quality to guarantee production data readiness:

  • During this activity, run tailored queries and validate them against Amazon Redshift tables and views. The recommendation is to run such checks against records that include NULL values as well as strings including trailing whitespaces.
  • Compute and compare statistics (min, max, average, sum, checksums) on numeric attributes against Netezza equivalents.

Conclusion

In this post, we detailed a project migration plan to migrate from Netezza to Amazon Redshift. We included examples of sizing the AWS SCT data extraction agents depending on the volume of data to migrate and the resources made available for the transfer. Validation of successful schema and data migration is vital, and we included several scripts to validate that the data model and data content meet expectations.

Special thanks go to AWS colleagues Arturo Bayo, Boopathi P, and Sunil Vora for their project delivery and support with this post.


About the Authors

Mattia Berlusconi is a Data & Analytics consultant with AWS Professional Services supporting enterprises in adopting innovative solutions for organizing and exploiting data to achieve their business objectives. He is specialized in building data platforms and managing database migrations.

 

 

 

Simon Dimaline has specialised in data warehousing and data modelling for more than 20 years. He currently works for the Data & Analytics practice within AWS Professional Services accelerating customers’ adoption of AWS analytics services.