All posts by Suresh Patnam

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 2

2022-07-15 Suresh Patnam

Post Syndicated from Suresh Patnam original https://aws.amazon.com/blogs/big-data/part-2-migrate-a-large-data-warehouse-from-greenplum-to-amazon-redshift-using-aws-sct/

In this second post of a multi-part series, we share best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting stored procedures, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns. You can check out the first post of this series for guidance on planning, running, and validation of a large-scale data warehouse migration from Greenplum to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT).

Choose your optimal Amazon Redshift cluster

Amazon Redshift has two types of clusters: provisioned and serverless. For provisioned clusters, you need to set up the same with required compute resources. Amazon Redshift Serverless can run high-performance analytics in the cloud at any scale. For more information, refer to Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure.

An Amazon Redshift cluster consists of nodes. Each cluster has a leader node and one or more compute nodes. The leader node receives queries from client applications, parses the queries, and develops query run plans. The leader node then coordinates the parallel run of these plans with the compute nodes and aggregates the intermediate results from these nodes. It then returns the results to the client applications.

When determining your type of cluster, consider the following:

Estimate the size of the input data compressed, vCPU, and performance. As of this writing, we recommend the Amazon Redshift RA3 instance with managed storage, which scales compute and storage independently for fast query performance.
Amazon Redshift provides an automated “Help me choose” cluster based on the size of your data.
A main advantage of a cloud Amazon Redshift data warehouse is that you’re no longer stuck with hardware and commodities like old guard data warehouses. For faster innovation, you have the option to try different cluster options and choose the optimized one in terms of performance and cost.
At the time of development or pilot, you can usually start with a smaller number of nodes. As you move to production, you can adjust the number of nodes based on your usage pattern. When right-sizing your clusters, we recommend choosing the reserved instance type to cut down the cost even further. The public-facing utility Simple Replay can help you determine performance against different cluster types and sizes by replaying the customer workload. For provisioned clusters, if you’re planning to use the recommended RA3 instance, you can compare different node types to determine the right instance type.
Based on your workload pattern, Amazon Redshift supports resize, pause and stop, and concurrency scaling of the cluster. Amazon Redshift workload management (WLM) enables effective and flexible management of memory and query concurrency.

Create data extraction tasks with AWS SCT

With AWS SCT extraction agents, you can migrate your source tables in parallel. These extraction agents authenticate using a valid user on the data source, allowing you to adjust the resources available for that user during the extraction. AWS SCT agents process the data locally and upload it to Amazon Simple Storage Service (Amazon S3) through the network (via AWS Direct Connect). We recommend having a consistent network bandwidth between your Greenplum machine where the AWS SCT agent is installed and your AWS Region.

If you have tables around 20 million rows or 1 TB in size, you can use the virtual partitioning feature on AWS SCT to extract data from those tables. This creates several sub-tasks and parallelizes the data extraction process for this table. Therefore, we recommend creating two groups of tasks for each schema that you migrate: one for small tables and one for large tables using virtual partitions.

For more information, refer to Creating, running, and monitoring an AWS SCT data extraction task.

Data architecture

To simplify and modernize your data architecture, consider the following:

Establish accountability and authority to enforce enterprise data standards and policies.
Formalize the data and analytics operating model between enterprise and business units and functions.
Simplify the data technology ecosystem through rationalization and modernization of data assets and tools or technology.
Develop organizational constructs that facilitate more robust integration of the business and delivery teams, and build data-oriented products and solutions to address the business problems and opportunities throughout the lifecycle.
Back up the data periodically so that if something is wrong, you have the ability to replay.
During planning, design, execution, and throughout implementation and maintenance, ensure data quality management is added to achieve the desired outcome.
Simple is the key to an easy, fast, intuitive, and low-cost solution. Simple scales much better than complex. Simple makes it possible to think big (Invent and Simplify is another Amazon leadership principle). Simplify the legacy process by migrating only the necessary data used in tables and schemas. For example, if you’re performing truncate and load for incremental data, identify a watermark and only process incremental data.
You may have use cases that requiring record-level inserts, updates, and deletes for privacy regulations and simplified pipelines; simplified file management and near-real-time data access; or simplified change data capture (CDC) data pipeline development. We recommend using purposeful tools based on your use case. AWS offers the options to use Apache HUDI with Amazon EMR and AWS Glue.

Migrate stored procedures

In this section, we share best practices for stored procedure migration from Greenplum to Amazon Redshift. Data processing pipelines with complex business logic often use stored procedures to perform the data transformation. We advise using big data processing like AWS Glue or Amazon EMR to modernize your extract, transform, and load (ETL) jobs. For more information, check out Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift. For time-sensitive migration to cloud-native data warehouses like Amazon Redshift, redesigning and developing the entire pipeline in a cloud-native ETL tool might be time-consuming. Therefore, migrating the stored procedures from Greenplum to Amazon Redshift stored procedures can be the right choice.

For a successful migration, make sure to follow Amazon Redshift stored procedure best practices:

Specify the schema name while creating a stored procedure. This helps facilitate schema-level security and you can enforce grants or revoke access control.
To prevent naming conflicts, we recommend naming procedures using the prefix sp_. Amazon Redshift reserves the sp_ prefix exclusively for stored procedures. By prefixing your procedure names with sp_, you ensure that your procedure name won’t conflict with any existing or future Amazon Redshift procedure names.
Qualify your database objects with the schema name in the stored procedure.
Follow the minimal required access rule and revoke unwanted access. For similar implementation, make sure the stored procedure run permission is not open to ALL.
The SECURITY attribute controls a procedure’s privileges to access database objects. When you create a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the default. For more information, refer to Security and privileges for stored procedures.
Managing transactions when it comes to stored procedures are important. For more information, refer to Managing transactions.
TRUNCATE issues a commit implicitly inside a stored procedure. It interferes with the transaction block by committing the current transaction and creating a new one. Exercise caution while using TRUNCATE to ensure it never breaks the atomicity of the transaction. This also applies for COMMIT and ROLLBACK.
Adhere to cursor constraints and understand performance considerations while using cursor. You should use set-based SQL logic and temporary tables while processing large datasets.
Avoid hardcoding in stored procedures. Use dynamic SQL to construct SQL queries dynamically at runtime. Ensure appropriate logging and error handling of the dynamic SQL.
For exception handling, you can write RAISE statements as part of the stored procedure code. For example, you can raise an exception with a custom message or insert a record into a logging table. For unhandled exceptions like WHEN OTHERS, use built-in functions like SQLERRM or SQLSTATE to pass it on to the calling application or program. As of this writing, Amazon Redshift limits calling a stored procedure from the exception block.

Sequences

You can use IDENTITY columns, system timestamps, or epoch time as an option to ensure uniqueness. The IDENTITY column or a timestamp-based solution might have sparse values, so if you need a continuous number sequence, you need to use dedicated number tables. You can also use of the RANK() or ROW_NUMBER() window function over the entire set. Alternatively, get the high-water mark from the existing ID column from the table and increment the values while inserting records.

Character datatype length

Greenplum char and varchar data type length is specified in terms of character length, including multi-byte ones. Amazon Redshift character types are defined in terms of bytes. For table columns using multi-byte character sets in Greenplum, the converted table column in Amazon Redshift should allocate adequate storage to the actual byte size of the source data.

An easy workaround is to set the Amazon Redshift character column length to four times larger than the corresponding Greenplum column length.

A best practice is to use the smallest possible column size. Amazon Redshift doesn’t allocate storage space according to the length of the attribute; it allocates storage according to the real length of the stored string. However, at runtime, while processing queries, Amazon Redshift allocates memory according to the length of the attribute. Therefore, not setting a default size of four times greater helps from a performance perspective.

An efficient solution is to analyze production datasets and determine the maximum byte size length of the Greenplum character columns. Add a 20% buffer to support future incremental growth on the table.

To arrive at the actual byte size length of an existing column, run the Greenplum data structure character utility from the AWS Samples GitHub repo.

Numeric precision and scale

The Amazon Redshift numeric data type has a limit to store up to maximum precision of 38, whereas in a Greenplum database, you can define a numeric column without any defined length.

Analyze your production datasets and determine numeric overflow candidates using the Greenplum data structure numeric utility from the AWS Samples GitHub repo. For numeric data, you have options to tackle this based on your use case. For numbers with a decimal part, you have the option to round the data based on the data type without any data loss in the whole number part. For future reference, you can a keep copy of the column in VARCHAR or store in an S3 data lake. If you see an extremely small percentage of an outlier of overflow data, clean up the source data for quality data migration.

SQL queries and functions

While converting SQL scripts or stored procedures to Amazon Redshift, if you encounter unsupported functions, database objects, or code blocks for which you might have to rewrite the query, create user-defined functions (UDFs), or redesign. You can create a custom scalar UDF using either a SQL SELECT clause or a Python program. The new function is stored in the database and is available for any user with sufficient privileges to run. You run a custom scalar UDF in much the same way as you run existing Amazon Redshift functions to match any functionality of legacy databases. The following are some examples of alternate query statements and ways to achieve specific aggregations that might be required during a code rewrite.

AGE

The Greenplum function AGE () returns an interval subtracting from the current date. You could accomplish the same using a subset of MONTHS_BETWEEN(), ADD_MONTH(), DATEDIFF(), and TRUNC() functions based on your use case.

The following example Amazon Redshift query calculates the gap between the date 2001-04-10 and 1957-06-13 in terms of year, month, and days. You can apply this to any date column in a table.

select
	trunc(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))/ 12) years,
	mod(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4, 12) months,
	'2001-04-10'::date -add_months('1957-06-13'::date,
	trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4) days;

COUNT

If you have a use case to get distinct aggregation in the Count() window function, you could accomplish the same using a combination of the Dense_Rank () and Max() window functions.

The following example Amazon Redshift query calculates the distinct item count for a given date of sale:

select
	sale_date,
	item,
	cost,
	max(densernk) over (partition by sale_date order by item rows between unbounded preceding and unbounded following) as distinct_itemcount
from
	(
	select
		*,dense_rank() over (partition by sale_date order by item) as densernk
	from
		testaggr)
order by
	sale_date,
	item,
cost;

ORDER BY

Amazon Redshift aggregate window functions with an ORDER BY clause require a mandatory frame.

The following example Amazon Redshift query creates a cumulative sum of cost by sale date and orders the results by item within the partition:

select
	*,
    sum(cost) over (partition by sale_date
order by
	item rows between unbounded preceding and unbounded following) as total_cost_by_date
from
	testaggr
order by
	sale_date,
	item,
	cost;

STRING_AGG

In Greenplum, STRING_AGG() is an aggregate function, which is used to concatenate a list of strings. In Amazon Redshift, use the LISTAGG() function.

The following example Amazon Redshift query returns a semicolon-separated list of email addresses for each department:

select
	dept,
	listagg(email_address,';') 
within group (order by dept) as email_list
from
	employee_contact
group by
	dept
order by
	dept;

ARRAY_AGG

In Greenplum, ARRAY_AGG() is an aggregate function that takes a set of values as input and returns an array. In Amazon Redshift, use a combination of the LISTAGG() and SPLIT_TO_ARRAY() functions. The SPLIT_TO_ARRAY() function returns a SUPER datatype.

The following example Amazon Redshift query returns an array of email addresses for each department:

select
	dept,
	SPLIT_TO_ARRAY(email_list,
	';') email_array
from
	(
	select
		dept,
		listagg(email_address,
		';') 
within group (
		order by dept) as email_list
	from
		employee_contact
	group by
		dept
	order by
		dept);

To retrieve array elements from a SUPER expression, you can use the SUBARRAY() function:

 select
	SUBARRAY( email_array,
	0,
	1 ) first_element,
	SUBARRAY( email_array,
	1,
	1) second_element,
	SUBARRAY( email_array,
	0 ) all_element
from
	testarray
where
	dept = 'HR';

UNNEST

In Greenplum, you can use the UNNEST function to split an array and convert the array elements into a set of rows. In Amazon Redshift, you can use PartiQL syntax to iterate over SUPER arrays. For more information, refer to Querying semistructured data.

create temp table unnesttest as
select
	json_parse('{"scalar_array": [0,1,2,3,4,5.5,6,7.9,8,9]}') as data;

select
	element
from
	unnesttest as un,
	un.data.scalar_array as element at index;

WHERE

You can’t use a window function in the WHERE clause of a query in Amazon Redshift. Instead, construct the query using the WITH clause and then refer the calculated column in the WHERE clause.

The following example Amazon Redshift query returns the sale date, item, and cost from a table for the sales dates where the total sale is more than 100:

with aggrcost as (
select
	sale_date ,
	item,
	cost,
	sum(cost) over (partition by sale_date) as total_sale
from
	testaggr )
select
	*
from
	aggrcost
where
	total_sale > 100;

Refer to the following table for additional Greenplum date/time functions along with the Amazon Redshift equivalent to accelerate you code migration.

.	Description	Greenplum	Amazon Redshift
1	The `now()` function return the start time of the current transaction	`now ()`	`sysdate`
2	`clock_timestamp()` returns the start timestamp of the current statement within a transaction block	`clock_timestamp ()`	`to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz`
3	`transaction_timestamp ()` returns the start timestamp of the current transaction	`transaction_timestamp ()`	`to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz`
4	Interval – This function adds x years and y months to the `date_time_column` and returns a timestamp type	`date_time_column + interval ‘ x years y months’`	`add_months(date_time_column, x*12 + y)`
5	Get total number of seconds between two-time stamp fields	`date_part('day', end_ts - start_ts) * 24 * 60 * 60+ date_part('hours', end_ts - start_ts) * 60 * 60+ date_part('minutes', end_ts - start_ts) * 60+ date_part('seconds', end_ts - start_ts)`	`datediff('seconds', start_ts, end_ts)`
6	Get total number of minutes between two-time stamp fields	`date_part('day', end_ts - start_ts) * 24 * 60 + date_part('hours', end_ts - start_ts) * 60 + date_part('minutes', end_ts - start_ts)`	`datediff('minutes', start_ts, end_ts)`
7	Extract date part literal from difference of two-time stamp fields	`date_part('hour', end_ts - start_ts)`	`extract(hour from (date_time_column_2 - date_time_column_1))`
8	Function to return the ISO day of the week	`date_part('isodow', date_time_column)`	`TO_CHAR(date_time_column, 'ID')`
9	Function to return ISO year from date time field	`extract (isoyear from date_time_column)`	`TO_CHAR(date_time_column, ‘IYYY’)`
10	Convert epoch seconds to equivalent datetime	`to_timestamp(epoch seconds)`	`TIMESTAMP 'epoch' + Number_of_seconds * interval '1 second'`

Amazon Redshift utility for troubleshooting or running diagnostics for the cluster

The Amazon Redshift Utilities GitHub repo contains a set of utilities to accelerate troubleshooting or analysis on Amazon Redshift. Such utilities consist of queries, views, and scripts. They are not deployed by default onto Amazon Redshift clusters. The best practice is to deploy the needed views into the admin schema.

Conclusion

In this post, we covered prescriptive guidance around data types, functions, and stored procedures to accelerate the migration process from Greenplum to Amazon Redshift. Although this post describes modernizing and moving to a cloud warehouse, you should be augmenting this transformation process towards a full-fledged modern data architecture. The AWS Cloud enables you to be more data-driven by supporting multiple use cases. For a modern data architecture, you should use purposeful data stores like Amazon S3, Amazon Redshift, Amazon Timestream, and others based on your use case.

About the Authors

Suresh Patnam is a Principal Solutions Architect at AWS. He is passionate about helping businesses of all sizes transforming into fast-moving digital organizations focusing on big data, data lakes, and AI/ML. Suresh holds a MBA degree from Duke University- Fuqua School of Business and MS in CIS from Missouri State University. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Arunabha Datta is a Sr. Data Architect at AWS Professional Services. He collaborates with customers and partners to architect and implement modern data architecture using AWS Analytics services. In his spare time, Arunabha enjoys photography and spending time with his family.

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 1

2022-07-15 Suresh Patnam

Post Syndicated from Suresh Patnam original https://aws.amazon.com/blogs/big-data/part-1-migrate-a-large-data-warehouse-from-greenplum-to-amazon-redshift-using-aws-sct/

A data warehouse collects and consolidates data from various sources within your organization. It’s used as a centralized data repository for analytics and business intelligence.

When working with on-premises legacy data warehouses, scaling the size of your data warehouse or improving performance can mean purchasing new hardware or adding more powerful hardware. This is often expensive and time-consuming. Running your own on-premises data warehouse also requires hiring database managers, administrators to deal with outages, upgrades, and data access requests. As companies become more data-driven, reliable access to centralized data is increasingly important. As a result, there is a strong demand for data warehouses that are fast, accessible, and able to scale elastically with business needs. Cloud data warehouses like Amazon Redshift address these needs while eliminating the cost and risk of purchasing new hardware.

This multi-part series explains how to migrate an on-premises Greenplum data warehouse to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT). In this first post, we describe how to plan, run, and validate the large-scale data warehouse migration. It covers the solution overview, migration assessment, and guidance on technical and business validation. In the second post, we share best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting stored procedures, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns.

Solution overview

Amazon Redshift is an industry-leading cloud data warehouse. Amazon Redshift uses Structured Query Language (SQL) to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.

AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database schema and most of the database code objects, SQL scripts, views, stored procedures, and functions to a format compatible with the target database. AWS SCT helps you modernize your applications simultaneously during database migration. When schema conversion is complete, AWS SCT can help migrate data from various data warehouses to Amazon Redshift using data extraction agents.

The following diagram illustrates our architecture for migrating data from Greenplum to Amazon Redshift using AWS SCT data extraction agents.

Perform a migration assessment

The initial data migration is the first milestone of the project. The main requirements for this phase are to minimize the impact on the data source and transfer the data as fast as possible. To do this, AWS offers several options, depending on the size of the database, network performance (AWS Direct Connect or AWS Snowball), and whether the migration is heterogeneous or not (AWS Database Migration Service (AWS DMS) or AWS SCT).

AWS provides a portfolio of cloud data migration services to provide the right solution for any data migration project. The level of connectivity is a significant factor in data migration, and AWS has offerings that can address your hybrid cloud storage, online data transfer, and offline data transfer needs.

Additionally, the AWS Snow Family makes it simple to get your data into and out of AWS via offline methods. Based on the size of the data, you can use AWS Snowmobile or AWS Snowball if you have petabytes to exabytes of data. To decide which transfer method is better for your use case, refer to Performance for AWS Snowball.

Perform schema conversion with AWS SCT

To convert your schema using AWS SCT, you must start a new AWS SCT project and connect your databases. Complete the following steps:

Install AWS SCT.
Open and initiate a new project.
For Source database engine, choose Greenplum.
For Target database engine, choose Amazon Redshift.
Choose OK.
Open your project and choose Connect to Greenplum.
Enter the Greenplum database information.
Choose Test connection.
Choose OK after a successful connection test.
Choose OK to complete the connection.
Repeat similar steps to establish a connection to your Amazon Redshift cluster.

By default, AWS SCT uses AWS Glue as the extract, transform, and load (ETL) solution for the migration. Before you continue, you must disable this setting.
On the Settings menu, choose Project settings.
Deselect Use AWS Glue.
Choose OK.
In the left pane, choose your schema (right-click) and choose Convert schema.
When asked to replace objects, choose Yes.
When asked to load statistics, choose Continue.

By the end of this step, all Greenplum objects should be migrated to Amazon Redshift syntax. Some objects may be shown in red, meaning that AWS SCT couldn’t fully migrate these objects. You can view an assessment summary of the migration for more information.
On the View menu, choose Assessment report view.

In the bottom pane, you can see Greenplum DDL and Amazon Redshift DDL of the selected objects side by side for comparison.
Choose the schema with a red icon, which indicates that it needs manual conversion.You’re presented with specific actions regarding the tables, constraints, or views that can’t be migrated to Amazon Redshift. You must investigate these issues and fix the errors manually with the required changes. Some examples are binary data in BLOB format, which AWS SCT automatically converts to character varying data type, but this may be highlighted as an issue. Additionally, some vendor-supplied procedures and functions couldn’t be converted, so AWS SCT can error out.

As a final step, you can validate that the tables exist in Amazon Redshift.

Connect using the Amazon Redshift query editor v2 or another third-party tool or utility of your choice and check for all the tables with the following code:

select t.table_schema, t.table_name,i.tbl_rows
from svv_tables t left join svv_table_info i 
on t.table_schema = i.schema and t.table_name = i.table
where t.table_type = 'BASE TABLE' and t.table_schema='SCHEMA NAME'
order by 1,2;

Migrate the data

To start your data migration using AWS SCT data extraction agents, complete the following steps:

Configure the AWS SCT extractor properties file with corresponding Greenplum properties:

port=8192
vendor=GREENPLUM
driver.jars="C:\DMS Workshop\JDBC\postgresql-42.2.9.jar"
redshift.driver.jars=C:\DMS Workshop\JDBC\RedshiftJDBC42-no-awssdk-1.2.43.1067.jar
working.folder=C:/Users/developer
ssl.option=ON
ssl.require.client.authentication=ON
#extractor.start.fetch.size=20000
#extractor.out.file.size=10485760
ssl.truststore.path=C:/DMS Workshop/sctagent/truststore
ssl.keystore.path=C:/DMS Workshop/sctagent/keystore
ssl.keystore.pwd=ZcUnMLbWYe1c0Dp/m9jIe8IMGNtyUOv+AnKVsV9eZI+SuJcWI9Fd99Lskd7FFA==
ssl.truststore.pwd=w23mFvm+SO46eR4dJ0Ly7n+XcvgUhhqrJNMPUyWdhFFKBxRU64/V2uioOul/aBcs8tMs=

Now you configure the AWS SCT extractor to perform a one-time data move. You can use multiple extractors when dealing with a large volume of data.

To register the extractor, on the View menu, choose Data migration view.
Choose Register.
Enter the information for your new agent.
Test the connection and choose Register.

Now you create a task for the extractor to extract data into the tables created on Amazon Redshift.
Under your schema in the left pane, choose Tables (right-click) and choose Create Local task.
For Task name, enter a name.
Test the connection and choose OK.
Choose Create.
Run your task and monitor its progress.

You can choose each task to get a detailed breakdown of its activity. Make sure to examine errors during the extract, upload, and copy process.

You can monitor the status of the tasks, the percentage completed, and the tables that were loaded successfully. You must also verify the count of records loaded into the Amazon Redshift database.

Technical validation

After the initial extracted data is loaded to Amazon Redshift, you must perform data validation tests in parallel. The goal at this stage is to validate production workloads, comparing Greenplum and Amazon Redshift outputs from the same inputs.

Typical activities covered during this phase include the following:

Count the number of objects and rows on each table.
Compare the same random subset of data in both Greenplum and Amazon Redshift for all migrated tables, validating that the data is exactly the same row by row.
Check for incorrect column encodings.
Identify skewed table data.
Annotate queries not benefiting from sort keys.
Identify inappropriate join cardinality.
Identify with tables with large VARCHAR columns.
Confirm that processes don’t crash when connected with the target environment.
Validate daily batch jobs (job duration, number of rows processed). To find the right techniques to perform most of those activities, refer to Top 10 Performance Tuning Techniques for Amazon Redshift
Set up Amazon Redshift automated alerts with Amazon Redshift Advisor.

Business validation

After you successfully migrate the data and validate the data movement, the last remaining task is to involve the data warehouse users in the validation process. These users from different business units across the company access the data warehouse using various tools and methods: JDBC/ODBC clients, Python scripts, custom applications, and more. It’s central to the migration to make sure that every end-user has verified and adapted this process to work seamlessly with Amazon Redshift before the final cutover.

This phase can consist of several tasks:

Adapt business users’ tools, applications, and scripts to connect to Amazon Redshift endpoints.
Modify users’ data load and dump procedures, replacing data movement to and from shared storage via ODBC/JDBC with COPY and UNLOAD operations from and to Amazon Simple Storage Service (Amazon S3).
Modify any incompatible queries, taking into account any implementation nuances between Amazon Redshift and PostgreSQL.
Run business processes against Greenplum and Amazon Redshift, and compare results and runtimes. Make sure to notify any issue or unexpected result to the team in charge of the migration, so the case can be analyzed in detail.
Tune query performance, taking into account table distribution and sort keys, and make extensive use of the EXPLAIN command in order to understand how Amazon Redshift plans and runs queries. For advanced table design concepts, refer to Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization.

This business validation phase is key so all end-users are aligned and ready for the final cutover. Following Amazon Redshift best practices enables end-users to fully take advantage of the capabilities of their new data warehouse. After you perform all the migration validation tasks, connect and test every ETL job, business process, external system, and user tool against Amazon Redshift, you can disconnect every process from the old data warehouse, which you can now safely power off and decommission.

Conclusion

In this post, we provided detailed steps to migrate from Greenplum to Amazon Redshift using AWS SCT. Although this post describes modernizing and moving to a cloud warehouse, you should be augmenting this transformation process towards a full-fledged modern data architecture. The AWS Cloud enables you to be more data-driven by supporting multiple use cases. For a modern data architecture, you should use purposeful data stores like Amazon S3, Amazon Redshift, Amazon Timestream, and other data stores based on your use case.

Check out the second post in this series, where we cover prescriptive guidance around data types, functions, and stored procedures.

About the Authors

Arunabha Datta is a Sr. Data Architect at Amazon Web Services (AWS). He collaborates with customers and partners to architect and implement modern data architecture using AWS Analytics services. In his spare time, Arunabha enjoys photography and spending time with his family.

Noise

All posts by Suresh Patnam

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 2

Choose your optimal Amazon Redshift cluster

Create data extraction tasks with AWS SCT

Data architecture

Migrate stored procedures

Sequences

Character datatype length

Numeric precision and scale

SQL queries and functions

AGE

COUNT

ORDER BY

STRING_AGG

ARRAY_AGG

UNNEST

WHERE

Amazon Redshift utility for troubleshooting or running diagnostics for the cluster

Conclusion

About the Authors

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 1

Solution overview

Perform a migration assessment

Perform schema conversion with AWS SCT

Migrate the data

Technical validation

Business validation

Conclusion

About the Authors

The collective thoughts of the interwebz