Tag Archives: AWS Schema Conversion Tool

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

2023-10-18 Ahmed Shehata

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/migrate-microsoft-azure-synapse-analytics-to-amazon-redshift-using-aws-sct/

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. With Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers, such as the following:

Amazon Redshift seamlessly integrates with broader data, analytics, and AI or machine learning (ML) services on AWS, enabling you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. With Amazon Redshift, you can build lake house architectures and perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning, and more.
You don’t need to worry about workloads such as ETL (extract, transform, and load), dashboards, ad-hoc queries, and so on interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get 1 hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
Amazon Redshift is straightforward to use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
Fault tolerance is built in. All data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, serverless web services or event-driven applications. You can also use your favorite business intelligence (BI) and SQL tools to access, analyze, and visualize data in Amazon Redshift.
Amazon Redshift ML makes it straightforward for data scientists to create, train, and deploy ML models using familiar SQL. You can also run predictions using SQL.
Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row-level and column-level security controls on sensitive data, and more.

In this post, we show how to migrate a data warehouse from Microsoft Azure Synapse to Redshift Serverless using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database code and storage objects to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your Azure Synapse Analytics. First, we create a Redshift database into which Azure Synapse data will be migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert Azure Synapse schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from Azure Synapse, upload it into an S3 bucket, and copy it to Amazon Redshift.

The following diagram illustrates our solution architecture.

This walkthrough covers the following steps:

Create a Redshift Serverless data warehouse.
Create the S3 bucket and folder.
Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT:
1. Connect to the Azure Synapse source.
2. Connect to the Amazon Redshift target.
3. Convert the Azure Synapse schema to a Redshift database.
4. Analyze the assessment report and address the action items.
5. Apply the converted schema to the target Redshift database.
Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents:
1. Generate trust and key stores (this step is optional).
2. Install and configure the data extraction agent.
3. Start the data extraction agent.
4. Register the data extraction agent.
5. Add virtual partitions for large tables (this step is optional).
6. Create a local data migration task.
7. Start the local data migration task.
View data in Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

A workstation with AWS SCT, Amazon Corretto 11, and Redshift drivers.
- You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we use an Amazon EC2 Windows instance. To create it, refer to Tutorial: Get started with Amazon EC2 Windows instances.
- To download and install AWS SCT on the EC2 instance that you created, refer to Installing, verifying, and updating AWS SCT.
- Download the Redshift JDBC driver.
- Download and install Amazon Corretto 11.
A database user account that AWS SCT can use to connect to your source Azure Synapse Analytics database.
- Grant VIEW DEFINITION and VIEW DATABASE STATE to each schema you are trying to convert to the database user used for migration.

Create a Redshift Serverless data warehouse

In this step, we create a Redshift Serverless data warehouse with a workgroup and namespace. A workgroup is a collection of compute resources and a namespace is a collection of database objects and users. To isolate workloads and manage different resources in Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create a Redshift Serverless data warehouse with a workgroup and namespace:

On the Amazon Redshift console, choose the AWS Region that you want to use.
In the navigation pane, choose Redshift Serverless.
Choose Create workgroup.

For Workgroup name, enter a name that describes the compute resources.

Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
Choose Next.
For Namespace, enter a name that describes your dataset.
In the Database name and password section, select Customize admin user credentials.
For Admin user name, enter a user name of your choice (for example, awsuser).
For Admin user password, enter a password of your choice (for example, MyRedShiftPW2022).

Choose Next.

Note that data in the Redshift Serverless namespace is encrypted by default.

In the Review and Create section, choose Create.

Now you create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace. Note that there can only be one default IAM role.

On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the namespace you just created.
On the Security and encryption tab, in the Permissions section, choose Manage IAM roles.
Choose Manage IAM roles and choose Create IAM role.
In the Specify an Amazon S3 bucket for the IAM role to access section, choose one of the following methods:
1. Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with names containing the word redshift.
2. Choose Any Amazon S3 bucket to allow the created IAM role to access all S3 buckets.
3. Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
Choose Create IAM role as default.
Capture the endpoint for the Redshift Serverless workgroup you just created.
On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workgroup you just created.
In the General information section, copy the endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create an S3 bucket:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
For Bucket name, enter a unique DNS-compliant name for your bucket (for example, uniquename-as-rs).

For more information about bucket names, refer to Bucket naming rules.

For AWS Region, choose the Region in which you created the Redshift Serverless workgroup.
Choose Create bucket.

Choose Buckets in the navigation pane and navigate to the S3 bucket you just created (uniquename-as-rs).
Choose Create folder.
For Folder name, enter incoming.
Choose Create folder.

Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT

To convert the Azure Synapse schema to Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that you created previously and launch AWS SCT.

Connect to the Azure Synapse source

Complete the following steps to connect to the Azure Synapse source:

On the File menu, choose Create New Project.
Choose a location to store your project files and data.
Provide a meaningful but memorable name for your project (for example, Azure Synapse to Amazon Redshift).
To connect to the Azure Synapse source data warehouse, choose Add source.
Choose Azure Synapse and choose Next.
For Connection name, enter a name (for example, olap-azure-synapse).

AWS SCT displays this name in the object tree in left pane.

For Server name, enter your Azure Synapse server name.
For SQL pool, enter your Azure Synapse pool name.
Enter a user name and password.
Choose Test connection to verify that AWS SCT can connect to your source Azure Synapse project.
When the connection is successfully validated, choose Ok and Connect.

Connect to the Amazon Redshift target

Follow these steps to connect to Amazon Redshift:

In AWS SCT, choose Add target.
Choose Amazon Redshift, then choose Next.
For Connection name, enter a name to describe the Amazon Redshift connection.

AWS SCT displays this name in the object tree in the right pane.

For Server name, enter the Redshift Serverless workgroup endpoint you captured earlier.
For Server port, enter 5439.
For Database, enter dev.
For User name, enter the user name you chose when creating the Redshift Serverless workgroup.
For Password, enter the password you chose when creating the Redshift Serverless workgroup.
Deselect Use AWS Glue.
Choose Test connection to verify that AWS SCT can connect to your target Redshift workgroup.
When the test is successful, choose OK.
Choose Connect to connect to the Amazon Redshift target.

Alternatively, you can use connection values that are stored in AWS Secrets Manager.

Convert the Azure Synapse schema to a Redshift data warehouse

After you create the source and target connections, you will see the source Azure Synapse object tree in the left pane and the target Amazon Redshift object tree in the right pane. We then create mapping rules to describe the source target pair for the Azure Synapse to Amazon Redshift migration.

Follow these steps to convert the Azure Synapse dataset to Amazon Redshift format:

In the left pane, choose (right-click) the schema you want to convert.
Choose Convert schema.
In the dialog box, choose Yes.

When the conversion is complete, you will see a new schema created in the Amazon Redshift pane (right pane) with the same name as your Azure Synapse schema.

The sample schema we used has three tables; you can see these objects in Amazon Redshift format in the right pane. AWS SCT converts all the Azure Synapse code and data objects to Amazon Redshift format. You can also use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects, but some objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from Main view to Assessment Report view as shown in the following screenshot.

The Summary tab shows objects that were converted automatically and objects that were not converted automatically. Green represents automatically converted objects or objects with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action items tab shows the recommended actions for each conversion issue. If you choose an action item from the list, AWS SCT highlights the object that the action item applies to.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database and schema show you the effort required to design and implement the recommendations for converting action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. AWS SCT completes some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it in the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the ID column data type from decimal(3,0) in Azure Synapse to the smallint data type in Amazon Redshift.

Apply the converted schema to the target Redshift data warehouse

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.

Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this section, we configure AWS SCT extraction agents to extract data from Azure Synapse and migrate to Amazon Redshift. For this post, we install the AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For very large datasets, AWS SCT supports the use of multiple data extraction agents running on several instances to maximize throughput and increase the speed of data migration.

Generate trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all data passed between the applications remains private and integral. To use SSL communication, you need to generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

In AWS SCT, choose Settings, Global settings, and Security.
Choose Generate trust and key store.

Enter a name and password for the trust and key stores.
Enter a location to store them.
Choose Generate, then choose OK.

Install and configure the data extraction agent

In the installation package for AWS SCT, you can find a subfolder called agents (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

For Service port, enter the port number the agent listens on. It is 8192 by default.
For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data.

The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.

For Enter Redshift JDBC driver file or files, enter the location where you downloaded the Redshift JDBC drivers.
For Add the Amazon Redshift driver, enter YES.
For Enable SSL communication, enter yes. Enter No here if you don’t want to use SSL.
Choose Next.

For Trust store path, enter the storage location you specified when creating the trust and key store.
For Trust store password, enter the password for the trust store.
For Enable client SSL authentication, enter yes.
For Key store path, enter the storage location you specified when creating the trust and key store.
For Key store password, enter the password for the key store.
Choose Next.

Start the data extraction agent

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, use C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed for your operating system. To stop an agent, run the same command but replace start with stop. To restart an agent, run the same RestartAgent.bat file.

Note that you should have administrator access to run those commands.

Register the data extraction agent

Follow these steps to register the data extraction agent:

In AWS SCT, change the view to Data Migration view choose Register.
Select Redshift data agent, then choose OK.

For Description, enter a name to identify the agent.
For Host name, if you installed the extraction agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT extraction agent is installed. It is recommended to install extraction agents on Linux for better performance.
For Port, enter the number you used for the listening port (default 8192) when installing the AWS SCT extraction agent.
Select Use SSL to encrypt AWS SCT connection to Data Extraction Agent.

If you’re using SSL, navigate to the SSL tab.
For Trust store, choose the trust store you created earlier.
For Key store, choose the key store you created earlier.
Choose Test connection.
After the connection is validated successfully, choose OK and Register.

Create a local data migration task

To migrate data from Azure Synapse Analytics to Amazon Redshift, you create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local data migration task:

In AWS SCT, under the schema name in the left pane, choose (right-click) the table you want to migrate (for this post, we use the table tbl_currency).
Choose Create Local task.

Choose from the following migration modes:
1. Extract the source data and store it on a local PC or virtual machine where the agent runs.
2. Extract the data and upload it to an S3 bucket.
3. Extract the data, upload it to Amazon S3, and copy it into Amazon Redshift. (We choose this option for this post.)

On the Advanced tab, provide the extraction and copy settings.

On the Source server tab, make sure you are using the current connection properties.

On the Amazon S3 settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket you created earlier.

The AWS SCT data extraction agent uploads the data in those S3 buckets and folders before copying it to Amazon Redshift.

Choose Test Task.

When the task is successfully validated, choose OK, then choose Create.

Start the local data migration task

To start the task, choose Start or Restart on the Tasks tab.

First, the data extraction agent extracts data from Azure Synapse. Then the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.

At this point, AWS SCT has successfully migrated data from the source Azure Synapse table to the Redshift table.

View data in Amazon Redshift

After the data migration task is complete, you can connect to Amazon Redshift and validate the data. Complete the following steps:

On the Amazon Redshift console, navigate to the Query Editor v2.
Open the Redshift Serverless workgroup you created.
Choose Query data.

For Database, enter a name for your database.
For Authentication, select Federated user
Choose Create connection.

Open a new editor by choosing the plus sign.
In the editor, write a query to select from the schema name and table or view name you want to verify.

You can explore the data, run ad-hoc queries, and make visualizations, charts, and views.

The following screenshot is the view of the source Azure Synapse dataset we used in this post.

Clean up

Follow the steps in this section to clean up any AWS resources you created as part of this post.

Stop the EC2 instance

Follow these steps to stop the EC2 instance:

On the Amazon EC2 console, in the navigation pane, choose Instances.
Select the instance you created.
Choose Instance state, then choose Terminate instance.
Choose Terminate when prompted for confirmation.

Delete the Redshift Serverless workgroup and namespace

Follow these steps to delete the Redshift Serverless workgroup and namespace:

On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workspace you created
On the Actions menu, choose Delete workgroup.
Select Delete the associated namespace.
Deselect Create final snapshot.
Enter delete in the confirmation text box and choose Delete.

Delete the S3 bucket

Follow these steps to delete the S3 bucket:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose the bucket you created.
Choose Delete.
To confirm deletion, enter the name of the bucket.
Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. This post discussed how a data migration task extracts, downloads, and migrates data from Azure Synapse to Amazon Redshift. The solution we presented performs a one-time migration of database objects and data. Data changes made in Azure Synapse when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to Azure Synapse on hold or rerun the ETL jobs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

To get started, download and install AWS SCT, sign in to the AWS Management Console, check out Redshift Serverless, and start migrating!

About the Authors

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Jagadish Kumar is a Senior Analytics Specialist Solutions Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Accelerate your data warehouse migration to Amazon Redshift – Part 7

2023-10-17 Mykhailo Kondak

Post Syndicated from Mykhailo Kondak original https://aws.amazon.com/blogs/big-data/accelerate-your-data-warehouse-migration-to-amazon-redshift-part-7/

Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. You can also integrate other AWS services such as Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in AWS.

Migrating a data warehouse can be complex. You have to migrate terabytes or petabytes of data from your legacy system while not disrupting your production workload. You also need to ensure that the new target data warehouse is consistent with upstream data changes so that business reporting can continue uninterrupted when you cut over to the new platform.

Previously, there were two main strategies to maintain data consistency after the initial bulk load during a migration to Amazon Redshift. You could identify the changed rows, perhaps using a filter on update timestamps, or you could modify your extract, transform, and load (ETL) process to write to both the source and target databases. Both of these options require manual effort to implement and increase the cost and risk of the migration project.

AWS Schema Conversion Tool (AWS SCT) could help you with initial bulk load from Azure Synapse Analytics, BigQuery, Greenplum Database, IBM Netezza, Microsoft SQL Server, Oracle, Snowflake, Teradata and Vertica. Now, we’re happy to share that AWS SCT has automated maintaining data consistency for you. If you’re migrating from an IBM Netezza data warehouse to Amazon Redshift, the AWS SCT data extractors will automatically capture changes from the source and apply them on the target. You configure a change data capture (CDC) migration task in AWS SCT, and it will extract the relevant data changes from IBM Netezza and apply them in a transactionally consistent order on Amazon Redshift. You need to configure the needed resources on IBM Netezza and start the data migration—the source database remains fully operational during the migration and replication.

In this post, we describe at a high-level how CDC tasks work in AWS SCT. Then we deep dive into an example of how to configure, start, and manage a CDC migration task. We look briefly at performance and how you can tune a CDC migration, and then conclude with some information about how you can get started on your own migration.

Accelerate your data warehouse migration to Amazon Redshift:

Solution overview

The following diagram shows the data migration and replication workflow with AWS SCT.

In the first step, your AWS SCT data extraction agent completes the full load of your source data to Amazon Redshift. Then the AWS SCT data extraction agent uses a history database in Netezza. The history database captures information about user activity such as queries, query plans, table access, column access, session creation, and failed authentication requests. The data extraction agent extracts information about transactions that you run in your source Netezza database and replicates them to your target Redshift database.

You can start ongoing replication automatically after you complete the full load. Alternatively, you can start CDC at a later time or on a schedule. For more information, refer to Configuring ongoing data replication.

At a high level, the ongoing replication flow is as follows.

At the start of the replication, the data extraction agent captures the last transaction identifier in the history table. The data extraction agent stores this value in the max_createxid variable. To capture the transaction ID, the agent runs the following query:

SELECT max(XID) AS max_createxid 
    FROM <DB_NAME>.<SCHEMA_NAME>."$hist_plan_prolog_n";

If this transaction ID value is different from the CDC start point, then the agent identifies the delta to replicate. This delta includes all transactions for the selected tables that happened after full load or after the previous replication. The data extraction agent selects the updated data from your source table.

From this updated data, AWS SCT creates two temporary tables. The first table includes all rows that you deleted from your source database and the old data of the rows that you updated. The second table includes all rows that you inserted and the new data of the rows that you updated. AWS SCT then uses these tables in JOIN clauses to replicate the changes to your target Redshift database.

Next, AWS SCT copies these tables to your Amazon Simple Storage Service (Amazon S3) bucket and uses this data to update your target Redshift cluster.

After updating your target database, AWS SCT deletes these temporary tables. Next, your data extraction agent sets the value of the CDC start point equal to the captured transaction ID (max_createxid). During the next data replication run, your agent will determine the delta to replicate using this updated CDC start point.

All changes that happen to your source database during the replication run will be captured in the next replication run. Make sure that you repeat the replication steps until the delta is equal to zero for each table that you included in the migration scope. At this point, you can cut over to your new databases.

Configure your source database

In your source Netezza database, create a history database and configure the history logging. Next, grant read permissions for all tables in the history database to the user that you use in the AWS SCT project. This user has the minimal permissions that are required to convert your source database schemas to Amazon Redshift.

Configure AWS SCT

Before you start data migration and replication with AWS SCT, make sure that you download the Netezza and Amazon Redshift drivers. Take note of the path to the folder where you saved these files. You will specify this path in the AWS SCT and data extraction agent settings.

To make sure the data extraction agents work properly, install the latest version of Amazon Corretto 11.

To configure AWS SCT, complete the following steps:

After you create an AWS SCT project, connect to your source and target databases and set up the mapping rules. A mapping rule describes a source-target pair that defines the migration target for your source database schema.
Convert your database schemas and apply them to Amazon Redshift if you haven’t done this yet. Make sure that the target tables exist in your Redshift database before you start data migration.
Now, install the data extraction agent. The AWS SCT installer includes the installation files for data extraction agents in the agents folder. Configure your data extraction agents by adding the listening port number and the path to the source and target database drivers. For the listening port, you can proceed with the default value. For database drivers, enter the path that you noted before.

The following diagram shows how the AWS SCT data extraction agents work.

After you install the data extraction agent, register it in AWS SCT.

Open the data migration view in AWS SCT and choose Register.
Enter the name of your agent, the host name, and the port that you configured in the previous step. For the host name, you can use the localhost 0.0.0.0 if you run the agent on the same machine where you installed the data extraction agent.

Create and run a CDC task

Now you can create and manage your data migration and replication tasks. To do so, complete the following steps:

Select the tables in your source database to migrate, open the context (right-click) menu, and choose Create local task.
Choose your data migration mode (for this post, choose Extract, upload and copy to replicate data changes from your source database):
1. Extract only – Extract your data and save it to your local working folders.
2. Extract and upload – Extract your data and upload it to Amazon S3.
3. Extract, upload and copy – Extract your data, upload it to Amazon S3, and copy it into your Redshift data warehouse.
Choose your encryption type. Make sure that you configure encryption for safe and secure data migrations.
Select Enable CDC.

After this, you can switch to the CDC settings tab.
For CDC mode, you can choose from the following options:
1. Migrate existing data and replicate ongoing changes – Migrate all existing source data and then start the replication. This is the default option.
2. Replicate data changes only – Start data replication immediately.

Sometimes you don’t need to migrate all existing source data. For example, if you have already migrated your data, you can start the data replication by choosing Replicate data changes only.

If you choose Replicate data changes only, you can also set the Last CDC point to configure the replication to start from this point. If it is not set, AWS SCT data extraction agents replicate all changes that occur after your replication task is started.

If your replication task failed, you can restart the replication from the point of failure. You can find the identifier of the last migrated CDC point on the CDC processing details tab in AWS SCT, set Last CDC point and start the task again. This will allow AWS SCT data extraction agents to replicate all changes in your source tables to your target database without gaps.

You can also configure when you want to schedule the CDC runs to begin.

If you select Immediately, the first replication run immediately after your agent completes the full load. Alternatively, you can specify the time and date when you want to start the replication.

Also, you can schedule when to run the replication again. You can enter the number of days, hours, or minutes when to repeat the replication runs. Set these values depending on the intensity of data changes in your source database.
Finally, you can set the end date when AWS SCT will stop running the replication.

On the Amazon S3 settings tab, you can connect your AWS SCT data extraction agent with your Amazon S3 bucket.

You don’t need to do this step if you have configured the AWS service profile in the global application settings.

After you have configured all settings, choose Create to create a CDC task.

Start this task in the AWS SCT user interface.

The following screenshots show examples of the AWS SCT user interface once you started tasks.

You can run multiple CDC tasks in parallel at the same time. For example, you can include different sets of source tables in each task to replicate the changes to different target Redshift clusters. AWS SCT handles these replication tasks and distributes resources correctly to minimize the replication time.

Data replication limitations

There are a few limitations in AWS SCT data replication:

Changes in your source database don’t trigger the replication run because AWS SCT isn’t able to automate these runs (as of this writing). You can instead run the data replication tasks on a predefined schedule.
AWS SCT doesn’t replicate TRUNCATE and DDL statements. If you change the structure of your source table or truncate it, then you must run the same statements in your target database. You should make these changes manually because AWS SCT isn’t aware of structure updates.

End-to-end example

Now that you know how to create a local replication task in AWS SCT, we deep dive and show how AWS SCT performs the extract and load processes.

First, we run the following code to check that we correctly configured our source Netezza database. To use this code example, change the name of your history database.

SELECT COUNT(*) FROM HISTDB.DBE."$hist_column_access_1";

If you configured your database correctly, then the output of this command includes a value that is different from zero. In our case, the result is as follows:

 COUNT |
-------+
2106717|

Now we create a table on Netezza to use in the example. The table has three columns and a primary key.

DROP TABLE NZ_DATA4EXTRACTOR.CDC_DEMO IF EXISTS; 
CREATE TABLE NZ_DATA4EXTRACTOR.CDC_DEMO 
(
    ID INTEGER NOT NULL, 
    TS TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
    CMNT CHARACTER(16)
)
DISTRIBUTE ON RANDOM; 

ALTER TABLE NZ_DATA4EXTRACTOR.CDC_DEMO 
    ADD CONSTRAINT CDC_DEMO_PK PRIMARY KEY (ID); 

SELECT * 
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO 
    ORDER BY ID;

The SELECT statement returns an empty table:

ID|TS|CMNT|
--+--+----+

Before we start the replication, we run the following query on Netezza to get the latest transaction identifier in the history table:

SELECT MAX(XID) AS XID 
    FROM HISTDB.DBE."$hist_plan_prolog_1";

For our test table, the script prints the last transaction identifier, which is 2691798:

XID    |
-------+
2691798|

To make sure that our table doesn’t include new transactions, AWS SCT runs the following script. If you want to run this script manually, replace 2691798 with the last transaction identifier in your history table.

SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO 
    WHERE CREATEXID > 2691798;

As expected, the script doesn’t return any values.

CREATEXID|
---------+
         |

CREATEXID and DELETEXID are hidden system columns that exist in every Netezza table. CREATEXID identifies the transaction ID that created the row, and DELETEXID identifies the transaction ID that deleted the row. AWS SCT uses them to find changes in the source data.

Now we’re ready to start the replication.

We assume you’ve used AWS SCT to convert the example table and build it on the target Amazon Redshift. AWS SCT runs the following statement on Amazon Redshift:

DROP TABLE IF EXISTS nz_data4extractor_nz_data4extractor.cdc_demo;

CREATE TABLE nz_data4extractor_nz_data4extractor.cdc_demo
(
    id INTEGER ENCODE AZ64 NOT NULL,
    ts TIMESTAMP WITHOUT TIME ZONE ENCODE AZ64 DEFAULT SYSDATE::TIMESTAMP,
    cmnt CHARACTER VARYING(48) ENCODE LZO
)
DISTSTYLE AUTO;

ALTER TABLE nz_data4extractor_nz_data4extractor.cdc_demo
    ADD CONSTRAINT cdc_demo_pk PRIMARY KEY (id);

AWS SCT also creates a staging table that holds replication changes until they can be applied on the actual target table:

CREATE TABLE IF NOT EXISTS "nz_data4extractor_nz_data4extractor"."_cdc_unit"
    (LIKE "nz_data4extractor_nz_data4extractor"."cdc_demo" INCLUDING  DEFAULTS);
ALTER TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ADD COLUMN deletexid_ BIGINT;

AWS SCT runs the following query to capture all changes that happened after the last transaction identifier:

SET show_deleted_records = true;

SELECT 
    ID, 
    TS, 
    CMNT
FROM NZ_DATA4EXTRACTOR.CDC_DEMO
WHERE CREATEXID <= 2691798
    AND (DELETEXID = 0 OR DELETEXID > 2691798)

This script returns an empty table:

ID|TS|CMNT|
--+--+----+

Now, we change data on Netezza and see how it gets replicated to Amazon Redshift:

INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (1, 'One');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (2, 'Two');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (3, 'Three');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (4, 'Four');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (5, 'Five');

SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;

The preceding script returns the following result:

ID|TS                     |CMNT            |
--+-----------------------+----------------+
 1|2023-03-07 14:05:11.000|One             |
 2|2023-03-07 14:05:11.000|Two             |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Four            |
 5|2023-03-07 14:05:11.000|Five            |

AWS SCT checks for data changes starting from the last transaction ID using the following query:

SET show_deleted_records = true;

SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    WHERE createxid > 2691798

The script returns a result that is different from zero:

CREATEXID|
---------+
  2691824|

Because the new transaction ID is greater than the last transaction ID, the history table contains new data to be replicated.

AWS SCT runs the following query to extract the changes. The application detects all rows that were inserted, deleted, or updated within the scope of transactions with IDs in range from 2691798 + 1 to 2691824.

SELECT
    ID,
    TS,
    CMNT,
    deletexid_
FROM (
    SELECT
        createxid,
        rowid,
        deletexid,
        2691798 AS min_CDC_trx,
        2691824 AS max_CDC_trx,
        CASE WHEN deletexid > max_CDC_trx
            THEN 0
            ELSE deletexid
            END AS deletexid_,
        MIN(createxid) OVER (PARTITION BY rowid) AS min_trx,
        COUNT(1) OVER (PARTITION BY rowid) AS rowid_cnt,
        ID,
        TS,
        CMNT
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO AS t
    WHERE deletexid <> 1
        AND (CREATEXID > min_CDC_trx OR deletexid_ > min_CDC_trx) -- Prior run max trx
        AND CREATEXID <= max_CDC_trx -- Current run max trx
    ) AS r
WHERE (min_trx = createxid OR deletexid_ = 0)
    AND NOT (
        CREATEXID > min_CDC_trx 
        AND deletexid <= max_CDC_trx 
        AND rowid_cnt = 1 
        AND deletexid > 0
        )

The extracted data is as follows:

ID|TS                     |CMNT            |DELETEXID_|
--+-----------------------+----------------+----------+
 1|2023-03-07 14:05:11.000|One             |         0|
 2|2023-03-07 14:05:11.000|Two             |         0|
 3|2023-03-07 14:05:11.000|Three           |         0|
 4|2023-03-07 14:05:11.000|Four            |         0|
 5|2023-03-07 14:05:11.000|Five            |         0|

Next, AWS SCT compresses the data and uploads it to Amazon S3. Then AWS SCT runs the following command to copy the data into the staging table on Amazon Redshift:

TRUNCATE TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit"; 

COPY "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ("id", "ts", "cmnt", "deletexid_")
    FROM 's3://bucket/folder/unit_1.manifest' MANIFEST
    CREDENTIALS '...'
    REGION '...'
    REMOVEQUOTES
    IGNOREHEADER 1
    GZIP
    DELIMITER '|';

From the staging table, AWS SCT applies the changes to the actual target table. For this iteration, we insert new rows into the Redshift table:

INSERT INTO "nz_data4extractor_nz_data4extractor"."cdc_demo"("id", "ts", "cmnt")
SELECT 
        "id", 
        "ts", 
        "cmnt" 
    FROM "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
    WHERE t2.deletexid_ = 0;

Let’s run another script that not only inserts, but also deletes and updates data in the source table:

INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (6, 'Six');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (7, 'Seven');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (8, 'Eight');

DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 1;
DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 7;

UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Two' WHERE ID = 2;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Replaced Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Yet Again Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Five' WHERE ID = 5;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Eight' WHERE ID = 8;

DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 5;

SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;

The Netezza table contains the following rows:

ID|TS                     |CMNT            |
--+-----------------------+----------------+
 2|2023-03-07 14:05:11.000|Updated Two     |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Yet Again Four  |
 6|2023-03-07 14:07:09.000|Six             |
 8|2023-03-07 14:07:10.000|Updated Eight   |

AWS SCT detects the changes as before using the new transaction ID:

SET show_deleted_records = true;

SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    WHERE createxid > 2691824

CREATEXID|
---------+
  2691872|

SELECT
    ID,
    TS,
    CMNT,
    deletexid_
FROM (
    SELECT
        createxid,
        rowid,
        deletexid,
        2691824 AS min_CDC_trx,
        2691872 AS max_CDC_trx,
        CASE WHEN deletexid > max_CDC_trx
            THEN 0
            ELSE deletexid
            END AS deletexid_,
        MIN(createxid) OVER (PARTITION BY rowid) AS min_trx,
        COUNT(1) OVER (PARTITION BY rowid) AS rowid_cnt,
        ID,
        TS,
        CMNT
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO AS t
    WHERE deletexid <> 1
        AND (CREATEXID > min_CDC_trx OR deletexid_ > min_CDC_trx) -- Prior run max trx
        AND CREATEXID <= max_CDC_trx -- Current run max trx
    ) AS r
WHERE (min_trx = createxid OR deletexid_ = 0)
    AND NOT (
        CREATEXID > min_CDC_trx 
        AND deletexid <= max_CDC_trx 
        AND rowid_cnt = 1 
        AND deletexid > 0
        )

The extracted changes appear as follows:

ID|TS                     |CMNT          |DELETEXID_|
--+-----------------------+--------------+----------+
 1|2023-03-07 14:05:11.000|One           |   2691856|
 2|2023-03-07 14:05:11.000|Two           |   2691860|
 2|2023-03-07 14:05:11.000|Updated Two   |         0|
 4|2023-03-07 14:05:11.000|Four          |   2691862|
 4|2023-03-07 14:05:11.000|Yet Again Four|         0|
 5|2023-03-07 14:05:11.000|Five          |   2691868|
 6|2023-03-07 14:07:09.000|Six           |         0|
 8|2023-03-07 14:07:10.000|Eight         |   2691870|
 8|2023-03-07 14:07:10.000|Updated Eight |         0|

Notice that we inserted a new row with ID 7 and then deleted this row. Therefore, we can ignore the row with ID 7 in our delta.

Also, we made several updates of the row with ID 4. In our delta, we include the original and the most recent versions of the row. We ignore all intermediate versions in our delta.

We updated the row with ID 5 and then deleted this row. We don’t include the updated row in our delta.

This way, AWS SCT optimizes the migrated data, reducing the migration time and the network traffic.

Now, as before, AWS SCT compresses, uploads to Amazon S3, and copies the data into the staging Redshift table:

TRUNCATE TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit";

COPY "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ("id", "ts", "cmnt", "deletexid_")
    FROM 's3://bucket/folder/unit_2.manifest' MANIFEST
    CREDENTIALS '...'
    REGION '...'
    REMOVEQUOTES
    IGNOREHEADER 1
    GZIP
    DELIMITER '|';

Then, AWS SCT applies the changes to the target table. AWS SCT removes the deleted rows, removes the old version of updated rows, and then inserts new rows and the most recent version of any updated rows:

DELETE 
    FROM  "nz_data4extractor_nz_data4extractor"."cdc_demo"
    USING "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
    WHERE "nz_data4extractor_nz_data4extractor"."cdc_demo"."id" = t2."id"
        AND COALESCE(CAST("nz_data4extractor_nz_data4extractor"."cdc_demo"."ts" AS VARCHAR),'?#-*') = COALESCE(CAST(t2."ts" AS VARCHAR),'?#-*')
        AND COALESCE(CAST("nz_data4extractor_nz_data4extractor"."cdc_demo"."cmnt" AS VARCHAR),'?#-*') = COALESCE(CAST(t2."cmnt" AS VARCHAR),'?#-*')
        AND t2.deletexid_ > 0;
  
INSERT INTO "nz_data4extractor_nz_data4extractor"."cdc_demo"("id", "ts", "cmnt")
SELECT 
    "id", 
    "ts", 
    "cmnt"
FROM "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
WHERE t2.deletexid_ = 0;

You can compare the data on the source and target to verify that AWS SCT captured all changes correctly:

SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;

ID|TS                     |CMNT            |
--+-----------------------+----------------+
 2|2023-03-07 14:05:11.000|Updated Two     |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Yet Again Four  |
 6|2023-03-07 14:07:09.000|Six             |
 8|2023-03-07 14:07:10.000|Updated Eight   |

The data on Amazon Redshift matches exactly:

id|ts                        |cmnt
 2|2023-03-07 14:05:11.000000|Updated Two
 3|2023-03-07 14:05:11.000000|Three
 4|2023-03-07 14:05:11.000000|Yet Again Four
 6|2023-03-07 14:07:09.000000|Six
 8|2023-03-07 14:07:10.000000|Updated Eight

In the previous examples, we showed how to run full load and CDC tasks. You can also create a CDC migration task without the full load. The process is the same—you provide AWS SCT with the transaction ID to start the replication from.

The CDC process does not have a significant impact on the source side. AWS SCT runs only SELECT statements there, using the transaction ID as boundaries for the WHERE clause. The performance impact of these statements is always smaller than the impact of DML statements generated by customer’s applications. For machines where AWS SCT data extraction agents are running, the CDC-related workload is always smaller than the full load workload because the volume of transferred data is smaller.

On the target side, for Amazon Redshift, the CDC process can generate considerable additional workload. The reason is that this process issues INSERT and DELETE statements and these can result in overhead for MPP systems, which Amazon Redshift is. Refer to Top 10 performance tuning techniques for Amazon Redshift to find best practices and tips on how to boost performance of your Redshift cluster.

Conclusion

In this post, we showed how to configure ongoing data replication for Netezza database migration to Amazon Redshift. You can use the described approach to automate data migration and replication from your IBM Netezza database to Amazon Redshift. Or, if you’re considering a migration of your existing Netezza workloads to Amazon Redshift, you can use AWS SCT to automatically convert your database schemas and migrate data. Download the latest version of AWS SCT and give it a try!

We’re happy to share these updates to help you in your data warehouse migration projects. In the meantime, you can learn more about Amazon Redshift and AWS SCT. Happy migrating!

About the Authors

Mykhailo Kondak is a Database Engineer in the AWS Database Migration Service team at AWS. He uses his experience with different database technologies to help Amazon customers to move their on-premises data warehouses and big data workloads to the AWS Cloud. In his spare time, he plays soccer.

Illia Kravtsov is a Database Engineer on the AWS Database Migration Service team. He has over 10 years of experience in data warehouse development with Teradata and other massively parallel processing (MPP) databases.

Michael Soo is a Principal Database Engineer in the AWS Database Migration Service. He builds products and services that help customers migrate their database workloads to the AWS Cloud.

Migrate Google BigQuery to Amazon Redshift using AWS Schema Conversion tool (SCT)

2022-12-14 Jagadish Kumar

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/migrate-google-bigquery-to-amazon-redshift-using-aws-schema-conversion-tool-sct/

Amazon Redshift is a fast, fully-managed, petabyte scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. Using Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers. The following are just some of the notable capabilities:

Amazon Redshift seamlessly integrates with broader analytics services on AWS. This enables you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. Amazon Redshift lets you build lake house architectures and then perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning (ML), and more.
You don’t need to worry about workloads, such as ETL, dashboards, ad-hoc queries, and so on, interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get one hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
Amazon Redshift is easy-to-use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
Fault Tolerance is inbuilt. All of the data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, and serverless web services-based or event-driven applications and so on.
Redshift ML makes it easy for data scientists to create, train, and deploy ML models using familiar SQL. They can also run predictions using SQL.
Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row and column level security controls on sensitive data, and so on.
Amazon Redshift integrates seamlessly with other AWS services and third-party tools. You can move, transform, load, and query large datasets quickly and reliably.

In this post, we provide a walkthrough for migrating a data warehouse from Google BigQuery to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the majority of the database code and storage objects to a format that is compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. Furthermore, AWS SCT can scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your BigQuery project. First, we create an Amazon Redshift database into which BigQuery data is migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert BigQuery schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from BigQuery, upload it into the S3 bucket, and then copy to Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

A workstation with AWS SCT, Amazon Corretto 11, and Amazon Redshift drivers.
1. You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we’re using Amazon EC2 Windows instance. To create it, use this guide.
2. To download and install AWS SCT on the EC2 instance that you previously created, use this guide.
3. Download the Amazon Redshift JDBC driver from this location.
4. Download and install Amazon Corretto 11.
A GCP service account that AWS SCT can use to connect to your source BigQuery project.
1. Grant BigQuery Admin and Storage Admin roles to the service account.
2. Copy the Service account key file, which was created in the Google cloud management console, to the EC2 instance that has AWS SCT.
3. Create a Cloud Storage bucket in GCP to store your source data during migration.

This walkthrough covers the following steps:

Create an Amazon Redshift Serverless Workgroup and Namespace
Create the AWS S3 Bucket and Folder
Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT
- Connecting to the Google BigQuery Source
- Connect to the Amazon Redshift Target
- Convert BigQuery schema to an Amazon Redshift
- Analyze the assessment report and address the action items
- Apply converted schema to target Amazon Redshift
Migrate data using AWS SCT data extraction agents
- Generating Trust and Key Stores (Optional)
- Install and start data extraction agent
- Register data extraction agent
- Add virtual partitions for large tables (Optional)
- Create a local migration task
- Start the Local Data Migration Task
View Data in Amazon Redshift

Create an Amazon Redshift Serverless Workgroup and Namespace

In this step, we create an Amazon Redshift Serverless workgroup and namespace. Workgroup is a collection of compute resources and namespace is a collection of database objects and users. To isolate workloads and manage different resources in Amazon Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create Amazon Redshift Serverless workgroup and namespace:

Navigate to the Amazon Redshift console.
In the upper right, choose the AWS Region that you want to use.
Expand the Amazon Redshift pane on the left and choose Redshift Serverless.
Choose Create Workgroup.
For Workgroup name, enter a name that describes the compute resources.
Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
Choose Next.

For Namespace name, enter a name that describes your dataset.
In Database name and password section, select the checkbox Customize admin user credentials.
- For Admin user name, enter a username of your choice, for example awsuser.
- For Admin user password: enter a password of your choice, for example MyRedShiftPW2022.

Choose Next. Note that data in Amazon Redshift Serverless namespace is encrypted by default.
In the Review and Create page, choose Create.
Create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace, as described in the following. Note that there can only be one default IAM role.
- Navigate to the Amazon Redshift Serverless Dashboard.
- Under Namespaces / Workgroups, choose the namespace that you just created.
- Navigate toSecurity and encryption.
- Under Permissions, choose Manage IAM roles.
- Navigate to Manage IAM roles. Then, choose the Manage IAM roles drop-down and choose Create IAM role.
- Under Specify an Amazon S3 bucket for the IAM role to access, choose one of the following methods:
  - Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with a name starting with redshift.
  - Choose Any Amazon S3 bucket to allow the created IAM role to access all of the S3 buckets.
  - Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
- Choose Create IAM role as default. Amazon Redshift automatically creates and sets the IAM role as default.
Capture the Endpoint for the Amazon Redshift Serverless workgroup that you just created.
- Navigate to the Amazon Redshift Serverless Dashboard.
- Under Namespaces / Workgroups, choose the workgroup that you just created.
- Under General information, copy the Endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create the S3 bucket:

Navigate to the Amazon S3 console
Choose Create bucket. The Create bucket wizard opens.
For Bucket name, enter a unique DNS-compliant name for your bucket (e.g., uniquename-bq-rs). See rules for bucket naming when choosing a name.
For AWS Region, choose the region in which you created the Amazon Redshift Serverless workgroup.
Select Create Bucket.
In the Amazon S3 console, navigate to the S3 bucket that you just created (e.g., uniquename-bq-rs).
Choose “Create folder” to create a new folder.
For Folder name, enter incoming and choose Create Folder.

Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT

To convert BigQuery schema to the Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that we created previously, and then launch AWS SCT.

Follow these steps using AWS SCT:

Connect to the BigQuery Source

From the File Menu choose Create New Project.
Choose a location to store your project files and data.
Provide a meaningful but memorable name for your project, such as BigQuery to Amazon Redshift.
To connect to the BigQuery source data warehouse, choose Add source from the main menu.
Choose BigQuery and choose Next. The Add source dialog box appears.
For Connection name, enter a name to describe BigQuery connection. AWS SCT displays this name in the tree in the left panel.
For Key path, provide the path of the service account key file that was previously created in the Google cloud management console.
Choose Test Connection to verify that AWS SCT can connect to your source BigQuery project.
Once the connection is successfully validated, choose Connect.

Connect to the Amazon Redshift Target

Follow these steps to connect to Amazon Redshift:

In AWS SCT, choose Add Target from the main menu.
Choose Amazon Redshift, then choose Next. The Add Target dialog box appears.
For Connection name, enter a name to describe the Amazon Redshift connection. AWS SCT displays this name in the tree in the right panel.
For Server name, enter the Amazon Redshift Serverless workgroup endpoint captured previously.
For Server port, enter 5439.
For Database, enter dev.
For User name, enter the username chosen when creating the Amazon Redshift Serverless workgroup.
For Password, enter the password chosen when creating Amazon Redshift Serverless workgroup.
Uncheck the “Use AWS Glue” box.
Choose Test Connection to verify that AWS SCT can connect to your target Amazon Redshift workgroup.
Choose Connect to connect to the Amazon Redshift target.

Note that alternatively you can use connection values that are stored in AWS Secrets Manager.

Convert BigQuery schema to an Amazon Redshift

After the source and target connections are successfully made, you see the source BigQuery object tree on the left pane and target Amazon Redshift object tree on the right pane.

Follow these steps to convert BigQuery schema to the Amazon Redshift format:

On the left pane, right-click on the schema that you want to convert.
Choose Convert Schema.
A dialog box appears with a question, The objects might already exist in the target database. Replace?. Choose Yes.

Once the conversion is complete, you see a new schema created on the Amazon Redshift pane (right pane) with the same name as your BigQuery schema.

The sample schema that we used has 16 tables, 3 views, and 3 procedures. You can see these objects in the Amazon Redshift format in the right pane. AWS SCT converts all of the BigQuery code and data objects to the Amazon Redshift format. Furthermore, you can use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects. However, some of the objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from the Main view to the Assessment Report view as follows:

The Summary tab shows objects that were converted automatically, and objects that weren’t converted automatically. Green represents automatically converted or with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action Items tab shows the recommended actions for each conversion issue. If you select an action item from the list, AWS SCT highlights the object to which the action item applies.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database/schema show you the effort required to design and implement the recommendations for converting Action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. Amazon Redshift takes some actions automatically while converting the schema to Amazon Redshift. Objects with these actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the RECORD and JSON datatype columns in BigQuery table ncaaf_referee_data to the SUPER datatype in Amazon Redshift. The partition key in the ncaaf_referee_data table is converted to the distribution key and sort key in Amazon Redshift.

Apply converted schema to target Amazon Redshift

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and then choose Apply to database.

Migrate data from BigQuery to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this walkthrough, we show how to configure AWS SCT extraction agents to extract data from BigQuery and migrate to Amazon Redshift.

First, install AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For big datasets, you can use several data extraction agents to increase the data migration speed.

Generating trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all of the data passed between the applications remains private and integral. To use SSL communication, you must generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

In AWS SCT, navigate to Settings → Global Settings → Security.
Choose Generate trust and key store.
Enter the name and password for trust and key stores and choose a location where you would like to store them.
Choose Generate.

Install and configure Data Extraction Agent

In the installation package for AWS SCT, you find a sub-folder agent (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

For Listening port, enter the port number on which the agent listens. It is 8192 by default.
For Add a source vendor, enter no, as you don’t need drivers to connect to BigQuery.
For Add the Amazon Redshift driver, enter YES.
For Enter Redshift JDBC driver file or files, enter the location where you downloaded Amazon Redshift JDBC drivers.
For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data. The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.
For Enable SSL communication, enter yes. Choose No here if you don’t want to use SSL.
For Key store, enter the storage location chosen when creating the trust and key store.
For Key store password, enter the password for the key store.
For Enable client SSL authentication, enter yes.
For Trust store, enter the storage location chosen when creating the trust and key store.
For Trust store password, enter the password for the trust store.

*************************************************
*                                               *
*     AWS SCT Data Extractor Configuration      *
*              Version 2.0.1.666                *
*                                               *
*************************************************
User name: Administrator
User home: C:\Windows\system32\config\systemprofile
*************************************************
Listening port [8192]: 8192
Add a source vendor [YES/no]: no
No one source data warehouse vendors configured. AWS SCT Data Extractor cannot process data extraction requests.
Add the Amazon Redshift driver [YES/no]: YES
Enter Redshift JDBC driver file or files: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\redshift-jdbc42-2.1.0.9.jar
Working folder [C:\Windows\system32\config\systemprofile]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject
Enable SSL communication [YES/no]: YES
Setting up a secure environment at "C:\Windows\system32\config\systemprofile". This process will take a few seconds...
Key store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftKeyStore
Key store password:
Re-enter the key store password:
Enable client SSL authentication [YES/no]: YES
Trust store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftTrustStore
Trust store password:
Re-enter the trust store password:

Starting Data Extraction Agent(s)

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, double-click C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed following your operating system.
To check the status of the agent, run the same command but replace start with status.
To stop an agent, run the same command but replace start with stop.
To restart an agent, run the same RestartAgent.bat file.

Register the Data Extraction Agent

Follow these steps to register the Data Extraction Agent:

In AWS SCT, change the view to Data Migration view (other) and choose + Register.
In the connection tab:
1. For Description, enter a name to identify the Data Extraction Agent.
2. For Host name, if you installed the Data Extraction Agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT Data Extraction Agent is installed. It’s recommended to install the Data Extraction Agents on Linux for better performance.
3. For Port, enter the number entered for the Listening Port when installing the AWS SCT Data Extraction Agent.
4. Select the checkbox to use SSL (if using SSL) to encrypt the AWS SCT connection to the Data Extraction Agent.
If you’re using SSL, then in the SSL Tab:
1. For Trust store, choose the trust store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
2. For Key Store, choose the key store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
Choose Test Connection.
Once the connection is validated successfully, choose Register.

Add virtual partitions for large tables (optional)

You can use AWS SCT to create virtual partitions to optimize migration performance. When virtual partitions are created, AWS SCT extracts the data in parallel for partitions. We recommend creating virtual partitions for large tables.

Follow these steps to create virtual partitions:

Deselect all objects on the source database view in AWS SCT.
Choose the table for which you would like to add virtual partitioning.
Right-click on the table, and choose Add Virtual Partitioning.
You can use List, Range, or Auto Split partitions. To learn more about virtual partitioning, refer to Use virtual partitioning in AWS SCT. In this example, we use Auto split partitioning, which generates range partitions automatically. You would specify the start value, end value, and how big the partition should be. AWS SCT determines the partitions automatically. For a demonstration, on the Lineorder table:
1. For Start Value, enter 1000000.
2. For End Value, enter 3000000.
3. For Interval, enter 1000000 to indicate partition size.
4. Choose Ok.

You can see the partitions automatically generated under the Virtual Partitions tab. In this example, AWS SCT automatically created the following five partitions for the field:

1. <1000000
2. >=1000000 and <=2000000
3. >2000000 and <=3000000
4. >3000000
5. IS NULL

Create a local migration task

To migrate data from BigQuery to Amazon Redshift, create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local migration task:

In AWS SCT, under the schema name in the left pane, right-click on Standard tables.
Choose Create Local task.
There are three migration modes from which you can choose:
1. Extract source data and store it on a local pc/virtual machine (VM) where the agent runs.
2. Extract data and upload it on an S3 bucket.
3. Choose Extract upload and copy, which extracts data to an S3 bucket and then copies to Amazon Redshift.
In the Advanced tab, for Google CS bucket folder enter the Google Cloud Storage bucket/folder that you created earlier in the GCP Management Console. AWS SCT stores the extracted data in this location.
In the Amazon S3 Settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket that you created earlier. The AWS SCT data extraction agent uploads the data into the S3 bucket/folder before copying to Amazon Redshift.
Choose Test Task.
Once the task is successfully validated, choose Create.

Start the Local Data Migration Task

To start the task, choose the Start button in the Tasks tab.

First, the Data Extraction Agent extracts data from BigQuery into the GCP storage bucket.
Then, the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.
At this point, AWS SCT has successfully migrated data from the source BigQuery table to the Amazon Redshift table.

View data in Amazon Redshift

After the data migration task executes successfully, you can connect to Amazon Redshift and validate the data.

Follow these steps to validate the data in Amazon Redshift:

Navigate to the Amazon Redshift QueryEditor V2.
Double-click on the Amazon Redshift Serverless workgroup name that you created.
Choose the Federated User option under Authentication.
Choose Create Connection.
Create a new editor by choosing the + icon.
In the editor, write a query to select from the schema name and table name/view name you would like to verify. Explore the data, run ad-hoc queries, and make visualizations and charts and views.

The following is a side-by-side comparison between source BigQuery and target Amazon Redshift for the sports data-set that we used in this walkthrough.

Clean up up any AWS resources that you created for this exercise

Follow these steps to terminate the EC2 instance:

Navigate to the Amazon EC2 console.
In the navigation pane, choose Instances.
Select the check-box for the EC2 instance that you created.
Choose Instance state, and then Terminate instance.
Choose Terminate when prompted for confirmation.

Follow these steps to delete Amazon Redshift Serverless workgroup and namespace

Navigate to Amazon Redshift Serverless Dashboard.
Under Namespaces / Workgroups, choose the workspace that you created.
Under Actions, choose Delete workgroup.
Select the checkbox Delete the associated namespace.
Uncheck Create final snapshot.
Enter delete in the delete confirmation text box and choose Delete.

Follow these steps to delete the S3 bucket

Navigate to Amazon S3 console.
Choose the bucket that you created.
Choose Delete.
To confirm deletion, enter the name of the bucket in the text input field.
Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. Following this walkthrough, you can understand how a data migration task extracts, downloads, and then migrates data from BigQuery to Amazon Redshift. The solution that we presented in this post performs a one-time migration of database objects and data. Data changes made in BigQuery when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to BigQuery on hold or replay the ETLs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

AWS SCT has some limitations when using BigQuery as a source. For example, AWS SCT can’t convert sub queries in analytic functions, geography functions, statistical aggregate functions, and so on. Find the full list of limitations in the AWS SCT user guide. We plan to address these limitations in future releases. Despite these limitations, you can use AWS SCT to automatically convert most of your BigQuery code and storage objects.

Download and install AWS SCT, sign in to the AWS Console, checkout Amazon Redshift Serverless, and start migrating!

About the authors

Cedrick Hoodye is a Solutions Architect with a focus on database migrations using the AWS Database Migration Service (DMS) and the AWS Schema Conversion Tool (SCT) at AWS. He works on DB migrations related challenges. He works closely with EdTech, Energy, and ISV business sector customers to help them realize the true potential of DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.

Amit Arora is a Solutions Architect with a focus on Database and Analytics at AWS. He works with our Financial Technology and Global Energy customers and AWS certified partners to provide technical assistance and design customer solutions on cloud migration projects, helping customers migrate and modernize their existing databases to the AWS Cloud.

Jagadish Kumar is an Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

New – A Fully Managed Schema Conversion in AWS Database Migration Service

2022-11-28 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-a-fully-managed-schema-conversion-in-aws-database-migration-service/

Since we launched AWS Database Migration Service (AWS DMS) in 2016, customers have securely migrated more than 800,000 databases to AWS with minimal downtime. AWS DMS supports migration between 20+ database and analytics engines, such as Oracle to Amazon Aurora MySQL, MySQL to Amazon Relational Database (Amazon RDS) MySQL, Microsoft SQL Server to Amazon Aurora PostgreSQL, MongoDB to Amazon DocumentDB, Oracle to Amazon Redshift, and to and from Amazon Simple Storage Service (Amazon S3).

Specifically, the AWS Schema Conversion Tool (AWS SCT) makes heterogeneous database and data warehouse migrations predictable and can automatically convert the source schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target engine. For example, it supports the conversion of Oracle PL/SQL and SQL Server T-SQL code to equivalent code in the Amazon Aurora MySQL dialect of SQL or the equivalent PL/pgSQL code in PostgreSQL. You can download the AWS SCT for your platform, including Windows or Linux (Fedora and Ubuntu).

Today we announce fully managed AWS DMS Schema Conversion, which streamlines database migrations by making schema assessment and conversion available inside AWS DMS. With DMS Schema Conversion, you can now plan, assess, convert and migrate under one central DMS service. You can access features of DMS Schema Conversion in the AWS Management Console without downloading and executing AWS SCT.

AWS DMS Schema Conversion automatically converts your source database schemas, and a majority of the database code objects to a format compatible with the target database. This includes tables, views, stored procedures, functions, data types, synonyms, and so on, similar to AWS SCT. Any objects that cannot be automatically converted are clearly marked as action items with prescriptive instructions on how to migrate to AWS manually.

In this launch, DMS Schema Conversion supports the following databases as sources for migration projects:

Microsoft SQL Server version 2008 R2 and higher
Oracle version 10.2 and later, 11g and up to 12.2, 18c, and 19c

DMS Schema Conversion supports the following databases as targets for migration projects:

Amazon RDS for MySQL version 8.x
Amazon RDS for PostgreSQL version 14.x

Setting Up AWS DMS Schema Conversion
To get started with DMS Schema Conversion, and if it is your first time using AWS DMS, complete the setup tasks to create a virtual private cloud (VPC) using the Amazon VPC service, source, and target database. To learn more, see Prerequisites for AWS Database Migration Service in the AWS documentation.

In the AWS DMS console, you can see new menus to set up Instance profiles, add Data providers, and create Migration projects.

Before you create your migration project, set up an instance profile by choosing Instance profiles in the left pane. An instance profile specifies network and security settings for your DMS Schema Conversion instances. You can create multiple instance profiles and select an instance profile to use for each migration project.

Choose Create instance profile and specify your default VPC or a new VPC, Amazon Simple Storage Service (Amazon S3) bucket to store your schema conversion metadata, and additional settings such as AWS Key Management Service (AWS KMS) keys.

You can create the simplest network configuration with a single VPC configuration. If your source or target data providers are in different VPCs, you can create your instance profile in one of the VPCs, and then link these two VPCs by using VPC peering.

Next, you can add data providers that store the data store type and location information about your source and target databases by choosing Data providers in the left pane. For each database, you can create a single data provider and use it in multiple migration projects.

Your data provider can be a fully managed Amazon RDS instance or a self-managed engine running either on-premises or on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Choose Create data provider to create a new data provider. You can set the type of the database location manually, such as database engine, domain name or IP address, port number, database name, and so on, for your data provider. Here, I have selected an RDS database instance.

After you create a data provider, make sure that you add database connection credentials in AWS Secrets Manager. DMS Schema Conversion uses this information to connect to a database.

Converting your database schema with AWS DMS Schema Conversion
Now, you can create a migration project for DMS Schema Conversion by choosing Migration projects in the left pane. A migration project describes your source and target data providers, your instance profile, and migration rules. You can also create multiple migration projects for different source and target data providers.

Choose Create migration project and select your instance profile and source and target data providers for DMS Schema Conversion.

After creating your migration project, you can use the project to create assessment reports and convert your database schema. Choose your migration project from the list, then choose the Schema conversion tab and click Launch schema conversion.

Migration projects in DMS Schema Conversion are always serverless. This means that AWS DMS automatically provisions the cloud resources for your migration projects, so you don’t need to manage schema conversion instances.

Of course, the first launch of DMS Schema Conversion requires starting a schema conversion instance, which can take up to 10–15 minutes. This process also reads the metadata from the source and target databases. After a successful first launch, you can access DMS Schema Conversion faster.

An important part of DMS Schema Conversion is that it generates a database migration assessment report that summarizes all of the schema conversion tasks. It also details the action items for schema that cannot be converted to the DB engine of your target database instance. You can view the report in the AWS DMS console or export it as a comma-separated value (.csv) file.

To create your assessment report, choose the source database schema or schema items that you want to assess. After you select the checkboxes, choose Assess in the Actions menu in the source database pane. This report will be archived with .csv files in your S3 bucket. To change the S3 bucket, edit the schema conversion settings in your instance profile.

Then, you can apply the converted code to your target database or save it as a SQL script. To apply converted code, choose Convert in the pane of Source data provider and then Apply changes in the pane of Target data provider.

Once the schema has been converted successfully, you can move on to the database migration phase using AWS DMS. To learn more, see Getting started with AWS Database Migration Service in the AWS documentation.

Now Available
AWS DMS Schema Conversion is now available in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) Regions, and you can start using it today.

To learn more, see the AWS DMS Schema Conversion User Guide, give it a try, and please send feedback to AWS re:Post for AWS DMS or through your usual AWS support contacts.

– Channy

Accelerate your data warehouse migration to Amazon Redshift – Part 6

2022-07-28 Michael Soo

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-6-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the sixth in a series of posts. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially simplify your migrations from legacy data warehouses to Amazon Redshift.

Check out all the previous posts in this series:

Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about macro conversion, case-insensitive string comparison, and other new features
Accelerate your data warehouse migration to Amazon Redshift – Part 2 to learn about automation for proprietary data types
Accelerate your data warehouse migration to Amazon Redshift – Part 3 to learn about automation for proprietary SQL statements
Accelerate your data warehouse migration to Amazon Redshift – Part 4 to learn about new options for database scripting
Accelerate your data warehouse migration to Amazon Redshift – Part 5 to learn about automation for SET tables

Amazon Redshift is the cloud data warehouse of choice for tens of thousands of customers who use it to analyze exabytes of data to gain business insights. With Amazon Redshift, you can query data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other AWS services such as Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in the AWS Cloud.

Migrating a data warehouse can be a complex undertaking. Your legacy workload might rely on proprietary features that aren’t directly supported by a modern data warehouse like Amazon Redshift. For example, some data warehouses enforce primary key constraints, making a tradeoff with DML performance. Amazon Redshift lets you define a primary key but uses the constraint for query optimization purposes only. If you use Amazon Redshift, or are migrating to Amazon Redshift, you may need a mechanism to check that primary key constraints are not being violated by extract, transform, and load (ETL) processes.

In this post, we describe two design patterns that you can use to accomplish this efficiently. We also show you how to use the AWS Schema Conversion Tool (AWS SCT) to automatically apply the design patterns to your SQL code.

We start by defining the semantics to address. Then we describe the design patterns and analyze their performance. We conclude by showing you how AWS SCT can automatically convert your code to enforce primary keys.

Primary keys

A primary key (PK) is a set of attributes such that no two rows can have the same value in the PK. For example, the following Teradata table has a two-attribute primary key (emp_id, div_id). Presumably, employee IDs are unique only within divisions.

CREATE TABLE testschema.emp ( 
  emp_id INTEGER NOT NULL
, name VARCHAR(12) NOT NULL
, div_id INTEGER NOT NULL
, job_title VARCHAR(12)
, salary DECIMAL(8,2)
, birthdate DATE NOT NULL ) 
CONSTRAINT pk_emp_id PRIMARY KEY (emp_id, div_id);

Most databases require that a primary key satisfy two criteria:

Uniqueness – The PK values are unique over all rows in the table
Not NULL – The PK attributes don’t accept NULL values

In this post, we focus on how to support the preceding primary key semantics. We describe two design patterns that you can use to develop SQL applications that respect primary keys in Amazon Redshift. Our focus is on INSERT-SELECT statements. Customers have told us that INSERT-SELECT operations comprise over 50% of the DML workload against tables with unique constraints. We briefly provide some guidance for other DML statements later in the post.

INSERT-SELECT

In the rest of this post, we dive deep into design patterns for INSERT-SELECT statements. We’re concerned with statements of the following form:

INSERT INTO <target table> SELECT * FROM <staging table>

The schema of the staging table is identical to the target table on a column-by-column basis.

A duplicate PK value can be introduced by two scenarios:

The staging table contains duplicates, meaning there are two or more rows in the staging data with the same PK value
There is a row x in the staging table and a row y in the target table that share the same PK value

Note that these situations are independent. It can be the case that the staging table contains duplicates, the staging table and target table share a duplicate, or both.

It’s imperative that the staging table doesn’t contain duplicate PK values. To ensure this, you can apply deduplication logic, as described in this post, to the staging table when it’s loaded. Alternatively, if your upstream source can guarantee that duplicates have been eliminated before delivery, you can eliminate this step.

Join

The first design pattern simply joins the staging and target tables. If any rows are returned, then the staging and target tables share a primary key value.

Suppose we have staging and target tables defined as the following:

CREATE TABLE stg ( 
  pk_col INTEGER 
, payload VARCHAR(100) 
, PRIMARY KEY (pk_col)
); 

CREATE TABLE tgt ( 
  pk_col INTEGER 
, payload VARCHAR(100) 
, PRIMARY KEY (pk_col)
);

We can use the following query to detect any duplicate primary key values:

SELECT count(1) 
FROM stg, tgt 
WHERE tgt.pk_col = stg.pk_col;

If the primary key has multiple columns, then the WHERE condition can be extended:

SELECT count(1)
FROM stg, tgt
WHERE
    tgt.pk_col1 = stg.pk_col1
AND tgt.pk_col2 = tgt.pk_col2
AND …
;

There is one complication with this design pattern. If you allow NULL values in the primary key column, then you need to add special code to handle the NULL to NULL matching:

SELECT count(1)
FROM stg, tgt
WHERE
   (tgt.pk_col = stg.pk_col) 
OR (tgt.pk_col IS NULL AND stg.pk_col IS NULL)
;

This is the primary disadvantage of this design pattern—the code can be ugly and unintuitive. Furthermore, if you have a multicolumn primary key, then the code becomes even more complicated.

INTERSECT

The second design pattern that we describe uses the Amazon Redshift INTERSECT operation. INTERSECT is a set-based operation that determines if two queries have any rows in common. You can check out UNION, INTERSECT, and EXCEPT in the Amazon Redshift documentation for more information.

We can determine if the staging and target table have duplicate PK values using the following query:

SELECT COUNT(1)
FROM (
  SELECT pk_col FROM stg
  INTERSECT
  SELECT pk_col FROM tgt
) a
;

If the primary key is composed of more than one column, you can simply modify the subqueries to include the additional columns:

SELECT COUNT(1)
FROM (
  SELECT pk_col1, pk_col2, …, pk_coln FROM stg
  INTERSECT
  SELECT pk_col, pk_col2, …, pk_coln FROM tgt
) a
;

This pattern’s main advantage is its simplicity. The code is easier to understand and validate than the join design pattern. INTERSECT handles the NULL to NULL matching implicitly so you don’t have to write any special code for NULL values.

Performance

We tested both design patterns using an Amazon Redshift cluster consisting of 12 ra3.4xlarge nodes. Each node contained 12 CPU and 96 GB of memory.

We created the staging and target tables with the same distribution and sort keys to minimize data redistribution at query time.

We generated the test data artificially using a custom program. The target dataset contained 1 billion rows of data. We ran 10 trials of both algorithms using staging datasets that ranged from 20–200 million rows, in 20-million-row increments.

In the following graph, the join design pattern is shown as a blue line. The intersect design pattern is shown as an orange line.

You can observe that the performance of both algorithms is excellent. Each is able to detect duplicates in less than 1 second for all trials. The join algorithm outperforms the intersect algorithm, but both have excellent performance.

So, which algorithm should use you choose? If you’re developing a new application on Amazon Redshift, the intersect algorithm is probably the best choice. The inherent NULL matching logic and simple intuitive code make this the best choice for new applications.

Conversely, if you need to squeeze every bit of performance from your application, then the join algorithm is your best option. In this case, you’ll have to trade complexity and perhaps extra effort in code review to gain the extra performance.

Automation

If you’re migrating an existing application to Amazon Redshift, you can use AWS SCT to automatically convert your SQL code.

Let’s see how this works. Suppose you have the following Teradata table. We use it as the target table in an INSERT-SELECT operation.

CREATE MULTISET TABLE testschema.test_pk_tgt (
  pk_col INTEGER NOT NULL
, payload VARCHAR(100) NOT NULL
, PRIMARY KEY (pk_col)
);

The staging table is identical to the target table, with the same columns and data types.

Next, we create a procedure to load the target table from the staging table. The procedure contains a single INSERT-SELECT statement:

REPLACE PROCEDURE testschema.insert_select()
BEGIN
INSERT INTO testschema.test_pk_tgt (pk_col, payload)
SELECT pk_col, payload FROM testschema.test_pk_stg;
END;

Now we use AWS SCT to convert the Teradata stored procedure to Amazon Redshift. First, open Settings, Conversion settings, and ensure that you’ve selected the option Automate Primary key / Unique constraint. If you don’t select this option, AWS SCT won’t add the PK check to the converted code.

Next, choose the stored procedure in the source database tree, right-click, and choose Convert schema.

AWS SCT converts the stored procedure (and embedded INSERT-SELECT) using the join rewrite pattern. Because AWS SCT performs the conversion for you, it uses the join rewrite pattern to leverage its performance advantage.

And that’s it, it’s that simple. If you’re migrating from Oracle or Teradata, you can use AWS SCT to convert your INSERT-SELECT statements now. We’ll be adding support for additional data warehouse engines soon.

In this post, we focused on INSERT-SELECT statements, but we’re also happy to report that AWS SCT can enforce primary key constraints for INSERT-VALUE and UPDATE statements. AWS SCT injects the appropriate SELECT statement into your code to determine if the INSERT-VALUE or UPDATE will create duplicate primary key values. Download the latest version of AWS SCT and give it a try!

Conclusion

In this post, we showed you how to enforce primary keys in Amazon Redshift. If you’re implementing a new application in Amazon Redshift, you can use the design patterns in this post to enforce the constraints as part of your ETL stream.

Also, if you’re migrating from an Oracle or Teradata database, you can use AWS SCT to automatically convert your SQL to Amazon Redshift. AWS SCT will inject additional code into your SQL stream to enforce your unique key constraints, and thereby insulate your application code from any related changes.

We’re happy to share these updates to help you in your data warehouse migration projects. In the meantime, you can learn more about Amazon Redshift and AWS SCT. Happy migrating!

About the authors

Michael Soo is a Principal Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Illia Kravtsov is a Database Developer with the AWS Project Delta Migration team. He has 10+ years experience in data warehouse development with Teradata and other MPP databases.

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 2

2022-07-15 Suresh Patnam

Post Syndicated from Suresh Patnam original https://aws.amazon.com/blogs/big-data/part-2-migrate-a-large-data-warehouse-from-greenplum-to-amazon-redshift-using-aws-sct/

In this second post of a multi-part series, we share best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting stored procedures, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns. You can check out the first post of this series for guidance on planning, running, and validation of a large-scale data warehouse migration from Greenplum to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT).

Choose your optimal Amazon Redshift cluster

Amazon Redshift has two types of clusters: provisioned and serverless. For provisioned clusters, you need to set up the same with required compute resources. Amazon Redshift Serverless can run high-performance analytics in the cloud at any scale. For more information, refer to Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure.

An Amazon Redshift cluster consists of nodes. Each cluster has a leader node and one or more compute nodes. The leader node receives queries from client applications, parses the queries, and develops query run plans. The leader node then coordinates the parallel run of these plans with the compute nodes and aggregates the intermediate results from these nodes. It then returns the results to the client applications.

When determining your type of cluster, consider the following:

Estimate the size of the input data compressed, vCPU, and performance. As of this writing, we recommend the Amazon Redshift RA3 instance with managed storage, which scales compute and storage independently for fast query performance.
Amazon Redshift provides an automated “Help me choose” cluster based on the size of your data.
A main advantage of a cloud Amazon Redshift data warehouse is that you’re no longer stuck with hardware and commodities like old guard data warehouses. For faster innovation, you have the option to try different cluster options and choose the optimized one in terms of performance and cost.
At the time of development or pilot, you can usually start with a smaller number of nodes. As you move to production, you can adjust the number of nodes based on your usage pattern. When right-sizing your clusters, we recommend choosing the reserved instance type to cut down the cost even further. The public-facing utility Simple Replay can help you determine performance against different cluster types and sizes by replaying the customer workload. For provisioned clusters, if you’re planning to use the recommended RA3 instance, you can compare different node types to determine the right instance type.
Based on your workload pattern, Amazon Redshift supports resize, pause and stop, and concurrency scaling of the cluster. Amazon Redshift workload management (WLM) enables effective and flexible management of memory and query concurrency.

Create data extraction tasks with AWS SCT

With AWS SCT extraction agents, you can migrate your source tables in parallel. These extraction agents authenticate using a valid user on the data source, allowing you to adjust the resources available for that user during the extraction. AWS SCT agents process the data locally and upload it to Amazon Simple Storage Service (Amazon S3) through the network (via AWS Direct Connect). We recommend having a consistent network bandwidth between your Greenplum machine where the AWS SCT agent is installed and your AWS Region.

If you have tables around 20 million rows or 1 TB in size, you can use the virtual partitioning feature on AWS SCT to extract data from those tables. This creates several sub-tasks and parallelizes the data extraction process for this table. Therefore, we recommend creating two groups of tasks for each schema that you migrate: one for small tables and one for large tables using virtual partitions.

For more information, refer to Creating, running, and monitoring an AWS SCT data extraction task.

Data architecture

To simplify and modernize your data architecture, consider the following:

Establish accountability and authority to enforce enterprise data standards and policies.
Formalize the data and analytics operating model between enterprise and business units and functions.
Simplify the data technology ecosystem through rationalization and modernization of data assets and tools or technology.
Develop organizational constructs that facilitate more robust integration of the business and delivery teams, and build data-oriented products and solutions to address the business problems and opportunities throughout the lifecycle.
Back up the data periodically so that if something is wrong, you have the ability to replay.
During planning, design, execution, and throughout implementation and maintenance, ensure data quality management is added to achieve the desired outcome.
Simple is the key to an easy, fast, intuitive, and low-cost solution. Simple scales much better than complex. Simple makes it possible to think big (Invent and Simplify is another Amazon leadership principle). Simplify the legacy process by migrating only the necessary data used in tables and schemas. For example, if you’re performing truncate and load for incremental data, identify a watermark and only process incremental data.
You may have use cases that requiring record-level inserts, updates, and deletes for privacy regulations and simplified pipelines; simplified file management and near-real-time data access; or simplified change data capture (CDC) data pipeline development. We recommend using purposeful tools based on your use case. AWS offers the options to use Apache HUDI with Amazon EMR and AWS Glue.

Migrate stored procedures

In this section, we share best practices for stored procedure migration from Greenplum to Amazon Redshift. Data processing pipelines with complex business logic often use stored procedures to perform the data transformation. We advise using big data processing like AWS Glue or Amazon EMR to modernize your extract, transform, and load (ETL) jobs. For more information, check out Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift. For time-sensitive migration to cloud-native data warehouses like Amazon Redshift, redesigning and developing the entire pipeline in a cloud-native ETL tool might be time-consuming. Therefore, migrating the stored procedures from Greenplum to Amazon Redshift stored procedures can be the right choice.

For a successful migration, make sure to follow Amazon Redshift stored procedure best practices:

Specify the schema name while creating a stored procedure. This helps facilitate schema-level security and you can enforce grants or revoke access control.
To prevent naming conflicts, we recommend naming procedures using the prefix sp_. Amazon Redshift reserves the sp_ prefix exclusively for stored procedures. By prefixing your procedure names with sp_, you ensure that your procedure name won’t conflict with any existing or future Amazon Redshift procedure names.
Qualify your database objects with the schema name in the stored procedure.
Follow the minimal required access rule and revoke unwanted access. For similar implementation, make sure the stored procedure run permission is not open to ALL.
The SECURITY attribute controls a procedure’s privileges to access database objects. When you create a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the default. For more information, refer to Security and privileges for stored procedures.
Managing transactions when it comes to stored procedures are important. For more information, refer to Managing transactions.
TRUNCATE issues a commit implicitly inside a stored procedure. It interferes with the transaction block by committing the current transaction and creating a new one. Exercise caution while using TRUNCATE to ensure it never breaks the atomicity of the transaction. This also applies for COMMIT and ROLLBACK.
Adhere to cursor constraints and understand performance considerations while using cursor. You should use set-based SQL logic and temporary tables while processing large datasets.
Avoid hardcoding in stored procedures. Use dynamic SQL to construct SQL queries dynamically at runtime. Ensure appropriate logging and error handling of the dynamic SQL.
For exception handling, you can write RAISE statements as part of the stored procedure code. For example, you can raise an exception with a custom message or insert a record into a logging table. For unhandled exceptions like WHEN OTHERS, use built-in functions like SQLERRM or SQLSTATE to pass it on to the calling application or program. As of this writing, Amazon Redshift limits calling a stored procedure from the exception block.

Sequences

You can use IDENTITY columns, system timestamps, or epoch time as an option to ensure uniqueness. The IDENTITY column or a timestamp-based solution might have sparse values, so if you need a continuous number sequence, you need to use dedicated number tables. You can also use of the RANK() or ROW_NUMBER() window function over the entire set. Alternatively, get the high-water mark from the existing ID column from the table and increment the values while inserting records.

Character datatype length

Greenplum char and varchar data type length is specified in terms of character length, including multi-byte ones. Amazon Redshift character types are defined in terms of bytes. For table columns using multi-byte character sets in Greenplum, the converted table column in Amazon Redshift should allocate adequate storage to the actual byte size of the source data.

An easy workaround is to set the Amazon Redshift character column length to four times larger than the corresponding Greenplum column length.

A best practice is to use the smallest possible column size. Amazon Redshift doesn’t allocate storage space according to the length of the attribute; it allocates storage according to the real length of the stored string. However, at runtime, while processing queries, Amazon Redshift allocates memory according to the length of the attribute. Therefore, not setting a default size of four times greater helps from a performance perspective.

An efficient solution is to analyze production datasets and determine the maximum byte size length of the Greenplum character columns. Add a 20% buffer to support future incremental growth on the table.

To arrive at the actual byte size length of an existing column, run the Greenplum data structure character utility from the AWS Samples GitHub repo.

Numeric precision and scale

The Amazon Redshift numeric data type has a limit to store up to maximum precision of 38, whereas in a Greenplum database, you can define a numeric column without any defined length.

Analyze your production datasets and determine numeric overflow candidates using the Greenplum data structure numeric utility from the AWS Samples GitHub repo. For numeric data, you have options to tackle this based on your use case. For numbers with a decimal part, you have the option to round the data based on the data type without any data loss in the whole number part. For future reference, you can a keep copy of the column in VARCHAR or store in an S3 data lake. If you see an extremely small percentage of an outlier of overflow data, clean up the source data for quality data migration.

SQL queries and functions

While converting SQL scripts or stored procedures to Amazon Redshift, if you encounter unsupported functions, database objects, or code blocks for which you might have to rewrite the query, create user-defined functions (UDFs), or redesign. You can create a custom scalar UDF using either a SQL SELECT clause or a Python program. The new function is stored in the database and is available for any user with sufficient privileges to run. You run a custom scalar UDF in much the same way as you run existing Amazon Redshift functions to match any functionality of legacy databases. The following are some examples of alternate query statements and ways to achieve specific aggregations that might be required during a code rewrite.

AGE

The Greenplum function AGE () returns an interval subtracting from the current date. You could accomplish the same using a subset of MONTHS_BETWEEN(), ADD_MONTH(), DATEDIFF(), and TRUNC() functions based on your use case.

The following example Amazon Redshift query calculates the gap between the date 2001-04-10 and 1957-06-13 in terms of year, month, and days. You can apply this to any date column in a table.

select
	trunc(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))/ 12) years,
	mod(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4, 12) months,
	'2001-04-10'::date -add_months('1957-06-13'::date,
	trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4) days;

COUNT

If you have a use case to get distinct aggregation in the Count() window function, you could accomplish the same using a combination of the Dense_Rank () and Max() window functions.

The following example Amazon Redshift query calculates the distinct item count for a given date of sale:

select
	sale_date,
	item,
	cost,
	max(densernk) over (partition by sale_date order by item rows between unbounded preceding and unbounded following) as distinct_itemcount
from
	(
	select
		*,dense_rank() over (partition by sale_date order by item) as densernk
	from
		testaggr)
order by
	sale_date,
	item,
cost;

ORDER BY

Amazon Redshift aggregate window functions with an ORDER BY clause require a mandatory frame.

The following example Amazon Redshift query creates a cumulative sum of cost by sale date and orders the results by item within the partition:

select
	*,
    sum(cost) over (partition by sale_date
order by
	item rows between unbounded preceding and unbounded following) as total_cost_by_date
from
	testaggr
order by
	sale_date,
	item,
	cost;

STRING_AGG

In Greenplum, STRING_AGG() is an aggregate function, which is used to concatenate a list of strings. In Amazon Redshift, use the LISTAGG() function.

The following example Amazon Redshift query returns a semicolon-separated list of email addresses for each department:

select
	dept,
	listagg(email_address,';') 
within group (order by dept) as email_list
from
	employee_contact
group by
	dept
order by
	dept;

ARRAY_AGG

In Greenplum, ARRAY_AGG() is an aggregate function that takes a set of values as input and returns an array. In Amazon Redshift, use a combination of the LISTAGG() and SPLIT_TO_ARRAY() functions. The SPLIT_TO_ARRAY() function returns a SUPER datatype.

The following example Amazon Redshift query returns an array of email addresses for each department:

select
	dept,
	SPLIT_TO_ARRAY(email_list,
	';') email_array
from
	(
	select
		dept,
		listagg(email_address,
		';') 
within group (
		order by dept) as email_list
	from
		employee_contact
	group by
		dept
	order by
		dept);

To retrieve array elements from a SUPER expression, you can use the SUBARRAY() function:

 select
	SUBARRAY( email_array,
	0,
	1 ) first_element,
	SUBARRAY( email_array,
	1,
	1) second_element,
	SUBARRAY( email_array,
	0 ) all_element
from
	testarray
where
	dept = 'HR';

UNNEST

In Greenplum, you can use the UNNEST function to split an array and convert the array elements into a set of rows. In Amazon Redshift, you can use PartiQL syntax to iterate over SUPER arrays. For more information, refer to Querying semistructured data.

create temp table unnesttest as
select
	json_parse('{"scalar_array": [0,1,2,3,4,5.5,6,7.9,8,9]}') as data;

select
	element
from
	unnesttest as un,
	un.data.scalar_array as element at index;

WHERE

You can’t use a window function in the WHERE clause of a query in Amazon Redshift. Instead, construct the query using the WITH clause and then refer the calculated column in the WHERE clause.

The following example Amazon Redshift query returns the sale date, item, and cost from a table for the sales dates where the total sale is more than 100:

with aggrcost as (
select
	sale_date ,
	item,
	cost,
	sum(cost) over (partition by sale_date) as total_sale
from
	testaggr )
select
	*
from
	aggrcost
where
	total_sale > 100;

Refer to the following table for additional Greenplum date/time functions along with the Amazon Redshift equivalent to accelerate you code migration.

.	Description	Greenplum	Amazon Redshift
1	The `now()` function return the start time of the current transaction	`now ()`	`sysdate`
2	`clock_timestamp()` returns the start timestamp of the current statement within a transaction block	`clock_timestamp ()`	`to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz`
3	`transaction_timestamp ()` returns the start timestamp of the current transaction	`transaction_timestamp ()`	`to_date(getdate(),'yyyy-mm-dd') + substring(timeofday(),12,15)::timetz`
4	Interval – This function adds x years and y months to the `date_time_column` and returns a timestamp type	`date_time_column + interval ‘ x years y months’`	`add_months(date_time_column, x*12 + y)`
5	Get total number of seconds between two-time stamp fields	`date_part('day', end_ts - start_ts) * 24 * 60 * 60+ date_part('hours', end_ts - start_ts) * 60 * 60+ date_part('minutes', end_ts - start_ts) * 60+ date_part('seconds', end_ts - start_ts)`	`datediff('seconds', start_ts, end_ts)`
6	Get total number of minutes between two-time stamp fields	`date_part('day', end_ts - start_ts) * 24 * 60 + date_part('hours', end_ts - start_ts) * 60 + date_part('minutes', end_ts - start_ts)`	`datediff('minutes', start_ts, end_ts)`
7	Extract date part literal from difference of two-time stamp fields	`date_part('hour', end_ts - start_ts)`	`extract(hour from (date_time_column_2 - date_time_column_1))`
8	Function to return the ISO day of the week	`date_part('isodow', date_time_column)`	`TO_CHAR(date_time_column, 'ID')`
9	Function to return ISO year from date time field	`extract (isoyear from date_time_column)`	`TO_CHAR(date_time_column, ‘IYYY’)`
10	Convert epoch seconds to equivalent datetime	`to_timestamp(epoch seconds)`	`TIMESTAMP 'epoch' + Number_of_seconds * interval '1 second'`

Amazon Redshift utility for troubleshooting or running diagnostics for the cluster

The Amazon Redshift Utilities GitHub repo contains a set of utilities to accelerate troubleshooting or analysis on Amazon Redshift. Such utilities consist of queries, views, and scripts. They are not deployed by default onto Amazon Redshift clusters. The best practice is to deploy the needed views into the admin schema.

Conclusion

In this post, we covered prescriptive guidance around data types, functions, and stored procedures to accelerate the migration process from Greenplum to Amazon Redshift. Although this post describes modernizing and moving to a cloud warehouse, you should be augmenting this transformation process towards a full-fledged modern data architecture. The AWS Cloud enables you to be more data-driven by supporting multiple use cases. For a modern data architecture, you should use purposeful data stores like Amazon S3, Amazon Redshift, Amazon Timestream, and others based on your use case.

About the Authors

Suresh Patnam is a Principal Solutions Architect at AWS. He is passionate about helping businesses of all sizes transforming into fast-moving digital organizations focusing on big data, data lakes, and AI/ML. Suresh holds a MBA degree from Duke University- Fuqua School of Business and MS in CIS from Missouri State University. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Arunabha Datta is a Sr. Data Architect at AWS Professional Services. He collaborates with customers and partners to architect and implement modern data architecture using AWS Analytics services. In his spare time, Arunabha enjoys photography and spending time with his family.

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 1

2022-07-15 Suresh Patnam

Post Syndicated from Suresh Patnam original https://aws.amazon.com/blogs/big-data/part-1-migrate-a-large-data-warehouse-from-greenplum-to-amazon-redshift-using-aws-sct/

A data warehouse collects and consolidates data from various sources within your organization. It’s used as a centralized data repository for analytics and business intelligence.

When working with on-premises legacy data warehouses, scaling the size of your data warehouse or improving performance can mean purchasing new hardware or adding more powerful hardware. This is often expensive and time-consuming. Running your own on-premises data warehouse also requires hiring database managers, administrators to deal with outages, upgrades, and data access requests. As companies become more data-driven, reliable access to centralized data is increasingly important. As a result, there is a strong demand for data warehouses that are fast, accessible, and able to scale elastically with business needs. Cloud data warehouses like Amazon Redshift address these needs while eliminating the cost and risk of purchasing new hardware.

This multi-part series explains how to migrate an on-premises Greenplum data warehouse to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT). In this first post, we describe how to plan, run, and validate the large-scale data warehouse migration. It covers the solution overview, migration assessment, and guidance on technical and business validation. In the second post, we share best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting stored procedures, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns.

Solution overview

Amazon Redshift is an industry-leading cloud data warehouse. Amazon Redshift uses Structured Query Language (SQL) to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.

AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database schema and most of the database code objects, SQL scripts, views, stored procedures, and functions to a format compatible with the target database. AWS SCT helps you modernize your applications simultaneously during database migration. When schema conversion is complete, AWS SCT can help migrate data from various data warehouses to Amazon Redshift using data extraction agents.

The following diagram illustrates our architecture for migrating data from Greenplum to Amazon Redshift using AWS SCT data extraction agents.

Perform a migration assessment

The initial data migration is the first milestone of the project. The main requirements for this phase are to minimize the impact on the data source and transfer the data as fast as possible. To do this, AWS offers several options, depending on the size of the database, network performance (AWS Direct Connect or AWS Snowball), and whether the migration is heterogeneous or not (AWS Database Migration Service (AWS DMS) or AWS SCT).

AWS provides a portfolio of cloud data migration services to provide the right solution for any data migration project. The level of connectivity is a significant factor in data migration, and AWS has offerings that can address your hybrid cloud storage, online data transfer, and offline data transfer needs.

Additionally, the AWS Snow Family makes it simple to get your data into and out of AWS via offline methods. Based on the size of the data, you can use AWS Snowmobile or AWS Snowball if you have petabytes to exabytes of data. To decide which transfer method is better for your use case, refer to Performance for AWS Snowball.

Perform schema conversion with AWS SCT

To convert your schema using AWS SCT, you must start a new AWS SCT project and connect your databases. Complete the following steps:

Install AWS SCT.
Open and initiate a new project.
For Source database engine, choose Greenplum.
For Target database engine, choose Amazon Redshift.
Choose OK.
Open your project and choose Connect to Greenplum.
Enter the Greenplum database information.
Choose Test connection.
Choose OK after a successful connection test.
Choose OK to complete the connection.
Repeat similar steps to establish a connection to your Amazon Redshift cluster.

By default, AWS SCT uses AWS Glue as the extract, transform, and load (ETL) solution for the migration. Before you continue, you must disable this setting.
On the Settings menu, choose Project settings.
Deselect Use AWS Glue.
Choose OK.
In the left pane, choose your schema (right-click) and choose Convert schema.
When asked to replace objects, choose Yes.
When asked to load statistics, choose Continue.

By the end of this step, all Greenplum objects should be migrated to Amazon Redshift syntax. Some objects may be shown in red, meaning that AWS SCT couldn’t fully migrate these objects. You can view an assessment summary of the migration for more information.
On the View menu, choose Assessment report view.

In the bottom pane, you can see Greenplum DDL and Amazon Redshift DDL of the selected objects side by side for comparison.
Choose the schema with a red icon, which indicates that it needs manual conversion.You’re presented with specific actions regarding the tables, constraints, or views that can’t be migrated to Amazon Redshift. You must investigate these issues and fix the errors manually with the required changes. Some examples are binary data in BLOB format, which AWS SCT automatically converts to character varying data type, but this may be highlighted as an issue. Additionally, some vendor-supplied procedures and functions couldn’t be converted, so AWS SCT can error out.

As a final step, you can validate that the tables exist in Amazon Redshift.

Connect using the Amazon Redshift query editor v2 or another third-party tool or utility of your choice and check for all the tables with the following code:

select t.table_schema, t.table_name,i.tbl_rows
from svv_tables t left join svv_table_info i 
on t.table_schema = i.schema and t.table_name = i.table
where t.table_type = 'BASE TABLE' and t.table_schema='SCHEMA NAME'
order by 1,2;

Migrate the data

To start your data migration using AWS SCT data extraction agents, complete the following steps:

Configure the AWS SCT extractor properties file with corresponding Greenplum properties:

port=8192
vendor=GREENPLUM
driver.jars="C:\DMS Workshop\JDBC\postgresql-42.2.9.jar"
redshift.driver.jars=C:\DMS Workshop\JDBC\RedshiftJDBC42-no-awssdk-1.2.43.1067.jar
working.folder=C:/Users/developer
ssl.option=ON
ssl.require.client.authentication=ON
#extractor.start.fetch.size=20000
#extractor.out.file.size=10485760
ssl.truststore.path=C:/DMS Workshop/sctagent/truststore
ssl.keystore.path=C:/DMS Workshop/sctagent/keystore
ssl.keystore.pwd=ZcUnMLbWYe1c0Dp/m9jIe8IMGNtyUOv+AnKVsV9eZI+SuJcWI9Fd99Lskd7FFA==
ssl.truststore.pwd=w23mFvm+SO46eR4dJ0Ly7n+XcvgUhhqrJNMPUyWdhFFKBxRU64/V2uioOul/aBcs8tMs=

Now you configure the AWS SCT extractor to perform a one-time data move. You can use multiple extractors when dealing with a large volume of data.

To register the extractor, on the View menu, choose Data migration view.
Choose Register.
Enter the information for your new agent.
Test the connection and choose Register.

Now you create a task for the extractor to extract data into the tables created on Amazon Redshift.
Under your schema in the left pane, choose Tables (right-click) and choose Create Local task.
For Task name, enter a name.
Test the connection and choose OK.
Choose Create.
Run your task and monitor its progress.

You can choose each task to get a detailed breakdown of its activity. Make sure to examine errors during the extract, upload, and copy process.

You can monitor the status of the tasks, the percentage completed, and the tables that were loaded successfully. You must also verify the count of records loaded into the Amazon Redshift database.

Technical validation

After the initial extracted data is loaded to Amazon Redshift, you must perform data validation tests in parallel. The goal at this stage is to validate production workloads, comparing Greenplum and Amazon Redshift outputs from the same inputs.

Typical activities covered during this phase include the following:

Count the number of objects and rows on each table.
Compare the same random subset of data in both Greenplum and Amazon Redshift for all migrated tables, validating that the data is exactly the same row by row.
Check for incorrect column encodings.
Identify skewed table data.
Annotate queries not benefiting from sort keys.
Identify inappropriate join cardinality.
Identify with tables with large VARCHAR columns.
Confirm that processes don’t crash when connected with the target environment.
Validate daily batch jobs (job duration, number of rows processed). To find the right techniques to perform most of those activities, refer to Top 10 Performance Tuning Techniques for Amazon Redshift
Set up Amazon Redshift automated alerts with Amazon Redshift Advisor.

Business validation

After you successfully migrate the data and validate the data movement, the last remaining task is to involve the data warehouse users in the validation process. These users from different business units across the company access the data warehouse using various tools and methods: JDBC/ODBC clients, Python scripts, custom applications, and more. It’s central to the migration to make sure that every end-user has verified and adapted this process to work seamlessly with Amazon Redshift before the final cutover.

This phase can consist of several tasks:

Adapt business users’ tools, applications, and scripts to connect to Amazon Redshift endpoints.
Modify users’ data load and dump procedures, replacing data movement to and from shared storage via ODBC/JDBC with COPY and UNLOAD operations from and to Amazon Simple Storage Service (Amazon S3).
Modify any incompatible queries, taking into account any implementation nuances between Amazon Redshift and PostgreSQL.
Run business processes against Greenplum and Amazon Redshift, and compare results and runtimes. Make sure to notify any issue or unexpected result to the team in charge of the migration, so the case can be analyzed in detail.
Tune query performance, taking into account table distribution and sort keys, and make extensive use of the EXPLAIN command in order to understand how Amazon Redshift plans and runs queries. For advanced table design concepts, refer to Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization.

This business validation phase is key so all end-users are aligned and ready for the final cutover. Following Amazon Redshift best practices enables end-users to fully take advantage of the capabilities of their new data warehouse. After you perform all the migration validation tasks, connect and test every ETL job, business process, external system, and user tool against Amazon Redshift, you can disconnect every process from the old data warehouse, which you can now safely power off and decommission.

Conclusion

In this post, we provided detailed steps to migrate from Greenplum to Amazon Redshift using AWS SCT. Although this post describes modernizing and moving to a cloud warehouse, you should be augmenting this transformation process towards a full-fledged modern data architecture. The AWS Cloud enables you to be more data-driven by supporting multiple use cases. For a modern data architecture, you should use purposeful data stores like Amazon S3, Amazon Redshift, Amazon Timestream, and other data stores based on your use case.

Check out the second post in this series, where we cover prescriptive guidance around data types, functions, and stored procedures.

About the Authors

Arunabha Datta is a Sr. Data Architect at Amazon Web Services (AWS). He collaborates with customers and partners to architect and implement modern data architecture using AWS Analytics services. In his spare time, Arunabha enjoys photography and spending time with his family.

Modernization pathways for a legacy .NET Framework monolithic application on AWS

2022-05-19 Ramakant Joshi

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/architecture/modernization-pathways-for-a-legacy-net-framework-monolithic-application-on-aws/

Organizations aim to deliver optimal technological solutions based on their customers’ needs. Although they may be at any stage in their cloud adoption journey, businesses often end up managing and building monolithic applications. However, there are many challenges to this solution. The internal structure of a monolithic application makes it difficult for developers to maintain code. This creates a steep learning curve for new developers and increases costs. Monoliths require multiple teams to coordinate a single large release, which increases the collaboration and knowledge transfer burden. As a business grows, a monolithic application may struggle to meet the demands of an expanding user base. To address these concerns, customers should evaluate their readiness to modernize their applications in the AWS Cloud to meet their business and technical needs.

We will discuss an approach to modernizing a monolithic three-tier application (MVC pattern): a web tier, an application tier using a .NET Framework, and a data tier with a Microsoft SQL (MSSQL) Server relational database. There are three main modernization pathways for .NET applications: rehosting, replatforming, and refactoring. We recommend following this decision matrix to assess and decide on your migration path, based on your specific requirements. For this blog, we will focus on a replatform and refactor strategy to design loosely coupled microservices, packaged as lightweight containers, and backed by a purpose-built database.

Your modernization journey

The outcomes of your organization’s approach to modernization gives you the ability to scale optimally with your customers’ demands. Let’s dive into a guided approach that achieves your goals of a modern architecture, and at the same time addresses scalability, ease of maintenance, rapid deployment cycles, and cost optimization.

This involves four steps:

Break down the monolith
Containerize your application
Refactor to .NET 6
Migrate to a purpose-built, lower-cost database engine.

1. Break down the monolith

Migration to the Amazon Web Services (AWS) Cloud has many advantages. These can include increased speed to market and business agility, new revenue opportunities, and cost savings. To take full advantage, you should continuously modernize your organization’s applications by refactoring your monolithic applications into microservices.

Decomposing a monolithic application into microservices presents technical challenges that require a solid understanding of the existing code base and context of the business domains. Several patterns are useful to incrementally transform a monolithic application into microservices and other distributed designs. However, the process of refactoring the code base is manual, risky, and time consuming.

To help developers accelerate the transformation, AWS introduced AWS Microservice Extractor for .NET. This helps breakdown architecting and refactoring applications into smaller code projects. Read how AWS Microservice Extractor for .NET helped our partner, Kloia, accelerate the modernization journey of their customers and decompose a monolith.

The next modernization pathway is to containerize your application.

2. Containerize

Why should you move to containers? Containers offer a way to help you build, test, deploy, and redeploy applications on multiple environments. Specifically, Docker Containers provide you with a reliable way to gather your application components and package them together into one build artifact. This is important because modern applications are often composed of a variety of pieces besides code, such as dependencies, binaries, or system libraries. Moving legacy .NET Framework applications to containers helps to optimize operating system utilization and achieve runtime consistency.

To accelerate this process, containerize these applications to Windows containers with AWS App2Container (A2C). A2C is a command line tool for modernizing .NET and java applications into containerized applications. A2C analyzes and builds an inventory of all applications running in virtual machines, on-premises, or in the cloud. Select the application that you want to containerize and A2C packages the application artifact and identified dependencies into container images. Here is a step-by-step article and self-paced workshop to get you started using A2C.
Once your app is containerized, you can choose to self-manage by using Amazon EC2 to host Docker with Windows containers. You can also use Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). These are fully managed container orchestration services that frees you to focus on building and managing applications instead of your underlying infrastructure. Read Amazon ECS vs Amazon EKS: making sense of AWS container services.

In the next section, we’ll discuss two primary aspects to optimizing costs in our modernization scenario:

Licensing costs of running workloads on Windows servers.
SQL Server licensing cost.

3. Refactor to .NET 6

To address Windows licensing costs, consider moving to a Linux environment by adopting .NET Core and using the Dockerfile for a Linux Container. Customers such as GoDataFeed benefit by porting .NET Framework applications to more recent .NET 6 and running them on AWS. The .NET team has significantly improved performance with .NET 6, including a 30–40% socket performance improvement on Linux. They have added ARM64-specific optimizations in the .NET libraries, which enable customers to run on AWS Graviton.

You may also choose to switch to a serverless option using AWS Lambda (which supports .NET 6 runtime), or run your containers on ECS with Fargate, a serverless, pay-as-you-go compute engine. AWS Fargate powered by AWS Graviton2 processors can reduce cost by up to 20%, and increase performance by up to 40% versus x86 Intel-based instances. If you need full control over an application’s underlying virtual machine (VM), operating system, storage, and patching, run .NET 6 applications on Amazon EC2 Linux instances. These are powered by the latest-generation Intel and AMD processors.

To help customers port their application to .NET 6 faster, AWS added .NET 6 support to Porting Assistant for .NET. Porting Assistant is an analysis tool that scans .NET Framework (3.5+) applications to generate a target .NET Core or .NET 6 compatibility assessment. This helps you to prioritize applications for porting based on effort required. It identifies incompatible APIs and packages from your .NET Framework applications, and finds known replacements. You can refer to a demo video that explains this process.

4. Migrate from SQL Server to a lower-cost database engine

AWS advocates that you build use case-driven, highly scalable, distributed applications suited to your specific needs. From a database perspective, AWS offers 15+ purpose-built engines to support diverse data models. Furthermore, microservices architectures employ loose coupling, so each individual microservice can independently store and retrieve information from its own data store. By deploying the database-per-service pattern, you can choose the most optimal data stores (relational or non-relational databases) for your application and business requirements.

For the purpose of this blog, we will focus on a relational database alternate for SQL Server. To address the SQL Server licensing costs, customers can consider a move to an open-source relational database engine. Amazon Relational Database Service (Amazon RDS) supports MySQL, MariaDB, and PostgreSQL. We will focus on PostgreSQL with a well-defined migration path. Amazon RDS supports two types of Postgres databases: Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition. To help you choose, read Is Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL a better choice for me?

Once you’ve decided on the Amazon RDS flavor, the next question would be “what’s the right migration strategy for me?” Consider the following:

Convert your schema
Migrate the data
Refactor your application

Schema conversion

AWS Schema Conversion Tool (SCT) is a free tool that can help you convert your existing database from one engine to another. AWS SCT supports a number of source databases, including Microsoft SQL Server, Oracle, and MySQL. You can choose from target database engines such as Amazon Aurora PostgreSQL-Compatible Edition, or choose to set up a data lake using Amazon S3. AWS SCT provides a graphical user interface that directly connects to the source and target databases to fetch the current schema objects. When connected, you can generate a database migration assessment report to get a high-level summary of the conversion effort and action items.

Data migration

When the schema migration is complete, you can move your data from the source database to the target database. Depending on your application availability requirements, you can run a straightforward extraction job that performs a one-time copy of the source data into the new database. Or, you can use a tool that copies the current data and continues to replicate all changes until you are ready to cut over to the new database. One such tool is AWS Database Migration Service (AWS DMS) that helps you migrate relational databases, data warehouses, NoSQL databases, and other types of data stores.

With AWS DMS, you can perform one-time migrations, and you can replicate ongoing changes to keep sources and targets in sync. When the source and target databases are in sync, you can take your database offline and move your operations to the target database. Read Microsoft SQL Server To Amazon Aurora with PostgreSQL Compatibility for a playbook or use this self-guided workshop to migrate to a PostgreSQL compatible database using SCT and DMS.

Application refactoring

Each database engine has its differences and nuances, and moving to a new database engine such as PostgreSQL from MSSQL Server will require code refactoring. After the initial database migration is completed, manually rewriting application code, switching out database drivers, and verifying that the application behavior hasn’t changed requires significant effort. This involves potential risk of errors when making extensive changes to the application code.

AWS built Babelfish for Aurora PostgreSQL to simplify migrating applications from SQL Server to Amazon Aurora PostgreSQL-Compatible Edition. Babelfish for Aurora PostgreSQL is a new capability for Amazon Aurora PostgreSQL-Compatible Edition that enables Aurora to understand commands from applications written for Microsoft SQL Server. With Babelfish, Aurora PostgreSQL now understands T-SQL, Microsoft SQL Server’s proprietary SQL dialect. It supports the same communications protocol, so your apps that were originally written for SQL Server can now work with Aurora. Read about how to migrate from SQL Server to Babelfish for Aurora PostgreSQL. Make sure you run the Babelfish Compass tool to determine whether the application contains any SQL features not currently supported by Babelfish.

Figure 1 shows the before and after state for your application based on the modernization path described in this blog. The application tier consists of microservices running on Amazon ECS Fargate clusters (or AWS Lambda functions), and the data tier runs on Amazon Aurora (PostgreSQL flavor).

Figure 1. A modernized microservices-based rearchitecture

Summary

In this post, we showed a migration path for a monolithic .NET Framework application to a modern microservices-based stack on AWS. We discussed AWS tools to break the monolith into microservices, and containerize the application. We also discussed cost optimization strategies by moving to Linux-based systems, and using open-source database engines. If you’d like to know more about modernization strategies, read this prescriptive guide.

Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool

2022-04-12 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/accelerate-snowflake-to-amazon-redshift-migration-using-aws-schema-conversion-tool/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics jobs. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

Several new features of Amazon Redshift address a wide range of analytics requirements and improve performance of the cloud data warehouse:

Amazon Redshift Serverless (preview)
AWS Data Exchange for Amazon Redshift
Data sharing and cross-Region data sharing
SQL notebooks support in Query Editor V2
Automated Materialized View (preview)
Amazon Redshift ML
Concurrency scaling
Support for the spatial geography data type
The new RA3 instance types
Elastic resize
Federated queries, which allow you to query data stored in your Amazon Aurora or Amazon Relational Database Service (Amazon RDS) PostgreSQL and MySQL operational databases directly from Amazon Redshift
The SUPER data type, which can store semi-structured data or documents as values
The new distributed and hardware accelerated cache with AQUA (Advanced Query Accelerator) for Amazon Redshift, which delivers up to 10 times more performance than other cloud warehouses
The machine learning (ML) based self-tuning capability to set sort and distribution keys for tables, which significantly improves query performance that was previously handled manually

To take advantage of these capabilities and future innovations, you must migrate from your existing data warehouse to Amazon Redshift.

In this post, we show you how to migrate your data warehouse schema from Snowflake to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT). AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application source code for embedded SQL statements and convert them as part of a database schema conversion project. During this process, AWS SCT performs cloud-native code optimization by converting legacy data warehouse functions to their equivalent AWS service, thereby helping you modernize the applications at the same time of database migration.

Solution overview

To implement this solution, you complete the following high-level steps:

Configure your AWS SCT application.
Analyze your source Snowflake schema.
Convert your Snowflake schema to an Amazon Redshift schema.
Deploy the schema to Amazon Redshift.

The following diagram illustrates the solution architecture.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

An existing Snowflake account with user credentials required to use AWS SCT. For more information, see Using Snowflake as source for AWS SCT.
An AWS account.
A desktop machine with AWS SCT installed. For more information, see Installing AWS SCT.
An Amazon Redshift cluster that is accessible from the machine where AWS SCT is installed.
JDBC drivers for Amazon Redshift and Snowflake. For more information, see Installing the required database drivers.

Set up an AWS SCT project and extract the schema from the source

In this walkthrough, we use the Snowflake sample database TPCDS_SF10TCL as the source of the schema conversion.

To set up the database migration project, complete the following steps:

Launch the AWS SCT application.
On the File menu, choose New project wizard.
Enter the project name and location.
For Source engine, choose Snowflake.
Choose Next.
Provide the database information and credentials for Snowflake and choose Test Connection.
When the connection is successful, choose Next.

For more information, see Using Snowflake as a source for AWS SCT.

Expand SNOWFLAKE_SAMPLE_DATA and Schemas.
Select TPCDS_SF10TCL and choose Next.

AWS SCT analyzes the schema and prepares an assessment report summary, as shown in the following screenshot.

This report summarizes the objects that AWS SCT converts to Amazon Redshift.

Review the report and choose Next.
Provide the database information and credentials for Amazon Redshift and deselect Use AWS Glue.
Choose Test Connection.
When the connection is successful, choose Finish.

For more information about establishing a connection to Amazon Redshift, see Getting the JDBC URL.

Review and apply the schema from Snowflake to Amazon Redshift

To convert the schema from Snowflake objects to Amazon Redshift, complete the following steps:

Expand SNOWFLAKE_SAMPLE_DATA and Schemas.
Choose (right-click) TPCDS_SF10TCL and choose Convert schema.
Choose Collect and continue.

AWS SCT converts the schema and shows the converted schema and objects in the right pane. The converted schema is marked with a red check mark.

Amazon Redshift take some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

Choose the object and then on the top menu, choose Main view and choose Assessment Report view.
Choose the Action items tab.

You can see list of all issues and actions taken by AWS SCT.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, we modify the DISTKEY to use inv_item_sk. AWS SCT analyze the tables and recommends the distribution and sort keys based on the statistics. For cases where you’re not sure, you should set it to AUTO. For more information about automatic data distribution and optimization in Amazon Redshift, refer to Automate your Amazon Redshift performance tuning with automatic table optimization.

To deploy the objects DDL to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.
Optionally, if you want to export the copy of the DDLs generated by AWS SCT and apply them manually, you can select the Amazon Redshift converted schema, right-click, and choose Save as SQL.
Log in to your Amazon Redshift cluster using the Query Editor V2.

For more information about Query Editor V2, refer to Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts.

To verify that the converted schema objects are deployed in Amazon Redshift, select the desired table, right-click, and choose Show table definition to see the underlying DDL.

Clean up

To avoid incurring future charges, complete the following steps:

Delete the Amazon Redshift cluster created for this demonstration.
If you were using an existing cluster, delete the new tables that you created as part of this exercise.
Stop any Amazon Elastic Compute Cloud (Amazon EC2) instances that were created to run the AWS SCT application.

Summary

In this post, we showed how easy it is to convert a Snowflake schema to an Amazon Redshift schema and used AWS SCT for this automatic conversion.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.

About the Authors

BP Yau is a Sr Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Shawn Sachdev is a Sr Analytics Specialist Solutions Architect at AWS. He works with customers and provides guidance to help them innovate and build well-architected and high-performance data warehouses and implement analytics at scale on the AWS platform. Before AWS, he has worked in several Analytics and System Engineering roles. Outside of work, he loves watching sports, and is an avid foodie and a craft beer enthusiast.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Selecting the right database and database migration plan for your workloads

2022-03-28 Nikhil Anand

Post Syndicated from Nikhil Anand original https://aws.amazon.com/blogs/architecture/selecting-the-right-database-and-database-migration-plan-for-your-workloads/

There has been a tectonic shift in the approach to hosting enterprise workloads. Companies are rapidly moving from on-premises data centers to cloud-based services. The driving factor has been the ability to innovate faster on the cloud.

Your transition to cloud can be straightforward, but it does go beyond the usual ‘lift-and-shift’ approach. To start with, you’ll need a cloud provider that provides a data store and computing facility. But you’ll be able to grow your business with a provider that has purpose-built resources and a platform that supports innovation. Companies are using higher-level services such as fully managed databases and serverless compute offered by Amazon Web Services (AWS) to get the most out of their cloud adoption journey.

In this blog, we will focus on the database tier of enterprise applications. We will evaluate the reasons for moving to managed or purpose-built databases. Then we’ll discuss in more detail the options you have and how to implement the right database migration plan for your workloads.

Why move to a managed database?

Managed databases, or purpose-built databases, are managed services by AWS that free you to focus on other business and technical priorities. Different types of databases fill certain needs, and so the database tier in an application should never be a one-size-fits-all approach. Based on the kind of application you are developing, you should research managed options that suit your business and enterprise use cases.

Database Type	Use Cases	AWS Service
Relational	Traditional applications, enterprise resource planning (ERP), customer relationship management (CRM), ecommerce	Amazon Aurora Amazon RDS Amazon Redshift
Key-Value	High-traffic web applications, ecommerce systems, gaming applications	Amazon DynamoDB
In-Memory	Caching, session management, gaming leaderboards, geospatial applications	Amazon ElastiCache Amazon MemoryDB for Redis
Document	Content management, catalogs, user profiles	Amazon DocumentDB (with MongoDB compatibility)
Wide Column	High-scale industrial apps for equipment maintenance, fleet management, and route optimization	Amazon Keyspaces
Graph	Fraud detection, social networking, recommendation engines	Amazon Neptune
Time Series	Internet of Things (IoT) applications, DevOps, industrial telemetry	Amazon TimeStream
Ledger	Systems of record, supply chain, registrations, banking transactions	Amazon QLDB

Table 1. Managed databases by AWS

Managed database features

Manageability. The top priority and most valuable asset that you own as a company is your data. While data remains your key asset, spending time on database management is not the optimum use of your time. Managed services have built-in reliable tooling for multiple aspects of database management, which can help your database operate at the highest level.

Availability and disaster recovery. Most managed databases at AWS are highly available by default. For example, Amazon Aurora is designed to offer 99.99% availability, replicating six copies of your data across three Availability Zones (Figure 1). It backs up your data continuously to Amazon S3. It recovers transparently from physical storage failures; instance failover typically takes less than 30 seconds.

Figure 1. Replication across three Availability Zones with Amazon Aurora DB cluster

With managed databases, you get multiple options to create a highly available and fault tolerant database. Alternatively, if you choose to self-host a database elsewhere you will have to formulate your own disaster recovery (DR) strategy. This takes meticulous DR planning and relies heavily on a constant monitoring solution.

Elasticity and agility: Cloud offers elasticity for scaling your database. You can scale in minutes and spin up and down additional servers and storage size as needed. It offers you flexibility with capacity planning. You can always reassess your database tier to see if it is over or under provisioned.

Self-managed databases on AWS

If I do not need a managed database, should I still use Amazon EC2 to host my database?

Here are some cases when you may find it beneficial to host your database on Amazon EC2 instances:

You need full control over the database and access to its underlying operating system, database installation, and configuration.
You are ready to plan for the high availability of your database, and prepare your disaster recovery plan.
You want to administer your database, including backups and recovery. You must perform tasks such as patching the operating system and the database, tuning the operating system and database parameters, managing security, and configuring high availability or replication.
You want to use features that aren’t currently supported by managed services.
You need a specific database version that isn’t supported by AWS managed database service.
Your database size and performance needs exceed the limits of the managed service.
You want to avoid automatic software patches that might not be compliant with your applications.
You want to achieve higher IOPS and storage capacity than the current limits of the managed services.

Can I customize my underlying OS and database environment?

Amazon RDS Custom is a managed database service for applications that require customization of the underlying operating system and database environment. With Amazon RDS Custom, you have access to the underlying EC2 instance that hosts the DB engine. You can access the EC2 instance via secure shell (AWS Systems Manager and SSH) and perform the customizations to suit your application needs.

Choosing which migration plan to implement

The first step in your cloud database migration journey is identifying which database that you want to migrate to. For relational databases, determine the migration strategy. In the majority of the database migrations, you can choose to rehost, replatform, or refactor.

Refer the AWS Prescriptive Guidance for choosing a migration strategy for relational databases.

The next is determining which database migration plan best serves your needs. AWS provides a number of options to help correctly handle your data migration with minimal downtime. Here are the database migration plans that you can consider using for your cloud database adoption journey:

1. AWS-offered AWS Database Migration Service (AWS DMS): AWS Database Migration Service (AWS DMS) is a self-service option for migrating databases. You can use AWS DMS to migrate between homogeneous database types, such as going from one MySQL instance to a new one. You can also use AWS DMS between heterogeneous database types, such as moving from a commercial database like Oracle to a cloud-native relational database like Aurora. Read tutorials about migrating sample data: AWS Database Migration Service Step-by-Step Walkthroughs.

AWS DMS offers minimal downtime, supports widely used databases, supports on-going replication, is offered as a low-cost service, and is highly reliable. If you are looking for an end-to-end service for database migration, consider AWS DMS.

2. AWS Database Migration Service (DMS) with AWS Schema Conversion Tool (SCT): If you are migrating between heterogenous databases, use the AWS Schema Conversion Tool (SCT). It converts the source database schema and the database code objects (like views, stored procedures, and functions) to a format compatible with the target database (Figure 2).

Figure 2. Supported conversions with AWS Schema Conversion Tool

For heterogeneous migrations, consider using AWS DMS with AWS SCT.

3. Database Freedom Program: If you are new to cloud computing or if your database migration plan must be evaluated and validated by industry experts, then try the Database Freedom Program. Database Freedom is a unique program designed to assist customers in migrating to AWS databases and analytics services. They will provide technical advice, migration support, and financial assistance.

You can use the Contact Us page for the Database Freedom program to get in touch with experts that can assist you in your cloud adoption journey.

4. AWS Professional Services and Partners: You may have in-house database expertise, but need end-to-end implementation assistance across different application tiers. Get help from the Professional Services of AWS or connect with the certified network of AWS Partners. AWS Database Migration Service Delivery Partners help customers use AWS DMS to migrate databases to AWS securely, while minimizing application downtime and following best practices.

Conclusion

Migrating to the cloud is a journey that is ever-evolving. To remain focused on your innovations, leverage the managed services of AWS for your migration journey.

I hope this post helps you consider using a managed database service when available, and effectively evaluate and choose the right database migration plan for your enterprise. For any queries, feel free to get in touch with us. We will be happy to help you in your cloud journey.

Happy migrating!

Additional Reading

What to consider when migrating data warehouse to Amazon Redshift

2022-03-24 Lewis Tang

Post Syndicated from Lewis Tang original https://aws.amazon.com/blogs/big-data/what-to-consider-when-migrating-data-warehouse-to-amazon-redshift/

Customers are migrating data warehouses to Amazon Redshift because it’s fast, scalable, and cost-effective. However, data warehouse migration projects can be complex and challenging. In this post, I help you understand the common drivers of data warehouse migration, migration strategies, and what tools and services are available to assist with your migration project.

Let’s first discuss the big data landscape, the meaning of a modern data architecture, and what you need to consider for your data warehouse migration project when building a modern data architecture.

Business opportunities

Data is changing the way we work, live, and play. All of this behavior change and the movement to the cloud has resulted in a data explosion over the past 20 years. The proliferation of Internet of Things and smart phones have accelerated the amount of the data that is generated every day. Business models have shifted, and so have the needs of the people running these businesses. We have moved from talking about terabytes of data just a few years ago to now petabytes and exabytes of data. By putting data to work efficiently and building deep business insights from the data collected, businesses in different industries and of various sizes can achieve a wide range of business outcomes. These can be broadly categorized into the following core business outcomes:

Improving operational efficiency – By making sense of the data collected from various operational processes, businesses can improve customer experience, increase production efficiency, and increase sales and marketing agility
Making more informed decisions – Through developing more meaningful insights by bringing together full picture of data across an organization, businesses can make more informed decisions
Accelerating innovation – Combining internal and external data sources enable a variety of AI and machine learning (ML) use cases that help businesses automate processes and unlock business opportunities that were either impossible to do or too difficult to do before

Business challenges

Exponential data growth has also presented business challenges.

First of all, businesses need to access all data across the organization, and data may be distributed in silos. It comes from a variety of sources, in a wide range of data types and in large volume and velocity. Some data may be stored as structured data in relational databases. Other data may be stored as semi-structured data in object stores, such as media files and the clickstream data that is constantly streaming from mobile devices.

Secondly, to build insights from data, businesses need to dive deep into the data by conducting analytics. These analytics activities generally involve dozens and hundreds of data analysts who need to access the system simultaneously. Having a performant system that is scalable to meet the query demand is often a challenge. It gets more complex when businesses need to share the analyzed data with their customers.

Last but not least, businesses need a cost-effective solution to address data silos, performance, scalability, security, and compliance challenges. Being able to visualize and predict cost is necessary for a business to measure the cost-effectiveness of its solution.

To solve these challenges, businesses need a future proof modern data architecture and a robust, efficient analytics system.

Modern data architecture

A modern data architecture enables organizations to store any amount of data in open formats, break down disconnected data silos, empower users to run analytics or ML using their preferred tool or technique, and manage who has access to specific pieces of data with the proper security and data governance controls.

The AWS data lake architecture is a modern data architecture that enables you to store data in a data lake and use a ring of purpose-built data services around the lake, as shown in the following figure. This allows you to make decisions with speed and agility, at scale, and cost-effectively. For more details, refer to Modern Data Architecture on AWS.

Modern data warehouse

Amazon Redshift is a fully managed, scalable, modern data warehouse that accelerates time to insights with fast, easy, and secure analytics at scale. With Amazon Redshift, you can analyze all your data and get performance at any scale with low and predictable costs.

Amazon Redshift offers the following benefits:

Analyze all your data – With Amazon Redshift, you can easily analyze all your data across your data warehouse and data lake with consistent security and governance policies. We call this the modern data architecture. With Amazon Redshift Spectrum, you can query data in your data lake with no need for loading or other data preparation. And with data lake export, you can save the results of an Amazon Redshift query back into the lake. This means you can take advantage of real-time analytics and ML/AI use cases without re-architecture, because Amazon Redshift is fully integrated with your data lake. With new capabilities like data sharing, you can easily share data across Amazon Redshift clusters both internally and externally, so everyone has a live and consistent view of the data. Amazon Redshift ML makes it easy to do more with your data—you can create, train, and deploy ML models using familiar SQL commands directly in Amazon Redshift data warehouses.
Fast performance at any scale – Amazon Redshift is a self-tuning and self-learning system that allows you to get the best performance for your workloads without the undifferentiated heavy lifting of tuning your data warehouse with tasks such as defining sort keys and distribution keys, and new capabilities like materialized views, auto-refresh, and auto-query rewrite. Amazon Redshift scales to deliver consistently fast results from gigabytes to petabytes of data, and from a few users to thousands. As your user base scales to thousands of concurrent users, the concurrency scaling capability automatically deploys the necessary compute resources to manage the additional load. Amazon Redshift RA3 instances with managed storage separate compute and storage, so you can scale each independently and only pay for the storage you need. AQUA (Advanced Query Accelerator) for Amazon Redshift is a new distributed and hardware-accelerated cache that automatically boosts certain types of queries.
Easy analytics for everyone – Amazon Redshift is a fully managed data warehouse that abstracts away the burden of detailed infrastructure management or performance optimization. You can focus on getting to insights, rather than performing maintenance tasks like provisioning infrastructure, creating backups, setting up the layout of data, and other tasks. You can operate data in open formats, use familiar SQL commands, and take advantage of query visualizations available through the new Query Editor v2. You can also access data from any application through a secure data API without configuring software drivers, managing database connections. Amazon Redshift is compatible with business intelligence (BI) tools, opening up the power and integration of Amazon Redshift to business users who operate from within the BI tool.

A modern data architecture with a data lake architecture and modern data warehouse with Amazon Redshift helps businesses in all different sizes address big data challenges, make sense of a large amount of data, and drive business outcomes. You can start the journey of building a modern data architecture by migrating your data warehouse to Amazon Redshift.

Migration considerations

Data warehouse migration presents a challenge in terms of project complexity and poses a risk in terms of resources, time, and cost. To reduce the complexity of data warehouse migration, it’s essential to choose a right migration strategy based on your existing data warehouse landscape and the amount of transformation required to migrate to Amazon Redshift. The following are the key factors that can influence your migration strategy decision:

Size – The total size of the source data warehouse to be migrated is determined by the objects, tables, and databases that are included in the migration. A good understanding of the data sources and data domains required for moving to Amazon Redshift leads to an optimal sizing of the migration project.
Data transfer – Data warehouse migration involves data transfer between the source data warehouse servers and AWS. You can either transfer data over a network interconnection between the source location and AWS such as AWS Direct Connect or transfer data offline via the tools or services such as the AWS Snow Family.
Data change rate – How often do data updates or changes occur in your data warehouse? Your existing data warehouse data change rate determines the update intervals required to keep the source data warehouse and the target Amazon Redshift in sync. A source data warehouse with a high data change rate requires the service switching from the source to Amazon Redshift to complete within an update interval, which leads to a shorter migration cutover window.
Data transformation – Moving your existing data warehouse to Amazon Redshift is a heterogenous migration involving data transformation such as data mapping and schema change. The complexity of data transformation determines the processing time required for an iteration of migration.
Migration and ETL tools – The selection of migration and extract, transform, and load (ETL) tools can impact the migration project. For example, the efforts required for deployment and setup of these tools can vary. We look closer at AWS tools and services shortly.

After you have factored in all these considerations, you can pick a migration strategy option for your Amazon Redshift migration project.

Migration strategies

You can choose from three migration strategies: one-step migration, two-step migration, or wave-based migration.

One-step migration is a good option for databases that don’t require continuous operation such as continuous replication to keep ongoing data changes in sync between the source and destination. You can extract existing databases as comma separated value (CSV) files, or columnar format like Parquet, then use AWS Snow Family services such as AWS Snowball to deliver datasets to Amazon Simple Storage Service (Amazon S3) for loading into Amazon Redshift. You then test the destination Amazon Redshift database for data consistency with the source. After all validations have passed, the database is switched over to AWS.

Two-step migration is commonly used for databases of any size that require continuous operation, such as the continuous replication. During the migration, the source databases have ongoing data changes, and continuous replication keeps data changes in sync between the source and Amazon Redshift. The breakdown of the two-step migration strategy is as follows:

Initial data migration – The data is extracted from the source database, preferably during non-peak usage to minimize the impact. The data is then migrated to Amazon Redshift by following the one-step migration approach described previously.
Changed data migration – Data that changed in the source database after the initial data migration is propagated to the destination before switchover. This step synchronizes the source and destination databases. After all the changed data is migrated, you can validate the data in the destination database and perform necessary tests. If all tests are passed, you then switch over to the Amazon Redshift data warehouse.

Wave-based migration is suitable for large-scale data warehouse migration projects. The principle of wave-based migration is taking precautions to divide a complex migration project into multiple logical and systematic waves. This strategy can significantly reduce the complexity and risk. You start from a workload that covers a good number of data sources and subject areas with medium complexity, then add more data sources and subject areas in each subsequent wave. With this strategy, you run both the source data warehouse and Amazon Redshift production environments in parallel for a certain amount of time before you can fully retire the source data warehouse. See Develop an application migration methodology to modernize your data warehouse with Amazon Redshift for details on how to identify and group data sources and analytics applications to migrate from the source data warehouse to Amazon Redshift using the wave-based migration approach.

To guide your migration strategy decision, refer to the following table to map the consideration factors with a preferred migration strategy.

.	One-Step Migration	Two-Step Migration	Wave-Based Migration
The number of subject areas in migration scope	Small	Medium to Large	Medium to Large
Data transfer volume	Small to Large	Small to Large	Small to Large
Data change rate during migration	None	Minimal to Frequent	Minimal to Frequent
Data transformation complexity	Any	Any	Any
Migration change window for switching from source to target	Hours	Seconds	Seconds
Migration project duration	Weeks	Weeks to Months	Months

Migration process

In this section, we review the three high-level steps of the migration process. The two-step migration strategy and wave-based migration strategy involve all three migration steps. However, the wave-based migration strategy includes a number of iterations. Because only databases that don’t require continuous operations are good fits for one-step migration, only Steps 1 and 2 in the migration process are required.

Step 1: Convert schema and subject area

In this step, you make the source data warehouse schema compatible with the Amazon Redshift schema by converting the source data warehouse schema using schema conversion tools such as AWS Schema Conversion Tool (AWS SCT) and the other tools from AWS partners. In some situations, you may also be required to use custom code to conduct complex schema conversions. We dive deeper into AWS SCT and migration best practices in a later section.

Step 2: Initial data extraction and load

In this step, you complete the initial data extraction and load the source data into Amazon Redshift for the first time. You can use AWS SCT data extractors to extract data from the source data warehouse and load data to Amazon S3 if your data size and data transfer requirements allow you to transfer data over the interconnected network. Alternatively, if there are limitations such as network capacity limit, you can load data to Snowball and from there data gets loaded to Amazon S3. When the data in the source data warehouse is available on Amazon S3, it’s loaded to Amazon Redshift. In situations when the source data warehouse native tools do a better data unload and load job than AWS SCT data extractors, you may choose to use the native tools to complete this step.

Step 3: Delta and incremental load

In this step, you use AWS SCT and sometimes source data warehouse native tools to capture and load delta or incremental changes from sources to Amazon Redshift. This is often referred to change data capture (CDC). CDC is a process that captures changes made in a database, and ensures that those changes are replicated to a destination such as a data warehouse.

You should now have enough information to start developing a migration plan for your data warehouse. In the following section, I dive deeper into the AWS services that can help you migrate your data warehouse to Amazon Redshift, and the best practices of using these services to accelerate a successful delivery of your data warehouse migration project.

Data warehouse migration services

Data warehouse migration involves a set of services and tools to support the migration process. You begin with creating a database migration assessment report and then converting the source data schema to be compatible with Amazon Redshift by using AWS SCT. To move data, you can use the AWS SCT data extraction tool, which has integration with AWS Data Migration Service (AWS DMS) to create and manage AWS DMS tasks and orchestrate data migration.

To transfer source data over the interconnected network between the source and AWS, you can use AWS Storage Gateway, Amazon Kinesis Data Firehose, Direct Connect, AWS Transfer Family services, Amazon S3 Transfer Acceleration, and AWS DataSync. For data warehouse migration involving a large volume of data, or if there are constraints with the interconnected network capacity, you can transfer data using the AWS Snow Family of services. With this approach, you can copy the data to the device, send it back to AWS, and have the data copied to Amazon Redshift via Amazon S3.

AWS SCT is an essential service to accelerate your data warehouse migration to Amazon Redshift. Let’s dive deeper into it.

Migrating using AWS SCT

AWS SCT automates much of the process of converting your data warehouse schema to an Amazon Redshift database schema. Because the source and target database engines can have many different features and capabilities, AWS SCT attempts to create an equivalent schema in your target database wherever possible. If no direct conversion is possible, AWS SCT creates a database migration assessment report to help you convert your schema. The database migration assessment report provides important information about the conversion of the schema from your source database to your target database. The report summarizes all the schema conversion tasks and details the action items for schema objects that can’t be converted to the DB engine of your target database. The report also includes estimates of the amount of effort that it will take to write the equivalent code in your target database that can’t be converted automatically.

Storage optimization is the heart of a data warehouse conversion. When using your Amazon Redshift database as a source and a test Amazon Redshift database as the target, AWS SCT recommends sort keys and distribution keys to optimize your database.

With AWS SCT, you can convert the following data warehouse schemas to Amazon Redshift:

Amazon Redshift
Azure Synapse Analytics (version 10)
Greenplum Database (version 4.3 and later)
Microsoft SQL Server (version 2008 and later)
Netezza (version 7.0.3 and later)
Oracle (version 10.2 and later)
Snowflake (version 3)
Teradata (version 13 and later)
Vertica (version 7.2 and later)

At AWS, we continue to release new features and enhancements to improve our product. For the latest supported conversions, visit the AWS SCT User Guide.

Migrating data using AWS SCT data extraction tool

You can use an AWS SCT data extraction tool to extract data from your on-premises data warehouse and migrate it to Amazon Redshift. The agent extracts your data and uploads the data to either Amazon S3 or, for large-scale migrations, an AWS Snowball Family service. You can then use AWS SCT to copy the data to Amazon Redshift. Amazon S3 is a storage and retrieval service. To store an object in Amazon S3, you upload the file you want to store to an S3 bucket. When you upload a file, you can set permissions on the object and also on any metadata.

In large-scale migrations involving data upload to a AWS Snowball Family service, you can use wizard-based workflows in AWS SCT to automate the process in which the data extraction tool orchestrates AWS DMS to perform the actual migration.

Considerations for Amazon Redshift migration tools

To improve and accelerate data warehouse migration to Amazon Redshift, consider the following tips and best practices. Tthis list is not exhaustive. Make sure you have a good understanding of your data warehouse profile and determine which best practices you can use for your migration project.

Use AWS SCT to create a migration assessment report and scope migration effort.
Automate migration with AWS SCT where possible. The experience from our customers shows that AWS SCT can automatically create the majority of DDL and SQL scripts.
When automated schema conversion is not possible, use custom scripting for the code conversion.
Install AWS SCT data extractor agents as close as possible to the data source to improve data migration performance and reliability.
To improve data migration performance, properly size your Amazon Elastic Compute Cloud (Amazon EC2) instance and its equivalent virtual machines that the data extractor agents are installed on.
Configure multiple data extractor agents to run multiple tasks in parallel to improve data migration performance by maximizing the usage of the allocated network bandwidth.
Adjust AWS SCT memory configuration to improve schema conversion performance.
Use Amazon S3 to store the large objects such as images, PDFs, and other binary data from your existing data warehouse.
To migrate large tables, use virtual partitioning and create sub-tasks to improve data migration performance.
Understand the use cases of AWS services such as Direct Connect, the AWS Transfer Family, and the AWS Snow Family. Select the right service or tool to meet your data migration requirements.
Understand AWS service quotas and make informed migration design decisions.

Summary

Data is growing in volume and complexity faster than ever. However, only a fraction of this invaluable asset is available for analysis. Traditional on-premises data warehouses have rigid architectures that don’t scale for modern big data analytics use cases. These traditional data warehouses are expensive to set up and operate, and require large upfront investments in both software and hardware.

In this post, we discussed Amazon Redshift as a fully managed, scalable, modern data warehouse that can help you analyze all your data, and achieve performance at any scale with low and predictable cost. To migrate your data warehouse to Amazon Redshift, you need to consider a range of factors, such as the total size of the data warehouse, data change rate, and data transformation complexity, before picking a suitable migration strategy and process to reduce the complexity and cost of your data warehouse migration project. With AWS services such AWS SCT and AWS DMS, and by adopting the tips and the best practices of these services, you can automate migration tasks, scale migration, accelerate the delivery of your data warehouse migration project, and delight your customers.

About the Author

Lewis Tang is a Senior Solutions Architect at Amazon Web Services based in Sydney, Australia. Lewis provides partners guidance to a broad range of AWS services and help partners to accelerate AWS practice growth.

Migrating to an Amazon Redshift Cloud Data Warehouse from Microsoft APS

2021-11-09 Sudarshan Roy

Post Syndicated from Sudarshan Roy original https://aws.amazon.com/blogs/architecture/migrating-to-an-amazon-redshift-cloud-data-warehouse-from-microsoft-aps/

Before cloud data warehouses (CDWs), many organizations used hyper-converged infrastructure (HCI) for data analytics. HCIs pack storage, compute, networking, and management capabilities into a single “box” that you can plug into your data centers. However, because of its legacy architecture, an HCI is limited in how much it can scale storage and compute and continue to perform well and be cost-effective. Using an HCI can impact your business’s agility because you need to plan in advance, follow traditional purchase models, and maintain unused capacity and its associated costs. Additionally, HCIs are often proprietary and do not offer the same portability, customization, and integration options as with open-standards-based systems. Because of their proprietary nature, migrating HCIs to a CDW can present technical hurdles, which can impact your ability to realize the full potential of your data.

One of these hurdles includes using AWS Schema Conversion Tool (AWS SCT). AWS SCT is used to migrate data warehouses, and it supports several conversions. However, when you migrate Microsoft’s Analytics Platform System (APS) SQL Server Parallel Data Warehouse (PDW) platform using only AWS SCT, it results in connection errors due to the lack of server-side cursor support in Microsoft APS. In this blog post, we show you three approaches that use AWS SCT combined with other AWS services to migrate Microsoft’s Analytics Platform System (APS) SQL Server Parallel Data Warehouse (PDW) HCI platform to Amazon Redshift. These solutions will help you overcome elasticity, scalability, and agility constraints associated with proprietary HCI analytics platforms and future proof your analytics investment.

AWS Schema Conversion Tool

Though using AWS SCT only will result in server-side cursor errors, you can pair it with other AWS services to migrate your data warehouses to AWS. AWS SCT converts source database schema and code objects, including views, stored procedures, and functions, to be compatible with a target database. It highlights objects that require manual intervention. You can also scan your application source code for embedded SQL statements as part of database-schema conversion project. During this process, AWS SCT optimizes cloud-native code by converting legacy Oracle and SQL Server functions to their equivalent AWS service. This helps you modernize applications simultaneously. Once conversion is complete, AWS SCT can also migrate data.

Figure 1 shows a standard AWS SCT implementation architecture.

Figure 1. AWS SCT migration approach

The next section shows you how to pair AWS SCT with other AWS services to migrate a Microsoft APS PDW to Amazon Redshift CDW. We prove you a base approach and two extensions to use for data warehouses with larger datasets and longer release outage windows.

Migration approach using SQL Server on Amazon EC2

The base approach uses Amazon Elastic Compute Cloud (Amazon EC2) to host a SQL Server in a symmetric multi-processing (SMP) architecture that is supported by AWS SCT, as opposed to Microsoft’s APS PDW’s massively parallel processing (MPP) architecture. By changing the warehouse’s architecture from MPP to SMP and using AWS SCT, you’ll avoid server-side cursor support errors.

Here’s how you’ll set up the base approach (Figure 2):

Set up the SMP SQL Server on Amazon EC2 and AWS SCT in your AWS account.
Set up Microsoft tools, including SQL Server Data Tools (SSDT), remote table copy, and SQL Server Integration Services (SSIS).
Use the Application Diagnostic Utility (ADU) and SSDT to connect and extract table lists, indexes, table definitions, view definitions, and stored procedures.
Generate data description languages (DDLs) using step 3 outputs.
Apply these DDLs to the SMP SQL Server on Amazon EC2.
Run AWS SCT against the SMP SQL database to begin migrating schema and data to Amazon Redshift.
Extract data using remote table copy from source, which copies data into the SMP SQL Server.
Load this data into Amazon Redshift using AWS SCT or AWS Database Migration Service (AWS DMS).
Use SSIS to load delta data from source to the SMP SQL Server on Amazon EC2.

Figure 2. Base approach using SMP SQL Server on Amazon EC2

Extending the base approach

The base approach overcomes server-side issues you would have during a direct migration. However, many organizations host terabytes (TB) of data. To migrate such a large dataset, you’ll need to adjust your approach.

The following sections extend the base approach. They still use the base approach to convert the schema and procedures, but the dataset is handled via separate processes.

Extension 1: AWS Snowball Edge

Note: AWS Snowball Edge is a Region-specific service. Verify that the service is available in your Region before planning your migration. See Regional Table to verify availability.

Snowball Edge lets you transfer large datasets to the cloud at faster-than-network speeds. Each Snowball Edge device can hold up to 100 TB and uses 256-bit encryption and an industry-standard Trusted Platform Module to ensure security and full chain-of-custody for your data. Furthermore, higher volumes can be transferred by clustering 5–10 devices for increased durability and storage.

Extension 1 enhances the base approach to allow you to transfer large datasets (Figure 3) while simultaneously setting up an SMP SQL Server on Amazon EC2 for delta transfers. Here’s how you’ll set it up:

Once Snowball Edge is enabled in the on-premises environment, it allows data transfer via network file system (NFS) endpoints. The device can then be used with standard Microsoft tools like SSIS, remote table copy, ADU, and SSDT.
While the device is being shipped back to an AWS facility, you’ll set up an SMP SQL Server database on Amazon EC2 to replicate the base approach.
After your data is converted, you’ll apply a converted schema to Amazon Redshift.
Once the Snowball Edge arrives at the AWS facility, data is transferred to the SMP SQL Server database.
You’ll subsequently run schema conversions and initial and delta loads per the base approach.

Figure 3. Solution extension that uses Snowball Edge for large datasets

Note: Where sequence numbers overlap in the diagram is a suggestion to possible parallel execution

Extension 1 transfers initial load and later applies delta load. This adds time to the project because of longer cutover release schedules. Additionally, you’ll need to plan for multiple separate outages, Snowball lead times, and release management timelines.

Note that not all analytics systems are classified as business-critical systems, so they can withstand a longer outage, typically 1-2 days. This gives you an opportunity to use AWS DataSync as an additional extension to complete initial and delta load in a single release window.

Extension 2: AWS DataSync

DataSync speeds up data transfer between on-premises environments and AWS. It uses a purpose-built network protocol and a parallel, multi-threaded architecture to accelerate your transfers.

Figure 4 shows the solution extension, which works as follows:

Create SMP MS SQL Server on EC2 and the DDL, as shown in the base approach.
Deploy DataSync agent(s) in your on-premises environment.
Provision and mount an NFS volume on the source analytics platform and DataSync agent(s).
Define a DataSync transfer task after the agents are registered.
Extract initial load from source onto the NFS mount that will be uploaded to Amazon Simple Storage Service (Amazon S3).
Load data extracts into the SMP SQL Server on Amazon EC2 instance (created using base approach).
Run delta loads per base approach, or continue using solution extension for delta loads.

Figure 4. Solution extension that uses DataSync for large datasets

Note: where sequence numbers overlap in the diagram is a suggestion to possible parallel execution

Transfer rates for DataSync depend on the amount of data, I/O, and network bandwidth available. A single DataSync agent can fully utilize a 10 gigabit per second (Gbps) AWS Direct Connect link to copy data from on-premises to AWS. As such, depending on initial load size, transfer window calculations must be done prior to finalizing transfer windows.

Conclusion

The approach and its extensions mentioned in this blog post provide mechanisms to migrate your Microsoft APS workloads to an Amazon Redshift CDW. They enable elasticity, scalability, and agility for your workload to future proof your analytics investment.

Related information

Accelerate your data warehouse migration to Amazon Redshift – Part 2

2021-07-22 Michael Soo

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-2-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the second post in a multi-part series. We’re excited to shared dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially reduce your overall cost to migrate to Amazon Redshift. Check out the first post Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about automated macro conversion, case-insensitive string comparison, case-sensitive identifiers, and other exciting new features.

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services like Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating their self-managed data warehouse engines to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share additional enhancements to the AWS Schema Conversion Tool (AWS SCT) to automate your migrations to Amazon Redshift. These enhancements reduce the recoding needed for your data tables, and more importantly, the manual work needed for views, stored procedures, scripts, and other application code that use those tables.

In this post, we introduce automation for INTERVAL and PERIOD data types, automatic type casting, binary data support, and some other enhancements that have been requested by customers. We show you how to use AWS SCT to convert objects from a Teradata data warehouse and provide links to relevant documentation so you can continue exploring these new capabilities.

INTERVAL data types

An INTERVAL is an unanchored duration of time, like “1 year” or “2 hours,” that doesn’t have a specific start or end time. In Teradata, INTERVAL data is implemented as 13 distinct data types depending on the granularity of time being represented. The following table summarizes these types.

Year intervals

Month intervals

Day intervals

Hour intervals

Minute intervals

Second intervals

INTERVAL YEAR

INTERVAL YEAR TO MONTH

INTERVAL MONTH

INTERVAL DAY

INTERVAL DAY TO HOUR

INTERVAL DAY TO MINUTE

INTERVAL DAY TO SECOND

INTERVAL HOUR

INTERVAL HOUR TO MINUTE

INTERVAL HOUR TO SECOND

INTERVAL MINUTE

INTERVAL MINUTE TO SECOND

INTERVAL SECOND

Amazon Redshift doesn’t support INTERVAL data types natively. Previously, if you used INTERVAL types in your data warehouse, you had to develop custom code as part of the database conversion process.

Now, AWS SCT automatically converts INTERVAL data types for you. AWS SCT converts an INTERVAL column into a CHARACTER VARYING column in Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the INTERVAL semantics.

For example, consider the following Teradata table, which has a MONTH interval column. This table store different types of leaves of absences and the allowable duration for each.

CREATE TABLE testschema.loa_durations (
  loa_type_id INTEGER
, loa_name VARCHAR(100) CHARACTER SET LATIN
, loa_duration INTERVAL MONTH(2))
PRIMARY INDEX (loa_type_id);

AWS SCT converts the table to Amazon Redshift as follows. Because Amazon Redshift doesn’t have a native INTERVAL data type, AWS SCT replaces it with a VARCHAR data type.

CREATE TABLE testschema.loa_durations(
  loa_type_id INTEGER
, loa_name VARCHAR(100)
, loa_duration VARCHAR(64)
)
DISTSTYLE KEY
DISTKEY
(
loa_type_id
)
SORTKEY
(
loa_type_id
);

Now, let’s suppose your application code uses the loa_duration column, like the following Teradata view. Here, the INTERVAL MONTH field is added to the current date to compute when a leave of absence ends if it starts today.

REPLACE VIEW testschema.loa_projected_end_date AS
SELECT
  loa_type_id loa_type_id
, loa_name loa_name
, loa_duration
, current_date AS today
, current_date + loa_duration AS end_date
FROM
testschema.loa_durations
;

Because the data is stored as CHARACTER VARYING, AWS SCT injects the proper type CAST into the Amazon Redshift code to interpret the string values as a MONTH interval. It then converts the arithmetic using Amazon Redshift date functions.

CREATE OR REPLACE VIEW testschema.loa_projected_end_date (loa_type_id, loa_name, loa_duration, today, end_date) AS
SELECT
  loa_type_id AS loa_type_id
, loa_name AS loa_name
, loa_duration
, CURRENT_DATE AS today
, dateadd(MONTH, CAST (loa_duration AS INTEGER),CURRENT_DATE)::DATE AS end_date
FROM testschema.loa_durations
;

Also, as a bonus, AWS SCT automatically converts any literal INTERVAL values that you might be using in your code.

For example, consider the following Teradata table. The table contains a DATE column, which records the last date when an employee was promoted.

CREATE TABLE TESTSCHEMA.employees (
  id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD'
)
UNIQUE PRIMARY INDEX ( id );

Now, suppose the database contains a view that computes the next date an employee is eligible for a promotion. We implement a business rule that employees who have never been promoted are eligible for promotion in 1.5 years. All other employees become eligible 2.5 years after their last promotion. See the following code:

REPLACE VIEW testschema.eligible_for_promo AS
SELECT 
  id
, name
, last_promo_date
, CASE WHEN last_promo_date is NULL THEN current_date + INTERVAL '18' MONTH
       ELSE last_promo_date + INTERVAL '2-06' YEAR TO MONTH
  END eligible_date
FROM employees
;

AWS SCT converts the INTERVAL values used in the CASE statement and translates the date expressions accordingly:

CREATE OR REPLACE VIEW testschema.eligible_for_promo (id, name, last_promo_date, eligible_date) AS
SELECT
  id
, name
, last_promo_date
, CASE
    WHEN last_promo_date IS NULL THEN dateadd(MONTH, 18, CURRENT_DATE)::DATE
    ELSE dateadd(MONTH, 30, last_promo_date)::DATE
  END AS eligible_date
FROM testschema.employees
;

We’re excited about INTERVAL automation in AWS SCT and encourage you to give it a try. For more information about getting started with AWS SCT, see Installing, verifying, and updating AWS SCT.

PERIOD data type

A PERIOD data value represents a duration of time with a specified start and end. For example, the Teradata literal “(2021-01-01 to 2021-01-31)” is a period with a duration of 31 days that starts and ends on the first and last day of January 2021, respectively. PERIOD data types can have three different granularities: DATE, TIME, or TIMESTAMP. The following table provides some examples.

Period Type	Example
PERIOD(DATE)	“(2021-01-01 to 2021-01-31) “
PERIOD(TIME)	“(12:00:00 to 13:00:00)”
PERIOD(TIMESTAMP)	“(2021-01-31 00:00:00 to 2021-01-31 23:59:59)”

As with INTERVAL, the PERIOD data type isn’t natively supported by Amazon Redshift. Previously, if you used these data types in your tables, you had to write custom code as part of the database conversion process.

Now, AWS SCT automatically converts PERIOD data types for you. AWS SCT converts a PERIOD column into two DATE (or TIME or TIMESTAMP) columns as appropriate on Amazon Redshift. Then AWS SCT converts your application code that uses the column to emulate the source engine semantics.

For example, consider the following Teradata table:

CREATE SET TABLE testschema.period_table (
  id INTEGER
, period_col PERIOD(timestamp)) 
UNIQUE PRIMARY INDEX (id);

AWS SCT converts the PERIOD(TIMESTAMP) column into two TIMESTAMP columns in Amazon Redshift:

CREATE TABLE IF NOT EXISTS testschema.period_table(
  id INTEGER
, period_col_begin TIMESTAMP
, period_col_end TIMESTAMP
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

Now, let’s look at a simple example of how you can use AWS SCT to convert your application code. A common operation in Teradata is to extract the starting (or ending) timestamps in a PERIOD value using the BEGIN and END built-in functions:

REPLACE VIEW testschema.period_view_begin_end AS 
SELECT 
  BEGIN(period_col) AS period_start
, END(period_col) AS period_end 
FROM testschema.period_table
;

AWS SCT converts the view to reference the transformed table columns:

CREATE OR REPLACE VIEW testschema.period_view_begin_end (period_start, period_end) AS
SELECT
  period_col_begin AS period_start
, period_col_end AS period_end
FROM testschema.period_table;

We’ll continue to build automation for PERIOD data conversion, so stay tuned for more improvements. In the meantime, you can try out the PERIOD data type conversion features in AWS SCT now. For more information, see Installing, verifying, and updating AWS SCT.

Type casting

Some data warehouse engines, like Teradata, provide an extensive set of rules to cast data values in expressions. These rules permit implicit casts, where the target data type is inferred from the expression, and explicit casts, which typically use a function to perform the type conversion.

Previously, you had to manually convert implicit cast operations in your SQL code. Now, we’re happy to share that AWS SCT automatically converts implicit casts as needed. This feature is available now for the following set of high-impact Teradata data types.

Category	Source data type	Target data types
Numeric	CHAR	BIGINT NUMBER TIMESTAMP
	VARCHAR	NUMBER NUMERIC DEC CHAR GEOMETRY
	INTEGER	DATE DEC
	BIGINT	DATE
	NUMBER	CHARACTER VARCHAR DEC
	DECIMAL	DATE TIMESTAMP SMALLINT DOUBLE PRECISION
	FLOAT	DEC
Time	DATE	BIGINT INTEGER DECIMAL FLOAT NUMBER CHARACTER TIMESTAMP
Time	INTERVAL	NUMBER BIGINT INTEGER
Other	GEOMETRY	DECIMAL

Let’s look at how to cast numbers to DATE. Many Teradata applications treat numbers and DATE as equivalent values. Internally, Teradata stores DATE values as INTEGER. The rules to convert between an INTEGER and a DATE are well-known and developers have commonly exploited this information to perform date calculations using INTEGER arithmetic.

For example, consider the following Teradata table:

CREATE TABLE testschema.employees (
  id INTEGER
, name VARCHAR(20) CHARACTER SET LATIN
, manager_id INTEGER
, last_promo_date DATE FORMAT 'YY/MM/DD')
UNIQUE PRIMARY INDEX ( id );

We insert a single row of data into the table:

select * from employees;

 *** Query completed. One row found. 4 columns returned. 
 *** Total elapsed time was 1 second.

         id  name                   manager_id  last_promo_date
-----------  --------------------  -----------  ---------------
        112  Britney                       201                ?

We use a macro to update the last_promo_date field for id = 112. The macro accepts a BIGINT parameter to populate the DATE field.

replace macro testschema.set_last_promo_date(emp_id integer, emp_promo_date bigint) AS (
update testschema.employees
set last_promo_date = :emp_promo_date
where id = :emp_id;
);

Now, we run the macro and check the value of the last_promo_date attribute:

exec testschema.set_last_promo_date(112, 1410330);

 *** Update completed. One row changed. 
 *** Total elapsed time was 1 second.


select * from employees;

 *** Query completed. One row found. 4 columns returned. 
 *** Total elapsed time was 1 second.

         id  name                   manager_id  last_promo_date
-----------  --------------------  -----------  ---------------
        112  Britney                       201         41/03/30

You can see the last_promo_date attribute is set to the date March 30, 2041.

Now, let’s use AWS SCT to convert the table and macro to Amazon Redshift. As we saw in Part 1 of this series, AWS SCT converts the Teradata macro into an Amazon Redshift stored procedure:

CREATE TABLE IF NOT EXISTS testschema.employees(
  id INTEGER
, name CHARACTER VARYING(20) 
, manager_id INTEGER
, last_promo_date DATE
)
DISTSTYLE KEY
DISTKEY
(id)
SORTKEY
(id);

CREATE OR REPLACE PROCEDURE testschema.set_last_promo_date(par_emp_id INTEGER, par_emp_promo_date BIGINT)
AS $BODY$
BEGIN
    UPDATE testschema.employees
    SET last_promo_date = TO_DATE((par_emp_promo_date + 19000000), 'YYYYMMDD')
        WHERE id = par_emp_id;
END;
$BODY$
LANGUAGE plpgsql;

Note that 20410330 = 1410330 + 19000000; so adding 19,000,000 to the input returns the correct date value 2041-03-30.

Now, when we run the stored procedure, it updates the last_promo_date as expected:

myredshift=# select * from testschema.employees;
 id  |  name   | manager_id | last_promo_date
 112 | Britney |        201 |
(1 row)

myredshift=# call testschema.set_last_promo_date(112, 1410330);
CALL

myredshift=# select * from testschema.employees;
 id  |  name   | manager_id | last_promo_date
 112 | Britney |        201 | 2041-03-30
(1 row)

Automatic data type casting is available in AWS SCT now. You can download the latest version and try it out.

BLOB data

Amazon Redshift doesn’t have native support for BLOB columns, which you use to store large binary objects like text or images.

Previously, if you were migrating a table with a BLOB column, you had to manually move the BLOB values to file storage, like Amazon Simple Storage Service (Amazon S3), then add a reference to the S3 file in the table. Using Amazon S3 as the storage target for binary objects is a best practice because these objects are large and typically have low analytic value.

We’re happy to share that AWS SCT now automates this process for you. AWS SCT replaces the BLOB column with a CHARACTER VARYING column on the target table. Then, when you use the AWS SCT data extractors to migrate your data, the extractors upload the BLOB value to Amazon S3 and insert a reference to the BLOB into the target table.

For example, let’s create a table in Teradata and populate it with some data:

CREATE SET TABLE TESTSCHEMA.blob_table (
  id INTEGER
, blob_col BLOB(10485760))
PRIMARY INDEX ( id );

select * from blob_table;

 *** Query completed. 2 rows found. 2 columns returned. 
 *** Total elapsed time was 1 second.

         id blob_col
----------- ---------------------------------------------------------------
          1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
          2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Now, we convert the table with AWS SCT and build it on Amazon Redshift:

myredshift=# \d testschema.blob_table;
                    Table "testschema.blob_table"
Column  | Type                     | Collation | Nullable | Default 
id      | integer                  |.          |          | 
blob_col | character varying(1300) |           |          |

Then we use the AWS SCT data extractors to migrate the table data from Teradata to Amazon Redshift.

When we look at the table in Amazon Redshift, you can see the paths to the S3 files that contain the BLOB values:

myredshift=# select * from testschema.blob_table;
(2 rows)

 id |                                                               blob_col                                                               
  2 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/308b6f0a902941e793212058570cdda5.dat
  1 | s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/a7686067af5549479b52d81e83c3871e.dat

And on Amazon S3, you can see the actual data files. There are two, one for each BLOB value:

$ aws s3 ls s3://<bucket name>/data/c12f53330dd3427a845a77f143d4a1a1/dbdee8e0485c481dad601fd6170fbfb4_lobs/2/
2021-05-13 23:59:47         23 522fee54fda5472fbae790f43e36cba1.dat
2021-05-13 23:59:47         24 5de6c53831f741629476e2c2cbc6b226.dat

BLOB support is available now in AWS SCT and the AWS SCT data extractors. Download the latest version of the application and try it out today.

Multi-byte CHARACTER conversion

Teradata supports multibyte characters in CHARACTER data columns, which are fixed length fields. Amazon Redshift supports multibyte characters in CHARACTER VARYING fields but not in fixed-length CHARACTER columns.

Previously, if you had fixed-length CHARACTER columns, you had to determine if they contained multibyte character data, and increase the target column size as appropriate.

AWS SCT now bridges this gap for you. If your Teradata tables contain CHARACTER columns with multibyte characters, AWS SCT automatically converts these columns to Amazon Redshift CHARACTER VARYING fields and sets the column sizes accordingly. Consider the following example, which contains four columns, a LATIN column that contains only single-byte characters, and UNICODE, GRAPHIC, and KANJISJIS columns that can contain multi-byte characters:

create table testschema.char_table (
  latin_col char(70) character set latin
, unicode_col char(70) character set unicode
, graphic_col char(70) character set graphic
, kanjisjis_col char(70) character set kanjisjis
);

AWS SCT translates the LATIN column to a fixed length CHARACTER column. The multi-byte columns are upsized and converted to CHARACTER VARYING:

CREATE TABLE IF NOT EXISTS testschema.char_table (
  latin_col CHARACTER(70)
, unicode_col CHARACTER VARYING(210)
, graphic_col CHARACTER VARYING(210)
, kanjisjis_col CHARACTER VARYING(210)
)
DISTSTYLE KEY
DISTKEY
(col1)
SORTKEY
(col1);

Automatic conversion for multibyte CHARACTER columns is available in AWS SCT now.

GEOMETRY data type size

Amazon Redshift has long supported geospatial data with a GEOMETRY data type and associated spatial functions.

Previously, Amazon Redshift restricted the maximum size of a GEOMETRY column to 64 KB, which constrained some customers with large objects. Now, we’re happy to share that the maximum size of GEOMETRY objects has been increased to just under 1 MB (specifically, 1,048,447 bytes).

For example, consider the following Teradata table:

create table geometry_table (
 id INTEGER
, geometry_col1 ST_GEOMETRY 
, geometry_col2 ST_GEOMETRY(1000)
, geometry_col3 ST_GEOMETRY(1048447) 
, geometry_col4 ST_GEOMETRY(10484470)
, geometry_col5 ST_GEOMETRY INLINE LENGTH 1000
)
;

You can use AWS SCT to convert it to Amazon Redshift. The converted table definition is as follows. A size specification isn’t needed on the converted columns because Amazon Redshift implicitly sets the column size.

CREATE TABLE IF NOT EXISTS testschema.geometry_table(
id INTEGER,
geometry_col1 GEOMETRY,
geometry_col2 GEOMETRY,
geometry_col3 GEOMETRY,
geometry_col4 GEOMETRY,
geometry_col5 GEOMETRY
)
DISTSTYLE KEY
DISTKEY
(
id
)
SORTKEY
(
id
);
ALTER TABLE testschema.geometry_table ALTER DISTSTYLE AUTO;
ALTER TABLE testschema.geometry_table ALTER SORTKEY AUTO;

Large GEOMETRY columns are available in Amazon Redshift now. For more information, see Querying spatial data in Amazon Redshift.

Conclusion

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports, applications, and ETL, as well as accelerate your query performance.

This post described a few of the dozens of new features we have recently introduced to automate your data warehouse migrations to Amazon Redshift. We will share more in upcoming posts. You’ll hear about additional SQL automation, a purpose-built scripting language for Amazon Redshift with BTEQ compatibility, and automated support for proprietary SQL features.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool on the AWS website. Happy migrating!

About the Author

Michael Soo is a database engineer with the AWS DMS and AWS SCT team at Amazon Web Services.