All posts by Ahmed Shehata

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/migrate-microsoft-azure-synapse-analytics-to-amazon-redshift-using-aws-sct/

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. With Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers, such as the following:

  • Amazon Redshift seamlessly integrates with broader data, analytics, and AI or machine learning (ML) services on AWS, enabling you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. With Amazon Redshift, you can build lake house architectures and perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning, and more.
  • You don’t need to worry about workloads such as ETL (extract, transform, and load), dashboards, ad-hoc queries, and so on interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
  • When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get 1 hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
  • Amazon Redshift is straightforward to use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
  • Fault tolerance is built in. All data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
  • Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, serverless web services or event-driven applications. You can also use your favorite business intelligence (BI) and SQL tools to access, analyze, and visualize data in Amazon Redshift.
  • Amazon Redshift ML makes it straightforward for data scientists to create, train, and deploy ML models using familiar SQL. You can also run predictions using SQL.
  • Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row-level and column-level security controls on sensitive data, and more.

In this post, we show how to migrate a data warehouse from Microsoft Azure Synapse to Redshift Serverless using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database code and storage objects to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your Azure Synapse Analytics. First, we create a Redshift database into which Azure Synapse data will be migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert Azure Synapse schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from Azure Synapse, upload it into an S3 bucket, and copy it to Amazon Redshift.

The following diagram illustrates our solution architecture.

This walkthrough covers the following steps:

  1. Create a Redshift Serverless data warehouse.
  2. Create the S3 bucket and folder.
  3. Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT:
    1. Connect to the Azure Synapse source.
    2. Connect to the Amazon Redshift target.
    3. Convert the Azure Synapse schema to a Redshift database.
    4. Analyze the assessment report and address the action items.
    5. Apply the converted schema to the target Redshift database.
  4. Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents:
    1. Generate trust and key stores (this step is optional).
    2. Install and configure the data extraction agent.
    3. Start the data extraction agent.
    4. Register the data extraction agent.
    5. Add virtual partitions for large tables (this step is optional).
    6. Create a local data migration task.
    7. Start the local data migration task.
  5. View data in Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

Create a Redshift Serverless data warehouse

In this step, we create a Redshift Serverless data warehouse with a workgroup and namespace. A workgroup is a collection of compute resources and a namespace is a collection of database objects and users. To isolate workloads and manage different resources in Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create a Redshift Serverless data warehouse with a workgroup and namespace:

  1. On the Amazon Redshift console, choose the AWS Region that you want to use.
  2. In the navigation pane, choose Redshift Serverless.
  3. Choose Create workgroup.

  1. For Workgroup name, enter a name that describes the compute resources.

  1. Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
  2. Choose Next.
  3. For Namespace, enter a name that describes your dataset.
  4. In the Database name and password section, select Customize admin user credentials.
  5. For Admin user name, enter a user name of your choice (for example, awsuser).
  6. For Admin user password, enter a password of your choice (for example, MyRedShiftPW2022).

  1. Choose Next.

Note that data in the Redshift Serverless namespace is encrypted by default.

  1. In the Review and Create section, choose Create.

Now you create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace. Note that there can only be one default IAM role.

  1. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the namespace you just created.
  2. On the Security and encryption tab, in the Permissions section, choose Manage IAM roles.
  3. Choose Manage IAM roles and choose Create IAM role.
  4. In the Specify an Amazon S3 bucket for the IAM role to access section, choose one of the following methods:
    1. Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with names containing the word redshift.
    2. Choose Any Amazon S3 bucket to allow the created IAM role to access all S3 buckets.
    3. Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
  5. Choose Create IAM role as default.
  6. Capture the endpoint for the Redshift Serverless workgroup you just created.
  7. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workgroup you just created.
  8. In the General information section, copy the endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create an S3 bucket:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. For Bucket name, enter a unique DNS-compliant name for your bucket (for example, uniquename-as-rs).

For more information about bucket names, refer to Bucket naming rules.

  1. For AWS Region, choose the Region in which you created the Redshift Serverless workgroup.
  2. Choose Create bucket.

  1. Choose Buckets in the navigation pane and navigate to the S3 bucket you just created (uniquename-as-rs).
  2. Choose Create folder.
  3. For Folder name, enter incoming.
  4. Choose Create folder.

Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT

To convert the Azure Synapse schema to Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that you created previously and launch AWS SCT.

Connect to the Azure Synapse source

Complete the following steps to connect to the Azure Synapse source:

  1. On the File menu, choose Create New Project.
  2. Choose a location to store your project files and data.
  3. Provide a meaningful but memorable name for your project (for example, Azure Synapse to Amazon Redshift).
  4. To connect to the Azure Synapse source data warehouse, choose Add source.
  5. Choose Azure Synapse and choose Next.
  6. For Connection name, enter a name (for example, olap-azure-synapse).

AWS SCT displays this name in the object tree in left pane.

  1. For Server name, enter your Azure Synapse server name.
  2. For SQL pool, enter your Azure Synapse pool name.
  3. Enter a user name and password.
  4. Choose Test connection to verify that AWS SCT can connect to your source Azure Synapse project.
  5. When the connection is successfully validated, choose Ok and Connect.

Connect to the Amazon Redshift target

Follow these steps to connect to Amazon Redshift:

  1. In AWS SCT, choose Add target.
  2. Choose Amazon Redshift, then choose Next.
  3. For Connection name, enter a name to describe the Amazon Redshift connection.

AWS SCT displays this name in the object tree in the right pane.

  1. For Server name, enter the Redshift Serverless workgroup endpoint you captured earlier.
  2. For Server port, enter 5439.
  3. For Database, enter dev.
  4. For User name, enter the user name you chose when creating the Redshift Serverless workgroup.
  5. For Password, enter the password you chose when creating the Redshift Serverless workgroup.
  6. Deselect Use AWS Glue.
  7. Choose Test connection to verify that AWS SCT can connect to your target Redshift workgroup.
  8. When the test is successful, choose OK.
  9. Choose Connect to connect to the Amazon Redshift target.

Alternatively, you can use connection values that are stored in AWS Secrets Manager.

Convert the Azure Synapse schema to a Redshift data warehouse

After you create the source and target connections, you will see the source Azure Synapse object tree in the left pane and the target Amazon Redshift object tree in the right pane. We then create mapping rules to describe the source target pair for the Azure Synapse to Amazon Redshift migration.

Follow these steps to convert the Azure Synapse dataset to Amazon Redshift format:

  1. In the left pane, choose (right-click) the schema you want to convert.
  2. Choose Convert schema.
  3. In the dialog box, choose Yes.

When the conversion is complete, you will see a new schema created in the Amazon Redshift pane (right pane) with the same name as your Azure Synapse schema.

The sample schema we used has three tables; you can see these objects in Amazon Redshift format in the right pane. AWS SCT converts all the Azure Synapse code and data objects to Amazon Redshift format. You can also use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects, but some objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from Main view to Assessment Report view as shown in the following screenshot.

The Summary tab shows objects that were converted automatically and objects that were not converted automatically. Green represents automatically converted objects or objects with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action items tab shows the recommended actions for each conversion issue. If you choose an action item from the list, AWS SCT highlights the object that the action item applies to.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database and schema show you the effort required to design and implement the recommendations for converting action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. AWS SCT completes some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it in the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the ID column data type from decimal(3,0) in Azure Synapse to the smallint data type in Amazon Redshift.

Apply the converted schema to the target Redshift data warehouse

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.

Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this section, we configure AWS SCT extraction agents to extract data from Azure Synapse and migrate to Amazon Redshift. For this post, we install the AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For very large datasets, AWS SCT supports the use of multiple data extraction agents running on several instances to maximize throughput and increase the speed of data migration.

Generate trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all data passed between the applications remains private and integral. To use SSL communication, you need to generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

  1. In AWS SCT, choose Settings, Global settings, and Security.
  2. Choose Generate trust and key store.

  1. Enter a name and password for the trust and key stores.
  2. Enter a location to store them.
  3. Choose Generate, then choose OK.

Install and configure the data extraction agent

In the installation package for AWS SCT, you can find a subfolder called agents (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

  1. For Service port, enter the port number the agent listens on. It is 8192 by default.
  2. For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data.

The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.

  1. For Enter Redshift JDBC driver file or files, enter the location where you downloaded the Redshift JDBC drivers.
  2. For Add the Amazon Redshift driver, enter YES.
  3. For Enable SSL communication, enter yes. Enter No here if you don’t want to use SSL.
  4. Choose Next.

  1. For Trust store path, enter the storage location you specified when creating the trust and key store.
  2. For Trust store password, enter the password for the trust store.
  3. For Enable client SSL authentication, enter yes.
  4. For Key store path, enter the storage location you specified when creating the trust and key store.
  5. For Key store password, enter the password for the key store.
  6. Choose Next.

Start the data extraction agent

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, use C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed for your operating system. To stop an agent, run the same command but replace start with stop. To restart an agent, run the same RestartAgent.bat file.

Note that you should have administrator access to run those commands.

Register the data extraction agent

Follow these steps to register the data extraction agent:

  1. In AWS SCT, change the view to Data Migration view choose Register.
  2. Select Redshift data agent, then choose OK.

  1. For Description, enter a name to identify the agent.
  2. For Host name, if you installed the extraction agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT extraction agent is installed. It is recommended to install extraction agents on Linux for better performance.
  3. For Port, enter the number you used for the listening port (default 8192) when installing the AWS SCT extraction agent.
  4. Select Use SSL to encrypt AWS SCT connection to Data Extraction Agent.

  1. If you’re using SSL, navigate to the SSL tab.
  2. For Trust store, choose the trust store you created earlier.
  3. For Key store, choose the key store you created earlier.
  4. Choose Test connection.
  5. After the connection is validated successfully, choose OK and Register.

Create a local data migration task

To migrate data from Azure Synapse Analytics to Amazon Redshift, you create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local data migration task:

  1. In AWS SCT, under the schema name in the left pane, choose (right-click) the table you want to migrate (for this post, we use the table tbl_currency).
  2. Choose Create Local task.

  1. Choose from the following migration modes:
    1. Extract the source data and store it on a local PC or virtual machine where the agent runs.
    2. Extract the data and upload it to an S3 bucket.
    3. Extract the data, upload it to Amazon S3, and copy it into Amazon Redshift. (We choose this option for this post.)

  1. On the Advanced tab, provide the extraction and copy settings.

  1. On the Source server tab, make sure you are using the current connection properties.

  1. On the Amazon S3 settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket you created earlier.

The AWS SCT data extraction agent uploads the data in those S3 buckets and folders before copying it to Amazon Redshift.

  1. Choose Test Task.

  1. When the task is successfully validated, choose OK, then choose Create.

Start the local data migration task

To start the task, choose Start or Restart on the Tasks tab.

First, the data extraction agent extracts data from Azure Synapse. Then the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.

At this point, AWS SCT has successfully migrated data from the source Azure Synapse table to the Redshift table.

View data in Amazon Redshift

After the data migration task is complete, you can connect to Amazon Redshift and validate the data. Complete the following steps:

  1. On the Amazon Redshift console, navigate to the Query Editor v2.
  2. Open the Redshift Serverless workgroup you created.
  3. Choose Query data.

  1. For Database, enter a name for your database.
  2. For Authentication, select Federated user
  3. Choose Create connection.

  1. Open a new editor by choosing the plus sign.
  2. In the editor, write a query to select from the schema name and table or view name you want to verify.

You can explore the data, run ad-hoc queries, and make visualizations, charts, and views.

The following screenshot is the view of the source Azure Synapse dataset we used in this post.

Clean up

Follow the steps in this section to clean up any AWS resources you created as part of this post.

Stop the EC2 instance

Follow these steps to stop the EC2 instance:

  1. On the Amazon EC2 console, in the navigation pane, choose Instances.
  2. Select the instance you created.
  3. Choose Instance state, then choose Terminate instance.
  4. Choose Terminate when prompted for confirmation.

Delete the Redshift Serverless workgroup and namespace

Follow these steps to delete the Redshift Serverless workgroup and namespace:

  1. On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workspace you created
  2. On the Actions menu, choose Delete workgroup.
  3. Select Delete the associated namespace.
  4. Deselect Create final snapshot.
  5. Enter delete in the confirmation text box and choose Delete.

Delete the S3 bucket

Follow these steps to delete the S3 bucket:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose the bucket you created.
  3. Choose Delete.
  4. To confirm deletion, enter the name of the bucket.
  5. Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. This post discussed how a data migration task extracts, downloads, and migrates data from Azure Synapse to Amazon Redshift. The solution we presented performs a one-time migration of database objects and data. Data changes made in Azure Synapse when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to Azure Synapse on hold or rerun the ETL jobs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

To get started, download and install AWS SCT, sign in to the AWS Management Console, check out Redshift Serverless, and start migrating!


About the Authors

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Jagadish Kumar is a Senior Analytics Specialist Solutions Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Easy analytics and cost-optimization with Amazon Redshift Serverless

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/easy-analytics-and-cost-optimization-with-amazon-redshift-serverless/

Amazon Redshift Serverless makes it easy to run and scale analytics in seconds without the need to setup and manage data warehouse clusters. With Redshift Serverless, users such as data analysts, developers, business professionals, and data scientists can get insights from data by simply loading and querying data in the data warehouse.

With Redshift Serverless, you can benefit from the following features:

  • Access and analyze data without the need to set up, tune, and manage Amazon Redshift clusters
  • Use Amazon Redshift’s SQL capabilities, industry-leading performance, and data lake integration to seamlessly query data across a data warehouse, data lake, and databases
  • Deliver consistently high performance and simplified operations for even the most demanding and volatile workloads with intelligent and automatic scaling, without under-provisioning or over-provisioning the compute resources
  • Pay for the compute only when the data warehouse is in use

In this post, we discuss four different use cases of Redshift Serverless:

  • Easy analytics – A startup company needs to create a new data warehouse and reports for marketing analytics. They have very limited IT resources, and need to get started quickly and easily with minimal infrastructure or administrative overhead.
  • Self-service analytics – An existing Amazon Redshift customer has a provisioned Amazon Redshift cluster that is right-sized for their current workload. A new team needs quick self-service access to the Amazon Redshift data to create forecasting and predictive models for the business.
  • Optimize workload performance – An existing Amazon Redshift customer is looking to optimize the performance of their variable reporting workloads during peak time.
  • Cost-optimization of sporadic workloads – An existing customer is looking to optimize the cost of their Amazon Redshift producer cluster with sporadic batch ingestion workloads.

Easy analytics

In our first use case, a startup company with limited resources needs to create a new data warehouse and reports for marketing analytics. The customer doesn’t have any IT administrators, and their staff is comprised of data analysts, a data scientist, and business analysts. They want to create new marketing analytics quickly and easily, to determine the ROI and effectiveness of their marketing efforts. Given their limited resources, they want minimal infrastructure and administrative overhead.

In this case, they can use Redshift Serverless to satisfy their needs. They can create a new Redshift Serverless endpoint in a few minutes and load their initial few TBs of marketing dataset into Redshift Serverless quickly. Their data analysts, data scientists, and business analysts can start querying and analyzing the data with ease and derive business insights quickly without worrying about infrastructure, tuning, and administrative tasks.

Getting started with Redshift Serverless is easy and quick. On the Get started with Amazon Redshift Serverless page, you can select the Use default settings option, which will create a default namespace and workgroup with the default settings, as shown in the following screenshots.

With just a single click, you can create a new Redshift Serverless endpoint in minutes with data encryption enabled, and a default AWS Identity and Access Management (IAM) role, VPC, and security group attached. You can also use the Customize settings option to override these settings, if desired.

When the Redshift Serverless endpoint is available, choose Query data to launch the Amazon Redshift Query Editor v2.

Query Editor v2 makes it easy to create database objects, load data, analyze and visualize data, and share and collaborate with your teams.

The following screenshot illustrates creating new database tables using the UI.

The following screenshot demonstrates loading data from Amazon Simple Storage Service (Amazon S3) using the UI.

The following screenshot shows an example of analyzing and visualizing data.

Refer to the video Get Started with Amazon Redshift Serverless to learn how to set up a new Redshift Serverless endpoint and start analyzing your data in minutes.

Self-service analytics

In another use case, a customer is currently using an Amazon Redshift provisioned cluster that is right-sized for their current workloads. A new data science team wants quick access to the Amazon Redshift cluster data for a new workload that will build predictive models for forecasting. The new team members don’t know yet how long they will need access and how complex their queries will be.

Adding the new data science group to the current cluster presented the following challenges:

  • The additional compute capacity needs of the new team are unknown and hard to estimate
  • Because the current cluster resources are optimally utilized, they need to ensure workload isolation to support the needs of the new team without impacting existing workloads
  • A chargeback or cost allocation model is desired for the various teams consuming data

To address these issues, they decide to let the data science team create their own new Redshift Serverless instance and grant them data share access to the data they need from the existing Amazon Redshift provisioned cluster. The following diagram illustrates the new architecture.

The following steps need to be performed to implement this architecture:

  1. The data science team can create a new Redshift Serverless endpoint, as described in the previous use case.
  2. Enable data sharing between the Amazon Redshift provisioned cluster (producer) and the data science Redshift Serverless endpoint (consumer) using these high-level steps:
    1. Create a new data share.
    2. Add a schema to the data share.
    3. Add objects you want to share to the data share.
    4. Grant usage on this data share to the Redshift Serverless consumer namespace, using the Redshift Serverless endpoint’s namespace ID.
    5. Note that the Redshift Serverless endpoint is encrypted by default; the provisioned Redshift producer cluster also needs to be encrypted for data sharing to work between them.

The following screenshot shows sample SQL commands to enable data sharing on the Amazon Redshift provisioned producer cluster.

On the Amazon Redshift Serverless consumer, create a database from the data share and then query the shared objects.

For more details about configuring Amazon Redshift data sharing, refer to Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation.

With this architecture, we can resolve the three challenges mentioned earlier:

  • Redshift Serverless allows the data science team to create a new Amazon Redshift database without worrying about capacity needs, and set up data sharing with the Amazon Redshift provisioned producer cluster within 30 minutes. This tackles the first challenge.
  • Amazon Redshift data sharing allows you to share live, transactionally consistent data across provisioned and Serverless Redshift databases, and data sharing can even happen when the producer is paused. The new workload is isolated and runs on its own compute resources, without impacting the performance of the Amazon Redshift provisioned producer cluster. This addresses the second challenge.
  • Redshift Serverless isolates the cost of the new workload to the new team and enables an easy chargeback model. This tackles the third challenge.

Optimized workload performance

For our third use case, an Amazon Redshift customer using an Amazon Redshift provisioned cluster is looking for performance optimization during peak times for their workload. They need a solution to manage dynamic workloads without over-provisioning or under-provisioning resources and build a scalable architecture.

An analysis of the workload on the cluster shows that the cluster has two different workloads:

  • The first workload is streaming ingestion, which runs steadily during the day.
  • The second workload is reporting, which runs on an ad hoc basis during the day with some scheduled jobs during the night. It was noted that the reporting jobs run anywhere between 8–12 hours daily.

The provisioned cluster was sized as 12 nodes of ra3.4xlarge to handle both workloads running in parallel.

To optimize these workloads, the following architecture was proposed and implemented:

  • Configure an Amazon Redshift provisioned cluster with just 4 nodes of ra3.4xlarge, to handle the streaming ingestion workload only. The following screenshots illustrate how to do this on the Amazon Redshift console, via an elastic resize operation of the existing Amazon Redshift provisioned cluster by reducing number of nodes from 12 to 4:
  • Create a new Redshift Serverless endpoint to be utilized by the reporting workload with 128 RPU (Redshift Processing Units) in lieu of 8 nodes ra3.4xlarge. For more details about setting up Redshift Serverless, refer to the first use case regarding easy analytics.
  • Enable data sharing between the Amazon Redshift provisioned cluster as the producer and Redshift Serverless as the consumer using the serverless namespace ID, similar to how it was configured earlier in the self-service analytics use case. For more information about how to configure Amazon Redshift data sharing, refer to Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation.

The following diagram compares the current architecture and the new architecture using Redshift Serverless.

After completing this setup, the customer ran the streaming ingestion workload on the Amazon Redshift provisioned instance (producer) and reporting workloads on Redshift Serverless (consumer) based on the recommended architecture. The following improvements were observed:

  • The streaming ingestion workload performed the same as it did on the former 12-node Amazon Redshift provisioned cluster.
  • Reporting users saw a performance improvement of 30% by using Redshift Serverless. It was able to scale compute resources dynamically within seconds, as additional ad hoc users ran reports and queries without impacting the streaming ingestion workload.
  • This architecture pattern is expandable to add more consumers like data scientists, by setting up another Redshift Serverless instance as a new consumer.

Cost-optimization

In our final use case, a customer is using an Amazon Redshift provisioned cluster as a producer to ingest data from different sources. The data is then shared with other Amazon Redshift provisioned consumer clusters for data science modeling and reporting purposes.

Their current Amazon Redshift provisioned producer cluster has 8 nodes of ra3.4xlarge and is located in the us-east-1 Region. The data delivery from the different data sources is scattered between midnight to 8:00 AM, and the data ingestion jobs take around 3 hours to run in total every day. The customer is currently on the on-demand cost model and has scheduled daily jobs to pause and resume the cluster to minimize costs. The cluster resumes every day at midnight and pauses at 8:00 AM, with a total runtime of 8 hours a day.

The current annual cost of this cluster is 365 days * 8 hours * 8 nodes * $3.26 (node cost per hour) = $76,153.6 per year.

To optimize the cost of this workload, the following architecture was proposed and implemented:

  • Set up a new Redshift Serverless endpoint with 64 RPU as the base configuration to be utilized by the data ingestion producer team. For more information about setting up Redshift Serverless, refer to the first use case regarding easy analytics.
  • Restore the latest snapshot from the existing Amazon Redshift provisioned producer cluster into Redshift Serverless by choosing the Restore to serverless namespace option, as shown in the following screenshot.
  • Enable data sharing between Redshift Serverless as the producer and the Amazon Redshift provisioned cluster as the consumer, similar to how it was configured earlier in the self-service analytics use case.

The following diagram compares the current architecture to the new architecture.

By moving to Redshift Serverless, the customer realized the following benefits:

  • Cost savings – With Redshift Serverless, the customer pays for compute only when the data warehouse is in use. In this scenario, the customer observed a savings of up to 65% on their annual costs by using Redshift Serverless as the producer, while still getting better performance on their workloads. The Redshift Serverless annual cost in this case equals 365 days * 3 hours * 64 RPUs * $0.375 (RPU cost per hour) = $26,280, compared to $76,153.6 for their former provisioned producer cluster. Also, the Redshift Serverless 64 RPU baseline configuration offers the customer more compute resources than their former 8 nodes of ra3.4xlarge cluster, resulting in better performance overall.
  • Less administration overhead – Because the customer doesn’t need to worry about pausing and resuming their Amazon Redshift cluster any more, the administration of their data warehouse is simplified by moving their producer Amazon Redshift cluster to Redshift Serverless.

Conclusion

In this post, we discussed four different use cases, demonstrating the benefits of Amazon Redshift Serverless—from its easy analytics, ease of use, superior performance, and cost savings that can be realized from the pay-per-use pricing model.

Amazon Redshift provides flexibility and choice in data warehousing. Amazon Redshift Provisioned is a great choice for customers who need a custom provisioning environment with more granular controls; and with Redshift Serverless, you can start new data warehousing workloads in minutes with dynamic auto scaling, no infrastructure management, and a pay-per-use pricing model.

We encourage you to start using Amazon Redshift Serverless today and enjoy the many benefits it offers.


About the Authors

Ahmed Shehata is a Data Warehouse Specialist Solutions Architect with Amazon Web Services, based out of Toronto.

Manish Vazirani is an Analytics Platform Specialist at AWS. He is part of the Data-Driven Everything (D2E) program, where he helps customers become more data-driven.

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He has nearly two decades of experience helping customers modernize their data platforms. He is passionate about helping customers build scalable, cost-effective data and analytics solutions in the cloud. In his spare time, he enjoys spending time with his family, travel, and road cycling.

Define error handling for Amazon Redshift Spectrum data

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/define-error-handling-for-amazon-redshift-spectrum-data/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift Spectrum allows you to query open format data directly from the Amazon Simple Storage Service (Amazon S3) data lake without having to load the data into Amazon Redshift tables. With Redshift Spectrum, you can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. This feature of Amazon Redshift enables a modern data architecture that allows you to query all your data to obtain more complete insights.

Amazon Redshift has a standard way of handling data errors in Redshift Spectrum. Data file fields containing any special character are set to null. Character fields longer than the defined table column length get truncated by Redshift Spectrum, whereas numeric fields display the maximum number that can fit in the column. With this newly added user-defined data error handling feature in Amazon Redshift Spectrum, you can now customize data validation and error handling.

This feature provides you with specific methods for handling each of the scenarios for invalid characters, surplus characters, and numeric overflow while processing data using Redshift Spectrum. Also, the errors are captured and visible in the newly created dictionary view SVL_SPECTRUM_SCAN_ERROR. You can even cancel the query when a defined threshold of errors has been reached.

Prerequisites

To demonstrate Redshift Spectrum user-defined data handling, we build an external table over a data file with soccer league information and use that to show different data errors and how the new feature offers different options in dealing with those data errors. We need the following prerequisites:

Solution overview

We use the data file for soccer leagues to define an external table to demonstrate different data errors and the different handling techniques offered by the new feature to deal with those errors. The following screenshot shows an example of the data file.

Note the following in the example:

  • The club name can typically be longer than 15 characters
  • The league name can be typically be longer than 20 characters
  • The club name Barcelôna includes an invalid character
  • The column nspi includes values that are bigger than SMALLINT range

Then we create an external table (see the following code) to demonstrate the new user-defined handling:

  • We define the club_name and league_name shorter than they should to demonstrate handling of surplus characters
  • We define the column league_nspi as SMALLINT to demonstrate handling of numeric overflow
  • We use the new table property data_cleansing_enabled to enable custom data handling
CREATE EXTERNAL TABLE schema_spectrum_uddh.soccer_league
(
  league_rank smallint,
  prev_rank   smallint,
  club_name   varchar(15),
  league_name varchar(20),
  league_off  decimal(6,2),
  league_def  decimal(6,2),
  league_spi  decimal(6,2),
  league_nspi smallint
)
ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    LINES TERMINATED BY '\n\l'
stored as textfile
LOCATION 's3://uddh-soccer/league/'
table properties ('skip.header.line.count'='1','data_cleansing_enabled'='true');

Invalid character data handling

With the introduction of the new table and column property invalid_char_handling, you can now choose how you deal with invalid characters in your data. The supported values are as follows:

  • DISABLED – Feature is disabled (no handling).
  • SET_TO_NULL – Replaces the value with null.
  • DROP_ROW – Drops the whole row.
  • FAIL – Fails the query when an invalid UTF-8 value is detected.
  • REPLACE – Replaces the invalid character with a replacement. With this option, you can use the newly introduced table property replacement_char.

The table property can work over the whole table or just a column level. Additionally, you can define the table property during create time or later by altering the table.

When you disable user-defined handling, Redshift Spectrum by default sets the value to null (similar to SET_TO_NULL):

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='DISABLED');

When you change the setting of the handling to DROP_ROW, Redshift Spectrum simply drops the row that has an invalid character:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='DROP_ROW');

When you change the setting of the handling to FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='FAIL');

When you change the setting of the handling to REPLACE and choose a replacement character, Redshift Spectrum replaces the invalid character with the chosen replacement character:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='REPLACE','replacement_char'='?');

Surplus character data handling

As mentioned earlier, we defined the columns club_name and league_name shorter than the actual contents of the corresponding fields in the data file.

With the introduction of the new table property surplus_char_handling, you can choose from multiple options:

  • DISABLED – Feature is disabled (no handling)
  • TRUNCATE – Truncates the value to the column size
  • SET_TO_NULL – Replaces the value with null
  • DROP_ROW – Drops the whole row
  • FAIL – Fails the query when a value is too large for the column

When you disable the user-defined handling, Redshift Spectrum defaults to truncating the surplus characters (similar to TRUNCATE):

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DISABLED');

When you change the setting of the handling to SET_TO_NULL, Redshift Spectrum simply sets to NULL the column value of any field that is longer than the defined length:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'SET_TO_NULL');


When you change the setting of the handling to DROP_ROW, Redshift Spectrum drops the row of any field that is longer than the defined length:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DROP_ROW');

When you change the setting of the handling to FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'FAIL');

We need to disable the user-defined data handling for this data error before demonstrating the next type of error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DISABLED');

Numeric overflow data handling

For this demonstration, we defined league_nspi intentionally SMALLINT (with a range to hold from -32,768 to +32,767) to show the available options for data handling.

With the introduction of the new table property numeric_overflow_handling, you can choose from multiple options:

  • DISABLED – Feature is disabled (no handling)
  • SET_TO_NULL – Replaces the value with null
  • DROP_ROW – Replaces each value in the row with NULL
  • FAIL – Fails the query when a value is too large for the column

When we look at the source data, we can observe that the top five countries have more points than the SMALLINT field can handle.

When you disable the user-defined handling, Redshift Spectrum defaults to the maximum number the numeric data type can handle, for our case SMALLINT can handle up to 32767:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DISABLED');

When you choose SET_TO_NULL, Redshift Spectrum sets to null the column with numeric overflow:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'SET_TO_NULL');

When you choose DROP_ROW, Redshift Spectrum drops the row containing the column with numeric overflow:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DROP_ROW');

When you choose FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'FAIL');

We need to disable the user-defined data handling for this data error before demonstrating the next type of error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DISABLED');

Stop queries at MAXERROR threshold

You can also choose to stop the query if it reaches a certain threshold in errors by using the newly introduced parameter spectrum_query_maxerror:

Set spectrum_query_maxerror to 7;

The following screenshot shows that the query ran successfully.

However, if you decrease this threshold to a lower number, the query fails because it reached the preset threshold:

Set spectrum_query_maxerror to 6;

Error logging

With the introduction of the new user-defined data handling feature, we also introduced the new view svl_spectrum_scan_error, which allows you to view a useful sample set of the logs of errors. The table contains the query, file, row, column, error code, handling action that was applied, as well as the original value and the modified (resulting) value. See the following code:

SELECT *
FROM svl_spectrum_scan_error
where location = 's3://uddh-soccer/league/spi_global_rankings.csv'

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the Amazon Redshift cluster created for this demonstration. If you were using an existing cluster, drop the created external table and external schema.
  2. Delete the S3 bucket.
  3. Delete the AWS Glue Data Catalog database.

Conclusion

In this post, we demonstrated Redshift Spectrum’s newly added feature of user-defined data error handling and showed how this feature provides the flexibility to take a user-defined approach to deal with data exceptions in processing external files. We also demonstrated how the logging enhancements provide transparency on the errors encountered in external data processing without needing to write additional custom code.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

Ahmed Shehata is a Data Warehouse Specialist Solutions Architect with Amazon Web Services, based out of Toronto.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift. He is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.