Tag Archives: AWS Database Migration Service

Revenue NSW modernises analytics with AWS, enabling unified and scalable data management, processing, and access

Post Syndicated from Saeed Barghi original https://aws.amazon.com/blogs/big-data/revenue-nsw-modernises-analytics-with-aws-enabling-unified-and-scalable-data-management-processing-and-access/

Revenue NSW, in Australia, is New South Wales (NSW) state’s principal revenue management agency and aspires to be the world’s most innovative and customer-centric revenue agency. Revenue NSW exists to administer grants, resolve fines, and collect revenue to fund essential state services for the over 8 million people of NSW in a fair, efficient, and timely manner.

Analytics at Revenue NSW plays a key role in enabling the organization’s goals and purpose by delivering reliable, secure, and authoritative insights. These insights are key to:

  • Understanding customer attributes to enable empathetic and informed actions
  • Supporting policy development
  • Assisting in the sequencing of millions of decisions
  • Maintaining compliance and education
  • Fostering transparency by providing open data and insights directly to the public

The challenge

Revenue NSW Analytics consumes data from a multitude of operational databases and real-time interfaces and through internally generated reports and files received from external data partners such as other government departments and agencies. The varying technologies, formats, and complexities of these data sources created friction and inefficiencies in data transformation, consolidation, and analysis in an environment that is often time-critical. In addition, these analytics systems were previously hosted on dedicated hardware on-premises that was nearing end-of-life and wasn’t easy to scale efficiently. To address these challenges, Revenue NSW Analytics used their partnership with AWS to build a strategic, unified, scalable, frictionless and modern data environment to help them standardize data transformation and consolidation pipelines from the hundreds of data sources. Additionally, the modern data environment must provide a single source of truth and enable secure and seamless access to the data through a unified SQL interface regardless of the data’s original format or technology.

After understanding other offerings, Revenue NSW Analytics decided on a proof of concept (PoC) using Amazon Web Services (AWS) cloud-based services, including Amazon Redshift. The key goals of the PoC were to assess the completeness of the solution, its performance, and the potential change in total cost of ownership compared to their on-premises setup.

Amazon Redshift, with its integration options, columnar storage, and massively parallel processing (MPP) architecture, offered the desired end-state solution. Tests demonstrated a typical speed increase between 5- and 50-fold in query execution, with many results 100 times faster than the existing on-premises solution. Amazon Redshift also performed significantly better compared with other cloud-based solutions, offering up to 6 times better performance. The success of the initial PoC led Revenue NSW Analytics to further collaborate with AWS, working towards developing a prototype that incorporated Amazon Redshift alongside various data ingestion patterns.

The solution

To simplify data ingestion from the operational databases—which run on different database engines including Oracle, PostgreSQL, and Microsoft SQL—Revenue NSW Analytics used AWS Database Migration Service (AWS DMS) to perform a bulk initial load, followed by capturing ongoing changes from these databases into Amazon Redshift in near real time.

For data from Salesforce’s real-time API, Revenue NSW Analytics used Amazon AppFlow to automate the continuous pulling and ingesting of data into Amazon Redshift.

The hundreds of structured and semi-structured data files were handled using AWS Glue. These files are regularly uploaded to Amazon Simple Storage Service (Amazon S3), triggering the relevant AWS Glue extract, transform, and load (ETL) jobs in an event-based architecture to transfer the data into Amazon Redshift.

To facilitate repeatability and enable iteration, Revenue NSW Analytics used infrastructure-as-code (IaC) and continuous integration and delivery (CI/CD) pipelines to deploy the different components of the solution.

The following is a high-level architecture demonstrating how these different components and services fit together.

Along with standardization and unified access, the success criteria of the new data environment included the ease of transition, consolidation of processes to the new standardised pipelines, scalability, language uniformity, and availability. The combination of supporting standard SQL, AWS DMS, and Amazon AppFlow low-code capabilities, and supporting Python in AWS Glue, a popular programming language, played a crucial role in facilitating the successful transition and adoption of the cloud-based data environment.

Other success factors of this environment include the ability to work within current budgets, and the extendibility and modularity of the solution. As shown in the preceding high-level architecture, the solution runs on multiple building blocks that are decoupled, modular, and either serverless—like AWS Glue—or managed services that support seamless scalable configurations that don’t require rebuilds. This allowed Revenue NSW Analytics to start small with each use case, expand and grow as required, and pay only for what they need.

Moreover, with the new cloud-based data environment, Revenue NSW Analytics can access to up-to-date data in near real time, which is essential to fulfilling critical use cases such as information requests and assisting with compliance case identification. The automated data ingestion pipelines removed much of the boilerplate and heavy lifting, allowing Revenue NSW teams to work more efficiently and focus on the differentiators of their business, and in some cases, shorten workflow times from months to weeks or days.

Another significant factor contributing to the project’s success is the people at the heart of Revenue NSW Analytics. The teams allocated to own and deliver this platform are cross-functional, with adjoining responsibilities and skills, and were prepared through multiple in-person and online training sessions. The teams were empowered to trial individual services to deliver new use cases and iterate on the solution to learn from successes and innovate progressively. This approach, together with support Revenue NSW received from AWS specialist solution architects, helped to minimize the risk of knowledge gaps that often arise when separate teams are responsible for building and operating a system.

The hard work of the Analytics team, the investment of Revenue NSW Analytics leadership in its people, and the continuous support from AWS can truly be seen throughout the delivery of the data environment, resulting in the achievement of the intended outcomes.

Conclusion and call to action

Since going live with their cloud-based data environment on AWS, Revenue NSW has onboarded dozens of analysts who can get more done in less time. This is a result of establishing a single source of truth from different data sources in Amazon Redshift, so that analysts and data consumers don’t need to shop around to find the data that they need to complete their tasks. This new data environment also provides Revenue NSW with the ability to create improved conditions for:

  • Increasing agility by exposing reusable, trusted data services for people and AI
  • Empowering operational systems with services best provided by analytical approaches
  • Decommissioning heritage, costly infrastructure and data practices.

Successful delivery of the cloud-based data environment on AWS has led to further collaboration between AWS and Revenue NSW. This includes exploring the adoption of AI and machine learning (AI/ML) and generative AI to further improve the delivery of services for the people of NSW.

To learn more about customer success stories like this or how to get started with building a data environment on AWS, contact your AWS account team. You can read about similar customers by browsing Customer Success Stories on our website.


About the authors

Saeed BarghiSaeed Barghi is a Sr. Specialist Solutions Architect at Amazon Web Services (AWS) specializing in architecting enterprise data platforms and AI solutions. Based in Melbourne, Australia, Saeed works with public sector customers in Australia and New Zealand and helps his customers build fit-for-purpose and future-proof data platforms and AI solutions.

Miroslaw (Mick) Mioduszewski is the Director of Analytics at Revenue NSW Department of Customer service in NSW. He held multiple C-level roles in private and public companies as well as government, e.g. COO and CIO, as well as serving as company director. Mick holds computer science and business degrees, is a fellow of the Australian Institute of Company Directors and an industry fellow at the University of technology, Sydney.

Moha Alsouli is a Public Sector Solutions Architect at Amazon Web Services (AWS) in Sydney. He is dedicated to supporting state and local government customers deliver citizen services, through solution design, reviews, optimisation, and architecture guidance. Moha is also specialising in generative artificial intelligence (AI) on AWS.

AWS Weekly Roundup: Cloud Club Captain Applications, Formula 1®, Amazon Nova Prompt Engineering, and more (Feb 24, 2025)

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-cloud-club-captain-applications-formula-1-amazon-nova-prompt-engineering-and-more-feb-24-2025/

AWS Developer Day 2025, held on February 20th, showcased how to integrate responsible generative AI into development workflows. The event featured keynotes from AWS leaders including Srini Iragavarapu, Director Generative AI Applications and Developer Experiences, Jeff Barr, Vice President of AWS Evangelism, David Nalley, Director Open Source Marketing of AWS, along with AWS Heroes and technical community members. Watch the full event recording on Developer Day 2025.

Cloud Club

Applications are now open through March 6th for the 2025 AWS Cloud Clubs Captains program. AWS Cloud Clubs are student-led groups for post-secondary and independent students, 18 years old and over. Find a club near you on our Meetup page.

Last week’s launches
Here are some launches that got my attention:

Amplify Hosting announces support for IAM roles for server-side rendered (SSR) applications  AWS Amplify Hosting now supports AWS Identity and Access Management (IAM) roles for SSR applications, enabling secure access to AWS services without managing credentials manually. Learn more in the IAM Compute Roles for Server-Side Rendering with AWS Amplify Hosting blog.

AWS WAF enhances Data Protection and logging experience  AWS WAF expands its Data Protection capabilities allowing sensitive data in logs to be replaced with cryptographic hashes (e.g. ‘ade099751d2ea9f3393f0f’) or a predefined static string (‘REDACTED’) before logs are sent to WAF Sample Logs, Amazon Security Lake, Amazon CloudWatch, or other logging destinations.

Announcing AWS DMS Serverless comprehensive premigration assessments AWS Database Migration Service Serverless (AWS DMS Serverless) now supports premigration assessments for replications to identify potential issues before database migrations begin. The tool analyzes source and target databases, providing recommendations for optimal DMS settings and best practices.

Amazon ECS increases the CPU limit for ECS tasks to 192 vCPUs – Amazon Elastic Container Service (Amazon ECS) now supports CPU limits of up to 192 vCPU for ECS tasks deployed on Amazon Elastic Compute Cloud (Amazon EC2) instances, an increase from the previous 10 vCPU limit. This enhancement allows customers to more effectively manage resource allocation on larger Amazon EC2 instances.

AWS Network Firewall introduces automated domain lists and insightsAWS Network Firewall now provides automated domain lists and insights by analyzing 30 days of HTTP/S traffic. This helps create and maintain allow-list policies more efficiently, at no extra cost.

AWS announces Backup Payment Methods for invoices AWS now enables you to set up backup payment methods that automatically activate if primary payment fails. This helps prevent service interruptions and reduces manual intervention for invoice payments.

Get updated with all the announcements of AWS announcements on the What’s New with AWS? page.

Other AWS news
Here are additional noteworthy items:

AWS Partner Network: Essential training resources for ISV partners To help scale solutions effectively, AWS provides essential training resources for Software Vendors (ISVs) partners in four key areas: AWS Marketplace fundamentals, Foundational Technical Review (FTR), APN Customer Engagement (ACE) program and co-selling, and Partner funding opportunities.

How Formula 1® uses generative AI to accelerate race-day issue resolution Formula 1® (F1) uses Amazon Bedrock to speed up race-day issue resolution, reducing troubleshooting time from weeks to minutes through a chatbot that analyzes root causes and suggests fixes.

How Formula 1® uses generative AI to accelerate race-day issue resolution

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases This blog introduces a solution using Amazon Bedrock Knowledge Bases and Amazon Bedrock Agents to reduce Large language models (LLMs) hallucinations by implementing a verified semantic cache that checks queries against curated answers before generating new responses, improving accuracy and response times.

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Orchestrate an intelligent document processing workflow using tools in Amazon Bedrock This blog demonstrates an intelligent document processing workflow using Amazon Bedrock tools that combines Anthropic’s Claude 3 Haiku for orchestration and Anthropic’s Claude 3.5 Sonnet (v2) for analysis to handle structured, semi-structured, and unstructured healthcare documents efficiently.

From community.aws
Here are my personal favorites posts from community.aws:

Tracing Amazon Bedrock Agents Learn how to track and analyze Amazon Bedrock Agents workflows using AWS X-Ray for better observability, by Randy D.

Testing Amazon ECS Network Resilience with AWS FISThis article demonstrates how to test network resilience in Amazon ECS using AWS FIS with guidance from Amazon Q Developer, by Sunil Govindankutty

Stop Using Default Arguments in AWS Lambda Functions Discover why your AWS Lambda costs might be spiralling out of control due to a common Python programming practice, by Stuart Clark.

Amazon Nova Prompt Engineering on AWS: A Field Guide by Brooke A field guide for using Amazon Nova models, covering prompt engineering patterns and best practices on AWS, by Brooke Jamieson.

Amazon Nova Prompt Engineering on AWS: A Field Guide by Brooke

Creating Deployment Configurations for EKS with Amazon Q Amazon Q Developer helps create EKS deployments by providing templates and best practices for Kubernetes configs, by Ricardo Tasso.

Processing WhatsApp Multimedia with Amazon Bedrock Agents: Images, Video, and DocumentsI invite you to read my latest blog, which explains how to create a WhatsApp AI assistant using Amazon Bedrock and Amazon Nova models to process multimedia content such as images, videos, documents, and audio.

Processing WhatsApp Multimedia with Amazon Bedrock Agents: Images, Video, and Documents

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

AWS GenAI Lofts – GenAI Lofts offer collaborative spaces and immersive experiences for startups and developers. You can join in-person GenAI Loft San Francisco events such as Hands-on with Agentic Graph RAG Workshop (February 25), Unstructured Data Meetup SF (February 26 – 27) and AI Tinkerers – San Francisco – February 2025 Demos + Science Fair (February 27 – 28). GenAI Loft Berlin has events and workshops on February 24 to March 7 that you can’t miss!

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Milan, Italy (April 2), Bay Area – Security Edition (April 4), Timișoara, Romania (April 10), and Prague, Czeh Republic (April 29).

AWS Innovate: Generative AI + Data – Join a free online conference focusing on generative AI and data innovations. Available in multiple geographic regions: APJC and EMEA (March 6), North America (March 13), Greater China Region (March 14), and Latin America (April 8).

AWS Summits – Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Paris (April 9), Amsterdam (April 16), London (April 30), and Poland (May 5).

AWS re:Inforce – AWS re:Inforce (June 16–18) in Philadelphia, PA our annual learning event devoted to all things AWS cloud security. Registration opens in March, and be ready to join more than 5,000 security builders and leaders.

Create your AWS Builder ID and reserve your alias. Builder ID is a universal login credential that gives you access–beyond the AWS Management Console–to AWS tools and resources, including over 600 free training courses, community features, and developer tools such as Amazon Q Developer.

You can browse all upcoming in-person and virtual events.

That’s all for this week. Stay tuned for next week’s Weekly Roundup!

Eli

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Database Migration Service now automates time-intensive schema conversion tasks using generative AI

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-data-migration-service-improves-database-schema-conversion-with-generative-ai/

Starting today, AWS Database Migration Service Schema Conversion (AWS DMS SC) introduces a new capability to improve the database schema conversion experience by automatically converting up to 90 percent of schema objects from commercial databases to PostgreSQL migrations.

AWS DMS is a cloud service that makes it possible to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the Amazon Web Services (AWS) Cloud or between combinations of cloud and on-premises setups.

Today, more than 1 million databases have been migrated using AWS Database Migration Service. AWS DMS helps you migrate your data from one database system to another. And, when migrating between different database engines, AWS DMS SC helps to convert the source database schema and procedures to the target database system.

However, although AWS DMS SC automates many steps in these migrations, certain complex database code elements still require manual intervention, which can extend migration timelines and add cost. This is particularly the case with proprietary system functions or procedures, and data type conversions, which don’t always have direct equivalents in PostgreSQL.

The new generative AI capability in AWS DMS SC is designed to address these challenges by automating some of the most time-intensive schema conversion tasks. Using large language models (LLMs) hosted on Amazon Bedrock, the new capability expands the existing conversion capabilities. It converts code snippets in the source database that were otherwise not supported by traditional rule-based techniques, including complex procedures and functions.

Generative AI–assisted code conversion helps to reduce migration costs and accelerate project timelines. Because AWS DMS SC automates more of the schema conversion process, you can focus on higher value tasks such as refining and optimizing your applications post-migration rather than manually resolving conversion gaps. Our beta customers have already experienced success with these AI-powered features in AWS DMS SC, achieving cost savings and faster migrations.

Let’s find out how it works
To demonstrate the ease of using this new generative AI capability, I’ll walk through the schema conversion process in AWS DMS SC. AWS DMS SC simplifies database migration by automatically converting my source database’s structure, including tables, views, stored procedures, functions, and more, to a format compatible with my target database. Any objects that can’t be automatically converted are flagged for manual attention.

I start with a self-managed commercial database running on Amazon Elastic Compute Cloud (Amazon EC2). I use the AWS Management Console to define the instance profile and the data providers. This is where I configure the replication instance network details, the database engine and its endpoint, the secret where the database password is securely stored, and more. I also create a migration project. These steps aren’t new, and you can refer to Accelerate your database migration journey using AWS DMS Schema Conversion in the AWS Database Blog to learn about the details.

After my project is created, I select it, and on the Schema conversion tab, I choose Launch schema conversion. It takes a couple of minutes to launch the conversion tool the first time.

DMS : Launch migration project

AWS DMS SC with generative AI is an opt-in capability. I first activate the option. On the Settings tab, I turn on Enable Generative AI feature for conversion.DMS : enable GenAI feature

Before diving into the details of the conversion, I would like to get an overall assessment of the migration complexity. I select the schema I want to migrate. Then I select Assess in the menu.

DMS : Assess schema

After a few minutes, a high-level Summary is available. The Action items tab has more details. I choose Export results and choose PDF to receive a report to share with my colleagues. The report is generated and available from an S3 bucket.

The summary screen shows the percentage of Database storage objects and Database code objects that can be converted by the rule-based method. That’s 100% and 57% in this example. Let’s see how the generative AI-based conversion will change that.

DMS : Assess schema summary

The PDF contains an executive summary, various statistics about the number of objects to be migrated, the feasibility of conversion with generative AI, and the complexity of the migration.

DMS : Assess schema PDF page 1 DMS : Assess schema PDF page 2

By reading the report, I learn there is no blocker detected to migrate the stored procedures. I select the stored procedure I want to migrate (PRC_AIML_DEMO6). Then, I select the Actions menu on the source database (the left one) and choose Convert.

After a minute or two, I can read the original procedure code in the left pane and the proposed migrated version on the right panel.

The summary screen has been updated. Now, it shows that 100 percent of the code can be converted automatically.

DNS : view proposed modifications

I can edit the code and make changes as required. When I’m comfortable with the proposed new version, I select the Actions menu on the target database side (the right one) and choose Apply changes.

DMS : Apply changes

With this new generative AI capability, AWS DMS SC can automatically convert up to 90 percent of schema objects from commercial databases to PostgreSQL.

To support your compliance requirements, this capability is initially turned off, and you can enable it as needed. If you choose to use the generative AI features in AWS DMS SC, it will flexibly decide between traditional rule-based methods and generative AI based on the complexity of the objects being converted. Customers with strict policies against generative AI can continue to rely solely on the rule-based approach, with any unconverted or partially converted objects requiring manual adjustments.

Availability and pricing
This new capability is available today in the following AWS Regions: US East (Ohio, N. Virginia), US West (Oregon), and Europe (Frankfurt).

AWS DMS Schema Conversion with generative AI provides you with a faster migration pathway and helps you accelerate your transition to AWS.

To get started, visit the AWS DMS Schema Conversion documentation and learn how this generative AI capability can simplify your next database migration.

— seb

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Post Syndicated from Shaheer Mansoor original https://aws.amazon.com/blogs/big-data/modernize-your-legacy-databases-with-aws-data-lakes-part-2-build-a-data-lake-using-aws-dms-data-on-apache-iceberg/

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake (Apache Iceberg) using AWS Glue. We show how to build data pipelines using AWS Glue jobs, optimize them for both cost and performance, and implement schema evolution to automate manual tasks. To review the first part of the series, where we load SQL Server data into Amazon Simple Storage Service (Amazon S3) using AWS Database Migration Service (AWS DMS), see Modernize your legacy databases with AWS data lakes, Part 1: Migrate SQL Server using AWS DMS.

Solution overview

In this post, we go over the process of building a data lake, providing the rationale behind the different decisions, and share best practices when building such a solution.

The following diagram illustrates the different layers of the data lake.

Overall Architecture

To load data into the data lake, AWS Step Functions can define a workflow, Amazon Simple Queue Service (Amazon SQS) can track the order of incoming files, and AWS Glue jobs and the Data Catalog can be used create the data lake silver layer. AWS DMS produces files and writes these files to the bronze bucket (as we explained in Part 1).

We can turn on Amazon S3 notifications and push the new arriving file names to an SQS first-in-first-out (FIFO) queue. A Step Functions state machine can consume messages from this queue to process the files in the order they arrive.

For processing the files, we need to create two types of AWS Glue jobs:

  • Full load – This job loads the entire table data dump into an Iceberg table. Data types from the source are mapped to an Iceberg data type. After the data is loaded, the job updates the Data Catalog with the table schemas.
  • CDC – This job loads the change data capture (CDC) files into the respective Iceberg tables. The AWS Glue job implements the schema evolution feature of Iceberg to handle schema changes such as addition or deletion of columns.

As in Part 1, the AWS DMS jobs will place the full load and CDC data from the source database (SQL Server) in the raw S3 bucket. Now we process this data using AWS Glue and save it to the silver bucket in Iceberg format. AWS Glue has a plugin for Iceberg; for details, see Using the Iceberg framework in AWS Glue.

Along with moving data from the bronze to the silver bucket, we also create and update the Data Catalog for further processing the data for the gold bucket.

The following diagram illustrates how the full load and CDC jobs are defined inside the Step Functions workflow.

Step Functions for loading data into the lake

In this post, we discuss the AWS Glue jobs for defining the workflow. We recommend using AWS Step Functions Workflow Studio, and setting up Amazon S3 event notifications and an SNS FIFO queue to receive the filename as messages.

Prerequisites

To follow the solution, you need the following prerequisites set up as well as certain access rights and AWS Identity and Access Management (IAM) privileges:

  • An IAM role to run Glue jobs
  • IAM privileges to create AWS DMS resources (this role was created in Part 1 of this series; you can use the same role here)
  • The AWS DMS job from Part 1 working and producing files for the source database on Amazon S3.

Create an AWS Glue connection for the source database

We need to create a connection between AWS Glue and the source SQL Server database so the AWS Glue job can query the source for the latest schema while loading the data files. To create the connection, follow these steps:

  1. On the AWS Glue console, choose Connections in the navigation pane.
  2. Choose Create custom connector.
  3. Give the connection a name and choose JDBC as the connection type.
  4. In the JDBC URL section, enter the following string and replace the name of your source database endpoint and database that was set up in Part 1: jdbc:sqlserver://{Your RDS End Point Name}:1433/{Your Database Name}.
  5. Select Require SSL connection, then choose Create connector.

Clue Connections

Create and configure the full load AWS Glue job

Complete the following steps to create the full load job:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Choose Script editor and select Spark.
  3. Choose Start fresh and select Create script.
  4. Enter a name for the full load job and choose the IAM role (mentioned in the prerequisites) for running the job.
  5. Finish creating the job.
  6. On the Job details tab, expand Advanced properties.
  7. In the Connections section, add the connection you created.
  8. Under Job parameters, pass the following arguments to the job:
    1. target_s3_bucket – The silver S3 bucket name.
    2. source_s3_bucket – The raw S3 bucket name.
    3. secret_id – The ID of the AWS Secrets Manager secret for the source database credentials.
    4. dbname – The source database name.
    5. datalake-formats – This sets the data format to iceberg.

Glue Job Parameters

The full load AWS Glue job starts after the AWS DMS task reaches 100%. The job loops over the files located in the raw S3 bucket and processes them one at time. For each file, the job infers the table name from the file name and gets the source table schema, including column names and primary keys.

If the table has one or more primary keys, the job creates an equivalent Iceberg table. If the job has no primary key, the file is not processed. In our use case, all the tables have primary keys, so we enforce this check. Depending on your data, you might need to handle this scenario differently.

You can use the following code to process the full load files. To start the job, choose Run.

import sys, boto3, json
import boto3
import json
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import SparkSession

#Get the arguments passed to the script
args = getResolvedOptions(sys.argv, ['JOB_NAME',
                           'target_s3_bucket',
                           'secret_id',
                           'source_s3_bucket'])
dbname = "AdventureWorks"
schema = "HumanResources"

#Initialize parameters
target_s3_bucket = args['target_s3_bucket']
source_s3_bucket = args['source_s3_bucket']
secret_id = args['secret_id']
unprocessed_tables = []
drop_column_list = ['db', 'table_name', 'schema_name', 'Op', 'last_update_time']  # DMS added columns

#Helper Function: Get Credentials from Secrets Manager
def get_db_credentials(secret_id):
    secretsmanager = boto3.client('secretsmanager')
    response = secretsmanager.get_secret_value(SecretId=secret_id)
    secrets = json.loads(response['SecretString'])
    return secrets['host'], int(secrets['port']), secrets['username'], secrets['password']

#Helper Function: Load Iceberg table with Primary key(s)
def load_table(full_load_data_df, dbname, table_name):

    try:
        full_load_data_df = full_load_data_df.drop(*drop_column_list)
        full_load_data_df.createOrReplaceTempView('full_data')

        query = """
        CREATE TABLE IF NOT EXISTS glue_catalog.{0}.{1}
        USING iceberg
        LOCATION "s3://{2}/{0}/{1}"
        AS SELECT * FROM full_data
        """.format(dbname, table_name, target_s3_bucket)
        spark.sql(query)
        
        #Update Table property to accept Schema Changes
        spark.sql("""ALTER TABLE glue_catalog.{0}.{1} SET TBLPROPERTIES (
                      'write.spark.accept-any-schema'='true'
                    )""".format(dbname, table_name))
        
    except Exception as ex:
        print(ex)
        failed_table = {"table_name": table_name, "Reason": ex}
        unprocessed_tables.append(failed_table)
        
def get_table_key(host, port, username, password, dbname):
    
    jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, port, dbname)
    
    connectionProperties = {
      "user" : username,
      "password" : password
    }
    
    spark.read.jdbc(url=jdbc_url, table='INFORMATION_SCHEMA.TABLE_CONSTRAINTS', properties=connectionProperties).createOrReplaceTempView("TABLE_CONSTRAINTS")
    spark.read.jdbc(url=jdbc_url, table='INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE', properties=connectionProperties).createOrReplaceTempView("CONSTRAINT_COLUMN_USAGE")
    df_table_pkeys = spark.sql("select c.TABLE_NAME, C.COLUMN_NAME as primary_key FROM TABLE_CONSTRAINTS T JOIN CONSTRAINT_COLUMN_USAGE C ON C.CONSTRAINT_NAME=T.CONSTRAINT_NAME WHERE T.CONSTRAINT_TYPE='PRIMARY KEY'")
    return df_table_pkeys


#Setup Spark configuration for reading and writing Iceberg tables
spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://{0}".format(dbname))
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .getOrCreate()
)


#Initialize MSSQL credentials
host, port, username, password = get_db_credentials(secret_id)

#Initialize primary keys for all tables
df_table_pkeys = get_table_key(host, port, username, password, dbname)

#Read Full load csv files from s3
s3 = boto3.client('s3')
full_load_tables = s3.list_objects_v2(Bucket=source_s3_bucket, Prefix="raw/{0}/{1}".format(args['dbname'], args['schema']))

#Loop over files
for item in full_load_tables['Contents']:
    pkey_list = []
    table_name = item["Key"].split("/")[3].lower()
    print("Table name {0}".format(table_name))
    current_table_df = df_table_pkeys.where(df_table_pkeys.TABLE_NAME == table_name)

    # Only Process tables with at least 1 Primary key
    if not current_table_df.isEmpty():
        for i in current_table_df.collect():
            pkey_list.append(i["primary_key"])
    else:
        failed_table = {"table_name": table_name, "Reason": "No primary key"}
        unprocessed_tables.append(failed_table)
        # ToDo Handle these cases

    full_data_path = "s3://{0}/{1}".format(source_s3_bucket, item['Key'])
    full_load_data_df = (spark
                        .read
                        .option("header", True)
                        .option("inferSchema", True)
                        .option("recursiveFileLookup", "true")
                        .csv(full_data_path)
                        )

    primary_key = ",".join(pkey_list)

    if table_name not in unprocessed_tables:
        load_table(full_load_data_df, dbname, table_name)

When the job is complete, it creates the database and tables in the Data Catalog, as shown in the following screenshot.

Data lake silver layer data

Create and configure the CDC AWS Glue job

The CDC AWS Glue job is created similar to the full load job. As with the full load AWS Glue job, you need to use the source database connection and pass the job parameters with one additional parameter, cdc_file, which contains the location of the CDC file to be processed. Because a CDC file can contain data for multiple tables, the job loops over the tables in a file and loads the table metadata from the source table ( RDS column names).

If the CDC operation is DELETE, the job deletes the records from the Iceberg table. If the CDC operation is INSERT or UPDATE, the job merges the data into the Iceberg table.

You can use the following code to process the CDC files. To start the job, choose Run

import sys
import boto3
import json
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import SparkSession

# Get the arguments passed to the script
args = getResolvedOptions(sys.argv, ['JOB_NAME',
                           'target_s3_bucket',
                           'secret_id',
                           'source_s3_bucket',
                           'cdc_file'])
dbname = "AdventureWorks"
schema = "HumanResources"
target_s3_bucket = args['target_s3_bucket']
source_s3_bucket = args['source_s3_bucket']
secret_id = args['secret_id']
cdc_file = args['cdc_file']
unprocessed_tables = []
drop_column_list = ['db', 'table_name', 'schema_name', 'Op', 'last_update_time']  # DMS added columns
source_s3_cdc_file_key = "raw/AdventureWorks/cdc/" + cdc_file



# Helper Function: Get Credentials from Secrets Manager
def get_db_credentials(secret_id):
    secretsmanager = boto3.client('secretsmanager')
    response = secretsmanager.get_secret_value(SecretId=secret_id)
    secrets = json.loads(response['SecretString'])
    return secrets['host'], int(secrets['port']), secrets['username'], secrets['password']

# Helper Function: Column names from RDS
def get_table_colums(table, host, port, username, password, dbname):

    jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, port, dbname)
    
    connectionProperties = {
      "user" : username,
      "password" : password
    }
    
    spark.read.jdbc(url=jdbc_url, table='INFORMATION_SCHEMA.COLUMNS', properties= connectionProperties).createOrReplaceTempView("TABLE_COLUMNS")
    columns = list((row.COLUMN_NAME) for (index, row) in spark.sql("select TABLE_NAME, TABLE_CATALOG, COLUMN_NAME from TABLE_COLUMNS where TABLE_NAME = '{0}' and TABLE_CATALOG = '{1}'".format(table, dbname)).select("COLUMN_NAME").toPandas().iterrows())
    return columns

# Helper Function: Get Colum names and datatypes from RDS
def get_table_colum_datatypes(table, host, port, username, password, dbname):

    jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, port, dbname)
    
    connectionProperties = {
      "user" : username,
      "password" : password
    }
    
    spark.read.jdbc(url=jdbc_url, table='INFORMATION_SCHEMA.COLUMNS', properties= connectionProperties).createOrReplaceTempView("TABLE_COLUMNS")
    return spark.sql("select TABLE_NAME, COLUMN_NAME, DATA_TYPE from TABLE_COLUMNS WHERE TABLE_NAME ='{0}'".format(table))

# Helper Function: Setup the primary key condition
def get_iceberg_table_condition(database, tablename):
    
    jdbc_url = "jdbc:sqlserver://{0}:{1};databaseName={2}".format(host, port, database)
    
    connectionProperties = {
      "user" : username,
      "password" : password
    }
    
    spark.read.jdbc(url=jdbc_url, table='INFORMATION_SCHEMA.TABLE_CONSTRAINTS', properties=connectionProperties).createOrReplaceTempView("TABLE_CONSTRAINTS")
    spark.read.jdbc(url=jdbc_url, table='INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE', properties=connectionProperties).createOrReplaceTempView("CONSTRAINT_COLUMN_USAGE")
    
    condition = ''
    
    for key in spark.sql("select C.COLUMN_NAME FROM TABLE_CONSTRAINTS T JOIN CONSTRAINT_COLUMN_USAGE C ON C.CONSTRAINT_NAME=T.CONSTRAINT_NAME WHERE T.CONSTRAINT_TYPE='PRIMARY KEY' AND c.TABLE_NAME = '{0}'".format(table)).collect():
        condition += "target.{0} = source.{0} and".format(key.COLUMN_NAME)
    return condition[:-4]

    
# Read incoming data from Amazon S3
def read_cdc_S3(source_s3_bucket, source_s3_cdc_file_key):
    
    inputDf = (spark
                    .read
                    .option("header", False)
                    .option("inferSchema", True)
                    .option("recursiveFileLookup", "true")
                    .csv("s3://" + source_s3_bucket + "/" + source_s3_cdc_file_key)
                    )
    return inputDf

# Setup Spark configuration for reading and writing Iceberg tables
spark = (
    SparkSession.builder
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://{0}".format(target_s3_bucket))
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .getOrCreate()
)

#Initialize MSSQL credentials
host, port, username, password = get_db_credentials(secret_id)

#Read the cdc file 
cdc_df = read_cdc_S3(source_s3_bucket, source_s3_cdc_file_key)

tables = cdc_df.toPandas()._c1.unique().tolist()

#Loop over tables in the cdc file
for table in tables:
    #Create dataframes for delets and for inserts and updates
    table_df_deletes = cdc_df.where((cdc_df._c1 == table) & (cdc_df._c0 == "D")).drop(cdc_df.columns[0], cdc_df.columns[1], cdc_df.columns[2], cdc_df.columns[3])
    table_df_upserts = cdc_df.where((cdc_df._c1 == table) & ((cdc_df._c0 == "I") | (cdc_df._c0 == "U"))).drop(cdc_df.columns[0], cdc_df.columns[1], cdc_df.columns[2], cdc_df.columns[3])
    
    #Update column names for the dataframes
    columns = get_table_colums(table, host, port, username, password, dbname) 
    selectExpr = [] 

    for column in columns: 
        selectExpr.append(cdc_df.where((cdc_df._c1 == table)).drop(cdc_df.columns[0], cdc_df.columns[1], cdc_df.columns[2], cdc_df.columns[3]).columns[columns.index(column)] + " as " + column)

    table_df_deletes = table_df_deletes.selectExpr(selectExpr) 
    table_df_upserts = table_df_upserts.selectExpr(selectExpr)
    
    #Process Deletes
    if table_df_deletes.count() > 0:
        
        print("Delete Triggered")
        table_df_deletes.createOrReplaceTempView('deleted_rows')
        
        sql_string = """MERGE INTO glue_catalog.{0}.{1} target
                        USING (SELECT * FROM deleted_rows) source
                        ON {2}
                        WHEN MATCHED 
                        THEN DELETE""".format(database, table.lower(), get_iceberg_table_condition(database, table.lower()))
        spark.sql(sql_string)
    
    if table_df_upserts.count() > 0:
        print("Upsert triggered")

        #Upsert Records when there are Schema Changes
        if len(table_df_upserts.columns) != len(columns):

            #Handle column deletes
            if len(table_df_upserts.columns) < len(columns):

                drop_columns = list(set(columns) - set(table_df_upserts.columns))

                for drop_column in drop_columns:
                    sql_string = """
                                    ALTER TABLE glue_catalog.{0}.{1}
                                    DROP COLUMN {2}""".format(dbname.lower(), table.lower(), drop_column)
                    spark.sql(sql_string)

            #Handle column additions
            elif len(table_df_upserts.columns) > len(columns):

                column_datatype_df = get_table_colum_datatypes(table, host, port, username, password, dbname)
                add_columns = list(set(table_df_upserts.columns) - set(columns))

                for add_column in add_columns:

                    #Set Iceberg data type
                    data_type = list((row.DATA_TYPE) for (index, row) in column_datatype_df.filter("COLUMN_NAME='{0}'".format(add_column)).select("DATA_TYPE").toPandas().iterrows())[0]

                    # Convert MSSQL Datatypes to Iceberg supported datatypes
                    if data_type.lower() in ["varchar", "char"]:
                        data_type = "string"

                    if data_type.lower() in ["bigint"]:
                        data_type = "long"

                    if data_type.lower() in ["array"]:
                        data_type = "list"

                    sql_string = """
                                    ALTER TABLE glue_catalog.{0}.{1}
                                    ADD COLUMN {2} {3}""".format(dbname.lower(), table.lower(), add_column, data_type)
                    spark.sql(sql_string)
                    
            #Create statement to update columns
            update_table_column_list = ""
            insert_column_list = ""
            columns = get_table_colums(table, host, port, username, password, dbname)             

            for column in columns:

                update_table_column_list+="""target.{0}=source.{0},""".format(column)
                insert_column_list+="""source.{0},""".format(column)

            table_df_upserts.createOrReplaceTempView('updated_rows')

            sql_string = """MERGE INTO glue_catalog.{0}.{1} target
                            USING (SELECT * FROM updated_rows) source
                            ON {2}
                            WHEN MATCHED 
                            THEN UPDATE SET {3} 
                            WHEN NOT MATCHED THEN INSERT ({4}) VALUES ({5})""".format(dbname.lower(), 
                                                                                      table.lower(), 
                                                                                      get_iceberg_table_condition(dbname.lower(), table.lower()), 
                                                                                      update_table_column_list.rstrip(","), 
                                                                                      ",".join(columns), 
                                                                                      insert_column_list.rstrip(","))

            spark.sql(sql_string)

    
print("CDC job complete")

The Iceberg MERGE INTO syntax can handle cases where a new column is added. For more details on this feature, see the Iceberg MERGE INTO syntax documentation. If the CDC job needs to process many tables in the CDC file, the job can be multi-threaded to process the file in parallel.

 

Configure EventBridge notifications, SQS queue, and Step Functions state machine

You can use EventBridge notifications to send notifications to EventBridge when certain events occur on S3 buckets, such as when new objects are created and deleted. For this post, we’re interested in the events when new CDC files from AWS DMS arrive in the bronze S3 bucket. You can create event notifications for new objects and insert the file names into an SQS queue. A Lambda function within Step Functions would consume from the queue, extract the file name, start a CDC Glue job, and pass the file name as a parameter to the job.

AWS DMS CDC files contain database insert, update, and delete statements. We need to process these in order, so we use an SQS FIFO queue, which preserves the order of messages in which they arrive. You can also configure Amazon SQS to set a time to live (TTL); this parameter defines how long a message stays in the queue before it expires.

Another important parameter to consider when configuring an SQS queue is the message visibility timeout value. While a message is being processed, it disappears from the queue to make sure that the message isn’t consumed by multiple consumers (AWS Glue jobs in our case). If the message is consumed successfully, it should be deleted from the queue before the visibility timeout. However, if the visibility timeout expires and the message isn’t deleted, the message reappears in the queue. In our solution, this timeout must be greater than the time it takes for the CDC job to process a file.

Lastly, we recommend using Step Functions to define a workflow for handling the full load and CDC files. Step Functions has built-in integrations to other AWS services like Amazon SQS, AWS Glue, and Lambda, which makes it a good candidate for this use case.

The Step Functions state machine starts with checking the status of the AWS DMS task. The AWS DMS tasks can be queried to check the status of the full load, and we check the value of the parameter FullLoadProgressPercent. When this value gets to 100%, we can start processing the full load files. After the AWS Glue job processes the full load files, we start polling the SQS queue to check the size of the queue. If the queue size is greater than 0, this means new CDC files have arrived and we can start the AWS Glue CDC job to process these files. The AWS Glue jobs processes the CDC files and deletes the messages from the queue. When the queue size reaches 0, the AWS Glue job exits and we loop in the Step Functions workflow to check the SQS queue size.

Because the Step Functions state machine is supposed to run indefinitely, it’s good to keep in mind that there will be service limits you need to adhere to. Namely, the maximum runtime, which is 1 year, and maximum run history size, i.e., state transitions or events for a state machine which is 25,000. We recommend adding an additional step at the end to check if either of these conditions are being met to stop the current state machine run and start a new one.

The following diagram illustrates how you can use Step Functions state machine history size to monitor and start a new Step Functions state machine run.

Step Functions Workflow

Configure the pipeline

The pipeline needs to be configured to address cost, performance, and resilience goals. You might want a pipeline that can load fresh data into the data lake and make it available quickly, and you might also want to optimize costs by loading large chunks of data into the data lake. At the same time, you should make the pipeline resilient and be able to recover in case of failures. In this section, we cover the different parameters and recommended settings to achieve these goals.

Step Functions is designed to process incoming AWS DMS CDC files by running AWS Glue jobs. AWS Glue jobs can take a couple of minutes to boot up, and when they’re running, it’s efficient to process large chunks of data. You can configure AWS DMS to write CSV files to Amazon S3 by configuring the following AWS DMS task parameters:

  • CdcMaxBatchInterval – Defines the maximum time limit AWS DMS will wait before writing a batch to Amazon S3
  • CdcMinFileSize – Defines the minimum file size AWS DMS will write to Amazon S3

Whichever condition is met first will invoke the write operation. If you want to prioritize data freshness, you should have a short CdcMaxBatchInterval value (10 seconds) and a small CdcMinFileSize value (1–5 MB). This will result in many small CSV files being written to Amazon S3 and will invoke a lot of AWS Glue jobs to process the data, making the extract, transform, and load (ETL) process faster. If you want to optimize costs, you should have a moderate CdcMaxBatchInterval (minutes) and a large CdcMinFileSize value (100–500 MB). In this scenario, we start a few AWS Glue jobs that will process large chunks of data, making the ETL flow more efficient. In a real-world use case, the required values for these parameters might fall somewhere that’s a good compromise between throughput and cost. You can configure these parameters when creating a target endpoint using the AWS DMS console, or by using the create-endpoint command in the AWS Command Line Interface (AWS CLI).

For the full list of parameters, see Using Amazon S3 as a target for AWS Database Migration Service.

Choosing the right AWS Glue worker types for the full load and CDC jobs is also crucial for performance and cost optimization. The AWS Glue (Spark) workers range from G1X to G8X, which have an increasing number of data processing units (DPUs). Full load files are usually much larger in size compared to CDC files, and therefore it’s more cost- and performance-effective to select a larger worker. For CDC files, it would be more cost-effective to select a smaller worker because files sizes are smaller.

You should design the Step Functions state machine in such a way that if anything fails, the pipeline can be redeployed after repair and resume processing from where it left off. One important parameter here is TTL for the messages in the SQS queue. This parameter defines how long a message stays in the queue before expiring. In case of failures, we want this parameter to be long enough for us to deploy a fix. Amazon SQS has a maximum of 14 days for a message’s TTL. We recommend setting this to a large enough value to minimize messages being expired in case of pipeline failures.

Clean up

Complete the following steps to clean up the resources you created in this post:

  1. Delete the AWS Glue jobs:
    1. On the AWS Glue console, choose ETL jobs in the navigation pane.
    2. Select the full load and CDC jobs and on the Actions menu, choose Delete.
    3. Choose Delete to confirm.
  2. Delete the Iceberg tables:
    1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Databases.
    2. Choose the database in which the Iceberg tables reside.
    3. Select the tables to delete, choose Delete, and confirm the deletion.
  3. Delete the S3 bucket:
    1. On the Amazon S3 console, choose Buckets in the navigation pane.
    2. Choose the silver bucket and empty the files in the bucket.
    3. Delete the bucket.

Conclusion

In this post, we showed how to use AWS Glue jobs to load AWS DMS files into a transactional data lake framework such as Iceberg. In our setup, AWS Glue provided highly scalable and simple-to-maintain ETL jobs. Furthermore, we share a proposed solution using Step Functions to create an ETL pipeline workflow, with Amazon S3 notifications and an SQS queue to capture newly arriving files. We shared how to design this system to be resilient towards failures and to automate one of the most time-consuming tasks in maintaining a data lake: schema evolution.

In Part 3, we will share how to process the data lake to create data marts.


About the Authors

Shaheer Mansoor is a Senior Machine Learning Engineer at AWS, where he specializes in developing cutting-edge machine learning platforms. His expertise lies in creating scalable infrastructure to support advanced AI solutions. His focus areas are MLOps, feature stores, data lakes, model hosting, and generative AI.

Anoop Kumar K M is a Data Architect at AWS with focus in the data and analytics area. He helps customers in building scalable data platforms and in their enterprise data strategy. His areas of interest are data platforms, data analytics, security, file systems and operating systems. Anoop loves to travel and enjoys reading books in the crime fiction and financial domains.

Sreenivas Nettem is a Lead Database Consultant at AWS Professional Services. He has experience working with Microsoft technologies with a specialization in SQL Server. He works closely with customers to help migrate and modernize their databases to AWS.

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Post Syndicated from Chanpreet Singh original https://aws.amazon.com/blogs/big-data/unlock-scalability-cost-efficiency-and-faster-insights-with-large-scale-data-migration-to-amazon-redshift/

Large-scale data warehouse migration to the cloud is a complex and challenging endeavor that many organizations undertake to modernize their data infrastructure, enhance data management capabilities, and unlock new business opportunities. As data volumes continue to grow exponentially, traditional data warehousing solutions may struggle to keep up with the increasing demands for scalability, performance, and advanced analytics.

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI). However, you might face significant challenges when planning for a large-scale data warehouse migration. These challenges can range from ensuring data quality and integrity during the migration process to addressing technical complexities related to data transformation, schema mapping, performance, and compatibility issues between the source and target data warehouses. Additionally, organizations must carefully consider factors such as cost implications, security and compliance requirements, change management processes, and the potential disruption to existing business operations during the migration. Effective planning, thorough risk assessment, and a well-designed migration strategy are crucial to mitigating these challenges and implementing a successful transition to the new data warehouse environment on Amazon Redshift.

In this post, we discuss best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

Success criteria for large-scale migration

The following diagram illustrates a scalable migration pattern for an extract, load, and transform (ELT) scenario using Amazon Redshift data sharing patterns.

The following diagram illustrates a scalable migration pattern for extract, transform, and load (ETL) scenario.

Migration pattern extract, transform, and load (ETL) scenarios

Success criteria alignment by all stakeholders (producers, consumers, operators, auditors) is key for successful transition to a new Amazon Redshift modern data architecture. The success criteria are the key performance indicators (KPIs) for each component of the data workflow. This includes the ETL processes that capture source data, the functional refinement and creation of data products, the aggregation for business metrics, and the consumption from analytics, business intelligence (BI), and ML.

KPIs make sure you can track and audit optimal implementation, achieve consumer satisfaction and trust, and minimize disruptions during the final transition. They measure workload trends, cost usage, data flow throughput, consumer data rendering, and real-life performance. This makes sure the new data platform can meet current and future business goals.

Migration from a large-scale mission-critical monolithic legacy data warehouse (such as Oracle, Netezza, Teradata, or Greenplum) is typically planned and implemented over 6–16 months, depending on the complexity of the existing implementation. The monolithic data warehouse environments that have been built over the last 30 years contain proprietary business logic and multiple data design patterns, including an operation data store, star or Snowflake schema, dimension and facts, data warehouses and data marts, online transaction processing (OLTP) real-time dashboards, and online analytic processing (OLAP) cubes with multi-dimensional analytics. The data warehouse is highly business critical with minimal allowable downtime. If your data warehouse platform has gone through multiple enhancements over the years, your operational service levels documentation may not be current with the latest operational metrics and desired SLAs for each tenant (such as business unit, data domain, or organization group).

As part of the success criteria for operational service levels, you need to document the expected service levels for the new Amazon Redshift data warehouse environment. This includes the expected response time limits for dashboard queries or analytical queries, elapsed runtime for daily ETL jobs, desired elapsed time for data sharing with consumers, total number of tenants with concurrency of loads and reports, and mission-critical reports for executives or factory operations.

As part of your modern data architecture transition strategy, the migration goal of a new Amazon Redshift based platform is to use the scalability, performance, cost-optimization, and additional lake house capabilities of Amazon Redshift, resulting in improving the existing data consumption experience. Depending on your enterprise’s culture and goals, your migration pattern of a legacy multi-tenant data platform to Amazon Redshift could use one of the following strategies:

A majority of organizations opt for the organic strategy (lift and shift) when migrating their large data platforms to Amazon Redshift. This approach uses AWS migration tools such as the AWS Schema Conversion Tool (AWS SCT) or the managed service version DMS Schema Conversion to rapidly meet goals around data center exit, cloud adoption, reducing legacy licensing costs, and replacing legacy platforms.

By establishing clear success criteria and monitoring KPIs, you can implement a smooth migration to Amazon Redshift that meets performance and operational goals. Thoughtful planning and optimization are crucial, including optimizing your Amazon Redshift configuration and workload management, addressing concurrency needs, implementing scalability, tuning performance for large result sets, minimizing schema locking, and optimizing join strategies. This will enable right-sizing the Redshift data warehouse to meet workload demands cost-effectively. Thorough testing and performance optimization will facilitate a smooth transition with minimal disruption to end-users, fostering exceptional user experiences and satisfaction. A successful migration can be accomplished through proactive planning, continuous monitoring, and performance fine-tuning, thereby aligning with and delivering on business objectives.

Migration involves the following phases, which we delve into in the subsequent sections:

  • Assessment
    • Discovery of workload and integrations
    • Dependency analysis
    • Effort estimation
    • Team sizing
    • Strategic wave planning
  • Functional and performance
    • Code conversion
    • Data validation
  • Measure and benchmark KPIs
    • Platform-level KPIs
    • Tenant-level KPIs
    • Consumer-level KPIs
    • Sample SQL
  • Monitoring Amazon Redshift performance and continual optimization
    • Identify top offending queries
    • Optimization strategies

To achieve a successful Amazon Redshift migration, it’s important to address these infrastructure, security, and deployment considerations simultaneously, thereby implementing a smooth and secure transition.

Assessment

In this section, we discuss the steps you can take in the assessment phase.

Discovery of workload and integrations

Conducting discovery and assessment for migrating a large on-premises data warehouse to Amazon Redshift is a critical step in the migration process. This phase helps identify potential challenges, assess the complexity of the migration, and gather the necessary information to plan and implement the migration effectively. You can use the following steps:

  • Data profiling and assessment – This involves analyzing the schema, data types, table sizes, and dependencies. Special attention should be given to complex data types such as arrays, JSON, or custom data types and custom user-defined functions (UDFs), because they may require specific handling during the migration process. Additionally, it’s essential to assess the volume of data and daily incremental data to be migrated, and estimate the required storage capacity in Amazon Redshift. Furthermore, analyzing the existing workload patterns, queries, and performance characteristics provides valuable insights into the resource requirements needed to optimize the performance of the migrated data warehouse in Amazon Redshift.
  • Code and query assessment – It’s crucial to assess the compatibility of existing SQL code, including queries, stored procedures, and functions. The AWS SCT can help identify any unsupported features, syntax, or functions that need to be rewritten or replaced to achieve a seamless integration with Amazon Redshift. Additionally, it’s essential to evaluate the complexity of the existing processes and determine if they require redesigning or optimization to align with Amazon Redshift best practices.
  • Performance and scalability assessment – This includes identifying performance bottlenecks, concurrency issues, or resource constraints that may be hindering optimal performance. This analysis helps determine the need for performance tuning or workload management techniques that may be required to achieve optimal performance and scalability in the Amazon Redshift environment.
  • Application integrations and mapping – Embarking on a data warehouse migration to a new platform necessitates a comprehensive understanding of the existing technology stack and business processes intertwined with the legacy data warehouse. Consider the following:
    • Meticulously document all ETL processes, BI tools, and scheduling mechanisms employed in conjunction with the current data warehouse. This includes commercial tools, custom scripts, and any APIs or connectors interfacing with source systems.
    • Take note of any custom code, frameworks, or mechanisms utilized in the legacy data warehouse for tasks such as managing slowly changing dimensions (SCDs), generating surrogate keys, implementing business logic, and other specialized functionalities. These components may require redevelopment or adaptation to operate seamlessly on the new platform.
    • Identify all upstream and downstream applications, as well as business processes that rely on the data warehouse. Map out their specific dependencies on database objects, tables, views, and other components. Trace the flow of data from its origins in the source systems, through the data warehouse, and ultimately to its consumption by reporting, analytics, and other downstream processes.
  • Security and access control assessment – This includes reviewing the existing security model, including user roles, permissions, access controls, data retention policies, and any compliance requirements and industry regulations that need to be adhered to.

Dependency analysis

Understanding dependencies between objects is crucial for a successful migration. You can use system catalog views and custom queries on your on-premises data warehouses to create a comprehensive object dependency report. This report shows how tables, views, and stored procedures rely on each other. This also involves analyzing indirect dependencies (for example, a view built on top of another view, which in turn uses a set of tables), and having a complete understanding of data usage patterns.

Effort estimation

The discovery phase serves as your compass for estimating the migration effort. You can translate those insights into a clear roadmap as follows:

  • Object classification and complexity assessment – Based on the discovery findings, categorize objects (tables, views, stored procedures, and so on) based on their complexity. Simple tables with minimal dependencies will require less effort to migrate than intricate views or stored procedures with complex logic.
  • Migration tools – Use the AWS SCT to estimate the base migration effort per object type. The AWS SCT can automate schema conversion, data type mapping, and function conversion, reducing manual effort.
  • Additional considerations – Factor in additional tasks beyond schema conversion. This may include data cleansing, schema optimization for Amazon Redshift performance, unit testing of migrated objects, and migration script development for complex procedures. The discovery phase sheds light on potential schema complexities, allowing you to accurately estimate the effort required for these tasks.

Team sizing

With a clear picture of the effort estimate, you can now size the team for the migration.

Person-months calculation

Divide the total estimated effort by the desired project duration to determine the total person-months required. This provides a high-level understanding of the team size needed.

For example, for a ELT migration project from an on-premises data warehouse to Amazon Redshift to be completed within 6 months, we estimate the team requirements based on the number of schemas or tenants (for example, 30), number of database tables (for example, 5,000), average migration estimate for a schema (for example, 4 weeks based on complexity of stored procedures, tables and views, platform-specific routines, and materialized views), and number of business functions (for example, 2,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

  • Migration time period (65% migration/35% for validation & transition) = 0.8* 6 months = 5 months or 22 weeks
  • Dedicated teams = Number of tenants / (migration time period) / (average migration period for a tenant) = 30/5/1 = 6 teams
  • Migration team structure:
    • One to three data developers with stored procedure conversion expertise per team, performing over 25 conversions per week
    • One data validation engineer per team, testing over 50 objects per week
    • One to two data visualization experts per team, confirming consumer downstream applications are accurate and performant
  • A common shared DBA team with performance tuning expertise responding to standardization and challenges
  • A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions

Team composition expertise

Based on the skillsets required for various migration tasks, we assemble a team with the right expertise. Platform architects define a well-architected platform. Data engineers are crucial for schema conversion and data transformation, and DBAs can handle cluster configuration and workload monitoring. An engagement or project management team makes sure the project runs smoothly, on time, and within budget.

For example, for an ETL migration project from Informatica/Greenplum to a target Redshift lakehouse with an Amazon Simple Storage Service (Amazon S3) data lake to be completed within 12 months, we estimate the team requirements based on the number of schemas and tenants (for example, 50 schemas), number of database tables (for example, 10,000), average migration estimate for a schema (6 weeks based on complexity of database objects), and number of business functions (for example, 5,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

  • An open data format ingestion architecture processing the source dataset and refining the data in the S3 data lake. This requires a dedicated team of 3–7 members building a serverless data lake for all data sources. Ingestion migration implementation is segmented by tenants and type of ingestion patterns, such as internal database change data capture (CDC); data streaming, clickstream, and Internet of Things (IoT); public dataset capture; partner data transfer; and file ingestion patterns.
  • The migration team composition is tailored to the needs of a project wave. Depending on each migration wave and what is being done in the wave (development, testing, or performance tuning), the right people will be engaged. When the wave is complete, the people from that wave will move to another wave.
  • A loading team builds a producer-consumer architecture in Amazon Redshift to process concurrent near real-time publishing of data. This requires a dedicated team of 3–7 members building and publishing refined datasets in Amazon Redshift.
  • A shared DBA group of 3–5 individuals helping with schema standardization, migration challenges, and performance optimization outside the automated conversion.
  • Data transformation experts to convert database stored functions in the producer or consumer.
  • A migration sprint plan for 10 months with 2 sprint weeks with multiple waves to release tenants to the new architecture.
  • A validation team to confirm a reliable and complete migration.
  • One to two data visualization experts per team, confirming that consumer downstream applications are accurate and performant.
  • A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions.

Strategic wave planning

Migration waves can be determined as follows:

  • Dependency-based wave delineation – Objects can be grouped into migration waves based on their dependency relationships. Objects with no or minimal dependencies will be prioritized for earlier waves, whereas those with complex dependencies will be migrated in subsequent waves. This provides a smooth and sequential migration process.
  • Logical schema and business area alignment – You can further revise migration waves by considering logical schema and business areas. This allows you to migrate related data objects together, minimizing disruption to specific business functions.

Functional and performance

In this section, we discuss the steps for refactoring the legacy SQL codebase to leverage Redshift SQL best practices, build validation routines to ensure accuracy and completeness during the transition to Redshift, capturing KPIs to ensure similar or better service levels for consumption tools/downstream applications, and incorporating performance hooks and procedures for scalable and performant Redshift Platform.

Code conversion

We recommend using the AWS SCT as the first step in the code conversion journey. The AWS SCT is a powerful tool that can streamline the database schema and code migrations to Amazon Redshift. With its intuitive interface and automated conversion capabilities, the AWS SCT can significantly reduce the manual effort required during the migration process. Refer to Converting data warehouse schemas to Amazon Redshift using AWS SCT for instructions to convert your database schema, including tables, views, functions, and stored procedures, to Amazon Redshift format. For an Oracle source, you can also use the managed service version DMS Schema Conversion.

When the conversion is complete, the AWS SCT generates a detailed conversion report. This report highlights any potential issues, incompatibilities, or areas requiring manual intervention. Although the AWS SCT automates a significant portion of the conversion process, manual review and modifications are often necessary to address various complexities and optimizations.

Some common cases where manual review and modifications are typically required include:

  • Incompatible data types – The AWS SCT may not always handle custom or non-standard data types, requiring manual intervention to map them to compatible Amazon Redshift data types.
  • Database-specific SQL extensions or proprietary functions – If the source database uses SQL extensions or proprietary functions specific to the database vendor (for example, STRING_AGG() or ARRAY_UPPER functions, or custom UDFs for PostgreSQL), these may need to be manually rewritten or replaced with equivalent Amazon Redshift functions or UDFs. The AWS SCT extension pack is an add-on module that emulates functions present in a source database that are required when converting objects to the target database.
  • Performance optimization – Although the AWS SCT can convert the schema and code, manual optimization is often necessary to take advantage of the features and capabilities of Amazon Redshift. This may include adjusting distribution and sort keys, converting row-by-row operations to set-based operations, optimizing query plans, and other performance tuning techniques specific to Amazon Redshift.
  • Stored procedures and code conversion – The AWS SCT offers comprehensive capabilities to seamlessly migrate stored procedures and other code objects across platforms. Although its automated conversion process efficiently handles the majority of cases, certain intricate scenarios may necessitate manual intervention due to the complexity of the code and utilization of database-specific features or extensions. To achieve optimal compatibility and accuracy, it’s advisable to undertake testing and validation procedures during the migration process.

After you address the issues identified during the manual review process, it’s crucial to thoroughly test the converted stored procedures, as well as other database objects and code, such as views, functions, and SQL extensions, in a non-production Redshift cluster before deploying them in the production environment. This exercise is mostly undertaken by QA teams. This phase also involves conducting holistic performance testing (individual queries, batch loads, consumption reports and dashboards in BI tools, data mining applications, ML algorithms, and other relevant use cases) in addition to functional testing to make sure the converted code meets the required performance expectations. The performance tests should simulate production-like workloads and data volumes to validate the performance under realistic conditions.

Data validation

When migrating data from an on-premises data warehouse to a Redshift cluster on AWS, data validation is a crucial step to confirm the integrity and accuracy of the migrated data. There are several approaches you can consider:

  • Custom scripts – Use scripting languages like Python, SQL, or Bash to develop custom data validation scripts tailored to your specific data validation requirements. These scripts can connect to both the source and target databases, extract data, perform comparisons, and generate reports.
  • Open source tools – Use open source data validation tools like Amazon Deequ or Great Expectations. These tools provide frameworks and utilities for defining data quality rules, validating data, and generating reports.
  • AWS native or commercial tools – Use AWS native tools such as AWS Glue Data Quality or commercial data validation tools like Collibra Data Quality. These tools often provide comprehensive features, user-friendly interfaces, and dedicated support.

The following are different types of validation checks to consider:

  • Structural comparisons – Compare the list of columns and data types of columns between the source and target (Amazon Redshift). Any mismatches should be flagged.
  • Row count validation – Compare the row counts of each core table in the source data warehouse with the corresponding table in the target Redshift cluster. This is the most basic validation step to make sure no data has been lost or duplicated during the migration process.
  • Column-level validation – Validate individual columns by comparing column-level statistics (min, max, count, sum, average) for each column between the source and target databases. This can help identify any discrepancies in data values or data types.

You can also consider the following validation strategies:

  • Data profiling – Perform data profiling on the source and target databases to understand the data characteristics, identify outliers, and detect potential data quality issues. For example, you can use the data profiling capabilities of AWS Glue Data Quality or the Amazon Deequ
  • Reconciliation reports – Produce detailed validation reports that highlight errors, mismatches, and data quality issues. Consider generating reports in various formats (CSV, JSON, HTML) for straightforward consumption and integration with monitoring tools.
  • Automate the validation process – Integrate the validation logic into your data migration or ETL pipelines using scheduling tools or workflow orchestrators like Apache Airflow or AWS Step Functions.

Lastly, keep in mind the following considerations for collaboration and communication:

  • Stakeholder involvement – Involve relevant stakeholders, such as business analysts, data owners, and subject matter experts, throughout the validation process to make sure business requirements and data quality expectations are met.
  • Reporting and sign-off – Establish a clear reporting and sign-off process for the validation results, involving all relevant stakeholders and decision-makers.

Measure and benchmark KPIs

For multi-tenant Amazon Redshift implementation, KPIs are segmented at the platform level, tenant level, and consumption tools level. KPIs evaluate the operational metrics, cost metrics, and end-user response time metrics. In this section, we discuss the KPIs needed for achieving a successful transition.

Platform-level KPIs

As new tenants are gradually migrated to the platform, it’s imperative to monitor the current state of Amazon Redshift platform-level KPIs. The current KPI’s state will help the platform team make the necessary scalability modifications (add nodes, add consumer clusters, add producer clusters, or increase concurrency scaling clusters). Amazon Redshift query monitoring rules (QMR) also help govern the overall state of data platform, providing optimal performance for all tenants by managing outlier workloads.

The following table summarizes the relevant platform-level KPIs.

Component KPI Service Level and Success Criteria
ETL Ingestion data volume Daily or hourly peak volume in GBps, number of objects, number of threads.
Ingestion threads Peak hourly ingestion threads (COPY or INSERT), number of dependencies, KPI segmented by tenants and domains.
Stored procedure volume Peak hourly stored procedure invocations segmented by tenants and domains.
Concurrent load Peak concurrent load supported by the producer cluster; distribution of ingestion pattern across multiple producer clusters using data sharing.
Data sharing dependency Data sharing between producer clusters (objects refreshed, locks per hour, waits per hour).
Workload Number of queries Peak hour query volume supported by cluster segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, domain, or sub-domain.
Number of queries per queue Peak hour query volume supported by priority automatic WLM queue segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, business group, domain, or sub-domain.
Runtime pattern Total runtime per hour; max, median, and average run pattern; segmented by service class across clusters.
Wait time patterns Total wait time per hour; max, median, and average wait pattern for queries waiting.
Performance Leader node usage Service level for leader node (recommended less than 80%).
Compute node CPU usage Service level for compute node (recommended less than 90%).
Disk I/O usage per node Service level for disk I/O per node.
QMR rules Number of outlier queries stopped by QMR (large scan, large spilling disk, large runtime); logging thresholds for potential large queries running more than 5 minutes.
History of WLM queries Historical trend of queries stored in historical archive table for all instances of queries in STL_WLM_QUERY; trend analysis over 30 days, 60 days, and 90 days to fine-tune the workload across clusters.
Cost Total cost per month of Amazon Redshift platform Service level for mix of instances (reserved, on-demand, serverless), cost of Concurrency Scaling, cost of Amazon Redshift Spectrum usage. Use AWS tools like AWS Cost Explorer or daily cost usage report to capture monthly costs for each component.
Daily Concurrency Scaling usage Service limits to monitor cost for concurrency scaling; invoke for outlier activity on spikes.
Daily Amazon Redshift Spectrum usage Service limits to monitor cost for using Amazon Redshift Spectrum; invoke for outlier activity.
Redshift Managed Storage usage cost Track usage of Redshift Managed Storage, monitoring wastage on temporary, archival, and old data assets.
Localization Remote or on-premises tools Service level for rendering large datasets to remote destinations.
Data transfer to remote tools Data transfer to BI tools or workstations outside the Redshift cluster VPC; separation of datasets to Amazon S3 using the unload feature, avoiding bottlenecks at leader node.

Tenant-level KPIs

Tenant-level KPIs help capture current performance levels from the legacy system and document expected service levels for the data flow from the source capture to end-user consumption. The captured legacy KPIs assist in providing the best target modern Amazon Redshift platform (a single Redshift data warehouse, a lake house with Amazon Redshift Spectrum, and data sharing with the producer and consumer clusters). Cost usage tracking at the tenant level helps you spread the cost of a shared platform across tenants.

The following table summarizes the relevant tenant-level KPIs.

Component KPI Service Level and Success Criteria
Cost Compute usage by tenant Track usage by tenant, business group, or domain; capture query volume by business unit associating Redshift user identity to internal business unit; data observability by consumer usage for data products helping with cost attribution.
ETL Orchestration SLA Service level for daily data availability.
Runtime Service level for data loading and transformation.
Data ingestion volume Peak expected volume for service level guarantee.
Query consumption Response time Response time SLA for query patterns (dashboards, SQL analytics, ML analytics, BI tool caching).
Concurrency Peak query consumers for tenant.
Query volume Peak hourly volume service levels and daily query volumes.
Individual query response for critical data consumption Service level and success criteria for critical workloads.

Consumer-level KPIs

A multi-tenant modern data platform can set service levels for a variety of consumer tools. The service levels provide guidance to end-users of the capability of the new deployment.

The following table summarizes the relevant consumer-level KPIs.

Consumer KPI Service Level and Success Criteria
BI tools Large data extraction Service level for unloading data for caching or query rendering a large result dataset.
Dashboards Response time Service level for data refresh.
SQL query tools Response time Service level for response time by query type.
Concurrency Service level for concurrent query access by all consumers.
One-time analytics Response time Service level for large data unloads or aggregation.
ML analytics Response time Service level for large data unloads or aggregation.

Sample SQL

The post includes sample SQL to capture daily KPI metrics. The following example KPI dashboard trends assist in capturing historic workload patterns, identifying deviations in workload, and providing guidance on the platform workload capacity to meet the current workload and anticipated growth patterns.

The following figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds).

Figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds)

The following figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns).

Figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns)

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters).

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters)

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time).

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time)

Monitoring Redshift performance and continual optimization

Amazon Redshift uses automatic table optimization (ATO) to choose the right distribution style, sort keys, and encoding when you create a table with AUTO options. Therefore, it’s a good practice to take advantage of the AUTO feature and create tables with DISTSTYLE AUTO, SORTKEY AUTO, and ENCODING AUTO. When tables are created with AUTO options, Amazon Redshift initially creates tables with optimal keys for the best first-time query performance possible using information such as the primary key and data types. In addition, Amazon Redshift analyzes the data volume and query usage patterns to evolve the distribution strategy and sort keys to optimize performance over time. Finally, Amazon Redshift performs table maintenance activities on your tables that reduce fragmentation and make sure statistics are up to date.

During a large, phased migration, it’s important to monitor and measure Amazon Redshift performance against target KPIs at each phase and implement continual optimization. As new workloads are onboarded at each phase of the migration, it’s recommended to perform regular Redshift cluster reviews and analyze query pattern and performance. Cluster reviews can be done by engaging the Amazon Redshift specialist team through AWS Enterprise support or your AWS account team. The goal of a cluster review includes the following:

  • Use cases – Review the application use cases and determine if the design is suitable to solve for those use cases.
  • End-to-end architecture – Assess the current data pipeline architecture (ingestion, transformation, and consumption). For example, determine if too many small inserts are occurring and review their ETL pipeline. Determine if integration with other AWS services can be useful, such as AWS Lake Formation, Amazon Athena, Redshift Spectrum, or Amazon Redshift federation with PostgreSQL and MySQL.
  • Data model design – Review the data model and table design and provide recommendations for sort and distribution keys, keeping in mind best practices.
  • Performance – Review cluster performance metrics. Identify bottlenecks or irregularities and suggest recommendations. Dive deep into specific long-running queries to identify solutions specific to the customer’s workload.
  • Cost optimization – Provide recommendations to reduce costs where possible.
  • New features – Stay up to date with the new features in Amazon Redshift and identify where they can be used to meet these goals.

New workloads can introduce query patterns that could impact performance and miss target SLAs. A number of factors can affect query performance. In the following sections, we discuss aspects impacting query speed and optimizations for improving Redshift cluster performance.

Identify top offending queries

A compute node is partitioned into slices. More nodes means more processors and more slices, which enables you to redistribute the data as needed across the slices. However, more nodes also means greater expense, so you will need to find the balance of cost and performance that is appropriate for your system. For more information on Redshift cluster architecture, see Data warehouse system architecture. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster. For more information on node types, see Amazon Redshift pricing.

Redshift Test Drive is an open source tool that lets you evaluate which different data warehouse configuration options are best suited for your workload. We created Redshift Test Drive from Simple Replay and Amazon Redshift Node Configuration Comparison (see Compare different node types for your workload using Amazon Redshift for more details) to provide a single entry point for finding the best Amazon Redshift configuration for your workload. Redshift Test Drive also provides additional features such as a self-hosted analysis UI and the ability to replicate external objects that a Redshift workload may interact with. With Amazon Redshift Serverless, you can start with a base Redshift Processing Unit (RPU), and Redshift Serverless automatically scales based on your workload needs.

Optimization strategies

If you choose to fine-tune manually, the following are key concepts and considerations:

  • Data distribution – Amazon Redshift stores table data on the compute nodes according to a table’s distribution style. When you run a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed. For more information, see Working with data distribution styles.
  • Data sort order – Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed. For more information, see Working with sort keys.
  • Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate (a condition in the WHERE clause) to restrict the query dataset.
  • Concurrent operations – Amazon Redshift offers a powerful feature called automatic workload management (WLM) with query priorities, which enhances query throughput and overall system performance. By intelligently managing multiple concurrent operations and allocating resources dynamically, automatic WLM makes sure high-priority queries receive the necessary resources promptly, while lower-priority queries are processed efficiently without compromising system stability. This advanced queuing mechanism allows Amazon Redshift to optimize resource utilization, minimizing potential bottlenecks and maximizing query throughput, ultimately delivering a seamless and responsive experience for users running multiple operations simultaneously.
  • Query structure – How your query is written will affect its performance. As much as possible, write queries to process and return as little data as will meet your needs. For more information, see Amazon Redshift best practices for designing queries.
  • Queries with a long return time – Queries with a long return time can impact the processing of other queries and overall performance of the cluster. It’s critical to identify and optimize them. You can optimize these queries by either moving clients to the same network or using the UNLOAD feature of Amazon Redshift, and then configure the client to read the output from Amazon S3. To identify percentile and top running queries, you can download the sample SQL notebook system queries. You can import this in Query Editor V2.0.

Conclusion

In this post, we discussed best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

The assessment phase of a data migration project is critical for implementing a successful migration. It involves a comprehensive analysis of the existing workload, integrations, and dependencies to accurately estimate the effort required and determine the appropriate team size. Strategic wave planning is crucial for prioritizing and scheduling the migration tasks effectively. Establishing KPIs and benchmarking them helps measure progress and identify areas for improvement. Code conversion and data validation processes validate the integrity of the migrated data and applications. Monitoring Amazon Redshift performance, identifying and optimizing top offending queries, and conducting regular cluster reviews are essential for maintaining optimal performance and addressing any potential issues promptly.

By addressing these key aspects, organizations can seamlessly migrate their data workloads to Amazon Redshift while minimizing disruptions and maximizing the benefits of Amazon Redshift.

We hope this post provides you with valuable guidance. We welcome any thoughts or questions in the comments section.


About the authors

Chanpreet Singh is a Senior Lead Consultant at AWS, specializing in Data Analytics and AI/ML. He has over 17 years of industry experience and is passionate about helping customers build scalable data warehouses and big data solutions. In his spare time, Chanpreet loves to explore nature, read, and enjoy with his family.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Ram Bhandarkar is a Principal Data Architect at AWS based out of Northern Virginia. He helps customers with planning future Enterprise Data Strategy and assists them with transition to Modern Data Architecture platform on AWS. He has worked with building and migrating databases, data warehouses and data lake solutions for over 25 years.

Vijay Bagur is a Sr. Technical Account Manager. He works with enterprise customers to modernize and cost optimize workloads, improve security posture, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, biking and traveling.

Let’s Architect! Migrating to the cloud with AWS

Post Syndicated from Federica Ciuffo original https://aws.amazon.com/blogs/architecture/lets-architect-migrating-to-the-cloud-with-aws/

In today’s digital world, businesses are increasingly turning to the cloud for its scalability, agility, and cost-effectiveness. Migrating your data center to the cloud can be a daunting task, but with the right approach and tools, it can be a successful journey. This Let’s Architect! blog post will guide you through the process of migrating to the cloud with AWS, leveraging the proven AWS Cloud Adoption Framework (AWS CAF) and exploring valuable resources to help you navigate each step.

AWS Cloud Adoption Framework

The AWS Cloud Adoption Framework (CAF) provides a comprehensive approach to planning, designing, and deploying your cloud migration. This robust framework outlines a four-phase methodology that guides you through every stage of the process, from strategy and planning to ongoing management and optimization. Here’s a closer look at the four phases of the AWS CAF:

  • Envision: Identify business transformation opportunities that align with your strategic goals and demonstrate how the cloud will accelerate your business outcomes.
  • Align: Assess your organization’s cloud readiness by identifying capability gaps across six key perspectives (Business, People, Governance, Platform, Security, and Operations). Address these gaps by developing strategies, ensuring stakeholder alignment, and implementing relevant change management activities.
  • Launch: Select impactful pilot initiatives and deploy them in production. These pilots should showcase the value proposition of the cloud and provide valuable insights for further refinement.
  • Scale: Focus on expanding production pilots and business value to desired scale and ensuring that the business benefits associated with your cloud investments are realized and sustained.
The AWS CAF recommends four iterative and incremental cloud transformation phases

Figure 1. The AWS CAF recommends four iterative and incremental cloud transformation phases

Take me to this whitepaper!

Large-scale migration and modernization

Migrating a large-scale data center to the cloud requires careful planning and execution. This video session focuses on valuable lessons learned from the thousands of enterprises who have migrated and modernized their on-premises workloads with AWS. Dive deep on technical lessons learned, mental models used, how to set up teams to modernize as they migrate, and how to engage with AWS Professional Services and AWS Partners for success. Finally, you will get insights on the latest AWS migration and modernization tools.

Migrating to AWS Cloud unlocked major benefits for Live Nation, including a 58% cost saving

Figure 2. Migrating to AWS Cloud unlocked major benefits for Live Nation, including a 58% cost saving

Take me to this video!

Dive deep into different AWS DMS migration options

At the heart of any successful data migration lies a robust database migration strategy. AWS Database Migration Service (AWS DMS) empowers you with a comprehensive suite of tools to seamlessly move and replicate your data. This session explains the various options offered by AWS DMS, including logical replication, managed native methods for export, import, and replication, and bulk extract and load functionalities. Through these options, you’ll gain a thorough understanding of how to migrate and replicate your data, along with the distinct advantages of each approach. The session also explores performance considerations to ensure optimal migration efficiency. Finally, you will learn how modern capabilities like serverless technologies, auto scaling, and schema conversion can simplify migrations.

AWS DMS Schema Conversion converts your existing database schemas and a majority of the database code objects to a format compatible with the target database

FIgure 3. AWS DMS Schema Conversion converts your existing database schemas and a majority of the database code objects to a format compatible with the target database

Take me to this video!

Application Migration with AWS

Migrating and modernizing your applications is a crucial aspect of your cloud adoption strategy. The Application Migration with AWS workshop series provides hands-on experience with planning and executing application migrations. You’ll learn practical techniques like database replatforming, application rehosting, and containerization to make your move to the cloud smooth and efficient.

As part of this lab, you will perform a database migration with AWS DMS

Figure 4. As part of this lab, you will perform a database migration with AWS DMS

Take me to this workshop!

But the journey doesn’t end there. As your applications scale in the cloud, managing that growth becomes key. This is where infrastructure as code (IaC) comes in, and AWS CDK takes IaC a step further by allowing you to write infrastructure code in familiar programming languages you already know. This streamlines your migration by leveraging your existing coding knowledge. We recommend this AWS CDK workshop to get started with CDK for infrastructure automation.

See you next time!

Thanks for reading! With this post, we provided resources to help you navigate your cloud migration journey with confidence and success. In the next blog, we will talk about Well-Architected best practices!

To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Post Syndicated from Ravi Itha original https://aws.amazon.com/blogs/big-data/simplify-operational-data-processing-in-data-lakes-using-aws-glue-and-apache-hudi/

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. It focuses on defining standards and patterns to integrate data producers and consumers and move data between data lakes and purpose-built data stores securely and efficiently. Out of the many data producer systems that feed data to a data lake, operational databases are most prevalent, where operational data is stored, transformed, analyzed, and finally used to enhance business operations of an organization. With the emergence of open storage formats such as Apache Hudi and its native support from AWS Glue for Apache Spark, many AWS customers have started adding transactional and incremental data processing capabilities to their data lakes.

AWS has invested in native service integration with Apache Hudi and published technical contents to enable you to use Apache Hudi with AWS Glue (for example, refer to Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started). In AWS ProServe-led customer engagements, the use cases we work on usually come with technical complexity and scalability requirements. In this post, we discuss a common use case in relation to operational data processing and the solution we built using Apache Hudi and AWS Glue.

Use case overview

AnyCompany Travel and Hospitality wanted to build a data processing framework to seamlessly ingest and process data coming from operational databases (used by reservation and booking systems) in a data lake before applying machine learning (ML) techniques to provide a personalized experience to its users. Due to the sheer volume of direct and indirect sales channels the company has, its booking and promotions data are organized in hundreds of operational databases with thousands of tables. Of those tables, some are larger (such as in terms of record volume) than others, and some are updated more frequently than others. In the data lake, the data to be organized in the following storage zones:

  1. Source-aligned datasets – These have an identical structure to their counterparts at the source
  2. Aggregated datasets – These datasets are created based on one or more source-aligned datasets
  3. Consumer-aligned datasets – These are derived from a combination of source-aligned, aggregated, and reference datasets enriched with relevant business and transformation logics, usually fed as inputs to ML pipelines or any consumer applications

The following are the data ingestion and processing requirements:

  1. Replicate data from operational databases to the data lake, including insert, update, and delete operations.
  2. Keep the source-aligned datasets up to date (typically within the range of 10 minutes to a day) in relation to their counterparts in the operational databases, ensuring analytics pipelines refresh consumer-aligned datasets for downstream ML pipelines in a timely fashion. Moreover, the framework should consume compute resources as optimally as possible per the size of the operational tables.
  3. To minimize DevOps and operational overhead, the company wanted to templatize the source code wherever possible. For example, to create source-aligned datasets in the data lake for 3,000 operational tables, the company didn’t want to deploy 3,000 separate data processing jobs. The smaller the number of jobs and scripts, the better.
  4. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

As you can guess, the Apache Hudi framework can solve the first requirement. Therefore, we will put our emphasis on the other requirements. We begin with a Data lake reference architecture followed by an overview of operational data processing framework. By showing you our open-source solution on GitHub, we delve into framework components and walk through their design and implementation aspects. Finally, by testing the framework, we summarize how it meets the aforementioned requirements.

Data lake reference architecture

Let’s begin with a big picture: a data lake solves a variety of analytics and ML use cases dealing with internal and external data producers and consumers. The following diagram represents a generic data lake architecture. To ingest data from operational databases to an Amazon Simple Storage Service (Amazon S3) staging bucket of the data lake, either AWS Database Migration Service (AWS DMS) or any AWS partner solution from AWS Marketplace that has support for change data capture (CDC) can fulfill the requirement. AWS Glue is used to create source-aligned and consumer-aligned datasets and separate AWS Glue jobs to do feature engineering part of ML engineering and operations. Amazon Athena is used for interactive querying and AWS Lake Formation is used for access controls.

Data Lake Reference Architecture

Operational data processing framework

The operational data processing (ODP) framework contains three components: File Manager, File Processor, and Configuration Manager. Each component runs independently to solve a portion of the operational data processing use case. We have open-sourced this framework on GitHub—you can clone the code repo and inspect it while we walk you through the design and implementation of the framework components. The source code is organized in three folders, one for each component, and if you customize and adopt this framework for your use case, we recommend promoting these folders as separate code repositories in your version control system. Consider using the following repository names:

  1. aws-glue-hudi-odp-framework-file-manager
  2. aws-glue-hudi-odp-framework-file-processor
  3. aws-glue-hudi-odp-framework-config-manager

With this modular approach, you can independently deploy the components to your data lake environment by following your preferred CI/CD processes. As illustrated in the preceding diagram, these components are deployed in conjunction with a CDC solution.

Component 1: File Manager

File Manager detects files emitted by a CDC process such as AWS DMS and tracks them in an Amazon DynamoDB table. As shown in the following diagram, it consists of an Amazon EventBridge event rule, an Amazon Simple Queue Service (Amazon SQS) queue, an AWS Lambda function, and a DynamoDB table. The EventBridge rule uses Amazon S3 Event Notifications to detect the arrival of CDC files in the S3 bucket. The event rule forwards the object event notifications to the SQS queue as messages. The File Manager Lambda function consumes those messages, parses the metadata, and inserts the metadata to the DynamoDB table odpf_file_tracker. These records will then be processed by File Processor, which we discuss in the next section.

ODPF Component: File Manager

Component 2: File Processor

File Processor is the workhorse of the ODP framework. It processes files from the S3 staging bucket, creates source-aligned datasets in the raw S3 bucket, and adds or updates metadata for the datasets (AWS Glue tables) in the AWS Glue Data Catalog.

We use the following terminology when discussing File Processor:

  1. Refresh cadence – This represents the data ingestion frequency (for example, 10 minutes). It usually goes with AWS Glue worker type (one of G.1X, G.2X, G.4X, G.8X, G.025X, and so on) and batch size.
  2. Table configuration – This includes the Hudi configuration (primary key, partition key, pre-combined key, and table type (Copy on Write or Merge on Read)), table data storage mode (historical or current snapshot), S3 bucket used to store source-aligned datasets, AWS Glue database name, AWS Glue table name, and refresh cadence.
  3. Batch size – This numeric value is used to split tables into smaller batches and process their respective CDC files in parallel. For example, a configuration of 50 tables with a 10-minute refresh cadence and a batch size of 5 results in a total of 10 AWS Glue job runs, each processing CDC files for 5 tables.
  4. Table data storage mode – There are two options:
    • Historical – This table in the data lake stores historical updates to records (always append).
    • Current snapshot – This table in the data lake stores latest versioned records (upserts) with the ability to use Hudi time travel for historical updates.
  5. File processing state machine – It processes CDC files that belong to tables that share a common refresh cadence.
  6. EventBridge rule association with the file processing state machine – We use a dedicated EventBridge rule for each refresh cadence with the file processing state machine as target.
  7. File processing AWS Glue job – This is a configuration-driven AWS Glue extract, transform, and load (ETL) job that processes CDC files for one or more tables.

File Processor is implemented as a state machine using AWS Step Functions. Let’s use an example to understand this. The following diagram illustrates running File Processor state machine with a configuration that includes 18 operational tables, a refresh cadence of 10 minutes, a batch size of 5, and an AWS Glue worker type of G.1X.

ODP framework component: File Processor

The workflow includes the following steps:

  1. The EventBridge rule triggers the File Processor state machine every 10 minutes.
  2. Being the first state in the state machine, the Batch Manager Lambda function reads configurations from DynamoDB tables.
  3. The Lambda function creates four batches: three of them will be mapped to five operational tables each, and the fourth one is mapped to three operational tables. Then it feeds the batches to the Step Functions Map state.
  4. For each item in the Map state, the File Processor Trigger Lambda function will be invoked, which in turn runs the File Processor AWS Glue job.
  5. Each AWS Glue job performs the following actions:
    • Checks the status of an operational table and acquires a lock when it is not processed by any other job. The odpf_file_processing_tracker DynamoDB table is used for this purpose. When a lock is acquired, it inserts a record in the DynamoDB table with the status updating_table for the first time; otherwise, it updates the record.
    • Processes the CDC files for the given operational table from the S3 staging bucket and creates a source-aligned dataset in the S3 raw bucket. It also updates technical metadata in the AWS Glue Data Catalog.
    • Updates the status of the operational table to completed in the odpf_file_processing_tracker table. In case of processing errors, it updates the status to refresh_error and logs the stack trace.
    • It also inserts this record into the odpf_file_processing_tracker_history DynamoDB table along with additional details such as insert, update, and delete row counts.
    • Moves the records that belong to successfully processed CDC files from odpf_file_tracker to the odpf_file_tracker_history table with file_ingestion_status set to raw_file_processed.
    • Moves to the next operational table in the given batch.
    • Note: a failure to process CDC files for one of the operational tables of a given batch does not impact the processing of other operational tables.

Component 3: Configuration Manager

Configuration Manager is used to insert configuration details to the odpf_batch_config and odpf_raw_table_config tables. To keep this post concise, we provide two architecture patterns in the code repo and leave the implementation details to you.

Solution overview

Let’s test the ODP framework by replicating data from 18 operational tables to a data lake and creating source-aligned datasets with 10-minute refresh cadence. We use Amazon Relational Database Service (Amazon RDS) for MySQL to set up an operational database with 18 tables, upload the New York City Taxi – Yellow Trip Data dataset, set up AWS DMS to replicate data to Amazon S3, process the files using the framework, and finally validate the data using Amazon Athena.

Create S3 buckets

For instructions on creating an S3 bucket, refer to Creating a bucket. For this post, we create the following buckets:

  1. odpf-demo-staging-EXAMPLE-BUCKET – You will use this to migrate operational data using AWS DMS
  2. odpf-demo-raw-EXAMPLE-BUCKET – You will use this to store source-aligned datasets
  3. odpf-demo-code-artifacts-EXAMPLE-BUCKET – You will use this to store code artifacts

Deploy File Manager and File Processor

Deploy File Manager and File Processor by following instructions from this README and this README, respectively.

Set up Amazon RDS for MySQL

Complete the following steps to set up Amazon RDS for MySQL as the operational data source:

  1. Provision Amazon RDS for MySQL. For instructions, refer to Create and Connect to a MySQL Database with Amazon RDS.
  2. Connect to the database instance using MySQL Workbench or DBeaver.
  3. Create a database (schema) by running the SQL command CREATE DATABASE taxi_trips;.
  4. Create 18 tables by running the SQL commands in the ops_table_sample_ddl.sql script.

Populate data to the operational data source

Complete the following steps to populate data to the operational data source:

  1. To download the New York City Taxi – Yellow Trip Data dataset for January 2021 (Parquet file), navigate to NYC TLC Trip Record Data, expand 2021, and choose Yellow Taxi Trip records. A file called yellow_tripdata_2021-01.parquet will be downloaded to your computer.
  2. On the Amazon S3 console, open the bucket odpf-demo-staging-EXAMPLE-BUCKET and create a folder called nyc_yellow_trip_data.
  3. Upload the yellow_tripdata_2021-01.parquet file to the folder.
  4. Navigate to the bucket odpf-demo-code-artifacts-EXAMPLE-BUCKET and create a folder called glue_scripts.
  5. Download the file load_nyc_taxi_data_to_rds_mysql.py from the GitHub repo and upload it to the folder.
  6. Create an AWS Identity and Access Management (IAM) policy called load_nyc_taxi_data_to_rds_mysql_s3_policy. For instructions, refer to Creating policies using the JSON editor. Use the odpf_setup_test_data_glue_job_s3_policy.json policy definition.
  7. Create an IAM role called load_nyc_taxi_data_to_rds_mysql_glue_role. Attach the policy created in the previous step.
  8. On the AWS Glue console, create a connection for Amazon RDS for MySQL. For instructions, refer to Adding a JDBC connection using your own JDBC drivers and Setting up a VPC to connect to Amazon RDS data stores over JDBC for AWS Glue. Name the connection as odpf_demo_rds_connection.
  9. In the navigation pane of the AWS Glue console, choose Glue ETL jobs, Python Shell script editor, and Upload and edit an existing script under Options.
  10. Choose the file load_nyc_taxi_data_to_rds_mysql.py and choose Create.
  11. Complete the following steps to create your job:
    • Provide a name for the job, such as load_nyc_taxi_data_to_rds_mysql.
    • For IAM role, choose load_nyc_taxi_data_to_rds_mysql_glue_role.
    • Set Data processing units to 1/16 DPU.
    • Under Advanced properties, Connections, select the connection you created earlier.
    • Under Job parameters, add the following parameters:
      • input_sample_data_path = s3://odpf-demo-staging-EXAMPLE-BUCKET/nyc_yellow_trip_data/yellow_tripdata_2021-01.parquet
      • schema_name = taxi_trips
      • table_name = table_1
      • rds_connection_name = odpf_demo_rds_connection
    • Choose Save.
  12. On the Actions menu, run the job.
  13. Go back to your MySQL Workbench or DBeaver and validate the record count by running the SQL command select count(1) row_count from taxi_trips.table_1. You will get an output of 1369769.
  14. Populate the remaining 17 tables by running the SQL commands from the populate_17_ops_tables_rds_mysql.sql script.
  15. Get the row count from the 18 tables by running the SQL commands from the ops_data_validation_query_rds_mysql.sql script. The following screenshot shows the output.
    Record volumes (for 18 Tables) in Operational Database

Configure DynamoDB tables

Complete the following steps to configure the DynamoDB tables:

  1. Download file load_ops_table_configs_to_ddb.py from the GitHub repo and upload it to the folder glue_scripts in the S3 bucket odpf-demo-code-artifacts-EXAMPLE-BUCKET.
  2. Create an IAM policy called load_ops_table_configs_to_ddb_ddb_policy. Use the odpf_setup_test_data_glue_job_ddb_policy.json policy definition.
  3. Create an IAM role called load_ops_table_configs_to_ddb_glue_role. Attach the policy created in the previous step.
  4. On the AWS Glue console, choose Glue ETL jobs, Python Shell script editor, and Upload and edit an existing script under Options.
  5. Choose the file load_ops_table_configs_to_ddb.py and choose Create.
  6. Complete the following steps to create a job:
    • Provide a name, such as load_ops_table_configs_to_ddb.
    • For IAM role, choose load_ops_table_configs_to_ddb_glue_role.
    • Set Data processing units to 1/16 DPU.
    • Under Job parameters, add the following parameters
      • batch_config_ddb_table_name = odpf_batch_config
      • raw_table_config_ddb_table_name = odpf_demo_taxi_trips_raw
      • aws_region = e.g., us-west-1
    • Choose Save.
  7. On the Actions menu, run the job.
  8. On the DynamoDB console, get the item count from the tables. You will find 1 item in the odpf_batch_config table and 18 items in the odpf_demo_taxi_trips_raw table.

Set up a database in AWS Glue

Complete the following steps to create a database:

  1. On the AWS Glue console, under Data catalog in the navigation pane, choose Databases.
  2. Create a database called odpf_demo_taxi_trips_raw.

Set up AWS DMS for CDC

Complete the following steps to set up AWS DMS for CDC:

  1. Create an AWS DMS replication instance. For Instance class, choose dms.t3.medium.
  2. Create a source endpoint for Amazon RDS for MySQL.
  3. Create target endpoint for Amazon S3. To configure the S3 endpoint settings, use the JSON definition from dms_s3_endpoint_setting.json.
  4. Create an AWS DMS task.
    • Use the source and target endpoints created in the previous steps.
    • To create AWS DMS task mapping rules, use the JSON definition from dms_task_mapping_rules.json.
    • Under Migration task startup configuration, select Automatically on create.
  5. When the AWS DMS task starts running, you will see a task summary similar to the following screenshot.
    DMS Task Summary
  6. In the Table statistics section, you will see an output similar to the following screenshot. Here, the Full load rows and Total rows columns are important metrics whose counts should match with the record volumes of the 18 tables in the operational data source.
    DMS Task Statistics
  7. As a result of successful full load completion, you will find Parquet files in the S3 staging bucket—one Parquet file per table in a dedicated folder, similar to the following screenshot. Similarly, you will find 17 such folders in the bucket.
    DMS Output in S3 Staging Bucket for Table 1

File Manager output

The File Manager Lambda function consumes messages from the SQS queue, extracts metadata for the CDC files, and inserts one item per file to the odpf_file_tracker DynamoDB table. When you check the items, you will find 18 items with file_ingestion_status set to raw_file_landed, as shown in the following screenshot.

CDC Files in File Tracker DynamoDB Table

File Processor output

  1. On the subsequent tenth minute (since the activation of the EventBridge rule), the event rule triggers the File Processor state machine. On the Step Functions console, you will notice that the state machine is invoked, as shown in the following screenshot.
    File Processor State Machine Run Summary
  2. As shown in the following screenshot, the Batch Generator Lambda function creates four batches and constructs a Map state for parallel running of the File Processor Trigger Lambda function.
    File Processor State Machine Run Details
  3. Then, the File Processor Trigger Lambda function runs the File Processor Glue Job, as shown in the following screenshot.
    File Processor Glue Job Parallel Runs
  4. Then, you will notice that the File Processor Glue Job runs create source-aligned datasets in Hudi format in the S3 raw bucket. For Table 1, you will see an output similar to the following screenshot. There will be 17 such folders in the S3 raw bucket.
    Data in S3 raw bucket
  5. Finally, in AWS Glue Data Catalog, you will notice 18 tables created in the odpf_demo_taxi_trips_raw database, similar to the following screenshot.
    Tables in Glue Database

Data validation

Complete the following steps to validate the data:

  1. On the Amazon Athena console, open the query editor, and select a workgroup or create a new workgroup.
  2. Choose AwsDataCatalog for Data source and odpf_demo_taxi_trips_raw for Database.
  3. Run the raw_data_validation_query_athena.sql SQL query. You will get an output similar to the following screenshot.
    Raw Data Validation via Amazon Athena

Validation summary: The counts in Amazon Athena match with the counts of the operational tables and it proves that the ODP framework has processed all the files and records successfully. This concludes the demo. To test additional scenarios, refer to Extended Testing in the code repo.

Outcomes

Let’s review how the ODP framework addressed the aforementioned requirements.

  1. As discussed earlier in this post, by logically grouping tables by refresh cadence and associating them to EventBridge rules, we ensured that the source-aligned tables are refreshed by the File Processor AWS Glue jobs. With the AWS Glue worker type configuration setting, we selected the appropriate compute resources while running the AWS Glue jobs (the instances of the AWS Glue job).
  2. By applying table-specific configurations (from odpf_batch_config and odpf_raw_table_config) dynamically, we were able to use one AWS Glue job to process CDC files for 18 tables.
  3. You can use this framework to support a variety of data migration use cases that require quicker data migration from on-premises storage systems to data lakes or analytics platforms on AWS. You can reuse File Manager as is and customize File Processor to work with other storage frameworks such as Apache Iceberg, Delta Lake, and purpose-built data stores such as Amazon Aurora and Amazon Redshift.
  4. To understand how the ODP framework met the company’s disaster recovery (DR) design criterion, we first need to understand the DR architecture strategy at a high level. The DR architecture strategy has the following aspects:
    • One AWS account and two AWS Regions are used for primary and secondary environments.
    • The data lake infrastructure in the secondary Region is kept in sync with the one in the primary Region.
    • Data is stored in S3 buckets, metadata data is stored in the AWS Glue Data Catalog, and access controls in Lake Formation are replicated from the primary to secondary Region.
    • The data lake source and target systems have their respective DR environments.
    • CI/CD tooling (version control, CI server, and so on) are to be made highly available.
    • The DevOps team needs to be able to deploy CI/CD pipelines of analytics frameworks (such as this ODP framework) to either the primary or secondary Region.
    • As you can imagine, disaster recovery on AWS is a vast subject, so we keep our discussion to the last design aspect.

By designing the ODP framework with three components and externalizing operational table configurations to DynamoDB global tables, the company was able to deploy the framework components to the secondary Region (in the rare event of a single-Region failure) and continue to process CDC files from the point it last processed in the primary Region. Because the CDC file tracking and processing audit data is replicated to the DynamoDB replica tables in the secondary Region, the File Manager microservice and File Processor can seamlessly run.

Clean up

When you’re finished testing this framework, you can delete the provisioned AWS resources to avoid any further charges.

Conclusion

In this post, we took a real-world operational data processing use case and presented you the framework we developed at AWS ProServe. We hope this post and the operational data processing framework using AWS Glue and Apache Hudi will expedite your journey in integrating operational databases into your modern data platforms built on AWS.


About the authors

Ravi-IthaRavi Itha is a Principal Consultant at AWS Professional Services with specialization in data and analytics and generalist background in application development. Ravi helps customers with enterprise data strategy initiatives across insurance, airlines, pharmaceutical, and financial services industries. In his 6-year tenure at Amazon, Ravi has helped the AWS builder community by publishing approximately 15 open-source solutions (accessible via GitHub handle), four blogs, and reference architectures. Outside of work, he is passionate about reading India Knowledge Systems and practicing Yoga Asanas.

srinivas-kandiSrinivas Kandi is a Data Architect at AWS Professional Services. He leads customer engagements related to data lakes, analytics, and data warehouse modernizations. He enjoys reading history and civilizations.

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/create-an-apache-hudi-based-near-real-time-transactional-data-lake-using-aws-dms-amazon-kinesis-aws-glue-streaming-etl-and-data-visualization-using-amazon-quicksight/

With the rapid growth of technology, more and more data volume is coming in many different formats—structured, semi-structured, and unstructured. Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. In most real-world use cases, it’s important to replicate the data from the relational database source to the target in real time. Change data capture (CDC) is one of the most common design patterns to capture the changes made in the source database and reflect them to other data stores.

We recently announced support for streaming extract, transform, and load (ETL) jobs in AWS Glue version 4.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue streaming ETL jobs continuously consume data from streaming sources, clean and transform the data in-flight, and make it available for analysis in seconds. AWS also offers a broad selection of services to support your needs. A database replication service such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3), which commonly hosts the storage layer of the data lake. Although it’s straightforward to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s difficult to apply this CDC process on your data lakes. Apache Hudi, an open-source data management framework used to simplify incremental data processing and data pipeline development, is a good option to solve this problem.

This post demonstrates how to apply CDC changes from Amazon Relational Database Service (Amazon RDS) or other relational databases to an S3 data lake, with flexibility to denormalize, transform, and enrich the data in near-real time.

Solution overview

We use an AWS DMS task to capture near-real-time changes in the source RDS instance, and use Amazon Kinesis Data Streams as a destination of the AWS DMS task CDC replication. An AWS Glue streaming job reads and enriches changed records from Kinesis Data Streams and performs an upsert into the S3 data lake in Apache Hudi format. Then we can query the data with Amazon Athena visualize it in Amazon QuickSight. AWS Glue natively supports continuous write operations for streaming data to Apache Hudi-based tables.

The following diagram illustrates the architecture used for this post, which is deployed through an AWS CloudFormation template.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Source data overview

To illustrate our use case, we assume a data analyst persona who is interested in analyzing near-real-time data for sport events using the table ticket_activity. An example of this table is shown in the following screenshot.

Apache Hudi connector for AWS Glue

For this post, we use AWS Glue 4.0, which already has native support for the Hudi framework. Hudi, an open-source data lake framework, simplifies incremental data processing in data lakes built on Amazon S3. It enables capabilities including time travel queries, ACID (Atomicity, Consistency, Isolation, Durability) transactions, streaming ingestion, CDC, upserts, and deletes.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

  • An RDS database instance (source).
  • An AWS DMS replication instance, used to replicate the data from the source table to Kinesis Data Streams.
  • A Kinesis data stream.
  • Four AWS Glue Python shell jobs:
    • rds-ingest-rds-setup-<CloudFormation Stack name> – creates one source table called ticket_activity on Amazon RDS.
    • rds-ingest-data-initial-<CloudFormation Stack name> – Sample data is automatically generated at random by the Faker library and loaded to the ticket_activity table.
    • rds-ingest-data-incremental-<CloudFormation Stack name> – Ingests new ticket activity data into the source table ticket_activity continuously. This job simulates customer activity.
    • rds-upsert-data-<CloudFormation Stack name> – Upserts specific records in the source table ticket_activity. This job simulates administrator activity.
  • AWS Identity and Access Management (IAM) users and policies.
  • An Amazon VPC, a public subnet, two private subnets, internet gateway, NAT gateway, and route tables.
    • We use private subnets for the RDS database instance and AWS DMS replication instance.
    • We use the NAT gateway to have reachability to pypi.org to use the MySQL connector for Python from the AWS Glue Python shell jobs. It also provides reachability to Kinesis Data Streams and an Amazon S3 API endpoint

To set up these resources, you must have the following prerequisites:

The following diagram illustrates the architecture of our provisioned resources.

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the AWS CloudFormation console.
  2. Choose Launch Stack
  3. Choose Next.
  4. For S3BucketName, enter the name of your new S3 bucket.
  5. For VPCCIDR, enter a CIDR IP address range that doesn’t conflict with your existing networks.
  6. For PublicSubnetCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  7. For PrivateSubnetACIDR and PrivateSubnetBCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
  8. For SubnetAzA and SubnetAzB, choose the subnets you want to use.
  9. For DatabaseUserName, enter your database user name.
  10. For DatabaseUserPassword, enter your database user password.
  11. Choose Next.
  12. On the next page, choose Next.
  13. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  14. Choose Create stack.

Stack creation can take about 20 minutes.

Set up an initial source table

The AWS Glue job rds-ingest-rds-setup-<CloudFormation stack name> creates a source table called event on the RDS database instance. To set up the initial source table in Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-rds-setup-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

This job will only create the one table, ticket_activity, in the MySQL instance (DDL). See the following code:

CREATE TABLE ticket_activity (
ticketactivity_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
sport_type VARCHAR(256) NOT NULL,
start_date DATETIME NOT NULL,
location VARCHAR(256) NOT NULL,
seat_level VARCHAR(256) NOT NULL,
seat_location VARCHAR(256) NOT NULL,
ticket_price INT NOT NULL,
customer_name VARCHAR(256) NOT NULL,
email_address VARCHAR(256) NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL )

Ingest new records

In this section, we detail the steps to ingest new records. Implement following steps to star the execution of the jobs.

Start data ingestion to Kinesis Data Streams using AWS DMS

To start data ingestion from Amazon RDS to Kinesis Data Streams, complete the following steps:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task rds-to-kinesis-<CloudFormation stack name>.
  3. On the Actions menu, choose Restart/Resume.
  4. Wait for the status to show as Load complete and Replication ongoing.

The AWS DMS replication task ingests data from Amazon RDS to Kinesis Data Streams continuously.

Start data ingestion to Amazon S3

Next, to start data ingestion from Kinesis Data Streams to Amazon S3, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose streaming-cdc-kinesis2hudi-<CloudFormation stack name> to open the job.
  3. Choose Run.

Do not stop this job; you can check the run status on the Runs tab and wait for it to show as Running.

Start the data load to the source table on Amazon RDS

To start data ingestion to the source table on Amazon RDS, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-data-initial-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

Validate the ingested data

After about 2 minutes from starting the job, the data should be ingested into the Amazon S3. To validate the ingested data in the Athena, complete the following steps:

  1. On the Athena console, complete the following steps if you’re running an Athena query for the first time:
    • On the Settings tab, choose Manage.
    • Specify the stage directory and the S3 path where Athena saves the query results.
    • Choose Save.

  1. On the Editor tab, run the following query against the table to check the data:
SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

Note that AWS Cloud Formation will create the database with the account number as database_<your-account-number>_hudi_cdc_demo.

Update existing records

Before you update the existing records, note down the ticketactivity_id value of a record from the ticket_activity table. Run the following SQL using Athena. For this post, we use ticketactivity_id = 46 as an example:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

To simulate a real-time use case, update the data in the source table ticket_activity on the RDS database instance to see that the updated records are replicated to Amazon S3. Complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose rds-ingest-data-incremental-<CloudFormation stack name> to open the job.
  3. Choose Run.
  4. Choose the Runs tab and wait for Run status to show as SUCCEEDED.

To upsert the records in the source table, complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job rds-upsert-data-<CloudFormation stack name>.
  3. On the Job details tab, under Advanced properties, for Job parameters, update the following parameters:
    • For Key, enter --ticketactivity_id.
    • For Value, replace 1 with one of the ticket IDs you noted above (for this post, 46).

  1. Choose Save.
  2. Choose Run and wait for the Run status to show as SUCCEEDED.

This AWS Glue Python shell job simulates a customer activity to buy a ticket. It updates a record in the source table ticket_activity on the RDS database instance using the ticket ID passed in the job argument --ticketactivity_id. It will update ticket_price=500 and updated_at with the current timestamp.

To validate the ingested data in Amazon s3, run the same query from Athena and check the ticket_activity value you noted earlier to observe the ticket_price and updated_at fields:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" where ticketactivity_id = 46 ;

Visualize the data in QuickSight

After you have the output file generated by the AWS Glue streaming job in the S3 bucket, you can use QuickSight to visualize the Hudi data files. QuickSight is a scalable, serverless, embeddable, ML-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights. QuickSight dashboards can be accessed from any device and seamlessly embedded into your applications, portals, and websites.

Build a QuickSight dashboard

To build a QuickSight dashboard, complete the following steps:

  1. Open the QuickSight console.

You’re presented with the QuickSight welcome page. If you haven’t signed up for QuickSight, you may have to complete the signup wizard. For more information, refer to Signing up for an Amazon QuickSight subscription.

After you have signed up, QuickSight presents a “Welcome wizard.” You can view the short tutorial, or you can close it.

  1. On the QuickSight console, choose your user name and choose Manage QuickSight.
  2. Choose Security & permissions, then choose Manage.
  3. Select Amazon S3 and select the buckets that you created earlier with AWS CloudFormation.
  4. Select Amazon Athena.
  5. Choose Save.
  6. If you changed your Region during the first step of this process, change it back to the Region that you used earlier during the AWS Glue jobs.

Create a dataset

Now that you have QuickSight up and running, you can create your dataset. Complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose Athena.
  4. For Data source name, enter a name (for example, hudi-blog).
  5. Choose Validate.
  6. After the validation is successful, choose Create data source.
  7. For Database, choose database_<your-account-number>_hudi_cdc_demo.
  8. For Tables, select ticket_activity.
  9. Choose Select.
  10. Choose Visualize.
  11. Choose hour and then ticket_activity_id to get the count of ticket_activity_id by hour.

Clean up

To clean up your resources, complete the following steps:

  1. Stop the AWS DMS replication task rds-to-kinesis-<CloudFormation stack name>.
  2. Navigate to the RDS database and choose Modify.
  3. Deselect Enable deletion protection, then choose Continue.
  4. Stop the AWS Glue streaming job streaming-cdc-kinesis2redshift-<CloudFormation stack name>.
  5. Delete the CloudFormation stack.
  6. On the QuickSight dashboard, choose your user name, then choose Manage QuickSight.
  7. Choose Account settings, then choose Delete account.
  8. Choose Delete account to confirm.
  9. Enter confirm and choose Delete account.

Conclusion

In this post, we demonstrated how you can stream data—not only new records, but also updated records from relational databases—to Amazon S3 using an AWS Glue streaming job to create an Apache Hudi-based near-real-time transactional data lake. With this approach, you can easily achieve upsert use cases on Amazon S3. We also showcased how to visualize the Apache Hudi table using QuickSight and Athena. As a next step, refer to the Apache Hudi performance tuning guide for a high-volume dataset. To learn more about authoring dashboards in QuickSight, check out the QuickSight Author Workshop.


About the Authors

Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Sundeep Kumar is a Sr. Data Architect, Data Lake at AWS, helping customers build data lake and analytics platform and solutions. When not building and designing data lakes, Sundeep enjoys listening music and playing guitar.

Migrate your existing SQL-based ETL workload to an AWS serverless ETL infrastructure using AWS Glue

Post Syndicated from Mitesh Patel original https://aws.amazon.com/blogs/big-data/migrate-your-existing-sql-based-etl-workload-to-an-aws-serverless-etl-infrastructure-using-aws-glue/

Data has become an integral part of most companies, and the complexity of data processing is increasing rapidly with the exponential growth in the amount and variety of data. Data engineering teams are faced with the following challenges:

  • Manipulating data to make it consumable by business users
  • Building and improving extract, transform, and load (ETL) pipelines
  • Scaling their ETL infrastructure

Many customers migrating data to the cloud are looking for ways to modernize by using native AWS services to further scale and efficiently handle ETL tasks. In the early stages of their cloud journey, customers may need guidance on modernizing their ETL workload with minimal effort and time. Customers often use many SQL scripts to select and transform the data in relational databases hosted either in an on-premises environment or on AWS and use custom workflows to manage their ETL.

AWS Glue is a serverless data integration and ETL service with the ability to scale on demand. In this post, we show how you can migrate your existing SQL-based ETL workload to AWS Glue using Spark SQL, which minimizes the refactoring effort.

Solution overview

The following diagram describes the high-level architecture for our solution. This solution decouples the ETL and analytics workloads from our transactional data source Amazon Aurora, and uses Amazon Redshift as the data warehouse solution to build a data mart. In this solution, we employ AWS Database Migration Service (AWS DMS) for both full load and continuous replication of changes from Aurora. AWS DMS enables us to capture deltas, including deletes from the source database, through the use of Change Data Capture (CDC) configuration. CDC in DMS enables us to capture deltas without writing code and without missing any changes, which is critical for the integrity of the data. Please refer CDC support in DMS to extend the solutions for ongoing CDC.

The workflow includes the following steps:

  1. AWS Database Migration Service (AWS DMS) connects to the Aurora data source.
  2. AWS DMS replicates data from Aurora and migrates to the target destination Amazon Simple Storage Service (Amazon S3) bucket.
  3. AWS Glue crawlers automatically infer schema information of the S3 data and integrate into the AWS Glue Data Catalog.
  4. AWS Glue jobs run ETL code to transform and load the data to Amazon Redshift.

For this post, we use the TPCH dataset for sample transactional data. The components of TPCH consist of eight tables. The relationships between columns in these tables are illustrated in the following diagram.

We use Amazon Redshift as the data warehouse to implement the data mart solution. The data mart fact and dimension tables are created in the Amazon Redshift database. The following diagram illustrates the relationships between the fact (ORDER) and dimension tables (DATE, PARTS, and REGION).

Set up the environment

To get started, we set up the environment using AWS CloudFormation. Complete the following steps:

  1. Sign in to the AWS Management Console with your AWS Identity and Access Management (IAM) user name and password.
  2. Choose Launch Stack and open the page on a new tab:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. In the Parameters section, enter the required parameters.
  6. Choose Next.

  1. On the Configure stack options page, leave all values as default and choose Next.
  2. On the Review stack page, select the check boxes to acknowledge the creation of IAM resources.
  3. Choose Submit.

Wait for the stack creation to complete. You can examine various events from the stack creation process on the Events tab. When the stack creation is complete, you will see the status CREATE_COMPLETE. The stack takes approximately 25–30 minutes to complete.

This template configures the following resources:

  • The Aurora MySQL instance sales-db.
  • The AWS DMS task dmsreplicationtask-* for full load of data and replicating changes from Aurora (source) to Amazon S3 (destination).
  • AWS Glue crawlers s3-crawler and redshift_crawler.
  • The AWS Glue database salesdb.
  • AWS Glue jobs insert_region_dim_tbl, insert_parts_dim_tbl, and insert_date_dim_tbl. We use these jobs for the use cases covered in this post. We create the insert_orders_fact_tbl AWS Glue job manually using AWS Glue Visual Studio.
  • The Redshift cluster blog_cluster with database sales and fact and dimension tables.
  • An S3 bucket to store the output of the AWS Glue job runs.
  • IAM roles and policies with appropriate permissions.

Replicate data from Aurora to Amazon S3

Now let’s look at the steps to replicate data from Aurora to Amazon S3 using AWS DMS:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the task dmsreplicationtask-* and on the Action menu, choose Restart/Resume.

This will start the replication task to replicate the data from Aurora to the S3 bucket. Wait for the task status to change to Full Load Complete. The data from the Aurora tables is now copied to the S3 bucket under a new folder, sales.

Create AWS Glue Data Catalog tables

Now let’s create AWS Glue Data Catalog tables for the S3 data and Amazon Redshift tables:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Connections.
  2. Select RedshiftConnection and on the Actions menu, choose Edit.
  3. Choose Save changes.
  4. Select the connection again and on the Actions menu, choose Test connection.
  5. For IAM role¸ choose GlueBlogRole.
  6. Choose Confirm.

Testing the connection can take approximately 1 minute. You will see the message “Successfully connected to the data store with connection blog-redshift-connection.” If you have trouble connecting successfully, refer to Troubleshooting connection issues in AWS Glue.

  1. Under Data Catalog in the navigation pane, choose Crawlers.
  2. Select s3_crawler and choose Run.

This will generate eight tables in the AWS Glue Data Catalog. To view the tables created, in the navigation pane, choose Databases under Data Catalog, then choose salesdb.

  1. Repeat the steps to run redshift_crawler and generate four additional tables.

If the crawler fails, refer to Error: Running crawler failed.

Create SQL-based AWS Glue jobs

Now let’s look at how the SQL statements are used to create ETL jobs using AWS Glue. AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS Glue runs these jobs on virtual resources that it provisions and manages in its own service account. AWS Glue Studio is a graphical interface that makes it simple to create, run, and monitor ETL jobs in AWS Glue. You can use AWS Glue Studio to create jobs that extract structured or semi-structured data from a data source, perform a transformation of that data, and save the result set in a data target.

Let’s go through the steps of creating an AWS Glue job for loading the orders fact table using AWS Glue Studio.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose Create job.
  3. Select Visual with a blank canvas, then choose Create.

  1. Navigate to the Job details tab.
  2. For Name, enter insert_orders_fact_tbl.
  3. For IAM Role, choose GlueBlogRole.
  4. For Job bookmark, choose Enable.
  5. Leave all other parameters as default and choose Save.

  1. Navigate to the Visual tab.
  2. Choose the plus sign.
  3. Under Add nodes, enter Glue in the search bar and choose AWS Glue Data Catalog (Source) to add the Data Catalog as the source.

  1. In the right pane, on the Data source properties – Data Catalog tab, choose salesdb for Database and customer for Table.

  1. On the Node properties tab, for Name, enter Customers.

  1. Repeat these steps for the Orders and LineItem tables.

This concludes creating data sources on the AWS Glue job canvas. Next, we add transformations by combining data from these different tables.

Transform the data

Complete the following steps to add data transformations:

  1. On the AWS Glue job canvas, choose the plus sign.
  2. Under Transforms, choose SQL Query.
  3. On the Transform tab, for Node parents, select all the three data sources.
  4. On the Transform tab, under SQL query, enter the following query:
SELECT orders.o_orderkey        AS ORDERKEY,
orders.o_orderdate       AS ORDERDATE,
lineitem.l_linenumber    AS LINENUMBER,
lineitem.l_partkey       AS PARTKEY,
lineitem.l_receiptdate   AS RECEIPTDATE,
lineitem.l_quantity      AS QUANTITY,
lineitem.l_extendedprice AS EXTENDEDPRICE,
orders.o_custkey         AS CUSTKEY,
customer.c_nationkey     AS NATIONKEY,
CURRENT_TIMESTAMP        AS UPDATEDATE
FROM   orders orders,
lineitem lineitem,
customer customer
WHERE  orders.o_orderkey = lineitem.l_orderkey
AND orders.o_custkey = customer.c_custkey
  1. Update the SQL aliases values as shown in the following screenshot.

  1. On the Data preview tab, choose Start data preview session.
  2. When prompted, choose GlueBlogRole for IAM role and choose Confirm.

The data preview process will take a minute to complete.

  1. On the Output schema tab, choose Use data preview schema.

You will see the output schema similar to the following screenshot.

Now that we have previewed the data, we change a few data types.

  1. On the AWS Glue job canvas, choose the plus sign.
  2. Under Transforms, choose Change Schema.
  3. Select the node.
  4. On the Transform tab, update the Data type values as shown in the following screenshot.

Now let’s add the target node.

  1. Choose the Change Schema node and choose the plus sign.
  2. In the search bar, enter target.
  3. Choose Amazon Redshift as the target.

  1. Choose the Amazon Redshift node, and on the Data target properties – Amazon Redshift tab, for Redshift access type, select Direct data connection.
  2. Choose RedshiftConnection for Redshift Connection, public for Schema, and order_table for Table.
  3. Select Merge data into target table under Handling of data and target table.
  4. Choose orderkey for Matching keys.

  1. Choose Save.

AWS Glue Studio automatically generates the Spark code for you. You can view it on the Script tab. If you would like to do any out-of-the-box transformations, you can modify the Spark code. The AWS Glue job uses the Apache SparkSQL query for SQL query transformation. To find the available SparkSQL transformations, refer to the Spark SQL documentation.

  1. Choose Run to run the job.

As part of the CloudFormation stack, three other jobs are created to load the dimension tables.

  1. Navigate back to the Jobs page on the AWS Glue console, select the job insert_parts_dim_tbl, and choose Run.

This job uses the following SQL to populate the parts dimension table:

SELECT part.p_partkey,
part.p_type,
part.p_brand
FROM   part part
  1. Select the job insert_region_dim_tbl and choose Run.

This job uses the following SQL to populate the region dimension table:

SELECT nation.n_nationkey,
nation.n_name,
region.r_name
FROM   nation,
region
WHERE  nation.n_regionkey = region.r_regionkey
  1. Select the job insert_date_dim_tbl and choose Run.

This job uses the following SQL to populate the date dimension table:

SELECT DISTINCT( l_receiptdate )        AS DATEKEY,
Dayofweek(l_receiptdate) AS DAYOFWEEK,
Month(l_receiptdate)     AS MONTH,
Year(l_receiptdate)      AS YEAR,
Day(l_receiptdate)       AS DATE
FROM   lineitem lineitem

You can view the status of the running jobs by navigating to the Job run monitoring section on the Jobs page. Wait for all the jobs to complete. These jobs will load the data into the facts and dimension tables in Amazon Redshift.

To help optimize the resources and cost, you can use the AWS Glue Auto Scaling feature.

Verify the Amazon Redshift data load

To verify the data load, complete the following steps:

  1. On the Amazon Redshift console, select the cluster blog-cluster and on the Query Data menu, choose Query in query editor 2.
  2. For Authentication, select Temporary credentials.
  3. For Database, enter sales.
  4. For User name, enter admin.
  5. Choose Save.

  1. Run the following commands in the query editor to verify that the data is loaded into the Amazon Redshift tables:
SELECT *
FROM   sales.PUBLIC.order_table;

SELECT *
FROM   sales.PUBLIC.date_table;

SELECT *
FROM   sales.PUBLIC.parts_table;

SELECT *
FROM   sales.PUBLIC.region_table;

The following screenshot shows the results from one of the SELECT queries.

Now for the CDC, update the quantity of a line item for order number 1 in Aurora database using the below query. (To connect to your Aurora cluster use Cloud9 or any SQL client tools like MySQL command-line client).

UPDATE lineitem SET l_quantity = 100 WHERE l_orderkey = 1 AND l_linenumber = 4;

DMS will replicate the changes into the S3 bucket as shown in the below screenshot.

Re-running the Glue job insert_orders_fact_tbl will update the changes to the ORDER fact table as shown in the below screenshot

Clean up

To avoid incurring future charges, delete the resources created for the solution:

  1. On the Amazon S3 console, select the S3 bucket created as part of the CloudFormation stack, then choose Empty.
  2. On the AWS CloudFormation console, select the stack that you created initially and choose Delete to delete all the resources created by the stack.

Conclusion

In this post, we showed how you can migrate existing SQL-based ETL to an AWS serverless ETL infrastructure using AWS Glue jobs. We used AWS DMS to migrate data from Aurora to an S3 bucket, then SQL-based AWS Glue jobs to move the data to fact and dimension tables in Amazon Redshift.

This solution demonstrates a one-time data load from Aurora to Amazon Redshift using AWS Glue jobs. You can extend this solution for moving the data on a scheduled basis by orchestrating and scheduling jobs using AWS Glue workflows. To learn more about the capabilities of AWS Glue, refer to AWS Glue.


About the Authors

Mitesh Patel is a Principal Solutions Architect at AWS with specialization in data analytics and machine learning. He is passionate about helping customers building scalable, secure and cost effective cloud native solutions in AWS to drive the business growth. He lives in DC Metro area with his wife and two kids.

Sumitha AP is a Sr. Solutions Architect at AWS. She works with customers and help them attain their business objectives by  designing secure, scalable, reliable, and cost-effective solutions in the AWS Cloud. She has a focus on data and analytics and provides guidance on building analytics solutions on AWS.

Deepti Venuturumilli is a Sr. Solutions Architect in AWS. She works with commercial segment customers and AWS partners to accelerate customers’ business outcomes by providing expertise in AWS services and modernize their workloads. She focuses on data analytics workloads and setting up modern data strategy on AWS.

Deepthi Paruchuri is an AWS Solutions Architect based in NYC. She works closely with customers to build cloud adoption strategy and solve their business needs by designing secure, scalable, and cost-effective solutions in the AWS cloud.

Reduce archive cost with serverless data archiving

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/reduce-archive-cost-with-serverless-data-archiving/

For regulatory reasons, decommissioning core business systems in financial services and insurance (FSI) markets requires data to remain accessible years after the application is retired. Traditionally, FSI companies either outsourced data archiving to third-party service providers, which maintained application replicas, or purchased vendor software to query and visualize archival data.

In this blog post, we present a more cost-efficient option with serverless data archiving on Amazon Web Services (AWS). In our experience, you can build your own cloud-native solution on Amazon Simple Storage Service (Amazon S3) at one-fifth of the price of third-party alternatives. If you are retiring legacy core business systems, consider serverless data archiving for cost-savings while keeping regulatory compliance.

Serverless data archiving and retrieval

Modern archiving solutions follow the principles of modern applications:

  • Serverless-first development, to reduce management overhead.
  • Cloud-native, to leverage native capabilities of AWS services, such as backup or disaster recovery, to avoid custom build.
  • Consumption-based pricing, since data archival is consumed irregularly.
  • Speed of delivery, as both implementation and archive operations need to be performed quickly to fulfill regulatory compliance.
  • Flexible data retention policies can be enforced in an automated manner.

AWS Storage and Analytics services offer the necessary building blocks for a modern serverless archiving and retrieval solution.

Data archiving can be implemented on top of Amazon S3) and AWS Glue.

  1. Amazon S3 storage tiers enable different data retention policies and retrieval service level agreements (SLAs). You can migrate data to Amazon S3 using AWS Database Migration Service; otherwise, consider another data transfer service, such as AWS DataSync or AWS Snowball.
  2. AWS Glue crawlers automatically infer both database and table schemas from your data in Amazon S3 and store the associated metadata in the AWS Glue Data Catalog.
  3. Amazon CloudWatch monitors the execution of AWS Glue crawlers and notifies of failures.

Figure 1 provides an overview of the solution.

Serverless data archiving and retrieval

Figure 1. Serverless data archiving and retrieval

Once the archival data is catalogued, Amazon Athena can be used for serverless data query operations using standard SQL.

  1. Amazon API Gateway receives the data retrieval requests and eases integration with other systems via REST, HTTPS, or WebSocket.
  2. AWS Lambda reads parametrization data/templates from Amazon S3 in order to construct the SQL queries. Alternatively, query templates can be stored as key-value entries in a NoSQL store, such as Amazon DynamoDB.
  3. Lambda functions trigger Athena with the constructed SQL query.
  4. Athena uses the AWS Glue Data Catalog to retrieve table metadata for the Amazon S3 (archival) data and to return the SQL query results.

How we built serverless data archiving

An early build-or-buy assessment compared vendor products with a custom-built solution using Amazon S3, AWS Glue, and a user frontend for data retrieval and visualization.

The total cost of ownership over a 10-year period for one insurance core system (Policy Admin System) was $0.25M to build and run the custom solution on AWS compared with >$1.1M for third-party alternatives. The implementation cost advantage of the custom-built solution was due to development efficiencies using AWS services. The lower run cost resulted from a decreased frequency of archival usage and paying only for what you use.

The data archiving solution was implemented with AWS services (Figure 2):

  1. Amazon S3 is used to persist archival data in Parquet format (optimized for analytics and compressed to reduce storage space) that is loaded from the legacy insurance core system. The archival data source was AS400/DB2 and moved with Informatica Cloud to Amazon S3.
  2. AWS Glue crawlers infer the database schema from objects in Amazon S3 and create tables in AWS Glue for the decommissioned application data.
  3. Lambda functions (Python) remove data records based on retention policies configured for each domain, such as customers, policies, claims, and receipts. A daily job (Control-M) initiates the retention process.
Exemplary implementation of serverless data archiving and retrieval for insurance core system

Figure 2. Exemplary implementation of serverless data archiving and retrieval for insurance core system

Retrieval operations are formulated and executed via Python functions in Lambda. The following AWS resources implement the retrieval logic:

  1. Athena is used to run SQL queries over the AWS Glue tables for the decommissioned application.
  2. Lambda functions (Python) build and execute queries for data retrieval. The functions render HMTL snippets using Jinja templating engine and Athena query results, returning the selected template filled with the requested archive data. Using Jinja as templating engine improved the speed of delivery and reduced the heavy lifting of frontend and backend changes when modeling retrieval operations by ~30% due to the decoupling between application layers. As a result, engineers only need to build an Athena query with the linked Jinja template.
  3. Amazon S3 stores templating configuration and queries (JSON files) used for query parametrization.
  4. Amazon API Gateway serves as single point of entry for API calls.

The user frontend for data retrieval and visualization is implemented as web application using React JavaScript library (with static content on Amazon S3) and Amazon CloudFront used for web content delivery.

The archiving solution enabled 80 use cases with 60 queries and reduced storage from three terabytes on source to only 35 gigabytes on Amazon S3. The success of the implementation depended on the following key factors:

  • Appropriate sponsorship from business across all areas (claims, actuarial, compliance, etc.)
  • Definition of SLAs for responding to courts, regulators, etc.
  • Minimum viable and mandatory approach
  • Prototype visualizations early on (fail fast)

Conclusion

Traditionally, FSI companies relied on vendor products for data archiving. In this post, we explored how to build a scalable solution on Amazon S3 and discussed key implementation considerations. We have demonstrated that AWS services enable FSI companies to build a serverless archiving solution while reaching and keeping regulatory compliance at a lower cost.

Learn more about some of the AWS services covered in this blog:

New – AWS DMS Serverless: Automatically Provisions and Scales Capacity for Migration and Data Replication

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-aws-dms-serverless-automatically-provisions-and-scales-capacity-for-migration-and-data-replication/

With the vast amount of data being created today, organizations are moving to the cloud to take advantage of the security, reliability, and performance of fully managed database services. To facilitate database and analytics migrations, you can use AWS Database Migration Service (AWS DMS). First launched in 2016, AWS DMS offers a simple migration process that automates database migration projects, saving time, resources, and money.

Although you can start AWS DMS migration with a few clicks through the console, you still need to do research and planning to determine the required capacity before migrating. It can be challenging to know how to properly scale capacity ahead of time, especially when simultaneously migrating many workloads or continuously replicating data. On top of that, you also need to continually monitor usage and manually scale capacity to ensure optimal performance.

Introducing AWS DMS Serverless
Today, I’m excited to tell you about AWS DMS Serverless, a new serverless option in AWS DMS that automatically sets up, scales, and manages migration resources to make your database migrations easier and more cost-effective.

Here’s a quick preview on how AWS DMS Serverless works:

AWS DMS Serverless removes the guesswork of figuring out required compute resources and handling the operational burden needed to ensure a high-performance, uninterrupted migration. It performs automatic capacity provisioning, scaling, and capacity optimization of migrations, allowing you to quickly begin migrations with minimal oversight.

At launch, AWS DMS Serverless supports Microsoft SQL Server, PostgreSQL, MySQL, and Oracle as data sources. As for data targets, AWS DMS Serverless supports a wide range of databases and analytics services, from Amazon Aurora, Amazon Relational Database Service (Amazon RDS), Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon DynamoDB, and more. AWS DMS Serverless continues to add support for new data sources and targets. Visit Supported Engine Versions to stay updated.

With a variety of sources and targets supported by AWS DMS Serverless, many scenarios become possible. You can use AWS DMS Serverless to migrate databases and help to build modern data strategies by synchronizing ongoing data replications into data lakes (e.g., Amazon S3) or data warehouses (e.g., Amazon Redshift) from multiple, perhaps disparate data sources.

How AWS DMS Serverless Works
Let me show you how you can get started with AWS DMS Serverless. In this post, I migrate my data from a source database running on PostgreSQL to a target MySQL database running on Amazon RDS. The following screenshot shows my source database with dummy data:

As for the target, I’ve set up a MySQL database running in Amazon RDS. The following screenshot shows my target database:

Getting starting with AWS DMS Serverless is similar to how AWS DMS works today. AWS DMS Serverless requires me to complete the setup tasks such as creating a virtual private cloud (VPC) to defining source and target endpoints. If this is your first time working with AWS DMS, you can learn more by visiting Prerequisites for AWS Database Migration Service.

To connect to a data store, AWS DMS needs endpoints for both source and target data stores. An endpoint provides all necessary information including connection, data store type, and location to my data stores. The following image shows an endpoint I’ve created for my target database:

When I have finished setting up the endpoints, I can begin to create a replication by selecting the Create replication button on the Serverless replications page. Replication is a new concept introduced in AWS DMS Serverless to abstract instances and tasks that we normally have in standard AWS DMS. Additionally, the capacity resources are managed independently for each replication.

On the Create replication page, I need to define some configurations. This starts with defining Name, then specifying Source database endpoint and Target database endpoint. If you don’t find your endpoints, make sure you’re selecting database engines supported by AWS DMS Serverless.

After that, I need to specify the Replication type. There are three types of replication available in AWS DMS Serverless:

  • Full load — If I need to migrate all existing data in source database
  • Change data capture (CDC) — If I have to replicate data changes from source to target database.
  • Full load and change data capture (CDC) — If I need to migrate existing data and replicate data changes from source to target database.

In this example, I chose Full load and change data capture (CDC) because I need to migrate existing data and continuously update the target database for ongoing changes on the source database.

In the Settings section, I can also enable logging with Amazon CloudWatch, which makes it easier for me to monitor replication progress over time.

As with standard AWS DMS, in AWS DMS Serverless, I can also configure Selection rules in Table mappings to define filters that I need to replicate from table columns in the source data store.

I can also use Transformation rules if I need to rename a schema or table or add a prefix or suffix to a schema or table.

In the Capacity section, I can set the range for required capacity to perform replication by defining the minimum and maximum DCU (DMS capacity units). The minimum DCU setting is optional because AWS DMS Serverless determines the minimum DCU based on an assessment of the replication workload. During replication process, AWS DMS uses this range to scale up and down based on CPU utilization, connections, and available memory.

Setting the maximum capacity allows you to manage costs by making sure that AWS DMS Serverless never consumes more resources than you have budgeted for. When you define the maximum DCU, make sure that you choose a reasonable capacity so that AWS DMS Serverless can handle large bursts of data transaction volumes. If traffic volume decreases, AWS DMS Serverless scales capacity down again, and you only pay for what you need. For cases in which you want to change the minimum and maximum DCU settings, you have to stop the replication process first, make the changes, and run the replication again.

When I’m finished with configuring replication, I select Create replication.

When my replication is created, I can view more details of my replication and start the process by selecting Start.

After my replication runs for around 40 minutes, I can monitor replication progress in the Monitoring tab. AWS DMS Serverless also has a CloudWatch metric called Capacity utilization, which indicates the use of capacity to run replication according to the range defined as minimum and maximum DCU. The following screenshot shows the capacity scales up in the CloudWatch metrics chart.

When the replication finishes its process, I see the capacity starting to decrease. This indicates that in addition to AWS DMS Serverless successfully scaling up to the required capacity, it can also scale down within the range I have defined.

Finally, all I need to do is verify whether my data has been successfully replicated into the target data store. I need to connect to the target, run a select query, and check if all data has been successfully replicated from the source.

Now Available
AWS DMS Serverless is now available in all commercial regions where standard AWS DMS is available, and you can start using it today. For more information about benefits, use cases, how to get started, and pricing details, refer to AWS DMS Serverless.

Happy migrating!
Donnie

AWS Week in Review – March 20, 2023

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-march-20-2023/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered.

Last Week’s Launches
Here are the launches that got my attention last week:

Picture of an S3 bucket and AWS CEO Adam Selipsky.Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities:

Amazon Linux 2023 – Our new Linux-based operating system is now generally available. Sébastien’s post is full of tips and info.

Application Auto Scaling – Now can use arithmetic operations and mathematical functions to customize the metrics used with Target Tracking policies. You can use it to scale based on your own application-specific metrics. Read how it works with Amazon ECS services.

AWS Data Exchange for Amazon S3 is now generally available – You can now share and find data files directly from S3 buckets, without the need to create or manage copies of the data.

Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning.

Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearch version 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality.

AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR.

Amazon EC2 M1 Mac Instances – You can now update guest environments to a specific or the latest macOS version without having to tear down and recreate the existing macOS environments.

AWS Chatbot – Now Integrates With Microsoft Teams to simplify the way you troubleshoot and operate your AWS resources.

Amazon GuardDuty RDS Protection for Amazon Aurora – Now generally available to help profile and monitor access activity to Aurora databases in your AWS account without impacting database performance

AWS Database Migration Service – Now supports validation to ensure that data is migrated accurately to S3 and can now generate an AWS Glue Data Catalog when migrating to S3.

AWS Backup – You can now back up and restore virtual machines running on VMware vSphere 8 and with multiple vNICs.

Amazon Kendra – There are new connectors to index documents and search for information across these new content: Confluence Server, Confluence Cloud, Microsoft SharePoint OnPrem, Microsoft SharePoint Cloud. This post shows how to use the Amazon Kendra connector for Microsoft Teams.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Example of a geospatial query.Women founders Q&A – We’re talking to six women founders and leaders about how they’re making impacts in their communities, industries, and beyond.

What you missed at that 2023 IMAGINE: Nonprofit conference – Where hundreds of nonprofit leaders, technologists, and innovators gathered to learn and share how AWS can drive a positive impact for people and the planet.

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms – The metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance.

Extend geospatial queries in Amazon Athena with user-defined functions (UDFs) and AWS Lambda – Using a solution based on Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons.

How cities can use transport data to reduce pollution and increase safety – A guest post by Rikesh Shah, outgoing head of open innovation at Transport for London.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Here are some opportunities to meet:

AWS Public Sector Day 2023 (March 21, London, UK) – An event dedicated to helping public sector organizations use technology to achieve more with less through the current challenging conditions.

Women in Tech at Skills Center Arlington (March 23, VA, USA) – Let’s celebrate the history and legacy of women in tech.

The AWS Summits season is warming up! You can sign up here to know when registration opens in your area.

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

New – A Fully Managed Schema Conversion in AWS Database Migration Service

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-a-fully-managed-schema-conversion-in-aws-database-migration-service/

Since we launched AWS Database Migration Service (AWS DMS) in 2016, customers have securely migrated more than 800,000 databases to AWS with minimal downtime. AWS DMS supports migration between 20+ database and analytics engines, such as Oracle to Amazon Aurora MySQL, MySQL to Amazon Relational Database (Amazon RDS) MySQL, Microsoft SQL Server to Amazon Aurora PostgreSQL, MongoDB to Amazon DocumentDB, Oracle to Amazon Redshift, and to and from Amazon Simple Storage Service (Amazon S3).

Specifically, the AWS Schema Conversion Tool (AWS SCT) makes heterogeneous database and data warehouse migrations predictable and can automatically convert the source schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target engine. For example, it supports the conversion of Oracle PL/SQL and SQL Server T-SQL code to equivalent code in the Amazon Aurora MySQL dialect of SQL or the equivalent PL/pgSQL code in PostgreSQL. You can download the AWS SCT for your platform, including Windows or Linux (Fedora and Ubuntu).

Today we announce fully managed AWS DMS Schema Conversion, which streamlines database migrations by making schema assessment and conversion available inside AWS DMS. With DMS Schema Conversion, you can now plan, assess, convert and migrate under one central DMS service. You can access features of DMS Schema Conversion in the AWS Management Console without downloading and executing AWS SCT.

AWS DMS Schema Conversion automatically converts your source database schemas, and a majority of the database code objects to a format compatible with the target database. This includes tables, views, stored procedures, functions, data types, synonyms, and so on, similar to AWS SCT. Any objects that cannot be automatically converted are clearly marked as action items with prescriptive instructions on how to migrate to AWS manually.

In this launch, DMS Schema Conversion supports the following databases as sources for migration projects:

  • Microsoft SQL Server version 2008 R2 and higher
  • Oracle version 10.2 and later, 11g and up to 12.2, 18c, and 19c

DMS Schema Conversion supports the following databases as targets for migration projects:

  • Amazon RDS for MySQL version 8.x
  • Amazon RDS for PostgreSQL version 14.x

Setting Up AWS DMS Schema Conversion
To get started with DMS Schema Conversion, and if it is your first time using AWS DMS, complete the setup tasks to create a virtual private cloud (VPC) using the Amazon VPC service, source, and target database. To learn more, see Prerequisites for AWS Database Migration Service in the AWS documentation.

In the AWS DMS console, you can see new menus to set up Instance profiles, add Data providers, and create Migration projects.

Before you create your migration project, set up an instance profile by choosing Instance profiles in the left pane. An instance profile specifies network and security settings for your DMS Schema Conversion instances. You can create multiple instance profiles and select an instance profile to use for each migration project.

Choose Create instance profile and specify your default VPC or a new VPC, Amazon Simple Storage Service (Amazon S3) bucket to store your schema conversion metadata, and additional settings such as AWS Key Management Service (AWS KMS) keys.

You can create the simplest network configuration with a single VPC configuration. If your source or target data providers are in different VPCs, you can create your instance profile in one of the VPCs, and then link these two VPCs by using VPC peering.

Next, you can add data providers that store the data store type and location information about your source and target databases by choosing Data providers in the left pane. For each database, you can create a single data provider and use it in multiple migration projects.

Your data provider can be a fully managed Amazon RDS instance or a self-managed engine running either on-premises or on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Choose Create data provider to create a new data provider. You can set the type of the database location manually, such as database engine, domain name or IP address, port number, database name, and so on, for your data provider. Here, I have selected an RDS database instance.

After you create a data provider, make sure that you add database connection credentials in AWS Secrets Manager. DMS Schema Conversion uses this information to connect to a database.

Converting your database schema with AWS DMS Schema Conversion
Now, you can create a migration project for DMS Schema Conversion by choosing Migration projects in the left pane. A migration project describes your source and target data providers, your instance profile, and migration rules. You can also create multiple migration projects for different source and target data providers.

Choose Create migration project and select your instance profile and source and target data providers for DMS Schema Conversion.

After creating your migration project, you can use the project to create assessment reports and convert your database schema. Choose your migration project from the list, then choose the Schema conversion tab and click Launch schema conversion.

Migration projects in DMS Schema Conversion are always serverless. This means that AWS DMS automatically provisions the cloud resources for your migration projects, so you don’t need to manage schema conversion instances.

Of course, the first launch of DMS Schema Conversion requires starting a schema conversion instance, which can take up to 10–15 minutes. This process also reads the metadata from the source and target databases. After a successful first launch, you can access DMS Schema Conversion faster.

An important part of DMS Schema Conversion is that it generates a database migration assessment report that summarizes all of the schema conversion tasks. It also details the action items for schema that cannot be converted to the DB engine of your target database instance. You can view the report in the AWS DMS console or export it as a comma-separated value (.csv) file.

To create your assessment report, choose the source database schema or schema items that you want to assess. After you select the checkboxes, choose Assess in the Actions menu in the source database pane. This report will be archived with .csv files in your S3 bucket. To change the S3 bucket, edit the schema conversion settings in your instance profile.

Then, you can apply the converted code to your target database or save it as a SQL script. To apply converted code, choose Convert in the pane of Source data provider and then Apply changes in the pane of Target data provider.

Once the schema has been converted successfully, you can move on to the database migration phase using AWS DMS. To learn more, see Getting started with AWS Database Migration Service in the AWS documentation.

Now Available
AWS DMS Schema Conversion is now available in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) Regions, and you can start using it today.

To learn more, see the AWS DMS Schema Conversion User Guide, give it a try, and please send feedback to AWS re:Post for AWS DMS or through your usual AWS support contacts.

Channy

Convert Oracle XML BLOB data to JSON using Amazon EMR and load to Amazon Redshift

Post Syndicated from Abhilash Nagilla original https://aws.amazon.com/blogs/big-data/convert-oracle-xml-blob-data-to-json-using-amazon-emr-and-load-to-amazon-redshift/

In legacy relational database management systems, data is stored in several complex data types, such XML, JSON, BLOB, or CLOB. This data might contain valuable information that is often difficult to transform into insights, so you might be looking for ways to load and use this data in a modern cloud data warehouse such as Amazon Redshift. One such example is migrating data from a legacy Oracle database with XML BLOB fields to Amazon Redshift, by performing preprocessing and conversion of XML to JSON using Amazon EMR. In this post, we describe a solution architecture for this use case, and show you how to implement the code to handle the XML conversion.

Solution overview

The first step in any data migration project is to capture and ingest the data from the source database. For this task, we use AWS Database Migration Service (AWS DMS), a service that helps you migrate databases to AWS quickly and securely. In this example, we use AWS DMS to extract data from an Oracle database with XML BLOB fields and stage the same data in Amazon Simple Storage Service (Amazon S3) in Apache Parquet format. Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance, and is the storage of choice for setting up data lakes on AWS.

After the data is ingested into an S3 staging bucket, we used Amazon EMR to run a Spark job to perform the conversion of XML fields to JSON fields, and the results are loaded in a curated S3 bucket. Amazon EMR runtime for Apache Spark can be over three times faster than clusters without EMR runtime, and has 100% API compatibility with standard Apache Spark. This improved performance means your workloads run faster and it saves you compute costs, without making any changes to your application.

Finally, transformed and curated data is loaded into Amazon Redshift tables using the COPY command. The Amazon Redshift table structure should match the number of columns and the column data types in the source file. Because we stored the data as a Parquet file, we specify the SERIALIZETOJSON option in the COPY command. This allows us to load complex types, such as structure and array, in a column defined as SUPER data type in the table.

The following architecture diagram shows the end-to-end workflow.

In detail, AWS DMS migrates data from the source database tables into Amazon S3, in Parquet format. Apache Spark on Amazon EMR reads the raw data, transforms the XML data type into JSON, and saves the data to the curated S3 bucket. In our code, we used an open-source library, called spark-xml, to parse and query the XML data.

In the rest of this post, we assume that the AWS DMS tasks have already run and created the source Parquet files in the S3 staging bucket. If you want to set up AWS DMS to read from an Oracle database with LOB fields, refer to Effectively migrating LOB data to Amazon S3 from Amazon RDS for Oracle with AWS DMS or watch the video Migrate Oracle to S3 Data lake via AWS DMS.

Prerequisites

If you want to follow along with the examples in this post using your AWS account, we provide an AWS CloudFormation template you can launch by choosing Launch Stack:

BDB-2063-launch-cloudformation-stack

Provide a stack name and leave the default settings for everything else. Wait for the stack to display Create Complete (this should only take a few minutes) before moving on to the other sections.

The template creates the following resources:

  • A virtual private cloud (VPC) with two private subnets that have routes to an Amazon S3 VPC endpoint
  • The S3 bucket {stackname}-s3bucket-{xxx}, which contains the following folders:
    • libs – Contains the JAR file to add to the notebook
    • notebooks – Contains the notebook to interactively test the code
    • data – Contains the sample data
  • An Amazon Redshift cluster, in one of the two private subnets, with a database named rs_xml_db and a schema named rs_xml
  • A secret (rs_xml_db) in AWS Secrets Manager
  • An EMR cluster

The CloudFormation template shared in this post is purely for demonstration purposes only. Please conduct your own security review and incorporate best practices prior to any production deployment using artifacts from the post.

Finally, some basic knowledge of Python and Spark DataFrames can help you review the transformation code, but isn’t mandatory to complete the example.

Understanding the sample data

In this post, we use college students’ course and subjects sample data that we created. In the source system, data consists of flat structure fields, like course_id and course_name, and an XML field that includes all the course material and subjects involved in the respective course. The following screenshot is an example of the source data, which is staged in an S3 bucket as a prerequisite step.

We can observe that the column study_material_info is an XML type field and contains nested XML tags in it. Let’s see how to convert this nested XML field to JSON in the subsequent steps.

Run a Spark job in Amazon EMR to transform the XML fields in the raw data to JSON

In this step, we use an Amazon EMR notebook, which is a managed environment to create and open Jupyter Notebook and JupyterLab interfaces. It enables you to interactively analyze and visualize data, collaborate with peers, and build applications using Apache Spark on EMR clusters. To open the notebook, follow these steps:

  1. On the Amazon S3 console, navigate to the bucket you created as a prerequisite step.
  2. Download the file in the notebooks folder.
  3. On the Amazon EMR console, choose Notebooks in the navigation pane.
  4. Choose Create notebook.
  5. For Notebook name, enter a name.
  6. For Cluster, select Choose an existing cluster.
  7. Select the cluster you created as a prerequisite.
  8. For Security Groups, choose BDB1909-EMR-LIVY-SG and BDB1909-EMR-Notebook-SG
  9. For AWS Service Role, choose the role bdb1909-emrNotebookRole-{xxx}.
  10. For Notebook location, specify the S3 path in the notebooks folder (s3://{stackname}-s3bucket-xxx}/notebooks/).
  11. Choose Create notebook.
  12. When the notebook is created, choose Open in JupyterLab.
  13. Upload the file you downloaded earlier.
  14. Open the new notebook.

    The notebook should look as shown in the following screenshot, and it contains a script written in Scala.
  15. Run the first two cells to configure Apache Spark with the open-source spark-xml library and import the needed modules.The spark-xml package allows reading XML files in local or distributed file systems as Spark DataFrames. Although primarily used to convert (portions of) large XML documents into a DataFrame, spark-xml can also parse XML in a string-valued column in an existing DataFrame with the from_xml function, in order to add it as a new column with parsed results as a struct.
  16. To do so, in the third cell, we load the data from the Parquet file generated by AWS DMS into a DataFrame, then we extract the attribute that contains the XML code (STUDY_MATERIAL_INFO) and map it to a string variable name payloadSchema.
  17. We can now use the payloadSchema in the from_xml function to convert the field STUDY_MATERIAL_INFO into a struct data type and added it as a column named course_material in a new DataFrame parsed.
  18. Finally, we can drop the original field and write the parsed DataFrame to our curated zone in Amazon S3.

Due to the structure differences between DataFrame and XML, there are some conversion rules from XML data to DataFrame and from DataFrame to XML data. More details and documentation are available XML Data Source for Apache Spark.

When we convert from XML to DataFrame, attributes are converted as fields with the heading prefix attributePrefix (underscore (_) is the default). For example, see the following code:

  <book category="undergraduate">
    <title lang="en">Introduction to Biology</title>
    <author>Demo Author 1</author>
    <year>2005</year>
    <price>30.00</price>
  </book>

It produces the following schema:

root
 |-- category: string (nullable = true)
 |-- title: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _lang: string (nullable = true)
 |-- author: string (nullable = true)
 |-- year: string (nullable = true)
 |-- price: string (nullable = true)

Next, we have a value in an element that has no child elements but attributes. The value is put in a separate field, valueTag. See the following code:

<title lang="en">Introduction to Biology</title>

It produces the following schema, and the tag lang is converted into the _lang field inside the DataFrame:

|-- title: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _lang: string (nullable = true)

Copy curated data into Amazon Redshift and query tables seamlessly

Because our semi-structured nested dataset is already written in the S3 bucket as Apache Parquet formatted files, we can use the COPY command with the SERIALIZETOJSON option to ingest data into Amazon Redshift. The Amazon Redshift table structure should match the metadata of the Parquet files. Amazon Redshift can replace any Parquet columns, including structure and array types, with SUPER data columns.

The following code demonstrates CREATE TABLE example to create a staging table.

create table rs_xml_db.public.stg_edw_course_catalog 
(
course_id bigint,
course_name character varying(5000),
course_material super
);

The following code uses the COPY example to load from Parquet format:

COPY rs_xml_db.public.stg_edw_course_catalog FROM 's3://<<your Amazon S3 Bucket for curated data>>/data/target/<<your output parquet file>>' 
IAM_ROLE '<<your IAM role>>' 
FORMAT PARQUET SERIALIZETOJSON; 

By using semistructured data support in Amazon Redshift, you can ingest and store semistructured data in your Amazon Redshift data warehouses. With the SUPER data type and PartiQL language, Amazon Redshift expands the data warehouse capability to integrate with both SQL and NoSQL data sources. The SUPER data type only supports up to 1 MB of data for an individual SUPER field or object. Note, the JSON object may be stored in a SUPER data type, but reading this data using JSON functions currently has a VARCHAR (65535 byte) limit. See Limitations for more details.

The following example shows how nested JSON can be easily accessed using SELECT statements:

SELECT DISTINCT bk._category
	,bk.author
	,bk.price
	,bk.year
	,bk.title._lang
FROM rs_xml_db.public.stg_edw_course_catalog main
INNER JOIN main.course_material.book bk ON true;

The following screenshot shows our results.

Clean up

To avoid incurring future charges, first delete the notebook and the related files on Amazon S3 bucket as explained in this EMR documentation page then the CloudFormation stack.

Conclusion

This post demonstrated how to use AWS services like AWS DMS, Amazon S3, Amazon EMR, and Amazon Redshift to seamlessly work with complex data types like XML and perform historical migrations when building a cloud data lake house on AWS. We encourage you to try this solution and take advantage of all the benefits of these purpose-built services.

If you have questions or suggestions, please leave a comment.


About the authors

Abhilash Nagilla is a Sr. Specialist Solutions Architect at AWS, helping public sector customers on their cloud journey with a focus on AWS analytics services. Outside of work, Abhilash enjoys learning new technologies, watching movies, and visiting new places.

Avinash Makey is a Specialist Solutions Architect at AWS. He helps customers with data and analytics solutions in AWS. Outside of work he plays cricket, tennis and volleyball in free time.

Fabrizio Napolitano is a Senior Specialist SA for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.

Stream change data to Amazon Kinesis Data Streams with AWS DMS

Post Syndicated from Sukhomoy Basak original https://aws.amazon.com/blogs/big-data/stream-change-data-to-amazon-kinesis-data-streams-with-aws-dms/

In this post, we discuss how to use AWS Database Migration Service (AWS DMS) native change data capture (CDC) capabilities to stream changes into Amazon Kinesis Data Streams.

AWS DMS is a cloud service that makes it easy to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups. AWS DMS also helps you replicate ongoing changes to keep sources and targets in sync.

CDC refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real time to a downstream system. Capturing every change from transactions in a source database and moving them to the target in real time keeps the systems synchronized, and helps with real-time analytics use cases and zero-downtime database migrations.

Kinesis Data Streams is a fully managed streaming data service. You can continuously add various types of data such as clickstreams, application logs, and social media to a Kinesis stream from hundreds of thousands of sources. Within seconds, the data will be available for your Kinesis applications to read and process from the stream.

AWS DMS can do both replication and migration. Kinesis Data Streams is most valuable in the replication use case because it lets you react to replicated data changes in other integrated AWS systems.

This post is an update to the post Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams. This new post includes steps required to configure AWS DMS and Kinesis Data Streams for a CDC use case. With Kinesis Data Streams as a target for AWS DMS, we make it easier for you to stream, analyze, and store CDC data. AWS DMS uses best practices to automatically collect changes from a data store and stream them to Kinesis Data Streams.

With the addition of Kinesis Data Streams as a target, we’re helping customers build data lakes and perform real-time processing on change data from your data stores. You can use AWS DMS in your data integration pipelines to replicate data in near-real time directly into Kinesis Data Streams. With this approach, you can build a decoupled and eventually consistent view of your database without having to build applications on top of a database, which is expensive. You can refer to the AWS whitepaper AWS Cloud Data Ingestion Patterns and Practices for more details on data ingestion patters.

AWS DMS sources for real-time change data

The following diagram illustrates that AWS DMS can use many of the most popular database engines as a source for data replication to a Kinesis Data Streams target. The database source can be a self-managed engine running on an Amazon Elastic Compute Cloud (Amazon EC2) instance or an on-premises database, or it can be on Amazon Relational Database Service (Amazon RDS), Amazon Aurora, or Amazon DocumentDB (with MongoDB availability).

Kinesis Data Streams can collect, process, and store data streams at any scale in real time and write to AWS Glue, which is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. You can use Amazon EMR for big data processing, Amazon Kinesis Data Analytics to process and analyze streaming data , Amazon Kinesis Data Firehose to run ETL (extract, transform, and load) jobs on streaming data, and AWS Lambda as a serverless compute for further processing, transformation, and delivery of data for consumption.

You can store the data in a data warehouse like Amazon Redshift, which is a cloud-scale data warehouse, and in an Amazon Simple Storage Service (Amazon S3) data lake for consumption. You can use Kinesis Data Firehose to capture the data streams and load the data into S3 buckets for further analytics.

Once the data is available in Kinesis Data Streams targets (as shown in the following diagram), you can visualize it using Amazon QuickSight; run ad hoc queries using Amazon Athena; access, process, and analyze it using an Amazon SageMaker notebook instance; and efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables using Amazon Redshift Spectrum.

Solution overview

In this post, we describe how to use AWS DMS to load data from a database to Kinesis Data Streams in real time. We use a SQL Server database as example, but other databases like Oracle, Microsoft Azure SQL, PostgreSQL, MySQL, SAP ASE, MongoDB, Amazon DocumentDB, and IBM DB2 also support this configuration.

You can use AWS DMS to capture data changes on the database and then send this data to Kinesis Data Streams. After the streams are ingested in Kinesis Data Streams, they can be consumed by different services like Lambda, Kinesis Data Analytics, Kinesis Data Firehose, and custom consumers using the Kinesis Client Library (KCL) or the AWS SDK.

The following are some use cases that can use AWS DMS and Kinesis Data Streams:

  • Triggering real-time event-driven applications – This use case integrates Lambda and Amazon Simple Notification Service (Amazon SNS).
  • Simplifying and decoupling applications – For example, moving from monolith to microservices. This solution integrates Lambda and Amazon API Gateway.
  • Cache invalidation, and updating or rebuilding indexes – Integrates Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) and Amazon DynamoDB.
  • Data integration across multiple heterogeneous systems – This solution sends data to DynamoDB or another data store.
  • Aggregating data and pushing it to downstream system – This solution uses Kinesis Data Analytics to analyze and integrate different sources and load the results in another data store.

To facilitate the understanding of the integration between AWS DMS, Kinesis Data Streams, and Kinesis Data Firehose, we have defined a business case that you can solve. In this use case, you are the data engineer of an energy company. This company uses Amazon Relational Database Service (Amazon RDS) to store their end customer information, billing information, and also electric meter and gas usage data. Amazon RDS is their core transaction data store.

You run a batch job weekly to collect all the transactional data and send it to the data lake for reporting, forecasting, and even sending billing information to customers. You also have a trigger-based system to send emails and SMS periodically to the customer about their electricity usage and monthly billing information.

Because the company has millions of customers, processing massive amounts of data every day and sending emails or SMS was slowing down the core transactional system. Additionally, running weekly batch jobs for analytics wasn’t giving accurate and latest results for the forecasting you want to do on customer gas and electricity usage. Initially, your team was considering rebuilding the entire platform and avoiding all those issues, but the core application is complex in design, and running in production for many years and rebuilding the entire platform will take years and cost millions.

So, you took a new approach. Instead of running batch jobs on the core transactional database, you started capturing data changes with AWS DMS and sending that data to Kinesis Data Streams. Then you use Lambda to listen to a particular data stream and generate emails or SMS using Amazon SNS to send to the customer (for example, sending monthly billing information or notifying when their electricity or gas usage is higher than normal). You also use Kinesis Data Firehose to send all transaction data to the data lake, so your company can run forecasting immediately and accurately.

The following diagram illustrates the architecture.

In the following steps, you configure your database to replicate changes to Kinesis Data Streams, using AWS DMS. Additionally, you configure Kinesis Data Firehose to load data from Kinesis Data Streams to Amazon S3.

It’s simple to set up Kinesis Data Streams as a change data target in AWS DMS and start streaming data. For more information, see Using Amazon Kinesis Data Streams as a target for AWS Database Migration Service.

To get started, you first create a Kinesis data stream in Kinesis Data Streams, then an AWS Identity and Access Management (IAM) role with minimal access as described in Prerequisites for using a Kinesis data stream as a target for AWS Database Migration Service. After you define your IAM policy and role, you set up your source and target endpoints and replication instance in AWS DMS. Your source is the database that you want to move data from, and the target is the database that you’re moving data to. In our case, the source database is a SQL Server database on Amazon RDS, and the target is the Kinesis data stream. The replication instance processes the migration tasks and requires access to the source and target endpoints inside your VPC.

A Kinesis delivery stream (created in Kinesis Data Firehose) is used to load the records from the database to the data lake hosted on Amazon S3. Kinesis Data Firehose can load data also to Amazon Redshift, Amazon OpenSearch Service, an HTTP endpoint, Datadog, Dynatrace, LogicMonitor, MongoDB Cloud, New Relic, Splunk, and Sumo Logic.

Configure the source database

For testing purposes, we use the database democustomer, which is hosted on a SQL Server on Amazon RDS. Use the following command and script to create the database and table, and insert 10 records:

create database democustomer

use democustomer

create table invoices (
	invoice_id INT,
	customer_id INT,
	billing_date DATE,
	due_date DATE,
	balance INT,
	monthly_kwh_use INT,
	total_amount_due VARCHAR(50)
);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (1, 1219578, '4/15/2022', '4/30/2022', 25, 6, 28);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (2, 1365142, '4/15/2022', '4/28/2022', null, 41, 20.5);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (3, 1368834, '4/15/2022', '5/5/2022', null, 31, 15.5);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (4, 1226431, '4/15/2022', '4/28/2022', null, 47, 23.5);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (5, 1499194, '4/15/2022', '5/1/2022', null, 39, 19.5);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (6, 1221240, '4/15/2022', '5/2/2022', null, 38, 19);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (7, 1235442, '4/15/2022', '4/27/2022', null, 50, 25);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (8, 1306894, '4/15/2022', '5/2/2022', null, 16, 8);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (9, 1343570, '4/15/2022', '5/3/2022', null, 39, 19.5);
insert into invoices (invoice_id, customer_id, billing_date, due_date, balance, monthly_kwh_use, total_amount_due) values (10, 1465198, '4/15/2022', '5/4/2022', null, 47, 23.5);

To capture the new records added to the table, enable MS-CDC (Microsoft Change Data Capture) using the following commands at the database level (replace SchemaName and TableName). This is required if ongoing replication is configured on the task migration in AWS DMS.

EXEC msdb.dbo.rds_cdc_enable_db 'democustomer';
GO
EXECUTE sys.sp_cdc_enable_table @source_schema = N'SchemaName', @source_name =N'TableName', @role_name = NULL;
GO
EXEC sys.sp_cdc_change_job @job_type = 'capture' ,@pollinginterval = 3599;
GO

You can use ongoing replication (CDC) for a self-managed SQL Server database on premises or on Amazon Elastic Compute Cloud (Amazon EC2), or a cloud database such as Amazon RDS or an Azure SQL managed instance. SQL Server must be configured for full backups, and you must perform a backup before beginning to replicate data.

For more information, see Using a Microsoft SQL Server database as a source for AWS DMS.

Configure the Kinesis data stream

Next, we configure our Kinesis data stream. For full instructions, see Creating a Stream via the AWS Management Console. Complete the following steps:

  1. On the Kinesis Data Streams console, choose Create data stream.
  2. For Data stream name¸ enter a name.
  3. For Capacity mode, select On-demand.When you choose on-demand capacity mode, Kinesis Data Streams instantly accommodates your workloads as they ramp up or down. For more information, refer to Choosing the Data Stream Capacity Mode.
  4. Choose Create data stream.
  5. When the data stream is active, copy the ARN.

Configure the IAM policy and role

Next, you configure your IAM policy and role.

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. Select JSON and use the following policy as a template, replacing the data stream ARN:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "kinesis:PutRecord",
                    "kinesis:PutRecords",
                    "kinesis:DescribeStream"
                ],
                "Resource": "<streamArn>"
            }
        ]
    }

  4. In the navigation pane, choose Roles.
  5. Choose Create role.
  6. Select AWS DMS, then choose Next: Permissions.
  7. Select the policy you created.
  8. Assign a role name and then choose Create role.

Configure the Kinesis delivery stream

We use a Kinesis delivery stream to load the information from the Kinesis data stream to Amazon S3. To configure the delivery stream, complete the following steps:

  1. On the Kinesis console, choose Delivery streams.
  2. Choose Create delivery stream.
  3. For Source, choose Amazon Kinesis Data Streams.
  4. For Destination, choose Amazon S3.
  5. For Kinesis data stream, enter the ARN of the data stream.
  6. For Delivery stream name, enter a name.
  7. Leave the transform and convert options at their defaults.
  8. Provide the destination bucket and specify the bucket prefixes for the events and errors.
  9. Under Buffer hints, compression and encryption, change the buffer size to 1 MB and buffer interval to 60 seconds.
  10. Leave the other configurations at their defaults.

Configure AWS DMS

We use an AWS DMS instance to connect to the SQL Server database and then replicate the table and future transactions to a Kinesis data stream. In this section, we create a replication instance, source endpoint, target endpoint, and migration task. For more information about endpoints, refer to Creating source and target endpoints.

  1. Create a replication instance in a VPC with connectivity to the SQL Server database and associate a security group with enough permissions to access to the database.
  2. On the AWS DMS console, choose Endpoints in the navigation pane.
  3. Choose Create endpoint.
  4. Select Source endpoint.
  5. For Endpoint identifier, enter a label for the endpoint.
  6. For Source engine, choose Microsoft SQL Server.
  7. For Access to endpoint database, select Provide access information manually.
  8. Enter the endpoint database information.
  9. Test the connectivity to the source endpoint.
    Now we create the target endpoint.
  10. On the AWS DMS console, choose Endpoints in the navigation pane.
  11. Choose Create endpoint.
  12. Select Target endpoint.
  13. For Endpoint identifier, enter a label for the endpoint.
  14. For Target engine, choose Amazon Kinesis.
  15. Provide the AWS DMS service role ARN and the data stream ARN.
  16. Test the connectivity to the target endpoint.

    The final step is to create a database migration task. This task replicates the existing data from the SQL Server table to the data stream and replicates the ongoing changes. For more information, see Creating a task.
  17. On the AWS DMS console, choose Database migration tasks.
  18. Choose Create task.
  19. For Task identifier, enter a name for your task.
  20. For Replication instance, choose your instance.
  21. Choose the source and target database endpoints you created.
  22. For Migration type, choose Migrate existing data and replicate ongoing changes.
  23. In Task settings, use the default settings.
  24. In Table mappings, add a new selection rule and specify the schema and table name of the SQL Server database. In this case, our schema name is dbo and the table name is invoices.
  25. For Action, choose Include.

When the task is ready, the migration starts.

After the data has been loaded, the table statistics are updated and you can see the 10 records created initially.

As the Kinesis delivery stream reads the data from Kinesis Data Streams and loads it in Amazon S3, the records are available in the bucket you defined previously.

To check that AWS DMS ongoing replication and CDC are working, use this script to add 1,000 records to the table.

You can see 1,000 inserts on the Table statistics tab for the database migration task.

After about 1 minute, you can see the records in the S3 bucket.

At this point the replication has been activated, and a Lambda function can start consuming the data streams to send emails SMS to the customers through Amazon SNS. More information, refer to Using AWS Lambda with Amazon Kinesis.

Conclusion

With Kinesis Data Streams as an AWS DMS target, you now have a powerful way to stream change data from a database directly into a Kinesis data stream. You can use this method to stream change data from any sources supported by AWS DMS to perform real-time data processing. Happy streaming!

If you have any questions or suggestions, please leave a comment.


About the Authors

Luis Eduardo Torres is a Solutions Architect at AWS based in Bogotá, Colombia. He helps companies to build their business using the AWS cloud platform. He has a great interest in Analytics and has been leading the Analytics track of AWS Podcast in Spanish.

Sukhomoy Basak is a Solutions Architect at Amazon Web Services, with a passion for Data and Analytics solutions. Sukhomoy works with enterprise customers to help them architect, build, and scale applications to achieve their business outcomes.

Modernization pathways for a legacy .NET Framework monolithic application on AWS

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/architecture/modernization-pathways-for-a-legacy-net-framework-monolithic-application-on-aws/

Organizations aim to deliver optimal technological solutions based on their customers’ needs. Although they may be at any stage in their cloud adoption journey, businesses often end up managing and building monolithic applications. However, there are many challenges to this solution. The internal structure of a monolithic application makes it difficult for developers to maintain code. This creates a steep learning curve for new developers and increases costs. Monoliths require multiple teams to coordinate a single large release, which increases the collaboration and knowledge transfer burden. As a business grows, a monolithic application may struggle to meet the demands of an expanding user base. To address these concerns, customers should evaluate their readiness to modernize their applications in the AWS Cloud to meet their business and technical needs.

We will discuss an approach to modernizing a monolithic three-tier application (MVC pattern): a web tier, an application tier using a .NET Framework, and a data tier with a Microsoft SQL (MSSQL) Server relational database. There are three main modernization pathways for .NET applications: rehosting, replatforming, and refactoring. We recommend following this decision matrix to assess and decide on your migration path, based on your specific requirements. For this blog, we will focus on a replatform and refactor strategy to design loosely coupled microservices, packaged as lightweight containers, and backed by a purpose-built database.

Your modernization journey

The outcomes of your organization’s approach to modernization gives you the ability to scale optimally with your customers’ demands. Let’s dive into a guided approach that achieves your goals of a modern architecture, and at the same time addresses scalability, ease of maintenance, rapid deployment cycles, and cost optimization.

This involves four steps:

  1. Break down the monolith
  2. Containerize your application
  3. Refactor to .NET 6
  4. Migrate to a purpose-built, lower-cost database engine.

1. Break down the monolith

Migration to the Amazon Web Services (AWS) Cloud has many advantages. These can include increased speed to market and business agility, new revenue opportunities, and cost savings. To take full advantage, you should continuously modernize your organization’s applications by refactoring your monolithic applications into microservices.

Decomposing a monolithic application into microservices presents technical challenges that require a solid understanding of the existing code base and context of the business domains. Several patterns are useful to incrementally transform a monolithic application into microservices and other distributed designs. However, the process of refactoring the code base is manual, risky, and time consuming.

To help developers accelerate the transformation, AWS introduced AWS Microservice Extractor for .NET. This helps breakdown architecting and refactoring applications into smaller code projects. Read how AWS Microservice Extractor for .NET helped our partner, Kloia, accelerate the modernization journey of their customers and decompose a monolith.

The next modernization pathway is to containerize your application.

2. Containerize

Why should you move to containers? Containers offer a way to help you build, test, deploy, and redeploy applications on multiple environments. Specifically, Docker Containers provide you with a reliable way to gather your application components and package them together into one build artifact. This is important because modern applications are often composed of a variety of pieces besides code, such as dependencies, binaries, or system libraries. Moving legacy .NET Framework applications to containers helps to optimize operating system utilization and achieve runtime consistency.

To accelerate this process, containerize these applications to Windows containers with AWS App2Container (A2C). A2C is a command line tool for modernizing .NET and java applications into containerized applications. A2C analyzes and builds an inventory of all applications running in virtual machines, on-premises, or in the cloud. Select the application that you want to containerize and A2C packages the application artifact and identified dependencies into container images. Here is a step-by-step article and self-paced workshop to get you started using A2C.
Once your app is containerized, you can choose to self-manage by using Amazon EC2 to host Docker with Windows containers. You can also use Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). These are fully managed container orchestration services that frees you to focus on building and managing applications instead of your underlying infrastructure. Read Amazon ECS vs Amazon EKS: making sense of AWS container services.

In the next section, we’ll discuss two primary aspects to optimizing costs in our modernization scenario:

  1. Licensing costs of running workloads on Windows servers.
  2. SQL Server licensing cost.

3. Refactor to .NET 6

To address Windows licensing costs, consider moving to a Linux environment by adopting .NET Core and using the Dockerfile for a Linux Container. Customers such as GoDataFeed benefit by porting .NET Framework applications to more recent .NET 6 and running them on AWS. The .NET team has significantly improved performance with .NET 6, including a 30–40% socket performance improvement on Linux. They have added ARM64-specific optimizations in the .NET libraries, which enable customers to run on AWS Graviton.

You may also choose to switch to a serverless option using AWS Lambda (which supports .NET 6 runtime), or run your containers on ECS with Fargate, a serverless, pay-as-you-go compute engine. AWS Fargate powered by AWS Graviton2 processors can reduce cost by up to 20%, and increase performance by up to 40% versus x86 Intel-based instances. If you need full control over an application’s underlying virtual machine (VM), operating system, storage, and patching, run .NET 6 applications on Amazon EC2 Linux instances. These are powered by the latest-generation Intel and AMD processors.

To help customers port their application to .NET 6 faster, AWS added .NET 6 support to Porting Assistant for .NET. Porting Assistant is an analysis tool that scans .NET Framework (3.5+) applications to generate a target .NET Core or .NET 6 compatibility assessment. This helps you to prioritize applications for porting based on effort required. It identifies incompatible APIs and packages from your .NET Framework applications, and finds known replacements. You can refer to a demo video that explains this process.

4. Migrate from SQL Server to a lower-cost database engine

AWS advocates that you build use case-driven, highly scalable, distributed applications suited to your specific needs. From a database perspective, AWS offers 15+ purpose-built engines to support diverse data models. Furthermore, microservices architectures employ loose coupling, so each individual microservice can independently store and retrieve information from its own data store. By deploying the database-per-service pattern, you can choose the most optimal data stores (relational or non-relational databases) for your application and business requirements.

For the purpose of this blog, we will focus on a relational database alternate for SQL Server. To address the SQL Server licensing costs, customers can consider a move to an open-source relational database engine. Amazon Relational Database Service (Amazon RDS) supports MySQL, MariaDB, and PostgreSQL. We will focus on PostgreSQL with a well-defined migration path. Amazon RDS supports two types of Postgres databases: Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition. To help you choose, read Is Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL a better choice for me?

Once you’ve decided on the Amazon RDS flavor, the next question would be “what’s the right migration strategy for me?” Consider the following:

  1. Convert your schema
  2. Migrate the data
  3. Refactor your application

Schema conversion

AWS Schema Conversion Tool (SCT) is a free tool that can help you convert your existing database from one engine to another. AWS SCT supports a number of source databases, including Microsoft SQL Server, Oracle, and MySQL. You can choose from target database engines such as Amazon Aurora PostgreSQL-Compatible Edition, or choose to set up a data lake using Amazon S3. AWS SCT provides a graphical user interface that directly connects to the source and target databases to fetch the current schema objects. When connected, you can generate a database migration assessment report to get a high-level summary of the conversion effort and action items.

Data migration

When the schema migration is complete, you can move your data from the source database to the target database. Depending on your application availability requirements, you can run a straightforward extraction job that performs a one-time copy of the source data into the new database. Or, you can use a tool that copies the current data and continues to replicate all changes until you are ready to cut over to the new database. One such tool is AWS Database Migration Service (AWS DMS) that helps you migrate relational databases, data warehouses, NoSQL databases, and other types of data stores.

With AWS DMS, you can perform one-time migrations, and you can replicate ongoing changes to keep sources and targets in sync. When the source and target databases are in sync, you can take your database offline and move your operations to the target database. Read Microsoft SQL Server To Amazon Aurora with PostgreSQL Compatibility for a playbook or use this self-guided workshop to migrate to a PostgreSQL compatible database using SCT and DMS.

Application refactoring

Each database engine has its differences and nuances, and moving to a new database engine such as PostgreSQL from MSSQL Server will require code refactoring. After the initial database migration is completed, manually rewriting application code, switching out database drivers, and verifying that the application behavior hasn’t changed requires significant effort. This involves potential risk of errors when making extensive changes to the application code.

AWS built Babelfish for Aurora PostgreSQL to simplify migrating applications from SQL Server to Amazon Aurora PostgreSQL-Compatible Edition. Babelfish for Aurora PostgreSQL is a new capability for Amazon Aurora PostgreSQL-Compatible Edition that enables Aurora to understand commands from applications written for Microsoft SQL Server. With Babelfish, Aurora PostgreSQL now understands T-SQL, Microsoft SQL Server’s proprietary SQL dialect. It supports the same communications protocol, so your apps that were originally written for SQL Server can now work with Aurora. Read about how to migrate from SQL Server to Babelfish for Aurora PostgreSQL. Make sure you run the Babelfish Compass tool to determine whether the application contains any SQL features not currently supported by Babelfish.

Figure 1 shows the before and after state for your application based on the modernization path described in this blog. The application tier consists of microservices running on Amazon ECS Fargate clusters (or AWS Lambda functions), and the data tier runs on Amazon Aurora (PostgreSQL flavor).

Figure 1. A modernized microservices-based rearchitecture

Figure 1. A modernized microservices-based rearchitecture

Summary

In this post, we showed a migration path for a monolithic .NET Framework application to a modern microservices-based stack on AWS. We discussed AWS tools to break the monolith into microservices, and containerize the application. We also discussed cost optimization strategies by moving to Linux-based systems, and using open-source database engines. If you’d like to know more about modernization strategies, read this prescriptive guide.

Selecting the right database and database migration plan for your workloads

Post Syndicated from Nikhil Anand original https://aws.amazon.com/blogs/architecture/selecting-the-right-database-and-database-migration-plan-for-your-workloads/

There has been a tectonic shift in the approach to hosting enterprise workloads. Companies are rapidly moving from on-premises data centers to cloud-based services. The driving factor has been the ability to innovate faster on the cloud.

Your transition to cloud can be straightforward, but it does go beyond the usual ‘lift-and-shift’ approach. To start with, you’ll need a cloud provider that provides a data store and computing facility. But you’ll be able to grow your business with a provider that has purpose-built resources and a platform that supports innovation. Companies are using higher-level services such as fully managed databases and serverless compute offered by Amazon Web Services (AWS) to get the most out of their cloud adoption journey.

In this blog, we will focus on the database tier of enterprise applications. We will evaluate the reasons for moving to managed or purpose-built databases. Then we’ll discuss in more detail the options you have and how to implement the right database migration plan for your workloads.

Why move to a managed database?

Managed databases, or purpose-built databases, are managed services by AWS that free you to focus on other business and technical priorities. Different types of databases fill certain needs, and so the database tier in an application should never be a one-size-fits-all approach. Based on the kind of application you are developing, you should research managed options that suit your business and enterprise use cases.

Database Type Use Cases AWS Service
Relational Traditional applications, enterprise resource planning (ERP), customer relationship management (CRM), ecommerce
  • Amazon Aurora
  • Amazon RDS
  • Amazon Redshift
Key-Value High-traffic web applications, ecommerce systems, gaming applications
  • Amazon DynamoDB
In-Memory Caching, session management, gaming leaderboards, geospatial applications
  • Amazon ElastiCache
  • Amazon MemoryDB for Redis
Document Content management, catalogs, user profiles
  • Amazon DocumentDB (with MongoDB compatibility)
Wide Column High-scale industrial apps for equipment maintenance, fleet management, and route optimization
  • Amazon Keyspaces
Graph Fraud detection, social networking, recommendation engines
  • Amazon Neptune
Time Series Internet of Things (IoT) applications, DevOps, industrial telemetry
  • Amazon TimeStream
Ledger Systems of record, supply chain, registrations, banking transactions
  • Amazon QLDB

Table 1. Managed databases by AWS

Managed database features

Manageability. The top priority and most valuable asset that you own as a company is your data. While data remains your key asset, spending time on database management is not the optimum use of your time. Managed services have built-in reliable tooling for multiple aspects of database management, which can help your database operate at the highest level.

Availability and disaster recovery. Most managed databases at AWS are highly available by default. For example, Amazon Aurora is designed to offer 99.99% availability, replicating six copies of your data across three Availability Zones (Figure 1). It backs up your data continuously to Amazon S3. It recovers transparently from physical storage failures; instance failover typically takes less than 30 seconds.

Replication across three Availability Zones with Amazon Aurora DB cluster

Figure 1. Replication across three Availability Zones with Amazon Aurora DB cluster

With managed databases, you get multiple options to create a highly available and fault tolerant database. Alternatively, if you choose to self-host a database elsewhere you will have to formulate your own disaster recovery (DR) strategy. This takes meticulous DR planning and relies heavily on a constant monitoring solution.

Elasticity and agility: Cloud offers elasticity for scaling your database. You can scale in minutes and spin up and down additional servers and storage size as needed. It offers you flexibility with capacity planning. You can always reassess your database tier to see if it is over or under provisioned.

Self-managed databases on AWS

If I do not need a managed database, should I still use Amazon EC2 to host my database?

Here are some cases when you may find it beneficial to host your database on Amazon EC2 instances:

  • You need full control over the database and access to its underlying operating system, database installation, and configuration.
  • You are ready to plan for the high availability of your database, and prepare your disaster recovery plan.
  • You want to administer your database, including backups and recovery. You must perform tasks such as patching the operating system and the database, tuning the operating system and database parameters, managing security, and configuring high availability or replication.
  • You want to use features that aren’t currently supported by managed services.
  • You need a specific database version that isn’t supported by AWS managed database service.
  • Your database size and performance needs exceed the limits of the managed service.
  • You want to avoid automatic software patches that might not be compliant with your applications.
  • You want to achieve higher IOPS and storage capacity than the current limits of the managed services.

Can I customize my underlying OS and database environment?

Amazon RDS Custom is a managed database service for applications that require customization of the underlying operating system and database environment. With Amazon RDS Custom, you have access to the underlying EC2 instance that hosts the DB engine. You can access the EC2 instance via secure shell (AWS Systems Manager and SSH) and perform the customizations to suit your application needs.

Choosing which migration plan to implement

The first step in your cloud database migration journey is identifying which database that you want to migrate to. For relational databases, determine the migration strategy. In the majority of the database migrations, you can choose to rehost, replatform, or refactor.

Refer the AWS Prescriptive Guidance for choosing a migration strategy for relational databases.

The next is determining which database migration plan best serves your needs. AWS provides a number of options to help correctly handle your data migration with minimal downtime. Here are the database migration plans that you can consider using for your cloud database adoption journey:

1. AWS-offered AWS Database Migration Service (AWS DMS): AWS Database Migration Service (AWS DMS) is a self-service option for migrating databases. You can use AWS DMS to migrate between homogeneous database types, such as going from one MySQL instance to a new one. You can also use AWS DMS between heterogeneous database types, such as moving from a commercial database like Oracle to a cloud-native relational database like Aurora. Read tutorials about migrating sample data: AWS Database Migration Service Step-by-Step Walkthroughs.

AWS DMS offers minimal downtime, supports widely used databases, supports on-going replication, is offered as a low-cost service, and is highly reliable. If you are looking for an end-to-end service for database migration, consider AWS DMS.

2. AWS Database Migration Service (DMS) with AWS Schema Conversion Tool (SCT): If you are migrating between heterogenous databases, use the AWS Schema Conversion Tool (SCT). It converts the source database schema and the database code objects (like views, stored procedures, and functions) to a format compatible with the target database (Figure 2).

Supported conversions with AWS Schema Conversion Tool

Figure 2. Supported conversions with AWS Schema Conversion Tool

For heterogeneous migrations, consider using AWS DMS with AWS SCT.

3. Database Freedom Program: If you are new to cloud computing or if your database migration plan must be evaluated and validated by industry experts, then try the Database Freedom Program. Database Freedom is a unique program designed to assist customers in migrating to AWS databases and analytics services. They will provide technical advice, migration support, and financial assistance.

You can use the Contact Us page for the Database Freedom program to get in touch with experts that can assist you in your cloud adoption journey.

4. AWS Professional Services and Partners: You may have in-house database expertise, but need end-to-end implementation assistance across different application tiers. Get help from the Professional Services of AWS or connect with the certified network of AWS Partners. AWS Database Migration Service Delivery Partners help customers use AWS DMS to migrate databases to AWS securely, while minimizing application downtime and following best practices.

Conclusion

Migrating to the cloud is a journey that is ever-evolving. To remain focused on your innovations, leverage the managed services of AWS for your migration journey.

I hope this post helps you consider using a managed database service when available, and effectively evaluate and choose the right database migration plan for your enterprise. For any queries, feel free to get in touch with us. We will be happy to help you in your cloud journey.

Happy migrating!

Additional Reading

Insights for CTOs: Part 3 – Growing your business with modern data capabilities

Post Syndicated from Syed Jaffry original https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/

This post was co-wrtiten with Jonathan Hwang, head of Foundation Data Analytics at Zendesk.


In my role as a Senior Solutions Architect, I have spoken to chief technology officers (CTOs) and executive leadership of large enterprises like big banks, software as a service (SaaS) businesses, mid-sized enterprises, and startups.

In this 6-part series, I share insights gained from various CTOs and engineering leaders during their cloud adoption journeys at their respective organizations. I have taken these lessons and summarized architecture best practices to help you build and operate applications successfully in the cloud. This series also covers building and operating cloud applications, security, cloud financial management, modern data and artificial intelligence (AI), cloud operating models, and strategies for cloud migration.

In Part 3, I’ve collaborated with the head of Foundation Analytics at Zendesk, Jonathan Hwang, to show how Zendesk incrementally scaled their data and analytics capabilities to effectively use the insights they collect from customer interactions. Read how Zendesk built a modern data architecture using Amazon Simple Storage Service (Amazon S3) for storage, Apache Hudi for row-level data processing, and AWS Lake Formation for fine-grained access control.

Why Zendesk needed to build and scale their data platform

Zendesk is a customer service platform that connects over 100,000 brands with hundreds of millions of customers via telephone, chat, email, messaging, social channels, communities, review sites, and help centers. They use data from these channels to make informed business decisions and create new and updated products.

In 2014, Zendesk’s data team built the first version of their big data platform in their own data center using Apache Hadoop for incubating their machine learning (ML) initiative. With that, they launched Answer Bot and Zendesk Benchmark report. These products were so successful they soon overwhelmed the limited compute resources available in the data center. By the end of 2017, it was clear Zendesk needed to move to the cloud to modernize and scale their data capabilities.

Incrementally modernizing data capabilities

Zendesk built and scaled their workload to use data lakes on AWS, but soon encountered new architecture challenges:

  • The General Data Protection Regulation (GDPR) “right to be forgotten” rule made it difficult and costly to maintain data lakes, because deleting a small piece of data required reprocessing large datasets.
  • Security and governance was harder to manage when data lake scaled to a larger number of users.

The following sections show you how Zendesk is addressing GDPR rules by evolving from plain Apache Parquet files on Amazon S3 to Hudi datasets on Amazon S3 to enable row level inserts/updates/deletes. To address security and governance, Zendesk is migrating to AWS Lake Formation centralized security for fine-grained access control at scale.

Zendesk’s data platform

Figure 1 shows Zendesk’s current data platform. It consists of three data pipelines: “Data Hub,” “Data Lake,” and “Self Service.”

Zendesk data pipelines

Figure 1. Zendesk data pipelines

Data Lake pipelines

The Data Lake and Data Hub pipelines cover the entire lifecycle of the data from ingestion to consumption.

The Data Lake pipelines consolidate the data from Zendesk’s highly distributed databases into a data lake for analysis.

Zendesk uses Amazon Database Migration Service (AWS DMS) for change data capture (CDC) from over 1,800 Amazon Aurora MySQL databases in eight AWS Regions. It detects transaction changes and applies them to the data lake using Amazon EMR and Hudi.

Zendesk ticket data consists of over 10 billion events and petabytes of data. The data lake files in Amazon S3 are transformed and stored in Apache Hudi format and registered on the AWS Glue catalog to be available as data lake tables for analytics querying and consumption via Amazon Athena.

Data Hub pipelines

The Data Hub pipelines focus on real-time events and streaming analytics use cases with Apache Kafka. Any application at Zendesk can publish events to a global Kafka message bus. Apache Flink ingests these events into Amazon S3.

The Data Hub provides high-quality business data that is highly available and scalable.

Self-managed pipeline

The self-managed pipelines empower product engineering teams to use the data lake for those use cases that don’t fit into our standard integration patterns. All internal Zendesk product engineering teams can use standard tools such as Amazon EMR, Amazon S3, Athena, and AWS Glue to publish their own analytics dataset and share them with other teams.

A notable example of this is Zendesk’s fraud detection engineering team. They publish their fraud detection data and findings through our self-manage data lake platform and use Amazon QuickSight for visualization.

You need fine-grained security and compliance

Data lakes can accelerate growth through faster decision making and product innovation. However, they can also bring new security and compliance challenges:

  • Visibility and auditability. Who has access to what data? What level of access do people have and how/when and who is accessing it?
  • Fine-grained access control. How do you define and enforce least privilege access to subsets of data at scale without creating bottlenecks or key person/team dependencies?

Lake Formation helps address these concerns by auditing data access and offering row- and column-level security and a delegated access control model to create data stewards for self-managed security and governance.

Zendesk used Lake Formation to build a fine-grained access control model that uses row-level security. It detects personally identifiable information (PII) while scaling the data lake for self-managed consumption.

Some Zendesk customers opt out of having their data included in ML or market research. Zendesk uses Lake Formation to apply row-level security to filter out records associated with a list of customer accounts who have opted out of queries. They also help data lake users understand which data lake tables contain PII by automatically detecting and tagging columns in the data catalog using AWS Glue’s PII detection algorithm.

The value of real-time data processing

When you process and consume data closer to the time of its creation, you can make faster decisions. Streaming analytics design patterns, implemented using services like Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis, create an enterprise event bus to exchange data between heterogeneous applications in near real time.

For example, it is common to use streaming to augment the traditional database CDC ingestion into the data lake with additional streaming ingestion of application events. CDC is a common data ingestion pattern, but the information can be too low level. This requires application context to be reconstructed in the data lake and business logic to be duplicated in two places, inside the application and in the data lake processing layer. This creates a risk of semantic misrepresentation of the application context.

Zendesk faced this challenge with their CDC data lake ingestion from their Aurora clusters. They created an enterprise event bus built with Apache Kafka to augment their CDC with higher-level application domain events to be exchanged directly between heterogeneous applications.

Zendesk’s streaming architecture

A CDC database ticket table schema can sometimes contain unnecessary and complex attributes that are application specific and do not capture the domain model of the ticket. This makes it hard for downstream consumers to understand and use the data. A ticket domain object may span several database tables when modeled in third normal form, which makes querying for analysts difficult downstream. This is also a brittle integration method because downstream data consumers can easily be impacted when the application logic changes, which makes it hard to derive a common data view.

To move towards event-based communication between microservices, Zendesk created the Platform Data Architecture (PDA) project, which uses a standard object model to represent a higher level, semantic view of their application data. Standard objects are domain objects designed for cross-domain communication and do not suffer from the lower level fragmented scope of database CDC. Ultimately, Zendesk aims to transition their data architecture from a collection of isolated products and data silos into a cohesive unified data platform.

An application view of Zendesk’s streaming architecture

Figure 2. An application view of Zendesk’s streaming architecture

Figure 3 shows how all Zendesk products and users integrate through common standard objects and standard events within the Data Hub. Applications publish and consume standard objects and events to/from the event bus.

For example, a complete ticket standard object will be published to the message bus whenever it is created, updated, or changed. On the consumption side, these events get used by product teams to enable platform capabilities such as search, data export, analytics, and reporting dashboards.

Summary

As Zendesk’s business grew, their data lake evolved from simple Parquet files on Amazon S3 to a modern Hudi-based incrementally updateable data lake. Now, their original coarse-grained IAM security policies use fine-grained access control with Lake Formation.

We have repeatedly seen this incremental architecture evolution achieve success because it reduces the business risk associated with the change and provides sufficient time for your team to learn and evaluate cloud operations and managed services.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Other posts in this series

Migration updates announced at re:Invent 2021

Post Syndicated from Angélica Ortega original https://aws.amazon.com/blogs/architecture/migration-updates-announced-at-reinvent-2021/

re:Invent is a yearly event that offers learning and networking opportunities for the global cloud computing community. 2021 marks the launch of several new features in different areas of cloud services and migration.

In this blog, we’ll cover some of the most important recent announcements.

AWS Mainframe Modernization (Preview)

Mainframe modernization has become a necessity for many companies. One of the main drivers fueling this requirement is the need for agility, as the market constantly demands new functionalities. The mainframe platform, due to its complex dependencies, long procurement cycles, and escalating costs, makes it impossible for companies to innovate at the needed pace.

Mainframe modernization can be a complex undertaking. To assist you, we have launched a comprehensive platform, called AWS Mainframe Modernization, that enables two popular migration patterns: replatforming, and automated refactoring.

AWS Mainframe Modernization flow

Figure 1. AWS Mainframe Modernization flow

AWS Migration and Modernization Competency

Application modernization is becoming an important migration strategy, especially for strategic business applications. It brings many benefits: software licensing and operation cost optimization, better performance, agility, resilience, and more. Selecting a partner with the required expertise can help reduce the time and risk for these kinds of projects. In the next section, you’ll find a summary of the experience required by a partner to get the AWS Migration and Modernization Competency. More information can be found at AWS Migration Competency Partners.

AWS Application Migration Service (AWS MGN)

AWS MGN is recommended as the primary migration service for lift and shift migrations. Customers currently using AWS Server Migration Service are encouraged to switch to it for future migrations.

Starting in November 2021, AWS MGN supports agentless replication from VMWare vCenter versions 6.7 and 7.0 to the AWS Cloud. This new feature is intended for users who want to rehost their applications to AWS, but cannot install the AWS Replication Agent on individual servers due to company policies or technical restrictions.

AWS Elastic Disaster Recovery

Two of the pillars of the Well-Architected Framework are Operational Excellence and Reliability. Both are directly concerned with the capability of a service to recover and work efficiently. AWS Elastic Disaster Recovery is a new service to help you to minimize downtime and data loss with fast, reliable, and recovery of on-premises and cloud-based applications. It uses storage, compute, point-in-time recovery, and cost-optimization.

AWS Resilience Hub

AWS Resilience Hub is a service designed to help customers define, measure, and manage the resilience of their applications in the cloud. This service helps you define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) and evaluates the configuration to meet the requirements defined. Aligned with the AWS Well-Architected Framework, this service can recover applications deployed with AWS CloudFormation, and integrates with AWS Fault Injection Simulator, AWS Systems Manager, or Amazon CloudWatch.

AWS Migration Hub Strategy Recommendations

One of the critical tasks in a migration is determining the right strategy. AWS Migration Hub can help you build a migration and modernization strategy for applications running on-premises or in AWS. AWS Migration Hub Strategy Recommendations were announced on October 2021. It’s designed to be the starting point for your cloud journey. It helps you to assess the appropriate strategy to transform your portfolios to use the full benefits of cloud services.

AWS Migration Hub Refactor Spaces (Preview)

Refactoring is the migration strategy that requires the biggest effort, but it permits you to take full advantage of cloud-native features to improve agility, performance, and scalability. AWS Migration Hub Refactor Spaces is the starting point for incremental application refactoring to microservices in AWS. It will help you reduce the undifferentiated heavy lifting of building and operating your AWS infrastructure for incremental refactoring.

AWS Database Migration Service

AWS Database Migration Service (AWS DMS) is a service that helps you migrate databases to AWS quickly and securely.

AWS DMS Fleet Advisor is a new free feature of AWS DMS that enables you to quickly build a database and analytics migration plan, by automating the discovery and analysis of your fleet. AWS DMS Fleet Advisor is intended for users looking to migrate a large number of database and analytic servers to AWS.

AWS Microservice Extractor for .NET is a new free tool and simplifies the process of re-architecting applications into smaller code projects. Modernize and transform your .NET applications with an assistive tool that analyzes source code and runtime metrics. It creates a visual representation of your application and its dependencies.

This tool visualizes your applications source code, helps with code refactoring, and assists in extraction of the code base into separate code projects.  Teams can then develop, build, and operate independently to improve agility, uptime, and scalability.

AWS Migration Evaluator

AWS Migration Evaluator (ME) is a migration assessment service that helps you create a directional business case for AWS Cloud planning and migration. Building a business case for the cloud can be a time-consuming process on your own. With Migration Evaluator, organizations can accelerate their evaluation and decision-making for migration to AWS. During 2021, there were some existing improvements to mention:

  • Quick Insights. This new capability of Migration Evaluator, provides customers with a one-page summary of their projected AWS costs, based on measured on-premises provisioning and utilization.
  • Enhanced Microsoft SQL Discovery. This is a new feature of the Migration Evaluator Collector, which assists you by including your SQL Server environment in their migration assessment.
  • Agentless Collection for Dependency Mapping. The ME Collector now enables agentless network traffic collection to be sent to the customer’s AWS Migration Hub account.

AWS Amplify Studio

This is a visual development environment that offers frontend developers new features to accelerate UI development with minimal coding, while integrating with Amplify. Read Introducing AWS Amplify Studio.

Conclusion

Migration is a crucial process for many enterprises as they move from on-premises systems to the cloud. It helps accelerate your cloud journey, and offers additional tools and methodologies created by AWS. AWS has created and is continually improving services and features to optimize the migration process and help you reach your business goals faster.

Related information

How the Georgia Data Analytics Center built a cloud analytics solution from scratch with the AWS Data Lab

Post Syndicated from Kanti Chalasani original https://aws.amazon.com/blogs/big-data/how-the-georgia-data-analytics-center-built-a-cloud-analytics-solution-from-scratch-with-the-aws-data-lab/

This is a guest post by Kanti Chalasani, Division Director at Georgia Data Analytics Center (GDAC). GDAC is housed within the Georgia Office of Planning and Budget to facilitate governed data sharing between various state agencies and departments.

The Office of Planning and Budget (OPB) established the Georgia Data Analytics Center (GDAC) with the intent to provide data accountability and transparency in Georgia. GDAC strives to support the state’s government agencies, academic institutions, researchers, and taxpayers with their data needs. Georgia’s modern data analytics center will help to securely harvest, integrate, anonymize, and aggregate data.

In this post, we share how GDAC created an analytics platform from scratch using AWS services and how GDAC collaborated with the AWS Data Lab to accelerate this project from design to build in record time. The pre-planning sessions, technical immersions, pre-build sessions, and post-build sessions helped us focus on our objectives and tangible deliverables. We built a prototype with a modern data architecture and quickly ingested additional data into the data lake and the data warehouse. The purpose-built data and analytics services allowed us to quickly ingest additional data and deliver data analytics dashboards. It was extremely rewarding to officially release the GDAC public website within only 4 months.

A combination of clear direction from OPB executive stakeholders, input from the knowledgeable and driven AWS team, and the GDAC team’s drive and commitment to learning played a huge role in this success story. GDAC’s partner agencies helped tremendously through timely data delivery, data validation, and review.

We had a two-tiered engagement with the AWS Data Lab. In the first tier, we participated in a Design Lab to discuss our near-to-long-term requirements and create a best-fit architecture. We discussed the pros and cons of various services that can help us meet those requirements. We also had meaningful engagement with AWS subject matter experts from various AWS services to dive deeper into the best practices.

The Design Lab was followed by a Build Lab, where we took a smaller cross section of the bigger architecture and implemented a prototype in 4 days. During the Build Lab, we worked in GDAC AWS accounts, using GDAC data and GDAC resources. This not only helped us build the prototype, but also helped us gain hands-on experience in building it. This experience also helped us better maintain the product after we went live. We were able to continually build on this hands-on experience and share the knowledge with other agencies in Georgia.

Our Design and Build Lab experiences are detailed below.

Step 1: Design Lab

We wanted to stand up a platform that can meet the data and analytics needs for the Georgia Data Analytics Center (GDAC) and potentially serve as a gold standard for other government agencies in Georgia. Our objective with the AWS Data Design Lab was to come up with an architecture that meets initial data needs and provides ample scope for future expansion, as our user base and data volume increased. We wanted each component of the architecture to scale independently, with tighter controls on data access. Our objective was to enable easy exploration of data with faster response times using Tableau data analytics as well as build data capital for Georgia. This would allow us to empower our policymakers to make data-driven decisions in a timely manner and allow State agencies to share data and definitions within and across agencies through data governance. We also stressed on data security, classification, obfuscation, auditing, monitoring, logging, and compliance needs. We wanted to use purpose-built tools meant for specialized objectives.

Over the course of the 2-day Design Lab, we defined our overall architecture and picked a scaled-down version to explore. The following diagram illustrates the architecture of our prototype.

The architecture contains the following key components:

  • Amazon Simple Storage Service (Amazon S3) for raw data landing and curated data staging.
  • AWS Glue for extract, transform, and load (ETL) jobs to move data from the Amazon S3 landing zone to Amazon S3 curated zone in optimal format and layout. We used an AWS Glue crawler to update the AWS Glue Data Catalog.
  • AWS Step Functions for AWS Glue job orchestration.
  • Amazon Athena as a powerful tool for a quick and extensive SQL data analysis and to build a logical layer on the landing zone.
  • Amazon Redshift to create a federated data warehouse with conformed dimensions and star schemas for consumption by Tableau data analytics.

Step 2: Pre-Build Lab

We started with planning sessions to build foundational components of our infrastructure: AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) instances, an Amazon Redshift cluster, a virtual private cloud (VPC), route tables, security groups, encryption keys, access rules, internet gateways, a bastion host, and more. Additionally, we set up AWS Identity and Access Management (IAM) roles and policies, AWS Glue connections, dev endpoints, and notebooks. Files were ingested via secure FTP, or from a database to Amazon S3 using AWS Command Line Interface (AWS CLI). We crawled Amazon S3 via AWS Glue crawlers to build Data Catalog schemas and tables for quick SQL access in Athena.

The GDAC team participated in Immersion Days for training in AWS Glue, AWS Lake Formation, and Amazon Redshift in preparation for the Build Lab.

We defined the following as the success criteria for the Build Lab:

  • Create ETL pipelines from source (Amazon S3 raw) to target (Amazon Redshift). These ETL pipelines should create and load dimensions and facts in Amazon Redshift.
  • Have a mechanism to test the accuracy of the data loaded through our pipelines.
  • Set up Amazon Redshift in a private subnet of a VPC, with appropriate users and roles identified.
  • Connect from AWS Glue to Amazon S3 to Amazon Redshift without going over the internet.
  • Set up row-level filtering in Amazon Redshift based on user login.
  • Data pipelines orchestration using Step Functions.
  • Build and publish Tableau analytics with connections to our star schema in Amazon Redshift.
  • Automate the deployment using AWS CloudFormation.
  • Set up column-level security for the data in Amazon S3 using Lake Formation. This allows for differential access to data based on user roles to users using both Athena and Amazon Redshift Spectrum.

Step 3: Four-day Build Lab

Following a series of implementation sessions with our architect, we formed the GDAC data lake and organized downstream data pulls for the data warehouse with governed data access. Data was ingested in the raw data landing lake and then curated into a staging lake, where data was compressed and partitioned in Parquet format.

It was empowering for us to build PySpark Extract Transform Loads (ETL) AWS Glue jobs with our meticulous AWS Data Lab architect. We built reusable glue jobs for the data ingestion and curation using the code snippets provided. The days were rigorous and long, but we were thrilled to see our centralized data repository come into fruition so rapidly. Cataloging data and using Athena queries proved to be a fast and cost-effective way for data exploration and data wrangling.

The serverless orchestration with Step Functions allowed us to put AWS Glue jobs into a simple readable data workflow. We spent time designing for performance and partitioning data to minimize cost and increase efficiency.

Database access from Tableau and SQL Workbench/J were set up for my team. Our excitement only grew as we began building data analytics and dashboards using our dimensional data models.

Step 4: Post-Build Lab

During our post-Build Lab session, we closed several loose ends and built additional AWS Glue jobs for initial and historic loads and append vs. overwrite strategies. These strategies were picked based on the nature of the data in various tables. We returned for a second Build Lab to work on building data migration tasks from Oracle Database via VPC peering, file processing using AWS Glue DataBrew, and AWS CloudFormation for automated AWS Glue job generation. If you have a team of 4–8 builders looking for a fast and easy foundation for a complete data analytics system, I would highly recommend the AWS Data Lab.

Conclusion

All in all, with a very small team we were able to set up a sustainable framework on AWS infrastructure with elastic scaling to handle future capacity without compromising quality. With this framework in place, we are moving rapidly with new data feeds. This would not have been possible without the assistance of the AWS Data Lab team throughout the project lifecycle. With this quick win, we decided to move forward and build AWS Control Tower with multiple accounts in our landing zone. We brought in professionals to help set up infrastructure and data compliance guardrails and security policies. We are thrilled to continually improve our cloud infrastructure, services and data engineering processes. This strong initial foundation has paved the pathway to endless data projects in Georgia.


About the Author

Kanti Chalasani serves as the Division Director for the Georgia Data Analytics Center (GDAC) at the Office of Planning and Budget (OPB). Kanti is responsible for GDAC’s data management, analytics, security, compliance, and governance activities. She strives to work with state agencies to improve data sharing, data literacy, and data quality through this modern data engineering platform. With over 26 years of experience in IT management, hands-on data warehousing, and analytics experience, she thrives for excellence.

Vishal Pathak is an AWS Data Lab Solutions Architect. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey with AWS, Vishal helped customers implement BI, data warehousing, and data lake projects in the US and Australia.