Tag Archives: Amazon Redshift

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

2023-11-13 Anshuman Varshney

Post Syndicated from Anshuman Varshney original https://aws.amazon.com/blogs/big-data/how-gameskraft-uses-amazon-redshift-data-sharing-to-support-growing-analytics-workloads/

This post is co-written by Anshuman Varshney, Technical Lead at Gameskraft.

Gameskraft is one of India’s leading online gaming companies, offering gaming experiences across a variety of categories such as rummy, ludo, poker, and many more under the brands RummyCulture, Ludo Culture, Pocket52, and Playship. Gameskraft holds the Guinness World Record for organizing the world’s largest online rummy tournament, and is one of India’s first gaming companies to build an ISO certified platform.

Amazon Redshift is a fully managed data warehousing service that offers both provisioned and serverless options, making it more efficient to run and scale analytics without having to manage your data warehouse. Amazon Redshift enables you to use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning (ML) to deliver the best price-performance at scale.

In this post, we show how Gameskraft used Amazon Redshift data sharing along with concurrency scaling and WLM optimization to support its growing analytics workloads.

Amazon Redshift use case

Gameskraft used Amazon Redshift RA3 instances with Redshift Managed Storage (RMS) for their data warehouse. The upstream data pipeline is a robust system that integrates various data sources, including Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK) for handling clickstream events, Amazon Relational Database Service (Amazon RDS) for delta transactions, and Amazon DynamoDB for delta game-related information. Additionally, data is extracted from vendor APIs that includes data related to product, marketing, and customer experience. All of this diverse data is then consolidated into the Amazon Simple Storage Service (Amazon S3) data lake before being uploaded to the Redshift data warehouse. These upstream data sources constitute the data producer components.

Gameskraft used Amazon Redshift workload management (WLM) to manage priorities within workloads, with higher priority being assigned to the extract, transform, and load (ETL) queue that runs critical jobs for data producers. The downstream consumers consist of business intelligence (BI) tools, with multiple data science and data analytics teams having their own WLM queues with appropriate priority values.

As Gameskraft’s portfolio of gaming products increased, it led to an approximate five-times growth of dedicated data analytics and data science teams. Consequently, there was a fivefold rise in data integrations and a fivefold increase in ad hoc queries submitted to the Redshift cluster. These query patterns and concurrency were unpredictable in nature. Also, over time the number of BI dashboards (both scheduled and live) increased, which contributed to more queries being submitted to the Redshift cluster.

With this growing workload, Gameskraft was observing the following challenges:

Increase in critical ETL job runtime
Increase in query wait time in multiple queues
Impact of unpredictable ad hoc query workloads across other queues in the cluster

Gameskraft was looking for a solution that would help them mitigate all these challenges, and provide flexibility to scale ingestion and consumption workload processing independently. Gameskraft was also looking for a solution that would cater to their unpredictable future growth.

Solution overview

Gameskraft tackled these challenges in a phased manner using Amazon Redshift concurrency scaling, Amazon Redshift data sharing, Amazon Redshift Serverless, and Redshift provisioned clusters.

Amazon Redshift concurrency scaling lets you easily support thousands of concurrent users and concurrent queries, with consistently fast query performance. As concurrency increases, Amazon Redshift automatically adds query processing power in seconds to process queries without any delays. When the workload demand subsides, this extra processing power is automatically removed, so you pay only for the time when concurrency scaling clusters are in use. Amazon Redshift offers 1 hour of free concurrency scaling credits per active cluster per day, allowing you to accumulate 30 hours of free credits per month.

Gameskraft enabled concurrency scaling in selective WLM queues to ease the query wait time in those queues during peak usage and also helped reduce ETL query runtime. In the prior setup, we maintained four specialized queues for ETL, ad hoc queries, BI tools, and data science. To prevent blockages for other processes, we imposed minimal query timeouts using query monitoring rules (QMR). However, both the ETL and BI tools queues were persistently occupied, impacting the performance of the remaining queues.

Concurrency scaling helped alleviate query wait time in the ad hoc query queue. Still, the challenge of downstream consumption workloads (like ad hoc queries) impacting ingestion persisted, and Gameskraft was looking for a solution to manage these workloads independently.

The following table summarizes the workload management configuration prior to the solution implementation.

Queue	Usage	Concurrency Scaling Mode	Concurrency on Main / Memory %	Query Monitoring Rules
`etl`	For ingestion from multiple data integration	off	auto	Stop action on: Query runtime (seconds) > 2700
`report`	For scheduled reporting purposes	off	auto	Stop action on: Query runtime (seconds) > 600
`datascience`	For data science workloads	off	auto	Stop action on: Query runtime (seconds) > 300
`readonly`	For ad hoc and day-to-day analysis	auto	auto	Stop action on: Query runtime (seconds) > 120
`bi_tool`	For BI tools	auto	auto	Stop action on: Query runtime (seconds) > 300

To achieve flexibility in scaling, Gameskraft used Amazon Redshift data sharing. Amazon Redshift data sharing allows you to extend the ease of use, performance, and cost benefits offered by a single cluster to multi-cluster deployments while being able to share data. Data sharing enables instant, granular, and fast data access across Amazon Redshift data warehouses without the need to copy or move it. Data sharing provides live access to data so that users always observe the most up-to-date and consistent information as it’s updated in the data warehouse. You can securely share live data across provisioned clusters, serverless endpoints within AWS account, across AWS accounts, and across AWS Regions.

Data sharing builds on Redshift Managed Storage (RMS), which underpins RA3 provisioned clusters and serverless workgroups, allowing multiple warehouses to query the same data with separate isolated compute. Queries accessing shared data run on the consumer cluster and read data from RMS directly without impacting the performance of the producer cluster. You can now rapidly onboard workloads with diverse data access patterns and SLA requirements and not be concerned about resource contention.

We chose to run all ETL workloads in the primary producer cluster to manage ETL independently. We used data sharing to share read-only access to data with a data science serverless workgroup, a BI provisioned cluster, an ad hoc query provisioned cluster, and a data integration serverless workgroup. Teams using these separate compute resources could then query the same data without copying the data between the producer and consumer. Additionally, we introduced concurrency scaling to the consumer queues, prioritizing BI tools, and extended the timeout for the remaining queues. These modifications notably enhanced overall efficiency and throughput.

The following table summarizes the new workload management configuration for the producer cluster.

Queue	Usage	Concurrency Scaling Mode	Concurrency on Main / Memory %	Query Monitoring Rules
`etl`	For ingestion from multiple data integration	auto	auto	Stop action on: Query runtime (seconds) > 3600

The following table summarizes the new workload management configuration for the consumer cluster.

Queue	Usage	Concurrency Scaling Mode	Concurrency on Main / Memory %	Query Monitoring Rules
`report`	For scheduled reporting purposes	off	auto	Stop action on: Query runtime (seconds) > 1200 Query queue time (seconds) > 1800 Spectrum scan row count (rows) > 100000 Spectrum scan (MB) > 3072
`datascience`	For data science workloads	off	auto	Stop action on: Query runtime (seconds) > 600 Query queue time (seconds) > 1800 Spectrum scan row count (rows) > 100000 Spectrum scan (MB) > 3072
`readonly`	For ad hoc and day-to-day analysis	auto	auto	Stop action on: Query runtime (seconds) > 900 Query queue time (seconds) > 3600 Spectrum scan (MB) > 3072 Spectrum scan row count (rows) > 100000
`bi_tool_live`	For live BI tools	auto	auto	Stop action on: Query runtime (seconds) > 900 Query queue time (seconds) > 1800 Spectrum scan (MB) > 1024 Spectrum scan row count (rows) > 1000
`bi_tool_schedule`	For scheduled BI tools	auto	auto	Stop action on: Query runtime (seconds) > 1800 Query queue time (seconds) > 3600 Spectrum scan (MB) > 1024 Spectrum scan row count (rows) > 1000

Solution implementation

Gameskraft is dedicated to maintaining uninterrupted system operations, prioritizing seamless solutions over downtime. In pursuit of this principle, strategic measures were undertaken to ensure a smooth migration process towards enabling data sharing, which included the following steps:

Planning:
- Replicating users and groups to the consumer, to mitigate potential access complications for analytics, data science, and BI teams.
- Establishing a comprehensive setup within the consumers, encompassing essential components like external schemas for Amazon Redshift Spectrum.
- Fine-tuning WLM configurations tailored to the consumer’s requirements.
Implementation:
- Introducing insightful monitoring dashboards in Grafana for CPU utilization, read/write throughputs, IOPS, and latencies specific to the consumer cluster, enhancing oversight capabilities.
- Changing all interleaved key tables on the producer cluster to compound sortkey tables to seamlessly transition data.
- Creating an external schema from the data share database on the consumer, mirroring that of the producer cluster with identical names. This approach minimizes the need for making query adjustments in multiple locations.
Testing:
- Conducting an internal week-long regression testing and auditing process to meticulously validate all data points by running the same workload and twice the workload.
Final changes:
- Updating the DNS record for the cluster endpoint, which included replacing the consumer cluster’s endpoint to the same domain as the producer cluster’s endpoint, to streamline connections and avoid making changes in multiple places.
- Ensuring data security and access control by revoking group and user privileges from the producer cluster.

The following diagram illustrates the Gameskraft Amazon Redshift data sharing architecture.

This diagram illustrates the Gameskraft Amazon Redshift data sharing architecture

The following diagram illustrates the Amazon Redshift data sharing architecture with multiple consumer clusters.

This diagram illustrates the Amazon Redshift data sharing architecture with multiple consumer clusters.

With data sharing implementation, Gameskraft was able to isolate the producer and consumer workloads. Data sharing also provided the flexibility to independently scale the producer and consumer data warehouses.

The implementation of the overall solution helped Gameskraft support more frequent data refresh (43% reduction in overall job runtime) for its ETL workload, which runs on the producer cluster, along with capabilities to support a growing (five-times increase in users, BI workloads, and ad hoc queries) and unpredictable consumer workload.

The following dashboards show some of the critical ETL pipeline runtimes (before solution implementation and after solution implementation).

The first shows the delta P1/P2/P3 job runs before and after solution implementation (duration in minutes).

This screenshot shows the delta P1/P2/P3 job runs before and after solution implementation (duration in minutes).

The following shows the daily event ETL P1/P2/P3 job runs before and after solution implementation (duration in minutes).

This screenshot shows the daily event ETL P1/P2/P3 job runs before and after solution implementation (duration in minutes).

Key considerations

Gameskraft embraces a modern data architecture, with the data lake residing in Amazon S3. To grant seamless access to the data lake, we use the innovative capabilities of Redshift Spectrum, which is a bridge between the data warehouse (Amazon Redshift) and data lake (Amazon S3). It allows you to perform data transformations and analysis directly on data stored in Amazon S3, without having to duplicate the data into your Redshift cluster.

Gameskraft had a couple of key learnings while implementing this data sharing solution:

First, as of this writing, Amazon Redshift data sharing doesn’t support adding external schemas, tables, or late-binding views on external tables to the data share. To enable this, we have created an external schema as a pointer to AWS Glue database. The same AWS Glue database is referenced in the external schema on the consumer side.
Second, Amazon Redshift doesn’t support sharing tables with interleaved sort keys and views that refer to tables with interleaved sort keys. Due to the presence of interleaved sort keys across numerous tables and views, a prerequisite for inclusion into the data share involves revising the sort key configuration to use compound sort key.

Conclusion

In this post, we saw how Gameskraft used data sharing and concurrency scaling in Amazon Redshift with a producer and consumer cluster architecture to achieve the following:

Reduce query wait time for all queues in the producer and consumer
Scale the producer and consumer independently based on workload and queue requirements
Improve ETL pipeline performance and the data refresh cycle to support more frequent refreshes in the producer cluster
Onboard more queues and workloads (BI tools queue, data integration queue, data science queue, downstream team’s queue, ad hoc query queue) in the consumer without impacting the ETL pipeline in the producer cluster
Flexibility to use multiple consumers with a mix of provisioned Redshift cluster and Redshift Serverless

These Amazon Redshift features and architecture can help support a growing and unpredictable analytics workload.

About the Authors

Anshuman Varshney is Technical Lead at Gameskraft with a background in both backend and data engineering. He has a proven track record of leading and mentoring cross-functional teams to deliver high-performance, scalable solutions. Apart from work, he relishes moments with his family, indulges in cinematic experiences, and seizes every opportunity to explore new destinations through travel.

Prafulla Wani is an Amazon Redshift Specialist Solution Architect at AWS. He works with AWS customers on analytics architecture designs and Amazon Redshift proofs of concept. In his spare time, he plays chess with his son.

Saurov Nandy is a Solutions Architect at AWS. He works with AWS customers to design and implement solutions that solve complex business problems. In his spare time, he likes to explore new places and indulge in photography and video editing.

Shashank Tewari is a Senior Technical Account Manager at AWS. He helps AWS customers optimize their architectures to achieve performance, scale, and cost efficiencies. In his spare time, he likes to play video games with his kids. During vacations, he likes to trek on mountains and take up adventure sports.

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

2023-11-10 Preshen Goobiah

Post Syndicated from Preshen Goobiah original https://aws.amazon.com/blogs/big-data/simplifying-data-processing-at-capitec-with-amazon-redshift-integration-for-apache-spark/

This post is cowritten with Preshen Goobiah and Johan Olivier from Capitec.

Apache Spark is a widely-used open source distributed processing system renowned for handling large-scale data workloads. It finds frequent application among Spark developers working with Amazon EMR, Amazon SageMaker, AWS Glue and custom Spark applications.

Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless. This integration expands the possibilities for AWS analytics and machine learning (ML) solutions, making the data warehouse accessible to a broader range of applications.

With the Amazon Redshift integration for Apache Spark, you can quickly get started and effortlessly develop Spark applications using popular languages like Java, Scala, Python, SQL, and R. Your applications can seamlessly read from and write to your Amazon Redshift data warehouse while maintaining optimal performance and transactional consistency. Additionally, you’ll benefit from performance improvements through pushdown optimizations, further enhancing the efficiency of your operations.

Capitec, South Africa’s biggest retail bank with over 21 million retail banking clients, aims to provide simple, affordable and accessible financial services in order to help South Africans bank better so that they can live better. In this post, we discuss the successful integration of the open source Amazon Redshift connector by Capitec’s shared services Feature Platform team. As a result of utilizing the Amazon Redshift integration for Apache Spark, developer productivity increased by a factor of 10, feature generation pipelines were streamlined, and data duplication reduced to zero.

The business opportunity

There are 19 predictive models in scope for utilizing 93 features built with AWS Glue across Capitec’s Retail Credit divisions. Feature records are enriched with facts and dimensions stored in Amazon Redshift. Apache PySpark was selected to create features because it offers a fast, decentralized, and scalable mechanism to wrangle data from diverse sources.

These production features play a crucial role in enabling real-time fixed-term loan applications, credit card applications, batch monthly credit behavior monitoring, and batch daily salary identification within the business.

The data sourcing problem

To ensure the reliability of PySpark data pipelines, it’s essential to have consistent record-level data from both dimensional and fact tables stored in the Enterprise Data Warehouse (EDW). These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime.

During feature development, data engineers require a seamless interface to the EDW. This interface allows them to access and integrate the necessary data from the EDW into the data pipelines, enabling efficient development and testing of features.

Previous solution process

In the previous solution, product team data engineers spent 30 minutes per run to manually expose Redshift data to Spark. The steps included the following:

Construct a predicated query in Python.
Submit an UNLOAD query via the Amazon Redshift Data API.
Catalog data in the AWS Glue Data Catalog via the AWS SDK for Pandas using sampling.

This approach posed issues for large datasets, required recurring maintenance from the platform team, and was complex to automate.

Current solution overview

Capitec was able to resolve these problems with the Amazon Redshift integration for Apache Spark within feature generation pipelines. The architecture is defined in the following diagram.

The workflow includes the following steps:

Internal libraries are installed into the AWS Glue PySpark job via AWS CodeArtifact.
An AWS Glue job retrieves Redshift cluster credentials from AWS Secrets Manager and sets up the Amazon Redshift connection (injects cluster credentials, unload locations, file formats) via the shared internal library. The Amazon Redshift integration for Apache Spark also supports using AWS Identity and Access Management (IAM) to retrieve credentials and connect to Amazon Redshift.
The Spark query is translated to an Amazon Redshift optimized query and submitted to the EDW. This is accomplished by the Amazon Redshift integration for Apache Spark.
The EDW dataset is unloaded into a temporary prefix in an Amazon Simple Storage Service (Amazon S3) bucket.
The EDW dataset from the S3 bucket is loaded into Spark executors via the Amazon Redshift integration for Apache Spark.
The EDL dataset is loaded into Spark executors via the AWS Glue Data Catalog.

These components work together to ensure that data engineers and production data pipelines have the necessary tools to implement the Amazon Redshift integration for Apache Spark, run queries, and facilitate the unloading of data from Amazon Redshift to the EDL.

Using the Amazon Redshift integration for Apache Spark in AWS Glue 4.0

In this section, we demonstrate the utility of the Amazon Redshift integration for Apache Spark by enriching a loan application table residing in the S3 data lake with client information from the Redshift data warehouse in PySpark.

The dimclient table in Amazon Redshift contains the following columns:

ClientKey – INT8
ClientAltKey – VARCHAR50
PartyIdentifierNumber – VARCHAR20
ClientCreateDate – DATE
IsCancelled – INT2
RowIsCurrent – INT2

The loanapplication table in the AWS Glue Data Catalog contains the following columns:

RecordID – BIGINT
LogDate – TIMESTAMP
PartyIdentifierNumber – STRING

The Redshift table is read via the Amazon Redshift integration for Apache Spark and cached. See the following code:

import pyspark.sql.functions as F
from pyspark.sql import SQLContext
sc = # existing SparkContext
sql_context = SQLContext(sc)

secretsmanager_client = boto3.client('secretsmanager')
secret_manager_response = secretsmanager_client.get_secret_value(
    SecretId='string',
    VersionId='string',
    VersionStage='string'
)
username = # get username from secret_manager_response
password = # get password from secret_manager_response
url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password

read_config = {
    "url": url,
    "tempdir": "s3://capitec-redshift-temp-bucket/<uuid>/",
    "unload_s3_format": "PARQUET"
}

d_client = (
    spark.read.format("io.github.spark_redshift_community.spark.redshift")
    .options(**read_config)
    .option("query", f"select * from edw_core.dimclient")
    .load()
    .where((F.col("RowIsCurrent") == 1) & (F.col("isCancelled") == 0))
    .select(
        F.col("PartyIdentifierNumber"),
        F.col("ClientCreateDate")
    )
    .cache()
)

Loan application records are read in from the S3 data lake and enriched with the dimclient table on Amazon Redshift information:

import pyspark.sql.functions as F
from awsglue.context import GlueContext
from pyspark import SparkContext

glue_ctx = GlueContext(SparkContext.getOrCreate())

push_down_predicate = (
    f"meta_extract_start_utc_ms between "
    f"'2023-07-12"
    f" 18:00:00.000000' and "
    f"'2023-07-13 06:00:00.000000'"
)

database_name="loan_application_system"
table_name="dbo_view_loan_applications"
catalog_id = # Glue Data Catalog

# Selecting only the following columns
initial_select_cols=[
            "RecordID",
            "LogDate",
            "PartyIdentifierNumber"
        ]

d_controller = (glue_ctx.create_dynamic_frame.from_catalog(catalog_id=catalog_id,
                                            database=database_name,
                                            table_name=table_name,
                                            push_down_predicate=push_down_predicate)
                .toDF()
                .select(*initial_select_cols)
                .withColumn("LogDate", F.date_format("LogDate", "yyyy-MM-dd").cast("string"))
                .dropDuplicates())

# Left Join on PartyIdentifierNumber and enriching the loan application record
d_controller_enriched = d_controller.join(d_client, on=["PartyIdentifierNumber"], how="left").cache()

As a result, the loan application record (from the S3 data lake) is enriched with the ClientCreateDate column (from Amazon Redshift).

How the Amazon Redshift integration for Apache Spark solves the data sourcing problem

The Amazon Redshift integration for Apache Spark effectively addresses the data sourcing problem through the following mechanisms:

Just-in-time reading – The Amazon Redshift integration for Apache Spark connector reads Redshift tables in a just-in-time manner, ensuring the consistency of data and schema. This is particularly valuable for Type 2 slowly changing dimension (SCD) and timespan accumulating snapshot facts. By combining these Redshift tables with the source system AWS Glue Data Catalog tables from the EDL within production PySpark pipelines, the connector enables seamless integration of data from multiple sources while maintaining data integrity.
Optimized Redshift queries – The Amazon Redshift integration for Apache Spark plays a crucial role in converting the Spark query plan into an optimized Redshift query. This conversion process simplifies the development experience for the product team by adhering to the data locality principle. The optimized queries use the capabilities and performance optimizations of Amazon Redshift, ensuring efficient data retrieval and processing from Amazon Redshift for the PySpark pipelines. This helps streamline the development process while enhancing the overall performance of the data sourcing operations.

Gaining the best performance

The Amazon Redshift integration for Apache Spark automatically applies predicate and query pushdown to optimize performance. You can gain performance improvements by using the default Parquet format used for unloading with this integration.

For additional details and code samples, refer to New – Amazon Redshift Integration with Apache Spark.

Solution Benefits

The adoption of the integration yielded several significant benefits for the team:

Enhanced developer productivity – The PySpark interface provided by the integration boosted developer productivity by a factor of 10, enabling smoother interaction with Amazon Redshift.
Elimination of data duplication – Duplicate and AWS Glue cataloged Redshift tables in the data lake were eliminated, resulting in a more streamlined data environment.
Reduced EDW load – The integration facilitated selective data unloading, minimizing the load on the EDW by extracting only the necessary data.

By using the Amazon Redshift integration for Apache Spark, Capitec has paved the way for improved data processing, increased productivity, and a more efficient feature engineering ecosystem.

Conclusion

In this post, we discussed how the Capitec team successfully implemented the Apache Spark Amazon Redshift integration for Apache Spark to simplify their feature computation workflows. They emphasized the importance of utilizing decentralized and modular PySpark data pipelines for creating predictive model features.

Currently, the Amazon Redshift integration for Apache Spark is utilized by 7 production data pipelines and 20 development pipelines, showcasing its effectiveness within Capitec’s environment.

Moving forward, the shared services Feature Platform team at Capitec plans to expand the adoption of the Amazon Redshift integration for Apache Spark in different business areas, aiming to further enhance data processing capabilities and promote efficient feature engineering practices.

For additional information on using the Amazon Redshift integration for Apache Spark, refer to the following resources:

About the Authors

Preshen Goobiah is the Lead Machine Learning Engineer for the Feature Platform at Capitec. He is focused on designing and building Feature Store components for enterprise use. In his spare time, he enjoys reading and traveling.

Johan Olivier is a Senior Machine Learning Engineer for Capitec’s Model Platform. He is an entrepreneur and problem-solving enthusiast. He enjoys music and socializing in his spare time.

Sudipta Bagchi is a Senior Specialist Solutions Architect at Amazon Web Services. He has over 12 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket. Connect with him on LinkedIn.

Syed Humair is a Senior Analytics Specialist Solutions Architect at Amazon Web Services (AWS). He has over 17 years of experience in enterprise architecture focusing on Data and AI/ML, helping AWS customers globally to address their business and technical requirements. You can connect with him on LinkedIn.

Vuyisa Maswana is a Senior Solutions Architect at AWS, based in Cape Town. Vuyisa has a strong focus on helping customers build technical solutions to solve business problems. He has supported Capitec in their AWS journey since 2019.

Create a modern data platform using the Data Build Tool (dbt) in the AWS Cloud

2023-11-09 Prantik Gachhayat

Post Syndicated from Prantik Gachhayat original https://aws.amazon.com/blogs/big-data/create-a-modern-data-platform-using-the-data-build-tool-dbt-in-the-aws-cloud/

Building a data platform involves various approaches, each with its unique blend of complexities and solutions. A modern data platform entails maintaining data across multiple layers, targeting diverse platform capabilities like high performance, ease of development, cost-effectiveness, and DataOps features such as CI/CD, lineage, and unit testing. In this post, we delve into a case study for a retail use case, exploring how the Data Build Tool (dbt) was used effectively within an AWS environment to build a high-performing, efficient, and modern data platform.

dbt is an open-source command line tool that enables data analysts and engineers to transform data in their warehouses more effectively. It does this by helping teams handle the T in ETL (extract, transform, and load) processes. It allows users to write data transformation code, run it, and test the output, all within the framework it provides. dbt enables you to write SQL select statements, and then it manages turning these select statements into tables or views in Amazon Redshift.

Use case

The Enterprise Data Analytics group of a large jewelry retailer embarked on their cloud journey with AWS in 2021. As part of their cloud modernization initiative, they sought to migrate and modernize their legacy data platform. The aim was to bolster their analytical capabilities and improve data accessibility while ensuring a quick time to market and high data quality, all with low total cost of ownership (TCO) and no need for additional tools or licenses.

dbt emerged as the perfect choice for this transformation within their existing AWS environment. This popular open-source tool for data warehouse transformations won out over other ETL tools for several reasons. dbt’s SQL-based framework made it straightforward to learn and allowed the existing development team to scale up quickly. The tool also offered desirable out-of-the-box features like data lineage, documentation, and unit testing. A crucial advantage of dbt over stored procedures was the separation of code from data—unlike stored procedures, dbt doesn’t store the code in the database itself. This separation further simplifies data management and enhances the system’s overall performance.

Let’s explore the architecture and learn how to build this use case using AWS Cloud services.

Solution overview

The following architecture demonstrates the data pipeline built on dbt to manage the Redshift data warehouse ETL process.

Figure 1 : Modern data platform using AWS Data Services and dbt

This architecture consists of the following key services and tools:

Amazon Redshift was utilized as the data warehouse for the data platform, storing and processing vast amounts of structured and semi-structured data
Amazon QuickSight served as the business intelligence (BI) tool, allowing the business team to create analytical reports and dashboards for various business insights
AWS Database Migration Service (AWS DMS) was employed to perform change data capture (CDC) replication from various source transactional databases
AWS Glue was put to work, loading files from the SFTP location to the Amazon Simple Storage Service (Amazon S3) landing bucket and subsequently to the Redshift landing schema
AWS Lambda functioned as a client program, calling third-party APIs and loading the data into Redshift tables
AWS Fargate, a serverless container management service, was used to deploy the consumer application for source queues and topics
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) was used to orchestrate different tasks of dbt pipelines
dbt, an open-source tool, was employed to write SQL-based data pipelines for data stored in Amazon Redshift, facilitating complex transformations and enhancing data modeling capabilities

Let’s take a closer look at each component and how they interact in the overall architecture to transform raw data into insightful information.

Data sources

As part of this data platform, we are ingesting data from diverse and varied data sources, including:

Transactional databases – These are active databases that store real-time data from various applications. The data typically encompasses all transactions and operations that the business engages in.
Queues and topics – Queues and topics come from various integration applications that generate data in real time. They represent an instantaneous stream of information that can be used for real-time analytics and decision-making.
Third-party APIs – These provide analytics and survey data related to ecommerce websites. This could include details like traffic metrics, user behavior, conversion rates, customer feedback, and more.
Flat files – Other systems supply data in the form of flat files of different formats. These files, stored in an SFTP location, might contain records, reports, logs, or other kinds of raw data that can be further processed and analyzed.

Data ingestion

Data from various sources are grouped into two major categories: real-time ingestion and batch ingestion.

Real-time ingestion uses the following services:

AWS DMS – AWS DMS is used to create CDC replication pipelines from OLTP (Online Transaction Processing) databases. The data is loaded into Amazon Redshift in near-real time to ensure that the most recent information is available for analysis. You can also use Amazon Aurora zero-ETL integration with Amazon Redshift to ingest data directly from OLTP databases to Amazon Redshift.
Fargate –Fargate is used to deploy Java consumer applications that ingest data from source topics and queues in real time. This real-time data consumption can help the business make immediate and data-informed decisions. You can also use Amazon Redshift Streaming Ingestion to ingest data from streaming engines like Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) into Amazon Redshift.

Batch ingestion uses the following services:

Lambda – Lambda is used as a client for calling third-party APIs and loading the resultant data into Redshift tables. This process has been scheduled to run daily, ensuring a consistent batch of fresh data for analysis.
AWS Glue – AWS Glue is used to load files into Amazon Redshift through the S3 data lake. You can also use features like auto-copy from Amazon S3 (feature under preview) to ingest data from Amazon S3 to Amazon Redshift. However, the focus of this post is more on data processing within Amazon Redshift, rather than on the data loading process. Data ingestion, whether real time or batch, forms the basis of any effective data analysis, enabling organizations to gather information from diverse sources and use it for insightful decision-making.

Data warehousing using Amazon Redshift

In Amazon Redshift, we’ve established three schemas, each serving as a different layer in the data architecture:

Landing layer – This is where all data ingested by our services initially lands. It’s raw, unprocessed data straight from the source.
Certified dataset (CDS) layer – This is the next stage, where data from the landing layer undergoes cleaning, normalization, and aggregation. The cleansed and processed data is stored in this certified dataset schema. It serves as a reliable, organized source for downstream data analysis.
User-friendly data mart (UFDM) layer – This final layer uses data from the CDS layer to create data mart tables. These are specifically tailored to support BI reports and dashboards as per the business requirements. The goal of this layer is to present the data in a way that is most useful and accessible for end-users.

This layered approach to data management allows for efficient and organized data processing, leading to more accurate and meaningful insights.

Data pipeline

dbt, an open-source tool, can be installed in the AWS environment and set up to work with Amazon MWAA. We store our code in an S3 bucket and orchestrate it using Airflow’s Directed Acyclic Graphs (DAGs). This setup facilitates our data transformation processes in Amazon Redshift after the data is ingested into the landing schema.

To maintain modularity and handle specific domains, we create individual dbt projects. The nature of the data reporting—real-time or batch—affects how we define our dbt materialization. For real-time reporting, we define materialization as a view, loading data into the landing schema using AWS DMS from database updates or from topic or queue consumers. For batch pipelines, we define materialization as a table, allowing data to be loaded from various types of sources.

In some instances, we have had to build data pipelines that extend from the source system all the way to the UFDM layer. This can be accomplished using Airflow DAGs, which we discuss further in the next section.

To wrap up, it’s worth mentioning that we deploy a dbt webpage using a Lambda function and enable a URL for this function. This webpage serves as a hub for documentation and data lineage, further bolstering the transparency and understanding of our data processes.

ETL job orchestration

In our data pipeline, we follow these steps for job orchestration:

Establish a new Amazon MWAA environment. This environment serves as the central hub for orchestrating our data pipelines.
Install dbt in the new Airflow environment by adding the following dependency to your requirements.txt:
```
boto3>=1.17.54
botocore>=1.20.54
dbt-redshift>=1.3.0
dbt-postgres>=1.3.0
```

Develop DAGs with specific tasks that call upon dbt commands to carry out the necessary transformations. This step involves structuring our workflows in a way that captures dependencies among tasks and ensures that tasks run in the correct order. The following code shows how to define the tasks in the DAG:

#imports..
...

#Define the begin_exec tasks
start = DummyOperator(
    task_id='begin_exec',
    dag=dag 
)

#Define 'verify_dbt_install' task to check if dbt was installed properly
verify = BashOperator(
    task_id='verify_dbt_install',
    dag=dag,
    bash_command='''
        echo "checking dbt version....";             
        /usr/local/airflow/.local/bin/dbt --version;
        if [ $? -gt 0 ]; then
            pip install dbt-redshift>=1.3.0;
        else
            echo "dbt already installed";
        fi
        python --version;
        echo "listing dbt...";      
        rm -r /tmp/dbt_project_home;
        cp -R /usr/local/airflow/dags/dbt_project_home /tmp;
        ls /tmp/dbt_project_home/<your_dbt_project_name>;
    '''
)

#Define ‘landing_to_cds_task’ task to copy from landing schema to cds schema
landing_to_cds_task = BashOperator(
    task_id='landing_to_cds_task', 
    dag = dag,
    bash_command='''        
        /usr/local/airflow/.local/bin/dbt run --project-dir /tmp/dbt_project_home/<your_dbt_project_name> --profiles-dir /tmp/dbt_project_home/ --select <model_folder_name>.*;
    '''
)

...
#Define data quality check task to test a package, generate docs and copy the docs to required S3 location
data_quality_check = BashOperator(
    task_id='data_quality_check',
    dag=dag,
    bash_command='''    
   	  /usr/local/airflow/.local/bin/dbt test –-select your_package.*               
        /usr/local/airflow/.local/bin/dbt docs generate --project-dir /tmp/dbt_project_home/<your_project_name> --profiles-dir /tmp/dbt_project_home/;        
        aws s3 cp /tmp/dbt_project_home/<your_project_name>/target/ s3://<your_S3_bucket_name>/airflow_home/dags/dbt_project_home/<your_project_name>/target --recursive;
    '''
)

Create DAGs that solely focus on dbt transformation. These DAGs handle the transformation process within our data pipelines, harnessing the power of dbt to convert raw data into valuable insights.
```
#This is how we define the flow 
start >> verify >> landing_to_cds_task >> cds_to_ufdm_task >> data_quality_check >> end_exec
```

The following image shows how this workflow would be seen on the Airflow UI .

Create DAGs with AWS Glue for ingestion. These DAGs use AWS Glue for data ingestion tasks. AWS Glue is a fully managed ETL service that makes it easy to prepare and load data for analysis. We create DAGs that orchestrate AWS Glue jobs for extracting data from various sources, transforming it, and loading it into our data warehouse.

      #Create boto3 client for Glue 
      glue_client = boto3.client('glue', region_name='us-east-1')

      #Define callback function to start the Glue job using boto3 client 
      def run_glue_ingestion_job():
   glue_client.start_job_run(JobName='glue_ingestion_job')  

#Define the task for glue job for ingestion
   glue_job_step = PythonOperator(
       task_id=’glue_task_for_source_to_landing’, 
       python_callable=run_glue_ingestion_job
   )
#This is how we define the flow 
start >> verify >> glue_task_for_source_to_landing >> landing_to_cds_task >> cds_to_ufdm_task >> data_quality_check >> end_exec

The following image shows how this workflow would be seen on the Airflow UI.

Create DAGs with Lambda for ingestion. Lambda lets us run code without provisioning or managing servers. These DAGs use Lambda functions to call third-party APIs and load data into our Redshift tables, which can be scheduled to run at certain intervals or in response to specific events.

#Create boto3 client for Lambda 
lambda_client = boto3.client('lambda')

#Define callback function to invoke the lambda function using boto3 client 
def run_lambda_ingestion_job():
   Lambda_client.invoke(FunctionName='<funtion_arn>')
)  

#Define the task for glue job for ingestion
glue_job_step = PythonOperator(
   task_id=’lambda_task_for_api_to_landing’, 
   python_callable=run_lambda_ingestion_job
)

The following image shows how this workflow would be seen on the Airflow UI.

We now have a comprehensive, well-orchestrated process that uses a variety of AWS services to handle different stages of our data pipeline, from ingestion to transformation.

Conclusion

The combination of AWS services and the dbt open-source project provides a powerful, flexible, and scalable solution for building modern data platforms. It’s a perfect blend of manageability and functionality, with its easy-to-use, SQL-based framework and features like data quality checks, configurable load types, and detailed documentation and lineage. Its principles of “code separate from data” and reusability make it a convenient and efficient tool for a wide range of users. This practical use case of building a data platform for a retail organization demonstrates the immense potential of AWS and dbt for transforming data management and analytics, paving the way for faster insights and informed business decisions.

For more information about using dbt with Amazon Redshift, see Manage data transformations with dbt in Amazon Redshift.

About the Authors

Prantik Gachhayat is an Enterprise Architect at Infosys having experience in various technology fields and business domains. He has a proven track record helping large enterprises modernize digital platforms and delivering complex transformation programs. Prantik specializes in architecting modern data and analytics platforms in AWS. Prantik loves exploring new tech trends and enjoys cooking.

Ashutosh Dubey is a Senior Partner Solutions Architect and Global Tech leader at Amazon Web Services based out of New Jersey, USA. He has extensive experience specializing in the Data and Analytics and AIML field including generative AI, contributed to the community by writing various tech contents, and has helped Fortune 500 companies in their cloud journey to AWS.

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

2023-11-08 Rajiv Arora

Post Syndicated from Rajiv Arora original https://aws.amazon.com/blogs/big-data/how-gilead-used-amazon-redshift-to-quickly-and-cost-effectively-load-third-party-medical-claims-data/

This post was co-written with Rajiv Arora, Director of Data Science Platform at Gilead Life Sciences.

Gilead Sciences, Inc. is a biopharmaceutical company committed to advancing innovative medicines to prevent and treat life-threatening diseases, including HIV, viral hepatitis, inflammation, and cancer. A leader in virology, Gilead historically relied on these drugs for growth but now through strategic investments, Gilead is expanding and increasing their focus in oncology, having acquired Kite and Immunomedics to boost their exposure to cell therapy and non-cell therapy, making it the primary growth engine. Because Gilead is expanding into biologics and large molecule therapies, and has an ambitious goal of launching 10 innovative therapies by 2030, there is heavy emphasis on using data with AI and machine learning (ML) to accelerate the drug discovery pipeline.

Amazon Redshift Serverless is a fully managed cloud data warehouse that allows you to seamlessly create your data warehouse with no infrastructure management required. You pay only for the compute resources and storage that you use. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs), which are part of the compute resources. All of the data stored in your warehouse, such as tables, views, and users, make up a namespace in Redshift Serverless.

One of the benefits of Redshift Serverless is that you don’t need to size your data warehouse for your peak workload. The peak workload includes loading periodic large datasets in multi-terabyte range. You can set a base RPU from 8 up to 512 and Redshift Serverless will automatically scale the RPUs to meet your workload demands. This makes it straightforward to manage your data warehouse in a cost-effective manner.

In this post, we share how Gilead collaborated with AWS to redesign their data ingestion process. They used Redshift Serverless as their data producer to load third-party medical claims data in a fast and cost-effective way, reducing load times from days to hours.

Gilead use case

Gilead loads a variety of data from hundreds of sources to their R&D data environment. They recently needed to do a monthly load of 140 TB of uncompressed healthcare claims data in under 24 hours after receiving it to provide analysts and data scientists with up-to-date information on a patient’s healthcare journey. This data volume is expected to increase monthly and is fully refreshed each month. The 3-node RA3 16XL provisioned cluster that had previously been hosting their warehouse was taking around 12 hours to ingest this data to Amazon Redshift, and Gilead was looking to optimize the data ingestion process in a more dynamic manner. Working with Amazon Redshift specialists from AWS, Gilead chose Redshift Serverless as a way to cost-effectively load this data and then use Redshift data sharing to share the final dataset to two additional Redshift data warehouses for end-user queries.

Loading data is a key process for any analytical system, including Amazon Redshift. When loading very large datasets, it’s important to not only load the data as quickly as possible but also in a way that optimizes the consumption queries.

Gilead’s healthcare claims data took 40 hours to load, which meant delays in using the data for downstream processes. The teams sought improvements, targeting a maximum 24-hour SLA for the load. They achieved the load in 8 hours, an 80% reduction in time to make data available.

Solution overview

After collaborating, the Gilead and AWS teams decided on a two-step process to load the data to Amazon Redshift. First, the data was loaded without a distkey and sortkey, which let the load process use the full parallel resources of the cluster. Then we used a deep copy to redistribute this data and add the desired distribution and sort characteristics.

The solution uses Redshift Serverless. The team wanted to ingest data to meet the required SLA, and the following approaches were benchmarked:

COPY command – The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from files on Amazon Simple Storage Service (Amazon S3)
Data lake analytics – Amazon Redshift Spectrum is used to query data directly from files on Amazon S3 by selecting a subset of columns and avoiding the intermediate step of copying data to staging table

Initial Solution approach: Single COPY command

The team determined it would be more effective to apply the distribution and sort keys in a post-copy step. The data was loaded first using automatic distribution of data. This took roughly 12 hours to complete. The team created open and closed claims tables with defined dist keys and with 20% of the columns to alleviate the need to query the larger table. With this success, we learned that we can still improve the big copy, as detailed in the following sections.

Proposed Solution approach 1: Parallel COPY command

Based on the initial solution approach above, the team tested yearly parallel copy commands as illustrated in the following diagram.

Below are the findings and learnings from this approach:

Ingesting data for 4 years using parallel copy showed a 25% performance improvement over the single copy command.
Compared to Initial solution approach, where we were taking 12 hours to ingest the data, we further optimized this runtime by 67% by segregating the data ingestion into separate yearly staging tables and running parallel copy commands.
After the data was loaded into staging yearly tables, we created the open and closed claim tables with an auto distkey with the subset of columns required for larger reporting groups. It took an additional 1 hour to create.

The team used a manifest file to make sure that the COPY command loads all of the required files for the respective year for ingesting.

Proposed Solution approach 2: Data Lake analytics

The team used this approach with Redshift Spectrum to load only the required columns to Redshift Serverless, which avoided loading data into multiple yearly tables and directly to a single table. The following diagram illustrates this approach.

The workflow consists of the following steps:

Crawl the files using AWS Glue.
Create a data lake external schema and table in Redshift Serverless.
Create two separate claims table for open and closed claims because open claims are most frequently consumed and are 20% of the columns and 100% of the data.
Create open and closed tables with selective columns needed for optimal performance optimization during consumption instead of all columns in the original third-party dataset. The data volume distribution is as follows:
- Total number of open claims records = 50 billion
- Total number of closed claims records = 200 billion
- Overall, total number of records = 250 billion
Distribute open and closed tables with a customer-identified distkey.
Configure data ingestion into open and closed claims tables combined using Redshift Serverless with 512 RPUs. This took 1.5 hours, which is further improved by 70% compared to scenario 1. We chose 512 RPUs in order to load data in the fastest way possible.

In this method, data ingestion was streamlined by only loading essential fields from the medical claims dataset and by splitting the table into open and closed claims. Open claims data is most frequently accessed and constitutes only 20% of columns so by splitting the tables. The team not only improved the ingestion performance but also consumption.

Amazon Redshift recently launched automatic mounting of AWS Glue Data Catalog, making it easier to run data lake analytics without manually creating external schemas. You can query data lake tables directly from Amazon Redshift Query Editor v2 or your favorite SQL editors.

Recommendations and best practices

Consider the following recommendations when loading large-scale data in Amazon Redshift.

Use Redshift Serverless with maximum 512 RPUs to efficiently and quickly load data
Depending on consumption use case and query pattern, adopt either of the following approaches:
- When consumption queries require only selected fields from the dataset and most frequently access a subset of data, use data lake queries to load only the relevant columns from Amazon S3 into Amazon Redshift
- When consumption queries require all fields, use COPY commands with a manifest file to ingest data in parallel into multiple logically separated tables and create a database view with UNION ALL of all tables
Avoid using varchar(max) while creating tables and create VARCHAR columns with the right size

Final Architecture

The following diagram shows the high-level final architecture that was implemented.

Conclusion

With the scalability of Redshift Serverless, data sharing to decouple ingestion from consumption workloads, and data lake analytics to ingest data, Gilead made their 140 TB dataset available to their analysts within hours of it being delivered. The innovative architecture of using a serverless ingestion data warehouse, a serverless consumption data warehouse for power users, and their original 3-node provisioned cluster for standard queries gives Gilead isolation to ensure data loads don’t affect their users. The architecture provides scalability to serve infrequent large queries with their serverless consumer along with the benefit of a fixed-cost and fixed-performance option of their provisioned cluster for their standard user queries. Due to the monthly schedule of the data load and the variable need for large queries by consumers, Redshift Serverless proved to be a cost-effective option compared to simply increasing the provisioned cluster to serve each of these use cases.

This split producer/consumer model of using Redshift serverless can bring benefits to many workloads that have similar performance characteristics to Gilead’s warehouse. Customers regularly run large data loads infrequently, and those processes compete with user queries. With this pattern, you can rely on your queries to perform consistently regardless of whether new data is being loaded to the system. This strikes a balance between minimizing cost while maintaining performance and frees the system administrators to load data without affecting users.

About the Authors

Rajiv Arora is a Director of Clinical Data Science at Gilead Sciences with over 20 years of experience in the industry. He is responsible for the multi-modal data platform for the development organization and supports all statistical and predictive analytical infrastructure for RWE and Advanced Analytical functions.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Raks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Brent Strong is a Senior Solutions Architect in the Healthcare and Life Sciences team at AWS. He has more than 15 years of experience in the industry, focusing on data and analytics and DevOps. At AWS, he works closely with large Life Sciences customers to help them deliver new and innovative treatments.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Amazon Aurora MySQL zero-ETL integration with Amazon Redshift is now generally available

2023-11-07 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/amazon-aurora-mysql-zero-etl-integration-with-amazon-redshift-is-now-generally-available/

“Data is at the center of every application, process, and business decision,” wrote Swami Sivasubramanian, VP of Database, Analytics, and Machine Learning at AWS, and I couldn’t agree more. A common pattern customers use today is to build data pipelines to move data from Amazon Aurora to Amazon Redshift. These solutions help them gain insights to grow sales, reduce costs, and optimize their businesses.

To help you focus on creating value from data instead of preparing data for analysis, we announced Amazon Aurora zero-ETL integration with Amazon Redshift at AWS re:Invent 2022 and in public preview for Amazon Aurora MySQL-Compatible Edition in June 2023.

Now generally available: Amazon Aurora MySQL zero-ETL integration with Amazon Redshift
Today, we announced the general availability of Amazon Aurora MySQL zero-ETL integration with Amazon Redshift. With this fully managed solution, you no longer need to build and maintain complex data pipelines in order to derive time-sensitive insights from your transactional data to inform critical business decisions.

This zero-ETL integration between Amazon Aurora and Amazon Redshift unlocks opportunities for you to run near real-time analytics and machine learning (ML) on petabytes of transactional data in Amazon Redshift. As this data gets written into Aurora, it will be available in Amazon Redshift within seconds.

It also enables you to run consolidated analytics from multiple Aurora MySQL database clusters in Amazon Redshift to derive holistic insights across many applications or partitions. Amazon Aurora MySQL zero-ETL integration with Amazon Redshift processes over 1 million transactions per minute (an equivalent of 17.5 million insert/update/delete row operations per minute) from multiple Aurora databases and makes them available in Amazon Redshift in less than 15 seconds (p50 latency lag).

Furthermore, you can take advantage of the analytics and built-in ML capabilities of Amazon Redshift, such as materialized views, cross-Region data sharing, and federated access to multiple data stores and data lakes.

Let’s get started
In this article, I’ll highlight some steps along with information on how you can get started easily. I will use my existing Amazon Aurora MySQL serverless database and Amazon Redshift data warehouse.

To get started, I need to navigate to Amazon RDS and select Create zero-ETL integration on the Zero-ETL integrations page.

On the Create zero-ETL integration page, I need to follow a few steps to configure the integration for my Amazon Aurora database cluster and my Amazon Redshift data warehouse.

First, I define an identifier for my integration and select Next.

On the next page, I need to select the source database by selecting Browse RDS databases.

Here, I can select my existing database as the source.

The next step asks me the target Amazon Redshift data warehouse. Here, I have the flexibility to choose the Amazon Redshift Serverless or RA3 data warehouse in my account or in different account. I select Browse Redshift data warehouses.

Then, I choose the target data warehouse.

Because Amazon Aurora needs to replicate into the data warehouse, we need to add an additional resource policy and add the Aurora database as an authorized integration source in the Amazon Redshift data warehouse.

I can solve this by manually updating in the Amazon Redshift console or let Amazon RDS fix it for me. I tick the checkbox.

On the next page, it shows me the changes that Amazon RDS will perform for us. I select Continue.

On the next page, I can configure the tags and also the encryption. By default, zero-ETL integration encrypts your data using AWS Key Management Service (AWS KMS), and I have the option to use my own key.

Then, I need to review all the configurations and select Create zero-ETL integration to create the integration.

After a few minutes, my zero-ETL integration is sucessfully created. Then, I switch to Amazon Redshift, and on the Zero-ETL integrations page, I can see that I have my recently created zero-ETL integration.

Since the integration does not yet have a target database inside Amazon Redshift, I need to create one.

Now the integration configuration is complete. On this page, I can see the integration status is active, and there is one table that has been replicated.

For testing, I create a new table in my Amazon Aurora database and insert a record into this table.

Then I switched to the Redshift query editor v2 inside Amazon Redshift. Here I can make a connection to the database that I formed as part of the integration. By running a simple query, I can see that my data is already available inside Amazon Redshift.

I found this zero-ETL integration very convenient for two reasons. First, I could unify all data from multiple database clusters together and analyze it in aggregate. Second, within seconds of the transactional data being written into Amazon Aurora MySQL, this zero-ETL integration seamlessly made the data available in Amazon Redshift.

Things to know

Availability – Amazon Aurora zero-ETL integration with Amazon Redshift is available in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Paciﬁc (Singapore), Asia Paciﬁc (Sydney), Asia Paciﬁc (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm).

Supported Database Engines – Amazon Aurora zero-ETL Integration with Amazon Redshift currently supports MySQL-compatible editions of Amazon Aurora. Support for Amazon Aurora PostgreSQL-Compatible Edition is a work in progress.

Pricing – Amazon Aurora zero-ETL integration with Amazon Redshift is provided at no additional cost. You pay for existing Amazon Aurora and Amazon Redshift resources used to create and process the change data created as part of a zero-ETL integration.

We’re one step closer to helping you focus more on creating value from data instead of preparing it for analysis. To learn more on how to get started, please visit the Amazon Aurora MySQL zero-ETL integration with Amazon Redshift page.

Happy integrating!
— Donnie

Implement model versioning with Amazon Redshift ML

2023-11-01 Rohit Bansal

Post Syndicated from Rohit Bansal original https://aws.amazon.com/blogs/big-data/implement-model-versioning-with-amazon-redshift-ml/

Amazon Redshift ML allows data analysts, developers, and data scientists to train machine learning (ML) models using SQL. In previous posts, we demonstrated how you can use the automatic model training capability of Redshift ML to train classification and regression models. Redshift ML allows you to create a model using SQL and specify your algorithm, such as XGBoost. You can use Redshift ML to automate data preparation, preprocessing, and selection of your problem type (for more information, refer to Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML). You can also bring a model previously trained in Amazon SageMaker into Amazon Redshift via Redshift ML for local inference. For local inference on models created in SageMaker, the ML model type must be supported by Redshift ML. However, remote inference is available for model types that are not natively available in Redshift ML.

Over time, ML models grow old, and even if nothing drastic happens, small changes accumulate. Common reasons why ML models needs to be retrained or audited include:

Data drift – Because your data has changed over time, the prediction accuracy of your ML models may begin to decrease compared to the accuracy exhibited during testing
Concept drift – The ML algorithm that was initially used may need to be changed due to different business environments and other changing needs

You may need to refresh the model on a regular basis, automate the process, and reevaluate your model’s improved accuracy. As of this writing, Amazon Redshift doesn’t support versioning of ML models. In this post, we show how you can use the bring your own model (BYOM) functionality of Redshift ML to implement versioning of Redshift ML models.

We use local inference to implement model versioning as part of operationalizing ML models. We assume that you have a good understanding of your data and the problem type that is most applicable for your use case, and have created and deployed models to production.

Solution overview

In this post, we use Redshift ML to build a regression model that predicts the number of people that may use the city of Toronto’s bike sharing service at any given hour of a day. The model accounts for various aspects, including holidays and weather conditions, and because we need to predict a numerical outcome, we used a regression model. We use data drift as a reason for retraining the model, and use model versioning as part of the solution.

After a model is validated and is being used on a regular basis for running predictions, you can create versions of the models, which requires you to retrain the model using an updated training set and possibly a different algorithm. Versioning serves two main purposes:

You can refer to prior versions of a model for troubleshooting or audit purposes. This enables you to ensure that your model still retains high accuracy before switching to a newer model version.
You can continue to run inference queries on the current version of a model during the model training process of the new version.

At the time of this writing, Redshift ML doesn’t have native versioning capabilities, but you can still achieve versioning by implementing a few simple SQL techniques by using the BYOM capability. BYOM was introduced to support pre-trained SageMaker models to run your inference queries in Amazon Redshift. In this post, we use the same BYOM technique to create a version of an existing model built using Redshift ML.

The following figure illustrates this workflow.

In the following sections, we show you how to can create a version from an existing model and then perform model retraining.

Prerequisites

As a prerequisite for implementing the example in this post, you need to set up a Redshift cluster or Amazon Redshift Serverless endpoint. For the preliminary steps to get started and set up your environment, refer to Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

We use the regression model created in the post Build regression models with Amazon Redshift ML. We assume that it is already been deployed and use this model to create new versions and retrain the model.

Create a version from the existing model

The first step is to create a version of the existing model (which means saving developmental changes of the model) so that a history is maintained and the model is available for comparison later on.

The following code is the generic format of the CREATE MODEL command syntax; in the next step, you get the information needed to use this command to create a new version:

CREATE MODEL model_name
    FROM ('job_name' | 's3_path' )
    FUNCTION function_name ( data_type [, ...] )
    RETURNS data_type
    IAM_ROLE { default }
    [ SETTINGS (
      S3_BUCKET 'bucket', | --required
      KMS_KEY_ID 'kms_string') --optional
    ];

Next, we collect and apply the input parameters to the preceding CREATE MODEL code to the model. We need the job name and the data types of the model input and output values. We collect these by running the show model command on our existing model. Run the following command in Amazon Redshift Query Editor v2:

show model predict_rental_count;

Note the values for AutoML Job Name, Function Parameter Types, and the Target Column (trip_count) from the model output. We use these values in the CREATE MODEL command to create the version.

The following CREATE MODEL statement creates a version of the current model using the values collected from our show model command. We append the date (the example format is YYYYMMDD) to the end of the model and function names to track when this new version was created.

CREATE MODEL predict_rental_count_20230706 
FROM 'redshiftml-20230706171639810624' 
FUNCTION predict_rental_count_20230706 (int4, int4, int4, int4, int4, int4, int4, numeric, numeric, int4)
RETURNS float8 
IAM_ROLE default
SETTINGS (
S3_BUCKET '<<your S3 Bucket>>');

This command may take few minutes to complete. When it’s complete, run the following command:

show model predict_rental_count_20230706;

We can observe the following in the output:

AutoML Job Name is the same as the original version of the model
Function Name shows the new name, as expected
Inference Type shows Local, which designates this is BYOM with local inference

You can run inference queries using both versions of the model to validate the inference outputs.

The following screenshot shows the output of the model inference using the original version.

The following screenshot shows the output of model inference using the version copy.

As you can see, the inference outputs are the same.

You have now learned how to create a version of a previously trained Redshift ML model.

Retrain your Redshift ML model

After you create a version of an existing model, you can retrain the existing model by simply creating a new model.

You can create and train a new model using same CREATE MODEL command but using different input parameters, datasets, or problem types as applicable. For this post, we retrain the model on newer datasets. We append _new to the model name so it’s similar to the existing model for identification purposes.

In the following code, we use the CREATE MODEL command with a new dataset available in the training_data table:

CREATE MODEL predict_rental_count_new
FROM training_data
TARGET trip_count
FUNCTION predict_rental_count_new
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML'
PROBLEM_TYPE regression
OBJECTIVE 'mse'
SETTINGS (s3_bucket 'redshiftml-<your-account-id>',
          s3_garbage_collect off,
          max_runtime 5000);

Run the following command to check the status of the new model:

show model predict_rental_count_new;

Replace the existing Redshift ML model with the retrained model

The last step is to replace the existing model with the retrained model. We do this by dropping the original version of the model and recreating a model using the BYOM technique.

First, check your retrained model to ensure the MSE/RMSE scores are staying stable between model training runs. To validate the models, you can run inferences by each of the model functions on your dataset and compare the results. We use the inference queries provided in Build regression models with Amazon Redshift ML.

After validation, you can replace your model.

Start by collecting the details of the predict_rental_count_new model.

Note the AutoML Job Name value, the Function Parameter Types values, and the Target Column name in the model output.

Replace the original model by dropping the original model and then creating the model with the original model and function names to make sure the existing references to the model and function names work:

drop model predict_rental_count;
CREATE MODEL predict_rental_count
FROM 'redshiftml-20230706171639810624' 
FUNCTION predict_rental_count(int4, int4, int4, int4, int4, int4, int4, numeric, numeric, int4)
RETURNS float8 
IAM_ROLE default
SETTINGS (
S3_BUCKET ’<<your S3 Bucket>>’);

The model creation should complete in a few minutes. You can check the status of the model by running the following command:

show model predict_rental_count;

When the model status is ready, the newer version predict_rental_count of your existing model is available for inference and the original version of the ML model predict_rental_count_20230706 is available for reference if needed.

Please refer to this GitHub repository for sample scripts to automate model versioning.

Conclusion

In this post, we showed how you can use the BYOM feature of Redshift ML to do model versioning. This allows you to have a history of your models so that you can compare model scores over time, respond to audit requests, and run inferences while training a new model.

For more information about building different models with Redshift ML, refer to Amazon Redshift ML.

About the Authors

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using other AWS Analytics services.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and using the power of ML within their data warehouse.

Amazon Redshift: Lower price, higher performance

2023-10-27 Stefan Gromoll

Post Syndicated from Stefan Gromoll original https://aws.amazon.com/blogs/big-data/amazon-redshift-lower-price-higher-performance/

Like virtually all customers, you want to spend as little as possible while getting the best possible performance. This means you need to pay attention to price-performance. With Amazon Redshift, you can have your cake and eat it too! Amazon Redshift delivers up to 4.9 times lower cost per user and up to 7.9 times better price-performance than other cloud data warehouses on real-world workloads using advanced techniques like concurrency scaling to support hundreds of concurrent users, enhanced string encoding for faster query performance, and Amazon Redshift Serverless performance enhancements. Read on to understand why price-performance matters and how Amazon Redshift price-performance is a measure of how much it costs to get a particular level of workload performance, namely performance ROI (return on investment).

Because both price and performance enter into the price-performance calculation, there are two ways to think about price-performance. The first way is to hold price constant: if you have $1 to spend, how much performance do you get from your data warehouse? A database with better price-performance will deliver better performance for each $1 spent. Therefore, when holding price constant when comparing two data warehouses that cost the same, the database with better price-performance will run your queries faster. The second way to look at price-performance is to hold performance constant: if you need your workload to finish in 10 minutes, what will it cost? A database with better price-performance will run your workload in 10 minutes at a lower cost. Therefore, when holding performance constant when comparing two data warehouses that are sized to deliver the same performance, the database with better price-performance will cost less and save you money.

Finally, another important aspect of price-performance is predictability. Knowing how much your data warehouse is going to cost as the number of data warehouse users grows is crucial for planning. It should not only deliver the best price-performance today, but also scale predictably and deliver the best price-performance as more users and workloads are added. An ideal data warehouse should have linear scale—scaling your data warehouse to deliver twice the query throughput should ideally cost twice as much (or less).

In this post, we share performance results to illustrate how Amazon Redshift delivers significantly better price-performance compared to leading alternative cloud data warehouses. This means that if you spend the same amount on Amazon Redshift as you would on one of these other data warehouses, you will get better performance with Amazon Redshift. Alternatively, if you size your Redshift cluster to deliver the same performance, you will see lower costs compared to these alternatives.

Price-performance for real-world workloads

You can use Amazon Redshift to power a very wide diversity of workloads, from batch-processing of complex extract, transform, and load (ETL)-based reports, and real-time streaming analytics to low-latency business intelligence (BI) dashboards that need to serve hundreds or even thousands of users at the same time with subsecond response times, and everything in between. One of the ways we continually improve price-performance for our customers is to constantly review the software and hardware performance telemetry from the Redshift fleet, looking for opportunities and customer use cases where we can further improve Amazon Redshift performance.

Some recent examples of performance optimizations driven by fleet telemetry include:

String query optimizations – By analyzing how Amazon Redshift processed different data types in the Redshift fleet, we found that optimizing string-heavy queries would bring significant benefit to our customers’ workloads. (We discuss this in more detail later in this post.)
Automated materialized views – We found that Amazon Redshift customers often run many queries that have common subquery patterns. For example, several different queries may join the same three tables using the same join condition. Amazon Redshift is now able to automatically create and maintain materialized views and then transparently rewrite queries to use the materialized views using the machine-learned automated materialized view autonomics feature in Amazon Redshift. When enabled, automated materialized views can transparently increase query performance for repetitive queries without any user intervention. (Note that automated materialized views were not used in any of the benchmark results discussed in this post).
High-concurrency workloads – A growing use case we see is using Amazon Redshift to serve dashboard-like workloads. These workloads are characterized by desired query response times of single-digit seconds or less, with tens or hundreds of concurrent users running queries simultaneously with a spiky and often unpredictable usage pattern. The prototypical example of this is an Amazon Redshift-backed BI dashboard that has a spike in traffic Monday mornings when a large number of users start their week.

High-concurrency workloads in particular have very broad applicability: most data warehouse workloads operate at concurrency, and it’s not uncommon for hundreds or even thousands of users to run queries on Amazon Redshift at the same time. Amazon Redshift was designed to keep query response times predictable and fast. Redshift Serverless does this automatically for you by adding and removing compute as needed to keep query response times fast and predictable. This means a Redshift Serverless-backed dashboard that loads quickly when it’s being accessed by one or two users will continue to load quickly even when many users are loading it at the same time.

To simulate this type of workload, we used a benchmark derived from TPC-DS with a 100 GB data set. TPC-DS is an industry-standard benchmark that includes a variety of typical data warehouse queries. At this relatively small scale of 100 GB, queries in this benchmark run on Redshift Serverless in an average of a few seconds, which is representative of what users loading an interactive BI dashboard would expect. We ran between 1–200 concurrent tests of this benchmark, simulating between 1–200 users trying to load a dashboard at the same time. We also repeated the test against several popular alternative cloud data warehouses that also support scaling out automatically (if you’re familiar with the post Amazon Redshift continues its price-performance leadership, we didn’t include Competitor A because it doesn’t support automatically scaling up). We measured average query response time, meaning how long a user would wait for their queries to finish (or their dashboard to load). The results are shown in the following chart.

Competitor B scales well until around 64 concurrent queries, at which point it is unable to provide additional compute and queries begin to queue, leading to increased query response times. Although Competitor C is able to scale automatically, it scales to lower query throughput than both Amazon Redshift and Competitor B and is not able to keep query runtimes low. In addition, it doesn’t support queueing queries when it runs out of compute, which prevents it from scaling beyond around 128 concurrent users. Submitting additional queries beyond this are rejected by the system.

Here, Redshift Serverless is able to keep the query response time relatively consistent at around 5 seconds even when hundreds of users are running queries at the same time. The average query response times for Competitors B and C increase steadily as load on the warehouses increases, which results in users having to wait longer (up to 16 seconds) for their queries to return when the data warehouse is busy. This means that if a user is trying to refresh a dashboard (which may even submit several concurrent queries when reloaded), Amazon Redshift would be able to keep dashboard load times far more consistent even if the dashboard is being loaded by tens or hundreds of other users at the same time.

Because Amazon Redshift is able to deliver very high query throughput for short queries (as we wrote about in Amazon Redshift continues its price-performance leadership), it’s also able to handle these higher concurrencies when scaling out more efficiently and therefore at a significantly lower cost. To quantify this, we look at the price-performance using published on-demand pricing for each of the warehouses in the preceding test, shown in the following chart. It’s worth noting that using Reserved Instances (RIs), especially 3-year RIs purchased with the all upfront payment option, has the lowest cost to run Amazon Redshift on Provisioned clusters, resulting in the best relative price-performance compared to on-demand or other RI options.

So not only is Amazon Redshift able to deliver better performance at higher concurrencies, it’s able to do so at significantly lower cost. Each data point in the price-performance chart is equivalent to the cost to run the benchmark at the specified concurrency. Because the price-performance is linear, we can divide the cost to run the benchmark at any concurrency by the concurrency (number of Concurrent Users in this chart) to tell us how much adding each new user costs for this particular benchmark.

The preceding results are straightforward to replicate. All queries used in the benchmark are available in our GitHub repository and performance is measured by launching a data warehouse, enabling Concurrency Scaling on Amazon Redshift (or the corresponding auto scaling feature on other warehouses), loading the data out of the box (no manual tuning or database-specific setup), and then running a concurrent stream of queries at concurrencies from 1–200 in steps of 32 on each data warehouse. The same GitHub repo references pregenerated (and unmodified) TPC-DS data in Amazon Simple Storage Service (Amazon S3) at various scales using the official TPC-DS data generation kit.

Optimizing string-heavy workloads

As mentioned earlier, the Amazon Redshift team is continuously looking for new opportunities to deliver even better price-performance for our customers. One improvement we recently launched that significantly improved performance is an optimization that accelerates the performance of queries over string data. For example, you might want to find the total revenue generated from retail stores located in New York City with a query like SELECT sum(price) FROM sales WHERE city = ‘New York’. This query is applying a predicate over string data (city = ‘New York’). As you can imagine, string data processing is ubiquitous in data warehouse applications.

To quantify how often customers’ workloads access strings, we conducted a detailed analysis of string data type usage using fleet telemetry of tens of thousands of customer clusters managed by Amazon Redshift. Our analysis indicates that in 90% of the clusters, string columns constitute at least 30% of all the columns, and in 50% of the clusters, string columns constitute at least 50% of all the columns. Moreover, a majority of all queries run on the Amazon Redshift cloud data warehouse platform access at least one string column. Another important factor is that string data is very often low cardinality, meaning the columns contain a relatively small set of unique values. For example, although an orders table representing sales data may contain billions of rows, an order_status column within that table might contain only a few unique values across those billions of rows, such as pending, in process, and completed.

As of this writing, most string columns in Amazon Redshift are compressed with LZO or ZSTD algorithms. These are good general-purpose compression algorithms, but they aren’t designed to take advantage of low-cardinality string data. In particular, they require that data be decompressed before being operated on, and are less efficient in their use of hardware memory bandwidth. For low-cardinality data, there is another type of encoding that can be more optimal: BYTEDICT. This encoding uses a dictionary-encoding scheme that allows the database engine to operate directly over compressed data without the need to decompress it first.

To further improve price-performance for string-heavy workloads, Amazon Redshift is now introducing additional performance enhancements that speed up scans and predicate evaluations, over low-cardinality string columns that are encoded as BYTEDICT, between 5–63 times faster (see results in the next section) compared to alternative compression encodings such as LZO or ZSTD. Amazon Redshift achieves this performance improvement by vectorizing scans over lightweight, CPU-efficient, BYTEDICT-encoded, low-cardinality string columns. These string-processing optimizations make effective use of memory bandwidth afforded by modern hardware, enabling real-time analytics over string data. These newly introduced performance capabilities are optimal for low-cardinality string columns (up to a few hundred unique string values).

You can automatically benefit from this new high performance string enhancement by enabling automatic table optimization in your Amazon Redshift data warehouse. If you don’t have automatic table optimization enabled on your tables, you can receive recommendations from the Amazon Redshift Advisor in the Amazon Redshift console on a string column’s suitability for BYTEDICT encoding. You can also define new tables that have low-cardinality string columns with BYTEDICT encoding. String enhancements in Amazon Redshift are now available in all AWS Regions where Amazon Redshift is available.

Performance results

To measure the performance impact of our string enhancements, we generated a 10TB (Tera Byte) dataset that consisted of low-cardinality string data. We generated three versions of the data using short, medium, and long strings, corresponding to the 25th, 50th, and 75th percentile of string lengths from Amazon Redshift fleet telemetry. We loaded this data into Amazon Redshift twice, encoding it in one case using LZO compression and in another using BYTEDICT compression. Finally, we measured the performance of scan-heavy queries that return many rows (90% of the table), a medium number of rows (50% of the table), and a few rows (1% of the table) over these low-cardinality string datasets. The performance results are summarized in the following chart.

Queries with predicates that match a high percentage of rows saw improvements of 5–30 times with the new vectorized BYTEDICT encoding compared to LZO, whereas queries with predicates that match a low percentage of rows saw improvements of 10–63 times in this internal benchmark.

Redshift Serverless price-performance

In addition to the high-concurrency performance results presented in this post, we also used the TPC-DS-derived Cloud Data Warehouse benchmark to compare the price-performance of Redshift Serverless to other data warehouses using a larger 3TB dataset. We chose data warehouses that were priced similarly, in this case within 10% of $32 per hour using publicly available on-demand pricing. These results show that, like Amazon Redshift RA3 instances, Redshift Serverless delivers better price-performance compared to other leading cloud data warehouses. As always, these results can be replicated by using our SQL scripts in our GitHub repository.

We encourage you to try Amazon Redshift using your own proof of concept workloads as the best way to see how Amazon Redshift can meet your data analytics needs.

Find the best price-performance for your workloads

The benchmarks used in this post are derived from the industry-standard TPC-DS benchmark, and have the following characteristics:

The schema and data are used unmodified from TPC-DS.
The queries are generated using the official TPC-DS kit with query parameters generated using the default random seed of the TPC-DS kit. TPC-approved query variants are used for a warehouse if the warehouse doesn’t support the SQL dialect of the default TPC-DS query.
The test includes the 99 TPC-DS SELECT queries. It doesn’t include maintenance and throughput steps.
For the single 3TB concurrency test, three power runs were run, and the best run is taken for each data warehouse.
Price-performance for the TPC-DS queries is calculated as cost per hour (USD) times the benchmark runtime in hours, which is equivalent to the cost to run the benchmark. The latest published on-demand pricing is used for all data warehouses and not Reserved Instance pricing as noted earlier.

We call this the Cloud Data Warehouse benchmark, and you can easily reproduce the preceding benchmark results using the scripts, queries, and data available in our GitHub repository. It is derived from the TPC-DS benchmarks as described in this post, and as such is not comparable to published TPC-DS results, because the results of our tests don’t comply with the official specification.

Conclusion

Amazon Redshift is committed to delivering the industry’s best price-performance for the widest variety of workloads. Redshift Serverless scales linearly with the best (lowest) price-performance, supporting hundreds of concurrent users while maintaining consistent query response times. Based on test results discussed in this post, Amazon Redshift has up to 2.6 times better price-performance at the same level of concurrency compared to the nearest competitor (Competitor B). As mentioned earlier, using Reserved Instances with the 3-year all upfront option gives you the lowest cost to run Amazon Redshift, resulting in even better relative price-performance compared to on-demand instance pricing that we used in this post. Our approach to continuous performance improvement involves a unique combination of customer obsession to understand customer use cases and their associated scalability bottlenecks coupled with continuous fleet data analysis to identify opportunities to make significant performance optimizations.

Each workload has unique characteristics, so if you’re just getting started, a proof of concept is the best way to understand how Amazon Redshift can lower your costs while delivering better performance. When running your own proof of concept, it’s important to focus on the right metrics—query throughput (number of queries per hour), response time, and price-performance. You can make a data-driven decision by running a proof of concept on your own or with assistance from AWS or a system integration and consulting partner.

To stay up to date with the latest developments in Amazon Redshift, follow the What’s New in Amazon Redshift feed.

About the authors

Stefan Gromoll is a Senior Performance Engineer with Amazon Redshift team where he is responsible for measuring and improving Redshift performance. In his spare time, he enjoys cooking, playing with his three boys, and chopping firewood.

Ravi Animi is a Senior Product Management leader in the Amazon Redshift team and manages several functional areas of the Amazon Redshift cloud data warehouse service including performance, spatial analytics, streaming ingestion and migration strategies. He has experience with relational databases, multi-dimensional databases, IoT technologies, storage and compute infrastructure services and more recently as a startup founder using AI/deep learning, computer vision, and robotics.

Aamer Shah is a Senior Engineer in the Amazon Redshift Service team.

Sanket Hase is a Software Development Manager in the Amazon Redshift Service team.

Orestis Polychroniou is a Principal Engineer in the Amazon Redshift Service team.

Create, train, and deploy Amazon Redshift ML model integrating features from Amazon SageMaker Feature Store

2023-10-26 Anirban Sinha

Post Syndicated from Anirban Sinha original https://aws.amazon.com/blogs/big-data/create-train-and-deploy-amazon-redshift-ml-model-integrating-features-from-amazon-sagemaker-feature-store/

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights on new data for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies. Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using SQL commands familiar to many roles such as executives, business analysts, and data analysts. We covered in a previous post how you can use data in Amazon Redshift to train models in Amazon SageMaker, a fully managed ML service, and then make predictions within your Redshift data warehouse.

Redshift ML currently supports ML algorithms such as XGBoost, multilayer perceptron (MLP), KMEANS, and Linear Learner. Additionally, you can import existing SageMaker models into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. However, one challenge in training a production-ready ML model using SageMaker Feature Store is access to a diverse set of features that aren’t always owned and maintained by the team that is building the model. For example, an ML model to identify fraudulent financial transactions needs access to both identifying (device type, browser) and transaction (amount, credit or debit, and so on) related features. As a data scientist building an ML model, you may have access to the identifying information but not the transaction information, and having access to a feature store solves this.

In this post, we discuss the combined feature store pattern, which allows teams to maintain their own local feature stores using a local Redshift table while still being able to access shared features from the centralized feature store. In a local feature store, you can store sensitive data that can’t be shared across the organization for regulatory and compliance reasons.

We also show you how to use familiar SQL statements to create and train ML models by combining shared features from the centralized store with local features and use these models to make in-database predictions on new data for use cases such as fraud risk scoring.

Overview of solution

For this post, we create an ML model to predict if a transaction is fraudulent or not, given the transaction record. To build this, we need to engineer features that describe an individual credit card’s spending pattern, such as the number of transactions or the average transaction amount, and also information about the merchant, the cardholder, the device used to make the payment, and any other data that may be relevant to detecting fraud.

To get started, we need an Amazon Redshift Serverless data warehouse with the Redshift ML feature enabled and an Amazon SageMaker Studio environment with access to SageMaker Feature Store. For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

We also need an offline feature store to store features in feature groups. The offline store uses an Amazon Simple Storage Service (Amazon S3) bucket for storage and can also fetch data using Amazon Athena queries. For an introduction to SageMaker Feature Store and instructions on setting it up, see Getting started with Amazon SageMaker Feature Store.

The following diagram illustrates solution architecture.

The workflow contains the following steps:

Create the offline feature group in SageMaker Feature Store and ingest data into the feature group.
Create a Redshift table and load local feature data into the table.
Create an external schema for Amazon Redshift Spectrum to access the offline store data stored in Amazon S3 using the AWS Glue Data Catalog.
Train and validate a fraud risk scoring ML model using local feature data and external offline feature store data.
Use the offline feature store and local store for inference.

Dataset

To demonstrate this use case, we use a synthetic dataset with two tables: identity and transactions. They can both be joined by the TransactionID column. The transaction table contains information about a particular transaction, such as amount, credit or debit card, and so on, and the identity table contains information about the user, such as device type and browser. The transaction must exist in the transaction table, but might not always be available in the identity table.

The following is an example of the transactions dataset.

The following is an example of the identity dataset.

Let’s assume that across the organization, data science teams centrally manage the identity data and process it to extract features in a centralized offline feature store. The data warehouse team ingests and analyzes transaction data in a Redshift table, owned by them.

We work through this use case to understand how the data warehouse team can securely retrieve the latest features from the identity feature group and join it with transaction data in Amazon Redshift to create a feature set for training and inferencing a fraud detection model.

Create the offline feature group and ingest data

To start, we set up SageMaker Feature Store, create a feature group for the identity dataset, inspect and process the dataset, and ingest some sample data. We then prepare the transaction features from the transaction data and store it in Amazon S3 for further loading into the Redshift table.

Alternatively, you can author features using Amazon SageMaker Data Wrangler, create feature groups in SageMaker Feature Store, and ingest features in batches using an Amazon SageMaker Processing job with a notebook exported from SageMaker Data Wrangler. This mode allows for batch ingestion into the offline store.

Let’s explore some of the key steps in this section.

Download the sample notebook.
On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
Locate your notebook instance and choose Open Jupyter.
Choose Upload and upload the notebook you just downloaded.
Open the notebook sagemaker_featurestore_fraud_redshiftml_python_sdk.ipynb.
Follow the instructions and run all the cells up to the Cleanup Resources section.

The following are key steps from the notebook:

We create a Pandas DataFrame with the initial CSV data. We apply feature transformations for this dataset.

identity_data = pd.read_csv(io.BytesIO(identity_data_object["Body"].read()))
transaction_data = pd.read_csv(io.BytesIO(transaction_data_object["Body"].read()))

identity_data = identity_data.round(5)
transaction_data = transaction_data.round(5)

identity_data = identity_data.fillna(0)
transaction_data = transaction_data.fillna(0)

# Feature transformations for this dataset are applied 
# One hot encode card4, card6
encoded_card_bank = pd.get_dummies(transaction_data["card4"], prefix="card_bank")
encoded_card_type = pd.get_dummies(transaction_data["card6"], prefix="card_type")

transformed_transaction_data = pd.concat(
    [transaction_data, encoded_card_type, encoded_card_bank], axis=1
)

We store the processed and transformed transaction dataset in an S3 bucket. This transaction data will be loaded later in the Redshift table for building the local feature store.

transformed_transaction_data.to_csv("transformed_transaction_data.csv", header=False, index=False)
s3_client.upload_file("transformed_transaction_data.csv", default_s3_bucket_name, prefix + "/training_input/transformed_transaction_data.csv")

Next, we need a record identifier name and an event time feature name. In our fraud detection example, the column of interest is TransactionID.EventTime can be appended to your data when no timestamp is available. In the following code, you can see how these variables are set, and then EventTime is appended to both features’ data.
```
# record identifier and event time feature names
record_identifier_feature_name = "TransactionID"
event_time_feature_name = "EventTime"

# append EventTime feature
identity_data[event_time_feature_name] = pd.Series(
    [current_time_sec] * len(identity_data), dtype="float64"
)
```

We then create and ingest the data into the feature group using the SageMaker SDK FeatureGroup.ingest API. This is a small dataset and therefore can be loaded into a Pandas DataFrame. When we work with large amounts of data and millions of rows, there are other scalable mechanisms to ingest data into SageMaker Feature Store, such as batch ingestion with Apache Spark.

identity_feature_group.create(
    s3_uri=<S3_Path_Feature_Store>,
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=<role_arn>,
    enable_online_store=False,
)

identity_feature_group_name = "identity-feature-group"

# load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
identity_feature_group.load_feature_definitions(data_frame=identity_data)
identity_feature_group.ingest(data_frame=identity_data, max_workers=3, wait=True)

We can verify that data has been ingested into the feature group by running Athena queries in the notebook or running queries on the Athena console.

At this point, the identity feature group is created in an offline feature store with historical data persisted in Amazon S3. SageMaker Feature Store automatically creates an AWS Glue Data Catalog for the offline store, which enables us to run SQL queries against the offline data using Athena or Redshift Spectrum.

Create a Redshift table and load local feature data

To build a Redshift ML model, we build a training dataset joining the identity data and transaction data using SQL queries. The identity data is in a centralized feature store where the historical set of records are persisted in Amazon S3. The transaction data is a local feature for training data that needs to made available in the Redshift table.

Let’s explore how to create the schema and load the processed transaction data from Amazon S3 into a Redshift table.

Create the customer_transaction table and load daily transaction data into the table, which you’ll use to train the ML model:

DROP TABLE customer_transaction;
CREATE TABLE customer_transaction (
  TransactionID INT,    
  isFraud INT,  
  TransactionDT INT,    
  TransactionAmt decimal(10,2), 
  card1 INT,    
  card2 decimal(10,2),card3 decimal(10,2),  
  card4 VARCHAR(20),card5 decimal(10,2),    
  card6 VARCHAR(20),    
  B1 INT,B2 INT,B3 INT,B4 INT,B5 INT,B6 INT,
  B7 INT,B8 INT,B9 INT,B10 INT,B11 INT,B12 INT,
  F1 INT,F2 INT,F3 INT,F4 INT,F5 INT,F6 INT,
  F7 INT,F8 INT,F9 INT,F10 INT,F11 INT,F12 INT,
  F13 INT,F14 INT,F15 INT,F16 INT,F17 INT,  
  N1 VARCHAR(20),N2 VARCHAR(20),N3 VARCHAR(20), 
  N4 VARCHAR(20),N5 VARCHAR(20),N6 VARCHAR(20), 
  N7 VARCHAR(20),N8 VARCHAR(20),N9 VARCHAR(20), 
  card_type_0  boolean,
  card_type_credit boolean,
  card_type_debit  boolean,
  card_bank_0  boolean,
  card_bank_american_express boolean,
  card_bank_discover  boolean,
  card_bank_mastercard  boolean,
  card_bank_visa boolean  
);

Load the sample data by using the following command. Replace your Region and S3 path as appropriate. You will find the S3 path in the S3 Bucket Setup For The OfflineStore section in the notebook or by checking the dataset_uri_prefix in the notebook.
```
COPY customer_transaction
FROM '<s3path>/transformed_transaction_data.csv' 
IAM_ROLE default delimiter ',' 
region 'your-region';
```

Now that we have created a local feature store for the transaction data, we focus on integrating a centralized feature store with Amazon Redshift to access the identity data.

Create an external schema for Redshift Spectrum to access the offline store data

We have created a centralized feature store for identity features, and we can access this offline feature store using services such as Redshift Spectrum. When the identity data is available through the Redshift Spectrum table, we can create a training dataset with feature values from the identity feature group and customer_transaction, joining on the TransactionId column.

This section provides an overview of how to enable Redshift Spectrum to query data directly from files on Amazon S3 through an external database in an AWS Glue Data Catalog.

First, check that the identity-feature-group table is present in the Data Catalog under the sagemamker_featurestore database.

Using Redshift Query Editor V2, create an external schema using the following command:

CREATE EXTERNAL SCHEMA sagemaker_featurestore
FROM DATA CATALOG
DATABASE 'sagemaker_featurestore'
IAM_ROLE default
create external database if not exists;

All the tables, including identity-feature-group external tables, are visible under the sagemaker_featurestore external schema. In Redshift Query Editor v2, you can check the contents of the external schema.

Run the following query to sample a few records—note that your table name may be different:
```
Select * from sagemaker_featurestore.identity_feature_group_1680208535 limit 10;
```

Create a view to join the latest data from identity-feature-group and customer_transaction on the TransactionId column. Be sure to change the external table name to match your external table name:

create or replace view public.credit_fraud_detection_v
AS select  "isfraud",
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
         case when "card_type_credit" = 'False' then 0 else 1 end as card_type_credit,
         case when "card_type_debit" = 'False' then 0 else 1 end as card_type_debit,
         case when "card_bank_american_express" = 'False' then 0 else 1 end as card_bank_american_express,
         case when "card_bank_discover" = 'False' then 0 else 1 end as card_bank_discover,
         case when "card_bank_mastercard" = 'False' then 0 else 1 end as card_bank_mastercard,
         case when "card_bank_visa" = 'False' then 0 else 1 end as card_bank_visa,
        "id_01","id_02","id_03","id_04","id_05"
from public.customer_transaction ct left join sagemaker_featurestore.identity_feature_group_1680208535 id
on id.transactionid = ct.transactionid with no schema binding;

Train and validate the fraud risk scoring ML model

Redshift ML gives you the flexibility to specify your own algorithms and model types and also to provide your own advanced parameters, which can include preprocessors, problem type, and hyperparameters. In this post, we create a customer model by specifying AUTO OFF and the model type of XGBOOST. By turning AUTO OFF and using XGBoost, we are providing the necessary inputs for SageMaker to train the model. A benefit of this can be faster training times. XGBoost is as open-source version of the gradient boosted trees algorithm. For more details on XGBoost, refer to Build XGBoost models with Amazon Redshift ML.

We train the model using 80% of the dataset by filtering on transactiondt < 12517618. The other 20% will be used for inference. A centralized feature store is useful in providing the latest supplementing data for training requests. Note that you will need to provide an S3 bucket name in the create model statement. It will take approximately 10 minutes to create the model.

CREATE MODEL frauddetection_xgboost
FROM (select  "isfraud",
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05"
from credit_fraud_detection_v where transactiondt < 12517618
)
TARGET isfraud
FUNCTION ml_fn_frauddetection_xgboost
IAM_ROLE default
AUTO OFF
MODEL_TYPE XGBOOST
OBJECTIVE 'binary:logistic'
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT(NUM_ROUND '100')
SETTINGS (S3_BUCKET <s3_bucket>);

When you run the create model command, it will complete quickly in Amazon Redshift while the model training is happening in the background using SageMaker. You can check the status of the model by running a show model command:

show model frauddetection_xgboost;

The output of the show model command shows that the model state is TRAINING. It also shows other information such as the model type and the training job name that SageMaker assigned.
After a few minutes, we run the show model command again:

show model frauddetection_xgboost;

Now the output shows the model state is READY. We can also see the train:error score here, which at 0 tells us we have a good model. Now that the model is trained, we can use it for running inference queries.

Use the offline feature store and local store for inference

We can use the SQL function to apply the ML model to data in queries, reports, and dashboards. Let’s use the function ml_fn_frauddetection_xgboost created by our model against our test dataset by filtering where transactiondt >=12517618, to predict whether a transaction is fraudulent or not. SageMaker Feature Store can be useful in supplementing data for inference requests.

Run the following query to predict whether transactions are fraudulent or not:

select  "isfraud" as "Actual",
        ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") as "Predicted"
from credit_fraud_detection_v where transactiondt >= 12517618;

For binary and multi-class classification problems, we compute the accuracy as the model metric. Accuracy can be calculated based on the following:

accuracy = (sum (actual == predicted)/total) *100

Let’s apply the preceding code to our use case to find the accuracy of the model. We use the test data (transactiondt >= 12517618) to test the accuracy, and use the newly created function ml_fn_frauddetection_xgboost to predict and take the columns other than the target and label as the input:

-- check accuracy 
WITH infer_data AS (
SELECT "isfraud" AS label,
ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") AS predicted,
CASE 
   WHEN label IS NULL
       THEN 0
   ELSE label
   END AS actual,
CASE 
   WHEN actual = predicted
       THEN 1::INT
   ELSE 0::INT
   END AS correct
FROM credit_fraud_detection_v where transactiondt >= 12517618),
aggr_data AS (
SELECT SUM(correct) AS num_correct,
COUNT(*) AS total
FROM infer_data) 

SELECT (num_correct::FLOAT / total::FLOAT) AS accuracy FROM aggr_data;

Clean up

As a final step, clean up the resources:

Delete the Redshift cluster.
Run the Cleanup Resources section of your notebook.

Conclusion

Redshift ML enables you to bring machine learning to your data, powering fast and informed decision-making. SageMaker Feature Store provides a purpose-built feature management solution to help organizations scale ML development across business units and data science teams.

In this post, we showed how you can train an XGBoost model using Redshift ML with data spread across SageMaker Feature Store and a Redshift table. Additionally, we showed how you can make inferences on a trained model to detect fraud using Amazon Redshift SQL commands.

About the authors

Anirban Sinha is a Senior Technical Account Manager at AWS. He is passionate about building scalable data warehouses and big data solutions working closely with customers. He works with large ISVs customers, in helping them build and operate secure, resilient, scalable, and high-performance SaaS applications in the cloud.

Gaurav Singh is a Senior Solutions Architect at AWS, specializing in AI/ML and Generative AI. Based in Pune, India, he focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. In his spare time, Gaurav loves to explore nature, read, and run.

Unstructured data management and governance using AWS AI/ML and analytics services

2023-10-25 Sakti Mishra

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/unstructured-data-management-and-governance-using-aws-ai-ml-and-analytics-services/

Unstructured data is information that doesn’t conform to a predefined schema or isn’t organized according to a preset data model. Unstructured information may have a little or a lot of structure but in ways that are unexpected or inconsistent. Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. By some estimates, unstructured data can make up to 80–90% of all new enterprise data and is growing many times faster than structured data. After decades of digitizing everything in your enterprise, you may have an enormous amount of data, but with dormant value. However, with the help of AI and machine learning (ML), new software tools are now available to unearth the value of unstructured data.

In this post, we discuss how AWS can help you successfully address the challenges of extracting insights from unstructured data. We discuss various design patterns and architectures for extracting and cataloging valuable insights from unstructured data using AWS. Additionally, we show how to use AWS AI/ML services for analyzing unstructured data.

Why it’s challenging to process and manage unstructured data

Unstructured data makes up a large proportion of the data in the enterprise that can’t be stored in a traditional relational database management systems (RDBMS). Understanding the data, categorizing it, storing it, and extracting insights from it can be challenging. In addition, identifying incremental changes requires specialized patterns and detecting sensitive data and meeting compliance requirements calls for sophisticated functions. It can be difficult to integrate unstructured data with structured data from existing information systems. Some view structured and unstructured data as apples and oranges, instead of being complementary. But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly.

Solution overview

Data and metadata discovery is one of the primary requirements in data analytics, where data consumers explore what data is available and in what format, and then consume or query it for analysis. If you can apply a schema on top of the dataset, then it’s straightforward to query because you can load the data into a database or impose a virtual table schema for querying. But in the case of unstructured data, metadata discovery is challenging because the raw data isn’t easily readable.

You can integrate different technologies or tools to build a solution. In this post, we explain how to integrate different AWS services to provide an end-to-end solution that includes data extraction, management, and governance.

The solution integrates data in three tiers. The first is the raw input data that gets ingested by source systems, the second is the output data that gets extracted from input data using AI, and the third is the metadata layer that maintains a relationship between them for data discovery.

The following is a high-level architecture of the solution we can build to process the unstructured data, assuming the input data is being ingested to the raw input object store.

Unstructured Data Management - Block Level Architecture Diagram

The steps of the workflow are as follows:

Integrated AI services extract data from the unstructured data.
These services write the output to a data lake.
A metadata layer helps build the relationship between the raw data and AI extracted output. When the data and metadata are available for end-users, we can break the user access pattern into additional steps.
In the metadata catalog discovery step, we can use query engines to access the metadata for discovery and apply filters as per our analytics needs. Then we move to the next stage of accessing the actual data extracted from the raw unstructured data.
The end-user accesses the output of the AI services and uses the query engines to query the structured data available in the data lake. We can optionally integrate additional tools that help control access and provide governance.
There might be scenarios where, after accessing the AI extracted output, the end-user wants to access the original raw object (such as media files) for further analysis. Additionally, we need to make sure we have access control policies so the end-user has access only to the respective raw data they want to access.

Now that we understand the high-level architecture, let’s discuss what AWS services we can integrate in each step of the architecture to provide an end-to-end solution.

The following diagram is the enhanced version of our solution architecture, where we have integrated AWS services.

Unstructured Data Management - AWS Native Architecture

Let’s understand how these AWS services are integrated in detail. We have divided the steps into two broad user flows: data processing and metadata enrichment (Steps 1–3) and end-users accessing the data and metadata with fine-grained access control (Steps 4–6).

Various AI services (which we discuss in the next section) extract data from the unstructured datasets.
The output is written to an Amazon Simple Storage Service (Amazon S3) bucket (labeled Extracted JSON in the preceding diagram). Optionally, we can restructure the input raw objects for better partitioning, which can help while implementing fine-grained access control on the raw input data (labeled as the Partitioned bucket in the diagram).
After the initial data extraction phase, we can apply additional transformations to enrich the datasets using AWS Glue. We also build an additional metadata layer, which maintains a relationship between the raw S3 object path, the AI extracted output path, the optional enriched version S3 path, and any other metadata that will help the end-user discover the data.
In the metadata catalog discovery step, we use the AWS Glue Data Catalog as the technical catalog, Amazon Athena and Amazon Redshift Spectrum as query engines, AWS Lake Formation for fine-grained access control, and Amazon DataZone for additional governance.
The AI extracted output is expected to be available as a delimited file or in JSON format. We can create an AWS Glue Data Catalog table for querying using Athena or Redshift Spectrum. Like the previous step, we can use Lake Formation policies for fine-grained access control.
Lastly, the end-user accesses the raw unstructured data available in Amazon S3 for further analysis. We have proposed integrating Amazon S3 Access Points for access control at this layer. We explain this in detail later in this post.

Now let’s expand the following parts of the architecture to understand the implementation better:

Using AWS AI services to process unstructured data
Using S3 Access Points to integrate access control on raw S3 unstructured data

Process unstructured data with AWS AI services

As we discussed earlier, unstructured data can come in a variety of formats, such as text, audio, video, and images, and each type of data requires a different approach for extracting metadata. AWS AI services are designed to extract metadata from different types of unstructured data. The following are the most commonly used services for unstructured data processing:

Amazon Comprehend – This natural language processing (NLP) service uses ML to extract metadata from text data. It can analyze text in multiple languages, detect entities, extract key phrases, determine sentiment, and more. With Amazon Comprehend, you can easily gain insights from large volumes of text data such as extracting product entity, customer name, and sentiment from social media posts.
Amazon Transcribe – This speech-to-text service uses ML to convert speech to text and extract metadata from audio data. It can recognize multiple speakers, transcribe conversations, identify keywords, and more. With Amazon Transcribe, you can convert unstructured data such as customer support recordings into text and further derive insights from it.
Amazon Rekognition – This image and video analysis service uses ML to extract metadata from visual data. It can recognize objects, people, faces, and text, detect inappropriate content, and more. With Amazon Rekognition, you can easily analyze images and videos to gain insights such as identifying entity type (human or other) and identifying if the person is a known celebrity in an image.
Amazon Textract – You can use this ML service to extract metadata from scanned documents and images. It can extract text, tables, and forms from images, PDFs, and scanned documents. With Amazon Textract, you can digitize documents and extract data such as customer name, product name, product price, and date from an invoice.
Amazon SageMaker – This service enables you to build and deploy custom ML models for a wide range of use cases, including extracting metadata from unstructured data. With SageMaker, you can build custom models that are tailored to your specific needs, which can be particularly useful for extracting metadata from unstructured data that requires a high degree of accuracy or domain-specific knowledge.
Amazon Bedrock – This fully managed service offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API. It also offers a broad set of capabilities to build generative AI applications, simplifying development while maintaining privacy and security.

With these specialized AI services, you can efficiently extract metadata from unstructured data and use it for further analysis and insights. It’s important to note that each service has its own strengths and limitations, and choosing the right service for your specific use case is critical for achieving accurate and reliable results.

AWS AI services are available via various APIs, which enables you to integrate AI capabilities into your applications and workflows. AWS Step Functions is a serverless workflow service that allows you to coordinate and orchestrate multiple AWS services, including AI services, into a single workflow. This can be particularly useful when you need to process large amounts of unstructured data and perform multiple AI-related tasks, such as text analysis, image recognition, and NLP.

With Step Functions and AWS Lambda functions, you can create sophisticated workflows that include AI services and other AWS services. For instance, you can use Amazon S3 to store input data, invoke a Lambda function to trigger an Amazon Transcribe job to transcribe an audio file, and use the output to trigger an Amazon Comprehend analysis job to generate sentiment metadata for the transcribed text. This enables you to create complex, multi-step workflows that are straightforward to manage, scalable, and cost-effective.

The following is an example architecture that shows how Step Functions can help invoke AWS AI services using Lambda functions.

AWS AI Services - Lambda Event Workflow -Unstructured Data

The workflow steps are as follows:

Unstructured data, such as text files, audio files, and video files, are ingested into the S3 raw bucket.
A Lambda function is triggered to read the data from the S3 bucket and call Step Functions to orchestrate the workflow required to extract the metadata.
The Step Functions workflow checks the type of file, calls the corresponding AWS AI service APIs, checks the job status, and performs any postprocessing required on the output.
AWS AI services can be accessed via APIs and invoked as batch jobs. To extract metadata from different types of unstructured data, you can use multiple AI services in sequence, with each service processing the corresponding file type.
After the Step Functions workflow completes the metadata extraction process and performs any required postprocessing, the resulting output is stored in an S3 bucket for cataloging.

Next, let’s understand how can we implement security or access control on both the extracted output as well as the raw input objects.

Implement access control on raw and processed data in Amazon S3

We just consider access controls for three types of data when managing unstructured data: the AI-extracted semi-structured output, the metadata, and the raw unstructured original files. When it comes to AI extracted output, it’s in JSON format and can be restricted via Lake Formation and Amazon DataZone. We recommend keeping the metadata (information that captures which unstructured datasets are already processed by the pipeline and available for analysis) open to your organization, which will enable metadata discovery across the organization.

To control access of raw unstructured data, you can integrate S3 Access Points and explore additional support in the future as AWS services evolve. S3 Access Points simplify data access for any AWS service or customer application that stores data in Amazon S3. Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations. Each access point has distinct permissions and network controls that Amazon S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket. With S3 Access Points, you can create unique access control policies for each access point to easily control access to specific datasets within an S3 bucket. This works well in multi-tenant or shared bucket scenarios where users or teams are assigned to unique prefixes within one S3 bucket.

An access point can support a single user or application, or groups of users or applications within and across accounts, allowing separate management of each access point. Every access point is associated with a single bucket and contains a network origin control and a Block Public Access control. For example, you can create an access point with a network origin control that only permits storage access from your virtual private cloud (VPC), a logically isolated section of the AWS Cloud. You can also create an access point with the access point policy configured to only allow access to objects with a defined prefix or to objects with specific tags. You can also configure custom Block Public Access settings for each access point.

The following architecture provides an overview of how an end-user can get access to specific S3 objects by assuming a specific AWS Identity and Access Management (IAM) role. If you have a large number of S3 objects to control access, consider grouping the S3 objects, assigning them tags, and then defining access control by tags.

S3 Access Points - Unstructured Data Management - Access Control

If you are implementing a solution that integrates S3 data available in multiple AWS accounts, you can take advantage of cross-account support for S3 Access Points.

Conclusion

This post explained how you can use AWS AI services to extract readable data from unstructured datasets, build a metadata layer on top of them to allow data discovery, and build an access control mechanism on top of the raw S3 objects and extracted data using Lake Formation, Amazon DataZone, and S3 Access Points.

In addition to AWS AI services, you can also integrate large language models with vector databases to enable semantic or similarity search on top of unstructured datasets. To learn more about how to enable semantic search on unstructured data by integrating Amazon OpenSearch Service as a vector database, refer to Try semantic search with the Amazon OpenSearch Service vector engine.

As of writing this post, S3 Access Points is one of the best solutions to implement access control on raw S3 objects using tagging, but as AWS service features evolve in the future, you can explore alternative options as well.

About the Authors

Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define their end-to-end data strategy, including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.

Bhavana Chirumamilla is a Senior Resident Architect at AWS with a strong passion for data and machine learning operations. She brings a wealth of experience and enthusiasm to help enterprises build effective data and ML strategies. In her spare time, Bhavana enjoys spending time with her family and engaging in various activities such as traveling, hiking, gardening, and watching documentaries.

Sheela Sonone is a Senior Resident Architect at AWS. She helps AWS customers make informed choices and trade-offs about accelerating their data, analytics, and AI/ML workloads and implementations. In her spare time, she enjoys spending time with her family—usually on tennis courts.

Daniel Bruno is a Principal Resident Architect at AWS. He had been building analytics and machine learning solutions for over 20 years and splits his time helping customers build data science programs and designing impactful ML products.

Simplify Amazon Redshift monitoring using the new unified SYS views

2023-10-24 Urvish Shah

Post Syndicated from Urvish Shah original https://aws.amazon.com/blogs/big-data/simplify-amazon-redshift-monitoring-using-the-new-unified-sys-views/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud, providing up to five times better price-performance than any other cloud data warehouse, with performance innovation out of the box at no additional cost to you. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads.

In this post, we discuss Amazon Redshift SYS monitoring views and how they simplify the monitoring of your Amazon Redshift workloads and resource usage.

Overview of SYS monitoring views

SYS monitoring views are system views in Amazon Redshift that can be used to monitor query and workload resource usage for provisioned clusters as well as for serverless workgroups. They offer the following benefits:

They’re categorized based on functional alignment, considering query state, performance metrics, and query types
We have introduced new performance metrics like planning_time, lock_wait_time, remote_read_io, and local_read_io to aid in performance troubleshooting
It improves the usability of monitoring views by logging the user-submitted query instead of the Redshift optimizer-rewritten query
It provides more troubleshooting metrics using fewer views
It enables unified Amazon Redshift monitoring by enabling you to use the same query across provisioned clusters or serverless workgroups

Let’s look at some of the features of SYS monitoring views and how they can be used for monitoring.

Unify various query-level monitoring metrics

The following table shows how you can unify various metrics and information for a query from multiple system tables & views into one SYS monitoring view.

STL/SVL/STV	Information element	SYS Monitoring View	View columns
STL_QUERY	elapsed time, query label, user ID, transaction, session, label, stopped queries, database name	SYS_QUERY_HISTORY	user_id query_id query_label transaction_id session_id database_name query_type status result_cache_hit start_time end_time elapsed_time queue_time execution_time error_message returned_rows returned_bytes query_text redshift_version usage_limit compute_type compile_time planning_time lock_wait_time
STL_WLM_QUERY	queue time, runtime
SVL_QLOG	result cache
STL_ERROR	error code, error message
STL_UTILITYTEXT	non-SELECT SQL
STL_DDLTEXT	DDL statements
SVL_STATEMENTEXT	all types of SQL statements
STL_RETURN	return rows and bytes
STL_USAGE_CONTROL	usage limit
STV_WLM_QUERY_STATE	current state of WLM
STV_RECENTS	recent and in-flight queries
STV_INFLIGHT	in-flight queries
SVL_COMPILE	compilation

For additional information on SYS to STL/SVL/STV mapping, refer to Migrating to SYS monitoring views.

User query-level logging

To enhance query performance, the Redshift query engine can rewrite user-submitted queries. The user-submitted query identifier is different than the rewritten query identifier. We refer to the user-submitted query as the parent query and the rewritten query as the child query in this post.

The following diagram illustrates logging at the parent query level and child query level. The parent query identifier is 1000, and the child query identifiers are 1001, 1002, and 1003.

Query lifecycle timings

SYS_QUERY_HISTORY has an enhanced list of columns to provide granular time metrics relating to the different query lifecycle phases. Note all times are recorded in microseconds. The following table summarizes these metrics.

Time metrics	Description
planning_time	The time the query spent prior to running the query, which typically includes query lifecycle phases like parse, analyze, planning and rewriting.
lock_wait_time	The time the query spent on acquiring the locks on the required database objects referenced.
queue_time	The time the query spent in the queue waiting for resources to be available to run.
compile_time	The time the query spent compiling.
execution_time	The time the query spent running. In the case of a SELECT query, this also includes the return time.
elapsed_time	The end-to-end time of the query run.

Solution overview

We discuss the following scenarios to help gain familiarity with the SYS monitoring views:

Workload and query lifecycle monitoring
Data ingestion monitoring
External query monitoring
Slow query performance troubleshooting

Prerequisites

You should have the following prerequisites to follow along with the examples in this post:

An AWS account
A Redshift provisioned cluster (current track) or Amazon Redshift Serverless endpoint

Additionally, download all the SQL queries that are referenced in this post as Redshift Query Editor v2 SQL notebooks.

Workload and query lifecycle monitoring

In this section, we discuss how to monitor the workload and query lifecycle.

Identify in-flight queries

SYS_QUERY_HISTORY provides a singular view to look at all the in-flight queries as well as historical runs. See the following example query:

SELECT  
  *
FROM    
  sys_query_history
WHERE    status IN ('planning', 'queued', 'running', 'returning')
ORDER BY
  start_time;

We get the following output.

Identify top long-running queries

The following query helps retrieve the top 100 queries that are taking the longest to run. Analyzing (and, if feasible, optimizing) these queries can help improve overall performance. These metrics are accumulated statistics across all runs of the query. Note that all the time values are in microseconds.

--top long running query by elapsed_time
SELECT  
  user_id
  , transaction_id
  , query_id
  , database_name
  , query_type
  , query_text::VARCHAR(100)
  , lock_wait_time
  , planning_time
  , compile_time
  , execution_time
  , elapsed_time
FROM    
  sys_query_history
ORDER BY
  elapsed_time DESC
LIMIT 100;

We get the following output.

Gather daily counts of queries by query types, period, and status

The following query provides insight into the distribution of different types of queries across different days and helps evaluate and track any changes in the workload:

--daily breakdown of workload by query types and status
SELECT  
  DATE_TRUNC('day', start_time) period_daily
  , query_type
  , status
  , COUNT(*)
FROM    
  sys_query_history
GROUP BY
  period_daily
  , query_type
  , status
ORDER BY
  period_daily
  , query_type
  , status;

We get the following output.

Gather run details of an in-flight query

To determine the run-level details of a query that is in-flight, you can use the is_active = ‘t’ filter when querying the SYS_QUERY_DETAIL table. See the following example:

SELECT  
  query_id
  , child_query_sequence
  , stream_id
  , segment_id
  , step_id
  , step_name
  , table_id
  , coalesce(table_name,'')|| coalesce(source,'') as table_name
  , start_time
  , end_time
  , duration
  , blocks_read
  , local_read_io
  , remote_read_io
FROM    
  sys_query_detail
WHERE is_active = 't'
ORDER BY
  query_id
  , child_query_sequence
  , stream_id
  , segment_id
  , step_id;

To view the latest 100 COPY queries run, use the following code:

SELECT  
  session_id
  , transaction_id
  , query_id
  , database_name
  , table_name
  , data_source
  , loaded_rows
  , loaded_bytes
  , duration / 1000.00 duration_ms
FROM    
  sys_load_history
ORDER BY
  start_time DESC LIMIT 100;

We get the following output.

Gather transaction-level details for commits and undo

SYS_TRANSACTION_HISTORY provides transaction-level logging by providing insights into committed transactions with details like blocks committed, status, and isolation level (serializable or snapshot used). It also logs details about the rolled back or undo transactions.

The following screenshots illustrate fetching details about a transaction that was committed successfully.

The following screenshots illustrate fetching details about a transaction that was rolled back.

Stats and vacuum

The SYS_ANALYZE_HISTORY monitoring view provides details like the last timestamp of analyze queries, the duration for which a particular analyze query ran, the number of rows in the table, and the number of rows modified. The following example query provides a list of the latest analyze queries that ran for all the permanent tables:

SELECT  
  TRIM(schema_name) schema_name
  , TRIM(table_name) table_name
  , table_id
  , status
  , COUNT(*) times_analyze_was_triggered
  , MAX(last_analyze_time) last_analyze_time
  , MAX(end_time) end_time
  , AVG(ROWS) "rows"
  , AVG(modified_rows) modified_rows
FROM    
  sys_analyze_history
WHERE
   status != 'Skipped'
GROUP BY
  schema_name
  , table_name
  , table_id
  , status
ORDER BY
  schema_name
  , table_name
  , table_id
  , status
  , end_time;

We get the following output.

The SYS_VACUUM_HISTORY monitoring view provides a complete set of details on VACUUM in a single view. For example, see the following code:

SELECT  
  user_id
  , transaction_id
  , query_id
  , TRIM(database_name) as database_name
  , TRIM(schema_name) as schema_name
  , TRIM(table_name) table_name
  , table_id
  , vacuum_type
  , is_automatic as is_auto
  , duration
  , rows_before_vacuum
  , size_before_vacuum
  , reclaimable_rows
  , reclaimed_rows
  , reclaimed_blocks
  , sortedrows_before_vacuum
  , sortedrows_after_vacuum
FROM    
  sys_vacuum_history
WHERE    status LIKE '%Finished%'
ORDER BY
  start_time;

We get the following output.

Data ingestion monitoring

In this section, we discuss how to monitor data ingestion.

Summary of ingestion

SYS_LOAD_HISTORY provides details into the statistics of COPY commands. Use this view for summarized insights into your ingestion workload. The following example query provides an hourly summary of ingestion broken down by tables in which data was ingested:

SELECT  
  date_trunc('hour', start_time) period_hourly
  , database_name
  , table_name
  , status
  , file_format
  , SUM(loaded_rows) total_rows_ingested
  , SUM(loaded_bytes) total_bytes_ingested
  , SUM(source_file_count) num_of_files_to_process
  , SUM(file_count_scanned) num_of_files_processed
  , SUM(error_count) total_errors
FROM    
  sys_load_history
GROUP BY
  period_hourly
  , database_name
  , table_name
  , status
  , file_format
ORDER BY
  table_name
  , period_hourly
  , status;

We get the following output.

File-level ingress logging

SYS_LOAD_DETAIL provides more granular insights into how ingestion is performed at the file level. For example, see the following query using sys_load_history:

SELECT  
  *
FROM    
  sys_load_history
WHERE table_name = 'catalog_sales'
ORDER BY
  start_time;

We get the following output.

The following example shows what detailed file-level monitoring looks like:

 SELECT  
  user_id
  , query_id
  , TRIM(file_name) file_name
  , bytes_scanned
  , lines_scanned
  , splits_scanned
  , record_time
  , start_time
  , end_time
FROM    
  sys_load_detail
WHERE query_id = 1824870
ORDER BY
  start_time;

Check for errors during ingress process

SYS_LOAD_ERROR_DETAIL enables you to track and troubleshoot errors that may have occurred during the ingestion process. This view logs details for the file that encountered the error during the ingestion process along with the line number at which the error occurred and column details within that line. See the following code:

select * from sys_load_error_detail order by start_time limit 100;

We get the following output.

External query monitoring

SYS_EXTERNAL_QUERY_DETAIL provides run details for external queries, which includes Amazon Redshift Spectrum and federated queries. This view logs details at the segment level and provides useful insights to troubleshoot and monitor performance of external queries in a single monitoring view. The following are a few useful metrics and data points this monitoring view provides:

Number of external files scanned (scanned_files) and format of external files (file_format) such as Parquet, text file, and so on
Data scanned in terms of rows (returned_rows) and bytes (returned_bytes)
Usage of partitioning (total_partitions and qualified_partitions) by external queries and tables
Granular insights into time taken in listing (s3list_time) and qualifying partitions (get_partition_time) for a given external object
External file location (file_location) and external table name (table_name)
Type of external source (source_type), such as Amazon Simple Storage Service (Amazon S3) for Redshift Spectrum, or federated
Recursive scan for subdirectories (is_recursive) or access of nested column data type (is_nested)

For example, the following query shows the daily summary of the number of external queries run and data scanned:

SELECT  
  DATE_TRUNC('hour', start_time) period_hourly
  , user_id
  , TRIM(source_type) source_type
  , COUNT (DISTINCT query_id) query_counts
  , SUM(returned_rows) returned_rows
  , ROUND(SUM(returned_bytes) / 1024^3,2) returned_gb
FROM    
  sys_external_query_detail
GROUP BY
  period_hourly
  , user_id
  , source_type
ORDER BY
  period_hourly
  , user_id
  , source_type;

We get the following output.

Usage of partitions

You can verify whether the external queries scanning large sums of data and files are partitioned or not. When you use partitions, you can restrict the amount of data that your external query has to scan by pruning based on the partition key. See the following code:

SELECT  
  file_location
  , CASE
      WHEN NVL(total_partitions,0) = 0
      THEN 'No'
      ELSE 'Yes'
    END is_partitioned
  , SUM(scanned_files) total_scanned_files
  , COUNT(DISTINCT query_id) query_count
FROM    
  sys_external_query_detail
GROUP BY
  file_location
  , is_partitioned
ORDER BY
  total_scanned_files DESC;

We get the following output.

For any errors encountered with external queries, look into SYS_EXTERNAL_QUERY_ERROR, which logs details at the granularity of file_location, column, and rowid within that file.

Slow query performance troubleshooting

Refer to the sysview_slow_query_performance_troubleshooting SQL notebook downloaded as part of the prerequisites for a step-by-step guide on how to perform query-level troubleshooting using SYS monitoring views and find answers to the following questions:

Do the queries being compared have similar query text?
Did the query use the result cache?
Which parts of the query lifecycle (queuing, compilation, planning, lock wait) are contributing the most to query runtimes?
Has the query plan changed?
Is the query reading more data blocks?
Is the query spilling to disk? If so, is it spilling to local or remote storage?
Is the query highly skewed with respect to data (distribution) and time (runtime)?
Do you see more rows processed in join steps or nested loops?
Are there any alerts indicating staleness in statistics?
When was the last vacuum and analyze performed for the tables involved in the query?

Clean up

If you created any Redshift provisioned clusters or Redshift Serverless workgroups as part of this post and no longer need them for your workloads, you can delete them to avoid incurring additional costs.

Conclusion

In this post, we explained how you can use the Redshift SYS monitoring views to monitor workloads of provisioned clusters and serverless workgroups. The SYS monitoring views provide simplified monitoring of the workloads, access to various query-level monitoring metrics from a unified view, and the ability to use the same SYS monitoring view query to run across both provisioned clusters and serverless workgroups. We also covered some key monitoring and troubleshooting scenarios using SYS monitoring views.

We encourage you to start using the new SYS monitoring views for your Redshift workloads. If you have any feedback or questions, please leave them in the comments.

About the authors

Urvish Shah is a Senior Database Engineer at Amazon Redshift. He has more than a decade of experience working on databases, data warehousing and in analytics space. Outside of work, he enjoys cooking, travelling and spending time with his daughter.

Ranjan Burman is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Enhance your security posture by storing Amazon Redshift admin credentials without human intervention using AWS Secrets Manager integration

2023-10-18 Tahir Aziz

Post Syndicated from Tahir Aziz original https://aws.amazon.com/blogs/big-data/enhance-your-security-posture-by-storing-amazon-redshift-admin-credentials-without-human-intervention-using-aws-secrets-manager-integration/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics. With the constant increase in generated data, Amazon Redshift customers continue to achieve success in delivering better service to their end-users, improving their products, and running an efficient and effective business.

AWS Secrets Manager helps you manage, retrieve, and rotate database credentials, and natively supports storing database secrets for Amazon Relational Database Service (Amazon RDS), Amazon Aurora, Amazon Redshift, and Amazon DocumentDB (with MongoDB compatibility). We recommend you use Secrets Manager for storing Amazon Redshift user credentials because it allows you to configure safer secret rotation, customize fine-grained access control, and audit and monitor secrets centrally. You can natively use existing Secrets Manager secrets to access Amazon Redshift using the Amazon Redshift API and query editor.

Until now, you would have needed to configure your Amazon Redshift admin credentials in plaintext, or let Amazon Redshift generate credential for you. To store these credentials in Secrets Manager, you either needed to manually create a secret, or configure scripts with the credentials hardcoded or generated. Both options required a human to retrieve them. Amazon Redshift now allows you to create and store admin credentials automatically without a human needing to see the credentials. As part of this workflow, the admin credentials are configured to rotate every 30 days automatically. By reducing the need for humans to see the secret during configuration, you can increase the security posture of your Amazon Redshift data warehouse and improve the accuracy of your audit trails.

In this post, we show how to integrate Amazon Redshift admin credentials with Secrets Manager for both new and previously provisioned Redshift clusters and Amazon Redshift Serverless namespaces.

Prerequisites

Complete the following prerequisites before starting:

Have admin privileges to create and manage Redshift Serverless namespaces or Redshift clusters.
Have admin privileges to create and manage secrets in Secrets Manager.
Optionally, have a Redshift Serverless namespace or a Redshift cluster to enable Secrets Manager integration.
Optionally, have different AWS Key Management Service (AWS KMS) keys for credentials encryption with Secrets Manager.
Have access to Amazon Redshift Query Editor v2.

Set up a new cluster using Secrets Manager

In this section, we provide steps to configure either a Redshift provisioned cluster or a Redshift Serverless workgroup with Secrets Manager.

Create a Redshift provisioned cluster

To get started using Secrets Manager with a new Redshift provisioned cluster, complete the following steps:

On the Amazon Redshift console, choose Create cluster.
Define the Cluster configuration and Sample data sections as needed.
In the Database configurations section, specify your desired admin user name.
To use Secrets Manager to automatically create and store your password, select Manage admin credentials in AWS Secrets Manager.
You can also customize the encryption settings with your own AWS customer managed KMS key by creating a key or choosing an existing one. This is the key that is used to encrypt the secret in Secrets Manager. If you don’t select Customize encryption settings, an AWS managed key will be used as default.
Provide the information in Cluster permissions and Additional configurations as appropriate and choose Create cluster.
When the cluster is available, you can check the ARN of the secret containing the admin password on the Properties tab of the cluster in the Database configurations section.

Create a Redshift Serverless workgroup

To get started using Secrets Manager with Redshift Serverless, create a Redshift Serverless workgroup with the following steps:

On the Amazon Redshift Serverless dashboard, choose Create workgroup.
Define the Workgroup name, Capacity, and Network and security sections as appropriate and choose Next.
Select Create a new namespace and provide a suitable name
In the Database name and password section, select Customize admin user and credentials.
Provide an admin user name.
In the Admin password section, select Manage admin credentials in AWS Secrets Manager.
You can also customize the encryption settings with your own AWS customer managed KMS key by creating a key or choosing an existing one. This is the key that is used to encrypt the secret in Secrets Manager. If you don’t select Customize encryption settings, an AWS managed key will be used as default.
Provide the information in the Permissions and Encryption and security sections as appropriate and choose Next.
Review the selected options and choose Create.
When the status of the newly created workgroup and namespace is Available, choose the namespace.
You can find the Secrets Manager ARN with admin credentials under General information.

Enable Secrets Manager for an existing Redshift cluster

In this section, we provide steps to enable Secrets Manager for an existing Redshift provisioned cluster or a Redshift Serverless namespace.

Configure an existing Redshift provisioned cluster

To enable Secrets Manager for an existing Redshift cluster, follow these steps:

On the Amazon Redshift console, choose the cluster that you want to modify.
On the Properties tab, choose Edit admin credentials.
Select Manage admin credentials in AWS Secrets Manager.
To use AWS KMS to encrypt the data, select Customize encryption options and either choose an existing KMS key or choose Create an AWS KMS key.
Choose Save changes.
When the cluster is available, you can check the ARN of the secret containing the admin password on the Properties tab of the cluster in the Database configurations section.

Configure an existing Redshift Serverless namespace

To enable Secrets Manager on an existing Amazon Redshift Serverless namespace, follow these steps:

On the Amazon Redshift Serverless Dashboard, choose the namespace that you want to modify.
On the Actions menu, choose Edit admin credentials.
Select Customize admin user credentials.
Select Manage admin credentials in AWS Secrets Manager.
To use AWS KMS to encrypt the data, select Customize encryption settings and either choose an existing AWS KMS key or choose Create an AWS KMS key.
Choose Save changes.
When the namespace status is Available, you can see the Secrets Manager ARN under Admin password ARN in the General information section.

Manage secrets in Secrets Manager

To manage the admin credentials in Secrets Manager, follow these steps:

On the Secrets Manager console, choose the secret that you want to modify.

Amazon Redshift creates the secret with rotation enabled by default and a rotation schedule of every 30 days.

To view the admin credentials, choose Retrieve secret value.
To change the secret rotation, choose Edit rotation.
Define the new rotation frequency and choose Save.
To rotate the secret immediately, choose Rotate secret immediately and choose Rotate.

Secrets Manager can be integrated with your application via the AWS SDK, which is available in Java, JavaScript, C#, Python3, Ruby, and Go. The supported language code snippet is available in the Sample code section.

Choose the tab for your preferred language and use the code snippet provided in your application.

Restore a snapshot

New warehouses can be launched from both serverless and provisioned snapshots. You have the choice to configure the restored cluster to use Secrets Manager credentials, even if the source cluster didn’t use Secrets Manager, by following these steps:

Navigate to either the Redshift snapshot dashboard for snapshots of provisioned clusters or the Redshift data backup dashboard for snapshots of serverless workgroups and choose the snapshot you’d like to restore from.
On the provisioned snapshot dashboard, on the Restore snapshot menu, choose Restore to provisioned cluster or Restore to serverless namespace.

On the serverless snapshot dashboard, on the Actions menu, under Restore serverless snapshot, choose Restore to provisioned cluster or Restore to serverless namespace.

If you’re restoring to a serverless endpoint from either option, you will need to have the target serverless namespace configured in advance.

If you’re restoring to a warehouse using a snapshot that doesn’t have Secrets Manager credentials configured, you can enable it in the Database configuration section of the snapshot restoration page by selecting Manage admin credentials in AWS Secrets Manager.
You can also customize the encryption settings with your own AWS customer managed KMS key by creating a key or choosing an existing one. If you don’t select Customize encryption settings, an AWS managed key will be used as default.
If the snapshot was taken from a cluster that was using Secrets Manager to manage its admin credentials and you’re restoring to a provisioned cluster, you can optionally choose to update the key used to encrypt credentials in Secrets Manager. Otherwise, if you’d like to use the same configuration as the source snapshot, you can choose the same key as before.
After you configure all the necessary details, choose Restore cluster from snapshot/Save changes to launch your provisioned cluster, or choose Restore to write the snapshot data to the namespace.

Connect to Amazon Redshift via Query Editor v2 using Secrets Manager

To connect to Amazon Redshift using Query Editor v2, complete the following steps:

On the Amazon Redshift console, choose the cluster that you want to connect to.
On the Properties tab, locate the admin user and admin password ARN.
Make a note of the ARN to be used in the later steps.
At the top of the cluster details page, on the Query data menu, choose Query in query editor v2.
Locate the Redshift cluster or Redshift Serverless workgroup you want to connect to and choose the options menu (three dots) next to its name, then choose Create connection.
In the connection window, select AWS Secrets Manager.
For Secret, choose the appropriate secret for your cluster.
Choose Create connection.

Note that access to the secrets can be controlled by AWS Identity and Access Management (IAM) permissions.

The connection should be established to your cluster now and you will be able to see the database objects in your cluster as well as run queries against your cluster

Conclusion

In this post, we demonstrated how the Secrets Manager integration with Amazon Redshift has simplified storing admin credentials. It’s a simple-to-use feature that is available immediately and automates the important task of maintaining admin credentials and rotating them for your Redshift data warehouse. Try it out today and leave a comment if you have any questions or suggestions.

About the Authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Julia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Ekta Ahuja is a Senior Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

Migrate Microsoft Azure Synapse Analytics to Amazon Redshift using AWS SCT

2023-10-18 Ahmed Shehata

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/migrate-microsoft-azure-synapse-analytics-to-amazon-redshift-using-aws-sct/

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. With Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers, such as the following:

Amazon Redshift seamlessly integrates with broader data, analytics, and AI or machine learning (ML) services on AWS, enabling you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. With Amazon Redshift, you can build lake house architectures and perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning, and more.
You don’t need to worry about workloads such as ETL (extract, transform, and load), dashboards, ad-hoc queries, and so on interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get 1 hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
Amazon Redshift is straightforward to use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
Fault tolerance is built in. All data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, serverless web services or event-driven applications. You can also use your favorite business intelligence (BI) and SQL tools to access, analyze, and visualize data in Amazon Redshift.
Amazon Redshift ML makes it straightforward for data scientists to create, train, and deploy ML models using familiar SQL. You can also run predictions using SQL.
Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row-level and column-level security controls on sensitive data, and more.

In this post, we show how to migrate a data warehouse from Microsoft Azure Synapse to Redshift Serverless using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT makes heterogeneous database migrations predictable by automatically converting the source database code and storage objects to a format compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. AWS SCT can also scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your Azure Synapse Analytics. First, we create a Redshift database into which Azure Synapse data will be migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert Azure Synapse schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from Azure Synapse, upload it into an S3 bucket, and copy it to Amazon Redshift.

The following diagram illustrates our solution architecture.

This walkthrough covers the following steps:

Create a Redshift Serverless data warehouse.
Create the S3 bucket and folder.
Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT:
1. Connect to the Azure Synapse source.
2. Connect to the Amazon Redshift target.
3. Convert the Azure Synapse schema to a Redshift database.
4. Analyze the assessment report and address the action items.
5. Apply the converted schema to the target Redshift database.
Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents:
1. Generate trust and key stores (this step is optional).
2. Install and configure the data extraction agent.
3. Start the data extraction agent.
4. Register the data extraction agent.
5. Add virtual partitions for large tables (this step is optional).
6. Create a local data migration task.
7. Start the local data migration task.
View data in Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

A workstation with AWS SCT, Amazon Corretto 11, and Redshift drivers.
- You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we use an Amazon EC2 Windows instance. To create it, refer to Tutorial: Get started with Amazon EC2 Windows instances.
- To download and install AWS SCT on the EC2 instance that you created, refer to Installing, verifying, and updating AWS SCT.
- Download the Redshift JDBC driver.
- Download and install Amazon Corretto 11.
A database user account that AWS SCT can use to connect to your source Azure Synapse Analytics database.
- Grant VIEW DEFINITION and VIEW DATABASE STATE to each schema you are trying to convert to the database user used for migration.

Create a Redshift Serverless data warehouse

In this step, we create a Redshift Serverless data warehouse with a workgroup and namespace. A workgroup is a collection of compute resources and a namespace is a collection of database objects and users. To isolate workloads and manage different resources in Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create a Redshift Serverless data warehouse with a workgroup and namespace:

On the Amazon Redshift console, choose the AWS Region that you want to use.
In the navigation pane, choose Redshift Serverless.
Choose Create workgroup.

For Workgroup name, enter a name that describes the compute resources.

Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
Choose Next.
For Namespace, enter a name that describes your dataset.
In the Database name and password section, select Customize admin user credentials.
For Admin user name, enter a user name of your choice (for example, awsuser).
For Admin user password, enter a password of your choice (for example, MyRedShiftPW2022).

Choose Next.

Note that data in the Redshift Serverless namespace is encrypted by default.

In the Review and Create section, choose Create.

Now you create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace. Note that there can only be one default IAM role.

On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the namespace you just created.
On the Security and encryption tab, in the Permissions section, choose Manage IAM roles.
Choose Manage IAM roles and choose Create IAM role.
In the Specify an Amazon S3 bucket for the IAM role to access section, choose one of the following methods:
1. Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with names containing the word redshift.
2. Choose Any Amazon S3 bucket to allow the created IAM role to access all S3 buckets.
3. Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
Choose Create IAM role as default.
Capture the endpoint for the Redshift Serverless workgroup you just created.
On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workgroup you just created.
In the General information section, copy the endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create an S3 bucket:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
For Bucket name, enter a unique DNS-compliant name for your bucket (for example, uniquename-as-rs).

For more information about bucket names, refer to Bucket naming rules.

For AWS Region, choose the Region in which you created the Redshift Serverless workgroup.
Choose Create bucket.

Choose Buckets in the navigation pane and navigate to the S3 bucket you just created (uniquename-as-rs).
Choose Create folder.
For Folder name, enter incoming.
Choose Create folder.

Convert and apply the Azure Synapse schema to Amazon Redshift using AWS SCT

To convert the Azure Synapse schema to Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that you created previously and launch AWS SCT.

Connect to the Azure Synapse source

Complete the following steps to connect to the Azure Synapse source:

On the File menu, choose Create New Project.
Choose a location to store your project files and data.
Provide a meaningful but memorable name for your project (for example, Azure Synapse to Amazon Redshift).
To connect to the Azure Synapse source data warehouse, choose Add source.
Choose Azure Synapse and choose Next.
For Connection name, enter a name (for example, olap-azure-synapse).

AWS SCT displays this name in the object tree in left pane.

For Server name, enter your Azure Synapse server name.
For SQL pool, enter your Azure Synapse pool name.
Enter a user name and password.
Choose Test connection to verify that AWS SCT can connect to your source Azure Synapse project.
When the connection is successfully validated, choose Ok and Connect.

Connect to the Amazon Redshift target

Follow these steps to connect to Amazon Redshift:

In AWS SCT, choose Add target.
Choose Amazon Redshift, then choose Next.
For Connection name, enter a name to describe the Amazon Redshift connection.

AWS SCT displays this name in the object tree in the right pane.

For Server name, enter the Redshift Serverless workgroup endpoint you captured earlier.
For Server port, enter 5439.
For Database, enter dev.
For User name, enter the user name you chose when creating the Redshift Serverless workgroup.
For Password, enter the password you chose when creating the Redshift Serverless workgroup.
Deselect Use AWS Glue.
Choose Test connection to verify that AWS SCT can connect to your target Redshift workgroup.
When the test is successful, choose OK.
Choose Connect to connect to the Amazon Redshift target.

Alternatively, you can use connection values that are stored in AWS Secrets Manager.

Convert the Azure Synapse schema to a Redshift data warehouse

After you create the source and target connections, you will see the source Azure Synapse object tree in the left pane and the target Amazon Redshift object tree in the right pane. We then create mapping rules to describe the source target pair for the Azure Synapse to Amazon Redshift migration.

Follow these steps to convert the Azure Synapse dataset to Amazon Redshift format:

In the left pane, choose (right-click) the schema you want to convert.
Choose Convert schema.
In the dialog box, choose Yes.

When the conversion is complete, you will see a new schema created in the Amazon Redshift pane (right pane) with the same name as your Azure Synapse schema.

The sample schema we used has three tables; you can see these objects in Amazon Redshift format in the right pane. AWS SCT converts all the Azure Synapse code and data objects to Amazon Redshift format. You can also use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects, but some objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from Main view to Assessment Report view as shown in the following screenshot.

The Summary tab shows objects that were converted automatically and objects that were not converted automatically. Green represents automatically converted objects or objects with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action items tab shows the recommended actions for each conversion issue. If you choose an action item from the list, AWS SCT highlights the object that the action item applies to.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database and schema show you the effort required to design and implement the recommendations for converting action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. AWS SCT completes some actions automatically while converting the schema to Amazon Redshift; objects with such actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it in the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the ID column data type from decimal(3,0) in Azure Synapse to the smallint data type in Amazon Redshift.

Apply the converted schema to the target Redshift data warehouse

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and choose Apply to database.

Migrate data from Azure Synapse to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this section, we configure AWS SCT extraction agents to extract data from Azure Synapse and migrate to Amazon Redshift. For this post, we install the AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For very large datasets, AWS SCT supports the use of multiple data extraction agents running on several instances to maximize throughput and increase the speed of data migration.

Generate trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all data passed between the applications remains private and integral. To use SSL communication, you need to generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

In AWS SCT, choose Settings, Global settings, and Security.
Choose Generate trust and key store.

Enter a name and password for the trust and key stores.
Enter a location to store them.
Choose Generate, then choose OK.

Install and configure the data extraction agent

In the installation package for AWS SCT, you can find a subfolder called agents (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

For Service port, enter the port number the agent listens on. It is 8192 by default.
For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data.

The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.

For Enter Redshift JDBC driver file or files, enter the location where you downloaded the Redshift JDBC drivers.
For Add the Amazon Redshift driver, enter YES.
For Enable SSL communication, enter yes. Enter No here if you don’t want to use SSL.
Choose Next.

For Trust store path, enter the storage location you specified when creating the trust and key store.
For Trust store password, enter the password for the trust store.
For Enable client SSL authentication, enter yes.
For Key store path, enter the storage location you specified when creating the trust and key store.
For Key store password, enter the password for the key store.
Choose Next.

Start the data extraction agent

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, use C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed for your operating system. To stop an agent, run the same command but replace start with stop. To restart an agent, run the same RestartAgent.bat file.

Note that you should have administrator access to run those commands.

Register the data extraction agent

Follow these steps to register the data extraction agent:

In AWS SCT, change the view to Data Migration view choose Register.
Select Redshift data agent, then choose OK.

For Description, enter a name to identify the agent.
For Host name, if you installed the extraction agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT extraction agent is installed. It is recommended to install extraction agents on Linux for better performance.
For Port, enter the number you used for the listening port (default 8192) when installing the AWS SCT extraction agent.
Select Use SSL to encrypt AWS SCT connection to Data Extraction Agent.

If you’re using SSL, navigate to the SSL tab.
For Trust store, choose the trust store you created earlier.
For Key store, choose the key store you created earlier.
Choose Test connection.
After the connection is validated successfully, choose OK and Register.

Create a local data migration task

To migrate data from Azure Synapse Analytics to Amazon Redshift, you create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local data migration task:

In AWS SCT, under the schema name in the left pane, choose (right-click) the table you want to migrate (for this post, we use the table tbl_currency).
Choose Create Local task.

Choose from the following migration modes:
1. Extract the source data and store it on a local PC or virtual machine where the agent runs.
2. Extract the data and upload it to an S3 bucket.
3. Extract the data, upload it to Amazon S3, and copy it into Amazon Redshift. (We choose this option for this post.)

On the Advanced tab, provide the extraction and copy settings.

On the Source server tab, make sure you are using the current connection properties.

On the Amazon S3 settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket you created earlier.

The AWS SCT data extraction agent uploads the data in those S3 buckets and folders before copying it to Amazon Redshift.

Choose Test Task.

When the task is successfully validated, choose OK, then choose Create.

Start the local data migration task

To start the task, choose Start or Restart on the Tasks tab.

First, the data extraction agent extracts data from Azure Synapse. Then the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.

At this point, AWS SCT has successfully migrated data from the source Azure Synapse table to the Redshift table.

View data in Amazon Redshift

After the data migration task is complete, you can connect to Amazon Redshift and validate the data. Complete the following steps:

On the Amazon Redshift console, navigate to the Query Editor v2.
Open the Redshift Serverless workgroup you created.
Choose Query data.

For Database, enter a name for your database.
For Authentication, select Federated user
Choose Create connection.

Open a new editor by choosing the plus sign.
In the editor, write a query to select from the schema name and table or view name you want to verify.

You can explore the data, run ad-hoc queries, and make visualizations, charts, and views.

The following screenshot is the view of the source Azure Synapse dataset we used in this post.

Clean up

Follow the steps in this section to clean up any AWS resources you created as part of this post.

Stop the EC2 instance

Follow these steps to stop the EC2 instance:

On the Amazon EC2 console, in the navigation pane, choose Instances.
Select the instance you created.
Choose Instance state, then choose Terminate instance.
Choose Terminate when prompted for confirmation.

Delete the Redshift Serverless workgroup and namespace

Follow these steps to delete the Redshift Serverless workgroup and namespace:

On the Redshift Serverless Dashboard, in the Namespaces / Workgroups section, choose the workspace you created
On the Actions menu, choose Delete workgroup.
Select Delete the associated namespace.
Deselect Create final snapshot.
Enter delete in the confirmation text box and choose Delete.

Delete the S3 bucket

Follow these steps to delete the S3 bucket:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose the bucket you created.
Choose Delete.
To confirm deletion, enter the name of the bucket.
Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. This post discussed how a data migration task extracts, downloads, and migrates data from Azure Synapse to Amazon Redshift. The solution we presented performs a one-time migration of database objects and data. Data changes made in Azure Synapse when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to Azure Synapse on hold or rerun the ETL jobs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

To get started, download and install AWS SCT, sign in to the AWS Management Console, check out Redshift Serverless, and start migrating!

About the Authors

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Jagadish Kumar is a Senior Analytics Specialist Solutions Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Accelerate your data warehouse migration to Amazon Redshift – Part 7

2023-10-17 Mykhailo Kondak

Post Syndicated from Mykhailo Kondak original https://aws.amazon.com/blogs/big-data/accelerate-your-data-warehouse-migration-to-amazon-redshift-part-7/

Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. You can also integrate other AWS services such as Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in AWS.

Migrating a data warehouse can be complex. You have to migrate terabytes or petabytes of data from your legacy system while not disrupting your production workload. You also need to ensure that the new target data warehouse is consistent with upstream data changes so that business reporting can continue uninterrupted when you cut over to the new platform.

Previously, there were two main strategies to maintain data consistency after the initial bulk load during a migration to Amazon Redshift. You could identify the changed rows, perhaps using a filter on update timestamps, or you could modify your extract, transform, and load (ETL) process to write to both the source and target databases. Both of these options require manual effort to implement and increase the cost and risk of the migration project.

AWS Schema Conversion Tool (AWS SCT) could help you with initial bulk load from Azure Synapse Analytics, BigQuery, Greenplum Database, IBM Netezza, Microsoft SQL Server, Oracle, Snowflake, Teradata and Vertica. Now, we’re happy to share that AWS SCT has automated maintaining data consistency for you. If you’re migrating from an IBM Netezza data warehouse to Amazon Redshift, the AWS SCT data extractors will automatically capture changes from the source and apply them on the target. You configure a change data capture (CDC) migration task in AWS SCT, and it will extract the relevant data changes from IBM Netezza and apply them in a transactionally consistent order on Amazon Redshift. You need to configure the needed resources on IBM Netezza and start the data migration—the source database remains fully operational during the migration and replication.

In this post, we describe at a high-level how CDC tasks work in AWS SCT. Then we deep dive into an example of how to configure, start, and manage a CDC migration task. We look briefly at performance and how you can tune a CDC migration, and then conclude with some information about how you can get started on your own migration.

Accelerate your data warehouse migration to Amazon Redshift:

Solution overview

The following diagram shows the data migration and replication workflow with AWS SCT.

In the first step, your AWS SCT data extraction agent completes the full load of your source data to Amazon Redshift. Then the AWS SCT data extraction agent uses a history database in Netezza. The history database captures information about user activity such as queries, query plans, table access, column access, session creation, and failed authentication requests. The data extraction agent extracts information about transactions that you run in your source Netezza database and replicates them to your target Redshift database.

You can start ongoing replication automatically after you complete the full load. Alternatively, you can start CDC at a later time or on a schedule. For more information, refer to Configuring ongoing data replication.

At a high level, the ongoing replication flow is as follows.

At the start of the replication, the data extraction agent captures the last transaction identifier in the history table. The data extraction agent stores this value in the max_createxid variable. To capture the transaction ID, the agent runs the following query:

SELECT max(XID) AS max_createxid 
    FROM <DB_NAME>.<SCHEMA_NAME>."$hist_plan_prolog_n";

If this transaction ID value is different from the CDC start point, then the agent identifies the delta to replicate. This delta includes all transactions for the selected tables that happened after full load or after the previous replication. The data extraction agent selects the updated data from your source table.

From this updated data, AWS SCT creates two temporary tables. The first table includes all rows that you deleted from your source database and the old data of the rows that you updated. The second table includes all rows that you inserted and the new data of the rows that you updated. AWS SCT then uses these tables in JOIN clauses to replicate the changes to your target Redshift database.

Next, AWS SCT copies these tables to your Amazon Simple Storage Service (Amazon S3) bucket and uses this data to update your target Redshift cluster.

After updating your target database, AWS SCT deletes these temporary tables. Next, your data extraction agent sets the value of the CDC start point equal to the captured transaction ID (max_createxid). During the next data replication run, your agent will determine the delta to replicate using this updated CDC start point.

All changes that happen to your source database during the replication run will be captured in the next replication run. Make sure that you repeat the replication steps until the delta is equal to zero for each table that you included in the migration scope. At this point, you can cut over to your new databases.

Configure your source database

In your source Netezza database, create a history database and configure the history logging. Next, grant read permissions for all tables in the history database to the user that you use in the AWS SCT project. This user has the minimal permissions that are required to convert your source database schemas to Amazon Redshift.

Configure AWS SCT

Before you start data migration and replication with AWS SCT, make sure that you download the Netezza and Amazon Redshift drivers. Take note of the path to the folder where you saved these files. You will specify this path in the AWS SCT and data extraction agent settings.

To make sure the data extraction agents work properly, install the latest version of Amazon Corretto 11.

To configure AWS SCT, complete the following steps:

After you create an AWS SCT project, connect to your source and target databases and set up the mapping rules. A mapping rule describes a source-target pair that defines the migration target for your source database schema.
Convert your database schemas and apply them to Amazon Redshift if you haven’t done this yet. Make sure that the target tables exist in your Redshift database before you start data migration.
Now, install the data extraction agent. The AWS SCT installer includes the installation files for data extraction agents in the agents folder. Configure your data extraction agents by adding the listening port number and the path to the source and target database drivers. For the listening port, you can proceed with the default value. For database drivers, enter the path that you noted before.

The following diagram shows how the AWS SCT data extraction agents work.

After you install the data extraction agent, register it in AWS SCT.

Open the data migration view in AWS SCT and choose Register.
Enter the name of your agent, the host name, and the port that you configured in the previous step. For the host name, you can use the localhost 0.0.0.0 if you run the agent on the same machine where you installed the data extraction agent.

Create and run a CDC task

Now you can create and manage your data migration and replication tasks. To do so, complete the following steps:

Select the tables in your source database to migrate, open the context (right-click) menu, and choose Create local task.
Choose your data migration mode (for this post, choose Extract, upload and copy to replicate data changes from your source database):
1. Extract only – Extract your data and save it to your local working folders.
2. Extract and upload – Extract your data and upload it to Amazon S3.
3. Extract, upload and copy – Extract your data, upload it to Amazon S3, and copy it into your Redshift data warehouse.
Choose your encryption type. Make sure that you configure encryption for safe and secure data migrations.
Select Enable CDC.

After this, you can switch to the CDC settings tab.
For CDC mode, you can choose from the following options:
1. Migrate existing data and replicate ongoing changes – Migrate all existing source data and then start the replication. This is the default option.
2. Replicate data changes only – Start data replication immediately.

Sometimes you don’t need to migrate all existing source data. For example, if you have already migrated your data, you can start the data replication by choosing Replicate data changes only.

If you choose Replicate data changes only, you can also set the Last CDC point to configure the replication to start from this point. If it is not set, AWS SCT data extraction agents replicate all changes that occur after your replication task is started.

If your replication task failed, you can restart the replication from the point of failure. You can find the identifier of the last migrated CDC point on the CDC processing details tab in AWS SCT, set Last CDC point and start the task again. This will allow AWS SCT data extraction agents to replicate all changes in your source tables to your target database without gaps.

You can also configure when you want to schedule the CDC runs to begin.

If you select Immediately, the first replication run immediately after your agent completes the full load. Alternatively, you can specify the time and date when you want to start the replication.

Also, you can schedule when to run the replication again. You can enter the number of days, hours, or minutes when to repeat the replication runs. Set these values depending on the intensity of data changes in your source database.
Finally, you can set the end date when AWS SCT will stop running the replication.

On the Amazon S3 settings tab, you can connect your AWS SCT data extraction agent with your Amazon S3 bucket.

You don’t need to do this step if you have configured the AWS service profile in the global application settings.

After you have configured all settings, choose Create to create a CDC task.

Start this task in the AWS SCT user interface.

The following screenshots show examples of the AWS SCT user interface once you started tasks.

You can run multiple CDC tasks in parallel at the same time. For example, you can include different sets of source tables in each task to replicate the changes to different target Redshift clusters. AWS SCT handles these replication tasks and distributes resources correctly to minimize the replication time.

Data replication limitations

There are a few limitations in AWS SCT data replication:

Changes in your source database don’t trigger the replication run because AWS SCT isn’t able to automate these runs (as of this writing). You can instead run the data replication tasks on a predefined schedule.
AWS SCT doesn’t replicate TRUNCATE and DDL statements. If you change the structure of your source table or truncate it, then you must run the same statements in your target database. You should make these changes manually because AWS SCT isn’t aware of structure updates.

End-to-end example

Now that you know how to create a local replication task in AWS SCT, we deep dive and show how AWS SCT performs the extract and load processes.

First, we run the following code to check that we correctly configured our source Netezza database. To use this code example, change the name of your history database.

SELECT COUNT(*) FROM HISTDB.DBE."$hist_column_access_1";

If you configured your database correctly, then the output of this command includes a value that is different from zero. In our case, the result is as follows:

 COUNT |
-------+
2106717|

Now we create a table on Netezza to use in the example. The table has three columns and a primary key.

DROP TABLE NZ_DATA4EXTRACTOR.CDC_DEMO IF EXISTS; 
CREATE TABLE NZ_DATA4EXTRACTOR.CDC_DEMO 
(
    ID INTEGER NOT NULL, 
    TS TIMESTAMP DEFAULT CURRENT_TIMESTAMP, 
    CMNT CHARACTER(16)
)
DISTRIBUTE ON RANDOM; 

ALTER TABLE NZ_DATA4EXTRACTOR.CDC_DEMO 
    ADD CONSTRAINT CDC_DEMO_PK PRIMARY KEY (ID); 

SELECT * 
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO 
    ORDER BY ID;

The SELECT statement returns an empty table:

ID|TS|CMNT|
--+--+----+

Before we start the replication, we run the following query on Netezza to get the latest transaction identifier in the history table:

SELECT MAX(XID) AS XID 
    FROM HISTDB.DBE."$hist_plan_prolog_1";

For our test table, the script prints the last transaction identifier, which is 2691798:

XID    |
-------+
2691798|

To make sure that our table doesn’t include new transactions, AWS SCT runs the following script. If you want to run this script manually, replace 2691798 with the last transaction identifier in your history table.

SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO 
    WHERE CREATEXID > 2691798;

As expected, the script doesn’t return any values.

CREATEXID|
---------+
         |

CREATEXID and DELETEXID are hidden system columns that exist in every Netezza table. CREATEXID identifies the transaction ID that created the row, and DELETEXID identifies the transaction ID that deleted the row. AWS SCT uses them to find changes in the source data.

Now we’re ready to start the replication.

We assume you’ve used AWS SCT to convert the example table and build it on the target Amazon Redshift. AWS SCT runs the following statement on Amazon Redshift:

DROP TABLE IF EXISTS nz_data4extractor_nz_data4extractor.cdc_demo;

CREATE TABLE nz_data4extractor_nz_data4extractor.cdc_demo
(
    id INTEGER ENCODE AZ64 NOT NULL,
    ts TIMESTAMP WITHOUT TIME ZONE ENCODE AZ64 DEFAULT SYSDATE::TIMESTAMP,
    cmnt CHARACTER VARYING(48) ENCODE LZO
)
DISTSTYLE AUTO;

ALTER TABLE nz_data4extractor_nz_data4extractor.cdc_demo
    ADD CONSTRAINT cdc_demo_pk PRIMARY KEY (id);

AWS SCT also creates a staging table that holds replication changes until they can be applied on the actual target table:

CREATE TABLE IF NOT EXISTS "nz_data4extractor_nz_data4extractor"."_cdc_unit"
    (LIKE "nz_data4extractor_nz_data4extractor"."cdc_demo" INCLUDING  DEFAULTS);
ALTER TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ADD COLUMN deletexid_ BIGINT;

AWS SCT runs the following query to capture all changes that happened after the last transaction identifier:

SET show_deleted_records = true;

SELECT 
    ID, 
    TS, 
    CMNT
FROM NZ_DATA4EXTRACTOR.CDC_DEMO
WHERE CREATEXID <= 2691798
    AND (DELETEXID = 0 OR DELETEXID > 2691798)

This script returns an empty table:

ID|TS|CMNT|
--+--+----+

Now, we change data on Netezza and see how it gets replicated to Amazon Redshift:

INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (1, 'One');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (2, 'Two');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (3, 'Three');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (4, 'Four');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (5, 'Five');

SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;

The preceding script returns the following result:

ID|TS                     |CMNT            |
--+-----------------------+----------------+
 1|2023-03-07 14:05:11.000|One             |
 2|2023-03-07 14:05:11.000|Two             |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Four            |
 5|2023-03-07 14:05:11.000|Five            |

AWS SCT checks for data changes starting from the last transaction ID using the following query:

SET show_deleted_records = true;

SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    WHERE createxid > 2691798

The script returns a result that is different from zero:

CREATEXID|
---------+
  2691824|

Because the new transaction ID is greater than the last transaction ID, the history table contains new data to be replicated.

AWS SCT runs the following query to extract the changes. The application detects all rows that were inserted, deleted, or updated within the scope of transactions with IDs in range from 2691798 + 1 to 2691824.

SELECT
    ID,
    TS,
    CMNT,
    deletexid_
FROM (
    SELECT
        createxid,
        rowid,
        deletexid,
        2691798 AS min_CDC_trx,
        2691824 AS max_CDC_trx,
        CASE WHEN deletexid > max_CDC_trx
            THEN 0
            ELSE deletexid
            END AS deletexid_,
        MIN(createxid) OVER (PARTITION BY rowid) AS min_trx,
        COUNT(1) OVER (PARTITION BY rowid) AS rowid_cnt,
        ID,
        TS,
        CMNT
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO AS t
    WHERE deletexid <> 1
        AND (CREATEXID > min_CDC_trx OR deletexid_ > min_CDC_trx) -- Prior run max trx
        AND CREATEXID <= max_CDC_trx -- Current run max trx
    ) AS r
WHERE (min_trx = createxid OR deletexid_ = 0)
    AND NOT (
        CREATEXID > min_CDC_trx 
        AND deletexid <= max_CDC_trx 
        AND rowid_cnt = 1 
        AND deletexid > 0
        )

The extracted data is as follows:

ID|TS                     |CMNT            |DELETEXID_|
--+-----------------------+----------------+----------+
 1|2023-03-07 14:05:11.000|One             |         0|
 2|2023-03-07 14:05:11.000|Two             |         0|
 3|2023-03-07 14:05:11.000|Three           |         0|
 4|2023-03-07 14:05:11.000|Four            |         0|
 5|2023-03-07 14:05:11.000|Five            |         0|

Next, AWS SCT compresses the data and uploads it to Amazon S3. Then AWS SCT runs the following command to copy the data into the staging table on Amazon Redshift:

TRUNCATE TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit"; 

COPY "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ("id", "ts", "cmnt", "deletexid_")
    FROM 's3://bucket/folder/unit_1.manifest' MANIFEST
    CREDENTIALS '...'
    REGION '...'
    REMOVEQUOTES
    IGNOREHEADER 1
    GZIP
    DELIMITER '|';

From the staging table, AWS SCT applies the changes to the actual target table. For this iteration, we insert new rows into the Redshift table:

INSERT INTO "nz_data4extractor_nz_data4extractor"."cdc_demo"("id", "ts", "cmnt")
SELECT 
        "id", 
        "ts", 
        "cmnt" 
    FROM "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
    WHERE t2.deletexid_ = 0;

Let’s run another script that not only inserts, but also deletes and updates data in the source table:

INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (6, 'Six');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (7, 'Seven');
INSERT INTO NZ_DATA4EXTRACTOR.CDC_DEMO (ID, CMNT) VALUES (8, 'Eight');

DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 1;
DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 7;

UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Two' WHERE ID = 2;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Replaced Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Yet Again Four' WHERE ID = 4;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Five' WHERE ID = 5;
UPDATE NZ_DATA4EXTRACTOR.CDC_DEMO SET CMNT = 'Updated Eight' WHERE ID = 8;

DELETE FROM NZ_DATA4EXTRACTOR.CDC_DEMO WHERE ID = 5;

SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;

The Netezza table contains the following rows:

ID|TS                     |CMNT            |
--+-----------------------+----------------+
 2|2023-03-07 14:05:11.000|Updated Two     |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Yet Again Four  |
 6|2023-03-07 14:07:09.000|Six             |
 8|2023-03-07 14:07:10.000|Updated Eight   |

AWS SCT detects the changes as before using the new transaction ID:

SET show_deleted_records = true;

SELECT MAX(CREATEXID) AS CREATEXID
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    WHERE createxid > 2691824

CREATEXID|
---------+
  2691872|

SELECT
    ID,
    TS,
    CMNT,
    deletexid_
FROM (
    SELECT
        createxid,
        rowid,
        deletexid,
        2691824 AS min_CDC_trx,
        2691872 AS max_CDC_trx,
        CASE WHEN deletexid > max_CDC_trx
            THEN 0
            ELSE deletexid
            END AS deletexid_,
        MIN(createxid) OVER (PARTITION BY rowid) AS min_trx,
        COUNT(1) OVER (PARTITION BY rowid) AS rowid_cnt,
        ID,
        TS,
        CMNT
    FROM NZ_DATA4EXTRACTOR.CDC_DEMO AS t
    WHERE deletexid <> 1
        AND (CREATEXID > min_CDC_trx OR deletexid_ > min_CDC_trx) -- Prior run max trx
        AND CREATEXID <= max_CDC_trx -- Current run max trx
    ) AS r
WHERE (min_trx = createxid OR deletexid_ = 0)
    AND NOT (
        CREATEXID > min_CDC_trx 
        AND deletexid <= max_CDC_trx 
        AND rowid_cnt = 1 
        AND deletexid > 0
        )

The extracted changes appear as follows:

ID|TS                     |CMNT          |DELETEXID_|
--+-----------------------+--------------+----------+
 1|2023-03-07 14:05:11.000|One           |   2691856|
 2|2023-03-07 14:05:11.000|Two           |   2691860|
 2|2023-03-07 14:05:11.000|Updated Two   |         0|
 4|2023-03-07 14:05:11.000|Four          |   2691862|
 4|2023-03-07 14:05:11.000|Yet Again Four|         0|
 5|2023-03-07 14:05:11.000|Five          |   2691868|
 6|2023-03-07 14:07:09.000|Six           |         0|
 8|2023-03-07 14:07:10.000|Eight         |   2691870|
 8|2023-03-07 14:07:10.000|Updated Eight |         0|

Notice that we inserted a new row with ID 7 and then deleted this row. Therefore, we can ignore the row with ID 7 in our delta.

Also, we made several updates of the row with ID 4. In our delta, we include the original and the most recent versions of the row. We ignore all intermediate versions in our delta.

We updated the row with ID 5 and then deleted this row. We don’t include the updated row in our delta.

This way, AWS SCT optimizes the migrated data, reducing the migration time and the network traffic.

Now, as before, AWS SCT compresses, uploads to Amazon S3, and copies the data into the staging Redshift table:

TRUNCATE TABLE "nz_data4extractor_nz_data4extractor"."_cdc_unit";

COPY "nz_data4extractor_nz_data4extractor"."_cdc_unit" 
    ("id", "ts", "cmnt", "deletexid_")
    FROM 's3://bucket/folder/unit_2.manifest' MANIFEST
    CREDENTIALS '...'
    REGION '...'
    REMOVEQUOTES
    IGNOREHEADER 1
    GZIP
    DELIMITER '|';

Then, AWS SCT applies the changes to the target table. AWS SCT removes the deleted rows, removes the old version of updated rows, and then inserts new rows and the most recent version of any updated rows:

DELETE 
    FROM  "nz_data4extractor_nz_data4extractor"."cdc_demo"
    USING "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
    WHERE "nz_data4extractor_nz_data4extractor"."cdc_demo"."id" = t2."id"
        AND COALESCE(CAST("nz_data4extractor_nz_data4extractor"."cdc_demo"."ts" AS VARCHAR),'?#-*') = COALESCE(CAST(t2."ts" AS VARCHAR),'?#-*')
        AND COALESCE(CAST("nz_data4extractor_nz_data4extractor"."cdc_demo"."cmnt" AS VARCHAR),'?#-*') = COALESCE(CAST(t2."cmnt" AS VARCHAR),'?#-*')
        AND t2.deletexid_ > 0;
  
INSERT INTO "nz_data4extractor_nz_data4extractor"."cdc_demo"("id", "ts", "cmnt")
SELECT 
    "id", 
    "ts", 
    "cmnt"
FROM "nz_data4extractor_nz_data4extractor"."_cdc_unit" t2
WHERE t2.deletexid_ = 0;

You can compare the data on the source and target to verify that AWS SCT captured all changes correctly:

SELECT * FROM NZ_DATA4EXTRACTOR.CDC_DEMO
    ORDER BY ID;

ID|TS                     |CMNT            |
--+-----------------------+----------------+
 2|2023-03-07 14:05:11.000|Updated Two     |
 3|2023-03-07 14:05:11.000|Three           |
 4|2023-03-07 14:05:11.000|Yet Again Four  |
 6|2023-03-07 14:07:09.000|Six             |
 8|2023-03-07 14:07:10.000|Updated Eight   |

The data on Amazon Redshift matches exactly:

id|ts                        |cmnt
 2|2023-03-07 14:05:11.000000|Updated Two
 3|2023-03-07 14:05:11.000000|Three
 4|2023-03-07 14:05:11.000000|Yet Again Four
 6|2023-03-07 14:07:09.000000|Six
 8|2023-03-07 14:07:10.000000|Updated Eight

In the previous examples, we showed how to run full load and CDC tasks. You can also create a CDC migration task without the full load. The process is the same—you provide AWS SCT with the transaction ID to start the replication from.

The CDC process does not have a significant impact on the source side. AWS SCT runs only SELECT statements there, using the transaction ID as boundaries for the WHERE clause. The performance impact of these statements is always smaller than the impact of DML statements generated by customer’s applications. For machines where AWS SCT data extraction agents are running, the CDC-related workload is always smaller than the full load workload because the volume of transferred data is smaller.

On the target side, for Amazon Redshift, the CDC process can generate considerable additional workload. The reason is that this process issues INSERT and DELETE statements and these can result in overhead for MPP systems, which Amazon Redshift is. Refer to Top 10 performance tuning techniques for Amazon Redshift to find best practices and tips on how to boost performance of your Redshift cluster.

Conclusion

In this post, we showed how to configure ongoing data replication for Netezza database migration to Amazon Redshift. You can use the described approach to automate data migration and replication from your IBM Netezza database to Amazon Redshift. Or, if you’re considering a migration of your existing Netezza workloads to Amazon Redshift, you can use AWS SCT to automatically convert your database schemas and migrate data. Download the latest version of AWS SCT and give it a try!

We’re happy to share these updates to help you in your data warehouse migration projects. In the meantime, you can learn more about Amazon Redshift and AWS SCT. Happy migrating!

About the Authors

Mykhailo Kondak is a Database Engineer in the AWS Database Migration Service team at AWS. He uses his experience with different database technologies to help Amazon customers to move their on-premises data warehouses and big data workloads to the AWS Cloud. In his spare time, he plays soccer.

Illia Kravtsov is a Database Engineer on the AWS Database Migration Service team. He has over 10 years of experience in data warehouse development with Teradata and other massively parallel processing (MPP) databases.

Michael Soo is a Principal Database Engineer in the AWS Database Migration Service. He builds products and services that help customers migrate their database workloads to the AWS Cloud.

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

2023-10-02 M Mehrtens

Post Syndicated from M Mehrtens original https://aws.amazon.com/blogs/big-data/non-json-ingestion-using-amazon-kinesis-data-streams-amazon-msk-and-amazon-redshift-streaming-ingestion/

Organizations are grappling with the ever-expanding spectrum of data formats in today’s data-driven landscape. From Avro’s binary serialization to the efficient and compact structure of Protobuf, the landscape of data formats has expanded far beyond the traditional realms of CSV and JSON. As organizations strive to derive insights from these diverse data streams, the challenge lies in seamlessly integrating them into a scalable solution.

In this post, we dive into Amazon Redshift Streaming Ingestion to ingest, process, and analyze non-JSON data formats. Amazon Redshift Streaming Ingestion allows you to connect to Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK) directly through materialized views, in real time and without the complexity associated with staging the data in Amazon Simple Storage Service (Amazon S3) and loading it into the cluster. These materialized views not only provide a landing zone for streaming data, but also offer the flexibility of incorporating SQL transforms and blending into your extract, load, and transform (ELT) pipeline for enhanced processing. For a deeper exploration on configuring and using streaming ingestion in Amazon Redshift, refer to Real-time analytics with Amazon Redshift streaming ingestion.

JSON data in Amazon Redshift

Amazon Redshift enables storage, processing, and analytics on JSON data through the SUPER data type, PartiQL language, materialized views, and data lake queries. The base construct to access streaming data in Amazon Redshift provides metadata from the source stream (attributes like stream timestamp, sequence numbers, refresh timestamp, and more) and the raw binary data from the stream itself. For streams that contain the raw binary data encoded in JSON format, Amazon Redshift provides a variety of tools for parsing and managing the data. For more information about the metadata of each stream format, refer to Getting started with streaming ingestion from Amazon Kinesis Data Streams and Getting started with streaming ingestion from Amazon Managed Streaming for Apache Kafka.

At the most basic level, Amazon Redshift allows parsing the raw data into distinct columns. The JSON_EXTRACT_PATH_TEXT and JSON_EXTRACT_ARRAY_ELEMENT_TEXT functions enable the extraction of specific details from JSON objects and arrays, transforming them into separate columns for analysis. When the structure of the JSON documents and specific reporting requirements are defined, these methods allow for pre-computing a materialized view with the exact structure needed for reporting, with improved compression and sorting for analytics.

In addition to this approach, the Amazon Redshift JSON functions allow storing and analyzing the JSON data in its original state using the adaptable SUPER data type. The function JSON_PARSE allows you to extract the binary data in the stream and convert it into the SUPER data type. With the SUPER data type and PartiQL language, Amazon Redshift extends its capabilities for semi-structured data analysis. It uses the SUPER data type for JSON data storage, offering schema flexibility within a column. For more information on using the SUPER data type, refer to Ingesting and querying semistructured data in Amazon Redshift. This dynamic capability simplifies data ingestion, storage, transformation, and analysis of semi-structured data, enriching insights from diverse sources within the Redshift environment.

Streaming data formats

Organizations using alternative serialization formats must explore different deserialization methods. In the next section, we dive into the optimal approach for deserialization. In this section, we take a closer look at the diverse formats and strategies organizations use to effectively manage their data. This understanding is key in determining the data parsing approach in Amazon Redshift.

Many organizations use a format other than JSON for their streaming use cases. JSON is a self-describing serialization format, where the schema of the data is stored alongside the actual data itself. This makes JSON flexible for applications, but this approach can lead to increased data transmission between applications due to the additional data contained in the JSON keys and syntax. Organizations seeking to optimize their serialization and deserialization performance, and their network communication between applications, may opt to use a format like Avro, Protobuf, or even a custom proprietary format to serialize application data into binary format in an optimized way. This provides the advantage of an efficient serialization where only the message values are packed into a binary message. However, this requires the consumer of the data to know what schema and protocol was used to serialize the data to deserialize the message. There are several ways that organizations can solve this problem, as illustrated in the following figure.

Visualization of different binary message serialization approaches

Embedded schema

In an embedded schema approach, the data format itself contains the schema information alongside the actual data. This means that when a message is serialized, it includes both the schema definition and the data values. This allows anyone receiving the message to directly interpret and understand its structure without needing to refer to an external source for schema information. Formats like JSON, MessagePack, and YAML are examples of embedded schema formats. When you receive a message in this format, you can immediately parse it and access the data with no additional steps.

Assumed schema

In an assumed schema approach, the message serialization contains only the data values, and there is no schema information included. To interpret the data correctly, the receiving application needs to have prior knowledge of the schema that was used to serialize the message. This is typically achieved by associating the schema with some identifier or context, like a stream name. When the receiving application reads a message, it uses this context to retrieve the corresponding schema and then decodes the binary data accordingly. This approach requires an additional step of schema retrieval and decoding based on context. This generally requires setting up a mapping in-code or in an external database so that consumers can dynamically retrieve the schemas based on stream metadata (such as the AWS Glue Schema Registry).

One drawback of this approach is in tracking schema versions. Although consumers can identify the relevant schema from the stream name, they can’t identify the particular version of the schema that was used. Producers need to ensure that they are making backward-compatible changes to schemas to ensure consumers aren’t disrupted when using a different schema version.

Embedded schema ID

In this case, the producer continues to serialize the data in binary format (like Avro or Protobuf), similar to the assumed schema approach. However, an additional step is involved: the producer adds a schema ID at the beginning of the message header. When a consumer processes the message, it starts by extracting the schema ID from the header. With this schema ID, the consumer then fetches the corresponding schema from a registry. Using the retrieved schema, the consumer can effectively parse the rest of the message. For example, the AWS Glue Schema Registry provides Java SDK SerDe libraries, which can natively serialize and deserialize messages in a stream using embedded schema IDs. Refer to How the schema registry works for more information about using the registry.

The usage of an external schema registry is common in streaming applications because it provides a number of benefits to consumers and developers. This registry contains all the message schemas for the applications and associates them with a unique identifier to facilitate schema retrieval. In addition, the registry may provide other functionalities like schema version change handling and documentation to facilitate application development.

The embedded schema ID in the message payload can contain version information, ensuring publishers and consumers are always using the same schema version to manage data. When schema version information isn’t available, schema registries can help enforce producers making backward-compatible changes to avoid causing issues in consumers. This helps decouple producers and consumers, provides schema validation at both the publisher and consumer stage, and allows for more flexibility in stream usage to allow for a variety of application requirements. Messages can be published with one schema per stream, or with multiple schemas inside a single stream, allowing consumers to dynamically interpret messages as they arrive.

For a deeper dive into the benefits of a schema registry, refer to Validate streaming data over Amazon MSK using schemas in cross-account AWS Glue Schema Registry.

Schema in file

For batch processing use cases, applications may embed the schema used to serialize the data into the data file itself to facilitate data consumption. This is an extension of the embedded schema approach but is less costly because the data file is generally larger, so the schema accounts for a proportionally smaller amount of the overall data. In this case, the consumers can process the data directly without additional logic. Amazon Redshift supports loading Avro data that has been serialized in this manner using the COPY command.

Convert non-JSON data to JSON

Organizations aiming to use non-JSON serialization formats need to develop an external method for parsing their messages outside of Amazon Redshift. We recommend using an AWS Lambda-based external user-defined function (UDF) for this process. Using an external Lambda UDF allows organizations to define arbitrary deserialization logic to support any message format, including embedded schema, assumed schema, and embedded schema ID approaches. Although Amazon Redshift supports defining Python UDFs natively, which may be a viable alternative for some use cases, we demonstrate the Lambda UDF approach in this post to cover more complex scenarios. For examples of Amazon Redshift UDFs, refer to AWS Samples on GitHub.

The basic architecture for this solution is as follows.

See the following code:

-- Step 1
CREATE OR REPLACE EXTERNAL FUNCTION fn_lambda_decode_avro_binary(varchar)
RETURNS varchar IMMUTABLE LAMBDA 'redshift-avro-udf';

-- Step 2
CREATE EXTERNAL SCHEMA kds FROM KINESIS

-- Step 3
CREATE MATERIALIZED VIEW {name} AUTO REFRESH YES AS
SELECT
    -- Step 4
   t.kinesis_data AS binary_avro,
   to_hex(binary_avro) AS hex_avro,
   -- Step 5
   fn_lambda_decode_avro_binary('{stream-name}', hex_avro) AS json_string,
   -- Step 6
   JSON_PARSE(json_string) AS super_data,
   t.sequence_number,
   t.refresh_time,
   t.approximate_arrival_timestamp,
   t.shard_id
FROM kds.{stream_name} AS t

Let’s explore each step in more detail.

Create the Lambda UDF

The overall goal is to develop a method that can accept the raw data as input and produce JSON-encoded data as an output. This aligns with the Amazon Redshift ability to natively process JSON into the SUPER data type. The specifics of the function depend on the serialization and streaming approach. For example, using the assumed schema approach with Avro format, your Lambda function may complete the following steps:

Take in the stream name and hexadecimal-encoded data as inputs.
Use the stream name to perform a lookup to identify the schema for the given stream name.
Decode the hexadecimal data into binary format.
Use the schema to deserialize the binary data into readable format.
Re-serialize the data into JSON format.

The f_glue_schema_registry_avro_to_json AWS samples example illustrates the process of decoding Avro using the assumed schema approach using the AWS Glue Schema Registry in a Lambda UDF to retrieve and use Avro schemas by stream name. For other approaches (such as embedded schema ID), you should author your Lambda function to handle deserialization as defined by your serialization process and schema registry implementation. If your application depends on an external schema registry or table lookup to process the message schema, we recommend that you implement caching for schema lookups to help reduce the load on the external systems and reduce the average Lambda function invocation duration.

When creating the Lambda function, make sure you accommodate the Amazon Redshift input event format and ensure compliance with the expected Amazon Redshift event output format. For details, refer to Creating a scalar Lambda UDF.

After you create and test the Lambda function, you can define it as a UDF in Amazon Redshift. For effective integration within Amazon Redshift, designate this Lambda function UDF as IMMUTABLE. This classification supports incremental materialized view updates. This treats the Lambda function as idempotent and minimizes the Lambda function costs for the solution, because a message doesn’t need to be processed if it has been processed before.

Configure the baseline Kinesis data stream

Regardless of your messaging format or approach (embedded schema, assumed schema, and embedded schema ID), you begin with setting up the external schema for streaming ingestion from your messaging source into Amazon Redshift. For more information, refer to Streaming ingestion.

CREATE EXTERNAL SCHEMA kds FROM KINESIS

IAM_ROLE 'arn:aws:iam::0123456789:role/redshift-streaming-role';

Create the raw materialized view

Next, you define your raw materialized view. This view contains the raw message data from the streaming source in Amazon Redshift VARBYTE format.

Convert the VARBYTE data to VARCHAR format

External Lambda function UDFs don’t support VARBYTE as an input data type. Therefore, you must convert the raw VARBYTE data from the stream into VARCHAR format to pass to the Lambda function. The best way to do this in Amazon Redshift is using the TO_HEX built-in method. This converts the binary data into hexadecimal-encoded character data, which can be sent to the Lambda UDF.

Invoke the Lambda function to retrieve JSON data

After the UDF has been defined, we can invoke the UDF to convert our hexadecimal-encoded data into JSON-encoded VARCHAR data.

Use the JSON_PARSE method to convert the JSON data to SUPER data type

Finally, we can use the Amazon Redshift native JSON parsing methods like JSON_PARSE, JSON_EXTRACT_PATH_TEXT, and more to parse the JSON data into a format that we can use for analytics.

Considerations

Consider the following when using this strategy:

Cost – Amazon Redshift invokes the Lambda function in batches to improve scalability and reduce the overall number of Lambda invocations. The cost of this solution depends on the number of messages in your stream, the frequency of the refresh, and the invocation time required to process the messages in a batch from Amazon Redshift. Using the IMMUTABLE UDF type in Amazon Redshift can also help minimize costs by utilizing the incremental refresh strategy for the materialized view.
Permissions and network access – The AWS Identity and Access Management (IAM) role used for the Amazon Redshift UDF must have permissions to invoke the Lambda function, and you must deploy the Lambda function such that it has access to invoke its external dependencies (for example, you may need to deploy it in a VPC to access private resources like a schema registry).
Monitoring – Use Lambda function logging and metrics to identify errors in deserialization, connection to the schema registry, and data processing. For details on monitoring the UDF Lambda function, refer to Embedding metrics within logs and Monitoring and troubleshooting Lambda functions.

Conclusion

In this post, we dove into different data formats and ingestion methods for a streaming use case. By exploring strategies for handling non-JSON data formats, we examined the use of Amazon Redshift streaming to seamlessly ingest, process, and analyze these formats in near-real time using materialized views.

Furthermore, we navigated through schema-per-stream, embedded schema, assumed schema, and embedded schema ID approaches, highlighting their merits and considerations. To bridge the gap between non-JSON formats and Amazon Redshift, we explored the creation of Lambda UDFs for data parsing and conversion. This approach offers a comprehensive means to integrate diverse data streams into Amazon Redshift for subsequent analysis.

As you navigate the ever-evolving landscape of data formats and analytics, we hope this exploration provides valuable guidance to derive meaningful insights from your data streams. We welcome any thoughts or questions in the comments section.

About the Authors

M Mehrtens has been working in distributed systems engineering throughout their career, working as a Software Engineer, Architect, and Data Engineer. In the past, M has supported and built systems to process terrabytes of streaming data at low latency, run enterprise Machine Learning pipelines, and created systems to share data across teams seamlessly with varying data toolsets and software stacks. At AWS, they are a Sr. Solutions Architect supporting US Federal Financial customers.

Sindhu Achuthan is a Sr. Solutions Architect with Federal Financials at AWS. She works with customers to provide architectural guidance on analytics solutions using AWS Glue, Amazon EMR, Amazon Kinesis, and other services. Outside of work, she loves DIYs, to go on long trails, and yoga.

Manage your workloads better using Amazon Redshift Workload Management

2023-09-25 Rohit Vashishtha

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/manage-your-workloads-better-using-amazon-redshift-workload-management/

With Amazon Redshift, you can run a complex mix of workloads on your data warehouse, such as frequent data loads running alongside business-critical dashboard queries and complex transformation jobs. We also see more and more data science and machine learning (ML) workloads. Each workload type has different resource needs and different service-level agreements (SLAs).

Amazon Redshift workload management (WLM) helps you maximize query throughput and get consistent performance for the most demanding analytics workloads by optimally using the resources of your existing data warehouse.

In Amazon Redshift, you implement WLM to define the number of query queues that are available and how queries are routed to those queues for processing. WLM queues are configured based on Redshift user groups, user roles, or query groups. When users belonging to a user group or role run queries in the database, their queries are routed to a queue as depicted in the following flowchart.

Role-based access control (RBAC) is a new enhancement that helps you simplify the management of security privileges in Amazon Redshift. You can use RBAC to control end-user access to data at a broad or granular level based on their job role. We have introduced support for Redshift roles in WLM queues, you will now find User roles along with User groups and Query groups as query routing mechanism.

This post provides examples of analytics workloads for an enterprise, and shares common challenges and ways to mitigate those challenges using WLM. We guide you through common WLM patterns and how they can be associated with your data warehouse configurations. We also show how to assign user roles to WLM queues and how to use WLM query insights to optimize configuration.

Use case overview

ExampleCorp is an enterprise using Amazon Redshift to modernize its data platform and analytics. They have variety of workloads with users from various departments and personas. The service-level performance requirements vary by the nature of the workload and user personas accessing the datasets. ExampleCorp would like to manage resources and priorities on Amazon Redshift using WLM queues. For this multitenant architecture by department, ExampleCorp can achieve read/write isolation using the Amazon Redshift data sharing feature and meet its unpredictable compute scaling requirements using concurrency scaling.

The following figure illustrates the user personas and access in ExampleCorp.

ExampleCorp has multiple Redshift clusters. For this post, we focus on the following:

Enterprise data warehouse (EDW) platform – This has all write workloads, along with some of the applications running reads via the Redshift Data API. The enterprise standardized data from the EDW cluster is accessed by multiple consumer clusters using the Redshift data sharing feature to run downstream reports, dashboards, and other analytics workloads.
Marketing data mart – This has predictable extract, transform, and load (ETL) and business intelligence (BI) workloads at specific times of day. The cluster admin understands the exact resource requirements by workload type.
Auditor data mart – This is only used for a few hours a day to run scheduled reports.

ExampleCorp would like to better manage their workloads using WLM.

Solution overview

As we discussed in the previous section, ExampleCorp has multiple Redshift data warehouses: one enterprise data warehouse and two downstream Redshift data warehouses. Each data warehouse has different workloads, SLAs, and concurrency requirements.

A database administrator (DBA) will implement appropriate WLM strategies on each Redshift data warehouse based on their use case. For this post, we use the following examples:

The enterprise data warehouse demonstrates Auto WLM with query priorities
The marketing data mart cluster demonstrates manual WLM
The auditors team uses their data mart infrequently for sporadic workloads; they use Amazon Redshift Serverless, which doesn’t require workload management

The following diagram illustrates the solution architecture.

Prerequisites

Before beginning this solution, you need the following:

An AWS account
Administrative access to Amazon Redshift

Let’s start by understanding some foundational concepts before solving the problem statement for ExampleCorp. First, how to choose between auto vs. manual WLM.

Auto vs. manual WLM

Amazon Redshift WLM enables you to flexibly manage priorities within workloads to meet your SLAs. Amazon Redshift supports Auto WLM or manual WLM for your provisioned Redshift data warehouse. The following diagram illustrates queues for each option.

Auto WLM determines the amount of resources that queries need and adjusts the concurrency based on the workload. When queries requiring large amounts of resources are in the system (for example, hash joins between large tables), the concurrency is lower. For additional information, refer to Implementing automatic WLM. You should use Auto WLM when your workload is highly unpredictable.

With manual WLM, you manage query concurrency and memory allocation, as opposed to auto WLM, where it’s managed by Amazon Redshift automatically. You configure separate WLM queues for different workloads like ETL, BI, and ad hoc and customize resource allocation. For additional information, refer to Tutorial: Configuring manual workload management (WLM) queues.

Use manual when When your workload pattern is predictable or if you need to throttle certain types of queries depending on the time of day, such as throttle down ingestion during business hours. If you need to guarantee multiple workloads are able to run at the same time, you can define slots for each workload.

Now that you have chosen automatic or manual WLM, let’s explore WLM parameters and properties.

Static vs. dynamic properties

The WLM configuration for a Redshift data warehouse is set using a parameter group under the database configuration properties.

The parameter group WLM settings are either dynamic or static. You can apply dynamic properties to the database without a cluster reboot, but static properties require a cluster reboot for changes to take effect. The following table summarizes the static vs. dynamic requirements for different WLM properties.

WLM Property	Automatic WLM	Manual WLM
Query groups	Dynamic	Static
Query group wildcard	Dynamic	Static
User groups	Dynamic	Static
User group wildcard	Dynamic	Static
User roles	Dynamic	Static
User role wildcard	Dynamic	Static
Concurrency on main	Not applicable	Dynamic
Concurrency Scaling mode	Dynamic	Dynamic
Enable short query acceleration	Not applicable	Dynamic
Maximum runtime for short queries	Dynamic	Dynamic
Percent of memory to use	Not applicable	Dynamic
Timeout	Not applicable	Dynamic
Priority	Dynamic	Not applicable
Adding or removing queues	Dynamic	Static

Note the following:

The parameter group parameters and WLM switch from manual to auto or vice versa and are static properties, and therefore require a cluster reboot.
For the WLM properties Concurrency on main, Percentage of memory to use, and Timeout, which are dynamic for manual WLM, the change only applies to new queries submitted after the value has changed and not for currently running queries.
The query monitoring rules, which we discuss later in this post, are dynamic and don’t require a cluster reboot.

In the next section, we discuss the concept of service class, meaning which queue does the query get submitted to and why.

Service class

Whether you use Auto or manual WLM, the user queries submitted go to the intended WLM queue via one of the following mechanisms:

User_Groups – The WLM queue directly maps to Redshift groups that would appear in the pg_group table.
Query_Groups – Queue assignment is based on the query_group label. For example, a dashboard submitted from the same reporting user can have separate priorities by designation or department.
User_Roles (latest addition) – The queue is assigned based on the Redshift roles.

WLM queues from a metadata perspective are defined as service class configuration. The following table lists common service class identifiers for your reference.

ID	Service class
1–4	Reserved for system use.
5	Used by the superuser queue.
6–13	Used by manual WLM queues that are defined in the WLM configuration.
14	Used by short query acceleration.
15	Reserved for maintenance activities run by Amazon Redshift.
100–107	Used by automatic WLM queue when `auto_wlm` is true.

The WLM queues you define based on user_groups, query_groups, or user_roles fall in service class ID 6–13 for manual WLM and service class id 100–107 for automatic WLM.

Using Query_group, you can force a query to go to service class 5 and run in the superuser queue (provided you are an authorized superuser) as shown in the following code:

set query_group to 'superuser';
analyze table_xyz;
vacuum full table_xyz;
reset query_group;

For more details on how to assign a query to a particular service class, refer to Assigning queries to queues.

The short query acceleration (SQA) queue (service class 14) prioritizes short-running queries ahead of longer-running queries. If you enable SQA, you can reduce WLM queues that are dedicated to running short queries. In addition, long-running queries don’t need to contend with short queries for slots in a queue, so you can configure your WLM queues to use fewer query slots (a term used for available concurrency). Amazon Redshift uses an ML algorithm to analyze each eligible query and predict the query’s runtime. Auto WLM dynamically assigns a value for the SQA maximum runtime based on analysis of your cluster’s workload. Alternatively, you can specify a fixed value of 1–20 seconds when using manual WLM.

SQA is enabled by default in the default parameter group and for all new parameter groups. SQA can have a maximum concurrency of six queries.

Now that you understand how queries get submitted to a service class, it’s important to understand ways to avoid runaway queries and initiate an action for an unintended event.

Query monitoring rules

You can use Amazon Redshift query monitoring rules (QMRs) to set metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries.

The Redshift cluster automatically collects query monitoring metrics. You can query the system view SVL_QUERY_METRICS_SUMMARY as an aid to determine threshold values for defining the QMR. Then create the QMR based on following attributes:

Query runtime, in seconds
Query return row count
The CPU time for a SQL statement

For a complete list of QMRs, refer to WLM query monitoring rules.

Create sample parameter groups

For our ExampleCorp use case, we demonstrate automatic and manual WLM for a provisioned Redshift data warehouse and share a serverless perspective of WLM.

The following AWS CloudFormation template provides an automated way to create sample parameter groups that you can attach to your Redshift data warehouse for workload management.

Enterprise data warehouse Redshift cluster using automatic WLM

For the EDW cluster, we use Auto WLM. To configure the service class, we look at all three options: user_roles, user_groups, and query_groups.

Here’s a glimpse of how this can be set up in WLM queues and then used in your queries.

On the Amazon Redshift console, under Configurations in the navigation pane, choose Workload Management. You can create a new parameter group or modify an existing one created by you. Select the parameter group to edit its queues. There’s always a default queue (the last one in case of multiple queues defined), which is a catch-all for queries that don’t get routed to any specific queue.

User roles in WLM

With the introduction of user roles in WLM queues, now you can manage your workload by adding different roles to different queues. This can help you prioritize the queries based on the roles a user has. When a user runs a query, WLM will check if this user’s roles were added in any workload queues and assign the query to the first matching queue. To add roles into the WLM queue, you can go to the WLM page, create or modify an existing workload queue, add a user’s roles in the queue, and select Matching wildcards to add roles that get matched as wildcards.

For more information about how to convert from groups to roles, refer to Amazon Redshift Roles (RBAC), which walks you through a stored procedure to convert groups to roles.

In the following example, we have created the WLM queue EDW_Admins, which uses edw_admin_role created in Amazon Redshift to submit the workloads in this queue. The EDW_Admins queue is created with a high priority and automatic concurrency scaling mode.

User groups

Groups are collections of users who are all granted permissions associated with the group. You can use groups to simplify permission management by granting privileges just one time. If the members of a group get added or removed, you don’t need to manage them at a user level. For example, you can create different groups for sales, administration, and support and give the users in each group the appropriate access to the data they need for their work.

You can grant or revoke permissions at the user group level, and those changes will apply to all members of the group.

ETL, data analysts, or BI or decision support systems can use user groups to better manage and isolate their workloads. For our example, ETL WLM queue queries will be run with the user group etl. The data analyst group (BI) WLM queue queries will run using the bi user group.

Choose Add queue to add a new queue that you will use for user_groups, in this case ETL. If you would like these to be matched as wildcards (strings containing those keywords), select Matching wildcards. You can customize other options like query priority and concurrency scaling, explained earlier in this post. Choose Save to complete this queue setup.

In the following example, we have created two different WLM queues for ETL and BI. The ETL queue has a high priority and concurrency scaling mode is off, whereas the BI queue has a low priority and concurrency scaling mode is off.

Use the following code to create a group with multiple users:

-- Example of create group with multiple users
create group ETL with user etl_user1, etl_user2;
Create group BI with user bi_user1, bi_user2;

Query groups

Query_Groups are labels used for queries that are run within the same session. Think of these as tags that you may want to use to identify queries for a uniquely identifiable use case. In our example use case, the data analysts or BI or decision support systems can use query_groups to better manage and isolate their workloads. For our example, weekly business reports can run with the query_group label wbr. Queries from the marketing department can be run with a query_group of marketing.

The benefit of using query_groups is that you can use it to constrain results from the STL_QUERY and STV_INFLIGHT tables and the SVL_QLOG view. You can apply a separate label to every query that you run to uniquely identify queries without having to look up their IDs.

Choose Add queue to add a new queue that you will use for query_groups, in this case wbr or weekly_business_report. If you would like these to be matched as wildcards (strings containing those keywords), select Matching wildcards. You can customize other options like query priority and concurrency scaling options as explained earlier in this post. Choose Save to save this queue setup.

Now let’s see how you can force a query to use the query_groups queue just created.

You can assign a query to a queue at runtime by assigning your query to the appropriate query group. Use the SET command to begin a query group:

SET query_group TO wbr;
-- or
SET query_group TO weekly_business_report;

Queries following the SET command would go to the WLM queue Query_Group_WBR until you either reset the query group or end your current login session. For information about setting and resetting server configuration parameter, see SET and RESET, respectively.

The query group labels that you specify must be included in the current WLM configuration; otherwise, the SET query_group command has no effect on query queues.

For more query_groups examples, refer to WLM queue assignment rules.

Marketing Redshift cluster using manual WLM

Expanding on the marketing Redshift cluster use case of ExampleCorp, this cluster serves two types of workloads:

Running ETL for a period of 2 hours between 7:00 AM to 9:00 AM
Running BI reports and dashboards for the remaining time during the day

When you have such a clarity in the workloads, and your scope of usage is customizable by design, you may want to consider using manual WLM, where you can control the memory and concurrency resource allocation. Auto WLM will still be applicable, but manual WLM can also be a choice.

Let’s set up manual WLM in this case, with two WLM queues: ETL and BI.

To best utilize the resources, we use an AWS Command Line Interface (AWS CLI) command at the start of our ETL, which will make our WLM queues ETL-friendly, providing higher concurrency to the ETL queue. At the end of our ETL, we use an AWS CLI command to change the WLM queue to have BI-friendly resource settings. Modifying the WLM queues doesn’t require a reboot of your cluster; however, modifying the parameters or parameter group does.

If you were to use Auto WLM, this could have been achieved by dynamically changing the query priority of the ETL and BI queues.

By default, when you choose Create, the WLM created will be Auto WLM. You can switch to manual WLM by choosing Switch WLM mode. After switching WLM mode, choose Edit workload queues.

This will open the Modify workload queues page, where you can create your ETL and BI WLM queues.

After you add your ETL and BI queues, choose Save. You should have configured the following:

An ETL queue with 60% memory allocation and query concurrency of 9
A BI queue with 30% memory allocation and query concurrency of 4
A default queue with 10% memory allocation and query concurrency of 2

Your WLM queues should appear with settings as shown in the following screenshot.

Enterprises may prefer to complete these steps in an automated way. For the marketing data mart use case, the ETL starts at 7:00 AM. An ideal start to the ETL flow would be to have a job that makes your WLM settings ETL queue friendly. Here’s how you would modify concurrency and memory (both dynamic properties in manual WLM queues) to an ETL-friendly configuration:

aws redshift --region 'us-east-1' modify-cluster-parameter-group --parameter-group-name manual-wlm-demo --parameters '{"ParameterName": "wlm_json_configuration","ParameterValue": "[{\"query_group\": [], \"user_group\": [\"etl\"],\"query_group_wild_card\": 0,\"user_group_wild_card\": 0, \"query_concurrency\": 9, \"max_execution_time\": 0, \"memory_percent_to_use\": 60, \"name\": \"ETL\" }, {\"query_group\": [], \"user_group\": [\"bi\"],\"query_group_wild_card\": 0,\"user_group_wild_card\": 0, \"query_concurrency\": 3, \"max_execution_time\": 0, \"memory_percent_to_use\": 20, \"name\": \"BI\" }, { \"query_group\": [], \"user_group\": [], \"query_group_wild_card\": 0, \"user_group_wild_card\": 0, \"query_concurrency\": 3, \"max_execution_time\": 5400000, \"memory_percent_to_use\": 20, \"name\": \"Default queue\", \"rules\": [ { \"rule_name\": \"user_query_duration_threshold\", \"predicate\": [ { \"metric_name\": \"query_execution_time\", \"operator\": \">\", \"value\": 10800 } ], \"action\": \"abort\" } ] }, { \"short_query_queue\": \"true\" } ]","Description": "ETL Start, ETL Friendly"}';

The preceding AWS CLI command programmatically sets the configuration of your WLM queues without requiring a reboot of the cluster because the queue settings changed were all dynamic settings.

For the marketing data mart use case, at 9:00 AM or when the ETL is finished, you can have a job run an AWS CLI command to modify the WLM queue resource settings to a BI-friendly configuration as shown in the following code:

aws redshift --region 'us-east-1' modify-cluster-parameter-group --parameter-group-name manual-wlm-demo --parameters '{"ParameterName": "wlm_json_configuration","ParameterValue": "[{\"query_group\": [], \"user_group\": [\"etl\"],\"query_group_wild_card\": 0,\"user_group_wild_card\": 0, \"query_concurrency\": 1, \"max_execution_time\": 0, \"memory_percent_to_use\": 5, \"name\": \"ETL\" }, {\"query_group\": [], \"user_group\": [\"bi\"],\"query_group_wild_card\": 0,\"user_group_wild_card\": 0, \"query_concurrency\": 12, \"max_execution_time\": 0, \"memory_percent_to_use\": 80, \"name\": \"BI\" }, { \"query_group\": [], \"user_group\": [], \"query_group_wild_card\": 0, \"user_group_wild_card\": 0, \"query_concurrency\": 2, \"max_execution_time\": 5400000, \"memory_percent_to_use\": 15, \"name\": \"Default queue\", \"rules\": [ { \"rule_name\": \"user_query_duration_threshold\", \"predicate\": [ { \"metric_name\": \"query_execution_time\", \"operator\": \">\", \"value\": 10800 } ], \"action\": \"abort\" } ] }, { \"short_query_queue\": \"true\" } ]","Description": "ETL End, BI Friendly"}';

Note that in regards to a manual WLM configuration, the maximum slots you can allocate to a queue is 50. However, this doesn’t mean that in an automatic WLM configuration, a Redshift cluster always runs 50 queries concurrently. This can change based on the memory needs or other types of resource allocation on the cluster. We recommend configuring your manual WLM query queues with a total of 15 or fewer query slots. For more information, see Concurrency level.

In case of WLM timeout or a QMR hop action within manual WLM, a query can attempt to hop to the next matching queue based on WLM queue assignment rules. This action in manual WLM is called query queue hopping.

Auditor Redshift data warehouse using WLM in Redshift Serverless

The auditor data warehouse workload runs on the month, and quarter end. For this periodic workload, Redshift Serverless is well suited, both from a cost and ease of administration perspective. Redshift Serverless uses ML to learn from your workload to automatically manage workload and auto scaling of compute needed for your workload.

In Redshift Serverless, you can set up usage and query limits. The query limits let you set up the QMR. You can choose Manage query limits to automatically trigger the default abort action when queries go beyond performance boundaries. For more information, refer to Query monitoring metrics for Amazon Redshift Serverless.

For other detailed limits in Redshift Serverless, refer to Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable.

Monitor using system views for operational metrics

The system views in Amazon Redshift are used to monitor the workload performance. You can view the status of queries, queues, and service classes by using WLM-specific system tables. You can query system tables to explore the following details:

View which queries are being tracked and what resources are allocated by the workload manager
See which queue a query has been assigned to
View the status of a query that is currently being tracked by the workload manager

You can download the sample SQL notebook system queries. You can import this in Query Editor V2.0. The queries in the sample notebook can help you explore your workloads being managed by WLM queues.

Conclusion

In this post, we covered real-world examples for Auto WLM and manual WLM patterns. We introduced user roles assignment to WLM queues, and shared queries on system views and tables to gather operational insights on your WLM configuration. We encourage you to explore using Redshift user roles with workload management. Use the script provided on AWS re:Post to convert groups to roles, and start using user roles for your WLM queues.

About the Authors

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, Texas. He has over 17 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with utmost security and data governance.

Harshida Patel is a Principal specialist SA with AWS.

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Use the new SQL commands MERGE and QUALIFY to implement and validate change data capture in Amazon Redshift

2023-09-22 Yanzhu Ji

Post Syndicated from Yanzhu Ji original https://aws.amazon.com/blogs/big-data/use-the-new-sql-commands-merge-and-qualify-to-implement-and-validate-change-data-capture-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads.

Amazon Redshift has added many features to enhance analytical processing like ROLLUP, CUBE and GROUPING SETS, which were demonstrated in the post Simplify Online Analytical Processing (OLAP) queries in Amazon Redshift using new SQL constructs such as ROLLUP, CUBE, and GROUPING SETS. Amazon Redshift has recently added many SQL commands and expressions. In this post, we talk about two new SQL features, the MERGE command and QUALIFY clause, which simplify data ingestion and data filtering.

One familiar task in most downstream applications is change data capture (CDC) and applying it to its target tables. This task requires examining the source data to determine if it is an update or an insert to existing target data. Without the MERGE command, you needed to test the new dataset against the existing dataset using a business key. When that didn’t match, you inserted new rows in the existing dataset; otherwise, you updated existing dataset rows with new dataset values.

The MERGE command conditionally merges rows from a source table into a target table. Traditionally, this could only be achieved by using multiple insert, update, or delete statements separately. When using multiple statements to update or insert data, there is a risk of inconsistencies between the different operations. Merge operation reduces this risk by ensuring that all operations are performed together in a single transaction.

The QUALIFY clause filters the results of a previously computed window function according to user‑specified search conditions. You can use the clause to apply filtering conditions to the result of a window function without using a subquery. This is similar to the HAVING clause, which applies a condition to further filter rows from a WHERE clause. The difference between QUALIFY and HAVING is that filtered results from the QUALIFY clause could be based on the result of running window functions on the data. You can use both the QUALIFY and HAVING clauses in one query.

In this post, we demonstrate how to use the MERGE command to implement CDC and how to use QUALIFY to simplify validation of those changes.

Solution overview

In this use case, we have a data warehouse, in which we have a customer dimension table that needs to always get the latest data from the source system. This data must also reflect the initial creation time and last update time for auditing and tracking purposes.

A simple way to solve this is to override the customer dimension fully every day; however, that won’t achieve the update tracking, which is an audit mandate, and it might not be feasible to do for bigger tables.

You can load sample data from Amazon S3 by following the instruction here. Using the existing customer table under sample_data_dev.tpcds, we create a customer dimension table and a source table that will contain both updates for existing customers and inserts for new customers. We use the MERGE command to merge the source table data with the target table (customer dimension). We also show how to use the QUALIFY clause to simplify validating the changes in the target table.

To follow along with the steps in this post, we recommend downloading the accompanying notebook, which contains all the scripts to run for this post. To learn about authoring and running notebooks, refer to Authoring and running notebooks.

Prerequisites

You should have the following prerequisites:

An AWS account
A Redshift provisioned cluster or Amazon Redshift Serverless endpoint
tpcds data in the sample_data_dev database (which contains a customer table)

Create and populate the dimension table

We use the existing customer table under sample_data_dev.tpcds to create a customer_dimension table. Complete the following steps:

Create a table using a few selected fields, including the business key, and add a couple of maintenance fields for insert and update timestamps:

 -- create the customer dimension table DROP TABLE IF EXISTS customer_dim CASCADE;
CREATE TABLE customer_dim ( 
customer_dim_id     bigint GENERATED BY DEFAULT AS IDENTITY(1, 1), 
c_customer_sk integer NOT NULL ENCODE az64 distkey,
c_first_name character(20) ENCODE lzo,
c_last_name character(30) ENCODE lzo,
c_current_addr_sk integer ENCODE az64,
c_birth_country character varying(20) ENCODE lzo,
c_email_address character(50) ENCODE lzo,
record_insert_ts    timestamp WITHOUT time ZONE DEFAULT current_timestamp ,
record_upd_ts       timestamp WITHOUT time ZONE DEFAULT NULL
)
SORTKEY (c_customer_sk);

Populate the dimension table:

-- populate dimension 
insert into customer_dim 
       (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
select  c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address
from “sample_data_dev”.”tpcds”.”customer”;

Validate the row count and the contents of the table:

-- check customers count and look at sample data
select count(1) from customer_dim; 
select * from customer_dim limit 10;

Simulate customer table changes

Use the following code to simulate changes made to the table:

-- create a source table with some updates and some inserts
-- Update- Email has changed for 100 customers 
drop table if exists src_customer;
create table src_customer distkey(c_customer_sk) as 
select c_customer_sk , c_first_name , c_last_name, c_current_addr_sk, c_birth_country, ‘x’+c_email_address as c_email_address, getdate() as effective_dt
from   customer_dim 
where  c_email_address is not null
limit 100;


-- also let’s add three completely new customers
insert into src_customer values 
(15000001, ‘Customer#15’,’000001’, 10001 ,’USA’    , ‘Customer#[email protected]’, getdate() ),
(15000002, ‘Customer#15’,’000002’, 10002 ,’MEXICO’ , ‘Customer#[email protected]’, getdate() ),
(15000003, ‘Customer#15’,’000003’, 10003 ,’CANADA’ , ‘Customer#[email protected]’, getdate() );

-- check source count
select count(1) from src_customer;

Merge the source table into the target table

Now you have a source table with some changes you need to merge with the customer dimension table.

Before the MERGE command, this type of task needed two separate UPDATE and INSERT commands to implement:

-- merge changes to dim customer
BEGIN TRANSACTION;
-- update current records
UPDATE customer_dim
SET    c_first_name      = src.c_first_name      ,
       c_last_name       = src.c_last_name       , 
       c_current_addr_sk = src.c_current_addr_sk , 
       c_birth_country   = src.c_birth_country   , 
       c_email_address   = src.c_email_address   ,
       record_upd_ts     = current_timestamp
from   src_customer AS src
where  customer_dim.c_customer_sk = src.c_customer_sk ;
-- Insert new records
INSERT INTO customer_dim (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
select src.c_customer_sk, src.c_first_name,src.c_last_name, src.c_current_addr_sk, src.c_birth_country, src.c_email_address 
from   src_customer AS src
where  src.c_customer_sk NOT IN (select c_customer_sk from customer_dim);
-- end merge operation
COMMIT TRANSACTION;

The MERGE command uses a more straightforward syntax, in which we use the key comparison result to decide if we perform an update DML operation (when matched) or an insert DML operation (when not matched):

MERGE INTO customer_dim using src_customer AS src ON customer_dim.c_customer_sk = src.c_customer_sk
WHEN MATCHED THEN UPDATE 
SET c_first_name      = src.c_first_name      ,
    c_last_name       = src.c_last_name       , 
    c_current_addr_sk = src.c_current_addr_sk , 
    c_birth_country   = src.c_birth_country   , 
    c_email_address   = src.c_email_address   ,
    record_upd_ts     = current_timestamp
WHEN NOT MATCHED THEN INSERT (c_customer_sk, c_first_name,c_last_name, c_current_addr_sk, c_birth_country, c_email_address) 
                      VALUES (src.c_customer_sk, src.c_first_name,src.c_last_name, src.c_current_addr_sk, src.c_birth_country, src.c_email_address );

Validate the data changes in the target table

Now we need to validate the data has made it correctly to the target table. We can first check the updated data using the update timestamp. Because this was our first update, we can examine all rows where the update timestamp is not null:

-- Check the changes
-- to get updates
select * 
from customer_dim
where record_upd_ts is not null

Use QUALIFY to simplify validation of the data changes

We need to examine the data inserted in this table most recently. One way to do that is to rank the data by its insert timestamp and get those with the first rank. This requires using the window function rank() and also requires a subquery to get the results.

Before the availability of QUALIFY, we needed to build that using a subquery like the following:

select customer_dim_id,c_customer_sk ,c_first_name ,c_last_name ,c_current_addr_sk,c_birth_country ,c_email_address ,record_insert_ts ,record_upd_ts 
from 
( select rank() OVER (ORDER BY DATE_TRUNC(‘second’,record_insert_ts) desc) AS rnk, 
         customer_dim_id,c_customer_sk ,c_first_name ,c_last_name ,c_current_addr_sk,c_birth_country ,c_email_address ,record_insert_ts ,record_upd_ts 
  from customer_dim
  where record_upd_ts is null)
where rnk = 1;

The QUALIFY function eliminates the need for the subquery, as in the following code snippet:

-- to get the newly inserted rows we can make use of Qualify feature
select * 
from customer_dim
where record_upd_ts is null
qualify rank() OVER (ORDER BY DATE_TRUNC(‘second’,record_insert_ts) desc) = 1

Validate all data changes

We can union the results of both queries to get all the inserts and update changes:

-- To get all changes
select *
from (
select 'Updates' as operations, cd.* 
from   customer_dim as cd
where  cd.record_upd_ts is not null
union 
select 'Inserts' as operations, cd.* 
from customer_dim cd
where cd.record_upd_ts is null
qualify rank() OVER (ORDER BY DATE_TRUNC('second',cd.record_insert_ts) desc) = 1 
) order by 1

Clean up

To clean up the resources used in the post, delete the Redshift provisioned cluster or Redshift Serverless workgroup and namespace you created for this post (this will also drop all the objects created).

If you used an existing Redshift provisioned cluster or Redshift Serverless workgroup and namespace, use the following code to drop these objects:

DROP TABLE IF EXISTS customer_dim CASCADE;
DROP TABLE IF EXISTS src_customer CASCADE;

Conclusion

When using multiple statements to update or insert data, there is a risk of inconsistencies between the different operations. The MERGE operation reduces this risk by ensuring that all operations are performed together in a single transaction. For Amazon Redshift customers who are migrating from other data warehouse systems or who regularly need to ingest fast-changing data into their Redshift warehouse, the MERGE command is a straightforward way to conditionally insert, update, and delete data from target tables based on existing and new source data.

In most analytic queries that use window functions, you may need to use those window functions in your WHERE clause as well. However, this is not permitted, and to do so, you have to build a subquery that contains the required window function and then use the results in the parent query in the WHERE clause. Using the QUALIFY clause eliminates the need for a subquery and therefore simplifies the SQL statement and makes it less difficult to write and read.

We encourage you to start using those new features and give us your feedback. For more details, refer to MERGE and QUALIFY clause.

About the authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Accelerate Amazon Redshift secure data use with Satori – Part 1

2023-09-21 Jagadish Kumar

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/accelerate-amazon-redshift-secure-data-use-with-satori-part-1/

This post is co-written by Lisa Levy, Content Specialist at Satori.

Data democratization enables users to discover and gain access to data faster, improving informed data-driven decisions and using data to generate business impact. It also increases collaboration across teams and organizations, breaking down data silos and enabling cross-functional teams to work together more effectively.

A significant barrier to data democratization is ensuring that data remains secure and compliant. The ability to search, locate, and mask sensitive data is critical for the data democratization process. Amazon Redshift provides numerous features such as role-based access control (RBAC), row-level security (RLS), column-level security (CLS), and dynamic data masking to facilitate the secure use of data.

In this two-part series, we explore how Satori, an Amazon Redshift Ready partner, can help Amazon Redshift users automate secure access to data and provide their data users with self-service data access. Satori integrates natively with both Amazon Redshift provisioned clusters and Amazon Redshift Serverless for easy setup of your Amazon Redshift data warehouse in the secure Satori portal.

In part 1, we provide detailed steps on how to integrate Satori with your Amazon Redshift data warehouse and control how data is accessed with security policies.

In part 2, we will explore how to set up self-service data access with Satori to data stored in Amazon Redshift.

Satori’s data security platform

Satori is a data security platform that enables frictionless self-service access for users with built-in security. Satori accelerates implementing data security controls on datawarehouses like Amazon Redshift, is straightforward to integrate, and doesn’t require any changes to your Amazon Redshift data, schema, or how your users interact with data.

Integrating Satori with Amazon Redshift accelerates organizations’ ability to make use of their data to generate business value. This faster time-to-value is achieved by enabling companies to manage data access more efficiently and effectively.

By using Satori with the Modern Data Architecture on AWS, you can find and get access to data using a personalized data portal, and companies can set policies such as just-in-time access to data and fine-grained access control. Additionally, all data access is audited. Satori seamlessly works with native Redshift objects, external tables that can be queried through Amazon Redshift Spectrum, as well shared database objects through Redshift data sharing.

Satori anonymizes data on the fly, based on your requirements, according to users, roles, and datasets. The masking is applied regardless of the underlying database and doesn’t require writing code or making changes to your databases, data warehouses, and data lakes. Satori continuously monitors data access, identifies the location of each dataset, and classifies the data in each column. The result is a self-populating data inventory, which also classifies the data for you and allows you to add your own customized classifications.

Satori integrates with identity providers to enrich its identity context and deliver better analytics and more accurate access control policies. Satori interacts with identity providers either via API or by using the SAML protocol. Satori also integrates with business intelligence (BI) tools like Amazon QuickSight, Tableau, Power BI etc. to monitor and enforce security and privacy policies for data consumers who use BI tools to access data.

In this post, we explore how organizations can accelerate secure data use in Amazon Redshift with Satori, including the benefits of integration and the necessary steps to start. We’ll go through an example of integrating Satori with a Redshift cluster and view how security policies are applied dynamically when queried through DBeaver.

Prerequisites

You should have the following prerequisites:

An AWS account.
A Redshift cluster and Redshift Severless endpoint to store and manage data. You can create and manage your cluster through the AWS Management Console, AWS Command Line Interface (AWS CLI), or Redshift API.
A Satori account and the Satori connector for Amazon Redshift.
A Redshift security group. You’ll need to configure your Redshift security group to allow inbound traffic from the Satori connector for Amazon Redshift. Note that Satori can be deployed as a software as a service (SaaS) data access controller or within your VPC.

Prepare the data

To set up our example, complete the following steps:

On the Amazon Redshift console, navigate to Query Editor v2.

If you’re familiar with SQL Notebooks, you can download this SQL notebook for the demonstration and import it to quickly get started.

In the Amazon Redshift provisioned Cluster, Use the following code to create a table, populate it, and create roles and users:

-- 1- Create Schema
create schema if not exists customer_schema;

-- 2- Create customer and credit_cards table
CREATE TABLE customer_schema.credit_cards (
customer_id INT,
name TEXT,
is_fraud BOOLEAN,
credit_card TEXT
);


create table customer_schema.customer (
id INT,
first_name TEXT,
last_name TEXT,
email TEXT,
gender TEXT,
ssn TEXT
);

-- 3- Populate the tables with sample data
INSERT INTO customer_schema.credit_cards
VALUES
(100,'John Smith','n', '4532109867542837'),
(101,'Jane Doe','y', '4716065243786267'),
(102,'Mahendra Singh','n', '5243111024532276'),
(103,'Adaku Zerhouni','n', '6011011238764578'),
(104,'Miguel Salazar','n', '6011290347689234'),
(105,'Jack Docket','n', '3736165700234635');

INSERT INTO customer_schema.customer VALUES
(1,'Yorke','Khomishin','[email protected]','Male','866-95-2246'),
(2,'Tedd','Donwell','[email protected]','Male','726-62-3033'),
(3,'Lucien','Keppe','[email protected]','Male','865-28-6322'),
(4,'Hester','Arnefield','[email protected]','Female','133-72-9078'),
(5,'Abigale','Bertouloume','[email protected]','Female','780-69-6814'),
(6,'Larissa','Bremen','[email protected]','Female','121-78-7749');

-- 4-  GRANT  SELECT permissions on the table
GRANT SELECT ON customer_schema.credit_cards TO PUBLIC;
GRANT SELECT ON customer_schema.customer TO PUBLIC;

-- 5- create roles
CREATE ROLE customer_service_role;
CREATE ROLE auditor_role;
CREATE ROLE developer_role;
CREATE ROLE datasteward_role;


-- 6- create four users
CREATE USER Jack WITH PASSWORD '1234Test!';
CREATE USER Kim WITH PASSWORD '1234Test!';
CREATE USER Mike WITH PASSWORD '1234Test!';
CREATE USER Sarah WITH PASSWORD '1234Test!';


-- 7- Grant roles to above users
GRANT ROLE customer_service_role TO Jack;
GRANT ROLE auditor_role TO Kim;
GRANT ROLE developer_role TO Mike;
GRANT ROLE datasteward_role TO Sarah;

Get namespaces for the Redshift provisioned cluster and Redshift Serverless endpoint

Connect to provisioned cluster through Query Editor V2 and run the following SQL:

select current_namespace; -- (Save as <producer_namespace>)

Repeat the above step for Redshift Serverless endpoint and get the namespace:

select current_namespace; -- (Save as <consumer_namespace>

Connect to Redshift provisioned cluster and create an outbound data share (producer) with the following SQL

-- Creating a datashare

CREATE DATASHARE cust_share SET PUBLICACCESSIBLE TRUE;

-- Adding schema to datashare

ALTER DATASHARE cust_share ADD SCHEMA customer_schema;

-- Adding customer table to datshares. We can add all the tables also if required

ALTER DATASHARE cust_share ADD TABLE customer_schema.credit_cards;

GRANT USAGE ON DATASHARE cust_share TO NAMESPACE '<consumer_namespace>'; -- (replace with consumer namespace created in prerequisites 4)

Connect to Redshift Serverless endpoint and execute the below statements to setup the inbound datashare.

CREATE DATABASE cust_db FROM DATASHARE cust_share OF NAMESPACE '< producer_namespace >'; -- (replace with producer namespace created in prerequisites 4)

Optionally, create the credit_cards table as an external table by using this sample file in Amazon S3 and adding the table to AWS Glue Data Catalog through Glue Crawler. Once the table is available in Glue Data Catalog, you can create the external schema in your Amazon Redshift Serverless endpoint using the below SQL

CREATE external SCHEMA satori_external

FROM data catalog DATABASE 'satoriblog'

IAM_ROLE default

CREATE external DATABASE if not exists;

Verify that the external table credit_cards is available from your Redshift Serverless endpoint

select * from satori_external.credit_cards ;

Connect to Amazon Redshift

If you don’t have a Satori account, you can either create a test drive account or get Satori from the AWS Marketplace. Then complete the following steps to connect to Amazon Redshift:

Log in to Satori.
Choose Data Stores in the navigation pane, choose Add Datastore, and choose Amazon Redshift.

DatastoreSetup001

Add your cluster identifier from the Amazon Redshift console. Satori will automatically detect the Region where your cluster resides within your AWS account.
Satori will generate a Satori hostname for your cluster, which you will use to connect to your Redshift cluster
In this demonstration, we will add a Redshift provisioned cluster and a Redshift Serverless endpoint to create two datastores in Satori

DatastoreProvisioned003

Datastore Serverless002

Allow inbound access for the Satori IP addresses listed in your Redshift cluster security group.

For more details on connecting Satori to your Redshift cluster, refer to Adding an AWS Redshift Data Store to Satori.

Under Authentication Settings, enter your root or superuser credentials for each datastore.

AuthenticationSettings004

Leave the rest of the tabs with their default settings and choose Save.

Now your data stores are ready to be accessed through Satori.

Create a dataset

Complete the following steps to create a dataset:

Choose Datasets in the navigation pane and choose Add New Dataset.
Select your datastore and enter the details of your dataset.

CustomerDataset005

A dataset can be a collection of database objects that you categorize as a dataset. For Redshift provisioned cluster, we created a customer dataset with details on the database and schema. You can also optionally choose to focus on a specific table within the schema or even exclude certain schemas or tables from the dataset.

For Redshift Serverless, we created a dataset that with all datastore locations, to include the shared table and External table

ServerlessDataset006

Choose Save.

For each dataset, navigate to User Access Rules and create dataset user access policies for the roles we created.

UserAccessRoles007

Enable Give Satori Control Over Access to the Dataset.
Optionally, you can add expiration and revoke time configurations to the access policies to limit how long access is granted to the Redshift cluster.

Create a security policy for the dataset

Satori provides multiple masking profile templates that you can use as a baseline and customize before adding them to your security policies. Complete the following steps to create your security policy:

Choose Masking Profiles in the navigation pane and use the Restrictive Policy template to create a masking policy.

MaskingProfiles008

Provide a descriptive name for the policy.
You can customize the policy further to add custom fields and their respective masking policies. The following example shows the additional field Credit Card Number that was added with the action to mask everything but the last four characters.

Choose Security Policies in the navigation pane and create a security policy called Customer Data Security Policy.

Associate the policy with the masking profile created in the previous step.

Associate the created security policy with the datasets by editing the dataset and navigating to the Security Policies tab.

Now that the integration, policy, and access controls are set, let’s query the data through DBeaver.

Query secure data

To query your data, connect to the Redshift cluster and Redshift Serverless endpoint using their respective Satori hostname that was obtained earlier.

When you query the data in Redshift provisioned cluster, you will see the security policies applied to the result set at runtime.

When you query the data in Redshift Serverless endpoint, you will see the security policies applied to credit_cards table shared from the Redshift provisioned cluster.

You will get similar results with policies applied if you query the external table in Amazon S3 from Redshift Serverless endpoint

Summary

In this post, we described how Satori can help you with secure data access from your Redshift cluster without requiring any changes to your Redshift data, schema, or how your users interact with data. In part 2, we will explore how to set up self-service data access to data stored in Amazon Redshift with the different roles we created as part of the initial setup.

Satori is available on the AWS Marketplace. To learn more, start a free trial or request a demo meeting.

About the authors

Lisa Levy is a Content Specialist at Satori. She publishes informative content to effectively describe how Satori’s data security platform enhances organizational productivity.

Stored procedure enhancements in Amazon Redshift

2023-09-06 Milind Oke

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/stored-procedure-enhancements-in-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. It supports stored procedures where prepared SQL code is saved and the code can be reused over and over again.

Stored procedures are commonly used to encapsulate logic for data transformation, data validation, and business-specific logic. By combining multiple SQL steps into a stored procedure, you can create reusable code blocks that can run together as a single transaction or multiple individual transactions. You can also schedule stored procedures to automate data processing on Amazon Redshift. For more information, refer to Bringing your stored procedures to Amazon Redshift.

In the Redshift stored procedure default atomic transaction mode, a call to a Redshift stored procedure will create its own transaction when the call starts or is part of the existing transaction if an explicit transaction is opened before the stored procedure is called. All the statements inside a procedure behave as if they are in a single transaction block that ends when the stored procedure call finishes. A nested call to another procedure is treated like any other SQL statement and operates within the context of the same transaction as the caller. Statements for TRUNCATE, COMMIT, and ROLLBACK and the exception handling block with arbitrary SQL statements close the current transaction and start a new transaction implicitly. This behavior can cause challenges in migration to Amazon Redshift from other systems like Teradata.

In this post, we discuss the enhancements to Amazon Redshift stored procedures for non-atomic transaction mode. This mode provides enhanced transaction controls that enable you to automatically commit the statements inside the stored procedure.

Non-atomic transaction mode

The new non-atomic transaction mode feature provides three enhancements on stored procedures in Amazon Redshift:

Unless the DML or DDL statements are part of an explicit open transaction, each statement in the stored procedure will run in its own implicit transaction and a new transaction will be opened to handle following statements. If an explicit transaction is opened, then all subsequent statements are run and remain un-committed until an explicit transaction control command (COMMIT or ROLLBACK) is run to end the transaction.
Amazon Redshift will not re-raise the exception after the exception handler statements are complete. Therefore, a new RAISE statement without any INFO or EXCEPTION has been provided to re-throw the exception caught by the exception handling block. This RAISE statement without any INFO or EXCEPTION will only be allowed in the exception handling block.
Also, the new START TRANSACTION statement begins an explicit transaction inside the non-atomic transaction mode stored procedure. Use the existing transaction control command (COMMIT or ROLLBACK) to end the explicitly started transaction.
- Amazon Redshift does not support sub-transactions so if there is already an open transaction, then calling this statement again will do nothing, and no error is raised.
- If an explicit transaction is still open when the nonatomic transaction mode stored procedure call ends, then the explicit transaction remains open until a transaction control command is run in the session.
- If the session disconnects before running a transaction control command, the whole transaction is automatically rolled back.

Additional restrictions

Some restrictions have also been introduced for Redshift stored procedures:

For nesting stored procedure calls, all the procedures must be created in the same transaction mode, no matter if it’s in atomic (default) transaction mode or the new non-atomic transaction mode
You can’t nest stored procedures across the two transaction modes (atomic and non-atomic)
You can’t set the SECURITY DEFINER option or SET configuration_parameter option for non-atomic transaction mode stored procedures

Impact to cursors

Cursors in non-atomic transaction mode stored procedures will behave differently compared to the default atomic transaction mode:

Cursor statements will need an explicit transaction block before beginning the cursor to ensure that each iteration of the cursor loop is not auto-committed.
To return a cursor from non-atomic transaction mode stored procedure, you will need an explicit transaction block before beginning the cursor. Otherwise, the cursor will be closed when the SQL statement inside the loop is automatically committed.

Advantages

The following are key advantages of this feature from a user perspective:

It provides the capability to lift and shift Teradata stored procedures to run in Teradata session mode. This helps in seamless migrations from traditional data warehouses like Teradata and SQL Server.
It enables Amazon Redshift to provide more flexible operations inside of stored procedures when encountering errors and exceptions. Amazon Redshift can now preserve previous action’s state before reaching an exception.

Syntax

The new optional keyword NONATOMIC has been added to the stored procedure definition syntax, as shown in the following code:

CREATE [ OR REPLACE ] PROCEDURE sp_procedure_name
( [ [ argname ] [ argmode ] argtype [, ...] ] )
[ NONATOMIC ]
AS $$
procedure_body
$$ LANGUAGE plpgsql

This optional keyword creates the stored procedure under the non-atomic transaction mode. If you don’t specify the keyword, then the default atomic mode will be the transaction mode when creating the stored procedure.

NONATOMIC means each DML and DDL statement in the procedure will be implicitly committed.

Without non-atomic mode, the procedure will create its own transaction when the call starts or be part of the existing transaction if an explicit transaction is opened before it is called. Every statement within the stored procedure will belong to this one transaction.

Example of NONATOMIC mode

Let’s consider the customer contact table custcontacts, which stores customer primary and secondary contact phone numbers:

CREATE table custcontacts(
custid int4 not null,
primaryphone char(10),
secondaryphone char(10));

We insert three sample customer records with no contact values:

INSERT INTO custcontacts VALUES (101, 'xxxxxxxxxx', 'xxxxxxxxxx');
INSERT INTO custcontacts VALUES (102, 'xxxxxxxxxx', 'xxxxxxxxxx');
INSERT INTO custcontacts VALUES (103, 'xxxxxxxxxx', 'xxxxxxxxxx');

You need to create a stored procedure to update the primary and secondary phone numbers. The requirement is not to roll back updates to the primary contact number if updates to the secondary contact number fail for some reason.

You can achieve this by creating the stored procedure with the NONATOMIC keyword. The NONATOMIC keyword ensures that each statement in the stored procedure runs in its own implicit transaction block. Therefore, if the UPDATE statement for the secondary phone fails, then it won’t roll back the data update made to the primary phone. See the following code:

CREATE PROCEDURE sp_update_custcontacts(cid int4,pphone char(15),sphone char(15)) NONATOMIC AS
$$
BEGIN
UPDATE custcontacts SET primaryphone=pphone WHERE custid=cid;
UPDATE custcontacts SET secondaryphone=sphone WHERE custid=cid;
END;
$$
LANGUAGE plpgsql;

Now let’s call the stored procedure passing the secondary phone number with more than 10 digits, which will fail in the secondaryphone UPDATE statement due to incorrect length:

call sp_update_custcontacts(101,'1234567890','345443345324');

The preceding procedure call will update the primary phone number successfully. The secondary phone number update fails. However, the primaryphone update will not roll back because it ran in its own implicit transaction block due to the NONATOMIC clause in the stored procedure definition.

select * from custcontacts;

custcontacts | primaryphone | secondaryphone
-------------+---------------+---------------
101 | 1234567890 | XXXXXXXXXX
102 | XXXXXXXXXX | XXXXXXXXXX
103 | XXXXXXXXXX | XXXXXXXXXX

Exception handling in NONATOMIC mode

Exceptions are handled in stored procedures differently based on the atomic or non-atomic mode:

Atomic (default) – Exceptions are always re-raised
Non-atomic – Exceptions are handled and you can choose to re-raise or not

Let’s continue with the previous example to illustrate exception handling in non-atomic mode.

Create the following table to log exceptions raised by stored procedures:

CREATE TABLE procedure_log
(log_timestamp timestamp, procedure_name varchar(100), error_message varchar(255));

Now update the sp_update_custcontacts() procedure to handle exceptions. Note that we’re adding an EXCEPTION block in the procedure definition. It inserts a record in the procedure_log table in the event of an exception.

CREATE OR REPLACE PROCEDURE sp_update_custcontacts(cid int4,pphone char(15),sphone char(15)) NONATOMIC AS
$$
BEGIN
UPDATE custcontacts SET primaryphone=pphone WHERE custid=cid;
UPDATE custcontacts SET secondaryphone=sphone WHERE custid=cid;
EXCEPTION
WHEN OTHERS THEN
INSERT INTO procedure_log VALUES (getdate(), 'sp_update_custcontacts', sqlerrm);
END;
$$
LANGUAGE plpgsql;

Now create one more stored procedure, which will call the preceding procedure. It also has an EXCEPTION block and inserts a record in the procedure_log table in the event of an exception.

CREATE PROCEDURE sp_update_customer() NONATOMIC AS
$$
BEGIN
-- Let us assume you have additional staments here to update other fields. For this example, ommitted them for simplifiction.
-- Nested call to update contacts
call sp_update_custcontacts(101,'1234567890','345443345324');
EXCEPTION
WHEN OTHERS THEN
INSERT INTO procedure_log VALUES (getdate(), 'sp_update_customer', sqlerrm);
END;
$$
LANGUAGE plpgsql;

Let’s call the parent procedure we created:

call sp_update_customer();

This in turn will call the sp_update_custcontacts() procedure. The inner procedure sp_update_custcontacts() will fail because we’re updating the secondary phone with an invalid value. The control will enter the EXCEPTION block of the sp_update_custcontacts() procedure and make an insert into the procedure_log table.

However, it will not re-raise the exception in non-atomic mode. Therefore, the parent procedure sp_update_customer() will not get the exception passed from the sp_update_custcontacts() procedure. The control will not enter the EXCEPTION block of the sp_update_customer() procedure.

If you query the procedure_log table, you will see an entry only for the error handled by the sp_update_custcontacts() procedure:

select * from procedure_log;

Procedure Log Output

Now redefine the sp_update_custcontacts() procedure with the RAISE statement:

CREATE PROCEDURE sp_update_custcontacts(cid int4,pphone char(15),sphone char(15)) NONATOMIC AS
$$
BEGIN
UPDATE custcontacts SET primaryphone=pphone WHERE custid=cid;
UPDATE custcontacts SET secondaryphone=sphone WHERE custid=cid;
EXCEPTION
WHEN OTHERS THEN
INSERT INTO procedure_log VALUES (getdate(), 'sp_update_custcontacts', sqlerrm);
RAISE;
END;
$$
LANGUAGE plpgsql;

Let’s call the parent stored procedure sp_update_customer() again:

call sp_update_customer();

Now the inner procedure sp_update_custcontacts() will re-raise the exception to the parent procedure sp_update_customer() after handling the exception in its own EXCEPTION block. Then the control will reach the EXCEPTION block in the parent procedure and insert another record into the procedure_log table.

If you query the procedure_log table now, you will see two entries: one by the inner procedure sp_update_custcontacts() and another by the parent procedure sp_update_customer(). This demonstrates that the RAISE statement in the inner procedure re-raised the exception.

select * from procedure_log;

Procedure log output

Explicit START TRANSACTION statement in non-atomic mode

You can issue a START TRANSACTION statement to begin a transaction block inside the stored procedure. It will open a new transaction inside the stored procedure. For examples, refer to Nonatomic mode stored procedure transaction management.

Conclusion

In this post, we discussed the enhancements to Redshift stored procedures for non-atomic transaction mode, which provides enhanced transaction controls to enable you to automatically commit the statements inside the stored procedure. This mode also enables easier migration to Amazon Redshift from other systems like Teradata. Try out these enhancements and let us know your experience in comments.

About the Authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Kiran Chinta is a Software Development Manager at Amazon Redshift. He leads a strong team in query processing, SQL language, data security, and performance. Kiran is passionate about delivering products that seamlessly integrate with customers’ business applications with the right ease of use and performance. In his spare time, he enjoys reading and playing tennis.

Huichen Liu is a software development engineer on the Amazon Redshift query processing team. She focuses on query optimization, statistics and SQL language features. In her spare time, she enjoys hiking and photography.

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

2023-08-31 Rohit Bansal

Post Syndicated from Rohit Bansal original https://aws.amazon.com/blogs/big-data/query-your-iceberg-tables-in-data-lake-using-amazon-redshift-preview/

Amazon Redshift is a fast, fully managed petabyte-scale cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers today use Amazon Redshift to analyze exabytes of data and run analytical queries, making it the most widely used cloud data warehouse. Amazon Redshift is available in both serverless and provisioned configurations.

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Amazon Redshift supports querying a wide variety of data formats, such as CSV, JSON, Parquet, and ORC, and table formats like Apache Hudi and Delta. Amazon Redshift also supports querying nested data with complex data types such as struct, array, and map.

With this capability, Amazon Redshift extends your petabyte-scale data warehouse to an exabyte-scale data lake on Amazon S3 in a cost-effective manner.

Apache Iceberg is the latest table format that is supported now in preview by Amazon Redshift. In this post, we show you how to query Iceberg tables using Amazon Redshift, and explore Iceberg support and options.

Solution overview

Apache Iceberg is an open table format for very large petabyte-scale analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is optimized for usage on Amazon S3.

Iceberg stores the metadata pointer for all the metadata files. When a SELECT query is reading an Iceberg table, the query engine first goes to the Iceberg catalog, then retrieves the entry of the location of the latest metadata file, as shown in the following diagram.

Amazon Redshift now provides support for Apache Iceberg tables, which allows data lake customers to run read-only analytics queries in a transactionally consistent way. This enables you to easily manage and maintain your tables on transactional data lakes.

Amazon Redshift supports Apache Iceberg’s native schema and partition evolution capabilities using the AWS Glue Data Catalog, eliminating the need to alter table definitions to add new partitions or to move and process large amounts of data to change the schema of an existing data lake table. Amazon Redshift uses the column statistics stored in the Apache Iceberg table metadata to optimize its query plans and reduce the file scans required to run queries.

In this post, we use the Yellow taxi public dataset from NYC Taxi & Limousine Commission as our source data. The dataset contains data files in Apache Parquet format on Amazon S3. We use Amazon Athena to convert this Parquet dataset and then use Amazon Redshift Spectrum to query and join with a Redshift local table, perform row-level deletes and updates and partition evolution, all coordinated through the AWS Glue Data Catalog in an S3 data lake.

Prerequisites

You should have the following prerequisites:

An AWS account. If you don’t have one, you can sign up for one.
A Redshift cluster with the current track version. For setup instructions, see Create a sample Amazon Redshift cluster.
Alternatively, you could use an Amazon Redshift Serverless endpoint. For setup instructions, see Getting started with Amazon Redshift Serverless.
An AWS Identity and Access Management (IAM) role set up for Redshift data lake access.
An S3 bucket to store Parquet and Iceberg files.
An Athena workgroup with Athena engine version 3 to use CTAS and MERGE commands with an Apache Iceberg table. To upgrade your existing Athena engine to version 3 in your Athena workgroup, follow the instructions in Upgrade to Athena engine version 3 to increase query performance and access more analytics features or refer to Changing the engine version in the Athena console.

Convert Parquet data to an Iceberg table

For this post, you need the Yellow taxi public dataset from the NYC Taxi & Limousine Commission available in Iceberg format. You can download the files and then use Athena to convert the Parquet dataset into an Iceberg table, or refer to Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue blog post to create the Iceberg table.

In this post, we use Athena to convert the data. Complete the following steps:

Download the files using the previous link or use the AWS Command Line Interface (AWS CLI) to copy the files from the public S3 bucket for year 2020 and 2021 to your S3 bucket using the following command:

aws s3 cp "s3://nyc-tlc/trip data/" s3://<Your S3 bucket name>/Parquet/  --exclude "*"  --include  "yellow_tripdata_2020*" –recursive
aws s3 cp "s3://nyc-tlc/trip data/" s3://<Your S3 bucket name>/Parquet/  --exclude "*"  --include  "yellow_tripdata_2021*" –recursive

For more information, refer to Setting up the Amazon Redshift CLI.

Create a database Icebergdb and create a table using Athena pointing to the Parquet format files using the following statement:

CREATE DATABASE Icebergdb; 
CREATE EXTERNAL TABLE icebergdb.nyc_taxi_yellow_parquet(
	vendorid int,
	tpep_pickup_datetime timestamp,
	tpep_dropoff_datetime timestamp,
	passenger_count bigint,
	trip_distance double,
	ratecodeid bigint,
	store_and_fwd_flag string,
	pulocationid int,
	dolocationid int,
	payment_type integer,
	fare_amount double,
	extra double,
	mta_tax double,
	tip_amount double,
	tolls_amount double,
	improvement_surcharge double,
	total_amount double,
	congestion_surcharge double,
	airport_fee double
)
STORED AS PARQUET
LOCATION 's3://<Your S3 Bucket>/Parquet/’

Validate the data in the Parquet table using the following SQL:

SELECT vendorid,
	tpep_pickup_datetime,
	tpep_dropoff_datetime,
	trip_distance,
	fare_amount,
	tip_amount,
	tolls_amount,
	total_amount,
	congestion_surcharge,
	airport_fee
FROM icebergdb.nyc_taxi_yellow_parquet
limit 5;

Create an Iceberg table in Athena with the following code. You can see the table type properties as an Iceberg table with Parquet format and snappy compression in the following create table statement. You need to update the S3 location before running the SQL. Also note that the Iceberg table is partitioned with the Year key.

CREATE  TABLE nyc_taxi_yellow_iceberg(
  vendorid int, 
  tpep_pickup_datetime timestamp, 
  tpep_dropoff_datetime timestamp, 
  passenger_count bigint, 
  trip_distance double, 
  ratecodeid bigint, 
  store_and_fwd_flag string, 
  pulocationid int, 
  dolocationid int, 
  payment_type bigint, 
  fare_amount double, 
  extra double, 
  mta_tax double, 
  tip_amount double, 
  tolls_amount double, 
  improvement_surcharge double, 
  total_amount double, 
  congestion_surcharge double, 
  airport_fee double)
PARTITIONED BY (year(tpep_pickup_datetime))
LOCATION ‘s3://<Your S3 bucket name>/iceberg/iceberg'
TBLPROPERTIES (
  'table_type'='iceberg',
  'write_compression'='snappy',
  'format'='parquet');

After you create the table, load the data into the Iceberg table using the previously loaded Parquet table nyc_taxi_yellow_parquet with the following SQL:

insert into nyc_taxi_yellow_iceberg (
	vendorid,tpep_pickup_datetime,
	tpep_dropoff_datetime,
	passenger_count,trip_distance,
	ratecodeid,store_and_fwd_flag,
	pulocationid,dolocationid,
	payment_type,fare_amount,
	extra,mta_tax,tip_amount,
	tolls_amount,total_amount,
	congestion_surcharge,airport_fee
	)
select vendorid,tpep_pickup_datetime,
	tpep_dropoff_datetime,
	passenger_count,trip_distance,
	ratecodeid,store_and_fwd_flag,
	pulocationid,dolocationid,
	payment_type,fare_amount,
	extra,mta_tax,tip_amount,
	tolls_amount,total_amount,
	congestion_surcharge,airport_fee
from nyc_taxi_yellow_parquet;

When the SQL statement is complete, validate the data in the Iceberg table nyc_taxi_yellow_iceberg. This step is required before moving to the next step.
```
SELECT * FROM nyc_taxi_yellow_iceberg LIMIT 5;
```
You can validate that the nyc_taxi_yellow_iceberg table is in Iceberg format table and partitioned on the Year column using the following command:
```
SHOW CREATE TABLE nyc_taxi_yellow_iceberg;
```

Create an external schema in Amazon Redshift

In this section, we demonstrate how to create an external schema in Amazon Redshift pointing to the AWS Glue database icebergdb to query the Iceberg table nyc_taxi_yellow_iceberg that we saw in the previous section using Athena.

Log in to the Redshift via Query Editor v2 or a SQL client and run the following command (note that the AWS Glue database icebergdb and Region information is being used):

CREATE external schema spectrum_iceberg_schema
from data catalog
database 'icebergdb'
region 'us-east-1'
iam_role default;

To learn about creating external schemas in Amazon Redshift, refer to create external schema

After you create the external schema spectrum_iceberg_schema, you can query the Iceberg table in Amazon Redshift.

Query the Iceberg table in Amazon Redshift

Run the following query in Query Editor v2. Note that spectrum_iceberg_schema is the name of the external schema created in Amazon Redshift and nyc_taxi_yellow_iceberg is the table in the AWS Glue database used in the query:

SELECT * FROM"dev"."spectrum_iceberg_schema"."nyc_taxi_yellow_iceberg";

The query data output in the following screenshot shows that the AWS Glue table with Iceberg format is queryable using Redshift Spectrum.

Check the explain plan of querying the Iceberg table

You can use the following query to get the explain plan output, which shows the format is ICEBERG:

EXPLAIN 
SELECT vendorid,count(*) 
FROM "dev"."spectrum_iceberg_schema"."nyc_taxi_yellow_iceberg"
GROUP BY vendorid;

Validate updates for data consistency

After the update is complete on the Iceberg table, you can query Amazon Redshift to see the transactionally consistent view of the data. Let’s run a query by picking a vendorid and for a certain pick-up and drop-off:

SELECT * FROM nyc_taxi_yellow_iceberg
WHERE vendorid=1
AND tpep_pickup_datetime=cast('2021-06-24 21:53:26' AS timestamp)
AND tpep_dropoff_datetime=cast('2021-06-24 22:02:46'AS timestamp)
LIMIT 5;

Next, update the value of passenger_count to 4 and trip_distance to 9.4 for a vendorid and certain pick-up and drop-off dates in Athena:

UPDATE nyc_taxi_yellow_iceberg
SET passenger_count=4,trip_distance=9.4
WHERE vendorid=1
AND tpep_pickup_datetime=cast('2021-06-24 21:53:26' AS timestamp)
AND tpep_dropoff_datetime=cast('2021-06-24 22:02:46'AS timestamp);

Finally, run the following query in Query Editor v2 to see the updated value of passenger_count and trip_distance:

SELECT * 
FROM "dev"."spectrum_iceberg_schema"."nyc_taxi_yellow_iceberg"
WHERE vendorid=1
AND tpep_pickup_datetime=cast('2021-06-24 21:53:26' AS timestamp)
AND tpep_dropoff_datetime=cast('2021-06-24 22:02:46' AS timestamp)
LIMIT 5;

As shown in the following screenshot, the update operations on the Iceberg table are available in Amazon Redshift.

Create a unified view of the local table and historical data in Amazon Redshift

As a modern data architecture strategy, you can organize historical data or less frequently accessed data in the data lake and keep frequently accessed data in the Redshift data warehouse. This provides the flexibility to manage analytics at scale and find the most cost-effective architecture solution.

In this example, we load 2 years of data in a Redshift table; the rest of the data stays on the S3 data lake because that dataset is less frequently queried.

Use the following code to load 2 years of data in the nyc_taxi_yellow_recent table in Amazon Redshift, sourcing from the Iceberg table:

CREATE TABLE nyc_taxi_yellow_recent
AS
SELECT *
FROM "dev"."spectrum_iceberg_schema"."nyc_taxi_yellow_iceberg"
WHERE extract(year from tpep_pickup_datetime)>2020;

Next, you can remove the last 2 years of data from the Iceberg table using the following command in Athena because you loaded the data into a Redshift table in the previous step:
```
DELETE FROM nyc_taxi_yellow_iceberg 
WHERE EXTRACT(year from tpep_pickup_datetime)>2020;
```

After you complete these steps, the Redshift table has 2 years of the data and the rest of the data is in the Iceberg table in Amazon S3.

Create a view using the nyc_taxi_yellow_iceberg Iceberg table and nyc_taxi_yellow_recent table in Amazon Redshift:

create or replace view nyc_taxi_yellow as
select 'nyc_taxi_yellow_iceberg' as source,vendorid,tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,trip_distance,
    ratecodeid,store_and_fwd_flag,
    pulocationid,dolocationid,
    payment_type,fare_amount,
    extra,mta_tax,tip_amount,
    tolls_amount,total_amount,
    congestion_surcharge,airport_fee
from "dev"."spectrum_iceberg_schema"."nyc_taxi_yellow_iceberg"
union all
select 'nyc_taxi_yellow_recent' as source,vendorid,tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,trip_distance,
    ratecodeid,store_and_fwd_flag,
    pulocationid,dolocationid,
    payment_type,fare_amount,
    extra,mta_tax,tip_amount,
    tolls_amount,total_amount,
    congestion_surcharge,airport_fee
from  public.nyc_taxi_yellow_recent
with no schema binding;

Now query the view, depending on the filter conditions, Redshift Spectrum will scan either the Iceberg data, the Redshift table, or both. The following example query returns a number of records from each of the source tables by scanning both tables:
```
SELECT source,count(1)
FROM  nyc_taxi_yellow
GROUP BY source;
```

Partition evolution

Iceberg uses hidden partitioning, which means you don’t need to manually add partitions for your Apache Iceberg tables. New partition values or new partition specs (add or remove partition columns) in Apache Iceberg tables are automatically detected by Amazon Redshift and no manual operation is needed to update partitions in the table definition. The following example demonstrates this.

In our example, if the Iceberg table nyc_taxi_yellow_iceberg was originally partitioned by year and later the column vendorid was added as an additional partition column, then Amazon Redshift can seamlessly query the Iceberg table nyc_taxi_yellow_iceberg with two different partition schemes over a period of time.

Considerations when querying Iceberg tables using Amazon Redshift

During the preview period, consider the following when using Amazon Redshift with Iceberg tables:

Only Iceberg tables defined in the AWS Glue Data Catalog are supported.
CREATE or ALTER external table commands are not supported, which means the Iceberg table should already exist in an AWS Glue database.
Time travel queries are not supported.
Iceberg versions 1 and 2 are supported. For more details on Iceberg format versions, refer to Format Versioning.
For a list of supported data types with Iceberg tables, refer to Supported data types with Apache Iceberg tables (preview).
Pricing for querying an Iceberg table is the same as accessing any other data formats using Amazon Redshift.

For additional details on considerations for Iceberg format tables preview, refer to Using Apache Iceberg tables with Amazon Redshift (preview).

Customer feedback

“Tinuiti, the largest independent performance marketing firm, handles large volumes of data on a daily basis and must have a robust data lake and data warehouse strategy for our market intelligence teams to store and analyze all our customer data in an easy, affordable, secure, and robust way,” says Justin Manus, Chief Technology Officer at Tinuiti. “Amazon Redshift’s support for Apache Iceberg tables in our data lake, which is the single source of truth, addresses a critical challenge in optimizing performance and accessibility and further simplifies our data integration pipelines to access all the data ingested from different sources and to power our customers’ brand potential.”

Conclusion

In this post, we showed you an example of querying an Iceberg table in Redshift using files stored in Amazon S3, cataloged as a table in the AWS Glue Data Catalog, and demonstrated some of the key features like efficient row-level update and delete, and the schema evolution experience for users to unlock the power of big data using Athena.

You can use Amazon Redshift to run queries on data lake tables in various files and table formats, such as Apache Hudi and Delta Lake, and now with Apache Iceberg (preview), which provides additional options for your modern data architectures needs.

We hope this gives you a great starting point for querying Iceberg tables in Amazon Redshift.

About the Authors

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

Build an ETL process for Amazon Redshift using Amazon S3 Event Notifications and AWS Step Functions

2023-08-31 Ziad Wali

Post Syndicated from Ziad Wali original https://aws.amazon.com/blogs/big-data/build-an-etl-process-for-amazon-redshift-using-amazon-s3-event-notifications-and-aws-step-functions/

Data warehousing provides a business with several benefits such as advanced business intelligence and data consistency. It plays a big role within an organization by helping to make the right strategic decision at the right moment which could have a huge impact in a competitive market. One of the major and essential parts in a data warehouse is the extract, transform, and load (ETL) process which extracts the data from different sources, applies business rules and aggregations and then makes the transformed data available for the business users.

This process is always evolving to reflect new business and technical requirements, especially when working in an ambitious market. Nowadays, more verification steps are applied to source data before processing them which so often add an administration overhead. Hence, automatic notifications are more often required in order to accelerate data ingestion, facilitate monitoring and provide accurate tracking about the process.

Amazon Redshift is a fast, fully managed, cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you to securely access your data in operational databases, data lakes or third-party datasets with minimal movement or copying. AWS Step Functions is a fully managed service that gives you the ability to orchestrate and coordinate service components. Amazon S3 Event Notifications is an Amazon S3 feature that you can enable in order to receive notifications when specific events occur in your S3 bucket.

In this post we discuss how we can build and orchestrate in a few steps an ETL process for Amazon Redshift using Amazon S3 Event Notifications for automatic verification of source data upon arrival and notification in specific cases. And we show how to use AWS Step Functions for the orchestration of the data pipeline. It can be considered as a starting point for teams within organizations willing to create and build an event driven data pipeline from data source to data warehouse that will help in tracking each phase and in responding to failures quickly. Alternatively, you can also use Amazon Redshift auto-copy from Amazon S3 to simplify data loading from Amazon S3 into Amazon Redshift.

Solution overview

The workflow is composed of the following steps:

A Lambda function is triggered by an S3 event whenever a source file arrives at the S3 bucket. It does the necessary verifications and then classifies the file before processing by sending it to the appropriate Amazon S3 prefix (accepted or rejected).
There are two possibilities:
- If the file is moved to the rejected Amazon S3 prefix, an Amazon S3 event sends a message to Amazon SNS for further notification.
- If the file is moved to the accepted Amazon S3 prefix, an Amazon S3 event is triggered and sends a message with the file path to Amazon SQS.
An Amazon EventBridge scheduled event triggers the AWS Step Functions workflow.
The workflow executes a Lambda function that pulls out the messages from the Amazon SQS queue and generates a manifest file for the COPY command.
Once the manifest file is generated, the workflow starts the ETL process using stored procedure.

The following image shows the workflow.

Prerequisites

Before configuring the previous solution, you can use the following AWS CloudFormation template to set up and create the infrastructure

Give the stack a name, select a deployment VPC and define the master user for the Amazon Redshift cluster by filling in the two parameters MasterUserName and MasterUserPassword.

The template will create the following services:

An S3 bucket
An Amazon Redshift cluster composed of two ra3.xlplus nodes
An empty AWS Step Functions workflow
An Amazon SQS queue
An Amazon SNS topic
An Amazon EventBridge scheduled rule with a 5-minute rate
Two empty AWS Lambda functions
IAM roles and policies for the services to communicate with each other

The names of the created services are usually prefixed by the stack’s name or the word blogdemo. You can find the names of the created services in the stack’s resources tab.

Step 1: Configure Amazon S3 Event Notifications

Create the following four folders in the S3 bucket:

received
rejected
accepted
manifest

In this scenario, we will create the following three Amazon S3 event notifications:

Trigger an AWS Lambda function on the received folder.
Send a message to the Amazon SNS topic on the rejected folder.
Send a message to Amazon SQS on the accepted folder.

To create an Amazon S3 event notification:

Go to the bucket’s Properties tab.
In the Event Notifications section, select Create Event Notification.
Fill in the necessary properties:
- Give the event a name.
- Specify the appropriate prefix or folder (accepted/, rejected/ or received/).
- Select All object create events as an event type.
- Select and choose the destination (AWS lambda, Amazon SNS or Amazon SQS).
  Note: for an AWS Lambda destination, choose the function that starts with ${stackname}-blogdemoVerify_%

At the end, you should have three Amazon S3 events:

An event for the received prefix with an AWS Lambda function as a destination type.
An event for the accepted prefix with an Amazon SQS queue as a destination type.
An event for the rejected prefix with an Amazon SNS topic as a destination type.

The following image shows what you should have after creating the three Amazon S3 events:

Step 2: Create objects in Amazon Redshift

Connect to the Amazon Redshift cluster and create the following objects:

Three schemas:

create schema blogdemo_staging; -- for staging tables
create schema blogdemo_core; -- for target tables
create schema blogdemo_proc; -- for stored procedures

A table in the blogdemo_staging and blogdemo_core schemas:

create table ${schemaname}.rideshare
(
  id_ride bigint not null,
  date_ride timestamp not null,
  country varchar (20),
  city varchar (20),
  distance_km smallint,
  price decimal (5,2),
  feedback varchar (10)
) distkey(id_ride);

A stored procedure to extract and load data into the target schema:

create or replace procedure blogdemo_proc.elt_rideshare (bucketname in varchar(200),manifestfile in varchar (500))
as $$
begin
-- purge staging table
truncate blogdemo_staging.rideshare;

-- copy data from s3 bucket to staging schema
execute 'copy blogdemo_staging.rideshare from ''s3://' + bucketname + '/' + manifestfile + ''' iam_role default delimiter ''|'' manifest;';

-- apply transformation rules here

-- insert data into target table
insert into blogdemo_core.rideshare
select * from blogdemo_staging.rideshare;

end;
$$ language plpgsql;

Set the role ${stackname}-blogdemoRoleRedshift_% as a default role:
1. In the Amazon Redshift console, go to clusters and click on the cluster blogdemoRedshift%.
2. Go to the Properties tab.
3. In the Cluster permissions section, select the role ${stackname}-blogdemoRoleRedshift%.
4. Click on Set default then Make default.

Step 3: Configure Amazon SQS queue

The Amazon SQS queue can be used as it is; this means with the default values. The only thing you need to do for this demo is to go to the created queue ${stackname}-blogdemoSQS% and purge the test messages generated (if any) by the Amazon S3 event configuration. Copy its URL in a text file for further use (more precisely, in one of the AWS Lambda functions).

Step 4: Setup Amazon SNS topic

In the Amazon SNS console, go to the topic ${stackname}-blogdemoSNS%
Click on the Create subscription button.
Choose the blogdemo topic ARN, email protocol, type your email and then click on Create subscription.
Confirm your subscription in your email that you received.

Step 5: Customize the AWS Lambda functions

The following code verifies the name of a file. If it respects the naming convention, it will move it to the accepted folder. If it does not respect the naming convention, it will move it to the rejected one. Copy it to the AWS Lambda function ${stackname}-blogdemoLambdaVerify and then deploy it:

import boto3
import re

def lambda_handler (event, context):
    objectname = event['Records'][0]['s3']['object']['key']
    bucketname = event['Records'][0]['s3']['bucket']['name']
    
    result = re.match('received/rideshare_data_20[0-5][0-9]((0[1-9])|(1[0-2]))([0-2][1-9]|3[0-1])\.csv',objectname)
    targetfolder = ''
    
    if result: targetfolder = 'accepted'
    else: targetfolder = 'rejected'
    
    s3 = boto3.resource('s3')
    copy_source = {
        'Bucket': bucketname,
        'Key': objectname
    }
    target_objectname=objectname.replace('received',targetfolder)
    s3.meta.client.copy(copy_source, bucketname, target_objectname)
    
    s3.Object(bucketname,objectname).delete()
    
    return {'Result': targetfolder}

The second AWS Lambda function ${stackname}-blogdemonLambdaGenerate% retrieves the messages from the Amazon SQS queue and generates and stores a manifest file in the S3 bucket manifest folder. Copy the following content, replace the variable ${sqs_url} by the value retrieved in Step 3 and then click on Deploy.

import boto3
import json
import datetime

def lambda_handler(event, context):

    sqs_client = boto3.client('sqs')
    queue_url='${sqs_url}'
    bucketname=''
    keypath='none'
    
    manifest_content='{\n\t"entries": ['
    
    while True:
        response = sqs_client.receive_message(
            QueueUrl=queue_url,
            AttributeNames=['All'],
            MaxNumberOfMessages=1
        )
        try:
            message = response['Messages'][0]
        except KeyError:
            break
        
        message_body=message['Body']
        message_data = json.loads(message_body)
        
        objectname = message_data['Records'][0]['s3']['object']['key']
        bucketname = message_data['Records'][0]['s3']['bucket']['name']

        manifest_content = manifest_content + '\n\t\t{"url":"s3://' +bucketname + '/' + objectname + '","mandatory":true},'
        receipt_handle = message['ReceiptHandle']

        sqs_client.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=receipt_handle
        )
        
    if bucketname != '':
        manifest_content=manifest_content[:-1]+'\n\t]\n}'
        s3 = boto3.resource("s3")
        encoded_manifest_content=manifest_content.encode('utf-8')
        current_datetime=datetime.datetime.now()
        keypath='manifest/files_list_'+current_datetime.strftime("%Y%m%d-%H%M%S")+'.manifest'
        s3.Bucket(bucketname).put_object(Key=keypath, Body=encoded_manifest_content)

    sf_tasktoken = event['TaskToken']
    
    step_function_client = boto3.client('stepfunctions')
    step_function_client.send_task_success(taskToken=sf_tasktoken,output='{"manifestfilepath":"' + keypath + '",\"bucketname":"' + bucketname +'"}')

Step 6: Add tasks to the AWS Step Functions workflow

Create the following workflow in the state machine ${stackname}-blogdemoStepFunctions%.

If you would like to accelerate this step, you can drag and drop the content of the following JSON file in the definition part when you click on Edit. Make sure to replace the three variables:

${GenerateManifestFileFunctionName} by the ${stackname}-blogdemoLambdaGenerate% arn.
${RedshiftClusterIdentifier} by the Amazon Redshift cluster identifier.
${MasterUserName} by the username that you defined while deploying the CloudFormation template.

Step 7: Enable Amazon EventBridge rule

Enable the rule and add the AWS Step Functions workflow as a rule target:

Go to the Amazon EventBridge console.
Select the rule created by the Amazon CloudFormation template and click on Edit.
Enable the rule and click Next.
You can change the rate if you want. Then select Next.
Add the AWS Step Functions state machine created by the CloudFormation template blogdemoStepFunctions% as a target and use an existing role created by the CloudFormation template ${stackname}-blogdemoRoleEventBridge%
Click on Next and then Update rule.

Test the solution

In order to test the solution, the only thing you should do is upload some csv files in the received prefix of the S3 bucket. Here are some sample data; each file contains 1000 rows of rideshare data.

If you upload them in one click, you should receive an email because the ridesharedata2022.csv does not respect the naming convention. The other three files will be loaded in the target table blogdemo_core.rideshare. You can check the Step Functions workflow to verify that the process finished successfully.

Clean up

Go to the Amazon EventBridge console and delete the rule ${stackname}-blogdemoevenbridge%.
In the Amazon S3 console, select the bucket created by the CloudFormation template ${stackname}-blogdemobucket% and click on Empty.
Go to Subscriptions in the Amazon SNS console and delete the subscription created in Step 4.
In the AWS CloudFormation console, select the stack and delete it.

Conclusion

In this post, we showed how different AWS services can be easily implemented together in order to create an event-driven architecture and automate its data pipeline, which targets the cloud data warehouse Amazon Redshift for business intelligence applications and complex queries.

About the Author

Ziad WALI is an Acceleration Lab Solutions Architect at Amazon Web Services. He has over 10 years of experience in databases and data warehousing where he enjoys building reliable, scalable and efficient solutions. Outside of work, he enjoys sports and spending time in nature.