Tag Archives: Amazon Redshift

Amazon RDS for MySQL zero-ETL integration with Amazon Redshift, now generally available, enables near real-time analytics

Post Syndicated from Matheus Guimaraes original https://aws.amazon.com/blogs/aws/amazon-rds-for-mysql-zero-etl-integration-with-amazon-redshift-now-generally-available-enables-near-real-time-analytics/

Zero-ETL integrations help unify your data across applications and data sources for holistic insights and breaking data silos. They provide a fully managed, no-code, near real-time solution for making petabytes of transactional data available in Amazon Redshift within seconds of data being written into Amazon Relational Database Service (Amazon RDS) for MySQL. This eliminates the need to create your own ETL jobs simplifying data ingestion, reducing your operational overhead and potentially lowering your overall data processing costs. Last year, we announced the general availability of zero-ETL integration with Amazon Redshift for Amazon Aurora MySQL-Compatible Edition as well as the availability in preview of Aurora PostgreSQL-Compatible Edition, Amazon DynamoDB, and RDS for MySQL.

I am happy to announce that Amazon RDS for MySQL zero-ETL with Amazon Redshift is now generally available. This release also includes new features such as data filtering, support for multiple integrations, and the ability to configure zero-ETL integrations in your AWS CloudFormation template.

In this post, I’ll show how you can get started with data filtering and consolidating your data across multiple databases and data warehouses. For a step-by-step walkthrough on how to set up zero-ETL integrations, see this blog post for a description of how to set one up for Aurora MySQL-Compatible, which offers a very similar experience.

Data filtering
Most companies, no matter the size, can benefit from adding filtering to their ETL jobs. A typical use case is to reduce data processing and storage costs by selecting only the subset of data needed to replicate from their production databases. Another is to exclude personally identifiable information (PII) from a report’s dataset. For example, a business in healthcare might want to exclude sensitive patient information when replicating data to build aggregate reports analyzing recent patient cases. Similarly, an e-commerce store may want to make customer spending patterns available to their marketing department, but exclude any identifying information. Conversely, there are certain cases when you might not want to use filtering, such as when making data available to fraud detection teams that need all the data in near real time to make inferences. These are just a few examples, so I encourage you to experiment and discover different use cases that might apply to your organization.

There are two ways to enable filtering in your zero-ETL integrations: when you first create the integration or by modifying an existing integration. Either way, you will find this option on the “Source” step of the zero-ETL creation wizard.

Interface for adding data filtering expressions to include or exclude databases or tables.

You apply filters by entering filter expressions that can be used to either include or exclude databases or tables from the dataset in the format of database*.table*. You can add multiple expressions and they will be evaluated in order from left to right.

If you’re modifying an existing integration, the new filtering rules will apply from that point in time on after you confirm your changes and Amazon Redshift will drop tables that are no longer part of the filter.

If you want to dive deeper, I recommend you read this blog post, which goes in depth into how you can set up data filters for Amazon Aurora zero-ETL integrations since the steps and concepts are very similar.

Create multiple zero-ETL integrations from a single database
You are now also able to configure up integrations from a single RDS for MySQL database to up to 5 Amazon Redshift data warehouses. The only requirement is that you must wait for the first integration to finish setting up successfully before adding others.

This allows you to share transactional data with different teams while providing them ownership over their own data warehouses for their specific use cases. For example, you can also use this in conjunction with data filtering to fan out different sets of data to development, staging, and production Amazon Redshift clusters from the same Amazon RDS production database.

Another interesting scenario where this could be really useful is consolidation of Amazon Redshift clusters by using zero-ETL to replicate to different warehouses. You could also use Amazon Redshift materialized views to explore your data, power your Amazon Quicksight dashboards, share data, train jobs in Amazon SageMaker, and more.

Conclusion
RDS for MySQL zero-ETL integrations with Amazon Redshift allows you to replicate data for near real-time analytics without needing to build and manage complex data pipelines. It is generally available today with the ability to add filter expressions to include or exclude databases and tables from the replicated data sets. You can now also set up multiple integrations from the same source RDS for MySQL database to different Amazon Redshift warehouses or create integrations from different sources to consolidate data into one data warehouse.

This zero-ETL integration is available for RDS for MySQL versions 8.0.32 and later, Amazon Redshift Serverless, and Amazon Redshift RA3 instance types in supported AWS Regions.

In addition to using the AWS Management Console, you can also set up a zero-ETL integration via the AWS Command Line Interface (AWS CLI) and by using an AWS SDK such as boto3, the official AWS SDK for Python.

See the documentation to learn more about working with zero-ETL integrations.

Matheus Guimaraes

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 2

Post Syndicated from Rajkumar Irudayaraj original https://aws.amazon.com/blogs/big-data/harness-zero-copy-data-sharing-from-salesforce-data-cloud-to-amazon-redshift-for-unified-analytics-part-2/

In the era of digital transformation and data-driven decision making, organizations must rapidly harness insights from their data to deliver exceptional customer experiences and gain competitive advantage. Salesforce and Amazon have collaborated to help customers unlock value from unified data and accelerate time to insights with bidirectional Zero Copy data sharing between Salesforce Data Cloud and Amazon Redshift.

In the Part 1 of this series, we discussed how to configure data sharing between Salesforce Data Cloud and customers’ AWS accounts in the same AWS Region. In this post, we discuss the architecture and implementation details of cross-Region data sharing between Salesforce Data Cloud and customers’ AWS accounts.

Solution overview

Salesforce Data Cloud provides a point-and-click experience to share data with a customer’s AWS account. On the AWS Lake Formation console, you can accept the datashare, create the resource link, mount Salesforce Data Cloud objects as data catalog views, and grant permissions to query the live and unified data in Amazon Redshift. Cross-Region data sharing between Salesforce Data Cloud and a customer’s AWS accounts is supported for two deployment scenarios: Amazon Redshift Serverless and Redshift provisioned clusters (RA3).

Cross-Region data sharing with Redshift Serverless

The following architecture diagram depicts the steps for setting up a cross-Region datashare between a Data Cloud instance in US-WEST-2 with Redshift Serverless in US-EAST-1.

Cross-Region data sharing set up consists of the following steps:

  1. The Data Cloud admin identifies the objects to be shared and creates a Data Share in the data cloud provisioned in the US-WEST-2
  2. The Data Cloud admin links the Data Share with the Amazon Redshift Data Share target. This creates an AWS Glue Data Catalog view and a cross-account Lake Formation resource share using the AWS Resource Access Manager (RAM) with the customer’s AWS account in US-WEST-2.
  3. The customer’s Lake Formation admin accepts the datashare invitation in US-WEST-2 from the Lake Formation console and grants default (select and describe) permissions to an AWS Identity and Access Management (IAM) principal.
  4. The Lake Formation admin switches to US-EAST-1 and creates a resource link pointing to the shared database in the US-WEST-2 Region.
  5. The IAM principal can log in to the Amazon Redshift query editor in US-EAST-1 and creates an external schema referencing the datashare resource link. The data can be queried through these external tables.

Cross-Region data sharing with a Redshift provisioned cluster

Cross-Region data sharing across Salesforce Data Cloud and a Redshift provisioned cluster requires additional steps on top of the Serverless set up. Based on the Amazon Redshift Spectrum considerations, the provisioned cluster and the Amazon Simple Storage Service (Amazon S3) bucket must be in the same Region for Redshift external tables. The following architecture depicts a design pattern and steps to share data with Redshift provisioned clusters.

Steps 1–5 in the set up remain the same across Redshift Serverless and provisioned cluster cross-Region sharing. Encryption must be enabled on both Redshift Serverless and the provisioned cluster. Listed below are the additional steps:

  1. Create a table from datashare data with the CREATE TABLE AS SELECT Create a datashare in Redshift serverless and grant access to the Redshift provisioned cluster.
  2. Create a database in the Redshift provisioned cluster and grant access to the target IAM principals. The datashare is ready for query.

The new table needs to be refreshed periodically to get the latest data from the shared Data Cloud objects with this solution.

Considerations when using data sharing in Amazon Redshift

For a comprehensive list of considerations and limitations of data sharing, refer to Considerations when using data sharing in Amazon Redshift. Some of the important ones for Zero Copy data sharing includes:

  • Data sharing is supported for all provisioned RA3 instance types (ra3.16xlarge, ra3.4xlarge, and ra3.xlplus) and Redshift Serverless. It isn’t supported for clusters with DC and DS node types.
  • For cross-account and cross-Region data sharing, both the producer and consumer clusters and serverless namespaces must be encrypted. However, they don’t need to share the same encryption key.
  • Data Catalog multi-engine views are generally available in commercial Regions where Lake Formation, the Data Catalog, Amazon Redshift, and Amazon Athena are available.
  • Cross-Region sharing is available in all LakeFormation supported regions.

Prerequisites

The prerequisites remain the same across same-Region and cross-Region data sharing, which are required before proceeding with the setup.

Configure cross-Region data sharing

The steps to create a datashare, create a datashare target, link the datashare target to the datashare, and accept the datashare in Lake Formation remain the same across same-Region and cross-Region data sharing. Refer to Part 1 of this series to complete the setup.

Cross-Region data sharing with Redshift Serverless

If you’re using Redshift Serverless, complete the following steps:

  1. On the Lake Formation console, choose Databases in the navigation pane.
  2. Choose Create database.
  3. Under Database details¸ select Resource link.
  4. For Resource link name, enter a name for the resource link.
  5. For Shared database’s region, choose the Data Catalog view source Region.
  6. The Shared database and Shared database’s owner ID fields are populated manually from the database metadata.
  7. Choose Create to complete the setup.

The resource link appears on the Databases page on the Lake Formation console, as shown in the following screenshot.

  1. Launch Redshift Query Editor v2 for the Redshift Serverless workspace The cross-region data share tables are auto-mounted and appear under awsdatacatalog. To query, run the following command and create an external schema. Specify the resource link as the Data Catalog database, the Redshift Serverless Region, and the AWS account ID.
    CREATE external SCHEMA cross_region_data_share --<<SCHEMA_NAME>>
    FROM DATA CATALOG DATABASE 'cross-region-data-share' --<<RESOURCE_LINK_NAME>>
    REGION 'us-east-1' --<TARGET_REGION>
    IAM_ROLE 'SESSION' CATALOG_ID '<<aws_account_id>>'; --<<REDSHIFT AWS ACCOUNT ID>>

  2. Refresh the schemas to view the external schema created in the dev database
  3. Run the show tables command to check the shared objects under the external database:
    SHOW TABLES FROM SCHEMA dev.cross_region_data_share --<<schema name>>

  4. Query the datashare as shown in the following screenshot.
    SELECT * FROM dev.cross_region_data_share.churn_modellingcsv_tableaus3_dlm; --<<change schema name & table name>>

Cross-Region data sharing with Redshift provisioned cluster

This section is a continuation of the previous section with additional steps needed for data sharing to work when the consumer is a provisioned Redshift cluster. Refer to Sharing data in Amazon Redshift and Sharing datashares for a deeper understanding of concepts and the implementation steps.

  1. Create a new schema and table in the Redshift Serverless in the consumer Region:
    CREATE SCHEMA customer360_data_share;
    CREATE TABLE customer360_data_share. customer_churn as
    SELECT * from dev.cross_region_data_share.churn_modellingcsv_tableaus3__dlm;

  2. Get the namespace for the Redshift Serverless (producer) and Redshift provisioned cluster (consumer) by running the following query in each cluster:
    select current_namespace

  3. Create a datashare in the Redshift Serverless (producer) and grant usage to the Redshift provisioned cluster (consumer). Set the datashare, schema, and table names to the appropriate values, and set the namespace to the consumer namespace.
    CREATE DATASHARE customer360_redshift_data_share;
    ALTER DATASHARE customer360_redshift_data_share ADD SCHEMA customer360_data_share;
    ALTER DATASHARE customer360_redshift_data_share ADD TABLE customer360_data_share.customer_churn; 
    GRANT USAGE ON DATASHARE customer360_redshift_data_share 
    TO NAMESPACE '5709a006-6ac3-4a0c-a609-d740640d3080'; --<<Data Share Consumer Namespace>>

  4. Log in as a superuser in the Redshift provisioned cluster, create a database from the datashare, and grant permissions. Refer to managing permissions for Amazon Redshift datashare for detailed guidance.

The datashare is now ready for query.

You can periodically refresh the table you created to get the latest data from the data cloud based on your business requirement.

Conclusion

Zero Copy data sharing between Salesforce Data Cloud and Amazon Redshift represents a significant advancement in how organizations can use their customer 360 data. By eliminating the need for data movement, this approach offers real-time insights, reduced costs, and enhanced security. As businesses continue to prioritize data-driven decision-making, Zero Copy data sharing will play a crucial role in unlocking the full potential of customer data across platforms.

This integration empowers organizations to break down data silos, accelerate analytics, and drive more agile customer-centric strategies. To learn more, refer to the following resources:


About the Authors

Rajkumar Irudayaraj is a Senior Product Director at Salesforce with over 20 years of experience in data platforms and services, with a passion for delivering data-powered experiences to customers.

Sriram Sethuraman is a Senior Manager in Salesforce Data Cloud product management. He has been building products for over 9 years using big data technologies. In his current role at Salesforce, Sriram works on Zero Copy integration with major data lake partners and helps customers deliver value with their data strategies.

Jason Berkowitz is a Senior Product Manager with AWS Lake Formation. He comes from a background in machine learning and data lake architectures. He helps customers become data-driven.

Ravi Bhattiprolu is a Senior Partner Solutions Architect at AWS. Ravi works with strategic ISV partners, Salesforce and Tableau, to deliver innovative and well-architected products and solutions that help joint customers achieve their business and technical objectives.

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Michael Chess is a Technical Product Manager at AWS Lake Formation. He focuses on improving data permissions across the data lake. He is passionate about enabling customers to build and optimize their data lakes to meet stringent security requirements.

Mike Patterson is a Senior Customer Solutions Manager in the Strategic ISV segment at AWS. He has partnered with Salesforce Data Cloud to align business objectives with innovative AWS solutions to achieve impactful customer experiences. In his spare time, he enjoys spending time with his family, sports, and outdoor activities.

How HPE Aruba Supply Chain optimized cost and performance by migrating to an AWS modern data architecture

Post Syndicated from Hardeep Randhawa original https://aws.amazon.com/blogs/big-data/how-hpe-aruba-supply-chain-optimized-cost-and-performance-by-migrating-to-an-aws-modern-data-architecture/

This blog post is co-written with Hardeep Randhawa and Abhay Kumar from HPE.

HPE Aruba Networking, formerly known as Aruba Networks, is a Santa Clara, California-based security and networking subsidiary of Hewlett Packard Enterprise company. HPE Aruba Networking is the industry leader in wired, wireless, and network security solutions. Hewlett-Packard acquired Aruba Networks in 2015, making it a wireless networking subsidiary with a wide range of next-generation network access solutions.

Aruba offers networking hardware like access points, switches, routers, software, security devices, and Internet of Things (IoT) products. Their large inventory requires extensive supply chain management to source parts, make products, and distribute them globally. This complex process involves suppliers, logistics, quality control, and delivery.

This post describes how HPE Aruba automated their Supply Chain management pipeline, and re-architected and deployed their data solution by adopting a modern data architecture on AWS.

Challenges with the on-premises solution

As the demand surged with time, it was imperative that Aruba build a sophisticated and powerful supply chain solution that could help them scale operations, enhance visibility, improve predictability, elevate customer experience, and drive sustainability. To achieve their vision of a modern, scalable, resilient, secure, and cost-efficient architecture, they chose AWS as their trusted partner due to the range of low-cost, scalable, and reliable cloud services they offer.

Through a commitment to cutting-edge technologies and a relentless pursuit of quality, HPE Aruba designed this next-generation solution as a cloud-based cross-functional supply chain workflow and analytics tool. The application supports custom workflows to allow demand and supply planning teams to collaborate, plan, source, and fulfill customer orders, then track fulfillment metrics via persona-based operational and management reports and dashboards. This also includes building an industry standard integrated data repository as a single source of truth, operational reporting through real time metrics, data quality monitoring, 24/7 helpdesk, and revenue forecasting through financial projections and supply availability projections. Overall, this new solution has empowered HPE teams with persona-based access to 10 full-scale business intelligence (BI) dashboards and over 350 report views across demand and supply planning, inventory and order management, SKU dashboards, deal management, case management, backlog views, and big deal trackers.

Overview of the solution

This post describes how HPE Aruba automated their supply chain management pipeline, starting from data migration from varied data sources into a centralized Amazon Simple Storage Service (Amazon S3) based storage to building their data warehouse on Amazon Redshift with the publication layer built on a third-party BI tool and user interface using ReactJS.

The following diagram illustrates the solution architecture.

https://admin.pulse.aws/survey/Survey-2khLQ3YQTeQ1k3VcjAFdn5UsCYb/

In the following sections, we go through the key components in the diagram in more detail:

  1. Source systems
  2. Data migration
  3. Regional distribution
  4. Orchestration
  5. File processing
  6. Data quality checks
  7. Archiving processed files
  8. Copying to Amazon Redshift
  9. Running stored procedures
  10. UI integration
  11. Code Deployment
  12. Security & Encryption
  13. Data Consumption
  14. Final Steps

1. Source systems

Aruba’s source repository includes data from three different operating regions in AMER, EMEA, and APJ, along with one worldwide (WW) data pipeline from varied sources like SAP S/4 HANA, Salesforce, Enterprise Data Warehouse (EDW), Enterprise Analytics Platform (EAP) SharePoint, and more. The data sources include 150+ files including 10-15 mandatory files per region ingested in various formats like xlxs, csv, and dat. Aruba’s data governance guidelines required that they use a single centralized tool that could securely and cost-effectively review all source files with multiple formats, sizes, and ingestion times for compliance before exporting them out of the HPE environment. To achieve this, Aruba first copied the respective files to a centralized on-premises staging layer.

2. Data migration

Aruba chose AWS Transfer Family for SFTP for secure and efficient file transfers from an on-premises staging layer to an Amazon S3 based landing zone. AWS Transfer Family seamlessly integrates with other AWS services, automates transfer, and makes sure data is protected with encryption and access controls. To prevent deduplication issues and maintain data integrity, Aruba customized these data transfer jobs to make sure previous transfers are complete before copying the next set of files.

3. Regional distribution

On average, Aruba transfers approximately 100 files, with total size ranging from 1.5–2 GB into the landing zone daily. The data volume increases each Monday with the weekly file loads and at the beginning of each month with the monthly file loads. These files follow the same naming pattern, with a daily system-generated timestamp appended to each file name. Each file arrives as a pair with a tail metadata file in CSV format containing the size and name of the file. This metadata file is later used to read source file names during processing into the staging layer.

The source data contains files from three different operating Regions and one worldwide pipeline that needs to be processed per local time zones. Therefore, separating the files and running a distinct pipeline for each was necessary to decouple and enhance failure tolerance. To achieve this, Aruba used Amazon S3 Event Notifications. With each file uploaded to Amazon S3, an Amazon S3 PUT event invokes an AWS Lambda function that distributes the source and the metadata files Region-wise and loads them into the respective Regional landing zone S3 bucket. To map the file with the respective Region, this Lambda function uses Region-to-file mapping stored in a configuration table in Amazon Aurora PostgreSQL-Compatible Edition.

4. Orchestration

The next requirement was to set up orchestration for the data pipeline to seamlessly implement the required logic on the source files to extract meaningful data. Aruba chose AWS Step Functions for orchestrating and automating their extract, transform, and load (ETL) processes to run on a fixed schedule. In addition, they use AWS Glue jobs for orchestrating validation jobs and moving data through the data warehouse.

They used Step Functions with Lambda and AWS Glue for automated orchestration to minimize the cloud solution deployment timeline by reusing the on-premises code base, where possible. The prior on-premises data pipeline was orchestrated using Python scripts. Therefore, integrating the existing scripts with Lambda within Step Functions and AWS Glue helped accelerate their deployment timeline on AWS.

5. File processing

With each pipeline running at 5:00 AM local time, the data is further validated, processed, and then moved to the processing zone folder in the same S3 bucket. Unsuccessful file validation results in the source files being moved to the reject zone S3 bucket directory. The following file validations are run by the Lambda functions invoked by the Step Functions workflow:

  • The Lambda function validates if the tail file is available with the corresponding source data file. When each complete file pair lands in the Regional landing zone, the Step Functions workflow considers the source file transfer as complete.
  • By reading the metadata file, the file validation function validates that the names and sizes of the files that land in the Regional landing zone S3 bucket match with the files on the HPE on-premises server.

6. Data quality checks

When the files land in the processing zone, the Step Functions workflow invokes another Lambda function that converts the raw files to CSV format followed by stringent data quality checks. The final validated CSV files are loaded into the temp raw zone S3 folder.

The data quality (DQ) checks are managed using DQ configurations stored in Aurora PostgreSQL tables. Some examples of DQ checks include duplicate data check, null value check, and date format check. The DQ processing is managed through AWS Glue jobs, which are invoked by Lambda functions from within the Step Functions workflow. A number of data processing logics are also integrated in the DQ flow, such as the following:

  • Flag-based deduplication – For specific files, when a flag managed in the Aurora configuration table is enabled, the process removes duplicates before processing the data
  • Pre-set values replacing nulls – Similarly, a preset value of 1 or 0 would imply a NULL in the source data based on the value set in the configuration table

7. Archiving processed files

When the CSV conversion is complete, the original raw files in the processing zone S3 folder are archived for 6 months in the archive zone S3 bucket folder. After 6 months, the files on AWS are deleted, with the original raw files retained in the HPE source system.

8. Copying to Amazon Redshift

When the data quality checks and data processing are complete, the data is loaded from the S3 temp raw zone into the curated zone on an Redshift provisioned cluster, using the COPY command feature.

9. Running stored procedures

From the curated zone, they use AWS Glue jobs, where the Redshift stored procedures are orchestrated to load the data from the curated zone into the Redshift publish zone. The Redshift publish zone is a different set of tables in the same Redshift provisioned cluster. The Redshift stored procedures process and load the data into fact and dimension tables in a star schema.

10. UI integration

Amazon OpenSearch Service is also integrated with the flow for publishing mass notifications to the end-users through the user interface (UI). The users can also send messages and post updates via the UI with the OpenSearch Service integration.

11. Code Deployment

Aruba uses AWS CodeCommit and AWS CodePipeline to deploy and manage a bi-monthly code release cycle, the frequency for which can be increased on-demand as per deployment needs. The release happens across four environments – Development, Testing, UAT and Production – deployed through DevOps discipline, thus enabling shorter turnaround time to ever-changing user requirements and upstream data source changes.

12. Security & Encryption

User access to the Aruba SC360 portal is managed via SSO with MFA authentication and data security managed via direct integration of the AWS solution with HPE IT’s unified access management API. All the data pipelines between HPE on-premises sources and S3 are encrypted for enhanced security.

13. Data Consumption

Aruba SC360 application provides a ‘Private Space’ feature to other BI/Analytics teams within HPE to run and manage their own data ingestion pipeline. This has been built using Amazon Redshift data sharing feature, which has enabled Aruba to securely share access to live data in their Amazon Redshift cluster, without manually moving or copying the data. Thus, the HPE internal teams could build their own data workloads on core Aruba SC360 data while maintaining data security and code isolation.

14. Final Steps

The data is finally fetched into the publication layer, which consists of a ReactJS-based user interface accessing the data in the Amazon publish zone using Spring Boot REST APIs. Along with data from the Redshift data warehouse, notifications updated in the OpenSearch Service tables are also fetched and loaded into the UI. Amazon Aurora PostgreSQL is used to maintain the configuration values for populating the UI. To build BI dashboards, Aruba opted to continue using their existing third-party BI tool due to its familiarity among internal teams.

Conclusion

In this post, we showed you how HPE Aruba Supply Chain successfully re-architected and deployed their data solution by adopting a modern data architecture on AWS.

The new solution has helped Aruba integrate data from multiple sources, along with optimizing their cost, performance, and scalability. This has also allowed the Aruba Supply Chain leadership to receive in-depth and timely insights for better decision-making, thereby elevating the customer experience.

To learn more about the AWS services used to build modern data solutions on AWS, refer to the AWS public documentation and stay up to date through the AWS Big Data Blog.


About the authors

Hardeep Randhawa is a Senior Manager – Big Data & Analytics, Solution Architecture at HPE, recognized for stewarding enterprise-scale programs and deployments. He has led a recent Big Data EAP (Enterprise Analytics Platform) build with one of the largest global SAP HANA/S4 implementations at HPE.

Abhay Kumar is a Lead Data Engineer in Aruba Supply Chain Analytics and manages the Cloud Infrastructure for the Application at HPE. With 11+ years of experience in the IT industry domains like banking, supply chain and Abhay has a strong background in Cloud Technologies, Data Analytics, Data Management, and Big Data systems. In his spare time, he likes reading, exploring new places and watching movies.

Ritesh Chaman is a Senior Technical Account Manager at Amazon Web Services. With 14 years of experience in the IT industry, Ritesh has a strong background in Data Analytics, Data Management, Big Data systems and Machine Learning. In his spare time, he loves cooking, watching sci-fi movies, and playing sports.

Sushmita Barthakur is a Senior Solutions Architect at Amazon Web Services, supporting Enterprise customers architect their workloads on AWS. With a strong background in Data Analytics and Data Management, she has extensive experience helping customers architect and build Business Intelligence and Analytics Solutions, both on-premises and the cloud. Sushmita is based out of Tampa, FL and enjoys traveling, reading and playing tennis.

Evaluating sample Amazon Redshift data sharing architecture using Redshift Test Drive and advanced SQL analysis

Post Syndicated from Ayan Majumder original https://aws.amazon.com/blogs/big-data/evaluating-sample-amazon-redshift-data-sharing-architecture-using-redshift-test-drive-and-advanced-sql-analysis/

With the launch of Amazon Redshift Serverless and the various provisioned instance deployment options, customers are looking for tools that help them determine the most optimal data warehouse configuration to support their Amazon Redshift workloads.

Amazon Redshift is a widely used, fully managed, petabyte-scale data warehouse service. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads.

Redshift Test Drive is a tool hosted on the GitHub repository that let customers evaluate which data warehouse configurations options are best suited for their workload. The Test Drive Workload Replicator utility consists of scripts that can be used to extract the workload queries from your source warehouse audit logs and replay them on a target warehouse you launched. The Test Drive Configuration Comparison utility automates this process by deploying target Amazon Redshift warehouses and orchestrating the replay of the source workload through a combination of AWS CloudFormation and AWS StepFunctions.

Both utilities unload the performance metrics from the replay of the source workload on the target configuration(s) to Amazon Simple Storage Service (Amazon S3), which is used as a storage to store the performance metrics. Although the Replay Analysis UI and Configuration Comparison utility can provide a preliminary performance comparison, many customers want to dig deeper by analyzing the raw data themselves.

The walkthrough illustrates an example workload replayed on a single Amazon Redshift data warehouse and data sharing architecture using the Workload Replicator utility, the output of which will be used to evaluate the performance of the workload.

Use case overview

For the sample use case, we assumed we have an existing 2 x ra3.4xlarge provisioned data warehouse that currently runs extract, transform, and load (ETL), ad hoc, and business intelligence (BI) queries. We’re interested in breaking these workloads apart using data sharing into a 32 base Redshift Processing Unit (RPU) Serverless producer running ETL and a 64 base RPU Serverless consumer running the BI workload. We used Workload Replicator to replay the workload on a replica baseline of the source and target data sharing configuration as specified in the tutorial. The following image shows the process flow.

Generating and accessing Test Drive metrics

The results of Amazon Redshift Test Drive can be accessed using an external schema for analysis of a replay. Refer to the Workload Replicator README and the Configuration Comparison README for more detailed instructions to execute a replay using the respective tool.

The external schema for analysis is automatically created with the Configuration Comparison utility, in which case you can proceed directly to the SQL analysis in the Deploy the QEv2 SQL Notebook and analyze workload section. If you use Workload Replicator, however, the external schema is not created automatically, and therefore needs to be configured as a prerequisite to the SQL analysis. We demonstrate in the following walkthrough how the external schema can be set up, using sample analysis of the Data Sharing use case.

Executing Test Drive Workload Replicator for data sharing

To execute Workload Replicator, use Amazon Elastic Compute Cloud (Amazon EC2) to run the automation scripts used to extract the workload from the source.

Configure Amazon Redshift Data Warehouse

  1. Create a snapshot following the guidance in the Amazon Redshift Management Guide.
  2. Enable audit logging following the guidance in Amazon Redshift Management Guide.
  3. Enable the user activity logging of the source cluster following the guidance Amazon Redshift Management Guide.

Enabling logging requires a change of the parameter group. Audit logging needs to be enabled prior to the workload that will be replayed because this is where the connections and SQL queries of the workload will be extracted from.

  1. Launch the baseline replica from the snapshot by restoring a 2 node ra3.4xlarge provisioned cluster from the snapshot.
  2. Launch the producer warehouse by restoring the snapshot to a 32 RPU serverless namespace.
  3. The consumer should not contain the schema and tables that will be shared from the producer. You can launch the 64 RPU Serverless consumer either from the snapshot and then drop the relevant objects, or you can create a new 64 RPU Serverless consumer warehouse and recreate consumer users.
  4. Create a datashare from the producer to the consumer and add the relevant objects.

Data share objects can be read using two mechanisms: using three-part notation (database.schema.table), or by creating an external schema pointing to a shared schema and querying that using two-part notation (external_schema.table). Because we want to seamlessly run the source workload, which uses two-part notation on the local objects, this post demonstrates the latter approach. For each schema shared from the producer, run the following command on the consumer:

CREATE EXTERNAL SCHEMA schema_name 
FROM REDSHIFT DATABASE ‘datashare_database_name’ SCHEMA ‘schema_name’;

Make sure to use the same schema name as the source for the external schema. Also, if any queries are run on the public schema, drop the local public schema first before creating the external equivalent.

  1. Grant usage on the schema for any relevant users.

Configure Redshift Test Drive Workload Replicator

  1. Create an S3 bucket to store the artifacts required by the utility (such as the metrics, extracted workload, and output data from running UNLOAD commands).
  2. Launch the following three types of EC2 instances using the recommended configuration of m5.8xlarge, 32GB of SSD storage, and Amazon Linux AMI:
    1. Baseline instance
    2. Target-producer instance
    3. Target-consumer instance

Make sure you can connect to the EC2 instance to run the utility.

  1. For each instance, install the required libraries by completing the following steps from the GitHub repository:
    a. 2.i
    b. 2.ii (if an ODBC driver should be used—the default is the Amazon Redshift Python driver)
    c. 2.iii
    d. 2.iv
    e. 2.v
  2. Create an AWS Identity and Access Management (IAM) role for the EC2 instances to access the Amazon Redshift warehouses, to read from the S3 audit logging bucket, and with both read and write access to the new S3 bucket created for storing replay artifacts.
  3. If you are going to run COPY and UNLOAD commands, create an IAM role with access to the S3 buckets required, and attach it to the Amazon Redshift warehouses that will execute the load and unload.

In this example, the IAM role is attached to the baseline replica and producer warehouses because these will be executing the ETL processes. The utility will update UNLOAD commands to unload data to a bucket you define, which as a best practice should be the bucket created for S3 artifacts. Write permissions need to be granted to the Amazon Redshift warehouse for this location.

Run Redshift Test Drive Workload Replicator

  1. Run aws configure on the EC2 instances and populate the default Region with the Region the utility is being executed in.
  2. Extract only needs to be run once, so connect to the baseline EC2 instance and run vi config/extract.yaml to open the extract.yaml file and configure the extraction details (select i to begin configuring elements, then use escape to leave edit mode and :wq! to leave vi). For more details on the parameters, see Configure parameters.

The following code is an example of a configured extract that unloads the logs for a half hour window to the Test Drive artifacts bucket and updates COPY commands to run with the POC Amazon Redshift role.

Configuration Extract File

  1. Run make extract to extract the workload. When completed, make note of the folder created at the path specified for the workload_location parameter in the extract (s3://testdriveartifacts/myworkload/Extraction_xxxx-xx-xxTxx:xx:xx.xxxxxx+00:00).
  2. On the same baseline EC2 instance that will run the full workload on the source replica, run vi config/replay.yaml and configure the details with the workload location copied in the previous step 3 and the baseline warehouse endpoint. (See additional details on the parameters Configure parameters to run an extract job. The values after the analysis_iam_role parameter can be left as the default).

The following code is an example for the beginning of a replay configuration for the source replica.

Config Reply File

  1. On the EC2 instance that will run the target-producer workload, run vi config/replay.yaml. Configure the details with the workload location copied in the previous step 3, the producer warehouse endpoint and other configuration as in step 4. In order to replay only the producer workload, add the appropriate users to include or exclude for the filters parameter.

The following code is an example of the filters used to exclude the BI workload from the producer.

Producer Configuration

  1. On the EC2 instance that will run the target-consumer workload, run vi config/replay.yaml and configure the details with the workload location copied in the previous step 3, the consumer warehouse endpoint, and appropriate filters as for step 5. The same users that were excluded on the producer workload replay should be included in the consumer workload replay.

The following is an example of the filters used to only run the BI workload from the consumer.

Consumer Configuration

  1. Run make replay on the baseline instance, target-producer instance, and target-consumer instance simultaneously to run the workload on the target warehouses.

Analyze the Workload Replicator output

  1. Create the folder structure in the S3 bucket that was created in the previous step.
'{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/'

For comparison_stats_s3_path, enter the S3 bucket and path name. For what_if_timestamp, enter the replay start time. For cluster_identifier, enter the target cluster name for easy identification.

The following screenshot shows

Configuration of S3

  1. Use the following script to unload system table data for each target cluster to a corresponding Amazon S3 target path that was created previously in the baseline Redshift cluster using QEv2.
UNLOAD ($$
SELECT a.*,Trim(u.usename) as username FROM sys_query_history a , pg_user u
WHERE a.user_id = u.usesysid
and a.start_time > to_timestamp('{what_if_timestamp}','YYYY-MM- DD-HH24-MI-SS')
$$) TO '{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/'
FORMAT AS PARQUET PARALLEL OFF ALLOWOVERWRITE 
IAM_ROLE '{redshift_iam_role}';

For what_if_timestamp, enter the replay start time. For comparison_stats_s3_path, enter the S3 bucket and path name. For cluster_identifier, enter the target cluster name for easy identification. For redshift_iam_role, enter the Amazon Resource Name (ARN) of the Redshift IAM role for the target cluster.

Unload File

  1. Create an external schema in Amazon Redshift with the name comparison_stats.
CREATE EXTERNAL SCHEMA comparison_stats from DATA CATALOG
DATABASE 'redshift_config_comparison'
IAM_ROLE '{redshift-iam-role}'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
  1. Create an external table in Amazon Redshift with the name redshift_config_comparision_aggregate based on the Amazon S3 file location.
CREATE EXTERNAL TABLE comparison_stats.redshift_config_comparision_aggregate 
(user_id int,
query_id bigint,
query_label VARCHAR,
transaction_id bigint,
session_id int,
database_name VARCHAR,
query_type VARCHAR,
status VARCHAR,
result_cache_hit boolean,
start_time timestamp,
end_time timestamp,
elapsed_time bigint,
queue_time bigint,
execution_time bigint,
error_message VARCHAR,
returned_rows bigint,
returned_bytes bigint,
query_text VARCHAR,
redshift_version VARCHAR,
usage_limit VARCHAR,
compute_type VARCHAR,
compile_time bigint,
planning_time bigint,
lock_wait_time bigint,
username VARCHAR)
PARTITIONED BY (cluster_identifier VARCHAR)
STORED AS PARQUET
LOCATION '{comparison_stats_s3_path}/{what_if_timestamp}';
  1. After creating a partitioned table, alter the table using the following statement to register partitions to the external catalog.

When you add a partition, you define the location of the subfolder on Amazon S3 that contains the partition data. Run that statement for each cluster identifier.

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier='{cluster_identifier}')
LOCATION '{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/';

Example:

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier='baseline-ra3-4xlarge-2')
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/baseline-ra3-4xlarge-2/';
ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier='producer-serverless32RPU')
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/producer-serverless32RPU/';
ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier='consumer-serverless64RPU')
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/consumer-serverless64RPU/';

Deploy the QEv2 SQL Notebook and analyze workload

In this section, we analyze the queries that were replayed in both the baseline and target clusters. We analyze the workload based on the common queries that are executed in the baseline and target clusters.

  1. Download the analysis notebook from Amazon S3.
  2. Import the notebook into the baseline Redshift clusters using QEv2. For guidance, refer to the Authoring and running notebooks.
  3. Create the stored procedure common_queries_sp in the same database that was used to create the external schema.
  4. The stored procedure will create a view called common_queries by querying the external table redshift_config_comparison_aggregate that was created in previous steps.

The view will identify the queries common to both the baseline and target clusters as mentioned in the notebook.

  1. Execute the stored procedure by passing the cluster identifiers for the baseline and target clusters as parameters to the stored procedure.

For this post, we passed the baseline and producer cluster identifier as the parameters. Passing the cluster identifiers as parameters will retrieve the data only for those specific clusters.

Once the common_queries view is created, you can perform further analysis using subsequent queries that are available in the notebook. If you have more than one target cluster, you can follow the same analysis process for each one. For this post, we have two target clusters: producer and consumer. We first performed the analysis between the baseline and producer clusters, then repeated the same process to analyze the data for the baseline versus consumer clusters.

To analyze our workload, we will use the sys_query_history view. We frequently use several columns from this view, including the following:

  • elapsed_time: The end-to-end time of the query run
  • execution_time: The time the query spent running. In the case of a SELECT query, this also includes the return time.
  • compile_time: The time the query spent compiling

For more information on sys_query_history, refer to SYS_QUERY_HISTORY in the Amazon Redshift Database Developer Guide. The following table shows the descriptions of the analysis queries.

Name of the query Description
1 Overall workload by user Count of common queries between baseline and target clusters based on user
2 Overall workload by query type Count of common queries between baseline and target clusters based on query type
3 Overall workload comparison (in seconds) Compare the overall workload between the baseline and target clusters by analyzing the execution time, compile time, and elapsed time
4 Percentile workload comparison The percentage of queries that perform at or below that runtime (for example, p50_s having the value of 5 seconds means 50% of queries in that workload were 5 seconds or faster)
5 Number of improve/degrade/stay same queries The number of queries degraded/stayed the same/improved when comparing the elapsed time between the baseline and target clusters
6 Degree of query-level performance change (proportion) The degree of change of the query from the baseline to target relative to the baseline performance
7 Comparison by query type (in seconds) Compare the elapsed time of different query types such as SELECT, INSERT, and COPY commands between the baseline cluster and target cluster
8 Top 10 slowest running queries (in seconds) Top 10 slowest queries between the baseline and target cluster by comparing the elapsed time of both clusters
9 Top 10 improved queries (in seconds) The top 10 queries with the most improved elapsed time when comparing the baseline cluster to the target cluster

Sample Results analysis

In our example, the overall workload improvement for workload isolation architecture using data sharing for ETL workload between baseline and producer is 858 seconds (baseline_elapsed_timetarget_elapsed_time) for the sample TPC data, as shown in the following screenshots.

Baseline vs Producer Analysis

Baseline vs Producer Analysis Grpah

The overall workload improvement for workload isolation architecture using data sharing for BI workload between baseline and consumer is 1148 seconds (baseline_elapsed_timetarget_elapsed_time) for sample TPC data, as shown in the following screenshots.

Baseline vs Consumer Analysis

Baseline vs Consumer Analysis Graph

Cleanup

Complete the following steps to clean up your resources:

  1. Delete the Redshift provisioned replica cluster and the two Redshift serverless endpoints (32 RPU and 64 RPU)
  2. Delete the S3 bucket used to store the artifacts
  3. Delete the baseline, target-producer, and target-consumer EC2 instances
  4. Delete the IAM role created for the EC2 instances to access Redshift clusters and S3 buckets
  5. Delete the IAM roles created for Amazon Redshift warehouses to access S3 buckets for COPY and UNLOAD commands

Conclusion

In this post, we walked you through the process of testing workload isolation architecture using Amazon Redshift Data Sharing and Test Drive utility. We demonstrated how you can use SQL for advanced price performance analysis and compare different workloads on different target Redshift cluster configurations. We encourage you to evaluate your Amazon Redshift data sharing architecture using the Redshift Test Drive tool. Use the provided SQL script to analyze the price-performance of your Amazon Redshift cluster.


About the Authors

Ayan Majumder is an Analytics Specialist Solutions Architect at AWS. His expertise lies in designing robust, scalable, and efficient cloud solutions for customers. Beyond his professional life, he derives joy from traveling, photography, and outdoor activities.

Ekta Ahuja is an Amazon Redshift Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys landscape photography, traveling, and board games.

Julia BeckJulia Beck is an Analytics Specialist Solutions Architect at AWS. She is passionate about supporting customers in validating and optimizing analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Amazon Redshift data ingestion options

Post Syndicated from Steve Phillips original https://aws.amazon.com/blogs/big-data/amazon-redshift-data-ingestion-options/

Amazon Redshift, a warehousing service, offers a variety of options for ingesting data from diverse sources into its high-performance, scalable environment. Whether your data resides in operational databases, data lakes, on-premises systems, Amazon Elastic Compute Cloud (Amazon EC2), or other AWS services, Amazon Redshift provides multiple ingestion methods to meet your specific needs. The currently available choices include:

This post explores each option (as illustrated in the following figure), determines which are suitable for different use cases, and discusses how and why to select a specific Amazon Redshift tool or feature for data ingestion.

A box indicating Amazon Redshift in the center of the image with boxes from right to left for Amazon RDS MySQL and PostgreSQL, Amazon Aurora MySQL and PostreSQL, Amazon EMR, Amazon Glue, Amazon S3 bucket, Amazon Managed Streaming for Apache Kafka and Amazon Kinesis. Each box has an arrow pointing to Amazon Redshift. Each arrow has the following labels: Amazon RDS & Amazon Aurora: zero-ETL and federated queries; AWS Glue and Amazon EMR: spark connector; Amazon S3 bucket: COPY command; Amazon Managed Streaming for Apache Kafka and Amazon Kinesis: redshift streaming. Amazon Data Firehose has an arrow pointing to Amazon S3 bucket indicating the data flow direction.

Amazon Redshift COPY command

The Redshift COPY command, a simple low-code data ingestion tool, loads data into Amazon Redshift from Amazon S3, DynamoDB, Amazon EMR, and remote hosts over SSH. It’s a fast and efficient way to load large datasets into Amazon Redshift. It uses massively parallel processing (MPP) architecture in Amazon Redshift to read and load large amounts of data in parallel from files or data from supported data sources. This allows you to utilize parallel processing by splitting data into multiple files, especially when the files are compressed.

Recommended use cases for the COPY command include loading large datasets and data from supported data sources. COPY automatically splits large uncompressed delimited text files into smaller scan ranges to utilize the parallelism of Amazon Redshift provisioned clusters and serverless workgroups. With auto-copy, automation enhances the COPY command by adding jobs for automatic ingestion of data.

COPY command advantages:

  • Performance – Efficiently loads large datasets from Amazon S3 or other sources in parallel with optimized throughput
  • Simplicity – Straightforward and user-friendly, requiring minimal setup
  • Cost-optimized – Uses Amazon Redshift MPP at a lower cost by reducing data transfer time
  • Flexibility – Supports file formats such as CSV, JSON, Parquet, ORC, and AVRO

Amazon Redshift federated queries

Amazon Redshift federated queries allow you to incorporate live data from Amazon RDS or Aurora operational databases as part of business intelligence (BI) and reporting applications.

Federated queries are useful for use cases where organizations want to combine data from their operational systems with data stored in Amazon Redshift. Federated queries allow querying data across Amazon RDS for MySQL and PostgreSQL data sources without the need for extract, transform, and load (ETL) pipelines. If storing operational data in a data warehouse is a requirement, synchronization of tables between operational data stores and Amazon Redshift tables is supported. In scenarios where data transformation is required, you can use Redshift stored procedures to modify data in Redshift tables.

Federated queries key features:

  • Real-time access – Enables querying of live data across discrete sources, such as Amazon RDS and Aurora, without the need to move the data
  • Unified data view – Provides a single view of data across multiple databases, simplifying data analysis and reporting
  • Cost savings – Eliminates the need for ETL processes to move data into Amazon Redshift, saving on storage and compute costs
  • Flexibility – Supports Amazon RDS and Aurora data sources, offering flexibility in accessing and analyzing distributed data

Amazon Redshift Zero-ETL integration

Aurora zero-ETL integration with Amazon Redshift allows access to operational data from Amazon Aurora MySQL-Compatible (and Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL in preview), and DynamoDB from Amazon Redshift without the need for ETL in near real time. You can use zero-ETL to simplify ingestion pipelines for performing change data capture (CDC) from an Aurora database to Amazon Redshift. Built on the integration of Amazon Redshift and Aurora storage layers, zero-ETL boasts simple setup, data filtering, automated observability, auto-recovery, and integration with either Amazon Redshift provisioned clusters or Amazon Redshift Serverless workgroups.

Zero-ETL integration benefits:

  • Seamless integration – Automatically integrates and synchronizes data between operational databases and Amazon Redshift without the need for custom ETL processes
  • Near real-time insights – Provides near real-time data updates, so the most current data is available for analysis
  • Ease of use – Simplifies data architecture by eliminating the need for separate ETL tools and processes
  • Efficiency – Minimizes data latency and provides data consistency across systems, enhancing overall data accuracy and reliability

Amazon Redshift integration for Apache Spark

The Amazon Redshift integration for Apache Spark, automatically included through Amazon EMR or AWS Glue, provides performance and security optimizations when compared to the community-provided connector. The integration enhances and simplifies security with AWS Identity and Access Management (IAM) authentication support. AWS Glue 4.0 provides a visual ETL tool for authoring jobs to read from and write to Amazon Redshift, using the Redshift Spark connector for connectivity. This simplifies the process of building ETL pipelines to Amazon Redshift. The Spark connector allows use of Spark applications to process and transform data before loading into Amazon Redshift. The integration minimizes the manual process of setting up a Spark connector and shortens the time needed to prepare for analytics and machine learning (ML) tasks. It allows you to specify the connection to a data warehouse and start working with Amazon Redshift data from your Apache Spark-based applications within minutes.

The integration provides pushdown capabilities for sort, aggregate, limit, join, and scalar function operations to optimize performance by moving only the relevant data from Amazon Redshift to the consuming Apache Spark application. Spark jobs are suitable for data processing pipelines and when you need to use Spark’s advanced data transformation capabilities.

With the Amazon Redshift integration for Apache Spark, you can simplify the building of ETL pipelines with data transformation requirements. It offers the following benefits:

  • High performance – Uses the distributed computing power of Apache Spark for large-scale data processing and analysis
  • Scalability – Effortlessly scales to handle massive datasets by distributing computation across multiple nodes
  • Flexibility – Supports a wide range of data sources and formats, providing versatility in data processing tasks
  • Interoperability – Seamlessly integrates with Amazon Redshift for efficient data transfer and queries

Amazon Redshift streaming ingestion

The key benefit of Amazon Redshift streaming ingestion is the ability to ingest hundreds of megabytes of data per second directly from streaming sources into Amazon Redshift with very low latency, supporting real-time analytics and insights. Supporting streams from Kinesis Data Streams, Amazon MSK, and Data Firehose, streaming ingestion requires no data staging, supports flexible schemas, and is configured with SQL. Streaming ingestion powers real-time dashboards and operational analytics by directly ingesting data into Amazon Redshift materialized views.

Amazon Redshift streaming ingestion unlocks near real-time streaming analytics with:

  • Low latency – Ingests streaming data in near real time, making streaming ingestion ideal for time-sensitive applications such as Internet of Things (IoT), financial transactions, and clickstream analysis
  • Scalability – Manages high throughput and large volumes of streaming data from sources such as Kinesis Data Streams, Amazon MSK, and Data Firehose
  • Integration – Integrates with other AWS services to build end-to-end streaming data pipelines
  • Continuous updates – Keeps data in Amazon Redshift continuously updated with the latest information from the data streams

Amazon Redshift ingestion use cases and examples

In this section, we discuss the details of different Amazon Redshift ingestion use cases and provide examples.

Redshift COPY use case: Application log data ingestion and analysis

Ingesting application log data stored in Amazon S3 is a common use case for the Redshift COPY command. Data engineers in an organization need to analyze application log data to gain insights into user behavior, identify potential issues, and optimize a platform’s performance. To achieve this, data engineers ingest log data in parallel from multiple files stored in S3 buckets into Redshift tables. This parallelization uses the Amazon Redshift MPP architecture, allowing for faster data ingestion compared to other ingestion methods.

The following code is an example of the COPY command loading data from a set of CSV files in an S3 bucket into a Redshift table:

COPY myschema.mytable
FROM 's3://my-bucket/data/files/'
IAM_ROLE ‘arn:aws:iam::1234567891011:role/MyRedshiftRole’
FORMAT AS CSV;

This code uses the following parameters:

  • mytable is the target Redshift table for data load
  • s3://my-bucket/data/files/‘ is the S3 path where the CSV files are located
  • IAM_ROLE specifies the IAM role required to access the S3 bucket
  • FORMAT AS CSV specifies that the data files are in CSV format

In addition to Amazon S3, the COPY command loads data from other sources, such as DynamoDB, Amazon EMR, remote hosts through SSH, or other Redshift databases. The COPY command provides options to specify data formats, delimiters, compression, and other parameters to handle different data sources and formats.

To get started with the COPY command, see Using the COPY command to load from Amazon S3.

Federated queries use case: Integrated reporting and analytics for a retail company

For this use case, a retail company has an operational database running on Amazon RDS for PostgreSQL, which stores real-time sales transactions, inventory levels, and customer information data. Additionally, a data warehouse runs on Amazon Redshift storing historical data for reporting and analytics purposes. To create an integrated reporting solution that combines real-time operational data with historical data in the data warehouse, without the need for multi-step ETL processes, complete the following steps:

  1. Set up network connectivity. Make sure your Redshift cluster and RDS for PostgreSQL instance are in the same virtual private cloud (VPC) or have network connectivity established through VPC peering, AWS PrivateLink, or AWS Transit Gateway.
  2. Create a secret and IAM role for federated queries:
    1. In AWS Secrets Manager, create a new secret to store the credentials (user name and password) for your Amazon RDS for PostgreSQL instance.
    2. Create an IAM role with permissions to access the Secrets Manager secret and the Amazon RDS for PostgreSQL instance.
    3. Associate the IAM role with your Amazon Redshift cluster.
  3. Create an external schema in Amazon Redshift:
    1. Connect to your Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
    2. Create an external schema that references your Amazon RDS for PostgreSQL instance:
CREATE EXTERNAL SCHEMA postgres_schema
FROM POSTGRES
DATABASE 'mydatabase'
SCHEMA 'public'
URI 'endpoint-for-your-rds-instance.aws-region.rds.amazonaws.com:5432'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRoleForRDS'
SECRET_ARN 'arn:aws:secretsmanager:aws-region:123456789012:secret:my-rds-secret-abc123';
  1. Query tables in your Amazon RDS for PostgreSQL instance directly from Amazon Redshift using federated queries:
SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.category
FROM
    postgres_schema.orders r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id
WHERE
    r.order_date >= '2024-01-01';
  1. Create views or materialized views in Amazon Redshift that combine the operational data from federated queries with the historical data in Amazon Redshift for reporting purposes:
CREATE MATERIALIZED VIEW sales_report AS
SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.category,
    h.historical_sales
FROM
    (
        SELECT
            order_id,
            order_date,
            customer_name,
            total_amount,
            product_id
        FROM
            postgres_schema.orders
    ) r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id;

With this implementation, federated queries in Amazon Redshift integrate real-time operational data from Amazon RDS for PostgreSQL instances with historical data in a Redshift data warehouse. This approach eliminates the need for multi-step ETL processes and enables you to create comprehensive reports and analytics that combine data from multiple sources.

To get started with Amazon Redshift federated query ingestion, see Querying data with federated queries in Amazon Redshift.

Zero-ETL integration use case: Near real-time analytics for an ecommerce application

Suppose an ecommerce application built on Aurora MySQL-Compatible manages online orders, customer data, and product catalogs. To perform near real-time analytics with data filtering on transactional data to gain insights into customer behavior, sales trends, and inventory management without the overhead of building and maintaining multi-step ETL pipelines, you can use zero-ETL integrations for Amazon Redshift. Complete the following steps:

  1. Set up an Aurora MySQL cluster (must be running Aurora MySQL version 3.05-compatible with MySQL 8.0.32 or higher):
    1. Create an Aurora MySQL cluster in your desired AWS Region.
    2. Configure the cluster settings, such as the instance type, storage, and backup options.
  2. Create a zero-ETL integration with Amazon Redshift:
    1. On the Amazon RDS console, navigate to the Zero-ETL integrations
    2. Choose Create integration and select your Aurora MySQL cluster as the source.
    3. Choose an existing Redshift cluster or create a new cluster as the target.
    4. Provide a name for the integration and review the settings.
    5. Choose Create integration to initiate the zero-ETL integration process.
  3. Verify the integration status:
    1. After the integration is created, monitor the status on the Amazon RDS console or by querying the SVV_INTEGRATION and SYS_INTEGRATION_ACTIVITY system views in Amazon Redshift.
    2. Wait for the integration to reach the Active state, indicating that data is being replicated from Aurora to Amazon Redshift.
  4. Create analytics views:
    1. Connect to your Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
    2. Create views or materialized views that combine and transform the replicated data from Aurora for your analytics use cases:
CREATE MATERIALIZED VIEW orders_summary AS
SELECT
    o.order_id,
    o.customer_id,
    SUM(oi.quantity * oi.price) AS total_revenue,
    MAX(o.order_date) AS latest_order_date
FROM
    aurora_schema.orders o
    JOIN aurora_schema.order_items oi ON o.order_id = oi.order_id
GROUP BY
    o.order_id,
    o.customer_id;
  1. Query the views or materialized views in Amazon Redshift to perform near real-time analytics on the transactional data from your Aurora MySQL cluster:
SELECT
	customer_id,
	SUM(total_revenue) AS total_customer_revenue,
	MAX(latest_order_date) AS most_recent_order
FROM
	orders_summary
GROUP BY
	customer_id
ORDER BY
	total_customer_revenue DESC;

This implementation achieves near real-time analytics for an ecommerce application’s transactional data using the zero-ETL integration between Aurora MySQL-Compatible and Amazon Redshift. The data automatically replicates from Aurora to Amazon Redshift, eliminating the need for multi-step ETL pipelines and supporting insights from the latest data quickly.

To get started with Amazon Redshift zero-ETL integrations, see Working with zero-ETL integrations. To learn more about Aurora zero-ETL integrations with Amazon Redshift, see Amazon Aurora zero-ETL integrations with Amazon Redshift.

Integration for Apache Spark use case: Gaming player events written to Amazon S3

Consider a large volume of gaming player events stored in Amazon S3. The events require data transformation, cleansing, and preprocessing to extract insights, generate reports, or build ML models. In this case, you can use the scalability and processing power of Amazon EMR to perform the required data changes using Apache Spark. After it’s processed, the transformed data must be loaded into Amazon Redshift for further analysis, reporting, and integration with BI tools.

In this scenario, you can use the Amazon Redshift integration for Apache Spark to perform the necessary data transformations and load the processed data into Amazon Redshift. The following implementation example assumes gaming player events in Parquet format are stored in Amazon S3 (s3://<bucket_name>/player_events/).

  1. Launch an Amazon EMR (emr-6.9.0) cluster with Apache Spark (Spark 3.3.0) with Amazon Redshift integration with Apache Spark support.
  2. Configure the necessary IAM role for accessing Amazon S3 and Amazon Redshift.
  3. Add security group rules to Amazon Redshift to allow access to the provisioned cluster or serverless workgroup.
  4. Create a Spark job that sets up a connection to Amazon Redshift, reads data from Amazon S3, performs transformations, and writes resulting data to Amazon Redshift. See the following code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
import os

def main():

	# Create a SparkSession
	spark = SparkSession.builder \
    		.appName("RedshiftSparkJob") \
    		.getOrCreate()

	# Set Amazon Redshift connection properties
	Redshift_jdbc_url = "jdbc:redshift://<redshift-endpoint>:<port>/<database>"
	redshift_table = "<schema>.<table_name>"
	temp_s3_bucket = "s3://<bucket_name>/temp/"
	iam_role_arn = "<iam_role_arn>"

	# Read data from Amazon S3
	s3_data = spark.read.format("parquet") \
    		.load("s3://<bucket_name>/player_events/")

	# Perform transformations
	transformed_data = s3_data.withColumn("transformed_column", lit("transformed_value"))

	# Write the transformed data to Amazon Redshift
	transformed_data.write \
    		.format("io.github.spark_redshift_community.spark.redshift") \
    		.option("url", redshift_jdbc_url) \
    		.option("dbtable", redshift_table) \
    		.option("tempdir", temp_s3_bucket) \
    		.option("aws_iam_role", iam_role_arn) \
    		.mode("overwrite") \
    		.save()

if __name__ == "__main__":
    main()

In this example, you first import the necessary modules and create a SparkSession. Set the connection properties for Amazon Redshift, including the endpoint, port, database, schema, table name, temporary S3 bucket path, and the IAM role ARN for authentication. Read data from Amazon S3 in Parquet format using the spark.read.format("parquet").load() method. Perform a transformation on the Amazon S3 data by adding a new column transformed_column with a constant value using the withColumn method and the lit function. Write the transformed data to Amazon Redshift using the write method and the io.github.spark_redshift_community.spark.redshift format. Set the necessary options for the Redshift connection URL, table name, temporary S3 bucket path, and IAM role ARN. Use the mode("overwrite") option to overwrite the existing data in the Amazon Redshift table with the transformed data.

To get started with Amazon Redshift integration for Apache Spark, see Amazon Redshift integration for Apache Spark. For more examples of using the Amazon Redshift for Apache Spark connector, see New – Amazon Redshift Integration with Apache Spark.

Streaming ingestion use case: IoT telemetry near real-time analysis

Imagine a fleet of IoT devices (sensors and industrial equipment) that generate a continuous stream of telemetry data such as temperature readings, pressure measurements, or operational metrics. Ingesting this data in real time to perform analytics to monitor the devices, detect anomalies, and make data-driven decisions requires a streaming solution integrated with a Redshift data warehouse.

In this example, we use Amazon MSK as the streaming source for IoT telemetry data.

  1. Create an external schema in Amazon Redshift:
    1. Connect to an Amazon Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
    2. Create an external schema that references the MSK cluster:
CREATE EXTERNAL SCHEMA kafka_schema
FROM KAFKA
BROKER 'broker-1.example.com:9092,broker-2.example.com:9092'
TOPIC 'iot-telemetry-topic'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRoleForMSK';
  1. Create a materialized view in Amazon Redshift:
    1. Define a materialized view that maps the Kafka topic data to Amazon Redshift table columns.
    2. CAST the streaming message payload data type to the Amazon Redshift SUPER type.
    3. Set the materialized view to auto refresh.
CREATE MATERIALIZED VIEW iot_telemetry_view
AUTO REFRESH YES
AS SELECT
    kafka_partition,
    kafka_offset,
    kafka_timestamp_type,
    kafka_timestamp,
    CAST(kafka_value AS SUPER) payload
FROM kafka_schema.iot-telemetry-topic;
  1. Query the iot_telemetry_view materialized view to access the real-time IoT telemetry data ingested from the Kafka topic. The materialized view will automatically refresh as new data arrives in the Kafka topic.
SELECT
    kafka_timestamp,
    payload:device_id,
    payload:temperature,
    payload:pressure
FROM iot_telemetry_view;

With this implementation, you can achieve near real-time analytics on IoT device telemetry data using Amazon Redshift streaming ingestion. As telemetry data is received by an MSK topic, Amazon Redshift automatically ingests and reflects the data in a materialized view, supporting query and analysis of the data in near real time.

To get started with Amazon Redshift streaming ingestion, see Streaming ingestion to a materialized view. To learn more about streaming and customer use cases, see Amazon Redshift Streaming Ingestion.

Conclusion

This post detailed the options available for Amazon Redshift data ingestion. The choice of data ingestion method depends on factors such as the size and structure of data, the need for real-time access or transformations, data sources, existing infrastructure, ease of use, and user skill-sets. Zero-ETL integrations and federated queries are suitable for simple data ingestion tasks or joining data between operational databases and Amazon Redshift analytics data. Large-scale data ingestion with transformation and orchestration benefit from Amazon Redshift integration with Apache Spark with Amazon EMR and AWS Glue. Bulk loading of data into Amazon Redshift regardless of dataset size fits perfectly with the capabilities of the Redshift COPY command. Utilizing streaming sources such as Kinesis Data Streams, Amazon MSK, or Data Firehose are ideal scenarios for utilizing AWS streaming services integration for data ingestion.

Evaluate the features and guidance provided for your data ingestion workloads and let us know your feedback in the comments.


About the Authors

Steve Phillips is a senior technical account manager at AWS in the North America region. Steve has worked with games customers for eight years and currently focuses on data warehouse architectural design, data lakes, data ingestion pipelines, and cloud distributed architectures.

Sudipta Bagchi is a Sr. Specialist Solutions Architect at Amazon Web Services. He has over 14 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket.

Use the AWS CDK with the Data Solutions Framework to provision and manage Amazon Redshift Serverless

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/use-the-aws-cdk-with-the-data-solutions-framework-to-provision-and-manage-amazon-redshift-serverless/

In February 2024, we announced the release of the Data Solutions Framework (DSF), an opinionated open source framework for building data solutions on AWS. DSF is built using the AWS Cloud Development Kit (AWS CDK) to package infrastructure components into L3 AWS CDK constructs on top of AWS services. L3 constructs are implementations of common technical patterns and create multiple resources that are configured to work with each other.

In this post, we demonstrate how to use the AWS CDK and DSF to create a multi-data warehouse platform based on Amazon Redshift Serverless. DSF simplifies the provisioning of Redshift Serverless, initialization and cataloging of data, and data sharing between different data warehouse deployments. Using a programmatic approach with the AWS CDK and DSF allows you to apply GitOps principles to your analytics workloads and realize the following benefits:

  • You can deploy using continuous integration and delivery (CI/CD) pipelines, including the definitions of Redshift objects (databases, tables, shares, and so on)
  • You can roll out changes consistently across multiple environments
  • You can bootstrap data warehouses (table creation, ingestion of data, and so on) using code and use version control to simplify the setup of testing environments
  • You can test changes before deployment using AWS CDK built-in testing capabilities

In addition, DSF’s Redshift Serverless L3 constructs provide a number of built-in capabilities that can accelerate development while helping you follow best practices. For example:

  • Running extract, transform, and load (ETL) jobs to and from Amazon Redshift is more straightforward because an AWS Glue connection resource is automatically created and configured. This means data engineers don’t have to configure this resource and can use it right away with their AWS Glue ETL jobs.
  • Similarly, with discovery of data inside Amazon Redshift, DSF provides a convenient method to configure an AWS Glue crawler to populate the AWS Glue Data Catalog for ease of discovery as well as ease of referencing tables when creating ETL jobs. The configured AWS Glue crawler uses an AWS Identity and Access Management (IAM) role that follows least privilege.
  • Sharing data between Redshift data warehouses is a common approach to improve collaboration between lines of business without duplicating data. DSF provides convenient methods for the end-to-end flow for both data producer and consumer.

Solution overview

The solution demonstrates a common pattern where a data warehouse is used as a serving layer for business intelligence (BI) workloads on top of data lake data. The source data is stored in Amazon Simple Storage Service (Amazon S3) buckets, then ingested into a Redshift producer data warehouse to create materialized views and aggregate data, and finally shared with a Redshift consumer running BI queries from the end-users. The following diagram illustrates the high-level architecture.

Solution Overview

In the post, we use Python for the example code. DSF also supports TypeScript.

Prerequisites

Because we’re using the AWS CDK, complete the steps in Getting Started with the AWS CDK before you implement the solution.

Initialize the project and provision a Redshift Serverless namespace and workgroup

Let’s start with initializing the project and including DSF as a dependency. You can run this code in your local terminal, or you can use AWS Cloud9:

mkdir dsf-redshift-blog && cd dsf-redshift-blog
cdk init --language python

Open the project folder in your IDE and complete the following steps:

  1. Open the app.py file.
  2. In this file, make sure to uncomment the first env This configures the AWS CDK environment depending on the AWS profile used during the deployment.
  3. Add a configuration flag in the cdk.context.json file at the root of the project (if it doesn’t exist, create the file):
    {  
        "@data-solutions-framework-on-aws/removeDataOnDestroy": true 
    }

Setting the @data-solutions-framework-on-aws/removeDataOnDestroy configuration flag to true makes sure resources that have the removal_policy parameter set to RemovalPolicy.DESTROY are destroyed when the AWS CDK stack is deleted. This is a guardrail DSF uses to prevent accidentally deleting data.

Now that the project is configured, you can start adding resources to the stack.

  1. Navigate to the dsf_redshift_blog folder and open the dsf_redshift_blog_stack.py file.

This is where we configure the resources to be deployed.

  1. To get started building the end-to-end demo, add the following import statements at the top of the file, which allows you to start defining the resources from both the AWS CDK core library as well as DSF:
    from aws_cdk import (
        RemovalPolicy,
        Stack
    )
    
    from aws_cdk.aws_s3 import Bucket
    from aws_cdk.aws_iam import Role, ServicePrincipal
    from constructs import Construct
    from cdklabs import aws_data_solutions_framework as dsf

We use several DSF-specific constructs to build the demo:

  • DataLakeStorage – This creates three S3 buckets, named Bronze, Silver, and Gold, to represent the different data layers.
  • S3DataCopy – This manages the copying of data from one bucket to another bucket.
  • RedshiftServerlessNamespace – This creates a Redshift Serverless namespace where database objects and users are stored.
  • RedshiftServerlessWorkgroup – This creates a Redshift Serverless workgroup that contains compute- and network-related configurations for the data warehouse. This is also the entry point for several convenient functionalities that DSF provides, such as cataloging of Redshift tables, running SQL statements as part of the AWS CDK (such as creating tables, data ingestion, merging of tables, and more), and sharing datasets across different Redshift clusters without moving data.
  1. Now that you have imported the libraries, create a set of S3 buckets following the medallion architecture best practices with bronze, silver, and gold data layers.

The high-level definitions of each layer are as follows:

  • Bronze represents raw data; this is where data from various source systems lands. No schema is needed.
  • Silver is cleaned and potentially augmented data. The schema is enforced in this layer.
  • Gold is data that’s further refined and aggregated to serve a specific business need.

Using the DataLakeStorage construct, you can create these three S3 buckets with the following best practices:

  • Encryption at rest through AWS Key Management Service (AWS KMS) is turned on
  • SSL is enforced
  • The use of S3 bucket keys is turned on
  • There’s a default S3 lifecycle rule defined to delete incomplete multipart uploads after 1 day
    data_lake = dsf.storage.DataLakeStorage(self,
        'DataLake',
        removal_policy=RemovalPolicy.DESTROY)

  1. After you create the S3 buckets, copy over the data using the S3DataCopy For this demo, we land the data in the Silver bucket because it’s already cleaned:
    source_bucket = Bucket.from_bucket_name(self, 
        'SourceBucket', 
        bucket_name='redshift-immersionday-labs')
    
    data_copy = dsf.utils.S3DataCopy(self,
        'SourceData', 
        source_bucket=source_bucket, 
        source_bucket_prefix='data/amazon-reviews/', 
        source_bucket_region='us-west-2', 
        target_bucket=data_lake.silver_bucket, 
        target_bucket_prefix='silver/amazon-reviews/')

  2. In order for Amazon Redshift to ingest the data in Amazon S3, it needs an IAM role with the right permissions. This role will be associated with the Redshift Serverless namespace that you create next.
    lake_role = Role(self, 
        'LakeRole', 
        assumed_by=ServicePrincipal('redshift.amazonaws.com'))
    
    data_lake.silver_bucket.grant_read(lake_role)

  3. To provision Redshift Serverless, configure two resources: a namespace and a workgroup. DSF provides L3 constructs for both:
    1. RedshiftServerlessNamespace
    2. RedshiftServerlessWorkgroup

    Both constructs follow security best practices, including:

    • The default virtual private cloud (VPC) uses private subnets (with public access disabled).
    • Data is encrypted at rest through AWS KMS with automatic key rotation.
    • Admin credentials are stored in AWS Secrets Manager with automatic rotation managed by Amazon Redshift.
    • A default AWS Glue connection is automatically created using private connectivity. This can be used by AWS Glue crawlers as well as AWS Glue ETL jobs to connect to Amazon Redshift.

    The RedshiftServerlessWorkgroup construct is the main entry point for other capabilities, such as integration with the AWS Glue Data Catalog, Redshift Data API, and Data Sharing API.

    1. In the following example, use the defaults provided by the construct and associate the IAM role that you created earlier to give Amazon Redshift access to the data lake for data ingestion:
      namespace = dsf.consumption.RedshiftServerlessNamespace(self, 
          'Namespace', 
          db_name='defaultdb', 
          name='producer', 
          removal_policy=RemovalPolicy.DESTROY, 
          default_iam_role=lake_role)
      
      workgroup = dsf.consumption.RedshiftServerlessWorkgroup(self, 
          'Workgroup', 
          name='producer', 
          namespace=namespace, 
          removal_policy=RemovalPolicy.DESTROY)

Create tables and ingest data

To create a table, you can use the runCustomSQL method in the RedshiftServerlessWorkgroup construct. This method allows you to run arbitrary SQL statements when the resource is being created (such as create table or create materialized view) and when it’s being deleted (such as drop table or drop materialized view).

Add the following code after the RedshiftServerlessWorkgroup instantiation:

create_amazon_reviews_table = workgroup.run_custom_sql('CreateAmazonReviewsTable', 
    database_name='defaultdb', 
    sql='CREATE TABLE amazon_reviews (marketplace character varying(16383) ENCODE lzo, customer_id character varying(16383) ENCODE lzo, review_id character varying(16383) ENCODE lzo, product_id character varying(16383) ENCODE lzo, product_parent character varying(16383) ENCODE lzo, product_title character varying(16383) ENCODE lzo, star_rating integer ENCODE az64, helpful_votes integer ENCODE az64, total_votes integer ENCODE az64, vine character varying(16383) ENCODE lzo, verified_purchase character varying(16383) ENCODE lzo, review_headline character varying(max) ENCODE lzo, review_body character varying(max) ENCODE lzo, review_date date ENCODE az64, year integer ENCODE az64) DISTSTYLE AUTO;', 
    delete_sql='drop table amazon_reviews')

load_amazon_reviews_data = workgroup.ingest_data('amazon_reviews_ingest_data', 
    'defaultdb', 
    'amazon_reviews', 
    data_lake.silver_bucket, 
    'silver/amazon-reviews/', 
    'FORMAT parquet')

load_amazon_reviews_data.node.add_dependency(create_amazon_reviews_table)
load_amazon_reviews_data.node.add_dependency(data_copy)

Given the asynchronous nature of some of the resource creation, we also enforce dependencies between some resources; otherwise, the AWS CDK would try to create them in parallel to accelerate the deployment. The preceding dependency statements establish the following:

  • Before you load the data, the S3 data copy is complete, so the data exists in the source bucket of the ingestion
  • Before you load the data, the target table has been created in the Redshift namespace

Bootstrapping example (materialized views)

The workgroup.run_custom_sql() method provides flexibility in how you can bootstrap your Redshift data warehouse using the AWS CDK. For example, you can create a materialized view to improve the queries’ performance by pre-aggregating data from the Amazon reviews:

materialized_view = workgroup.run_custom_sql('MvProductAnalysis',
    database_name='defaultdb',
    sql=f'''CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(1) AS review_total, SUM(star_rating) AS rating FROM amazon_reviews WHERE marketplace = 'US' GROUP BY 1,2;''',
    delete_sql='drop materialized view mv_product_analysis')

materialized_view.node.add_dependency(load_amazon_reviews_data)

Catalog tables in Amazon Redshift

The deployment of RedshiftServerlessWorkgroup automatically creates an AWS Glue connection resource that can be used by AWS Glue crawlers and AWS Glue ETL jobs. This is directly exposed from the workgroup construct through the glue_connection property. Using this connection, the workgroup construct exposes a convenient method to catalog the tables inside the associated Redshift Serverless namespace. The following an example code:

workgroup.catalog_tables('DefaultDBCatalog', 'mv_product_analysis')

This single line of code creates a database in the Data Catalog named mv_product_analysis and the associated crawler with the IAM role and network configuration already configured. By default, it crawls all the tables inside the public schema in the default database indicated when the Redshift Serverless namespace was created. To override this, the third parameter in the catalogTables method allows you to define a pattern on what to crawl (see the JDBC data store in the include path).

You can run the crawler using the AWS Glue console or invoke it using the SDK, AWS Command Line Interface (AWS CLI), or AWS CDK using AwsCustomResource.

Data sharing

DSF supports Redshift data sharing for both sides (producers and consumers) as well as same account and cross-account scenarios. Let’s create another Redshift Serverless namespace and workgroup to demonstrate the interaction:

namespace2 = dsf.consumption.RedshiftServerlessNamespace(self, 
    "Namespace2", 
    db_name="defaultdb", 
    name="consumer", 
    default_iam_role=lake_role, 
    removal_policy=RemovalPolicy.DESTROY)

workgroup2 = dsf.consumption.RedshiftServerlessWorkgroup(self, 
    "Workgroup2", 
    name="consumer", 
    namespace=namespace2, 
    removal_policy=RemovalPolicy.DESTROY)

For producers

For producers, complete the following steps:

  1. Create the new share and populate the share with the schema or tables:
    data_share = workgroup.create_share('DataSharing', 
        'defaultdb', 
        'defaultdbshare', 
        'public', ['mv_product_analysis'])
    
    data_share.new_share_custom_resource.node.add_dependency(materialized_view)
  2. Create access grants:
    • To grant to a cluster in the same account:
      share_grant = workgroup.grant_access_to_share("GrantToSameAccount", 
          data_share, 
          namespace2.namespace_id)
      
      share_grant.resource.node.add_dependency(data_share.new_share_custom_resource)
      share_grant.resource.node.add_dependency(namespace2)
    • To grant to a different account:
      workgroup.grant_access_to_share('GrantToDifferentAccount', 
          tpcdsShare, 
          undefined, 
          '<ACCOUNT_ID_OF_CONSUMER>', 
          true)

The last parameter in the grant_access_to_share method allows to automatically authorize the cross-account access on the data share. Omitting this parameter would default to no authorization; which means a Redshift administrator needs to authorize the cross-account share either using the AWS CLI, SDK, or Amazon Redshift console.

For consumers

For the same account share, to create the database from the share, use the following code:

create_db_from_share = workgroup2.create_database_from_share("CreateDatabaseFromShare", 
    "marketing", 
    data_share.data_share_name, 
    data_share.producer_namespace)

create_db_from_share.resource.node.add_dependency(share_grant.resource)
create_db_from_share.resource.node.add_dependency(workgroup2)

For cross-account grants, the syntax is similar, but you need to indicate the producer account ID:

consumerWorkgroup.create_database_from_share('CreateCrossAccountDatabaseFromShare', 
    'tpcds', 
    <PRODUCER_SHARE_NAME>, 
    <PRODUCER_NAMESPACE_ID>, 
    <PRODUCER_ACCOUNT_ID>)

To see the full working example, follow the instructions in the accompanying GitHub repository.

Deploy the resources using the AWS CDK

To deploy the resources, run the following code:

cdk deploy

You can review the resources created, as shown in the following screenshot.

Confirm the changes for the deployment to start. Wait a few minutes for the project to be deployed; you can keep track of the deployment using the AWS CLI or the AWS CloudFormation console.

When the deployment is complete, you should see two Redshift workgroups (one producer and one consumer).

Using Amazon Redshift Query Editor v2, you can log in to the producer Redshift workgroup using Secrets Manager, as shown in the following screenshot.

Producer QEV2 Login

After you log in, you can see the tables and views that you created using DSF in the defaultdb database.

QEv2 Tables

Log in to the consumer Redshift workgroup to see the shared dataset from the producer Redshift workgroup under the marketing database.

Clean up

You can run cdk destroy in your local terminal to delete the stack. Because you marked the constructs with a RemovalPolicy.DESTROY and configured DSF to remove data on destroy, running cdk destroy or deleting the stack from the AWS CloudFormation console will clean up the provisioned resources.

Conclusion

In this post, we demonstrated how to use the AWS CDK along with the DSF to manage Redshift Serverless as code. Codifying the deployment of resources helps provide consistency across multiple environments. Aside from infrastructure, DSF also provides capabilities to bootstrap (table creation, ingestion of data, and more) Amazon Redshift and manage objects, all from the AWS CDK. This means that changes can be version controlled, reviewed, and even unit tested.

In addition to Redshift Serverless, DSF supports other AWS services, such as Amazon Athena, Amazon EMR, and many more. Our roadmap is publicly available, and we look forward to your feature requests, contributions, and feedback.

You can get started using DSF by following our quick start guide.


About the authors


Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.
Vincent Gromakowski is an Analytics Specialist Solutions Architect at AWS where he enjoys solving customers’ analytics, NoSQL, and streaming challenges. He has a strong expertise on distributed data processing engines and resource orchestration platform.

Integrate Tableau and Microsoft Entra ID with Amazon Redshift using AWS IAM Identity Center

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/integrate-tableau-and-microsoft-entra-id-with-amazon-redshift-using-aws-iam-identity-center/

This post is co-written with Sid Wray, Jade Koskela, and Ravi Bhattiprolu from SalesForce.

Amazon Redshift and Tableau empower data analysis. Amazon Redshift is a cloud data warehouse that processes complex queries at scale and with speed. Its advanced query optimization serves results to Tableau. Tableau’s extensive capabilities and enterprise connectivity help analysts efficiently prepare, explore, and share data insights company-wide.

Customers can integrate Amazon Redshift with Tableau using single sign-on (SSO) capabilities enabled by AWS IAM Identity Center integration with trusted identity propagation. You can use this to seamlessly implement authentication with third-party identity providers (IdP) and authorization with Redshift. It positions Amazon Redshift as an AWS managed application, allowing you to take full advantage of the trusted identity propagation feature.

Amazon Web Services (AWS) collaborated with Tableau to enable SSO support for accessing Amazon Redshift from Tableau. Both Tableau Desktop 2023.3.9 and Tableau Server 2023.3.9 releases support trusted identity propagation with IAM Identity Center. This SSO integration is available for Tableau Desktop, Tableau Server, and Tableau Prep.

This blog post provides a step-by-step guide to integrating IAM Identity Center with Microsoft Entra ID as the IdP and configuring Amazon Redshift as an AWS managed application. Additionally, you’ll learn how to set up the Amazon Redshift driver in Tableau, enabling SSO directly within Tableau Desktop.

Solution overview

The following diagram illustrates the architecture of the Tableau SSO integration with Amazon Redshift, IAM Identity Center, and Microsoft Entra ID.

Figure 1: Solution overview for Tableau integration with Amazon Redshift using IAM Identity Center and Microsoft Entra ID

The solution depicted in Figure 1 includes the following steps:

  1. The user configures Tableau to access Amazon Redshift using IAM Identity Center.
  2. On a user sign-in attempt, Tableau initiates a browser-based OAuth flow and redirects the user to the Microsoft Entra ID sign-in page to enter the sign-in credentials.
  3. After successful authentication, Microsoft Entra ID issues authentication tokens (ID and access token) to Tableau.
  4. The Amazon Redshift driver then makes a call to the Amazon Redshift-enabled Identity Center application and forwards the access token.
  5. Amazon Redshift passes the token to IAM Identity Center for validation.
  6. IAM Identity Center first validates the token using the OpenID Connect (OIDC) discovery connection to the trusted token issuer (TTI) and returns an IAM Identity Center generated access token for the same user. In Figure 1, the TTI is the Microsoft Entra ID server.
  7. Amazon Redshift then uses the access token to obtain the user and group membership information from Identity Center.
  8. The Tableau user will be able to connect with Amazon Redshift and access data based on the user and group membership returned from IAM Identity Center.

Prerequisites

Before you begin implementing the solution, you must have the following in place:

Walkthrough

In this walkthrough, you will use the following steps to build the solution:

  1. Set up the Microsoft Entra ID OIDC application
  2. Collect Microsoft Entra ID information
  3. Set up a trusted token issuer in IAM Identity Center
  4. Set up client connections and trusted token issuers
  5. Set up the Tableau OAuth config files for Microsoft Entra ID
  6. Install the Tableau OAuth config file for Tableau Desktop
  7. Set up the Tableau OAuth config file for Tableau Server or Tableau Cloud
  8. Federate to Amazon Redshift from Tableau Desktop
  9. Federate to Amazon Redshift from Tableau Server

Set up the Microsoft Entra ID OIDC application

To create your Microsoft Entra application and service principal, follow these steps:

  1. Sign in to the Microsoft Entra admin center as Cloud Application Administrator (at the least).
  2. Browse to App registrations under Manage, and choose New registration.
  3. Enter a name for the application. For example, Tableau-OIDC-App.
  4. Select a supported account type, which determines who can use the application. For this example, select the first option in the list.
  5. Under Redirect URI, select Web for the type of application you want to create. Enter the URI where the access token is sent to. In this example, you’re using localhost, so enter http://localhost:55556/Callback and http://localhost/auth/add_oauth_token.
  6. Choose Register.
  7. In the navigation pane, choose Certificates & secrets.
  8. Choose New client secret.
  9. Enter a Description and select an expiration for the secret or specify a custom lifetime. For this example, keep the Microsoft recommended default expiration value of 6 months. Choose Add.
  10. Copy the secret value.
    Note: It will only be presented one time; after that you cannot read it.
  11. In the navigation pane, under Manage, choose Expose an API.
  12. If you’re setting up for the first time, you can see Set to the right of Application ID URI.
  13. Choose Set, and then choose Save.
  14. After the application ID URI is set up, choose Add a scope.
  15. For Scope name, enter a name. For example, redshift_login.
  16. For Admin consent display name, enter a display name. For example, redshift_login.
  17. For Admin consent description, enter a description of the scope.
  18. Choose Add scope.

For more information about setting up the Microsoft Entra app, see Register a Microsoft Entra app and create a service principal.

Collect Microsoft Entra ID information

To configure your IdP with IAM Identity Center and Amazon Redshift, collect the following parameters from Microsoft Entra ID. If you don’t have these parameters, contact your Microsoft Entra ID admin.

  1. Tenant ID,Client ID and Audience value: To get these values:
    1. Sign in to the Azure portal with your Microsoft account.
    2. Under Manage, choose App registrations.
    3. Choose the application that you created in previous sections.
    4. On the left panel, choose Overview, a new page will appear containing the Essentials section. You can find the Tenant ID,Client ID and Audience value (Application ID URI) as shown in the following figure:

      Figure 2: Overview section of OIDC application

  1. Scope: To find your scope value:
    1. In the navigation pane of the OIDC application, under Manage, choose Expose an API.
    2. You will find the value under Scopes as shown in the following figure:

      Figure 3: Application scope

Set up a trusted token issuer in IAM Identity Center

At this point, you have finished configurations in the Entra ID console; now you’re ready to add Entra ID as a TTI. You will start by adding a TTI so you can exchange tokens. In this step, you will create a TTI in the centralized management account. To create a TTI, follow these steps:

  1. Open the AWS Management Console and navigate to IAM Identity Center, and then to the Settings
  2. Select the Authentication tab and under Trusted token issuers, choose Create trusted token issuer.
  3. On the Set up an external IdP to issue trusted tokens page, under Trusted token issuer details, do the following:
    1. For Issuer URL, enter the OIDC discovery URL of the external IdP that will issue tokens for trusted identity propagation. The URL would be: https://sts.windows.net/<tenantid>/. To find your Microsoft Entra tenant ID, see Collect Microsoft Entra ID information.
    2. For Trusted token issuer name, enter a name to identify this TTI in IAM Identity Center and in the application console.
    3. Under Map attributes, do the following:
      1. For Identity provider attribute, select an attribute from the list to map to an attribute in the Identity Center identity store. You can choose Email, Object Identifier, Subject, and Other. This example uses Other where we’re specifying the upn (user principal name) as the Identity provider attribute to map with Email from the IAM identity Center attribute.
      2. For IAM Identity Center attribute, select the corresponding attribute for the attribute mapping.
    4. Under Tags (optional), choose Add new tag, specify a value for Key, and optionally for Value. For information about tags, see Tagging AWS IAM Identity Center resources.

Figure 4 that follows shows the set up for TTI.

Figure 4: Create a trusted token issuer

  1. Choose Create trusted token issuer.

Set up client connections and trusted token issuers

A third-party application (such as Tableau) that isn’t managed by AWS exchanges the external token (JSON Web Token (JWT) for an IAM Identity Center token before calling AWS services.

The JWT must contain a subject (sub) claim, an audience (aud) claim, an issuer (iss), a user attribute claim, and a JWT ID (JTI) claim. The audience is a value that represents the AWS service that the application will use, and the audience claim value must match the value that’s configured in the Redshift application that exchanges the token.

In this section, you will specify the audience claim in the Redshift application, which you will get from Microsoft Entra ID. You will configure the Redshift application in the member account where the Redshift cluster or serverless instance is.

  1. Select IAM Identity Center connection from Amazon Redshift console menu.

Figure 5: Redshift IAM Identity Center connection

  1. Select the Amazon Redshift application that you created as part of the prerequisites.
  2. Select the Client connections tab and choose Edit.
  3. Choose Yes under Configure client connections that use third-party IdPs.
  4. Select the checkbox for Trusted token issuer that you created in the previous section.
  5. Enter the aud claim value under Configure selected trusted token issuers. For example, api://1230a234-b456-7890-99c9-a12345bcc123. To get the audience value, see Collect Microsoft Entra ID information.
  6. Choose Save.

Figure 6: Adding an audience claim for the TTI

Your IAM Identity Center, Amazon Redshift, and Microsoft Entra ID configuration is complete. Next, you need to configure Tableau.

Set up the Tableau OAuth config files for Microsoft Entra ID

To integrate Tableau with Amazon Redshift using IAM Identity Center, you need to use a custom XML. In this step, you use the following XML and replace the values starting with the $ sign and highlighted in bold. The rest of the values can be kept as they are, or you can modify them based on your use case. For detailed information on each of the elements in the XML file, see the Tableau documentation on GitHub.

Note: The XML file will be used for all the Tableau products including Tableau Desktop, Server, and Cloud. You can use the following XML or you can refer to Tableau’s github.

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
<dbclass>redshift</dbclass>
<!-- For configs embedded in the connector package, don't prefix with "custom_". For external configs, always prefix with "custom_". -->
<oauthConfigId>custom_redshift_azure</oauthConfigId>
<clientIdDesktop>$copy_client_id_from_azure_oidc_app</clientIdDesktop>
<clientSecretDesktop>$copy_client_secret_from_azure_oidc_app</clientSecretDesktop>
<redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55559/Callback</redirectUrisDesktop>
<!-- For multitenant apps use the common endpoint, for single tenant apps use the directory specific endpoint. -->
<authUri>https://login.microsoftonline.com/$azure_tenant_id/oauth2/v2.0/authorize</authUri>
<tokenUri>https://login.microsoftonline.com/$azure_tenant_id/oauth2/v2.0/token</tokenUri>
<scopes>openid</scopes>
<scopes>offline_access</scopes>
<scopes>email</scopes>
<!-- An example with a custom API, which was required at the time of writing for integration with AWS IAM IDC. -->
<scopes>$scope_from_azure_oidc_app</scopes>
<capabilities>
<entry>
<key>OAUTH_CAP_REQUIRES_PROMPT_SELECT_ACCOUNT</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_REQUIRE_PKCE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_STATE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
<value>false</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
<value>true</value>
</entry>
<!-- Depending on the Azure application, dynamic ports may not be allowed. Enable this if not allowed. -->
<entry>
<key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
<value>true</value>
</entry>
</capabilities>
<accessTokenResponseMaps>
<entry>
<key>ACCESSTOKEN</key>
<value>access_token</value>
</entry>
<entry>
<key>REFRESHTOKEN</key>
<value>refresh_token</value>
</entry>
<entry>
<key>access-token-issue-time</key>
<value>issued_at</value>
</entry>
<entry>
<key>id-token</key>
<value>id_token</value>
</entry>
<entry>
<key>username</key>
<value>email</value>
</entry>
<entry>
<key>access-token-expires-in</key>
<value>expires_in</value>
</entry>
</accessTokenResponseMaps>
</pluginOAuthConfig>

The following is an example XML file:

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
<dbclass>redshift</dbclass>
<!-- For configs embedded in the connector package, don't prefix with "custom_". For external configs, always prefix with "custom_". -->
<oauthConfigId>custom_redshift_azure</oauthConfigId>
<clientIdDesktop>1230a234-b456-7890-99c9-a12345bcc123</clientIdDesktop>
<clientSecretDesktop>RdQbc~1234559xFX~c65737wOwjsdfdsg123bg2</clientSecretDesktop>
<redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
<redirectUrisDesktop>http://localhost:55559/Callback</redirectUrisDesktop>
<!-- For multitenant apps use the common endpoint, for single tenant apps use the directory specific endpoint. -->
<authUri>https://login.microsoftonline.com/e12a1ab3-1234-12ab-12b3-1a5012221d12/oauth2/v2.0/authorize</authUri>
<tokenUri>https://login.microsoftonline.com/e12a1ab3-1234-12ab-12b3-1a5012221d12/oauth2/v2.0/token</tokenUri>
<scopes>openid</scopes>
<scopes>offline_access</scopes>
<scopes>email</scopes>
<!-- An example with a custom API, which was required at the time of writing for integration with AWS IAM IDC. -->
<scopes>api://1230a234-b456-7890-99c9-a12345bcc123/redshift_login</scopes>
<capabilities>
<entry>
<key>OAUTH_CAP_REQUIRES_PROMPT_SELECT_ACCOUNT</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_REQUIRE_PKCE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_STATE</key>
<value>true</value>
</entry>
<entry>
<key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
<value>false</value>
</entry>
<entry>
<key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
<value>true</value>
</entry>
<!-- Depending on the Azure application, dynamic ports may not be allowed. Enable this if not allowed. -->
<entry>
<key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
<value>true</value>
</entry>
</capabilities>
<accessTokenResponseMaps>
<entry>
<key>ACCESSTOKEN</key>
<value>access_token</value>
</entry>
<entry>
<key>REFRESHTOKEN</key>
<value>refresh_token</value>
</entry>
<entry>
<key>access-token-issue-time</key>
<value>issued_at</value>
</entry>
<entry>
<key>id-token</key>
<value>id_token</value>
</entry>
<entry>
<key>username</key>
<value>email</value>
</entry>
<entry>
<key>access-token-expires-in</key>
<value>expires_in</value>
</entry>
</accessTokenResponseMaps>
</pluginOAuthConfig>

Install the Tableau OAuth config file for Tableau Desktop

After the configuration XML file is created, it must be copied to a location to be used by Amazon Redshift Connector from Tableau Desktop. Save the file from the previous step as .xml and save it under Documents\My Tableau Repository\OAuthConfigs.

Note: Currently, this integration isn’t supported in macOS because the Redshift ODBC 2.X driver isn’t supported yet for MAC. It will be supported soon.

Set up the Tableau OAuth config file for Tableau Server or Tableau Cloud

To integrate with Amazon Redshift using IAM Identity Center authentication, you must install the Tableau OAuth config file in Tableau Server or Tableau Cloud.

  1. Sign in to the Tableau Server or Tableau Cloud using admin credentials.
  2. Navigate to Settings.
  3. Go to OAuth Clients Registry and select Add OAuth Client
  4. Choose following settings:
    1. Connection Type: Amazon Redshift
    2. OAuth Provider: Custom_IdP
    3. Client Id: Enter your IdP client ID value
    4. Client Secret: Enter your client secret value
    5. Redirect URL: Enter http://localhost/auth/add_oauth_token. This example uses localhost for testing in a local environment. You should use the full hostname with https.
    6. Choose OAuth Config File. Select the XML file that you configured in the previous section.
    7. Select Add OAuth Client and choose Save.

Figure 7: Create an OAuth connection in Tableau Server or Cloud

Federate to Amazon Redshift from Tableau Desktop

Now you’re ready to connect to Amazon Redshift from Tableau as an Entra ID federated user. In this step, you create a Tableau Desktop report and publish it to Tableau Server.

  1. Open Tableau Desktop.
  2. Select Amazon Redshift Connector and enter the following values:
    1. Server: Enter the name of the server that hosts the database and the name of the database you want to connect to.
    2. Port: Enter 5439.
    3. Database: Enter your database name. This example uses dev.
    4. Authentication: Select OAuth.
    5. Federation Type: Select Identity Center.
    6. Identity Center Namespace: You can leave this value blank.
    7. OAuth Provider: This value should automatically be pulled from your configured XML. It will be the value from the element oauthConfigId.
    8. Select Require SSL.
    9. Choose Sign in.

      Figure 8: Tableau Desktop OAuth connection

  1. Enter your IdP credentials in the browser pop-up window.

    Figure 9: Microsoft Entra sign in page

  2. When authentication is successful, you will see the message shown in Figure 10 that follows.

    Figure 10: Successful authentication using Tableau

Congratulations! You’re signed in using the IAM Identity Center integration with Amazon Redshift. Now you’re ready to explore and analyze your data using Tableau Desktop.

Figure 11: Successful connection using Tableau Desktop

After signing in, you can create your own Tableau Report on the desktop version and publish it to your Tableau Server. For this example, we created and published a report named SalesReport.

Federate to Amazon Redshift from Tableau Server

After you have published the report from Tableau Desktop to Tableau Server, sign in as a non-admin user and view the published report (SalesReport in this example) using IAM Identity Center authentication.

  1. Sign in to the Tableau Server site as a non-admin user.
  2. Navigate to Explore and go to the folder where your published report is stored.
  3. Select the report and choose Sign In.

    Figure 12: User audit in sys_query_history

  4. To authenticate, enter your non-admin Microsoft Entra ID (Azure) credentials in the browser pop-up.

    Figure 13: Tableau Server sign In

  5. After your authentication is successful, you can access the report.

    Figure 14: Tableau report

Verify user identity from Amazon Redshift

As an optional step, you can audit the federated IAM Identity Center user from Amazon Redshift.

Figure 15 is a screenshot from the Amazon Redshift system table (sys_query_history) showing that user Ethan from Microsoft Entra ID is accessing the sales report.

select distinct user_id, pg.usename as username, trim(query_text) as query_text
from sys_query_history sys
join pg_user_info pg
on sys.user_id=pg.usesysid
where query_id=<query_id> and usesysid=<federateduser_id> and query_type='SELECT'
order by start_time desc
;

Figure 15: User audit in sys_query_history

Clean up

Complete the following steps to clean up your resources:

  1. Delete the IdP applications that you created to integrate with IAM Identity Center.
  2. Delete the IAM Identity Center configuration.
  3. Delete the Amazon Redshift application and the Amazon Redshift provisioned cluster or serverless instance that you created for testing.
  4. Delete the AWS Identity and Access Management (IAM) role and IAM policy that you created as part of the prerequisites for IAM Identity Center and Amazon Redshift integration.
  5. Delete the permission set from IAM Identity Center that you created for Amazon Redshift Query Editor V2 in the management account.

Conclusion

This post explored a streamlined approach to access management for data analytics by using Tableau’s support for OIDC for SSO. The solution facilitates federated user authentication, where user identities from an external IdP are trusted and propagated to Amazon Redshift. You learned how to configure Tableau Desktop and Tableau Server to seamlessly integrate with Amazon Redshift using IAM Identity Center for SSO. By harnessing this integration between a third-party IdP and IAM Identity Center, users can securely access Amazon Redshift data sources within Tableau without managing separate database credentials.

The following are key resources to learn more about Amazon Redshift integration with IAM Identity Center:


About the Authors

Debu-PandaDebu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Sid Wray is a Senior Product Manager at Salesforce based in the Pacific Northwest with nearly 20 years of experience in Digital Advertising, Data Analytics, Connectivity Integration and Identity and Access Management. He currently focuses on supporting ISV partners for Salesforce Data Cloud.

Adiascar Cisneros is a Tableau Senior Product Manager based in Atlanta, GA. He focuses on the integration of the Tableau Platform with AWS services to amplify the value users get from our products and accelerate their journey to valuable, actionable insights. His background includes analytics, infrastructure, network security, and migrations.

Jade Koskela is a Principal Software Engineer at Salesforce. He has over a decade of experience building Tableau with a focus on areas including data connectivity, authentication, and identity federation.

Harshida Patel is a Principal Solutions Architect, Analytics with AWS.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Ravi Bhattiprolu is a Senior Partner Solutions Architect at AWS. He collaborates with strategic independent software vendor (ISV) partners like Salesforce and Tableau to design and deliver innovative, well-architected cloud products, integrations, and solutions to help joint AWS customers achieve their business goals.

Harness Zero Copy data sharing from Salesforce Data Cloud to Amazon Redshift for Unified Analytics – Part 1

Post Syndicated from Rajkumar Irudayaraj original https://aws.amazon.com/blogs/big-data/harness-zero-copy-data-sharing-from-salesforce-data-cloud-to-amazon-redshift-for-unified-analytics-part-1/

This post is co-authored by Rajkumar Irudayaraj, Sr. Director of Product, Salesforce Data Cloud.

In today’s ever-evolving business landscape, organizations must harness and act on data to fuel analytics, generate insights, and make informed decisions to deliver exceptional customer experiences. Salesforce and Amazon have collaborated to help customers unlock value from unified data and accelerate time to insights with bidirectional Zero Copy data sharing between Salesforce Data Cloud and Amazon Redshift.

In a previous post, we showed how Zero Copy data federation empowers businesses to access Amazon Redshift data within the Salesforce Data Cloud to enrich customer 360 data with operational data. This two-part series explores how analytics teams can access customer 360 data from Salesforce Data Cloud within Amazon Redshift to generate insights on unified data without the overhead of extract, transform, and load (ETL) pipelines. In this post, we cover data sharing between Salesforce Data Cloud and customers’ AWS accounts in the same AWS Region. Part 2 covers cross-Region data sharing between Salesforce Data Cloud and customers’ AWS accounts.

What is Salesforce Data Cloud?

Salesforce Data Cloud is a data platform that unifies all of your company’s data into Salesforce’s Einstein 1 Platform, giving every team a 360-degree view of the customer to drive automation, create analytics, personalize engagement, and power trusted artificial intelligence (AI). Salesforce Data Cloud creates a holistic customer view by turning volumes of disconnected data into a unified customer profile that’s straightforward to access and understand. This unified view helps your sales, service, and marketing teams build personalized customer experiences, invoke data-driven actions and workflows, and safely drive AI across all Salesforce applications.

What is Amazon Redshift?

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence (BI) tools. It’s optimized for datasets ranging from a few hundred gigabytes to petabytes and delivers better price-performance compared to other data warehousing solutions. With a fully managed, AI-powered, massively parallel processing (MPP) architecture, Amazon Redshift makes business decision-making quick and cost-effective. Amazon Redshift Spectrum enables querying structured and semi-structured data in Amazon Simple Storage Service (Amazon S3) without having to load the data into Redshift tables. Redshift Spectrum integration with AWS Lake Formation enables querying auto-mounted AWS Glue Data Catalog tables with AWS Identity and Access Management (IAM) credentials and harnessing Lake Formation for permission grants and access control policies on Data Catalog views. Salesforce Data Cloud Data sharing with Amazon Redshift leverages AWS Glue Data Catalog support for multi-engine views and Redshift Spectrum integration with Lake Formation.

What is Zero Copy data sharing?

Zero Copy data sharing enables Amazon Redshift customers to query customer 360 data stored in Salesforce Data Cloud without the need for traditional ETL to move or copy the data. Instead, you simply connect and use the data in place, unlocking its value immediately with on demand access to the most recent data. Data sharing is supported with both Amazon Redshift Serverless and provisioned RA3 clusters. Data can be shared with a Redshift Serverless or provisioned cluster in the same Region or with a Redshift Serverless cluster in a different Region. To get an overview of Salesforce Zero Copy integration with Amazon Redshift, please refer to this Salesforce Blog.

Solution overview

Salesforce Data Cloud provides a point-and-click experience to share data with a customer’s AWS account. On the Lake Formation console, you can accept the data share, create the resource link, mount Salesforce Data Cloud objects as data catalog views, and grant permissions to query the live and unified data in Amazon Redshift.

The following diagram depicts the end-to-end process involved for sharing Salesforce Data Cloud data with Amazon Redshift in the same Region using a Zero Copy architecture. This architecture follows the pattern documented in Cross-account data sharing best practices and considerations.

The data share setup consists of the following high-level steps:

  1. The Salesforce Data Cloud admin creates the data share target with the target account for the data share.
  2. The Salesforce Data Cloud admin selects the data cloud objects to be shared with Amazon Redshift and creates a data share.
  3. The Salesforce Data Cloud admin links the data share to the data share target, which invokes the following operations to create a cross-account resource share:
    1. Create a Data Catalog view for the Salesforce Data Cloud Apache Iceberg tables by invoking the Catalog API.
    2. Use Lake Formation sharing to create a cross-account Data Catalog share.
  4. In the customer AWS account, the Lake Formation admin logs in to the Lake Formation console to accept the resource share, create a resource link, and grant access permissions to the Redshift role.
  5. The data analyst launches the Amazon Redshift Query Editor with the appropriate role to query the data share and join with native Redshift tables.

Prerequisites

The following are the prerequisites to enable data sharing:

  • A Salesforce Data Cloud account.
  • An AWS account with AWS Glue and Lake Formation enabled.
  • Either a Redshift Serverless or a Redshift provisioned cluster with RA3 instance types (ra3.16xlarge, ra3.4xlarge, ra3.xlplus). Data sharing is not supported for other provisioned instance types like DC2 or DS2 and must be set up before accessing the data share. If you don’t have an existing provisioned Redshift RA3 cluster, we recommend using a Redshift Serverless namespace for ease of operations and maintenance.
  • The Amazon Redshift service must be running in the same Region where the Salesforce Data Cloud is running.
  • AWS admin roles for Lake Formation and Amazon Redshift:

Create the data share target

Complete the following steps to create the data share target:

  1. In Salesforce Data Cloud, choose App Launcher and choose Data Share Targets.
  1. Choose New and choose Amazon Redshift, then choose Next.
  1. Enter the details for Label, API Name, and Account for the data share target.
  2. Choose Save.

After you save these settings, the S3 Tenant Folder value is populated.

  1. Choose the S3 Tenant Folder link and copy the verification token.

If you’re not signed in to the AWS Management Console, you’ll be redirected to the login page.

  1. Enter the verification token and choose Save.

The data share target turns to active status.

Create a data share

Complete the following steps to create a data share:

  1. Navigate to the Data Share tab in your Salesforce org.
  2. Choose App Launcher and choose Data Shares.

Alternatively, you can navigate to the Data Share tab from your org’s home page.

  1. Choose New, then choose Next.
  1. Provide a label, name, data space, and description, then choose Next.
  1. Select the objects to be included in the share and choose Save.

Link the data share target to the data share

To link the data share target to the data share, complete the following steps:

  1. On the data share record home page, choose Link/Unlink Data Share Target.
  2. Select the data share target you want to link to the data share and choose Save.

The data share must be active before you can accept the resource share on the Lake Formation console.

Accept the data share in Lake Formation

This section provides the detailed steps for accepting the data share invite and configuration steps to mount the data share with Amazon Redshift.

  1. After the data share is successfully linked to the data share target, navigate to the Lake Formation console.

The data share invitation banner is displayed.

  1. Choose Accept and create.

The Accept and create page shows a resource link and provides the option to set up IAM permissions.

  1. In the Principals section, choose the IAM users and roles to grant the default permissions (describe and select) for the data share resource link.
  1. Choose Create.

The resource link created in the previous step appears next to the AWS Glue database resource share on the Lake Formation console.

Query the data share from Redshift Serverless

Launch the query editor for Redshift Serverless and log in as a federated user with the role that has describe and select permissions for the resource link.

The data share tables are auto-mounted, appear under awsdatacatalog, and can be queried as shown in the following screenshot.

Query the data share from the Redshift provisioned cluster

To query the data share from the Redshift provisioned cluster, log in to the provisioned cluster as the superuser.

On an editor tab, run the following SQL statement to grant an IAM user access to the Data Catalog:

GRANT USAGE ON DATABASE awsdatacatalog to "IAM:myIAMUser"

IAM:myIAMUser is an IAM user that you want to grant usage privilege to the Data Catalog. Alternatively, you can grant usage privilege to IAMR:myIAMRole for an IAM role. For more details, refer to Querying the AWS Glue Data Catalog.

Log in as the user with the role from the previous step using temporary credentials.

You should be able to expand awsdatacatalog and query the data share tables as shown in the following screenshot.

Conclusion

Zero Copy data sharing between Salesforce Data Cloud and Amazon Redshift represents a significant advancement in how organizations can use their customer 360 data. By eliminating the need for data movement, this approach offers real-time insights, reduced costs, and enhanced security. As businesses continue to prioritize data-driven decision-making, Zero Copy data sharing will play a crucial role in unlocking the full potential of customer data across platforms.

This integration empowers organizations to break down data silos, accelerate analytics, and drive more agile customer-centric strategies. To learn more, refer to the following resources:


About the Authors

Rajkumar Irudayaraj is a Senior Product Director at Salesforce with over 20 years of experience in data platforms and services, with a passion for delivering data-powered experiences to customers.

Jason Berkowitz is a Senior Product Manager with AWS Lake Formation. He comes from a background in machine learning and data lake architectures. He helps customers become data-driven.

Ravi Bhattiprolu is a Senior Partner Solutions Architect at AWS. Ravi works with strategic ISV partners, Salesforce and Tableau, to deliver innovative and well-architected products & solutions that help joint customers achieve their business and technical objectives.

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

Ife Stewart is a Principal Solutions Architect in the Strategic ISV segment at AWS. She has been engaged with Salesforce Data Cloud over the last 2 years to help build integrated customer experiences across Salesforce and AWS. Ife has over 10 years of experience in technology. She is an advocate for diversity and inclusion in the technology field.

Michael Chess is a Technical Product Manager at AWS Lake Formation. He focuses on improving data permissions across the data lake. He is passionate about ensuring customers can build and optimize their data lakes to meet stringent security requirements.

Mike Patterson is a Senior Customer Solutions Manager in the Strategic ISV segment at AWS. He has partnered with Salesforce Data Cloud to align business objectives with innovative AWS solutions to achieve impactful customer experiences. In his spare time, he enjoys spending time with his family, sports, and outdoor activities.

How Kaplan, Inc. implemented modern data pipelines using Amazon MWAA and Amazon AppFlow with Amazon Redshift as a data warehouse

Post Syndicated from Jimy Matthews original https://aws.amazon.com/blogs/big-data/how-kaplan-inc-implemented-modern-data-pipelines-using-amazon-mwaa-and-amazon-appflow-with-amazon-redshift-as-a-data-warehouse/

This post is co-written with Hemant Aggarwal and Naveen Kambhoji from Kaplan.

Kaplan, Inc. provides individuals, educational institutions, and businesses with a broad array of services, supporting our students and partners to meet their diverse and evolving needs throughout their educational and professional journeys. Our Kaplan culture empowers people to achieve their goals. Committed to fostering a learning culture, Kaplan is changing the face of education.

Kaplan data engineers empower data analytics using Amazon Redshift and Tableau. The infrastructure provides an analytics experience to hundreds of in-house analysts, data scientists, and student-facing frontend specialists. The data engineering team is on a mission to modernize its data integration platform to be agile, adaptive, and straightforward to use. To achieve this, they chose the AWS Cloud and its services. There are various types of pipelines that need to be migrated from the existing integration platform to the AWS Cloud, and the pipelines have different types of sources like Oracle, Microsoft SQL Server, MongoDB, Amazon DocumentDB (with MongoDB compatibility), APIs, software as a service (SaaS) applications, and Google Sheets. In terms of scale, at the time of writing over 250 objects are being pulled from three different Salesforce instances.

In this post, we discuss how the Kaplan data engineering team implemented data integration from the Salesforce application to Amazon Redshift. The solution uses Amazon Simple Storage Service as a data lake, Amazon Redshift as a data warehouse, Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as an orchestrator, and Tableau as the presentation layer.

Solution overview

The high-level data flow starts with the source data stored in Amazon S3 and then integrated into Amazon Redshift using various AWS services. The following diagram illustrates this architecture.

Amazon MWAA is our main tool for data pipeline orchestration and is integrated with other tools for data migration. While searching for a tool to migrate data from a SaaS application like Salesforce to Amazon Redshift, we came across Amazon AppFlow. After some research, we found Amazon AppFlow to be well-suited for our requirement to pull data from Salesforce. Amazon AppFlow provides the ability to directly migrate data from Salesforce to Amazon Redshift. However, in our architecture, we chose to separate the data ingestion and storage processes for the following reasons:

  • We needed to store data in Amazon S3 (data lake) as an archive and a centralized location for our data infrastructure.
  • From a future perspective, there might be scenarios where we need to transform the data before storing it in Amazon Redshift. By storing the data in Amazon S3 as an intermediate step, we can integrate transformation logic as a separate module without impacting the overall data flow significantly.
  • Apache Airflow is the central point in our data infrastructure, and other pipelines are being built using various tools like AWS Glue. Amazon AppFlow is one part of our overall infrastructure, and we wanted to maintain a consistent approach across different data sources and targets.

To accommodate these requirements, we divided the pipeline into two parts:

  • Migrate data from Salesforce to Amazon S3 using Amazon AppFlow
  • Load data from Amazon S3 to Amazon Redshift using Amazon MWAA

This approach allows us to take advantage of the strengths of each service while maintaining flexibility and scalability in our data infrastructure. Amazon AppFlow can handle the first part of the pipeline without the need for any other tool, because Amazon AppFlow provides functionalities like creating a connection to source and target, scheduling the data flow, and creating filters, and we can choose the type of flow (incremental and full load). With this, we were able to migrate the data from Salesforce to an S3 bucket. Afterwards, we created a DAG in Amazon MWAA that runs an Amazon Redshift COPY command on the data stored in Amazon S3 and moves the data into Amazon Redshift.

We faced the following challenges with this approach:

  • To do incremental data, we have to manually change the filter dates in the Amazon AppFlow flows, which isn’t elegant. We wanted to automate that date filter change.
  • Both parts of the pipeline were not in sync because there was no way to know if the first part of the pipeline was complete so that the second part of the pipeline could start. We wanted to automate these steps as well.

Implementing the solution

To automate and resolve the aforementioned challenges, we used Amazon MWAA. We created a DAG that acts as the control center for Amazon AppFlow. We developed an Airflow operator that can perform various Amazon AppFlow functions using Amazon AppFlow APIs like creating, updating, deleting, and starting flows, and this operator is used in the DAG. Amazon AppFlow stores the connection data in an AWS Secrets Manager managed secret with the prefix appflow. The cost of storing the secret is included with the charge for Amazon AppFlow. With this, we were able to run the complete data flow using a single DAG.

The complete data flow consists of the following steps:

  1. Create the flow in the Amazon AppFlow using a DAG.
  2. Update the flow with the new filter dates using the DAG.
  3. After updating the flow, the DAG starts the flow.
  4. The DAG waits for the flow complete by checking the flow’s status repeatedly.
  5. A success status indicates that the data has been migrated from Salesforce to Amazon S3.
  6. After the data flow is complete, the DAG calls the COPY command to copy data from Amazon S3 to Amazon Redshift.

This approach helped us resolve the aforementioned issues, and the data pipelines have become more robust, simple to understand, straightforward to use with no manual intervention, and less prone to error because we are controlling everything from a single point (Amazon MWAA). Amazon AppFlow, Amazon S3, and Amazon Redshift are all configured to use encryption to protect the data. We also performed logging and monitoring, and implemented auditing mechanisms to track the data flow and access using AWS CloudTrail and Amazon CloudWatch. The following figure shows a high-level diagram of the final approach we took.

Conclusion

In this post, we shared how Kaplan’s data engineering team successfully implemented a robust and automated data integration pipeline from Salesforce to Amazon Redshift, using AWS services like Amazon AppFlow, Amazon S3, Amazon Redshift, and Amazon MWAA. By creating a custom Airflow operator to control Amazon AppFlow functionalities, we orchestrated the entire data flow seamlessly within a single DAG. This approach has not only resolved the challenges of incremental data loading and synchronization between different pipeline stages, but has also made the data pipelines more resilient, straightforward to maintain, and less error-prone. We reduced the time for creating a pipeline for a new object from an existing instance and a new pipeline for a new source by 50%. This also helped remove the complexity of using a delta column to get the incremental data, which also helped reduce the cost per table by 80–90% compared to a full load of objects every time.

With this modern data integration platform in place, Kaplan is well-positioned to provide its analysts, data scientists, and student-facing teams with timely and reliable data, empowering them to drive informed decisions and foster a culture of learning and growth.

Try out Airflow with Amazon MWAA and other enhancements to improve your data orchestration pipelines.

For additional details and code examples of Amazon MWAA, refer to the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.


About the Authors

Hemant Aggarwal is a senior Data Engineer at Kaplan India Pvt Ltd, helping in developing and managing ETL pipelines leveraging AWS and process/strategy development for the team.

Naveen Kambhoji is a Senior Manager at Kaplan Inc. He works with Data Engineers at Kaplan for building data lakes using AWS Services. He is the facilitator for the entire migration process. His passion is building scalable distributed systems for efficiently managing data on cloud.Outside work, he enjoys travelling with his family and exploring new places.

Jimy Matthews is an AWS Solutions Architect, with expertise in AI/ML tech. Jimy is based out of Boston and works with enterprise customers as they transform their business by adopting the cloud and helps them build efficient and sustainable solutions. He is passionate about his family, cars and Mixed martial arts.

Optimize your workloads with Amazon Redshift Serverless AI-driven scaling and optimization

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/optimize-your-workloads-with-amazon-redshift-serverless-ai-driven-scaling-and-optimization/

The current scaling approach of Amazon Redshift Serverless increases your compute capacity based on the query queue time and scales down when the queuing reduces on the data warehouse. However, you might need to automatically scale compute resources based on factors like query complexity and data volume to meet price-performance targets, irrespective of query queuing. To address this requirement, Redshift Serverless launched the artificial intelligence (AI)-driven scaling and optimization feature, which scales the compute not only based on the queuing, but also factoring data volume and query complexity.

In this post, we describe how Redshift Serverless utilizes the new AI-driven scaling and optimization capabilities to address common use cases. This post also includes example SQLs, which you can run on your own Redshift Serverless data warehouse to experience the benefits of this feature.

Solution overview

The AI-powered scaling and optimization feature in Redshift Serverless provides a user-friendly visual slider to set your desired balance between price and performance. By moving the slider, you can choose between optimized for cost, balanced performance and cost, or optimized for performance. Based on where you position the slider, Amazon Redshift will automatically add or remove resources to ensure better behavior and perform other AI-driven optimizations like automatic materialized views and automatic table design optimization to meet your selected price-performance target.

Price Performance Slider

The slider offers the following options:

  • Optimized for cost – Prioritizes cost savings. Redshift attempts to automatically scale up compute capacity when doing so and doesn’t incur additional charges. And it will also attempt to scale down compute for lower cost, despite longer runtime.
  • Balanced – Offers balance between performance and cost. Redshift scales for performance with a moderate cost increase.
  • Optimized for performance – Prioritizes performance. Redshift scales aggressively for maximum performance, potentially incurring higher costs.

In the following sections, we illustrate how the AI-driven scaling and optimization feature can intelligently predict your workload compute needs and scale proactively for three scenarios:

  • Use case 1 – A long-running complex query. Compute scales based on query complexity.
  • Use case 2 – A sudden spike in ingestion volume (a three-fold increase, from 720 million to 2.1 billion). Compute scales based on data volume.
  • Use case 3 – A data lake query scanning large datasets (TBs). Compute scales based on the expected data to be scanned from the data lake. The expected data scan is predicted by machine learning (ML) models based on prior historical run statistics.

In the existing auto scaling mechanism, the use cases don’t increase compute capacity automatically unless queuing is identified across the instance.

Prerequisites

To follow along, complete the following prerequisites:

  1. Create a Redshift Serverless workgroup in preview mode. For instructions, see Creating a preview workgroup.
  2. While creating the preview group, choose Performance and Cost Controls and Price-performance target, and adjust the slider to Optimized for performance. For more information, refer to Amazon Redshift adds new AI capabilities, including Amazon Q, to boost efficiency and productivity.
  3. Set up an AWS Identity and Access Management (IAM) role as the default IAM role. Refer to Managing IAM roles created for a cluster using the console for instructions.
  4. We use TPC-DS 1TB Cloud Data Warehouse Benchmark data to demonstrate this feature. Run the SQL statements to create tables and load the TPC-DS 1TB data.

Use case 1: Scale compute based on query complexity

The following query analyzes product sales across multiple channels such as websites, wholesale, and retail stores. This complex query typically takes about 25 minutes to run with the default 128 RPUs. Let’s run this workload on the preview workgroup created as part of prerequisites.

When a query is run for the first time, the AI scaling system may make a suboptimal decision regarding resource allocation or scaling as the system is still learning the query and data characteristics. However, the system learns from this experience, and when the same query is run again, it can make a more optimal scaling decision. Therefore, if the query didn’t scale during the first run, it is recommended to rerun the query. You can monitor the RPU capacity used on the Redshift Serverless console or by querying the SYS_SERVERLSS_USAGE system view.

The results cache is turned off in the following queries to avoid fetching results from the cache.

SET enable_result_cache_for_session TO off;
with /* TPC-DS demo query */
    ws as
    (select d_year AS ws_sold_year, ws_item_sk,    ws_bill_customer_sk
     ws_customer_sk,    sum(ws_quantity) ws_qty,    sum(ws_wholesale_cost) ws_wc,
        sum(ws_sales_price) ws_sp   from web_sales   left join web_returns on
     wr_order_number=ws_order_number and ws_item_sk=wr_item_sk   join date_dim
     on ws_sold_date_sk = d_date_sk   where wr_order_number is null   group by
     d_year, ws_item_sk, ws_bill_customer_sk   ),
    cs as  
    (select d_year AS cs_sold_year,
     cs_item_sk,    cs_bill_customer_sk cs_customer_sk,    sum(cs_quantity) cs_qty,
        sum(cs_wholesale_cost) cs_wc,    sum(cs_sales_price) cs_sp   from catalog_sales
       left join catalog_returns on cr_order_number=cs_order_number and cs_item_sk=cr_item_sk
       join date_dim on cs_sold_date_sk = d_date_sk   where cr_order_number is
     null   group by d_year, cs_item_sk, cs_bill_customer_sk   ),
    ss as  
    (select
     d_year AS ss_sold_year, ss_item_sk,    ss_customer_sk,    sum(ss_quantity)
     ss_qty,    sum(ss_wholesale_cost) ss_wc,    sum(ss_sales_price) ss_sp
       from store_sales left join store_returns on sr_ticket_number=ss_ticket_number
     and ss_item_sk=sr_item_sk   join date_dim on ss_sold_date_sk = d_date_sk
       where sr_ticket_number is null   group by d_year, ss_item_sk, ss_customer_sk
       ) 
       
       select 
       ss_customer_sk,round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2)
     ratio,ss_qty store_qty, ss_wc store_wholesale_cost, ss_sp store_sales_price,
    coalesce(ws_qty,0)+coalesce(cs_qty,0) other_chan_qty,coalesce(ws_wc,0)+coalesce(cs_wc,0)
     other_chan_wholesale_cost,coalesce(ws_sp,0)+coalesce(cs_sp,0) other_chan_sales_price
    from ss left join ws on (ws_sold_year=ss_sold_year and ws_item_sk=ss_item_sk
     and ws_customer_sk=ss_customer_sk)left join cs on (cs_sold_year=ss_sold_year
     and cs_item_sk=cs_item_sk and cs_customer_sk=ss_customer_sk)where coalesce(ws_qty,0)>0
     and coalesce(cs_qty, 0)>0 order by   ss_customer_sk,  ss_qty desc, ss_wc
     desc, ss_sp desc,  other_chan_qty,  other_chan_wholesale_cost,  other_chan_sales_price,
      round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2);

When the query is complete, run the following SQL to capture the start and end times of the query, which will be used in the next query:

select query_id,query_text,start_time,end_time, elapsed_time/1000000.0 duration_in_seconds
from sys_query_history
where query_text like '%TPC-DS demo query%'
and query_text not like '%sys_query_history%'
order by start_time desc

Let’s assess the compute scaled during the preceding start_time and end_time period. Replace start_time and end_time in the following query with the output of the preceding query:

select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc

-- Example
--select * from sys_serverless_usage
--where end_time >= '2024-06-03 00:17:12.322353'
--and end_time <= DATEADD(minute,1,'2024-06-03 00:19:11.553218')
--order by end_time asc

The following screenshot shows an example output.

Use Case 1 output

You can notice the increase in compute over the duration of this query. This demonstrates how Redshift Serverless scales based on query complexity.

Use case 2: Scale compute based on data volume

Let’s consider the web_sales ingestion job. For this example, your daily ingestion job processes 720 million records and completes in an average of 2 minutes. This is what you ingested in the prerequisite steps.

Due to some event (such as month end processing), your volumes increased by three times and now your ingestion job needs to process 2.1 billion records. In an existing scaling approach, this would increase your ingestion job runtime unless the queue time is enough to invoke additional compute resources. But with AI-driven scaling, in performance optimized mode, Amazon Redshift automatically scales compute to complete your ingestion job within usual runtimes. This helps protect your ingestion SLAs.

Run the following job to ingest 2.1 billion records into the web_sales table:

copy web_sales from 's3://redshift-downloads/TPC-DS/2.13/3TB/web_sales/' iam_role default gzip delimiter '|' EMPTYASNULL region 'us-east-1';

Run the following query to compare the duration of ingesting 2.1 billion records and 720 million records. Both ingestion jobs completed in approximately a similar time, despite the three-fold increase in volume.

select query_id,table_name,data_source,loaded_rows,duration/1000000.0 duration_in_seconds , start_time,end_time
from sys_load_history
where
table_name='web_sales'
order by start_time desc

Run the following query with the start times and end times from the previous output:

select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc

The following is an example output. You can notice the increase in compute capacity for the ingestion job that processes 2.1 billion records. This illustrates how Redshift Serverless scaled based on data volume.

Use Case 2 Output

Use case 3: Scale data lake queries

In this use case, you create external tables pointing to TPC-DS 3TB data in an Amazon Simple Storage Service (Amazon S3) location. Then you run a query that scans a large volume of data to demonstrate how Redshift Serverless can automatically scale compute capacity as needed.

In the following SQL, provide the ARN of the default IAM role you attached in the prerequisites:

-- Create external schema
create external schema ext_tpcds_3t
from data catalog
database ext_tpcds_db
iam_role '<ARN of the default IAM role attached>'
create external database if not exists;

Create external tables by running DDL statements in the following SQL file. You should see seven external tables in the query editor under the ext_tpcds_3t schema, as shown in the following screenshot.

External Tables

Run the following query using external tables. As mentioned in the first use case, if the query didn’t scale during the first run, it is recommended to rerun the query, because the system will have learned from the previous experience and can potentially provide better scaling and performance for the subsequent run.

The results cache is turned off in the following queries to avoid fetching results from the cache.

SET enable_result_cache_for_session TO off;

with /* TPC-DS demo data lake query */

ws as
(select d_year AS ws_sold_year, ws_item_sk, ws_bill_customer_sk
ws_customer_sk,    sum(ws_quantity) ws_qty,    sum(ws_wholesale_cost) ws_wc,
sum(ws_sales_price) ws_sp   from ext_tpcds_3t.web_sales   left join ext_tpcds_3t.web_returns on
wr_order_number=ws_order_number and ws_item_sk=wr_item_sk   join ext_tpcds_3t.date_dim
on ws_sold_date_sk = d_date_sk   where wr_order_number is null   group by
d_year, ws_item_sk, ws_bill_customer_sk   ),

cs as
(select d_year AS cs_sold_year,
cs_item_sk,    cs_bill_customer_sk cs_customer_sk,    sum(cs_quantity) cs_qty,
sum(cs_wholesale_cost) cs_wc,    sum(cs_sales_price) cs_sp   from ext_tpcds_3t.catalog_sales
left join ext_tpcds_3t.catalog_returns on cr_order_number=cs_order_number and cs_item_sk=cr_item_sk
join ext_tpcds_3t.date_dim on cs_sold_date_sk = d_date_sk   where cr_order_number is
null   group by d_year, cs_item_sk, cs_bill_customer_sk   ),

ss as
(select
d_year AS ss_sold_year, ss_item_sk,    ss_customer_sk,    sum(ss_quantity)
ss_qty,    sum(ss_wholesale_cost) ss_wc,    sum(ss_sales_price) ss_sp
from ext_tpcds_3t.store_sales left join ext_tpcds_3t.store_returns on sr_ticket_number=ss_ticket_number
and ss_item_sk=sr_item_sk   join ext_tpcds_3t.date_dim on ss_sold_date_sk = d_date_sk
where sr_ticket_number is null   group by d_year, ss_item_sk, ss_customer_sk)

SELECT           ss_customer_sk,round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2)
ratio,ss_qty store_qty, ss_wc store_wholesale_cost, ss_sp store_sales_price,
coalesce(ws_qty,0)+coalesce(cs_qty,0) other_chan_qty,coalesce(ws_wc,0)+coalesce(cs_wc,0)    other_chan_wholesale_cost,coalesce(ws_sp,0)+coalesce(cs_sp,0) other_chan_sales_price
FROM ss left join ws on (ws_sold_year=ss_sold_year and ws_item_sk=ss_item_sk and ws_customer_sk=ss_customer_sk)left join cs on (cs_sold_year=ss_sold_year and cs_item_sk=cs_item_sk and cs_customer_sk=ss_customer_sk)
where coalesce(ws_qty,0)>0
and coalesce(cs_qty, 0)>0
order by   ss_customer_sk,  ss_qty desc, ss_wc desc, ss_sp desc,  other_chan_qty,  other_chan_wholesale_cost,  other_chan_sales_price,     round(ss_qty/(coalesce(ws_qty+cs_qty,1)),2);

Review the total elapsed time of the query. You need the start_time and end_time from the results to feed into the next query.

select query_id,query_text,start_time,end_time, elapsed_time/1000000.0 duration_in_seconds
from sys_query_history
where query_text like '%TPC-DS demo data lake query%'
and query_text not like '%sys_query_history%'
order by start_time desc

Run the following query to see how compute scaled during the preceding start_time and end_time period. Replace start_time and end_time in the following query from the output of the preceding query:

select * from sys_serverless_usage
where end_time >= 'start_time'
and end_time <= DATEADD(minute,1,'end_time')
order by end_time asc

The following screenshot shows an example output.

Use Case 3 Output

The increased compute capacity for this data lake query shows that Redshift Serverless can scale to match the data being scanned. This demonstrates how Redshift Serverless can dynamically allocate resources based on query needs.

Considerations when choosing your price-performance target

You can use the price-performance slider to choose your desired price-performance target for your workload. The AI-driven scaling and optimizations provide holistic optimizations using the following models:

  • Query prediction models – These determine the actual resource needs (memory, CPU consumption, and so on) for each individual query
  • Scaling prediction models – These predict how the query would behave on different capacity sizes

Let’s consider a query that takes 7 minutes and costs $7. The following figure shows the query runtimes and cost with no scaling.

Scaling Type Example

A given query might scale in a few different ways, as shown below. Based on the price-performance target you chose on the slider, AI-driven scaling predicts how the query trades off performance and cost, and scales it accordingly.

Scaling Types

The slider options yield the following results:

  • Optimized for cost – When you choose Optimized for cost, the warehouse scales up if there is no additional cost or lesser costs to the user. In the preceding example, the superlinear scaling approach demonstrates this behavior. Scaling will only occur if it can be done in a cost-effective manner according to the scaling model predictions. If the scaling models predict that cost-optimized scaling isn’t possible for the given workload, then the warehouse won’t scale.
  • Balanced – With the Balanced option, the system will scale in favor of performance and there will be a cost increase, but it will be a limited increase in cost. In the preceding example, the linear scaling approach demonstrates this behavior.
  • Optimized for performance – With the Optimized for performance option, the system will scale in favor of performance even though the costs are higher and non-linear. In the preceding example, the sublinear scaling approach demonstrates this behavior. The closer the slider position is to the Optimized for performance position, the more sublinear scaling is permitted.

The following are additional points to note:

  • The price-performance slider options are dynamic and they can be changed anytime. However, the impact of these changes will not be realized immediately. The impact of this is effective as the system learns how to scale the current workload and any additional workloads better.
  • The price-performance slider options, Max capacity and Max RPU-hours are designed to work together. Max capacity and Max RPU-hours are the controls to limit maximum RPUs the data warehouse allowed to scale and maximum RPU hours allowed to consume respectively. These controls are always honored and enforced regardless of the settings on the price-performance target slider.
  • The AI-driven scaling and optimization feature dynamically adjusts compute resources to optimize query runtime speed while adhering to your price-performance requirements. It considers factors such as query queueing, concurrency, volume, and complexity. The system can either run queries on a compute resource with lower concurrent queries or spin up additional compute resources to avoid queueing. The goal is to provide the best price-performance balance based on your choices.

Monitoring

You can monitor the RPU scaling in the following ways:

  • Review the RPU capacity used graph on the Amazon Redshift console.
  • Monitor the ComputeCapacity metric under AWS/Redshift-Serverless and Workgroup in Amazon CloudWatch.
  • Query the SYS_QUERY_HISTORY view, providing the specific query ID or query text to identify the time period. Use this time period to query the SYS_SERVERLSS_USAGE system view to find the compute_capacity The compute_capacity field will show the RPUs scaled during the query runtime.

Refer to Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable for the step-by-step instructions on using these approaches.

Clean up

Complete the following steps to delete the resources you created to avoid unexpected costs:

  1. Delete the Redshift Serverless workgroup.
  2. Delete the Redshift Serverless associated namespace.

Conclusion

In this post, we discussed how to optimize your workloads to scale based on the changes in data volume and query complexity. We demonstrated an approach to implement more responsive, proactive scaling with the AI-driven scaling feature in Redshift Serverless. Try this feature in your environment, conduct a proof of concept on your specific workloads, and share your feedback with us.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Ashish Agrawal is a Principal Product Manager with Amazon Redshift, building cloud-based data warehouses and analytics cloud services. Ashish has over 25 years of experience in IT. Ashish has expertise in data warehouses, data lakes, and platform as a service. Ashish has been a speaker at worldwide technical conferences.

Davide Pagano is a Software Development Manager with Amazon Redshift based out of Palo Alto, specialized in building cloud-based data warehouses and analytics cloud services solutions. He has over 10 years of experience with databases, out of which 6 years of experience tailored to Amazon Redshift.

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/seamless-integration-of-data-lake-and-data-warehouse-using-amazon-redshift-spectrum-and-amazon-datazone/

Unlocking the true value of data often gets impeded by siloed information. Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. A data mesh framework empowers business units with data ownership and facilitates seamless sharing.

However, integrating datasets from different business units can present several challenges. Each business unit exposes data assets with varying formats and granularity levels, and applies different data validation checks. Unifying these necessitates additional data processing, requiring each business unit to provision and maintain a separate data warehouse. This burdens business units focused solely on consuming the curated data for analysis and not concerned with data management tasks, cleansing, or comprehensive data processing.

In this post, we explore a robust architecture pattern of a data sharing mechanism by bridging the gap between data lake and data warehouse using Amazon DataZone and Amazon Redshift.

Solution overview

Amazon DataZone is a data management service that makes it straightforward for business units to catalog, discover, share, and govern their data assets. Business units can curate and expose their readily available domain-specific data products through Amazon DataZone, providing discoverability and controlled access.

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. Thousands of customers use Amazon Redshift data sharing to enable instant, granular, and fast data access across Amazon Redshift provisioned clusters and serverless workgroups. This allows you to scale your read and write workloads to thousands of concurrent users without having to move or copy the data. Amazon DataZone natively supports data sharing for Amazon Redshift data assets. With Amazon Redshift Spectrum, you can query the data in your Amazon Simple Storage Service (Amazon S3) data lake using a central AWS Glue metastore from your Redshift data warehouse. This capability extends your petabyte-scale Redshift data warehouse to unbounded data storage limits, which allows you to scale to exabytes of data cost-effectively.

The following figure shows a typical distributed and collaborative architectural pattern implemented using Amazon DataZone. Business units can simply share data and collaborate by publishing and subscribing to the data assets.

hubandspoke

The Central IT team (Spoke N) subscribes the data from individual business units and consumes this data using Redshift Spectrum. The Central IT team applies standardization and performs the tasks on the subscribed data such as schema alignment, data validation checks, collating the data, and enrichment by adding additional context or derived attributes to the final data asset. This processed unified data can then persist as a new data asset in Amazon Redshift managed storage to meet the SLA requirements of the business units. The new processed data asset produced by the Central IT team is then published back to Amazon DataZone. With Amazon DataZone, individual business units can discover and directly consume these new data assets, gaining insights to a holistic view of the data (360-degree insights) across the organization.

The Central IT team manages a unified Redshift data warehouse, handling all data integration, processing, and maintenance. Business units access clean, standardized data. To consume the data, they can choose between a provisioned Redshift cluster for consistent high-volume needs or Amazon Redshift Serverless for variable, on-demand analysis. This model enables the units to focus on insights, with costs aligned to actual consumption. This allows the business units to derive value from data without the burden of data management tasks.

This streamlined architecture approach offers several advantages:

  • Single source of truth – The Central IT team acts as the custodian of the combined and curated data from all business units, thereby providing a unified and consistent dataset. The Central IT team implements data governance practices, providing data quality, security, and compliance with established policies. A centralized data warehouse for processing is often more cost-efficient, and its scalability allows organizations to dynamically adjust their storage needs. Similarly, individual business units produce their own domain-specific data. There are no duplicate data products created by business units or the Central IT team.
  • Eliminating dependency on business units – Redshift Spectrum uses a metadata layer to directly query the data residing in S3 data lakes, eliminating the need for data copying or relying on individual business units to initiate the copy jobs. This significantly reduces the risk of errors associated with data transfer or movement and data copies.
  • Eliminating stale data – Avoiding duplication of data also eliminates the risk of stale data existing in multiple locations.
  • Incremental loading – Because the Central IT team can directly query the data on the data lakes using Redshift Spectrum, they have the flexibility to query only the relevant columns needed for the unified analysis and aggregations. This can be done using mechanisms to detect the incremental data from the data lakes and process only the new or updated data, further optimizing resource utilization.
  • Federated governance – Amazon DataZone facilitates centralized governance policies, providing consistent data access and security across all business units. Sharing and access controls remain confined within Amazon DataZone.
  • Enhanced cost appropriation and efficiency – This method confines the cost overhead of processing and integrating the data with the Central IT team. Individual business units can provision the Redshift Serverless data warehouse to solely consume the data. This way, each unit can clearly demarcate the consumption costs and impose limits. Additionally, the Central IT team can choose to apply chargeback mechanisms to each of these units.

In this post, we use a simplified use case, as shown in the following figure, to bridge the gap between data lakes and data warehouses using Redshift Spectrum and Amazon DataZone.

custom blueprints and spectrum

The underwriting business unit curates the data asset using AWS Glue and publishes the data asset Policies in Amazon DataZone. The Central IT team subscribes to the data asset from the underwriting business unit. 

We focus on how the Central IT team consumes the subscribed data lake asset from business units using Redshift Spectrum and creates a new unified data asset.

Prerequisites

The following prerequisites must be in place:

  • AWS accounts – You should have active AWS accounts before you proceed. If you don’t have one, refer to How do I create and activate a new AWS account? In this post, we use three AWS accounts. If you’re new to Amazon DataZone, refer to Getting started.
  • A Redshift data warehouse – You can create a provisioned cluster following the instructions in Create a sample Amazon Redshift cluster, or provision a serverless workgroup following the instructions in Get started with Amazon Redshift Serverless data warehouses.
  • Amazon Data Zone resources – You need a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone environment (with a custom AWS service blueprint).
  • Data lake asset – The data lake asset Policies from the business units was already onboarded to Amazon DataZone and subscribed by the Central IT team. To understand how to associate multiple accounts and consume the subscribed assets using Amazon Athena, refer to Working with associated accounts to publish and consume data.
  • Central IT environment – The Central IT team has created an environment called env_central_team and uses an existing AWS Identity and Access Management (IAM) role called custom_role, which grants Amazon DataZone access to AWS services and resources, such as Athena, AWS Glue, and Amazon Redshift, in this environment. To add all the subscribed data assets to a common AWS Glue database, the Central IT team configures a subscription target and uses central_db as the AWS Glue database.
  • IAM role – Make sure that the IAM role that you want to enable in the Amazon DataZone environment has necessary permissions to your AWS services and resources. The following example policy provides sufficient AWS Lake Formation and AWS Glue permissions to access Redshift Spectrum:
{
	"Version": "2012-10-17",
	"Statement": [{
		"Effect": "Allow",
		"Action": [
			"lakeformation:GetDataAccess",
			"glue:GetTable",
			"glue:GetTables",
			"glue:SearchTables",
			"glue:GetDatabase",
			"glue:GetDatabases",
			"glue:GetPartition",
			"glue:GetPartitions"
		],
		"Resource": "*"
	}]
}

As shown in the following screenshot, the Central IT team has subscribed to the data Policies. The data asset is added to the env_central_team environment. Amazon DataZone will assume the custom_role to help federate the environment user (central_user) to the action link in Athena. The subscribed asset Policies is added to the central_db database. This asset is then queried and consumed using Athena.

The goal of the Central IT team is to consume the subscribed data lake asset Policies with Redshift Spectrum. This data is further processed and curated into the central data warehouse using the Amazon Redshift Query Editor v2 and stored as a single source of truth in Amazon Redshift managed storage. In the following sections, we illustrate how to consume the subscribed data lake asset Policies from Redshift Spectrum without copying the data.

Automatically mount access grants to the Amazon DataZone environment role

Amazon Redshift automatically mounts the AWS Glue Data Catalog in the Central IT Team account as a database and allows it to query the data lake tables with three-part notation. This is available by default with the Admin role.

To grant the required access to the mounted Data Catalog tables for the environment role (custom_role), complete the following steps:

  1. Log in to the Amazon Redshift Query Editor v2 using the Amazon DataZone deep link.
  2. In the Query Editor v2, choose your Redshift Serverless endpoint and choose Edit Connection.
  3. For Authentication, select Federated user.
  4. For Database, enter the database you want to connect to.
  5. Get the current user IAM role as illustrated in the following screenshot.

getcurrentUser from Redshift QEv2

  1. Connect to Redshift Query Editor v2 using the database user name and password authentication method. For example, connect to dev database using the admin user name and password. Grant usage on the awsdatacatalog database to the environment user role custom_role (replace the value of current_user with the value you copied):
GRANT USAGE ON DATABASE awsdatacatalog to "IAMR:current_user"

grantpermissions to awsdatacatalog

Query using Redshift Spectrum

Using the federated user authentication method, log in to Amazon Redshift. The Central IT team will be able to query the subscribed data asset Policies (table: policy) that was automatically mounted under awsdatacatalog.

query with spectrum

Aggregate tables and unify products

The Central IT team applies the necessary checks and standardization to aggregate and unify the data assets from all business units, bringing them at the same granularity. As shown in the following screenshot, both the Policies and Claims data assets are combined to form a unified aggregate data asset called agg_fraudulent_claims.

creatingunified product

These unified data assets are then published back to the Amazon DataZone central hub for business units to consume them.

unified asset published

The Central IT team also unloads the data assets to Amazon S3 so that each business unit has the flexibility to use either a Redshift Serverless data warehouse or Athena to consume the data. Each business unit can now isolate and put limits to the consumption costs on their individual data warehouses.

Because the intention of the Central IT team was to consume data lake assets within a data warehouse, the recommended solution would be to use custom AWS service blueprints and deploy them as part of one environment. In this case, we created one environment (env_central_team) to consume the asset using Athena or Amazon Redshift. This accelerates the development of the data sharing process because the same environment role is used to manage the permissions across multiple analytical engines.

Clean up

To clean up your resources, complete the following steps:

  1. Delete any S3 buckets you created.
  2. On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
  3. Delete the Amazon DataZone domain.
  4. On the Lake Formation console, delete the Lake Formation admins registered by Amazon DataZone along with the tables and databases created by Amazon DataZone.
  5. If you used a provisioned Redshift cluster, delete the cluster. If you used Redshift Serverless, delete any tables created as part of this post.

Conclusion

In this post, we explored a pattern of seamless data sharing with data lakes and data warehouses with Amazon DataZone and Redshift Spectrum. We discussed the challenges associated with traditional data management approaches, data silos, and the burden of maintaining individual data warehouses for business units.

In order to curb operating and maintenance costs, we proposed a solution that uses Amazon DataZone as a central hub for data discovery and access control, where business units can readily share their domain-specific data. To consolidate and unify the data from these business units and provide a 360-degree insight, the Central IT team uses Redshift Spectrum to directly query and analyze the data residing in their respective data lakes. This eliminates the need for creating separate data copy jobs and duplication of data residing in multiple places.

The team also takes on the responsibility of bringing all the data assets to the same granularity and process a unified data asset. These combined data products can then be shared through Amazon DataZone to these business units. Business units can only focus on consuming the unified data assets that aren’t specific to their domain. This way, the processing costs can be controlled and tightly monitored across all business units. The Central IT team can also implement chargeback mechanisms based on the consumption of the unified products for each business unit.

To learn more about Amazon DataZone and how to get started, refer to Getting started. Check out the YouTube playlist for some of the latest demos of Amazon DataZone and more information about the capabilities available.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Implement data quality checks on Amazon Redshift data assets and integrate with Amazon DataZone

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/implement-data-quality-checks-on-amazon-redshift-data-assets-and-integrate-with-amazon-datazone/

Data quality is crucial in data pipelines because it directly impacts the validity of the business insights derived from the data. Today, many organizations use AWS Glue Data Quality to define and enforce data quality rules on their data at rest and in transit. However, one of the most pressing challenges faced by organizations is providing users with visibility into the health and reliability of their data assets. This is particularly crucial in the context of business data catalogs using Amazon DataZone, where users rely on the trustworthiness of the data for informed decision-making. As the data gets updated and refreshed, there is a risk of quality degradation due to upstream processes.

Amazon DataZone is a data management service designed to streamline data discovery, data cataloging, data sharing, and governance. It allows your organization to have a single secure data hub where everyone in the organization can find, access, and collaborate on data across AWS, on premises, and even third-party sources. It simplifies the data access for analysts, engineers, and business users, allowing them to discover, use, and share data seamlessly. Data producers (data owners) can add context and control access through predefined approvals, providing secure and governed data sharing. The following diagram illustrates the Amazon DataZone high-level architecture. To learn more about the core components of Amazon DataZone, refer to Amazon DataZone terminology and concepts.

DataZone High Level Architecture

To address the issue of data quality, Amazon DataZone now integrates directly with AWS Glue Data Quality, allowing you to visualize data quality scores for AWS Glue Data Catalog assets directly within the Amazon DataZone web portal. You can access the insights about data quality scores on various key performance indicators (KPIs) such as data completeness, uniqueness, and accuracy.

By providing a comprehensive view of the data quality validation rules applied on the data asset, you can make informed decisions about the suitability of the specific data assets for their intended use. Amazon DataZone also integrates historical trends of the data quality runs of the asset, giving full visibility and indicating if the quality of the asset improved or degraded over time. With the Amazon DataZone APIs, data owners can integrate data quality rules from third-party systems into a specific data asset. The following screenshot shows an example of data quality insights embedded in the Amazon DataZone business catalog. To learn more, see Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions.

In this post, we show how to capture the data quality metrics for data assets produced in Amazon Redshift.

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

With Amazon DataZone, the data owner can directly import the technical metadata of a Redshift database table and views to the Amazon DataZone project’s inventory. As these data assets gets imported into Amazon DataZone, it bypasses the AWS Glue Data Catalog, creating a gap in data quality integration. This post proposes a solution to enrich the Amazon Redshift data asset with data quality scores and KPI metrics.

Solution overview

The proposed solution uses AWS Glue Studio to create a visual extract, transform, and load (ETL) pipeline for data quality validation and a custom visual transform to post the data quality results to Amazon DataZone. The following screenshot illustrates this pipeline.

Glue ETL pipeline

The pipeline starts by establishing a connection directly to Amazon Redshift and then applies necessary data quality rules defined in AWS Glue based on the organization’s business needs. After applying the rules, the pipeline validates the data against those rules. The outcome of the rules is then pushed to Amazon DataZone using a custom visual transform that implements Amazon DataZone APIs.

The custom visual transform in the data pipeline makes the complex logic of Python code reusable so that data engineers can encapsulate this module in their own data pipelines to post the data quality results. The transform can be used independently of the source data being analyzed.

Each business unit can use this solution by retaining complete autonomy in defining and applying their own data quality rules tailored to their specific domain. These rules maintain the accuracy and integrity of their data. The prebuilt custom transform acts as a central component for each of these business units, where they can reuse this module in their domain-specific pipelines, thereby simplifying the integration. To post the domain-specific data quality results using a custom visual transform, each business unit can simply reuse the code libraries and configure parameters such as Amazon DataZone domain, role to assume, and name of the table and schema in Amazon DataZone where the data quality results need to be posted.

In the following sections, we walk through the steps to post the AWS Glue Data Quality score and results for your Redshift table to Amazon DataZone.

Prerequisites

To follow along, you should have the following:

The solution uses a custom visual transform to post the data quality scores from AWS Glue Studio. For more information, refer to Create your own reusable visual transforms for AWS Glue Studio.

A custom visual transform lets you define, reuse, and share business-specific ETL logic with your teams. Each business unit can apply their own data quality checks relevant to their domain and reuse the custom visual transform to push the data quality result to Amazon DataZone and integrate the data quality metrics with their data assets. This eliminates the risk of inconsistencies that might arise when writing similar logic in different code bases and helps achieve a faster development cycle and improved efficiency.

For the custom transform to work, you need to upload two files to an Amazon Simple Storage Service (Amazon S3) bucket in the same AWS account where you intend to run AWS Glue. Download the following files:

Copy these downloaded files to your AWS Glue assets S3 bucket in the folder transforms (s3://aws-glue-assets<account id>-<region>/transforms). By default, AWS Glue Studio will read all JSON files from the transforms folder in the same S3 bucket.

customtransform files

In the following sections, we walk you through the steps of building an ETL pipeline for data quality validation using AWS Glue Studio.

Create a new AWS Glue visual ETL job

You can use AWS Glue for Spark to read from and write to tables in Redshift databases. AWS Glue provides built-in support for Amazon Redshift. On the AWS Glue console, choose Author and edit ETL jobs to create a new visual ETL job.

Establish an Amazon Redshift connection

In the job pane, choose Amazon Redshift as the source. For Redshift connection, choose the connection created as prerequisite, then specify the relevant schema and table on which the data quality checks need to be applied.

dqrulesonredshift

Apply data quality rules and validation checks on the source

The next step is to add the Evaluate Data Quality node to your visual job editor. This node allows you to define and apply domain-specific data quality rules relevant to your data. After the rules are defined, you can choose to output the data quality results. The outcomes of these rules can be stored in an Amazon S3 location. You can additionally choose to publish the data quality results to Amazon CloudWatch and set alert notifications based on the thresholds.

Preview data quality results

Choosing the data quality results automatically adds the new node ruleOutcomes. The preview of the data quality results from the ruleOutcomes node is illustrated in the following screenshot. The node outputs the data quality results, including the outcomes of each rule and its failure reason.

previewdqresults

Post the data quality results to Amazon DataZone

The output of the ruleOutcomes node is then passed to the custom visual transform. After both files are uploaded, the AWS Glue Studio visual editor automatically lists the transform as mentioned in post_dq_results_to_datazone.json (in this case, Datazone DQ Result Sink) among the other transforms. Additionally, AWS Glue Studio will parse the JSON definition file to display the transform metadata such as name, description, and list of parameters. In this case, it lists parameters such as the role to assume, domain ID of the Amazon DataZone domain, and table and schema name of the data asset.

Fill in the parameters:

  • Role to assume is optional and can be left empty; it’s only needed when your AWS Glue job runs in an associated account
  • For Domain ID, the ID for your Amazon DataZone domain can be found in the Amazon DataZone portal by choosing the user profile name

datazone page

  • Table name and Schema name are the same ones you used when creating the Redshift source transform
  • Data quality ruleset name is the name you want to give to the ruleset in Amazon DataZone; you could have multiple rulesets for the same table
  • Max results is the maximum number of Amazon DataZone assets you want the script to return in case multiple matches are available for the same table and schema name

Edit the job details and in the job parameters, add the following key-value pair to import the right version of Boto3 containing the latest Amazon DataZone APIs:

--additional-python-modules

boto3>=1.34.105

Finally, save and run the job.

dqrules post datazone

The implementation logic of inserting the data quality values in Amazon DataZone is mentioned in the post Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions . In the post_dq_results_to_datazone.py script, we only adapted the code to extract the metadata from the AWS Glue Evaluate Data Quality transform results, and added methods to find the right DataZone asset based on the table information. You can review the code in the script if you are curious.

After the AWS Glue ETL job run is complete, you can navigate to the Amazon DataZone console and confirm that the data quality information is now displayed on the relevant asset page.

Conclusion

In this post, we demonstrated how you can use the power of AWS Glue Data Quality and Amazon DataZone to implement comprehensive data quality monitoring on your Amazon Redshift data assets. By integrating these two services, you can provide data consumers with valuable insights into the quality and reliability of the data, fostering trust and enabling self-service data discovery and more informed decision-making across your organization.

If you’re looking to enhance the data quality of your Amazon Redshift environment and improve data-driven decision-making, we encourage you to explore the integration of AWS Glue Data Quality and Amazon DataZone, and the new preview for OpenLineage-compatible data lineage visualization in Amazon DataZone. For more information and detailed implementation guidance, refer to the following resources:


About the Authors

Fabrizio Napolitano is a Principal Specialist Solutions Architect for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys nature and outdoor activities, reading, and traveling.

Automate Amazon Redshift Advisor recommendations with email alerts using an API

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/automate-amazon-redshift-advisor-recommendations-with-email-alerts-using-an-api/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. Amazon Redshift now allows you to programmatically access Amazon Redshift Advisor recommendations through an API, enabling you to integrate recommendations about how to improve your provisioned cluster performance into your own applications.

Amazon Redshift Advisor offers recommendations about optimizing your Redshift cluster performance and helps you save on operating costs. Advisor develops its customized recommendations by analyzing performance and usage metrics for your cluster and displays recommendations that should have a significant impact on performance and operations. Now, with the ability to programmatically access these recommendations through the ListRecommendations API, you can make recommendations available to implement on-demand or automatically through your own internal applications and tools without the need to access the Amazon Redshift console.

In this post, we show you how to use the ListRecommendations API to set up email notifications for Advisor recommendations on your Redshift cluster. These recommendations, such as identifying tables that should be vacuumed to sort the data or finding table columns that are candidates for compression, can help improve performance and save costs.

How to access Redshift Advisor recommendations

To access Advisor recommendations on the Amazon Redshift console, choose Advisor in the navigation pane. You can expand each recommendation to see more details, and sort and group recommendations.

You can also use the ListRecommendations API to automate receiving the Advisor recommendations and programmatically implement them. The API returns a list of recommended actions that can be parsed and implemented. The API and SDKs also enable you to set up workflows to use Advisor programmatically for automated optimizations. These automated periodic checks of Advisor using cron scheduling along with implementing the changes can help you keep Redshift clusters optimized automatically without manual intervention.

You can also use the list-recommendations command in the AWS Command Line Interface (AWS CLI) to invoke the Advisor recommendations from the command line and automate the workflow through scripts.

Solution overview

The following diagram illustrates the solution architecture.

The solution workflow consists of the following steps:

  1. An Amazon EventBridge schedule invokes an AWS Lambda function to retrieve Advisor recommendations.
  2. Advisor generates recommendations that are accessible through an API.
  3. Optionally, this solution stores the recommendations in an Amazon Simple Storage Service (Amazon S3) bucket.
  4. Amazon Simple Notification Service (Amazon SNS) automatically sends notifications to end-users.

Prerequisites

To deploy this solution, you should have the following:

Deploy the solution

Complete the following steps to deploy the solution:

  1. Choose Launch Stack.
    Launch Cloudformation Stack
  1. For Stack name, enter a name for the stack, for example, blog-redshift-advisor-recommendations.
  2. For SnsTopicArn, enter the SNS topic Amazon Resource Name (ARN) for receiving the email alerts.
  3. For ClusterIdentifier, enter your Redshift cluster name if you want to receive Advisor notifications for a particular cluster. If you leave it blank, you will receive notifications for all Redshift provisioned clusters in your account.
  4. For S3Bucket, enter the S3 bucket name to store the detailed Advisor recommendations in a JSON file. If you leave it blank, this step will be skipped.
  5. For ScheduleExpression, enter the frequency in cron format to receive Advisor recommendation alerts. For this post, we want to receive alerts every Sunday at 14:00 UTC, so we enter cron(0 14 ? * SUN *).

Make sure to provide the correct cron time expression when deploying the CloudFormation stack to avoid any failures.

  1. Keep all options as default under Configure Stack options and choose Next.
  2. Review the settings, select the acknowledge check box, and create the stack.

If the CloudFormation stack fails for any reason, refer to Troubleshooting CloudFormation.

After the CloudFormation template is deployed, it will create the following resources:

Workflow details

Let’s take a closer look at the Lambda function and the complete workflow.

The input values provided for SnsTopicArn, ClusterIdentifier, and S3Bucket in the CloudFormation stack creation are set as environmental variables in the Lambda function. If the ClusterIdentifier parameter is None, then it will invoke the ListRecommendations API to generate Advisor recommendations for all the clusters within the account (same AWS Region). Otherwise, it will pass the ClusterIdentifier value and generate Advisor recommendations only for the given cluster. If the input parameter S3Bucket is provided, the solution creates a folder named RedshiftAdvisorRecommendations and generates the Advisor recommendations file in JSON format within it. If a value for S3Bucket isn’t provided, this step will be skipped.

Next, the function will summarize recommendations by each provisioned cluster (for all clusters in the account or a single cluster, depending on your settings) based on the impact on performance and cost as HIGH, MEDIUM, and LOW categories. An SNS notification email will be sent to the subscribers with the summarized recommendations.

SQL commands are included as part of the Advisor’s recommended action. RecommendedActionType-SQL summarizes the number of SQL actions that can be applied using SQL commands.

If there are no recommendations available for any cluster, the SNS notification email will be sent notifying there are no Advisor recommendations.

An EventBridge rule is created to invoke the Lambda function based on the frequency you provided in the stack parameters. By default, it’s scheduled to run weekly each Sunday at 14:00 UTC.

The following is a screenshot of a sample SNS notification email.

Clean up

We recommend deleting the CloudFormation stack if you aren’t going to continue using the solution. This will avoid incurring any additional costs from the resources created as part of the solution.

Conclusion

In this post, we discussed how Redshift Advisor offers you specific recommendations to improve the performance of and decrease the operating costs for your Redshift cluster. We also showed you how to programmatically access these recommendations through an API and implement them on-demand or automatically using your own internal tools without having access to the Amazon Redshift console.

By integrating these recommendations into your workflows, you can make informed decisions and implement best practices to optimize the performance and costs of your Redshift clusters, ultimately enhancing the overall efficiency and productivity of your data processing operations.

We encourage you to try out this automated solution to access Advisor recommendations programmatically. If you have any feedback or questions, please leave them in the comments.


About the authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Nita Shah is a Senior Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.


Vamsi Bhadriraju
is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

Sumant Nemmani is a Senior Technical Product Manager at AWS. He is focused on helping customers of Amazon Redshift benefit from features that use machine learning and intelligent mechanisms to enable the service to self-tune and optimize itself, ensuring Redshift remains price-performant as they scale their usage.

AWS Weekly Roundup: Mithra, Amazon Titan Image Generator v2, AWS GenAI Lofts, and more (August 12, 2024)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-mithra-amazon-titan-image-generator-v2-aws-genai-lofts-and-more-august-12-2024/

When Dr. Swami Sivasubramanian, VP of AI and Data, was an intern at Amazon in 2005, Dr. Werner Vogels, CTO of Amazon, was his first manager. Nineteen years later, the two shared a stage at the VivaTech Conference to reflect on Amazon’s history of innovation—from pioneering the pay-as-you-go model with Amazon Web Services (AWS) to transforming customer experiences using “good old-fashioned AI”—as well as what really keeps them up at night in the age of generative artificial intelligence (generative AI).

Asked if competitors ever kept him up at night, Dr. Werner insisted that listening to customer needs—such as guardrails, security, and privacy—and building products based on those needs is what drives success at Amazon. Dr. Swami said he viewed Amazon SageMaker and Amazon Bedrock as prime examples of successful products that have emerged as a result of this customer-first approach. “If you end up chasing your competitors, you are going to end up building what they are building,” he added. “If you actually listen to your customers, you are actually going to lead the way in innovation.” To learn four more lessons on customer-obsessed innovation, visit our AWS Careers blog.

For example, for customer-obsessed security, we build and use Mithra, a powerful neural network model to detect and respond to cyber threats. It analyzes up to 200 trillion internet domain requests daily from the AWS global network, identifying an average of 182,000 new malicious domains with remarkable accuracy. Mithra is just one example of how AWS uses global scale, advanced artificial intelligence and machine learning (AI/ML) technology, and constant innovation to lead the way in cloud security, making the internet safer for everyone. To learn more, visit the blog post of Chief Information Security Officer at Amazon CJ Moses, How AWS tracks the cloud’s biggest security threats and helps shut them down.

Last week’s launches
Here are some launches that got my attention:

Amazon Titan Image Generator v2 in Amazon Bedrock – With the new Amazon Titan Image Generator v2 model, you can guide image creation using a text prompt and reference images, control the color palette of generated images, remove backgrounds, and customize the model to maintain brand style and subject consistency. To learn more, visit my blog post, Amazon Titan Image Generator v2 is now available in Amazon Bedrock.

Regional expansion of Anthropic’s Claude models in Amazon Bedrock – The Claude 3.5 Sonnet, Anthropic’s latest high-performance AI model, is now available in US West (Oregon), Europe (Frankfurt), Asia Pacific (Tokyo), and Asia Pacific (Singapore) Regions in Amazon Bedrock. The Claude 3 Haiku, Anthropic’s compact and affordable AI model, is now available in Asia Pacific (Tokyo) and Asia Pacific (Singapore) Regions in Amazon Bedrock.

Private IPv6 addressing for VPCs and subnets – You can now address private IPv6 for VPCs and subnets with Amazon VPC IP Address Manager (IPAM). Within IPAM, you can configure private IPv6 addresses in a private scope, provision Unique Local IPv6 Unicast Addresses (ULA) and Global Unicast Addresses (GUA), and use them to create VPCs and subnets for private access. To learn more, visit see the Understanding IPv6 addressing on AWS and designing a scalable addressing plan and VPC documentation,

Up to 30 GiB/s of read throughput in Amazon EFS – We are increasing the read throughput to 30 GiB/s, extending simple, fully elastic, and provisioning-free experience of Amazon EFS to support throughput-intensive AI and ML workloads for model training, inference, financial analytics, and genomic data analysis.

Large language models (LLMs) in Amazon Redshift ML – You can use pre-trained publicly available LLMs in Amazon SageMaker JumpStart as part of Amazon Redshift ML. For example, you can use LLMs to summarize feedback, perform entity extraction, and conduct sentiment analysis on data in your Amazon Redshift table, so you can bring the power of generative AI to your data warehouse.

Data products in Amazon DataZone – You can create data products in Amazon DataZone, which enable the grouping of data assets into well-defined, self-contained packages tailored for specific business use cases. For example, a marketing analysis data product can bundle various data assets such as marketing campaign data, pipeline data, and customer data. To learn more, visit this AWS Big Data blog post.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
Here are some additional news items that you might find interesting:

AWS Goodies by Jeff Barr – Want to discover more exciting news about AWS? Jeff Barr is always in catch-up mode, doing his best to share all of the interesting things that he finds or that are shared with him. You can find his goodies once a week. Follow his LinkedIn page.

AWS and Multicloud – You might have missed a great article about the existing capabilities AWS has and the continued enhancements we’ve made in multicloud environments. In the post, Jeff covers the AWS approach to multicloud, provides you with some real-world examples, and reviews some of the newest multicloud and hybrid capabilities found across the lineup of AWS services.

Code transformation in Amazon Q Developer – At Amazon, we asked a small team to use Amazon Q Developer Agent for code transformation to migrate more than 30,000 production applications from older Java versions to Java 17. By using Amazon Q Developer to automate these upgrades, the team saved over 4,500 developer years of effort compared to what it would have taken to do all of these upgrades manually and saved the company $260 million in annual savings by moving to the latest Java version.

Contributing to AWS CDKAWS Cloud Development Kit (AWS CDK) is an open source software development framework to model and provision your cloud application resources using familiar programming languages. Contributing to AWS CDK not only helps you deepen your knowledge of AWS services but also allows you to give back to the community and improve a tool you rely on.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS re:Invent 2024 – Dive into the first-round session catalog. Explore all the different learning opportunities at AWS re:Invent this year and start building your agenda today. You’ll find sessions for all interests and learning styles.

AWS Innovate Migrate, Modernize, Build – Learn about proven strategies and practical steps for effectively migrating workloads to the AWS Cloud, modernizing applications, and building cloud-native and AI-enabled solutions. Don’t miss this opportunity to learn with the experts and unlock the full potential of AWS. Register now for Asia Pacific, Korea, and Japan (September 26).

AWS Summits – The 2024 AWS Summit season is almost wrapping up! Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: São Paulo (August 15), Jakarta (September 5), and Toronto (September 11).

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: New Zealand (August 15), Colombia (August 24), New York (August 28), Belfast (September 6), and Bay Area (September 13).

AWS GenAI Lofts – Meet AWS AI experts and attend talks, workshops, fireside chats, and Q&As with industry leaders. All lofts are free and are carefully curated to offer something for everyone to help you accelerate your journey with AI. There are lofts scheduled in San Francisco (August 14–September 27), São Paulo (September 2–November 20), London (September 30–October 25), Paris (October 8–November 25), and Seoul (November).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Channy

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Migrate Amazon Redshift from DC2 to RA3 to accommodate increasing data volumes and analytics demands

Post Syndicated from Valdiney Gomes original https://aws.amazon.com/blogs/big-data/migrate-amazon-redshift-from-dc2-to-ra3-to-accommodate-increasing-data-volumes-and-analytics-demands/

This is a guest post by Valdiney Gomes, Hélio Leal, Flávia Lima, and Fernando Saga from Dafiti.

As businesses strive to make informed decisions, the amount of data being generated and required for analysis is growing exponentially. This trend is no exception for Dafiti, an ecommerce company that recognizes the importance of using data to drive strategic decision-making processes. With the ever-increasing volume of data available, Dafiti faces the challenge of effectively managing and extracting valuable insights from this vast pool of information to gain a competitive edge and make data-driven decisions that align with company business objectives.

Amazon Redshift is widely used for Dafiti’s data analytics, supporting approximately 100,000 daily queries from over 400 users across three countries. These queries include both extract, transform, and load (ETL) and extract, load, and transform (ELT) processes and one-time analytics. Dafiti’s data infrastructure relies heavily on ETL and ELT processes, with approximately 2,500 unique processes run daily. These processes retrieve data from around 90 different data sources, resulting in updating roughly 2,000 tables in the data warehouse and 3,000 external tables in Parquet format, accessed through Amazon Redshift Spectrum and a data lake on Amazon Simple Storage Service (Amazon S3).

The growing need for storage space to maintain data from over 90 sources and the functionality available on the new Amazon Redshift node types, including managed storage, data sharing, and zero-ETL integrations, led us to migrate from DC2 to RA3 nodes.

In this post, we share how we handled the migration process and provide further impressions of our experience.

Amazon Redshift at Dafiti

Amazon Redshift is a fully managed data warehouse service, and was adopted by Dafiti in 2017. Since then, we’ve had the opportunity to follow many innovations and have gone through three different node types. We started with 115 dc2.large nodes and with the launch of Redshift Spectrum and the migration of our cold data to the data lake, then we considerably improved our architecture and migrated to four dc2.8xlarge nodes. RA3 introduced many features, allowing us to scale and pay for computing and storage independently. This is what brought us to the current moment, where we have eight ra3.4xlarge nodes in the production environment and a single node ra3.xlplus cluster for development.

Given our scenario, where we have many data sources and a lot of new data being generated every moment, we came across a problem: the 10 TB we had available in our cluster was insufficient for our needs. Although most of our data is currently in the data lake, more storage space was needed in the data warehouse. This was solved by RA3, which scales compute and storage independently. Also, with zero-ETL, we simplified our data pipelines, ingesting tons of data in near real time from our Amazon Relational Database Service (Amazon RDS) instances, while data sharing enables a data mesh approach.

Migration process to RA3

Our first step towards migration was to understand how the new cluster should be sized; for this, AWS provides a recommendation table.

Given the configuration of our cluster, consisting of four dc2.8xlarge nodes, the recommendation was to switch to ra3.4xlarge.

At this point, one concern we had was regarding reducing the amount of vCPU and memory. With DC2, our four nodes provided a total of 128 vCPUs and 976 GiB; in RA3, even with eight nodes, these values were reduced to 96 vCPUs and 768 GiB. However, the performance was improved, with processing of workloads 40% faster in general.

AWS offers Redshift Test Drive to validate whether the configuration chosen for Amazon Redshift is ideal for your workload before migrating the production environment. At Dafiti, given the particularities of our workload, which gives us some flexibility to make changes to specific windows without affecting the business, it wasn’t necessary to use Redshift Test Drive.

We carried out the migration as follows:

  1. We created a new cluster with eight ra3.4xlarge nodes from the snapshot of our four-node dc2.8xlarge cluster. This process took around 10 minutes to create the new cluster with 8.75 TB of data.
  2. We turned off our internal ETL and ELT orchestrator, to prevent our data from being updated during the migration period.
  3. We changed the DNS pointing to the new cluster in a transparent way for our users. At this point, only one-time queries and those made by Amazon QuickSight reached the new cluster.
  4. After the read query validation stage was complete and we were satisfied with the performance, we reconnected our orchestrator so that the data transformation queries could be run in the new cluster.
  5. We removed the DC2 cluster and completed the migration.

The following diagram illustrates the migration architecture.

Migrate architecture

During the migration, we defined some checkpoints at which a rollback would be performed if something unwanted happened. The first checkpoint was in Step 3, where the reduction in performance in user queries would lead to a rollback. The second checkpoint was in Step 4, if the ETL and ELT processes presented errors or there was a loss of performance compared to the metrics collected from the processes run in DC2. In both cases, the rollback would simply occur by changing the DNS to point to DC2 again, because it would still be possible to rebuild all processes within the defined maintenance window.

Results

The RA3 family introduced many features, allowed scaling, and enabled us to pay for compute and storage independently, which changed the game at Dafiti. Before, we had a cluster that performed as expected, but limited us in terms of storage, requiring daily maintenance to maintain control of disk space.

The RA3 nodes performed better and workloads ran 40% faster in general. It represents a significant decrease in the delivery time of our critical data analytics processes.

This improvement became even more pronounced in the days following the migration, due to the ability in Amazon Redshift to optimize caching, statistics, and apply performance recommendations. Additionally, Amazon Redshift is able to provide recommendations for optimizing our cluster based on our workload demands through Amazon Redshift Advisor recommendations, and offers automatic table optimization, which played a key role in achieving a seamless transition.

Moreover, the storage capacity leap from 10 TB to multiple PB solved Dafiti’s primary challenge of accommodating growing data volumes. This substantial increase in storage capabilities, combined with the unexpected performance enhancements, demonstrated that the migration to RA3 nodes was a successful strategic decision that addressed Dafiti’s evolving data infrastructure requirements.

Data sharing has been used since the moment of migration, to share data between the production and development environment, but the natural evolution is to enable the data mesh at Dafiti through this resource. The limitation we had was the need to activate case sensitivity, which is a prerequisite for data sharing, and which forced us to change some broken processes. But that was nothing compared to the benefits we’re seeing from migrating to RA3.

Conclusion

In this post, we discussed how Dafiti handled migrating to Redshift RA3 nodes, and the benefits of this migration.

Do you want to know more about what we’re doing in the data area at Dafiti? Check out the following resources:

 The content and opinions in this post are those of Dafiti’s authors and AWS is not responsible for the content or accuracy of this post.


About the Authors

Valdiney Gomes is Data Engineering Coordinator at Dafiti. He worked for many years in software engineering, migrated to data engineering, and currently leads an amazing team responsible for the data platform for Dafiti in Latin America.

Hélio Leal is a Data Engineering Specialist at Dafiti, responsible for maintaining and evolving the entire data platform at Dafiti using AWS solutions.

Flávia Lima is a Data Engineer at Dafiti, responsible for sustaining the data platform and providing data from many sources to internal customers.

Fernando Saga is a data engineer at Dafiti, responsible for maintaining Dafiti’s data platform using AWS solutions.

Query AWS Glue Data Catalog views using Amazon Athena and Amazon Redshift

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/query-aws-glue-data-catalog-views-using-amazon-athena-and-amazon-redshift/

Today’s data lakes are expanding across lines of business operating in diverse landscapes and using various engines to process and analyze data. Traditionally, SQL views have been used to define and share filtered data sets that meet the requirements of these lines of business for easier consumption. However, with customers using different processing engines in their data lakes, each with its own version of views, they’re creating separate views per engine, adding to maintenance overhead. Furthermore, accessing these engine-defined views requires customers to have elevated access levels, granting them access to both the SQL view itself and the underlying databases and tables referenced in the view’s SQL definition. This approach impedes granting consistent access to a subset of data using SQL views, hampering productivity and increasing management overhead.

Glue Data Catalog views is a new feature of the AWS Glue Data Catalog that customers can use to create a common view schema and single metadata container that can hold view-definitions in different dialects that can be used across engines such as Amazon Redshift and Amazon Athena. By defining a single view object that can be queried from multiple engines, Data Catalog views enable customers to manage permissions on a single view schema consistently using AWS Lake Formation. A view can be shared across different AWS accounts as well. For querying these views, users need access to the view object only and don’t need access to the referenced databases and tables in the view definition. Further, all requests against the Data Catalog views, such as requests for access credentials on underlying resources, will be logged as AWS CloudTrail management events for auditing purposes.

In this blog post, we will show how you can define and query a Data Catalog view on top of open source table formats such as Iceberg across Athena and Amazon Redshift. We will also show you the configurations needed to restrict access to the underlying database and tables. To follow along, we have provided an AWS CloudFormation template.

Use case

An Example Corp has two business units: Sales and Marketing. The Sales business unit owns customer datasets, including customer details and customer addresses. The Marketing business unit wants to conduct a targeted marketing campaign based on a preferred customer list and has requested data from the Sales business unit. The Sales business unit’s data steward (AWS Identity and Access Management (IAM) role: product_owner_role), who owns the customer and customer address datasets, plans to create and share non-sensitive details of preferred customers with the Marketing unit’s data analyst (business_analyst_role) for their campaign use case. The Marketing team analyst plans to use Athena for interactive analysis for the marketing campaign and later, use Amazon Redshift to generate the campaign report.

In this solution, we demonstrate how you can use Data Catalog views to share a subset of customer details stored in Iceberg format filtered by the preferred flag. This view can be seamlessly queried using Athena and Amazon Redshift Spectrum, with data access centrally managed through AWS Lake Formation.

Prerequisites

For the solution in this blog post, you need the following:

  • An AWS account. If you don’t have an account, you can create one.
  • You have created a data lake administrator Take note of this role’s Amazon Resource Name (ARN) to use later. For simplicity’s sake, this post will use IAM Admin role as the Datalake Admin and Redshift Admin but make sure that in your environment you follow the principle of least privilege.
  • Under Data Catalog settings, have the default settings in place. Both of the following options should be selected:
    • Use only IAM access control for new databases
    • Use only IAM access control for new tables in new databases

Get started

To follow the steps in this post, sign in to the AWS Management Console as the IAM Admin and deploy the following CloudFormation stack to create the necessary resources:

  1. Choose to deploy the CloudFormation template.
    Launch Cloudformation Stack
  2. Provide an IAM role that you have already configured as a Lake Formation administrator.
  3. Complete the steps to deploy the template. Leave all settings as default.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.

The CloudFormation stack creates the following resources. Make a note of these values—you will use them later.

  • Amazon Simple Storage Service (Amazon S3) buckets that store the table data and Athena query result
  • IAM roles: product_owner_role and business_analyst_role
  • Virtual private cloud (VPC) with the required network configuration, which will be used for compute
  • AWS Glue database: customerdb, which contains the customer and customer_address tables in Iceberg format
  • Glue database: customerviewdb, which will contain the Data Catalog views
  • Redshift Serverless cluster

The CloudFormation stack also registers the data lake bucket with Lake Formation in Lake Formation access mode. You can verify this by navigating to the Lake Formation console and selecting Data lake locations under Administration.

Solution overview

The following figure shows the architecture of the solution.

As a requirement to create a Data Catalog view, the data lake S3 locations for the tables (customer and customer_address) need to be registered with Lake Formation and granted full permission to product_owner_role.

The Sales product owner: product_owner_role is also granted permission to create views under customerviewdb using Lake Formation.

After the Glue Data Catalog View (customer_view) is created on the customer dataset with the required subset of customer information, the view is shared with the Marketing analyst (business_analyst_role), who can then query the preferred customer’s non sensitive information as defined by the view without having access to underlying customer tables.

  1. Enable Lake Formation permission mode on the customerdbdatabase and its tables.
  2. Grant the database (customerdb) and tables (customer and customer_address) full permission to product_owner_role using Lake Formation.
  3. Enable Lake Formation permission mode on the database (customerviewdb) where the multiple dialect Data Catalog view will be created.
  4. Grant full database permission to product_owner_role using Lake Formation.
  5. Create Data Catalog views as product_owner_role using Athena and Amazon Redshift to add engine dialects.
  6. Share the database and Data Catalog views read permission to business_analyst_role using Lake Formation.
  7. Query the Data Catalog view using business_analyst_role from Athena and Amazon Redshift engine.

With the prerequisites in place and an understanding of the overall solution, you’re ready to set up the solution.

Set up Lake Formation permissions for product_owner_role

Sign in to the LakeFormation console as a data lake administrator. For the examples in this post, we use the IAM Admin role, Admin as the data lake admin.

Enable Lake Formation permission mode on customerdb and its tables

  1. In the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Choose customerdb and choose Edit.
  3. Under Default permissions for newly created tables, clear Use only IAM access control for new tables in this database.
  4. Choose Save.
  5. Under Data Catalog in the navigation pane, choose Databases.
  6. Select customerdb and under Action, select View
  7. Select the IAMAllowedPrincipal from the list and choose Revoke.
  8. Repeat the same for all tables under the database customerdb.

Grant the product_owner_role access to customerdb and its tables

Grant product_owner_role all permissions to the customerdb database.

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
  2. Choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Select product_owner_role.
  5. Under LF-Tags or catalog resources, select Named Data Catalog resourcesand select customerdb for Databases.
  6. Select SUPER for Database permissions.
  7. Choose Grant to apply the permissions.

Grant product_owner_role all permissions to the customer and customer_address tables.

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission
  2. Choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the product_owner_role.
  5. Under LF-Tags or catalog resources, choose Named Data Catalog resourcesand select customerdb for databases and customer and customer_address for tables.
  6. Choose SUPER for Table permissions.
  7. Choose Grant to apply the permissions.

Enable Lake Formation permission mode

Enable Lake Formation permission mode on the database where the Data Catalog view will be created.

  1. In the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select customerviewdb and choose Edit.
  3. Under Default permissions for newly created tables, clear Use only IAM access control for new tables in this database.
  4. Choose Save.
  5. Choose Databases from Data Catalog in the navigation pane.
  6. Select customerviewdb and under Action select View.
  7. Select the IAMAllowedPrincipal from the list and choose Revoke.

Grant the product_owner_role access to customerviewdb using Lake Formation mode

Grant product_owner_role all permissions to the customerviewdb database.

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
  2. Choose Grant
  3. Under Principals, select IAM users and roles.
  4. Choose product_owner_role
  5. Under LF-Tags or catalog resources, choose Named Data Catalog resourcesand select customerviewdb for Databases.
  6. Select SUPER for Database permissions.
  7. Choose Grant to apply the permissions.

Create Glue Data Catalog views as product_owner_role

Now that you have Lake Formation permissions set on the databases and tables, you will use the product_owner_role to create Data Catalog views using Athena and Amazon Redshift. This will also add the engine dialects for Athena and Amazon Redshift.

Add the Athena dialect

  1. In the AWS console, either sign in using product_owner_role or, if you’re already signed in as an Admin, switch to product_owner_role.
  2. Launch query editor and select the workgroup athena_glueview from the upper right side of the console. You will create a view that combines data from the customer and customer_address tables, specifically for customers who are marked as preferred. The tables include personal information about the customer, such as their name, date of birth, country of birth, and email address.
  3. Run the following in the query editor to create the customer_view view under the customerviewdb database.
    create protected multi dialect view customerviewdb.customer_view
    security definer
    as
    select c_customer_id, c_first_name, c_last_name, c_birth_day, c_birth_month,
    c_birth_year, c_birth_country, c_email_address,
    ca_country,ca_zip
    from customerdb.customer, customerdb.customer_address
    where c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

  4. Run the following query to preview the view you just created.
    select * from customerviewdb.customer_view limit 10;

  5. Run following query to find the top three birth years with the highest customer counts from the customer_view view and display the birth year and corresponding customer count for each.
    select c_birth_year,
    	count(*) as count
    from "customerviewdb"."customer_view"
    group by c_birth_year
    order by count desc
    limit 3

Output:

  1. To validate that the view is created, go to the navigation pane and choose Views under Data catalog on the Lake Formation console
  2. Select customer_view and go to the SQL definition section to validate the Athena engine dialect.

When you created the view in Athena, it added the dialect for Athena engine. Next, to support the use case described earlier, the marketing campaign report needs to be generated using Amazon Redshift. For this, you need to add the Redshift dialect to the view so you can query it using Amazon Redshift as an engine.

Add the Amazon Redshift dialect

  1. Sign in to the AWS console as an Admin, navigate to Amazon Redshift console and sign in to Redshift Qurey editor v2.
  2. Connect to the Serverless cluster as Admin (federated user) and run the following statements to grant permission on the Glue automount database (awsdatacatalog) access to product_owner_role and business_analyst_role.
    create user  "IAMR:product_owner_role" password disable;
    create user  "IAMR:business_analyst_role" password disable;
    
    grant usage on database awsdatacatalog to "IAMR:product_owner_role";
    grant usage on database awsdatacatalog to "IAMR:business_analyst_role";

  3. Sign in to the Amazon Redshift console as product_owner_role and sign in to the QEv2 editor using product_owner_role (as a federated user). You will use the following ALTER VIEW query to add the Amazon Redshift engine dialect to the view created previously using Athena.
  4. Run the following in the query editor:
    alter external view awsdatacatalog.customerviewdb.customer_view AS
    select c_customer_id, c_first_name, c_last_name, c_birth_day, c_birth_month,
    c_birth_year, c_birth_country, c_email_address,
    ca_country, ca_zip
    from awsdatacatalog.customerdb.customer, awsdatacatalog.customerdb.customer_address
    where c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y'

  5. Run following query to preview the view.
    select * from awsdatacatalog.customerviewdb.customer_view limit 10;

  6. Run the same query that you ran in Athena to find the top three birth years with the highest customer counts from the customer_view view and display the birth year and corresponding customer count for each.
    select c_birth_year,
    	count(*) as count
    from awsdatacatalog.customerviewdb.customer_view
    group by c_birth_year
    order by count desc
    limit 3

By querying the same view and running the same query in Redshift, you obtained the same result set as you observed in Athena.

Validate the dialects added

Now that you have added all the dialects, navigate to the Lake Formation console to see how the dialects are stored.

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Views.
  2. Select customer_view and go to SQL definitions section to validate that the Athena and Amazon Redshift dialects have been added.

Alternatively, you can also create the view using Redshift to add Redshift dialect and update in Athena to add the Athena dialect.

Next, you will see how the business_analyst_role can query the view without having access to query the underlying tables and the Amazon S3 location where the data exists.

Set up Lake Formation permissions for business_analyst_role

Sign in to the Lake Formation console as the DataLake administrator (For this blog, we use the IAM Admin role, Admin, as the Datalake admin).

Grant business_analyst_role access to the database and view using Lake Formation

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
  2. Choose Grant
  3. Under Principals, select IAM users and roles.
  4. Select business_analyst_role.
  5. Under LF-Tags or catalog resources, select Named Data Catalog resources and select customerviewdb for Databases.
  6. Select DESCRIBE for Database permissions.
  7. Choose Grant to apply the permissions.

Grant the business_analyst_role SELECT and DESCRIBE permissions to customer_view

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission.
  2. Choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Select  business_analyst_role.
  5. Under LF-Tags or catalog resources, choose Named Data Catalog resources and select customerviewdb for Databases and customer_view for Views.
  6. Choose SELECT and DESCRIBE for View permissions.
  7. Choose Grant to apply the permissions.

Query the Data Catalog views using business_analyst_role

Now that you have set up the solution, test it by querying the data using Athena and Amazon Redshift.

Using Athena

  1. Sign in to the Athena console as business_analyst_role.
  2. Launch query editor and select the workgroup athena_glueview. Select database customerviewdb from the dropdown on the left and you should be able to see the view created previously using product_owner_role. Also, notice that no tables are shown because business_analyst_role doesn’t have access granted for the base tables.
  3. Run the following in the query editor to query the view query.
    select * from customerviewdb.customer_view limit 10

As you can see in the preceding figure, business_analyst_role can query the view without having access to the underlying tables.

  1. Next, query the table customer on which the view is created. It should give an error.
    SELECT * FROM customerdb.customer limit 10

Using Amazon Redshift

  1. Navigate to the Amazon Redshift console and sign in to Amazon Redshift query editor v2. Connect to the Serverless cluster as business_analyst_role (federated user) and run the following in the query editor to query the view.
  2. Select the customerviewdb on the left side of the console. You should see the view customer_view. Also, note that you cannot see the tables from which the view is created. Run the following in the query editor to query the view.
    SELECT * FROM "awsdatacatalog"."customerviewdb"."customer_view";

The business analyst user can run the analysis on the Data Catalog view without needing access to the underlying databases and tables on from which the view is created.

Glue Data Catalog views offer solutions for various data access and governance scenarios. Organizations can use this feature to define granular access controls on sensitive data—such as personally identifiable information (PII) or financial records—to help them comply with data privacy regulations. Additionally, you can use Data Catalog views to implement row-level, column-level, or even cell-level filtering based on the specific privileges assigned to different user roles or personas, allowing for fine-grained data access control. Furthermore, Data Catalog views can be used in data mesh patterns, enabling secure, domain-specific data sharing across the organization for self-service analytics, while allowing users to use preferred analytics engines like Athena or Amazon Redshift on the same views for governance and consistent data access.

Clean up

To avoid incurring future charges, delete the CloudFormation stack. For instructions, see Deleting a stack on the AWS CloudFormation console. Ensure that the following resources created for this blog post are removed:

  • S3 buckets
  • IAM roles
  • VPC with network components
  • Data Catalog database, tables and views
  • Amazon Redshift Serverless cluster
  • Athena workgroup

Conclusion

In this post, we demonstrated how to use AWS Glue Data Catalog views across multiple engines such as Athena and Redshift. You can share Data Catalog views so that different personas can query them. For more information about this new feature, see Using AWS Glue Data Catalog views.


About the Authors

Pathik Shah is a Sr. Analytics Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Paul Villena is a Senior Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure as code, serverless technologies, and coding in Python.

Derek Liu is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through AWS analytic services.

Enrich, standardize, and translate streaming data in Amazon Redshift with generative AI

Post Syndicated from Anusha Challa original https://aws.amazon.com/blogs/big-data/enrich-standardize-and-translate-streaming-data-in-amazon-redshift-with-generative-ai/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it straightforward and cost-effective to analyze your data. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

Amazon Redshift ML is a feature of Amazon Redshift that enables you to build, train, and deploy machine learning (ML) models directly within the Redshift environment. Now, you can use pretrained publicly available large language models (LLMs) in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. You can use pretrained publicly available LLMs from leading providers such as Meta, AI21 Labs, LightOn, Hugging Face, Amazon Alexa, and Cohere as part of your Redshift ML workflows. By integrating with LLMs, Redshift ML can support a wide variety of natural language processing (NLP) use cases on your analytical data, such as text summarization, sentiment analysis, named entity recognition, text generation, language translation, data standardization, data enrichment, and more. Through this feature, the power of generative artificial intelligence (AI) and LLMs is made available to you as simple SQL functions that you can apply on your datasets. The integration is designed to be simple to use and flexible to configure, allowing you to take advantage of the capabilities of advanced ML models within your Redshift data warehouse environment.

In this post, we demonstrate how Amazon Redshift can act as the data foundation for your generative AI use cases by enriching, standardizing, cleansing, and translating streaming data using natural language prompts and the power of generative AI. In today’s data-driven world, organizations often ingest real-time data streams from various sources, such as Internet of Things (IoT) devices, social media platforms, and transactional systems. However, this streaming data can be inconsistent, missing values, and be in non-standard formats, presenting significant challenges for downstream analysis and decision-making processes. By harnessing the power of generative AI, you can seamlessly enrich and standardize streaming data after ingesting it into Amazon Redshift, resulting in high-quality, consistent, and valuable insights. Generative AI models can derive new features from your data and enhance decision-making. This enriched and standardized data can then facilitate accurate real-time analysis, improved decision-making, and enhanced operational efficiency across various industries, including ecommerce, finance, healthcare, and manufacturing. For this use case, we use the Meta Llama-3-8B-Instruct LLM to demonstrate how to integrate it with Amazon Redshift to streamline the process of data enrichment, standardization, and cleansing.

Solution overview

The following diagram demonstrates how to use Redshift ML capabilities to integrate with LLMs to enrich, standardize, and cleanse streaming data. The process starts with raw streaming data coming from Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK), which is materialized in Amazon Redshift as raw data. User-defined functions (UDFs) are then applied to the raw data, which invoke an LLM deployed on SageMaker JumpStart to enrich and standardize the data. The enhanced, cleansed data is then stored back in Amazon Redshift, ready for accurate real-time analysis, improved decision-making, and enhanced operational efficiency.

To deploy this solution, we complete the following steps:

  1. Choose an LLM for the use case and deploy it using foundation models (FMs) in SageMaker JumpStart.
  2. Use Redshift ML to create a model referencing the SageMaker JumpStart LLM endpoint.
  3. Create a materialized view to load the raw streaming data.
  4. Call the model function with prompts to transform the data and view results.

Example data

The following code shows an example of raw order data from the stream:

Record1: {
    "orderID":"101",
    "email":" john. roe @example.com",
    "phone":"+44-1234567890",
    "address":"123 Elm Street, London",
    "comment": "please cancel if items are out of stock"
}
Record2: {
    "orderID":"102",
    "email":" jane.s mith @example.com",
    "phone":"(123)456-7890",
    "address":"123 Main St, Chicago, 12345",
    "comment": "Include a gift receipt"
}
Record3: {
    "orderID":"103",
    "email":"max.muller @example.com",
    "phone":"+498912345678",
    "address":"Musterstrabe, Bayern 00000",
    "comment": "Bitte nutzen Sie den Expressversand"
}
Record4: {
    "orderID":"104",
    "email":" julia @example.com",
    "phone":"(111) 4567890",
    "address":"000 main st, los angeles, 11111",
    "comment": "Entregar a la puerta"
}
Record5: {
    "orderID":"105",
    "email":" roberto @example.com",
    "phone":"+33 3 44 21 83 43",
    "address":"000 Jean Allemane, paris, 00000",
    "comment": "veuillez ajouter un emballage cadeau"
}

The raw data has inconsistent formatting for email and phone numbers, the address is incomplete and doesn’t have a country, and comments are in various languages. To address the challenges with the raw data, we can implement a comprehensive data transformation process using Redshift ML integrated with an LLM in an ETL workflow. This approach can help standardize the data, cleanse it, and enrich it to meet the desired output format.

The following table shows an example of enriched address data.

orderid Address Country (Identified using LLM)
101 123 Elm Street, London United Kingdom
102 123 Main St, Chicago, 12345 USA
103 Musterstrabe, Bayern 00000 Germany
104 000 main st, los angeles, 11111 USA
105 000 Jean Allemane, paris, 00000 France

The following table shows an example of standardized email and phone data.

orderid email

cleansed_email

(Using LLM)

Phone Standardized Phone (Using LLM)
101 john. roe @example.com john.roe@example.com +44-1234567890 +44 1234567890
102 jane.s mith @example.com jane.smith@example.com (123)456-7890 +1 1234567890
103 max.muller @example.com max.muller@example.com 498912345678 +49 8912345678
104 julia @example.com julia@example.com (111) 4567890 +1 1114567890
105 roberto @example.com roberto@example.com +33 3 44 21 83 43 +33 344218343

The following table shows an example of translated and enriched comment data.

orderid Comment

english_comment

(Translated using LLM)

comment_language

(Identified by LLM)

101 please cancel if items are out of stock please cancel if items are out of st English
102 Include a gift receipt Include a gift receipt English
103 Bitte nutzen Sie den Expressversand Please use express shipping German
104 Entregar a la puerta Leave at door step Spanish
105 veuillez ajouter un emballage cadeau Please add a gift wrap French

Prerequisites

Before you implement the steps in the walkthrough, make sure you have the following prerequisites:

Choose an LLM and deploy it using SageMaker JumpStart

Complete the following steps to deploy your LLM:

  1. On the SageMaker JumpStart console, choose Foundation models in the navigation pane.
  2. Search for your FM (for this post, Meta-Llama-3-8B-Instruct) and choose View model.
  3. On the Model details page, review the End User License Agreement (EULA) and choose Open notebook in Studio to start using the notebook in Amazon SageMaker Studio.
  4. In the Select domain and user profile pop-up, choose a profile, then choose Open Studio.
  5. When the notebook opens, in the Set up notebook environment pop-up, choose t3.medium or another instance type recommended in the notebook, then choose Select.
  6. Modify the notebook cell that has accept_eula = False to accept_eula = True.
  7. Select and run the first five cells (see the highlighted sections in the following screenshot) using the run icon.
  1. After you run the fifth cell, choose Endpoints under Deployments in the navigation pane, where you can see the endpoint created.
  2. Copy the endpoint name and wait until the endpoint status is In Service.

It can take 30–45 minutes for the endpoint to be available.

Use Redshift ML to create a model referencing the SageMaker JumpStart LLM endpoint

In this step, you create a model using Redshift ML and the bring your own model (BYOM) capability. After the model is created, you can use the output function to make remote inference to the LLM model. To create a model in Amazon Redshift for the LLM endpoint you created previously, complete the following steps:

  1. Log in to the Redshift endpoint using the Amazon Redshift Query Editor V2.
  2. Make sure you have the following AWS Identity and Access Management (IAM) policy added to the default IAM role. Replace <endpointname> with the SageMaker JumpStart endpoint name you captured earlier:
    {
      "Statement": [
          {
              "Action": "sagemaker:InvokeEndpoint",
              "Effect": "Allow",
              "Resource": "arn:aws:sagemaker:<region>:<AccountNumber>:endpoint/<endpointname>",
              "Principal": "*"
          }
      ]
    }

  3. In the query editor, run the following SQL statement to create a model in Amazon Redshift. Replace <endpointname> with the endpoint name you captured earlier. Note that the input and return data type for the model is the SUPER data type.
    CREATE MODEL meta_llama_3_8b_instruct
    FUNCTION meta_llama_3_8b_instruct(super)
    RETURNS SUPER
    SAGEMAKER '<endpointname>'
    IAM_ROLE default;

Create a materialized view to load raw streaming data

Use the following SQL to create materialized view for the data that is being streamed through the customer-orders stream. The materialized view is set to auto refresh and will be refreshed as data keeps arriving in the stream.

CREATE EXTERNAL SCHEMA kinesis_streams FROM KINESIS
IAM_ROLE default;

CREATE MATERIALIZED VIEW mv_customer_orders AUTO REFRESH YES AS
    SELECT 
    refresh_time,
    approximate_arrival_timestamp,
    partition_key,
    shard_id,
    sequence_number,
    --json_parse(from_varbyte(kinesis_data, 'utf-8')) as rawdata,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'orderID',true)::character(36) as orderID,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'email',true)::character(36) as email,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'phone',true)::character(36) as phone,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'address',true)::character(36) as address,
    json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'comment',true)::character(36) as comment
    FROM kinesis_streams."customer-orders";

After you run these SQL statements, the materialized view mv_customer_orders will be created and continuously updated as new data arrives in the customer-orders Kinesis data stream.

Call the model function with prompts to transform data and view results

Now you can call the Redshift ML LLM model function with prompts to transform the raw data and view the results. The input payload is a JSON with prompt and model parameters as attributes:

  • Prompt – The prompt is the input text or instruction provided to the generative AI model to generate new content. The prompt acts as a guiding signal that the model uses to produce relevant and coherent output. Each model has unique prompt engineering guidance. Refer to the Meta Llama 3 Instruct model card for its prompt formats and guidance.
  • Model parameters – The model parameters determine the behavior and output of the model. With model parameters, you can control the randomness, number of tokens generated, where the model should stop, and more.

In the Invoke endpoint section of the SageMaker Studio notebook, you can find the model parameters and example payloads.

k

The following SQL statement calls the Redshift ML LLM model function with prompts to standardize phone number and email data, identify the country from the address, and translate comments into English and identify the original comment’s language. The output of the SQL is stored in the table enhanced_raw_data_customer_orders.

create table enhanced_raw_data_customer_orders as
select phone,email,comment, address
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nConvert this phone number into a standard format: '||phone||'\n\nA standard phone number had plus sign followed by CountryCode followed by a space and the rest of the phone number without any spaces or dashes. \n\nExamples: +1 1234567890, +91 1234567890\n\nReturn only the standardized phone number, nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as standardized_phone
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nConvert this email into a standard format:'||email||'\n\nA standard email ID does not have spaces and is lower case. Return only the standardized email and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as standardized_email
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIdentify which country is this address in:'||address||'\n\nReturn only the country and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as country
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nTranslate this statement to english if it is not in english:'||comment||'\n\nReturn the english comment and nothing else. Output only english<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as translated_comment
  ,meta_llama_3_8b_instruct(json_parse('{"inputs":"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nIdentify which language this statement is in:'||comment||'\n\nReturn only the language and nothing else<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n","parameters":{"stop": "<|eot_id|>"}}')) as orig_comment_language
  from mv_customer_orders;

Query the enhanced_raw_data_customer_orders table to view the data. The output of LLM is in JSON format with the result in the generated_text attribute. It’s stored in the SUPER data type and can be queried using PartiQL:

select 
    phone as raw_phone
    , standardized_phone.generated_text :: varchar as standardized_phone 
    , email as raw_email
    , standardized_email.generated_text :: varchar as standardized_email
    , address as raw_address
    , country.generated_text :: varchar as country
    , comment as raw_comment
    , translated_comment.generated_text :: varchar as translated_comment
    , orig_comment_language.generated_text :: varchar as orig_comment_language
from enhanced_raw_data_customer_orders;

The following screenshot shows our output.

Clean up

To avoid incurring future charges, delete the resources you created:

  1. Delete the LLM endpoint in SageMaker JumpStart by running the cell in the Clean up section in the Jupyter notebook.
  2. Delete the Kinesis data stream.
  3. Delete the Redshift Serverless workgroup or Redshift cluster.

Conclusion

In this post, we showed you how to enrich, standardize, and translate streaming data in Amazon Redshift with generative AI and LLMs. Specifically, we demonstrated the integration of the Meta Llama 3 8B Instruct LLM, available through SageMaker JumpStart, with Redshift ML. Although we used the Meta Llama 3 model as an example, you can use a variety of other pre-trained LLM models available in SageMaker JumpStart as part of your Redshift ML workflows. This integration allows you to explore a wide range of NLP use cases, such as data enrichment, content summarization, knowledge graph development, and more. The ability to seamlessly integrate advanced LLMs into your Redshift environment significantly broadens the analytical capabilities of Redshift ML. This empowers data analysts and developers to incorporate ML into their data warehouse workflows with streamlined processes driven by familiar SQL commands.

We encourage you to explore the full potential of this integration and experiment with implementing various use cases that integrate the power of generative AI and LLMs with Amazon Redshift. The combination of the scalability and performance of Amazon Redshift, along with the advanced natural language processing capabilities of LLMs, can unlock new possibilities for data-driven insights and decision-making.


About the authors

Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Post Syndicated from Poulomi Dasgupta original https://aws.amazon.com/blogs/big-data/set-up-cross-account-aws-glue-data-catalog-access-using-aws-lake-formation-and-aws-iam-identity-center-with-amazon-redshift-and-amazon-quicksight/

Most organizations manage their workforce identity centrally in external identity providers (IdPs) and are comprised of multiple business units that produce their own datasets and manage the lifecycle spread across multiple AWS accounts. These business units have varying landscapes, where a data lake is managed by Amazon Simple Storage Service (Amazon S3) and analytics workloads are run on Amazon Redshift, a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data.

Business units that create data products like to share them with others, without copying the data, to promote analysis to derive insights. Also, they want tighter control on user access and the ability to audit access to their data products. To address this, enterprises usually catalog the datasets in the AWS Glue Data Catalog for data discovery and use AWS Lake Formation for fine-grained access control to adhere to the compliance and operating security model for their business units. Given the diverse range of services, fine-grained data sharing, and personas involved, these enterprises often want a streamlined experience for enterprise user identities when accessing their data using AWS Analytics services.

AWS IAM Identity Center enables centralized management of workforce user access to AWS accounts and applications using a local identity store or by connecting corporate directories using IdPs. Amazon Redshift and AWS Lake Formation are integrated with the new trusted identity propagation capability in IAM Identity Center, allowing you to use third-party IdPs such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin.

With trusted identity propagation, Lake Formation enables data administrators to directly provide fine-grained access to corporate users and groups, and simplifies the traceability of end-to-end data access across supported AWS services. Because access is managed based on a user’s corporate identity, end-users don’t need to use database local user credentials or assume an AWS Identity and Access Management (IAM) role to access data. Furthermore, this enables effective user permissions based on collective group membership and supports group hierarchy.

In this post, we cover how to enable trusted identity propagation with AWS IAM Identity Center, Amazon Redshift, and AWS Lake Formation residing on separate AWS accounts and set up cross-account sharing of an S3 data lake for enterprise identities using AWS Lake Formation to enable analytics using Amazon Redshift. Then we use Amazon QuickSight to build insights using Redshift tables as our data source.

Solution overview

This post covers a use case where an organization centrally manages corporate users within their IdP and where the users belong to multiple business units. Their goal is to enable centralized user authentication through IAM Identity Center in the management account, while keeping the business unit that analyzes data using a Redshift cluster and the business unit that produces data cataloged using the Data Catalog in separate member accounts. This allows them to maintain a single authentication mechanism through IAM Identity Center within an organization while retaining access control, resource, and cost separation through the use of separate AWS accounts per business units and enabling cross-account data sharing using Lake Formation.

For this solution, AWS Organizations is enabled in the central management account and IAM Identity Center is configured for managing workforce identities. The organization has two member accounts: one account that manages the S3 data lake using the Data Catalog, and another account that runs analytical workloads on Amazon Redshift and QuickSight, with all the services enabled with trusted identity propagation. Amazon Redshift will access cross-account AWS Glue resources using IAM Identity Center users and groups set up in the central management account using QuickSight in member account 1. In member account 2, permissions on the AWS Glue resources are managed using Lake Formation and are shared with member account 1 using Lake Formation data sharing.

The following diagram illustrates the solution architecture.

The solution consists of the following:

  • In the centralized management account, we create a permission set and create account assignments for Redshift_Member_Account. We integrate users and groups from the IdP with IAM Identity Center.
  • Member account 1 (Redshift_Member_Account) is where the Redshift cluster and application exist.
  • Member account 2 (Glue_Member_Account) is where metadata is cataloged in the Data Catalog and Lake Formation is enabled with IAM Identity Center integration.
  • We assign permissions to two IAM Identity Center groups to access the Data Catalog resources:
    • awssso-sales – We apply column-level filtering for this group so that users belonging to this group will be able to select two columns and read all rows.
    • awssso-finance – We apply row-level filtering using data filters for this group so that users belonging to this group will be able to select all columns and see rows after row-level filtering is applied.
  • We apply different permissions for three IAM Identity Center users:
    • User Ethan, part of awssso-sales – Ethan will be able to select two columns and read all rows.
    • User Frank, part of awssso-finance – Frank will be able to select all columns and see rows after row-level filtering is applied.
    • User Brian, part of awssso-sales and awssso-finance – Brian inherits permissions defined for both groups.
  • We set up QuickSight in the same account where Amazon Redshift exists, enabling authentication using IAM Identity Center.

Prerequisites

You should have the following prerequisites alreday set up:

Member account 2 configuration

Sign in to the Lake Formation console as the data lake administrator. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.

In this section, we walk through the steps to set up Lake Formation, enable Lake Formation permissions, and grant database and table permissions to IAM Identity Center groups.

Set up Lake Formation

Complete the steps in this section to set up Lake Formation.

Create AWS Glue resources

You can use an existing AWS Glue database that has a few tables. For this post, we use a database called customerdb and a table called reviews whose data is stored in the S3 bucket lf-datalake-<account-id>-<region>.

Register the S3 bucket location

Complete the following steps to register the S3 bucket location:

  • On the Lake Formation console, in the navigation pane, under Administration, choose Data lake locations.
  • Choose Register location.
  • For Amazon S3 location, enter the S3 bucket location that contains table data.
  • For IAM role, provide a user-defined IAM role. For instructions to create a user-defined IAM role, refer to Requirements for roles used to register locations.
  • For Permission mode, select Lake Formation.
  • Choose Register location.

Set cross-account version

Complete the following steps to set your cross-account version:

  • Sign in to the Lake Formation console as the data lake admin.
  • In the navigation pane, under Administration, choose Data Catalog settings.
  • Under Cross-account version settings, keep the latest version (Version 4) as the current cross-account version.
  • Choose Save.

Add permissions required for cross-account access

If the AWS Glue Data Catalog resource policy is already enabled in the account, then you can either remove the policy or add the following permissions to the policy that are required for cross-account grants. The provided policy enables AWS Resource Access Manager (AWS RAM) to share a resource policy while cross-account grants are made using Lake Formation. For more information, refer to Prerequisites. Please skip to the following step if your policy is blank under Catalog Settings.

  • Sign in to the AWS Glue console as an IAM admin.
  • In the navigation pane, under Data Catalog, choose Catalog settings.
  • Under Permissions, add the following policy, and provide the account ID where your AWS Glue resources exist:
{ "Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ram.amazonaws.com"
},
"Action": "glue:ShareResource",
"Resource": [
"arn:aws:glue:us-east-1:<account-id>:table/*/*",
"arn:aws:glue:us-east-1:<account-id>:database/*",
"arn:aws:glue:us-east-1:<account-id>:catalog"
]
}
]
}
  • Choose Save.

For more information, see Granting cross-account access.

Enable IAM Identity Center integration for Lake Formation

To integrate IAM Identity Center with your Lake Formation organization instance of IAM Identity Center, refer to Connecting Lake Formation with IAM Identity Center.

To enable cross-account sharing for IAM Identity Center users and groups, add the target recipient accounts to your Lake Formation IAM Identity Center integration under the AWS account and organization IDs.

  • Sign in to the Lake Formation console as a data lake admin.
  • In the navigation pane, under Administration, choose IAM Identity Center integration.
  • Under AWS account and organization IDs, choose Add.
  • Enter your target accounts.
  • Choose Add.

Enable Lake Formation permissions for databases

For Data Catalog databases that contain tables that you might share, you can stop new tables from having the default grant of Super to IAMAllowedPrincipals. Complete the following steps:

  • Sign in to the Lake Formation console as a data lake admin.
  • In the navigation pane, under Data Catalog, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Edit.
  • Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
  • Choose Save.

For Data Catalog databases, remove IAMAllowedPrincipals.

  • Under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose View.
  • Select IAMAllowedPrincipals and choose Revoke.

Repeat the same steps for tables under the customerdb database.

Grant database permissions to IAM Identity Center groups

Complete the following steps to grant database permissions to your IAM Identity Center groups:

  • On the Lake Formation console, under Data Catalog, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select your IAM Identity Center group names and choose Assign.

  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Database permissions, select Describe for Database permissions.
  • Choose Grant.

Grant table permissions to IAM Identity Center groups

In the following section, we will grant different permissions to our two IAM Identity Center groups.

Column filter

We first add permissions to the group awssso-sales. This group will have access to the customerdb database and be able to select only two columns and read all rows.

  • On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select awssso-sales and choose Assign.

  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Tables, choose reviews.
  • Under Table permissions, select Select for Table permissions.
  • Select Column-based access.
  • Select Include columns and choose product_title and star_rating.
  • Choose Grant.

Row filter

Next, we grant permissions to awssso-finance. This group will have access to customerdb and be able to select all columns and apply filters on rows.

We need to first create a data filter by performing the following steps:

  • On the Lake Formation console, choose Data filters under Data Catalog.
  • Choose Create data filter.
  • For Data filter name, provide a name.
  • For Target database, choose customerdb.
  • For Target table, choose reviews.
  • For Column-level access, select Access to all columns.
  • For Row-level access, choose Filter rows and apply your filter. In this example, we are filtering reviews with star_rating as 5.
  • Choose Create data filter.

  • Under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select awssso-finance and choose Assign.
  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Tables, choose reviews.
  • Under Data Filters, choose the High_Rating
  • Under Data Filter permissions, select Select.
  • Choose Grant.

Member account 1 configuration

In this section, we walk through the steps to add Amazon Redshift Spectrum table access in member account 1, where the Redshift cluster and application exist.

Accept Invite from RAM

You should have received a Resource Access Manager (RAM) invite from member account 2 when you added member account 1 under IAM Identity Center integration in Lake Formation at the member account 1.

  • Navigate to Resource Access Manager(RAM) from admin console.
  • Under Shared with me, click on resource shares.
  • Select the resource name and click on Accept resource share.

Please make sure that you have followed this entire blog to establish the Redshift Integration with IAM Identity Center before following the next steps.

Set up Redshift Spectrum table access for the IAM Identity Center group

Complete the following steps to set up Redshift Spectrum table access:

  1. Sign in to the Amazon Redshift console using the admin role.
  2. Navigate to Query Editor v2.
  3. Choose the options menu (three dots) next to the cluster and choose Create connection.
  4. Connect as the admin user and run the following commands to make the shared resource link data in the S3 data lake available to the sales group (use the account ID where the Data Catalog exists):
create external schema if not exists <schema_name> from DATA CATALOG database '<glue_catalog_name>' catalog_id '<accountid>';
grant usage on schema <schema_name> to role "<role_name>";

For example:

create external schema if not exists cross_account_glue_schema from DATA CATALOG database 'customerdb' catalog_id '932880906720';
grant usage on schema cross_account_glue_schema to role "awsidc:awssso-sales";
grant usage on schema cross_account_glue_schema to role "awsidc:awssso-finance";

Validate Redshift Spectrum access as an IAM Identity Center user

Complete the following steps to validate access:

  • On the Amazon Redshift console, navigate to Query Editor v2.
  • Choose the options menu (three dots) next to the cluster and choose Create connection.
  • Select IAM Identity Center.
  • Enter your Okta user name and password in the browser pop-up.

  • When you’re connected as a federated user, run the following SQL commands to query the cross_account_glue_schema data lake table.
select * from "dev"."cross_account_glue_schema"."reviews";

The following screenshot shows that user Ethan, who is part of the awssso-sales group, has access to two columns and all rows from the Data Catalog.

The following screenshot shows that user Frank, who is part of the awssso-finance group, has access to all columns for records that have star_rating as 5.

The following screenshot shows that user Brian, who is part of awssso-sales and awssso-finance, has access to all columns for records that have star_rating as 5 and access to only two columns (other columns are returned NULL) for records with star_rating other than 5.

Subscribe to QuickSight with IAM Identity Center

In this post, we set up QuickSight in the same account where the Redshift cluster exists. You can use the same or a different member account for QuickSight setup. To subscribe to QuickSight, complete the following steps:

  • Sign in to your AWS account and open the QuickSight console.
  • Choose Sign up for QuickSight.

  • Enter a notification email address for the QuickSight account owner or group. This email address will receive service and usage notifications.
  • Select the identity option that you want to subscribe with. For this post, we select Use AWS IAM Identity Center.
  • Enter a QuickSight account name.
  • Choose Configure.

  • Next, assign groups in IAM Identity Center to roles in QuickSight (admin, author, and reader.) This step enables your users to access the QuickSight application. In this post, we choose awssso-sales and awssso-finance for Admin group.
  • Specify an IAM role to control QuickSight access to your AWS resources. In this post, we select Use QuickSight managed role (default).
  • For this post, we deselect Add Paginated Reports.
  • Review the choices that you made, then choose Finish.

Enable trusted identity propagation in QuickSight

Trusted identity propagation authenticates the end-user in Amazon Redshift when they access QuickSight assets that use a trusted identity propagation enabled data source. When an author creates a data source with trusted identity propagation, the identity of the data source consumers in QuickSight is propagated and logged in AWS CloudTrail. This allows database administrators to centrally manage data security in Amazon Redshift and automatically apply data security rules to data consumers in QuickSight.

To configure QuickSight to connect to Amazon Redshift data sources with trusted identity propagation, configure Amazon Redshift OAuth scopes to your QuickSight account:

aws quicksight update-identity-propagation-config --aws-account-id "AWSACCOUNTID" --service "REDSHIFT" --authorized-targets "IAM Identity Center managed application ARN"

For example:

aws quicksight update-identity-propagation-config --aws-account-id "1234123123" --service "REDSHIFT" --authorized-targets "arn:aws:sso::XXXXXXXXXXXX:application/ssoins-XXXXXXXXXXXX/apl-XXXXXXXXXXXX"

After you have added the scope, the following command lists all OAuth scopes that are currently on a QuickSight account:

aws quicksight list-identity-propagation-configs --aws-account-id "AWSACCOUNTID"

The following code is the example with output:

aws quicksight list-identity-propagation-configs --aws-account-id "1234123123"
{
"Status": 200,
"Services": [
{
"Service": "REDSHIFT",
"AuthorizedTargets": [
"arn:aws:sso::1004123000:application/ssoins-1234f1234bb1f123/apl-12a1234e2e391234"
]
}
],
"RequestId": "116ec1b0-1533-4ed2-b5a6-d7577e073b35"
}

For more information, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

For QuickSight to connect to a Redshift instance, you must add an appropriate IP address range in the Redshift security group for the specific AWS Region. For more information, see AWS Regions, websites, IP address ranges, and endpoints.

Test your IAM Identity Center and Amazon Redshift integration with QuickSight

Now you’re ready to connect to Amazon Redshift using QuickSight.

  • In the management account, open the IAM Identity Center console and copy the AWS access portal URL from the dashboard.
  • Sign out from the management account and enter the AWS access portal URL in a new browser window.
  • In the pop-up window, enter your IdP credentials.
  • On the Applications tab, select the QuickSight app.
  • After you federate to QuickSight, choose Datasets.
  • Select New Dataset and then choose Redshift (Auto Discovered).
  • Enter your data source details. Make sure to select Single sign-on for Authentication method.
  • Choose Create data source.

Congratulations! You’re signed in using IAM Identity Center integration with Amazon Redshift and are ready to explore and analyze your data using QuickSight.

The following screenshot from QuickSight shows that user Ethan, who is part of the awssso-sales group, has access to two columns and all rows from the Data Catalog.

The following screenshot from QuickSight shows that user Frank, who is part of the awssso-finance group, has access to all columns for records that have star_rating as 5.

The following screenshot from QuickSight shows that user Brian, who is part of awssso-sales and awssso-finance, has access to all columns for records that have star_rating as 5 and access to only two columns (other columns are returned NULL) for records with star_rating other than 5.

Clean up

Complete the following steps to clean up your resources:

  • Delete the data from the S3 bucket.
  • Delete the Data Catalog objects that you created as part of this post.
  • Delete the Lake Formation resources and QuickSight account.
  • If you created new Redshift cluster for testing this solution, delete the cluster.

Conclusion

In this post, we established cross-account access to enable centralized user authentication through IAM Identity Center in the management account, while keeping the Amazon Redshift and AWS Glue resources isolated by business unit in separate member accounts. We used Query Editor V2 for querying the data from Amazon Redshift. Then we showed how to build user-facing dashboards by integrating with QuickSight. Refer to Integrate Tableau and Okta with Amazon Redshift using AWS IAM Identity Center to learn about integrating Tableau and Okta with Amazon Redshift using IAM Identity Center.

Learn more about IAM Identity Center with Amazon Redshift, QuickSight, and Lake Formation. Leave your questions and feedback in the comments section.


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

Post Syndicated from Chanpreet Singh original https://aws.amazon.com/blogs/big-data/unlock-scalability-cost-efficiency-and-faster-insights-with-large-scale-data-migration-to-amazon-redshift/

Large-scale data warehouse migration to the cloud is a complex and challenging endeavor that many organizations undertake to modernize their data infrastructure, enhance data management capabilities, and unlock new business opportunities. As data volumes continue to grow exponentially, traditional data warehousing solutions may struggle to keep up with the increasing demands for scalability, performance, and advanced analytics.

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI). However, you might face significant challenges when planning for a large-scale data warehouse migration. These challenges can range from ensuring data quality and integrity during the migration process to addressing technical complexities related to data transformation, schema mapping, performance, and compatibility issues between the source and target data warehouses. Additionally, organizations must carefully consider factors such as cost implications, security and compliance requirements, change management processes, and the potential disruption to existing business operations during the migration. Effective planning, thorough risk assessment, and a well-designed migration strategy are crucial to mitigating these challenges and implementing a successful transition to the new data warehouse environment on Amazon Redshift.

In this post, we discuss best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

Success criteria for large-scale migration

The following diagram illustrates a scalable migration pattern for an extract, load, and transform (ELT) scenario using Amazon Redshift data sharing patterns.

The following diagram illustrates a scalable migration pattern for extract, transform, and load (ETL) scenario.

Migration pattern extract, transform, and load (ETL) scenarios

Success criteria alignment by all stakeholders (producers, consumers, operators, auditors) is key for successful transition to a new Amazon Redshift modern data architecture. The success criteria are the key performance indicators (KPIs) for each component of the data workflow. This includes the ETL processes that capture source data, the functional refinement and creation of data products, the aggregation for business metrics, and the consumption from analytics, business intelligence (BI), and ML.

KPIs make sure you can track and audit optimal implementation, achieve consumer satisfaction and trust, and minimize disruptions during the final transition. They measure workload trends, cost usage, data flow throughput, consumer data rendering, and real-life performance. This makes sure the new data platform can meet current and future business goals.

Migration from a large-scale mission-critical monolithic legacy data warehouse (such as Oracle, Netezza, Teradata, or Greenplum) is typically planned and implemented over 6–16 months, depending on the complexity of the existing implementation. The monolithic data warehouse environments that have been built over the last 30 years contain proprietary business logic and multiple data design patterns, including an operation data store, star or Snowflake schema, dimension and facts, data warehouses and data marts, online transaction processing (OLTP) real-time dashboards, and online analytic processing (OLAP) cubes with multi-dimensional analytics. The data warehouse is highly business critical with minimal allowable downtime. If your data warehouse platform has gone through multiple enhancements over the years, your operational service levels documentation may not be current with the latest operational metrics and desired SLAs for each tenant (such as business unit, data domain, or organization group).

As part of the success criteria for operational service levels, you need to document the expected service levels for the new Amazon Redshift data warehouse environment. This includes the expected response time limits for dashboard queries or analytical queries, elapsed runtime for daily ETL jobs, desired elapsed time for data sharing with consumers, total number of tenants with concurrency of loads and reports, and mission-critical reports for executives or factory operations.

As part of your modern data architecture transition strategy, the migration goal of a new Amazon Redshift based platform is to use the scalability, performance, cost-optimization, and additional lake house capabilities of Amazon Redshift, resulting in improving the existing data consumption experience. Depending on your enterprise’s culture and goals, your migration pattern of a legacy multi-tenant data platform to Amazon Redshift could use one of the following strategies:

A majority of organizations opt for the organic strategy (lift and shift) when migrating their large data platforms to Amazon Redshift. This approach uses AWS migration tools such as the AWS Schema Conversion Tool (AWS SCT) or the managed service version DMS Schema Conversion to rapidly meet goals around data center exit, cloud adoption, reducing legacy licensing costs, and replacing legacy platforms.

By establishing clear success criteria and monitoring KPIs, you can implement a smooth migration to Amazon Redshift that meets performance and operational goals. Thoughtful planning and optimization are crucial, including optimizing your Amazon Redshift configuration and workload management, addressing concurrency needs, implementing scalability, tuning performance for large result sets, minimizing schema locking, and optimizing join strategies. This will enable right-sizing the Redshift data warehouse to meet workload demands cost-effectively. Thorough testing and performance optimization will facilitate a smooth transition with minimal disruption to end-users, fostering exceptional user experiences and satisfaction. A successful migration can be accomplished through proactive planning, continuous monitoring, and performance fine-tuning, thereby aligning with and delivering on business objectives.

Migration involves the following phases, which we delve into in the subsequent sections:

  • Assessment
    • Discovery of workload and integrations
    • Dependency analysis
    • Effort estimation
    • Team sizing
    • Strategic wave planning
  • Functional and performance
    • Code conversion
    • Data validation
  • Measure and benchmark KPIs
    • Platform-level KPIs
    • Tenant-level KPIs
    • Consumer-level KPIs
    • Sample SQL
  • Monitoring Amazon Redshift performance and continual optimization
    • Identify top offending queries
    • Optimization strategies

To achieve a successful Amazon Redshift migration, it’s important to address these infrastructure, security, and deployment considerations simultaneously, thereby implementing a smooth and secure transition.

Assessment

In this section, we discuss the steps you can take in the assessment phase.

Discovery of workload and integrations

Conducting discovery and assessment for migrating a large on-premises data warehouse to Amazon Redshift is a critical step in the migration process. This phase helps identify potential challenges, assess the complexity of the migration, and gather the necessary information to plan and implement the migration effectively. You can use the following steps:

  • Data profiling and assessment – This involves analyzing the schema, data types, table sizes, and dependencies. Special attention should be given to complex data types such as arrays, JSON, or custom data types and custom user-defined functions (UDFs), because they may require specific handling during the migration process. Additionally, it’s essential to assess the volume of data and daily incremental data to be migrated, and estimate the required storage capacity in Amazon Redshift. Furthermore, analyzing the existing workload patterns, queries, and performance characteristics provides valuable insights into the resource requirements needed to optimize the performance of the migrated data warehouse in Amazon Redshift.
  • Code and query assessment – It’s crucial to assess the compatibility of existing SQL code, including queries, stored procedures, and functions. The AWS SCT can help identify any unsupported features, syntax, or functions that need to be rewritten or replaced to achieve a seamless integration with Amazon Redshift. Additionally, it’s essential to evaluate the complexity of the existing processes and determine if they require redesigning or optimization to align with Amazon Redshift best practices.
  • Performance and scalability assessment – This includes identifying performance bottlenecks, concurrency issues, or resource constraints that may be hindering optimal performance. This analysis helps determine the need for performance tuning or workload management techniques that may be required to achieve optimal performance and scalability in the Amazon Redshift environment.
  • Application integrations and mapping – Embarking on a data warehouse migration to a new platform necessitates a comprehensive understanding of the existing technology stack and business processes intertwined with the legacy data warehouse. Consider the following:
    • Meticulously document all ETL processes, BI tools, and scheduling mechanisms employed in conjunction with the current data warehouse. This includes commercial tools, custom scripts, and any APIs or connectors interfacing with source systems.
    • Take note of any custom code, frameworks, or mechanisms utilized in the legacy data warehouse for tasks such as managing slowly changing dimensions (SCDs), generating surrogate keys, implementing business logic, and other specialized functionalities. These components may require redevelopment or adaptation to operate seamlessly on the new platform.
    • Identify all upstream and downstream applications, as well as business processes that rely on the data warehouse. Map out their specific dependencies on database objects, tables, views, and other components. Trace the flow of data from its origins in the source systems, through the data warehouse, and ultimately to its consumption by reporting, analytics, and other downstream processes.
  • Security and access control assessment – This includes reviewing the existing security model, including user roles, permissions, access controls, data retention policies, and any compliance requirements and industry regulations that need to be adhered to.

Dependency analysis

Understanding dependencies between objects is crucial for a successful migration. You can use system catalog views and custom queries on your on-premises data warehouses to create a comprehensive object dependency report. This report shows how tables, views, and stored procedures rely on each other. This also involves analyzing indirect dependencies (for example, a view built on top of another view, which in turn uses a set of tables), and having a complete understanding of data usage patterns.

Effort estimation

The discovery phase serves as your compass for estimating the migration effort. You can translate those insights into a clear roadmap as follows:

  • Object classification and complexity assessment – Based on the discovery findings, categorize objects (tables, views, stored procedures, and so on) based on their complexity. Simple tables with minimal dependencies will require less effort to migrate than intricate views or stored procedures with complex logic.
  • Migration tools – Use the AWS SCT to estimate the base migration effort per object type. The AWS SCT can automate schema conversion, data type mapping, and function conversion, reducing manual effort.
  • Additional considerations – Factor in additional tasks beyond schema conversion. This may include data cleansing, schema optimization for Amazon Redshift performance, unit testing of migrated objects, and migration script development for complex procedures. The discovery phase sheds light on potential schema complexities, allowing you to accurately estimate the effort required for these tasks.

Team sizing

With a clear picture of the effort estimate, you can now size the team for the migration.

Person-months calculation

Divide the total estimated effort by the desired project duration to determine the total person-months required. This provides a high-level understanding of the team size needed.

For example, for a ELT migration project from an on-premises data warehouse to Amazon Redshift to be completed within 6 months, we estimate the team requirements based on the number of schemas or tenants (for example, 30), number of database tables (for example, 5,000), average migration estimate for a schema (for example, 4 weeks based on complexity of stored procedures, tables and views, platform-specific routines, and materialized views), and number of business functions (for example, 2,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

  • Migration time period (65% migration/35% for validation & transition) = 0.8* 6 months = 5 months or 22 weeks
  • Dedicated teams = Number of tenants / (migration time period) / (average migration period for a tenant) = 30/5/1 = 6 teams
  • Migration team structure:
    • One to three data developers with stored procedure conversion expertise per team, performing over 25 conversions per week
    • One data validation engineer per team, testing over 50 objects per week
    • One to two data visualization experts per team, confirming consumer downstream applications are accurate and performant
  • A common shared DBA team with performance tuning expertise responding to standardization and challenges
  • A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions

Team composition expertise

Based on the skillsets required for various migration tasks, we assemble a team with the right expertise. Platform architects define a well-architected platform. Data engineers are crucial for schema conversion and data transformation, and DBAs can handle cluster configuration and workload monitoring. An engagement or project management team makes sure the project runs smoothly, on time, and within budget.

For example, for an ETL migration project from Informatica/Greenplum to a target Redshift lakehouse with an Amazon Simple Storage Service (Amazon S3) data lake to be completed within 12 months, we estimate the team requirements based on the number of schemas and tenants (for example, 50 schemas), number of database tables (for example, 10,000), average migration estimate for a schema (6 weeks based on complexity of database objects), and number of business functions (for example, 5,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

  • An open data format ingestion architecture processing the source dataset and refining the data in the S3 data lake. This requires a dedicated team of 3–7 members building a serverless data lake for all data sources. Ingestion migration implementation is segmented by tenants and type of ingestion patterns, such as internal database change data capture (CDC); data streaming, clickstream, and Internet of Things (IoT); public dataset capture; partner data transfer; and file ingestion patterns.
  • The migration team composition is tailored to the needs of a project wave. Depending on each migration wave and what is being done in the wave (development, testing, or performance tuning), the right people will be engaged. When the wave is complete, the people from that wave will move to another wave.
  • A loading team builds a producer-consumer architecture in Amazon Redshift to process concurrent near real-time publishing of data. This requires a dedicated team of 3–7 members building and publishing refined datasets in Amazon Redshift.
  • A shared DBA group of 3–5 individuals helping with schema standardization, migration challenges, and performance optimization outside the automated conversion.
  • Data transformation experts to convert database stored functions in the producer or consumer.
  • A migration sprint plan for 10 months with 2 sprint weeks with multiple waves to release tenants to the new architecture.
  • A validation team to confirm a reliable and complete migration.
  • One to two data visualization experts per team, confirming that consumer downstream applications are accurate and performant.
  • A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions.

Strategic wave planning

Migration waves can be determined as follows:

  • Dependency-based wave delineation – Objects can be grouped into migration waves based on their dependency relationships. Objects with no or minimal dependencies will be prioritized for earlier waves, whereas those with complex dependencies will be migrated in subsequent waves. This provides a smooth and sequential migration process.
  • Logical schema and business area alignment – You can further revise migration waves by considering logical schema and business areas. This allows you to migrate related data objects together, minimizing disruption to specific business functions.

Functional and performance

In this section, we discuss the steps for refactoring the legacy SQL codebase to leverage Redshift SQL best practices, build validation routines to ensure accuracy and completeness during the transition to Redshift, capturing KPIs to ensure similar or better service levels for consumption tools/downstream applications, and incorporating performance hooks and procedures for scalable and performant Redshift Platform.

Code conversion

We recommend using the AWS SCT as the first step in the code conversion journey. The AWS SCT is a powerful tool that can streamline the database schema and code migrations to Amazon Redshift. With its intuitive interface and automated conversion capabilities, the AWS SCT can significantly reduce the manual effort required during the migration process. Refer to Converting data warehouse schemas to Amazon Redshift using AWS SCT for instructions to convert your database schema, including tables, views, functions, and stored procedures, to Amazon Redshift format. For an Oracle source, you can also use the managed service version DMS Schema Conversion.

When the conversion is complete, the AWS SCT generates a detailed conversion report. This report highlights any potential issues, incompatibilities, or areas requiring manual intervention. Although the AWS SCT automates a significant portion of the conversion process, manual review and modifications are often necessary to address various complexities and optimizations.

Some common cases where manual review and modifications are typically required include:

  • Incompatible data types – The AWS SCT may not always handle custom or non-standard data types, requiring manual intervention to map them to compatible Amazon Redshift data types.
  • Database-specific SQL extensions or proprietary functions – If the source database uses SQL extensions or proprietary functions specific to the database vendor (for example, STRING_AGG() or ARRAY_UPPER functions, or custom UDFs for PostgreSQL), these may need to be manually rewritten or replaced with equivalent Amazon Redshift functions or UDFs. The AWS SCT extension pack is an add-on module that emulates functions present in a source database that are required when converting objects to the target database.
  • Performance optimization – Although the AWS SCT can convert the schema and code, manual optimization is often necessary to take advantage of the features and capabilities of Amazon Redshift. This may include adjusting distribution and sort keys, converting row-by-row operations to set-based operations, optimizing query plans, and other performance tuning techniques specific to Amazon Redshift.
  • Stored procedures and code conversion – The AWS SCT offers comprehensive capabilities to seamlessly migrate stored procedures and other code objects across platforms. Although its automated conversion process efficiently handles the majority of cases, certain intricate scenarios may necessitate manual intervention due to the complexity of the code and utilization of database-specific features or extensions. To achieve optimal compatibility and accuracy, it’s advisable to undertake testing and validation procedures during the migration process.

After you address the issues identified during the manual review process, it’s crucial to thoroughly test the converted stored procedures, as well as other database objects and code, such as views, functions, and SQL extensions, in a non-production Redshift cluster before deploying them in the production environment. This exercise is mostly undertaken by QA teams. This phase also involves conducting holistic performance testing (individual queries, batch loads, consumption reports and dashboards in BI tools, data mining applications, ML algorithms, and other relevant use cases) in addition to functional testing to make sure the converted code meets the required performance expectations. The performance tests should simulate production-like workloads and data volumes to validate the performance under realistic conditions.

Data validation

When migrating data from an on-premises data warehouse to a Redshift cluster on AWS, data validation is a crucial step to confirm the integrity and accuracy of the migrated data. There are several approaches you can consider:

  • Custom scripts – Use scripting languages like Python, SQL, or Bash to develop custom data validation scripts tailored to your specific data validation requirements. These scripts can connect to both the source and target databases, extract data, perform comparisons, and generate reports.
  • Open source tools – Use open source data validation tools like Amazon Deequ or Great Expectations. These tools provide frameworks and utilities for defining data quality rules, validating data, and generating reports.
  • AWS native or commercial tools – Use AWS native tools such as AWS Glue Data Quality or commercial data validation tools like Collibra Data Quality. These tools often provide comprehensive features, user-friendly interfaces, and dedicated support.

The following are different types of validation checks to consider:

  • Structural comparisons – Compare the list of columns and data types of columns between the source and target (Amazon Redshift). Any mismatches should be flagged.
  • Row count validation – Compare the row counts of each core table in the source data warehouse with the corresponding table in the target Redshift cluster. This is the most basic validation step to make sure no data has been lost or duplicated during the migration process.
  • Column-level validation – Validate individual columns by comparing column-level statistics (min, max, count, sum, average) for each column between the source and target databases. This can help identify any discrepancies in data values or data types.

You can also consider the following validation strategies:

  • Data profiling – Perform data profiling on the source and target databases to understand the data characteristics, identify outliers, and detect potential data quality issues. For example, you can use the data profiling capabilities of AWS Glue Data Quality or the Amazon Deequ
  • Reconciliation reports – Produce detailed validation reports that highlight errors, mismatches, and data quality issues. Consider generating reports in various formats (CSV, JSON, HTML) for straightforward consumption and integration with monitoring tools.
  • Automate the validation process – Integrate the validation logic into your data migration or ETL pipelines using scheduling tools or workflow orchestrators like Apache Airflow or AWS Step Functions.

Lastly, keep in mind the following considerations for collaboration and communication:

  • Stakeholder involvement – Involve relevant stakeholders, such as business analysts, data owners, and subject matter experts, throughout the validation process to make sure business requirements and data quality expectations are met.
  • Reporting and sign-off – Establish a clear reporting and sign-off process for the validation results, involving all relevant stakeholders and decision-makers.

Measure and benchmark KPIs

For multi-tenant Amazon Redshift implementation, KPIs are segmented at the platform level, tenant level, and consumption tools level. KPIs evaluate the operational metrics, cost metrics, and end-user response time metrics. In this section, we discuss the KPIs needed for achieving a successful transition.

Platform-level KPIs

As new tenants are gradually migrated to the platform, it’s imperative to monitor the current state of Amazon Redshift platform-level KPIs. The current KPI’s state will help the platform team make the necessary scalability modifications (add nodes, add consumer clusters, add producer clusters, or increase concurrency scaling clusters). Amazon Redshift query monitoring rules (QMR) also help govern the overall state of data platform, providing optimal performance for all tenants by managing outlier workloads.

The following table summarizes the relevant platform-level KPIs.

Component KPI Service Level and Success Criteria
ETL Ingestion data volume Daily or hourly peak volume in GBps, number of objects, number of threads.
Ingestion threads Peak hourly ingestion threads (COPY or INSERT), number of dependencies, KPI segmented by tenants and domains.
Stored procedure volume Peak hourly stored procedure invocations segmented by tenants and domains.
Concurrent load Peak concurrent load supported by the producer cluster; distribution of ingestion pattern across multiple producer clusters using data sharing.
Data sharing dependency Data sharing between producer clusters (objects refreshed, locks per hour, waits per hour).
Workload Number of queries Peak hour query volume supported by cluster segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, domain, or sub-domain.
Number of queries per queue Peak hour query volume supported by priority automatic WLM queue segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, business group, domain, or sub-domain.
Runtime pattern Total runtime per hour; max, median, and average run pattern; segmented by service class across clusters.
Wait time patterns Total wait time per hour; max, median, and average wait pattern for queries waiting.
Performance Leader node usage Service level for leader node (recommended less than 80%).
Compute node CPU usage Service level for compute node (recommended less than 90%).
Disk I/O usage per node Service level for disk I/O per node.
QMR rules Number of outlier queries stopped by QMR (large scan, large spilling disk, large runtime); logging thresholds for potential large queries running more than 5 minutes.
History of WLM queries Historical trend of queries stored in historical archive table for all instances of queries in STL_WLM_QUERY; trend analysis over 30 days, 60 days, and 90 days to fine-tune the workload across clusters.
Cost Total cost per month of Amazon Redshift platform Service level for mix of instances (reserved, on-demand, serverless), cost of Concurrency Scaling, cost of Amazon Redshift Spectrum usage. Use AWS tools like AWS Cost Explorer or daily cost usage report to capture monthly costs for each component.
Daily Concurrency Scaling usage Service limits to monitor cost for concurrency scaling; invoke for outlier activity on spikes.
Daily Amazon Redshift Spectrum usage Service limits to monitor cost for using Amazon Redshift Spectrum; invoke for outlier activity.
Redshift Managed Storage usage cost Track usage of Redshift Managed Storage, monitoring wastage on temporary, archival, and old data assets.
Localization Remote or on-premises tools Service level for rendering large datasets to remote destinations.
Data transfer to remote tools Data transfer to BI tools or workstations outside the Redshift cluster VPC; separation of datasets to Amazon S3 using the unload feature, avoiding bottlenecks at leader node.

Tenant-level KPIs

Tenant-level KPIs help capture current performance levels from the legacy system and document expected service levels for the data flow from the source capture to end-user consumption. The captured legacy KPIs assist in providing the best target modern Amazon Redshift platform (a single Redshift data warehouse, a lake house with Amazon Redshift Spectrum, and data sharing with the producer and consumer clusters). Cost usage tracking at the tenant level helps you spread the cost of a shared platform across tenants.

The following table summarizes the relevant tenant-level KPIs.

Component KPI Service Level and Success Criteria
Cost Compute usage by tenant Track usage by tenant, business group, or domain; capture query volume by business unit associating Redshift user identity to internal business unit; data observability by consumer usage for data products helping with cost attribution.
ETL Orchestration SLA Service level for daily data availability.
Runtime Service level for data loading and transformation.
Data ingestion volume Peak expected volume for service level guarantee.
Query consumption Response time Response time SLA for query patterns (dashboards, SQL analytics, ML analytics, BI tool caching).
Concurrency Peak query consumers for tenant.
Query volume Peak hourly volume service levels and daily query volumes.
Individual query response for critical data consumption Service level and success criteria for critical workloads.

Consumer-level KPIs

A multi-tenant modern data platform can set service levels for a variety of consumer tools. The service levels provide guidance to end-users of the capability of the new deployment.

The following table summarizes the relevant consumer-level KPIs.

Consumer KPI Service Level and Success Criteria
BI tools Large data extraction Service level for unloading data for caching or query rendering a large result dataset.
Dashboards Response time Service level for data refresh.
SQL query tools Response time Service level for response time by query type.
Concurrency Service level for concurrent query access by all consumers.
One-time analytics Response time Service level for large data unloads or aggregation.
ML analytics Response time Service level for large data unloads or aggregation.

Sample SQL

The post includes sample SQL to capture daily KPI metrics. The following example KPI dashboard trends assist in capturing historic workload patterns, identifying deviations in workload, and providing guidance on the platform workload capacity to meet the current workload and anticipated growth patterns.

The following figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds).

Figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds)

The following figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns).

Figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns)

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters).

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters)

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time).

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time)

Monitoring Redshift performance and continual optimization

Amazon Redshift uses automatic table optimization (ATO) to choose the right distribution style, sort keys, and encoding when you create a table with AUTO options. Therefore, it’s a good practice to take advantage of the AUTO feature and create tables with DISTSTYLE AUTO, SORTKEY AUTO, and ENCODING AUTO. When tables are created with AUTO options, Amazon Redshift initially creates tables with optimal keys for the best first-time query performance possible using information such as the primary key and data types. In addition, Amazon Redshift analyzes the data volume and query usage patterns to evolve the distribution strategy and sort keys to optimize performance over time. Finally, Amazon Redshift performs table maintenance activities on your tables that reduce fragmentation and make sure statistics are up to date.

During a large, phased migration, it’s important to monitor and measure Amazon Redshift performance against target KPIs at each phase and implement continual optimization. As new workloads are onboarded at each phase of the migration, it’s recommended to perform regular Redshift cluster reviews and analyze query pattern and performance. Cluster reviews can be done by engaging the Amazon Redshift specialist team through AWS Enterprise support or your AWS account team. The goal of a cluster review includes the following:

  • Use cases – Review the application use cases and determine if the design is suitable to solve for those use cases.
  • End-to-end architecture – Assess the current data pipeline architecture (ingestion, transformation, and consumption). For example, determine if too many small inserts are occurring and review their ETL pipeline. Determine if integration with other AWS services can be useful, such as AWS Lake Formation, Amazon Athena, Redshift Spectrum, or Amazon Redshift federation with PostgreSQL and MySQL.
  • Data model design – Review the data model and table design and provide recommendations for sort and distribution keys, keeping in mind best practices.
  • Performance – Review cluster performance metrics. Identify bottlenecks or irregularities and suggest recommendations. Dive deep into specific long-running queries to identify solutions specific to the customer’s workload.
  • Cost optimization – Provide recommendations to reduce costs where possible.
  • New features – Stay up to date with the new features in Amazon Redshift and identify where they can be used to meet these goals.

New workloads can introduce query patterns that could impact performance and miss target SLAs. A number of factors can affect query performance. In the following sections, we discuss aspects impacting query speed and optimizations for improving Redshift cluster performance.

Identify top offending queries

A compute node is partitioned into slices. More nodes means more processors and more slices, which enables you to redistribute the data as needed across the slices. However, more nodes also means greater expense, so you will need to find the balance of cost and performance that is appropriate for your system. For more information on Redshift cluster architecture, see Data warehouse system architecture. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster. For more information on node types, see Amazon Redshift pricing.

Redshift Test Drive is an open source tool that lets you evaluate which different data warehouse configuration options are best suited for your workload. We created Redshift Test Drive from Simple Replay and Amazon Redshift Node Configuration Comparison (see Compare different node types for your workload using Amazon Redshift for more details) to provide a single entry point for finding the best Amazon Redshift configuration for your workload. Redshift Test Drive also provides additional features such as a self-hosted analysis UI and the ability to replicate external objects that a Redshift workload may interact with. With Amazon Redshift Serverless, you can start with a base Redshift Processing Unit (RPU), and Redshift Serverless automatically scales based on your workload needs.

Optimization strategies

If you choose to fine-tune manually, the following are key concepts and considerations:

  • Data distribution – Amazon Redshift stores table data on the compute nodes according to a table’s distribution style. When you run a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed. For more information, see Working with data distribution styles.
  • Data sort order – Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed. For more information, see Working with sort keys.
  • Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate (a condition in the WHERE clause) to restrict the query dataset.
  • Concurrent operations – Amazon Redshift offers a powerful feature called automatic workload management (WLM) with query priorities, which enhances query throughput and overall system performance. By intelligently managing multiple concurrent operations and allocating resources dynamically, automatic WLM makes sure high-priority queries receive the necessary resources promptly, while lower-priority queries are processed efficiently without compromising system stability. This advanced queuing mechanism allows Amazon Redshift to optimize resource utilization, minimizing potential bottlenecks and maximizing query throughput, ultimately delivering a seamless and responsive experience for users running multiple operations simultaneously.
  • Query structure – How your query is written will affect its performance. As much as possible, write queries to process and return as little data as will meet your needs. For more information, see Amazon Redshift best practices for designing queries.
  • Queries with a long return time – Queries with a long return time can impact the processing of other queries and overall performance of the cluster. It’s critical to identify and optimize them. You can optimize these queries by either moving clients to the same network or using the UNLOAD feature of Amazon Redshift, and then configure the client to read the output from Amazon S3. To identify percentile and top running queries, you can download the sample SQL notebook system queries. You can import this in Query Editor V2.0.

Conclusion

In this post, we discussed best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

The assessment phase of a data migration project is critical for implementing a successful migration. It involves a comprehensive analysis of the existing workload, integrations, and dependencies to accurately estimate the effort required and determine the appropriate team size. Strategic wave planning is crucial for prioritizing and scheduling the migration tasks effectively. Establishing KPIs and benchmarking them helps measure progress and identify areas for improvement. Code conversion and data validation processes validate the integrity of the migrated data and applications. Monitoring Amazon Redshift performance, identifying and optimizing top offending queries, and conducting regular cluster reviews are essential for maintaining optimal performance and addressing any potential issues promptly.

By addressing these key aspects, organizations can seamlessly migrate their data workloads to Amazon Redshift while minimizing disruptions and maximizing the benefits of Amazon Redshift.

We hope this post provides you with valuable guidance. We welcome any thoughts or questions in the comments section.


About the authors

Chanpreet Singh is a Senior Lead Consultant at AWS, specializing in Data Analytics and AI/ML. He has over 17 years of industry experience and is passionate about helping customers build scalable data warehouses and big data solutions. In his spare time, Chanpreet loves to explore nature, read, and enjoy with his family.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Ram Bhandarkar is a Principal Data Architect at AWS based out of Northern Virginia. He helps customers with planning future Enterprise Data Strategy and assists them with transition to Modern Data Architecture platform on AWS. He has worked with building and migrating databases, data warehouses and data lake solutions for over 25 years.

Vijay Bagur is a Sr. Technical Account Manager. He works with enterprise customers to modernize and cost optimize workloads, improve security posture, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, biking and traveling.

Get started with the new Amazon DataZone enhancements for Amazon Redshift

Post Syndicated from Carmen Manzulli original https://aws.amazon.com/blogs/big-data/get-started-with-the-new-amazon-datazone-enhancements-for-amazon-redshift/

In today’s data-driven landscape, organizations are seeking ways to streamline their data management processes and unlock the full potential of their data assets, while controlling access and enforcing governance. That’s why we introduced Amazon DataZone.

Amazon DataZone is a powerful data management service that empowers data engineers, data scientists, product managers, analysts, and business users to seamlessly catalog, discover, analyze, and govern data across organizational boundaries, AWS accounts, data lakes, and data warehouses.

On March 21, 2024, Amazon DataZone introduced several exciting enhancements to its Amazon Redshift integration that simplify the process of publishing and subscribing to data warehouse assets like tables and views, while enabling Amazon Redshift customers to take advantage of the data management and governance capabilities or Amazon DataZone.

These updates empower the experience for both data users and administrators.

Data producers and consumers can now quickly create data warehouse environments using preconfigured credentials and connection parameters provided by their Amazon DataZone administrators.

Additionally, these enhancements grant administrators greater control over who can access and use the resources within their AWS accounts and Redshift clusters, and for what purpose.

As an administrator, you can now create parameter sets on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and an AWS secret. You can use these parameter sets to create environment profiles and authorize Amazon DataZone projects to use these environment profiles for creating environments.

In turn, data producers and data consumers can now select an environment profile to create environments without having to provide the parameters themselves, saving time and reducing the risk of issues.

In this post, we explain how you can use these enhancements to the Amazon Redshift integration to publish your Redshift tables to the Amazon DataZone data catalog, and enable users across the organization to discover and access them in a self-service fashion. We present a sample end-to-end customer workflow that covers the core functionalities of Amazon DataZone, and include a step-by-step guide of how you can implement this workflow.

The same workflow is available as video demonstration on the Amazon DataZone official YouTube channel.

Solution overview

To get started with the new Amazon Redshift integration enhancements, consider the following scenario:

  • A sales team acts as the data producer, owning and publishing product sales data (a single table in a Redshift cluster called catalog_sales)
  • A marketing team acts as the data consumer, needing access to the sales data in order to analyze it and build product adoption campaigns

At a high level, the steps we walk you through in the following sections include tasks for the Amazon DataZone administrator, Sales team, and Marketing team.

Prerequisites

For the workflow described in this post, we assume a single AWS account, a single AWS Region, and a single AWS Identity and Access Management (IAM) user, who will act as Amazon DataZone administrator, Sales team (producer), and Marketing team (consumer).

To follow along, you need an AWS account. If you don’t have an account, you can create one.

In addition, you must have the following resources configured in your account:

  • An Amazon DataZone domain with admin, sales, and marketing projects
  • A Redshift namespace and workgroup

If you don’t have these resources already configured, you can create them by deploying an AWS CloudFormation stack:

  1. Choose Launch Stack to deploy the provided CloudFormation template.
  2. For AdminUserPassword, enter a password, and take note of this password to use in later steps.
  3. Leave the remaining settings as default.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
  5. When the stack deployment is complete, on the Amazon DataZone console, choose View domains in the navigation pane to see the new created Amazon DataZone domain.
  6. On the Amazon Redshift Serverless console, in the navigation pane, choose Workgroup configuration and see the new created resource.

You should be logged in using the same role that you used to deploy the CloudFormation stack and verify that you’re in the same Region.

As a final prerequisite, you need to create a catalog_sales table in the default Redshift database (dev).

  1. On the Amazon Redshift Serverless console, selected your workgroup and choose Query data to open the Amazon Redshift query editor.
  2. In the query editor, choose your workgroup and select Database user name and password as the type of connection, then provide your admin database user name and password.
  3. Use the following query to create the catalog_sales table, which the Sales team will publish in the workflow:
    CREATE TABLE catalog_sales AS 
    SELECT 146776932 AS order_number, 23 AS quantity, 23.4 AS wholesale_cost, 45.0 as list_price, 43.0 as sales_price, 2.0 as discount, 12 as ship_mode_sk,13 as warehouse_sk, 23 as item_sk, 34 as catalog_page_sk, 232 as ship_customer_sk, 4556 as bill_customer_sk
    UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
    UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
    UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
    UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
    UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
    UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
    UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
    UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
    UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
    UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Now you’re ready to get started with the new Amazon Redshift integration enhancements.

Amazon DataZone administrator tasks

As the Amazon DataZone administrator, you perform the following tasks:

  1. Configure the DefaultDataWarehouseBlueprint.
    • Authorize the Amazon DataZone admin project to use the blueprint to create environment profiles.
    • Create a parameter set on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and AWS secret.
  2. Set up environment profiles for the Sales and Marketing teams.

Configure the DefaultDataWarehouseBlueprint

Amazon DataZone blueprints define what AWS tools and services are provisioned to be used within an Amazon DataZone environment. Enabling the data warehouse blueprint will allow data consumers and data producers to use Amazon Redshift and the Query Editor for data sharing, accessing, and consuming.

  1. On the Amazon DataZone console, choose View domains in the navigation pane.
  2. Choose your Amazon DataZone domain.
  3. Choose Default Data Warehouse.

If you used the CloudFormation template, the blueprint is already enabled.

Part of the new Amazon Redshift experience involves the Managing projects and Parameter sets tabs. The Managing projects tab lists the projects that are allowed to create environment profiles using the data warehouse blueprint. By default, this is set to all projects. For our purpose, let’s grant only the admin project.

  1. On the Managing projects tab, choose Edit.

  1. Select Restrict to only managing projects and choose the AdminPRJ project.
  2. Choose Save changes.

With this enhancement, the administrator can control which projects can use default blueprints in their account to create environment profile

The Parameter sets tab lists parameters that you can create on top of DefaultDataWarehouseBlueprint by providing parameters such as Redshift cluster or Redshift Serverless workgroup name, database name, and the credentials that allow Amazon DataZone to connect to your cluster or workgroup. You can also create AWS secrets on the Amazon DataZone console. Before these enhancements, AWS secrets had to be managed separately using AWS Secrets Manager, making sure to include the proper tags (key-value) for Amazon Redshift Serverless.

For our scenario, we need to create a parameter set to connect a Redshift Serverless workgroup containing sales data.

  1. On the Parameter sets tab, choose Create parameter set.
  2. Enter a name and optional description for the parameter set.
  3. Choose the Region containing the resource you want to connect to (for example, our workgroup is in us-east-1).
  4. In the Environment parameters section, select Amazon Redshift Serverless.

If you already have an AWS secret with credentials to your Redshift Serverless workgroup, you can provide the existing AWS secret ARN. In this case, the secret must be tagged with the following (key-value): AmazonDataZoneDomain: <Amazon DataZone domain ID>.

  1. Because we don’t have an existing AWS secret, we create a new one by choosing Create new AWS Secret.
  2. In the pop-up, enter a secret name and your Amazon Redshift credentials, then choose Create new AWS Secret.

Amazon DataZone creates a new secret using Secrets Manager and makes sure the secret is tagged with the domain in which you’re creating the parameter set.

  1. Enter the Redshift Serverless workgroup name and database name to complete the parameters list. If you used the provided CloudFormation template, use sales-workgroup for the workgroup name and dev for the database name.
  2. Choose Create parameter set.

You can see the parameter set created for your Redshift environment and the blueprint enabled with a single managing project configured.

 

Set up environment profiles for the Sales and Marketing teams

Environment profiles are predefined templates that encapsulate technical details required to create an environment, such as the AWS account, Region, and resources and tools to be added to projects. The next Amazon DataZone administrator task consists of setting up environment profiles, based on the default enabled blueprint, for the Sales and Marketing teams.

This task will be performed from the admin project in the Amazon DataZone data portal, so let’s follow the data portal URL and start creating an environment profile for the Sales team to publish their data.

  1. On the details page of your Amazon DataZone domain, in the Summary section, choose the link for your data portal URL.

When you open the data portal for the first time, you’re prompted to create a project. If you used the provided CloudFormation template, the projects are already created.

  1. Choose the AdminPRJ project.
  2. On the Environments page, choose Create environment profile.
  3. Enter a name (for example, SalesEnvProfile) and optional description (for example, Sales DWH Environment Profile) for the new environment profile.
  4. For Owner, choose AdminPRJ.
  5. For Blueprint, select the DefaultDataWarehouse blueprint (you’ll only see blueprints where the admin project is listed as a managing project).
  6. Choose the current enabled account and the parameter set you previously created.

Then you will see each pre-compiled value for Redshift Serverless. Under Authorized projects, you can pick the authorized projects allowed to use this environment profile to create an environment. By default, this is set to All projects.

  1. Select Authorized projects only.
  2. Choose Add projects and choose the SalesPRJ project.
  3. Configure the publishing permissions for this environment profile. Because the Sales team is our data producer, we select Publish from any schema.
  4. Choose Create environment profile.

Next, you create a second environment profile for the Marketing team to consume data. To do this, you repeat similar steps made for the Sales team.

  1. Choose the AdminPRJ project.
  2. On the Environments page, choose Create environment profile.
  3. Enter a name (for example, MarketingEnvProfile) and optional description (for example, Marketing DWH Environment Profile).
  4. For Owner, choose AdminPRJ.
  5. For Blueprint, select the DefaultDataWarehouse blueprint.
  6. Select the parameter set you created earlier.
  7. This time, keep All projects as the default (alternatively, you could select Authorized projects only and add MarketingPRJ).
  8. Configure the publishing permissions for this environment profile. Because the Marketing team is our data consumer, we select Don’t allow publishing.
  9. Choose Create environment profile.

With these two environment profiles in place, the Sales and Marketing teams can start working on their projects on their own to create their proper environments (resources and tools) with fewer configurations and less risk to incur errors, and publish and consume data securely and efficiently within these environments.

To recap, the new enhancements offer the following features:

  • When creating an environment profile, you can choose to provide your own Amazon Redshift parameters or use one of the parameter sets from the blueprint configuration. If you choose to use the parameter set created in the blueprint configuration, the AWS secret only requires the AmazonDataZoneDomain tag (the AmazonDataZoneProject tag is only required if you choose to provide your own parameter sets in the environment profile).
  • In the environment profile, you can specify a list of authorized projects, so that only authorized projects can use this environment profile to create data warehouse environments.
  • You can also specify what data authorized projects are allowed to be published. You can choose one of the following options: Publish from any schema, Publish from the default environment schema, and Don’t allow publishing.

These enhancements grant administrators more control over Amazon DataZone resources and projects and facilitate the common activities of all roles involved.

Sales team tasks

As a data producer, the Sales team performs the following tasks:

  1. Create a sales environment.
  2. Create a data source.
  3. Publish sales data to the Amazon DataZone data catalog.

Create a sales environment

Now that you have an environment profile, you need to create an environment in order to work with data and analytics tools in this project.

  1. Choose the SalesPRJ project.
  2. On the Environments page, choose Create environment.
  3. Enter a name (for example, SalesDwhEnv) and optional description (for example, Environment DWH for Sales) for the new environment.
  4. For Environment profile, choose SalesEnvProfile.

Data producers can now select an environment profile to create environments, without the need to provide their own Amazon Redshift parameters. The AWS secret, Region, workgroup, and database are ported over to the environment from the environment profile, streamlining and simplifying the experience for Amazon DataZone users.

  1. Review your data warehouse parameters to confirm everything is correct.
  2. Choose Create environment.

The environment will be automatically provisioned by Amazon DataZone with the preconfigured credentials and connection parameters, allowing the Sales team to publish Amazon Redshift tables seamlessly.

Create a data source

Now, let’s create a new data source for our sales data.

  1. Choose the SalesPRJ project.
  2. On the Data page, choose Create data source.
  3. Enter a name (for example, SalesDataSource) and optional description.
  4. For Data source type, select Amazon Redshift.
  5. For Environment¸ choose SalesDevEnv.
  6. For Redshift credentials, you can use the same credentials you provided during environment creation, because you’re still using the same Redshift Serverless workgroup.
  7. Under Data Selection, enter the schema name where your data is located (for example, public) and then specify a table selection criterion (for example, *).

Here, the * indicates that this data source will bring into Amazon DataZone all the technical metadata from the database tables of your schema (in this case, a single table called catalog_sales).

  1. Choose Next.

On the next page, automated metadata generation is enabled. This means that Amazon DataZone will automatically generate the business names of the table and columns for that asset. 

  1. Leave the settings as default and choose Next.
  2. For Run preference, select when to run the data source. Amazon DataZone can automatically publish these assets to the data catalog, but let’s select Run on demand so we can curate the metadata before publishing.
  3. Choose Next.
  4. Review all settings and choose Create data source.
  5. After the data source has been created, you can manually pull technical metadata from the Redshift Serverless workgroup by choosing Run.

When the data source has finished running, you can see the catalog_sales asset correctly added to the inventory.

Publish sales data to the Amazon DataZone data catalog

Open the catalog_sales asset to see details of the new asset (business metadata, technical metadata, and so on).

In a real-world scenario, this pre-publishing phase is when you can enrich the asset providing more business context and information, such as a readme, glossaries, or metadata forms. For example, you can start accepting some metadata automatically generated recommendations and rename the asset or its columns in order to make them more readable, descriptive, and easy to search and understand from a business user.

For this post, simply choose Publish asset to complete the Sales team tasks.

Marketing team tasks

Let’s switch to the Marketing team and subscribe to the catalog_sales asset published by the Sales team. As a consumer team, the Marketing team will complete the following tasks:

  1. Create a marketing environment.
  2. Discover and subscribe to sales data.
  3. Query the data in Amazon Redshift.

Create a marketing environment

To subscribe and access Amazon DataZone assets, the Marketing team needs to create an environment.

  1. Choose the MarketingPRJ project.
  2. On the Environments page, choose Create environment.
  3. Enter a name (for example, MarketingDwhEnv) and optional description (for example, Environment DWH for Marketing).
  4. For Environment profile, choose MarketingEnvProfile.

As with data producers, data consumers can also benefit from a pre-configured profile (created and managed by the administrator) in order to speed up the environment creation process, avoiding mistakes and reducing risks of errors.

  1. Review your data warehouse parameters to confirm everything is correct.
  2. Choose Create environment.

Discover and subscribe to sales data

Now that we have a consumer environment, let’s search the catalog_sales table in the Amazon DataZone data catalog.

  1. Enter sales in the search bar.
  2. Choose the catalog_sales table.
  3. Choose Subscribe.
  4. In the pop-up window, choose your marketing consumer project, provide a reason for the subscription request, and choose Subscribe.

When you get a subscription request as a data producer, Amazon DataZone will notify you through a task in the sales producer project. Because you’re acting as both subscriber and publisher here, you will see a notification.

  1. Choose the notification, which will open the subscription request.

You can see details including which project has requested access, who is the requestor, and why access is needed.

  1. To approve, enter a message for approval and choose Approve.

Now that subscription has been approved, let’s go back to the MarketingPRJ. On the Subscribed data page, catalog_sales is listed as an approved asset, but access hasn’t been granted yet. If we choose the asset, you can see that Amazon DataZone is working on the backend to automatically grant the access. When it’s complete, you’ll see the subscription as granted and the message “Asset added to 1 environment.”

Query data in Amazon Redshift

Now that the marketing project has access to the sales data, we can use the Amazon Redshift Query Editor V2 to analyze the sales data.

  1. Under MarketingPRJ, go to the Environments page and select the marketing environment.
  2. Under the analytics tools, choose Query data with Amazon Redshift, which redirects you to the query editor within the environment of the project.
  3. To connect to Amazon Redshift, choose your workgroup and select Federated user as the connection type.

When you’re connected, you will see the catalog_sales table under the public schema.

  1. To make sure that you have access to this table, run the following query:
SELECT * FROM catalog_sales LIMIT 10

As a consumer, you’re now able to explore data and create reports, or you can aggregate data and create new assets to publish in Amazon DataZone, becoming a producer of a new data product to share with other users and departments.

Clean up

To clean up your resources, complete the following steps:

  1. On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
  2. Clean up all Amazon Redshift resources (workgroup and namespace) to avoid incurring additional charges.

Conclusion

In this post, we demonstrated how you can get started with the new Amazon Redshift integration in Amazon DataZone. We showed how to streamline the experience for data producers and consumers and how to grant administrators control over data resources.

Embrace these enhancements and unlock the full potential of Amazon DataZone and Amazon Redshift for your data management needs.

Resources

For more information, refer to the following resources:

 


About the author

Carmen is a Solutions Architect at AWS, based in Milan (Italy). She is a Data Lover that enjoys helping companies in the adoption of Cloud technologies, especially with Data Analytics and Data Governance. Outside of work, she is a creative people who loves being in contact with nature and sometimes practicing adrenaline activities.