All posts by Lakshmi Nair

Navigating architectural choices for a lakehouse using Amazon SageMaker

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/navigating-architectural-choices-for-a-lakehouse-using-amazon-sagemaker/

Organizations today are using data more than ever to drive decision-making and innovation. Because they work with petabytes of information, they have traditionally gravitated towards two distinct paradigms—data lakes and data warehouses. While each paradigm excels at specific use cases, they often create unintended barriers between the data assets. 

Data lakes are often built on object storage such as Amazon Simple Storage Service (Amazon S3), which provide flexibility by supporting diverse data formats and schema-on-read capabilities. This enables multi-engine access where various processing frameworks (such as Apache Spark, Trino, and Presto) can query the same data. On the other hand, data warehouses (such as Amazon Redshift) excel in areas such as ACID (atomicity, consistency, isolation and durability) compliance, performance optimization, and straightforward deployment, making them suitable for structured and complex queries. As data volumes grow and analytics needs become more complex, organizations seek to bridge these silos and use the strengths of both paradigms. This is where the concept of lakehouse architecture is applied, offering a unified approach to data management and analytics. 

Over time, several distinct lakehouse approaches have emerged. In this post, we show you how to evaluate and choose the right lakehouse pattern for your needs.

The data lake centric lakehouse approach begins with the scalability, cost-effectiveness, and flexibility of a traditional data lake built on object storage. The goal is to add a layer of transactional capabilities and data management traditionally found in databases, primarily through open table formats (such as Apache Hudi, Delta Lake, or Apache Iceberg). While open table formats have made significant strides by introducing ACID guarantees for single-table operations in data lakes, implementing multi-table transactions with complex referential integrity constraints and joins remains challenging. The fundamental nature of querying petabytes of files on object storage, often through distributed query engines, can result in slow interactive queries at high concurrency when compared to a highly optimized, indexed, and materialized data warehouse. Open table formats introduce compaction and indexing, but the full suite of intelligent storage optimizations found in highly mature, proprietary data warehouses is still evolving in data lake-centric architecture. 

The data warehouse centric lakehouse approach offers robust analytical capabilities but has significant interoperability challenges. Though data warehouses provide JAVA Database Connectivity (JDBC) and Open Database Connectivity (ODBC) drivers for external access, the underlying data remains in proprietary formats, making it difficult for external tools or services to directly access it without complex extract, transform, and load (ETL) or API layers. This can lead to data duplication and latency. A data warehouse architecture might support reading open table formats, but its ability to write to them or participate in their transactional layers can be limited. This restricts true interoperability and can create shadow data silos. 

On AWS, you can build a modern, open lakehouse architecture to achieve unified access to both data warehouses and data lakes. By using this approach, you can build sophisticated analytics, machine learning (ML), and generative AI applications while maintaining a single source of truth for their data. You don’t have to choose between a data lake or data warehouse. You can use existing investments and preserve the strengths of both paradigms while eliminating their respective weaknesses. The lakehouse architecture on AWS embraces open table formats such as Apache Hudi, Delta Lake, and Apache Iceberg.

You can accelerate your lakehouse journey with the next generation of Amazon SageMaker, which delivers an integrated experience for analytics and AI with unified access to data. SageMaker is built on an open lakehouse architecture that is fully compatible with Apache Iceberg. By extending support for Apache Iceberg REST APIs, SageMaker significantly adds interoperability and accessibility across various Apache Iceberg-compatible query engines and tools. At the core of this architecture is a metadata management layer built on AWS Glue Data Catalog and AWS Lake Formation, which provide unified governance and centralized access control.

Foundations of the Amazon SageMaker lakehouse architecture

The lakehouse architecture of Amazon SageMaker has four main components that work together to create a unified data platform. 

  • Flexible storage to adapt to the workload patterns and requirements
  • Technical catalog that serves as a single source of truth for all metadata
  • Integrated permission management with fine-grained access control across all data assets
  • Open access framework built on Apache Iceberg REST APIs for universal compatibility

Catalogs and permissions

When building an open lakehouse, the catalog—your central repository of metadata—is a critical component for data discovery and governance. There are two types of catalogs in the lakehouse architecture of Amazon SageMaker: managed catalogs and federated catalogs.

You can use an AWS Glue crawler to automatically discover and register this metadata in Data Catalog. Data Catalog stores the schema and table metadata of your data assets, effectively turning files into logical tables. After your data is cataloged, the next challenge is controlling who can access it. While you could use complex S3 bucket policies for every folder, this approach is difficult to manage and scale. Lake Formation provides a centralized database-style permissions model on the Data Catalog, giving you the flexibility to grant or revoke fine-grained access at row, column, and cell levels for individual users or roles. 

Open access with Apache Iceberg REST APIs

The lakehouse architecture described in the preceding section and shown in the following figure also uses the AWS Glue Iceberg REST catalog through the service endpoint, which provides OSS compatibility, enabling increased interoperability for managing Iceberg table metadata across Spark and other open source analytics engines. You can choose the appropriate API based on table format and use case requirements.

The lakehouse architecture of Amazon SageMaker

In this post, we explore various lakehouse architecture patterns, focusing on how to optimally use data lake and data warehouse to create robust, scalable, and performance-driven data solutions. 

Bringing data into your lakehouse on AWS

When building a lakehouse architecture, you can choose from three distinct patterns to access and integrate your data, each offering unique advantages for different use cases.

  • Traditional ETL is the classic method of extracting data, transforming it and loading it into your lakehouse. 

When to use it:

    • You need complex transformations and require highly curated and optimized data sets for downstream applications for better performance
    • You need to perform historical data migrations
    • You need data quality enforcement and standardization at scale
    • You need highly governed curated data in a lakehouse

  • Zero-ETL is a modern architectural pattern where data automatically and continuously replicates from a source system to lakehouse with minimal or no manual intervention or custom code. Behind the scenes, the pattern uses change data capture (CDC) to automatically stream all new inserts, updates, and deletes from the source to the target. This architectural pattern is effective when the source system maintains a high degree of data cleanliness and structure, minimizing the need for heavy pre-load transformations, or when data refinement and aggregation can occur at the target end within lakehouse. Zero-ETL replicates data with minimal delay, and the transformation logic is performed on the target end closer to where the insights are generated by shifting it to a more efficient, post-load phase. 

When to use it:

    • You need to reduce operational complexity and gain flexible control over data replication for both near real-time and batch use cases.
    • You need limited customization. While zero-ETL implies minimal work, some light transformations might still be required on the replicated data.
    • You need to minimize the need for specialized ETL expertise.
    • You need to maintain data freshness without processing delays and reduce risk of data inconsistencies. Zero-ETL facilitates faster time-to-insight.

zero-etl architecture

  • Data federation (no-movement approach) is a method that enables querying and combining data from multiple disparate sources without physically moving or copying it into a single centralized location. This query-in-place approach allows the query engine to connect directly to the external source systems, delegate and execute queries, and combine results on the fly for presentation to the user. The effectiveness of this architecture pattern depends on three key factors: network latency between systems, source system performance capabilities, and the query engine’s ability to push down predicates to optimize query execution. This no-movement approach can significantly reduce data duplication and storage costs while providing real-time access to source data.

When to use it:

    • You need to query the source system directly to use operational analytics.
    • You don’t want to duplicate data to save on storage space and associated costs within your Lakehouse.
    • You’re willing to trade some query performance and governance for immediate data availability and one-time analysis of live data.
    • You don’t need to frequently query the data.

Understanding the storage layer of your lakehouse on AWS

Now that you’ve seen different ways to get data into a lakehouse, the next question is where to store the data. As shown in the following figure, you can architect a modern open lakehouse on AWS by storing the data in a data lake (Amazon S3 or Amazon S3 Tables) or data warehouse (Redshift Managed Storage), so you can optimize for both flexibility and performance based on your specific workload requirements.

A modern lakehouse isn’t a single storage technology but a strategic combination of them. The decision of where and how to store your data impacts everything from the speed of your dashboards to the efficiency of your ML models. You must consider not only the initial cost of storage but also the long-term costs of data retrieval, the latency required by your users, and the governance necessary to maintain a single source of truth. In this section, we delve into architectural patterns for the data lake and the data warehouse and provide a clear framework for when to use each storage pattern. While they have historically been seen as competing architectures, the modern and open lakehouse approach uses both to create a single, powerful data platform.

General purpose S3

A general purpose S3 bucket in Amazon S3 is the standard, foundational bucket type used for storing objects. It provides flexibility so that you can store your data in its native format without a rigid upfront schema. Because of the ability of an S3 bucket to decouple storage from compute, you can store the data in a highly scalable location, while a variety of query engines can access and process it independently. This means that you can choose the right tool for the job without having to move or duplicate the data. You can store petabytes of data without ever having to provision or manage storage capacity, and its tiered storage classes provide significant cost savings by automatically moving less-frequently accessed data to more affordable storage.

The existing Data Catalog functions as a managed catalog. It’s identified by the AWS account number, which means there is no migration needed for existing Data Catalogs; they’re already available in the lakehouse and become the default catalog for the new data, as shown in the following figure.

A foundational data lake on general purpose S3 is highly efficient for append-only workloads. However, its file-based nature lacks the transactional guarantees of a traditional database. This is where you can use the support of open-source transactional table formats such as Apache Hudi, Delta Lake, and Apache Iceberg. With these table formats, you can implement multi-version concurrency control, allowing multiple readers and writers to operate simultaneously without conflicts. They provide snapshot isolation, so that readers see consistent views of data even during write operations. A typical medallion architecture pattern with Apache Iceberg is depicted in the following figure. When building a lakehouse on AWS with Apache Iceberg, customers can choose between two primary approaches for storing their data on Amazon S3: General purpose S3 buckets with self-managed Iceberg or using the fully managed S3 Tables. Each path has distinct advantages, and the right choice depends on your specific needs for control, performance, and operational overhead. 

General purpose S3 with Self-managed Iceberg

Using general purpose S3 buckets with self-managed Iceberg is a traditional approach where you store both data and Iceberg metadata files in standard S3 buckets. With this option, you maintain full control but are responsible for managing the complete Iceberg table lifecycle, including essential maintenance tasks such as compaction and garbage collection.

When to use it:

  • Maximum control: This approach provides complete control over the entire data life cycle. You can fine-tune every aspect of table maintenance, such as defining your own compaction schedules and strategies, which can be crucial for specific high-performance workloads or to optimize costs.
  • Flexibility and customization: It is ideal for organizations with strong in-house data engineering expertise that need to integrate with a wider range of open-source tools and custom scripts. You can use Amazon EMR or Apache Spark to manage the table operations. 
  • Lower upfront costs: You pay only for Amazon S3 storage, API requests, and the compute resources you use for maintenance. This can be more cost-effective for smaller or less-frequent workloads where continuous, automated optimization isn’t necessary.

Note: The query performance depends entirely on your optimization strategy. Without continuous, scheduled jobs for compaction, performance can degrade over time as data gets fragmented. You must monitor these jobs to ensure efficient querying.

S3 Tables

S3 Tables provides S3 storage that’s optimized for analytic workloads and provides Apache Iceberg compatibility to store tabular data at scale. You can integrate S3 table buckets and tables with Data Catalog and register the catalog as a Lake Formation data location from the Lake Formation console or using service APIs, as shown in the following figure. This catalog will be registered and mounted as a federated lakehouse catalog.

When to use it:

  • Simplified operations: S3 Tables automatically handles table maintenance tasks such as compaction, snapshot management and orphan file cleanup in the background. This automation eliminates the need to build and manage custom maintenance jobs, significantly reducing your operational overhead.
  • Automated optimization: S3 Tables provides built-in automatic optimizations that improve query performance. These optimizations include background processes such as file compaction to address the small files problem and data layout optimizations specific to tabular data. However, this automation trades flexibility for convenience. Because you can’t control the timing or method of compaction operations, workloads with specific performance requirements might experience varying query performance. 
  • Focus on data usage: S3 Tables reduces the engineering overhead and shifts the focus to data consumption, data governance and value creation. 
  • Simplified entry to open table formats: It’s suitable for teams who are new to the concept of Apache Iceberg but want to use transactional capabilities on data lake. 
  • No external catalog: Suitable for smaller teams who don’t want to manage an external catalog.

Redshift managed storage

While the data lake serves as the central source of truth for all your data, it’s not the most suitable data store for every job. For the most demanding business intelligence and reporting workloads, the data lake’s open and flexible nature can introduce performance unpredictability. To help ensure the desired performance, consider transitioning a curated subset of your data from the data lake to a data warehouse for the following reasons:

  • High concurrency BI and reporting: When hundreds of business users are concurrently running complex queries on live dashboards, a data warehouse is specifically optimized to handle these workloads with predictable, sub-second query latency.
  • Predictable performance SLAs:– For critical business processes that require data to be delivered at a guaranteed speed, such as financial reporting or end-of-day sales analysis, a data warehouse provides consistent performance. 
  • Complex SQL workloads: While data lakes are powerful, they can struggle with highly complex queries involving numerous joins and massive aggregations. A data warehouse is purpose-built to run these relational workloads efficiently.

The lakehouse architecture on AWS supports Redshift Managed Storage (RMS), a storage option provided by Amazon Redshift, a fully managed, petabyte-scale data warehouse service in the cloud. RMS storage supports the automatic table optimization offered in Amazon Redshift such as built-in query optimizations for data warehousing workloads, automated materialized views, and AI-driven optimizations and scaling for frequently running workloads.

Federated RMS catalog: Onboard existing Amazon Redshift data warehouses to lakehouse

Implementing a federated catalog with existing Amazon Redshift data warehouses creates a metadata-only integration that requires no data movement. This approach lets you extend your established Amazon Redshift investments into a modern open lakehouse framework while maintaining compatibility with existing workflows. Amazon Redshift uses a hierarchical data organization structure: 

  • Cluster level: Starts with a namespace 
  • Database level: Contains multiple databases 
  • Schema level: Organizes tables within databases

When you register your existing Amazon Redshift provisioned or serverless namespaces as a federated catalog in Data Catalog, this hierarchy maps directly into the lakehouse metadata layer. The lakehouse implementation on AWS supports multiple catalogs using a dynamic hierarchy to organize and map the underlying storage metadata.

After you register a namespace, the federated catalog automatically mounts across all Amazon Redshift data warehouses in your AWS Region and account. During this process, Amazon Redshift internally creates external databases that correspond to data shares. This mechanism remains completely abstracted from end users. By using federated catalogs, you can create and use immediate visibility and accessibility across your data ecosystem. Permissions on the federated catalogs can be managed by Lake Formation for both same account and cross account access. 

The real capability of federated catalogs emerges when accessing Amazon Redshift-managed storage from external AWS engines such as Amazon Athena, Amazon EMR, or open source Spark. Because Amazon Redshift uses proprietary block-based storage that only Amazon Redshift engines can read natively, AWS automatically provisions a service-managed Amazon Redshift Serverless instance in the background. This service-managed instance acts as a translation layer between external engines and Amazon Redshift managed storage. AWS establishes automatic data shares between your registered federated catalog and the service-managed Amazon Redshift Serverless instance to enable secure, efficient data access. AWS also creates a service-managed Amazon S3 bucket in the background for data transfer.

 When an external engine such as Athena submits queries against Amazon Redshift federated catalog, Lake Formation handles the credential vending by providing the temporary credentials to the requesting service. The query executes through the service-managed Amazon Redshift Serverless, which accesses data through automatically established data shares, processes results, offloads them to a service-managed Amazon S3 staging area, and then returns results to the original requesting engine.

To track the compute cost of the federated catalog of existing Amazon Redshift warehouse, use the following tag.

aws:redshift-serverless:LakehouseManagedWorkgroup value: "True"

To activate the AWS generated cost allocation tags for billing insight, follow the activation instructions. You can also view the computational cost of the resources in AWS Billing.

When to use it:

  • Existing Amazon Redshift investments: Federated catalogs are designed for organizations with existing Amazon Redshift deployments who want to use their data across multiple services without migration.
  • Cross-service data sharing:– Implement so teams can share existing data in an Amazon Redshift data warehouse across different warehouses and centralize their permissions.
  • Enterprise integration requirements: This approach is suitable for organizations that need to integrate with established data governance. It also maintains compatibility with current workflows while adding lakehouse capabilities.
  • Infrastructure control and pricing:– You can retain full control over compute capacity for their existing warehouses for predictable workloads. You can optimize compute capacity, choose between on-demand and reserved capacity pricing, and fine-tune performance parameters. This provides cost predictability and performance control for consistent workloads.

When implementing lakehouse architecture with multiple catalog types, selecting the appropriate query engine is crucial for both performance and cost optimization. This post focuses on the storage foundation of lakehouse, however for critical workloads involving extensive Amazon Redshift data operations, consider executing queries within Amazon Redshift or using Spark when possible. Complex joins spanning multiple Amazon Redshift tables through external engines might result in higher compute costs if the engines don’t support full predicate push-down. 

Other use-cases

Build a multi-warehouse architecture

Amazon Redshift supports data sharing, which you can use to share live data between source and target Amazon Redshift clusters. By using data sharing, you can share live data without creating copies or moving data, enabling uses cases such as workload isolation (hub and spoke architecture) and cross group collaboration (data mesh architecture). Without a lakehouse architecture, you must create an explicit data share between source and target Amazon Redshift clusters. While managing these data shares in small deployments is relatively straightforward, it becomes complex in data mesh architectures.

The lakehouse architecture addresses this challenge so customers can publish their existing Amazon Redshift warehouses as federated catalogs. These federated catalogs are automatically mounted and made available as external databases in other consumer Amazon Redshift warehouses within the same account and Region. By using this approach, you can maintain a single copy of data and use multiple data warehouses to query it, eliminating the need to create and manage multiple data shares and scale with workload isolation. The permission management becomes centralized through Lake Formation, streamlining governance across the entire multi-warehouse environment.

Near real-time analytics on petabytes of transactional data with no pipeline management:

Zero-ETL integrations seamlessly replicate transactional data from OLTP data sources to Amazon Redshift, general purpose S3 (with self-managed Iceberg) or S3 Tables. This approach eliminates the need to maintain complex ETL pipelines, reducing the number of moving parts in your data architecture and potential points of failure. Business users can analyze fresh operational data immediately rather than working with stale data from the last ETL run. 

See Aurora zero-ETL integrations for a list of OLTP data sources that can be replicated to an existing Amazon Redshift warehouse.

See Zero-ETL integrations for information about other supported data sources that can be replicated to an existing Amazon Redshift warehouse, general purpose S3 with self-managed Iceberg, and S3 Tables.

Conclusion

A lakehouse architecture isn’t about choosing between a data lake and a data warehouse. Instead, it’s an approach to interoperability where both frameworks coexist and serve different purposes within a unified data architecture. By understanding fundamental storage patterns, implementing effective catalog strategies, and using native storage capabilities, you can build scalable, high-performance data architectures that support both your current analytics needs and future innovation. For more information, see The lakehouse architecture of Amazon SageMaker

 


About the authors

Lakshmi Nair

Lakshmi Nair

Lakshmi is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Saman Irfan

Saman Irfan

Saman is a Senior Specialist Solutions Architect at Amazon Web Services, based in Berlin, Germany. Saman is passionate about helping organizations modernize their data architectures and unlock the full potential of their data to drive innovation and business transformation. Outside of work, she enjoys spending time with her family, watching TV series, and staying updated with the latest advancements in technology.

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/access-your-existing-data-and-resources-through-amazon-sagemaker-unified-studio-part-1-aws-glue-data-catalog-and-amazon-redshift/

Amazon SageMaker Unified Studio provides a unified environment for data, analytics, machine learning (ML), and AI workloads. Part of the next generation of Amazon SageMaker, SageMaker Unified Studio allows you to discover your data and put it to work using familiar AWS tools to complete end-to-end development workflows, including data analysis, data processing, model training, generative AI app development, and more, in a single governed environment. You can create or join projects to collaborate with your teams, share AI and analytics artifacts securely, and discover and use your data stored in different data sources through Amazon SageMaker Lakehouse.

This series of posts demonstrates how you can onboard and access existing AWS data sources using SageMaker Unified Studio. This post focuses on onboarding existing AWS Glue Data Catalog tables and database tables available in Amazon Redshift. Part 2 discusses using Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR.

Access your existing data and resources through Amazon SageMaker Unified Studio

This series primarily focuses on the UI experience. If you prefer script-based automation, refer to Bringing existing resources into Amazon SageMaker Unified Studio.

Access management with SageMaker Unified Studio

The SageMaker Unified Studio authorization model is a hierarchical access control list (ACL) based on the resource type such as a domain or a project. For example, at the domain level, a user might have a domain owner designation and at the project level, the user can be an owner or contributor. You can configure these profiles at AWS Identity and Access Management (IAM) user, single sign-on (SSO) user, and SSO group level.

Each project has a project role. When the user interacts with resources within SageMaker Unified Studio, it generates IAM session credentials based on the user’s effective profile in the specific project context, and then users can use tools such as Amazon Athena or Amazon Redshift to query the relevant data. The project owner can add or remove project members for their project, create publishing agreements with a domain, and publish assets to a domain.

SageMaker Unified Studio can be accessed by IAM users or SSO authenticated users, and IAM roles can interact with the SageMaker Unified Studio through its APIs.

Solution overview

AWS Lake Formation enables you to define fine-grained access control on the Data Catalog, where you can configure access at database, table, row, or column level or define permissions with tags. When setting up Lake Formation, you can configure it with hybrid access mode, where you get flexibility to selectively enable Lake Formation permissions for specific databases and tables, and continue using IAM permissions for others. SageMaker Unified Studio supports Lake Formation hybrid mode.

When you create a project in SageMaker Unified Studio, an AWS Glue database is added by default as part of the project. Assets published into that database don’t need any additional permissions, but if you want to publish or subscribe assets from an existing AWS Glue database, then you need to provide explicit permissions to SageMaker Unified Studio to be able to access the database and tables. For more details, see Configure Lake Formation permissions for Amazon SageMaker Unified Studio.

Let’s understand how we can access existing datasets through SageMaker Unified Studio.

Prerequisites

To run the instruction, you must complete the following prerequisites:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with All capabilities project profile

In the SageMaker Unified Studio, select the project and navigate to the Project overview page. Copy the Project role ARN as highlighted in the screenshot. This project role will be used further in the post to provide permissions on existing datasets and resources.

Use existing AWS Glue tables

This section has following prerequisites:

One extra prerequisite step is to revoke IAMAllowedPrincipals group permission on both database and table to enforce Lake Formation permission for access. For detailed instruction see Revoking permission using the Lake Formation console.

To access existing Data Catalog tables in SageMaker Unified Studio, complete the following steps:

  1. On the Lake Formation console using the data lake administrator, choose Data lake locations in the navigation pane and choose Register location.
  2. Enter the S3 prefix for Amazon S3 path.
  3. For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
  4. Select Lake Formation for Permission mode and choose Register location.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select the existing Data Catalog database.
  3. From the Actions menu, choose Grant to grant permissions to the project role.

  1. For IAM users and roles, choose the project role.
  2. Select Named Data Catalog resources, and for Catalogs, choose the default catalog.
  3. For Databases, choose your existing Data Catalog database.

  1. For Database permissions, select Describe and choose Grant.

The next step is to grant the permission on the tables to the project role.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select the existing Data Catalog database.
  3. From the Actions menu, choose Grant to grant permissions to the project role.
  4. For IAM users and roles, choose the project role.
  5. Select Named Data Catalog resources, and for Catalogs, choose the default catalog.
  6. For Databases, choose your Data Catalog database.
  7. For Tables, select the tables that you need to provide permission to the project role.

  1. For Table permissions, select Select and Describe.
  2. For Grantable permissions, select Select and Describe.
  3. Choose Grant.

You should revoke any existing permissions of IAMAllowedPrincipals on the databases and tables within Lake Formation.

Now let’s verify that we can access the existing AWS Glue table from the SageMaker Unified Studio Query Editor.

  1. In SageMaker Unified Studio, navigate to your project.
  2. On the project page, under Lakehouse, choose Data.
  3. Next to the Data Catalog table, choose the options menu (three dots), and choose Query with Athena.

SageMaker Unified Studio provides a unified JupyterLab experience across different languages, including SQL, PySpark, and Scala Spark. It also supports unified access across different compute runtimes such as Amazon Redshift and Athena for SQL, Amazon EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark. To access the data through the unified JupyterLab experience, complete the following steps:

  1. On the SageMaker Unified Studio project page, on the top menu, choose Build, and under IDE & APPLICATIONS, choose
  2. Wait for the space to be ready.
  3. Choose the plus sign and for Notebook, choose Python 3.
  4. In the notebook, switch the connection type to PySpark, choose spark.fineGrained, and query the existing Data Catalog table:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_sql = spark.sql("""
select * from retaildb.salesorders
""" )

df_sql.show()

Use existing Redshift clusters

This section has following prerequisites:

  • A Redshift cluster

To bring in existing Redshift clusters, follow these steps:

  1. To use your provisioned Redshift cluster or a Redshift Serverless workgroup, add either of the following tags (key/value) to the resource:
    1. Add AmazonDataZoneProject: <projectID> if you want to allow only a specific SageMaker Unified Studio project to access the Amazon Redshift resource. Replace <projectID> with the ID of the project created in SageMaker Unified Studio.
    2. Add for-use-with-all-datazone-projects: true if you want to allow all SageMaker Unified Studio projects to access the Amazon Redshift resource.

  1. To add the compute connection in SageMaker Unified Studio, you can authenticate the cluster using either the user name and password of the database, IAM credentials, or AWS Secrets Manager. To provide the authentication using Secrets Manager, add either of the following tags. This will enable the existing secret to appear on the dropdown menu, while defining the connection in SageMaker Unified Studio.
    1. AmazonDataZoneProject: <projectID>
    2. for-use-with-all-datazone-projects: true

In the following screenshot, you can see the tag configuration section within Secrets Manager settings for Redshift Serverless compute. To understand how to create a secret for a database in a Redshift cluster using Secrets Manager, refer to Managing Amazon Redshift admin passwords using AWS Secrets Manager.

  1. After the tags are applied, log in to SageMaker Unified Studio and choose the project.
  2. Go to the Compute section of your project, and on the Data warehouse tab, choose Add compute.
  3. Select Connect to existing compute resources.
  4. Choose the compute type: Amazon Redshift Provisioned cluster or Amazon Redshift Serverless.
  5. Configure the parameters by selecting the existing compute and authentication and choose Add compute.

The detailed walkthrough process is illustrated in the following screenshot.

Use Redshift tables with existing compute

This section has following prerequisites:

  • A Redshift table

In this section, we illustrate steps to create a federated connection for an existing Amazon Redshift data source. You can register an existing Redshift provisioned cluster as well as Redshift Serverless with the Data Catalog using SageMaker Unified Studio. This creates a federated multi-level catalog and provides the ability to centrally manage permissions to the data with fine-grained access control using Lake Formation. By mounting Amazon Redshift data in the Data Catalog, you can query it using your preferred tools such as Athena or AWS Glue extract, transform, and load (ETL) without having to copy or move the data.

Create an Amazon Redshift managed VPC endpoint for Amazon Redshift

Amazon Redshift managed virtual private cloud (VPC) endpoints use AWS PrivateLink to allow one VPC to privately access resources in another VPC as if they were local to the same VPC. With an Amazon Redshift managed VPC endpoint, you can connect to your private Redshift cluster with the RA3 instance type or Redshift Serverless within your VPC.

In this section, we explain how to create an Amazon Redshift managed VPC endpoint for both Redshift Serverless and an Amazon Redshift provisioned cluster in a single account. The managed VPC endpoint needs to be created only if your Redshift provisioned or Redshift Serverless cluster is in a different VPC than the SageMaker Unified Studio domain VPC.

If the SageMaker Unified Studio domain account is in a different account, allow the additional AWS accounts to create cluster endpoints. For steps to authorize your Amazon Redshift provisioned or Redshift Serverless cluster to deploy endpoints in additional accounts and grant access to the cross-account VPC, refer to Granting access to a VPC.

Redshift Serverless

For Redshift Serverless, follow these instructions.

The common practice is to allow port 5439 (Amazon Redshift connectivity port) to the security group or CIDR range in which your consumption workloads run.

  1. In the security group associated with the Redshift cluster, add an inbound rule with Type as Redshift, Protocol as TCP, Port range as 5439 (Amazon Redshift connectivity port), and Source as the CIDR range in which your consumption workloads run.

  1. On the Amazon Redshift console of the workgroup, go to Redshift-managed VPC endpoints.
  2. Choose Create endpoint.
  3. In the Endpoint settings section, choose the VPC, associated private subnet, and security group created for the SageMaker Unified Studio domain account to deploy the endpoint against.

The following screenshot shows the Amazon Redshift managed VPC endpoint created for Redshift Serverless.

Redshift provisioned

For Amazon Redshift provisioned, follow these instructions:

  1. To implement an Amazon Redshift managed VPC endpoint for a provisioned cluster, you need to enable cluster relocation and create subnet groups. In the cluster subnet group, choose the VPC and subnets of the SageMaker Unified Studio domain account.
  2. On the Amazon Redshift console, choose Configurations in the navigation pane.
  3. Provide the endpoint details, then choose Create endpoint.

Create a federated connection for Amazon Redshift

Complete the following steps to create a federated catalog in the Data Catalog to query the data using various preferred analytics tools such as Athena, visual ETL in SageMaker Unified Studio, Amazon EMR, and more:

  1. On the SageMaker Unified Studio console, choose your project.
  2. Choose Data in the navigation pane.
  3. In the data explorer, choose the plus sign to add a data source.
  4. Under Add a data source, choose Add connection, then choose Amazon Redshift.
  5. Enter the following parameters in the connection details, and choose Add data.
    1. Name: Enter the connection name.
    2. Host: Enter the Amazon Redshift managed VPC endpoint.
    3. Port: Enter the port number (Amazon Redshift uses 5439 as the default port).
    4. Database: Enter the database name.
    5. Authentication: Choose either the database user name and password credentials or Secrets Manager.

After the connection is established, you will see that the federated catalog is created, as shown in the following screenshot. This catalog uses the AWS Glue connection to connect to Amazon Redshift. The databases, tables, and views are automatically cataloged in the Data Catalog and registered with Lake Formation.

With Athena, data analysts can run federated SQL queries to scan data from multiple data sources in-place without creating complex data pipelines or data replication.

Use existing Data Catalog tables and Amazon Redshift assets in the SageMaker Unified Studio business data catalog

You can use the SageMaker Unified Studio business data catalog to catalog the data across your organization with business context. To use Amazon SageMaker Catalog, you must bring your existing data assets into the inventory of your project. Follow the instructions in this section to bring your existing Data Catalogs and Amazon Redshift assets into the project inventory.

Add an existing Data Catalog to the project inventory

To enrich the asset with business context and share your assets outside your own project, you must first bring the metadata to SageMaker Catalog. To import the metadata of the assets into the project’s inventory, you need to create a data source in the project catalog.

  1. In SageMaker Unified Studio, navigate to the Project catalog page within the project.
  2. Choose Data sources.
  3. Choose CREATE DATA SOURCE.
  4. For Name, provide the name of the data source.
  5. Choose AWS Glue (Lakehouse) for Data source type.
  6. For Data selection, choose the Database name and choose Next.
  7. Keep the rest as default and choose CREATE.
  8. Choose RUN to import the metadata.

After the data source successfully completes its run, metadata of all the data assets gets added to the project’s inventory.

Add existing Redshift tables and views to the project inventory

Create a data source to bring in the existing Redshift tables and views to add to the project’s inventory:

  1. In SageMaker Unified Studio, navigate to the Project catalog within the project.
  2. Choose Data sources.
  3. Choose CREATE DATA SOURCE.
  4. For Name, provide the name of the data source.
  5. Choose Amazon Redshift for Data source type.
  6. For Connection, choose the name of the Redshift connection.
  7. For Database name, choose dev and for Schema, enter public.
  8. Keep the rest as default and choose CREATE.
  9. Choose RUN to import the metadata.

After the data source successfully completes its run, metadata of all the data assets gets added to the project’s inventory.

Conclusion

This post explained how you can access existing data and resources available in the Data Catalog and Amazon Redshift using SageMaker Unified Studio. SageMaker Unified Studio provides an integrated environment for analytics and AI. Being able to access existing datasets available in your AWS account helps reduce operational overhead because users of your organization can access a common interface, collaborate, and share datasets. It also brings in efficiency for administrators because they can manage permissions for domains and projects in a common place.

In the next post, we will demonstrate how you can onboard and access other existing data sources such as Amazon S3, Amazon RDS, DynamoDB, and Amazon EMR.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Sakti Mishra is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Daiyan Alamgir is a Principal Frontend Engineer on the Amazon SageMaker Unified Studio team based in New York.

Vipin Mohan is a Principal Product Manager at AWS, leading the launch of generative AI capabilities in Amazon SageMaker Unified Studio. He is committed to shaping impactful products by working backward from customer insights, championing user-focused solutions, and delivering scalable results.

Chanu Damarla is a Principal Product Manager on the Amazon SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 2: Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/access-your-existing-data-and-resources-through-amazon-sagemaker-unified-studio-part-2-amazon-s3-amazon-rds-amazon-dynamodb-and-amazon-emr/

Organizations often face the challenge of managing and analyzing data spread across multiple storage systems and databases while providing secure, efficient access for their data science teams. Amazon SageMaker Unified Studio addresses this challenge by providing a unified analytics and AI development environment where data scientists can access, analyze, and use data from various sources within a single, governed workspace, allowing teams to use their existing data infrastructure while taking advantage of advanced analytics and AI capabilities. SageMaker Unified Studio is part of the next generation of Amazon SageMaker, the center for all your data, analytics, and AI.

In Part 1 of this series, we explored how to access AWS Glue Data Catalog tables and Amazon Redshift resources through SageMaker Unified Studio. Continuing our journey, this post discusses integrating additional vital data sources such as Amazon Simple Storage Service (Amazon S3) buckets, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR clusters. We demonstrate how to configure the necessary permissions, establish connections, and effectively use these resources within SageMaker Unified Studio. Whether you’re working with object storage, relational databases, NoSQL databases, or big data processing, this post can help you seamlessly incorporate your existing data infrastructure into your SageMaker Unified Studio workflows.

Access your existing data and resources through Amazon SageMaker Unified Studio

Solution overview

SageMaker Unified Studio seamlessly works with your existing data and resources through relevant permissions and network settings.

Let’s understand how we can access existing datasets across S3, RDS, DynamoDB, and EMR through SageMaker Unified Studio.

Prerequisites

To run the instruction, you must complete the following prerequisites:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with All capabilities project profile

In SageMaker Unified Studio, select the project and navigate to the Project overview page. Copy the Project role ARN as highlighted in the screenshot. This project role will be used further in the post to provide permissions on existing datasets and resources.

Use existing S3 buckets

This section has following prerequisites:

  • An S3 bucket

To use an existing S3 bucket in SageMaker Unified Studio, configure an S3 bucket policy that allows the appropriate actions for the project AWS Identity and Access Management (IAM) role.

The following is an example bucket policy. Replace <aws_accountid> with the AWS account ID where the domain resides, <s3_bucket> with the name of the S3 bucket that you intend to query in SageMaker Unified Studio, and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the project role in SageMaker Unified Studio:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::<s3_bucket>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        },
        {
            "Sid": "Statement2",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::<s3_bucket>/*",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After you configure the policy, log in to SageMaker Unified Studio and open the project.

Query the data using the JupyterLab IDE to perform analysis, as shown in the following screenshot.

Although the project role has been given appropriate permissions to access the S3 bucket in SageMaker Unified Studio, you will not able to list the contents of the bucket and show the S3 path in the data explorer section within SageMaker Unified Studio.

Use existing RDS DB instances

This section has following prerequisites:

  • A VPC and a private subnet
  • A RDS DB instance on the private subnet in the VPC

SageMaker Unified Studio uses the virtual private cloud (VPC) and subnets that are specified in the domain creation. If you have the data source like an RDS DB instance in a separate VPC, you can configure network reachability between the domain VPC and the data source VPC using VPC peering, AWS Transit Gateway, or a resource VPC endpoint, or alternatively you can create a new domain using the data source VPC.

Add a PostgreSQL connection

Complete the following steps to configure that reachability using VPC peering with Amazon Virtual Private Cloud (Amazon VPC):

  1. On the Amazon VPC console, choose Your VPCs, and make a note of the VPC ID of your VPC named SageMakerUnifiedStudioVPC.
  2. Choose Peering connections, and choose Create peering connection.
  3. Under Select another VPC to peer with, for VPC ID (Requester), choose the VPC ID noted earlier.
  4. Under Select another VPC to peer with, for VPC ID (Accepter), choose the VPC where the target RDS DB instance is located.
  5. Review your settings and choose Create peering connection.
  6. On the Peering connections page, select your peering connection.
  7. Under Actions, choose Accept request.
  8. Review the settings and choose Accept request.

Now you have configured the VPC peering connection. The next step is to configure the network route from the SageMaker Unified Studio VPC to the Amazon RDS VPC.

  1. On the Amazon VPC console, choose Route tables in the navigation pane.
  2. Choose the route table that is used in the private subnets of SageMakerUnifiedStudioVPC.
  3. Choose Edit routes.
  4. Choose Add route.
  5. For Destination, choose the VPC CIDR of the VPC where the RDS DB instance is located.
  6. For Target, choose Peering Connection, and choose the peering connection you created earlier.
  7. Choose Save changes.

Now you have configured the route table from the SageMaker Unified Studio VPC to the Amazon RDS VPC. The next step is to configure the opposite route.

  1. On the Amazon VPC console, choose Route tables in the navigation pane.
  2. Choose the route table that is used in the private subnets of the RDS DB instance.
  3. Choose Edit routes.
  4. Choose Add route.
  5. For Destination, choose the VPC CIDR of SageMakerUnifiedStudioVPC.
  6. For Target, choose Peering Connection, and choose the peering connection you created earlier.
  7. Choose Save changes.

Now you configure your RDS security group to accept traffic coming from SageMaker Unified Studio.

  1. On the Amazon RDS console, navigate to your RDS DB instance, and choose VPC security groups.
  2. Select your security group, and choose Inbound rules.
  3. Choose Edit inbound rules.
  4. Choose Add rule.
  5. For Type, choose Custom TPC.
  6. For Port range, enter your RDS port number.
  7. For Source, enter the VPC CIDR of SageMakerUnifiedStudioVPC.

Now you have network reachability required to use the existing RDS DB instance. The next step is to create a connection pointing to that RDS DB instance in SageMaker Unified Studio.

  1. Sign in to SageMaker Unified Studio and open your project.
  2. In your project, in the navigation pane, choose Data.
  3. Choose the plus sign, and for Add data source, choose Add connection.
  4. Select PostgreSQL.
  5. For Data source name, enter postgresql_source.
  6. For Host, enter the host name of your Aurora PostgreSQL database cluster.
  7. For Port, enter the port number of your Aurora PostgreSQL database cluster (by default, it’s 5432).
  8. For Database, enter your database name.
  9. For Authentication, select Username and password, and enter your user name and password.
  10. Choose Add data source.

You will need to wait for several minutes to complete this step.

Use a visual ETL flow to ingest data to Amazon RDS

In a visual extract, transform, and load (ETL) flow, you can use PostgreSQL as source and target. You can create a PostgreSQL target, and for Name, choose postgresql_source to ingest data into Amazon RDS.

  1. Choose the plus sign, and under Data sources, choose Amazon S3.
  2. Choose Amazon S3 for the source node, and enter following values:
    1. S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv
    2. Format: CSV
    3. Sep: ,
    4. Multiline: Enabled
    5. Header: Disabled
    6. Leave the rest as default.
  3. Wait for the data preview to be available.
  4. Choose the plus sign to the right of Amazon S3 Under Transforms, choose Rename Columns.
  5. Choose the Rename Columns node, and choose Add new rename pair.
  6. For Current name and New name, enter following pairs:
    1. _c0: venueid
    2. _c1: venuename
    3. _c2: venuecity
    4. _c3: venuestate
    5. _c4: venueseats
  7. Choose the plus sign to the right of Rename Columns
  8. Under Targets, choose PostgreSQL, and enter following values:
    1. Name: postgresql_source
    2. Schema: public
    3. Table: venue

  1. Choose Save to project. You can optionally change the name and add a description.
  2. Choose Run. Optionally, you can change the compute parameters.

Wait for completion. Then the data has been successfully ingested.

Run an Athena query to explore the table on Amazon RDS

After you create a table on Amazon RDS, you can explore the table through a data explorer in SageMaker Unified Studio:

  1. On SageMaker Unified Studio, choose Data.
  2. Under Lakehouse, choose postgresql_source, public, and venue.
  3. On the options menu (three dots), choose Query with Athena.

You get records from the RDS table venue.

Use existing DynamoDB tables

This section has following prerequisites:

  • A DynamoDB table

To access existing DynamoDB tables, configure a resource-based policy that allows the appropriate actions for the project role:

  1. On the DynamoDB console, choose Tables in the navigation pane.
  2. Select your table.
  3. Choose the Permissions tab and choose Create table policy.

The following example policy allows connecting to DynamoDB tables as a federated source. Replace <aws_region> with your AWS Region, <aws_account_id> with the AWS account ID where DynamoDB is deployed, <dynamodb_table> with the DynamoDB table that you intend to query from SageMaker Unified Studio, and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the project role in SageMaker Unified Studio:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:DescribeTable",
                "dynamodb:PartiQLSelect",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": "arn:aws:dynamodb:<aws_region>:<aws_accountid>:table/<dynamodb_table>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After the policies are incorporated on the DynamoDB table, create an Amazon SageMaker Lakehouse connection within SageMaker Unified Studio:

  1. Choose Data in the navigation pane.
  2. In the data explorer, choose the plus sign to add a data source.
  3. Select Add connection and choose Next.
  4. Select Amazon DynamoDB and choose Next.
  5. For Name, enter a name, then choose Add data.

The following screenshot shows the detailed steps to create a federated DynamoDB connection in SageMaker Unified Studio. After the connection is established, you can query the data from the DynamoDB table with using the Athena query editor.

You can also use existing DynamoDB tables as part of the ETL process. In the following screenshot, we demonstrate this using a visual ETL flow.

Use existing EMR clusters

This section has following prerequisites:

  • An EMR on EC2 cluster

SageMaker Unified Studio enables you to create new compute or add existing compute resources to a project for submitting jobs. You can add existing Amazon EMR on EC2 clusters or add existing Amazon EMR Serverless applications to submit data analytics jobs. To add a new EMR Serverless application, an administrator must enable a blueprint for the project.

To add an existing EMR on EC2 cluster, complete the following steps:

  1. In SageMaker Unified Studio, navigate to the project for which you plan to add compute, then choose Compute in the navigation pane.
  2. Choose the Data processing
  3. To add an existing EMR on EC2 cluster, choose Add compute.
  4. Choose Connect to existing compute resources and choose Next.

  1. To specify the compute resources to choose from, choose EMR on EC2 cluster.

  1. The Add Compute dialog box requires you to have the correct permissions to access the EMR on EC2 cluster. You can choose Copy project information to copy the data; the admin will need to grant the data worker access. Send the information to your admin.
  2. After the account administrator has granted the data worker access, you can specify the Amazon Resource Names (ARNs) associated with the cluster. You must fill in the Access role ARN, EMR on EC2 cluster ARN, Instance profile role ARN, and Name
  3. After you configure these settings, choose Add compute.

Your EMR on EC2 instance will be added to your project.

After you have added a cluster to a project, you will be able to see the cluster on the Data processing tab of the Compute page. You can then view the cluster details by choosing the specific cluster.

In addition to adding existing compute resources, you have the option to create new compute resources, which allows you to create both EMR on EC2 cluster and EMR Serverless applications.

Conclusion

SageMaker Unified Studio enables you to integrate with multiple data sources, providing data scientists and analysts with a powerful, unified environment for their AI and analytics workflows. As demonstrated throughout this two-part series, you can seamlessly connect to and use data from the Data Catalog, Amazon Redshift, Amazon S3, Amazon RDS, DynamoDB, and Amazon EMR—while maintaining proper security controls and permissions. This flexibility alleviates the need for complex data movement operations and allows teams to focus on extracting insights from their data rather than managing infrastructure. By following the approaches outlined in these posts, organizations can maximize their existing data investments while taking advantage of the advanced capabilities of SageMaker Unified Studio for their data science and analytics needs.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Sakti Mishra is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Daiyan Alamgir is a Principal Frontend Engineer on the Amazon SageMaker Unified Studio team based in New York.

Vipin Mohan is a Principal Product Manager at AWS, leading the launch of generative AI capabilities in Amazon SageMaker Unified Studio. He is committed to shaping impactful products by working backward from customer insights, championing user-focused solutions, and delivering scalable results.

Chanu Damarla is a Principal Product Manager on the Amazon SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/connect-share-and-query-where-your-data-sits-using-amazon-sagemaker-unified-studio/

The ability for organizations to quickly analyze data across multiple sources is crucial for maintaining a competitive advantage. Imagine a scenario where the retail analytics team is trying to answer a simple question: Among customers who purchased summer jackets last season, which customers are likely to be interested in the new spring collection?

While the question is straightforward, getting the answer requires piecing together data across multiple data sources such as customer profiles stored in Amazon Simple Storage Service (Amazon S3) from customer relationship management (CRM) systems, historical purchase transactions in an Amazon Redshift data warehouse, and current product catalog information in Amazon DynamoDB. Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems.

In this blog post, we will demonstrate how business units can use Amazon SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems.

SageMaker Unified Studio provides a unified experience for using data, analytics, and AI capabilities. You can use familiar AWS services for model development, generative AI, data processing, and analytics—all within a single, governed environment. To strike a fine balance of democratizing data and AI access while maintaining strict compliance and regulatory standards, Amazon SageMaker Data and AI Governance is built into SageMaker Unified Studio. With Amazon SageMaker Catalog, teams can collaborate through projects, discover, and access approved data and models using semantic search with generative AI-created metadata, or you can use natural language to ask Amazon Q to find your data. Within SageMaker Unified Studio, organizations can implement a single, centralized permission model with fine-grained access controls, facilitating seamless data and AI asset sharing through streamlined publishing and subscription workflows. Teams can also query the data directly from sources such as Amazon S3 and Amazon Redshift, through Amazon SageMaker Lakehouse.

SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources. Built on AWS Glue Data Catalog and AWS Lake Formation, it organizes data through catalogs that can be accessed through an open, Apache Iceberg REST API to help ensure secure access to data with consistent, fine-grained access controls. SageMaker Lakehouse organizes data access through two types of catalogs: federated catalogs and managed catalogs (shown in the following figure). A catalog is a logical container that organizes objects from a data store, such as schemas, tables, views, or materialized views such as from Amazon Redshift. You can also create nested catalogs to mirror the hierarchical structure of your data sources within SageMaker Lakehouse.

  • Federated catalogs: Through SageMaker Unified Studio, you can create connections to external data sources such as Amazon DynamoDB. See Data connections in Amazon SageMaker Lakehouse for all the supported external data sources. These connections are stored in the AWS Glue Data Catalog (Data Catalog) and registered with Lake Formation, allowing you to create a federated catalog for each available data source.
  • Managed catalogs: A managed catalog refers to the data that resides on Amazon S3 or Redshift Managed Storage (RMS).

The existing Data Catalog becomes the Default catalog (identified by the AWS account number) and is readily available in SageMaker Lakehouse.

If the business units don’t have a data warehouse but need the benefits of one—such as a query result cache and query rewrite optimizations—then, they can create an RMS managed catalog in SageMaker Unified Studio. This is a SageMaker Lakehouse managed catalog backed by RMS storage. The table metadata is managed by Data Catalog. When you create an RMS managed catalog, it deploys an Amazon Redshift managed serverless workgroup. Users can write data to managed RMS tables using Iceberg APIs, Amazon Redshift, or Zero-ETL ingestion from supported data sources.

Functional working model

In SageMaker Unified Studio, the infrastructure team will enable the blueprints and configure the project profiles for tools and technologies to the respective business units to build and monitor their pipelines. They will also onboard the teams to SageMaker Unified Studio, enabling them to build the data products in a single integrated, governed environment. To enforce standardization within the organization, the central governance team can also create hierarchical representations of business units through domain units and dictate certain actions that these teams can perform under a domain unit. Global policies such as data dictionaries (business glossaries), data classification tags, and additional information with metadata forms can be created by the governance team to ensure standardization and consistency within the organization.

Individual business units will use these project profiles based on their needs to process the data using the authorized tool of their choice and create data products. Business units can enjoy the full flexibility to process and consume the data without worrying about the maintenance of the underlying infrastructure. Depending on the nature of the workloads, business units can choose a storage solution that best fits their use case. You can use SageMaker Lakehouse to unify the data across different data sources.

To share the data outside the business unit, the teams will publish the metadata of their data to a SageMaker catalog and make it discoverable and accessible to other business units. Amazon SageMaker Catalog serves as a central repository hub to store both technical and business catalog information of the data product. To establish trust between the data producers and data consumers, SageMaker Catalog also integrates the data quality metrics and data lineage events to track and drive transparency in data pipelines. While sharing the data, data producers of these business units can apply fine grained access control permissions at row and column level to these assets during subscription approval workflows. SageMaker Unified Studio automatically grants subscription access to the subscribed data assets after the subscription request is approved by the data producer. As shown in the following figure, the data sharing capability highlights that the data remains at its origin with the data producer, while consumers from other business units can consume and analyze it using their own compute resources. This approach eliminates any data duplication or data movement.

Solution overview

In this post, we explore two scenarios for sharing data between different teams (retail, marketing, and data analysts). The solution in this post gives you the implementation for a single account use case.

Scenario 1

The retail team needs to create a comprehensive view of customer behavior to optimize their spring collection launch. Their data landscape is diverse:

  • Customer profiles stored in Amazon S3 (default Data Catalog)
  • Historical purchase transactions stored in RMS (SageMaker Lakehouse managed RMS catalog)
  • Inventory information of the product in DynamoDB. (federated catalog)

The team needs to share this unified view with their regional data analysts while maintaining strict data governance protocols. Data analysts discover the data and subscribe to the data. We will also walk through the publishing and subscription workflow as part of the data sharing process. To get a unified view of the customer sales transactions for active products, the data analysts will use Amazon Athena.

Here are the high level steps of the solution implementation as shown in the preceding diagram:

  1. In this post, we take an example of two teams who participate in the collaboration. The retail team has created a project retailsales-sql-project and the data analysts team has created a project dataanalyst-sql-project within SageMaker Unified Studio.
  2. The retail team creates and stores their data in various sources:
    1. customer data in Amazon S3 (contains customer data)
    2. inventory data in a DynamoDB table (contains product catalog information)
    3. store_sales_lakehouse in SageMaker Lakehouse managed RMS (contains purchase history)
  3. The retail team publishes the assets to the project catalog to make them discoverable to other domain members within the organization.
  4. The data analysts team discovers the data and subscribes to the data assets.
  5. An incoming request is sent to the retail team, who then approves the subscription request. After the subscription is approved, data analysts use Athena to create a unified query from all the subscribed data assets to get insights into the data.

In this scenario, we will review how SageMaker Catalog manages the subscription grants to Data Catalog assets (both federated and managed).

For this scenario, we assume that the retail team doesn’t have their own data warehouse and they want to create and manage Amazon Redshift tables using Data Catalog.

Scenario 2

The marketing team needs access to transaction data for campaign optimization. They have campaign performance data stored in an Amazon Redshift data warehouse. However, to have improved campaign ROI and better resource allocation, they need data from the retail team to understand actual customer purchase behavior. To improve the campaign ROI, they need answers to crucial questions such as:

  • What is the true conversion rate across different customer segments?
  • Which customers should be targeted for upcoming promotions?
  • How do seasonal buying patterns affect campaign success?

Here the retail team shares the purchase history data store_sales to the marketing team. In this scenario, shown in the preceding figure, we assume that the retail team has their own data warehouse and uses Amazon Redshift to store the purchase history data.

The high level steps of the solution implementation for this scenario are:

  1. The marketing team has created the project marketing-sql-project within SageMaker Unified Studio.
  2. The retail team has store_sales in Amazon Redshift data warehouse (contains purchase history)
  3. The retail team has published the assets to the project catalog
  4. The marketing team discovers the data and subscribes to the data assets.
  5. An incoming request is sent to the retail team, who then approves the subscription request. After the subscription is approved, the marketing team uses Amazon Redshift to consume the purchase history and identify high-value customer segments.

In this scenario, we will review the process of how SageMaker Catalog grants access to managed Amazon Redshift assets.

Prerequisites

To follow the step by step guide, you must complete the following prerequisites:

Note that the default SQL analytics project profile provides you with a RedshiftServerless blueprint. However, in this post, we want to showcase the data sharing capabilities of different types of SageMaker Lakehouse catalogs (managed and federated).

For the simplicity, we chose the SQL analytics project profile. However, you can also test this by using the Custom project profile by selecting specific blueprints such as LakehouseCatalog and LakeHouseDatabase for scenarios where the business unit doesn’t have their own data warehouse.

Solution walkthrough (Scenario 1)

The first step focuses on preparing the data for each data source for unified access.

Data preparation

In this section, you will create the following data sets:

  • customer data in Amazon S3 (default Data Catalog)
  • inventory data in a DynamoDB table (federated catalog)
  • store_sales_lakehouse in SageMaker Lakehouse managed RMS (managed catalog)
  1. Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
  2. On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor.

  1. Select the following options:
    1. Under CONNECTIONS, select Athena (Lakehouse).
    2. Under CATALOGS, select AwsDataCatalog.
    3. Under DATABASES, select glue_db_<environmentid> or the customer glue database name you provided during project creation.
    4. After the options are selected, choose Choose.

When users select a project profile within SageMaker Unified Studio, the system automatically triggers the relevant AWS CloudFormation stack (DataZone-Env-<environmentid>) and deploys the necessary infrastructure resources in the form of environments. Environments are the actual data infrastructure behind a project.

  1. Run the following SQL:
CREATE TABLE customer AS
SELECT 13251813 cust_id,'Joyce Deaton'   cust_name,'Greece'   cust_country, '[email protected]'   cust_email
UNION
SELECT 1581546  ,'Daniel Dow'  ,'India'  , '[email protected]'  
UNION
SELECT 1581536  ,'Marie Lange'  ,'Canada'  , '[email protected]'  
UNION
SELECT 1827661  ,'Wesley Harris'  ,'Rome'  , '[email protected]'  
UNION
SELECT 1581536  ,'Alexander Salyer'  ,'Germany'  , '[email protected]'  
UNION
SELECT 3581536  ,'Jerry Tracy'  ,'Swiss'  , '[email protected]' 
  1. After the SQL is executed, you will find that the customer table has been created in the Lakehouse section under Lakehouse/AwsDataCatalog/glue_db_<environmentid>.

  1. The product catalog is stored in DynamoDB. You can create a new table named inventory in DynamoDB with partition key prod_id through AWS CloudShell with the following command:
aws dynamodb create-table \
    --table-name inventory\
    --attribute-definitions \
AttributeName=prod_id,AttributeType=N \
    --key-schema \
AttributeName=prod_id,KeyType=HASH \
    --provisioned-throughput \
ReadCapacityUnits=5,WriteCapacityUnits=5 \
    --table-class STANDARD
  1. Populate the DynamoDB table using the following commands:
aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "1"}, "prod_name": {"S": "Widget A"},"active": {"S": "Y"}}' 

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "2"}, "prod_name": {"S": "Gadget B"},"active": {"S": "Y"}}'

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "3"}, "prod_name": {"S": "Item C"},"active": {"S": "N"}}' 
  1. To use the DynamoDB table in SageMaker Unified Studio, you need to configure a resource-based policy that allows the appropriate actions for the project role.
    1. To create the resource-based policy, navigate to the DynamoDB console and choose Tables from the navigation pane.
    2. Select the Permissions table and choose Create table policy.

  1. The following is an example policy that allows connecting to DynamoDB tables as a federated source. Replace the <aws_region> with the Region you are working on, <aws_account_id> with the AWS Account ID where DynamoDB is deployed, <dynamodb_table> with the DynamoDB table (in this case inventory) that you intend to query from Amazon SageMaker Unified Studio and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the Project role Amazon Resource Name (ARN) in SageMaker Unified Studio portal. You can get the project role ARN by navigating to the project in SageMaker Unified Studio and then to Project overview.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:DescribeTable",
                "dynamodb:PartiQLSelect",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": "arn:aws:dynamodb:<aws_region>:<aws_accountid>:table/<dynamodb_table>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After the policies are incorporated on the DynamoDB table, create an SageMaker Lakehouse connection within SageMaker Unified Studio. As shown in the example, dynamodb-connection-catalogs is created.

  1. After the connection is successfully established, you will see the DynamoDB table inventory under Lakehouse.

The next step is to create a managed catalog for RMS objects using SageMaker Lakehouse.

  1. Choose Data in the navigation pane.
  2. In the data explorer, choose the plus icon to add a data source.
  3. Select Create Lakehouse catalog.
  4. Choose Next.

  1. Enter the name of the catalog. The catalog name provided in the example is redshift-lakehouse-connection-catalogs. Choose Add data.

  1. After the connection is created, you will see the catalog under Lakehouse.

  1. This creates a managed Amazon Redshift Serverless workgroup in your AWS account. You will see a new database dev@<redshift-catalog-name> in the managed Amazon Redshift Serverless workgroup.
    1. On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor.
    2. Select Redshift (Lakehouse) from CONNECTIONSdev@<redshift-catalog-name> from DATABASES and public from SCHEMAS

  1. Run the following SQL in order. The SQL creates the store_sales_lakehouse table in the dev database in the public schema. The retail team inserts data into the store_sales_lakehouse table.
CREATE TABLE public.store_sales_lakehouse (
    sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
    cust_id INTEGER NOT NULL,
    sale_date DATE NOT NULL,
    sale_amount DECIMAL(10, 2) NOT NULL,
    prod_id INTEGER  NOT NULL,
    last_purchase_date DATE
);
INSERT INTO public.store_sales_lakehouse (cust_id, sale_date, sale_amount, prod_id, last_purchase_date)
VALUES
(13251813, '2023-01-15', 150.00, 1, '2023-01-15'),
(29033279, '2023-01-20', 200.00, 4, '2023-01-20'),
(12755125, '2023-02-01', 75.50, 3, '2023-02-01'),
(26009249, '2023-02-10', 300.00, 2, '2023-02-10'),
(3270685, '2023-02-15', 125.00, 2, '2023-02-15'),
(6520539, '2023-03-01', 100.00, 2, '2023-03-01'),
(10251183, '2023-03-10', 250.00, 1, '2023-03-10'),
(10251283, '2023-03-15', 180.00, 1, '2023-03-15'),
(10251383, '2023-04-01', 90.00, 2, '2023-04-01'),
(10251483, '2023-04-10', 220.00, 3, '2023-04-10'),
(10251583, '2023-04-15', 175.00, 3, '2023-04-15'),
(10251683, '2023-05-01', 130.00, 1, '2023-05-01'),
(10251783, '2023-05-10', 280.00, 1, '2023-05-10'),
(10251883, '2023-05-15', 195.00, 4, '2023-05-15'),
(10251983, '2023-06-01', 110.00, 2, '2023-06-01'),
(10251083, '2023-06-10', 270.00, 1, '2023-06-10'),
(10252783, '2023-06-15', 185.00, 2, '2023-06-15'),
(10253783, '2023-07-01', 95.00, 3, '2023-07-01'),
(10254783, '2023-07-10', 240.00, 1, '2023-07-10'),
(10255783, '2023-07-15', 160.00, 3, '2023-07-15');
  1. On successful creation of the table, you should now be able to query the data. Select the table store_sales_lakehouse and select Query with Redshift.

Import assets to the project catalog from various data sources

To share your assets outside your own project to other business units, you must first bring your metadata to SageMaker Catalog. To import the assets into the project’s inventory, you need to create a data source in the project catalog. In this section, we show you how to import the technical metadata from AWS Glue data catalogs. Here, you will import data assets from various sources that you have created as part of your data preparation.

  1. Sign in to SageMaker Unified Studio as a member of the retail team. Select the project retailsales-sql-project, under Project catalog. Choose Data sources and import the assets by choosing Run.

  1. To import the federated catalog, create a new data source and choose Run. This will import the metadata of the inventory data from DynamoDB table.

  1. After successful run of all the data sources, choose Assets under Project catalog in the navigation plane. You will find all the assets in the Inventory of Project catalog.

Publish the assets

To make the assets discoverable to the data analysts team, the retail team must publish their assets.

  1. In the project retailsales-sql-project, choose Project catalog and select Assets.
  2. Select each asset in the INVENTORY tab, enrich the asset with the automated metadata generation and PUBLISH ASSET.

Discover the assets

SageMaker Catalog within SageMaker Unified Studio enables efficient data asset discovery and access management. The data analysts team signs in to SageMaker Unified Studio and selects the project dataanalyst-sql-project. The data analysts team then locates the desired assets in SageMaker Catalog and initiates the subscription request.

In this section, members of dataanalyst-sql-project browse the catalog and find the assets. There are multiple ways to find the desired assets.

  • Sign in to SageMaker Unified Studio as a member of the data analysts team. Choose Discover in the top navigation bar and select Catalog. Find the desired asset by browsing or entering the name of the asset into the search bar.
  • Search for the asset through a conversational interface using Amazon Q.
  • Use the faceted filter search by selecting the desired project in the BROWSE CATALOG.

The data analysts team selects the project retailsales-sql-project.

Subscribe to the assets

The data analysts team submits a subscription request with an appropriate justification for each of these assets.

  1. For each asset, choose SUBSCRIBE.
  2. Select dataanalyst-sql-project in Project.
  3. Provide the Reason for request as “need this data for analysis”.

Note that during the subscription process, the requester sees a message that the asset access control and fulfillment will be Managed. This means that SageMaker Unified Studio automatically manages subscription access grants and permissions for these assets.

Subscription approval workflow

To approve the subscription request, you must be a member of the retail team and select the project that has published the asset.

  1. Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
  2. In the navigation pane, choose Project catalog and then select Subscription requests.
  3. In INCOMING REQUESTS, choose the REQUESTED tab and select View request for each asset to see detailed information of the subscription request.

  • REQUEST DETAILS provides information about the subscribing project, the requestor, and the justification to access the asset.
  • RESPONSE DETAILS provides an option to approve the subscription with full access to the data (Full access) or restricted access to the data (Approve with row or column filters). With restricted access to data, the subscription approval workflow process offers granular access control for sensitive data through row-level filtering and column-level filtering. Using row filters, approvers can restrict access to specific records based on defined criteria. Using column filters, approvers can control access to specific columns within the data sets. This allows excluding sensitive fields while sharing the relevant data. Approvers can implement these filters during the approval process, helping to ensure that the data access aligns with the organization’s security requirements and compliance policies. For this post, select Full access in the RESPONSE DETAILS
  • (Optional) Decision comment is where you can add a comment about accepting or rejecting the subscription request.
  • Choose APPROVE.

  1. Repeat the subscription approval workflow process for all the requested assets.
  2. After all the subscription requests are approved, choose the APPROVED tab to view all the approved assets.

Subscription fulfillment methods

After subscription approval, a fulfillment process manages access to the assets. SageMaker Unified Studio provides fulfillment methods for managed assets and unmanaged assets.

  • Managed assets: SageMaker Unified Studio automatically manages the fulfillment and permissions for assets such as AWS Glue tables and Amazon Redshift tables and views.
  • Unmanaged assets: For unmanaged assets, permissions are handled externally. SageMaker Unified Studio publishes standard events for actions such as approvals through Amazon EventBridge, enabling integration with other AWS services or third-party solutions for custom integrations.

In this scenario 1, because the assets are Data Catalogs, SageMaker Unified Studio grants and manages access to these managed assets on your behalf through Lake Formation. See the SageMaker Unified Studio subscription workflow for updates on sharing options.

Analyze the data

The data analysts team uses the subscribed data assets from varied sources to get unified insights.

  1. As a data analyst, sign in to SageMaker Unified Studio and select the project dataanalyst-sql-project. In the navigation pane, choose Project catalog and select Assets.
  2. Choose the SUBSCRIBED tab to find all the subscribed assets from the retailsales-sql-project.
  3. The status under each asset is Asset accessible. This indicates that the subscription grants are fulfilled and the data analysts team can now consume the assets with the compute of their choice.

Query using Athena (subscription grants fulfilled using Lake Formation)

As a member of the data analysts team, create a unified view to get purchase history with customer information for active products.

  1. In the dataanalyst-sql-project project, go to Build and select Query Editor.
  2. Use the following sample query to get the required information. Replace glue_db_<environmentid> with your subscribed glue database.
select * from "redshift-lakehouse-connection-catalogs/dev"."public"."store_sales_lakehouse" sales 
 left  join "awsdatacatalog"."glue_db_<environmentid>"."customer" customer
 on sales.cust_id=customer.cust_id
 inner  join "dynamodb-connection-catalogs"."default"."inventory" inventory
 on sales.prod_id = inventory.prod_id
 where inventory.active ='Y'

Solution walk-through (Scenario 2)

In this scenario, we assume that the retail team stores the purchase history data in their Amazon Redshift data warehouse. Because you’re using the default SQL analytics project profile to create the project, you will use a Redshift Serverless compute (project.redshift). The purchase history data is shared with the marketing team for enhanced campaign performance.

  1. Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
  2. On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor
  3. Select the following options:
    • Under CONNECTIONS, select Redshift(Lakehouse).
    • Under CATALOGS, select dev.
    • Under DATABASES, select public.
  4. Run the following SQL:
CREATE TABLE public.store_sales (
sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
cust_id INTEGER NOT NULL,
sale_date DATE NOT NULL,
sale_amount DECIMAL(10, 2) NOT NULL,
prod_id INTEGER  NOT NULL,
last_purchase_date DATE
);
INSERT INTO public.store_sales (cust_id, sale_date, sale_amount, prod_id, last_purchase_date)
VALUES
(13251813, '2023-01-15', 150.00, 1, '2023-01-15'),
(29033279, '2023-01-20', 200.00, 4, '2023-01-20'),
(12755125, '2023-02-01', 75.50, 3, '2023-02-01'),
(26009249, '2023-02-10', 300.00, 2, '2023-02-10'),
(3270685, '2023-02-15', 125.00, 2, '2023-02-15'),
(6520539, '2023-03-01', 100.00, 2, '2023-03-01'),
(10251183, '2023-03-10', 250.00, 1, '2023-03-10'),
(10251283, '2023-03-15', 180.00, 1, '2023-03-15'),
(10251383, '2023-04-01', 90.00, 2, '2023-04-01'),
(10251483, '2023-04-10', 220.00, 3, '2023-04-10'),
(10251583, '2023-04-15', 175.00, 3, '2023-04-15'),
(10251683, '2023-05-01', 130.00, 1, '2023-05-01'),
(10251783, '2023-05-10', 280.00, 1, '2023-05-10'),
(10251883, '2023-05-15', 195.00, 4, '2023-05-15'),
(10251983, '2023-06-01', 110.00, 2, '2023-06-01'),
(10251083, '2023-06-10', 270.00, 1, '2023-06-10'),
(10252783, '2023-06-15', 185.00, 2, '2023-06-15'),
(10253783, '2023-07-01', 95.00, 3, '2023-07-01'),
(10254783, '2023-07-10', 240.00, 1, '2023-07-10'),
(10255783, '2023-07-15', 160.00, 3, '2023-07-15');

5. On successful execution of the query, you will see store_sales under Redshift in the navigation pane.

Import the asset to the project catalog inventory

To share your assets outside your own project to other marketing business units, you must first share your metadata to SageMaker Catalog. To import the assets into the project’s inventory, you need to run the data source in the project catalog.

In the project retailsales-sql-project, under Project catalog, select Data sources and import the asset store-sales. Select the highlighted data source and choose Run as shown in the screenshot.

Publish the asset

To make the assets discoverable to the marketing team, the retail team must publish their asset.

  1. Go to the navigation pane and choose Project catalog, and then select Assets.
  2. Select store-sales in the INVENTORY tab, enrich the asset with the automated metadata generation and PUBLISH ASSET as illustrated in the screenshot.

Discover and subscribe the asset

The marketing team discovers and subscribes to the store-sales asset.

  1. Sign in to SageMaker Unified Studio as a member of the marketing team and select marketing-sql-project.
  2. Navigate to the Discover menu in the top navigation bar and choose Catalog. Find the desired asset by browsing or entering the name of the asset into the search bar.
  3. Select the asset and choose SUBSCRIBE.
  4. Enter a justification in Reason for request and choose REQUEST.

Subscription approval workflow

The retail team gets an incoming request in their project to approve the subscription request.

  1. Sign in to the SageMaker Unified Studio and select the project retailsales-sql-project as a member of the retail team. Under Project catalog, select Subscription requests.
  2. In the INCOMING REQUESTS, under the REQUESTED tab, select View request for store-sales.

  1. You will see detailed information for the subscription request.
  2. Select Full access in the RESPONSE DETAILS and choose APPROVE.

Analyze the data

Sign in to SageMaker Unified Studio as a member of the marketing team and select marketing-sql-project.

  1. In the Project catalog, select Assets and choose the SUBSCRIBED tab to find all the subscribed assets from the retailsales-sql-project.
  2. Notice the status under the asset marked as Asset accessible. This indicates that the subscription grants are fulfilled and the marketing team can now consume the asset with the compute of their choice.

Query using Amazon Redshift (subscription grants fulfilled using native Amazon Redshift data sharing)

To query the shared data with Amazon Redshift compute, select Build and then Query Editor. Select the following options

  1. Under CONNECTIONS, select Redshift(Lakehouse).
  2. Under CATALOGS, select dev.
  3. Under DATABASES, select project.
select * from "dev"."project"."store_sales" sales  

When a subscription to an Amazon Redshift table or view is approved, SageMaker Unified Studio automatically adds the subscribed asset to the consumer’s Amazon Redshift Serverless workgroup for the project. Notice the subscribed asset is shared under the folder project. In the Redshift navigation pane, you can also see the datashare created between the source and the target cluster. In this case, because the data is shared in the same account but between different clusters, SageMaker Unified Studio creates a view in the target database and permissions are granted on the view. See Grant access to managed Amazon Redshift assets in Amazon SageMaker Unified Studio for information about data sharing options within Amazon Redshift.

Clean up

Make sure you remove the SageMaker Unified Studio resources to avoid any unexpected costs. Start by deleting the connections, catalogs, underlying data sources, projects, databases, and domain that you created for this post. For additional details, see the Amazon SageMaker Unified Studio Administrator Guide.

Conclusion

In this post, we explored two distinct approaches to data sharing and analytics.

Business units without an existing data warehouse can use a SageMaker Lakehouse managed RMS catalog. In the first scenario, we showcased subscription fulfillment of AWS Glue Data Catalogs using AWS Lake Formation for federated and managed catalogs. The data analysts team was able to connect and subscribe to the data shared by the retail team that resided in Amazon S3, Amazon Redshift, and other data sources such as DynamoDB through SageMaker Lakehouse.

In the second scenario, we demonstrated the native data-sharing capabilities of Amazon Redshift. In this scenario, we assume that the retail team has sales transactions stored in an Amazon Redshift data warehouse. Using the data sharing feature of Amazon Redshift, the asset was shared to the marketing team using Amazon SageMaker Unified Studio.

Both approaches enable unified querying across varied data sources with teams able to efficiently discover, publish, and subscribe to data assets while maintaining strict access controls through Amazon SageMaker Data and AI Governance. Subscription fulfillment is automated, reducing the administrative overhead. Using the query-in-place approach eliminates data redundancy and maintains data consistency while allowing unified analysis across data sources through a single integrated experience.

To learn more, see the Amazon SageMaker Unified Studio Administrator Guide and the following resources:


About the authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached through LinkedIn

Ramkumar Nottath is a Principal Solutions Architect at AWS focusing on Analytics services. He enjoys working with various customers to help them build scalable, reliable big data and analytics solutions. His interests extend to various technologies such as analytics, data warehousing, streaming, data governance, and machine learning. He loves spending time with his family and friends. 

Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/foundational-blocks-of-amazon-sagemaker-unified-studio-an-admins-guide-to-implement-unified-access-to-all-your-data-analytics-and-ai/

Amazon SageMaker Unified Studio (preview) provides a unified experience for using data, analytics, and AI capabilities. You can use familiar AWS services for model development, generative AI, data processing, and analytics—all within a single, governed environment. Users can now build, deploy, and execute end-to-end workflows from a single interface. SageMaker Unified Studio is built on the foundations of Amazon DataZone, where it uses domains to categorize and structure the data assets, while offering project-based collaboration features that allow teams to securely share artifacts and work together across various compute services. This experience allows multiple personas to seamlessly collaborate, while operating under appropriate access controls and governance policies.

In this post, we focus on the admin persona and deep dive into the foundational building blocks while implementing the self-service access to all your data.

Conceptual framework

SageMaker Unified Studio offers an integrated development experience organized into three distinct planes, each serving different personas and purposes within the development lifecycle. This architecture enables seamless collaboration while maintaining clear boundaries of responsibility.

As shown in the following figure, each plane represents a distinct layer of functionality that works in harmony with the others to create a complete data and machine learning (ML) solution.

foundational planes

The planes are as follows:

  • Infrastructure plane – The infrastructure plane forms the foundation of SageMaker Unified Studio. Here administrators and domain owners of the organization provision the underlying infrastructure and define rules for users of the data factory plane to deploy the compute resources for data and ML operations in self-service mode. They can also decide to onboard existing resources or pre-create them. They can set up access controls and permissions to enforce and allocate resources to different teams and projects. This layer makes sure that all necessary computational resources are available and properly governed for downstream computation.
  • Data factory plane – The data factory plane functions like a sophisticated vending machine for compute resources, where data scientists and ML engineers can select and utilize preconfigured compute resources or deploy new ones. The data product developers, data engineers, and data scientists can create collaboration spaces and build data products by consuming infrastructure resources, with all the underlying complexity abstracted away.
  • Product experience plane – At the outermost layer, the product experience plane serves as a discovery and collaboration hub where business units (data producers and data consumers) can explore available data products from the asset catalog. This plane drives users to engage in data-driven conversations with knowledge and insights shared across the organization. Through the product experience plane, data product owners can use automated workflows to capture data lineage and data quality metrics and oversee access controls. They can track how their data products are being used and continuously improve the value proposition of their data assets.

In this post, we focus on the infrastructure plane deployment steps from an administrator’s perspective, outlining key responsibilities and actions required and how to configure and organize your assets under specific business units and teams and authorize policies during the initial setup phase.

Roles and responsibilities of the domain owner (admin) for the infrastructure plane

As shown in the following figure, the infrastructure plane revolves around three pivotal operational paradigms: onboard, organize, and authorize.

The details of the three essential functions in the foundational layer are as follows:

  • Onboard – The domain owner establishes a foundational environment by creating a domain, which represents an organization entity for you to connect together your assets, users, resources, and code repository configs. They can onboard the users who have authorization to access the self-serve unified studio. The self-serve unified studio is a browser-based web application where you can analyze, discover, catalog, govern, and share data in self-serve manner. The admin can enable the necessary blueprints and create project profiles to set up the underlying data infrastructure. In a multi-account (Mesh) scenario, the admin can also onboard the business units by associating the AWS accounts.
  • Organize – Here the domain owner creates hierarchies to organize and isolate projects within individual business units. The method of creating hierarchical representation of business units or team-level organization is through domain units. This makes sure that each business unit takes ownership of their assets. The admin can also delegate ownership within these business units.
  • Authorize – The admin or owners of individual business units or line of business (domain unit owners) can manage user policies—project-specific policies that dictate certain actions these principals can perform under a domain unit.

Now that we have discussed the core functions, let’s delve into the workflow that brings these concepts together.

Process workflow (infrastructure plane)

In the following figure, we break down the roles and responsibilities of domain owners to unit administrators through a sequence of operations, providing infrastructure deployment and management.

process workflow

The workflow consists of the following steps:

  1. The root domain owner (admin) creates a SageMaker Unified Studio domain from the console. After the domain is created, you get a SageMaker Unified Studio URL—a browser-based web application that can authenticate you with your AWS Identity and Access Management (IAM) user credentials or with credentials from your identity provider (IdP) through AWS IAM Identity Center or with your SAML credentials.
  2. As part of the onboarding process, the admin onboards single sign-on (SSO) users, SSO groups, and IAM users who are authorized to log in to SageMaker Unified Studio. IAM roles can be onboarded on the domain as well, but can be used for programmatic access only. During the quick setup deployment of the domain, default project profile templates are created. A project profile is a collection of blueprints that holds configurations of AWS tools and services. You can create following project profiles:
    1. Generative AI application development – Provides you with the tooling capabilities to build generative AI applications using Amazon Bedrock foundation models (FMs) and tools.
    2. SQL analytics – Provides you with a SQL editor to query the data in Amazon SageMaker Lakehouse, Amazon Redshift, and Amazon Athena.
    3. Data analytics and AI-ML model development – Provides you tools to build and orchestrate ML and generative AI models powered by AWS Glue, Athena, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), Amazon SageMaker AI, and SageMaker Lakehouse.
    4. Custom project profile – Provides capabilities to build custom templates that can bundle multiple blueprints with varied tooling capabilities to suit your business needs.

Admins can also authorize project profile templates to specific users and groups, enforcing the capability to control resource deployment based on user personas. By default, all users are authorized to use default project profiles. However, this can be changed by the admin to limit the access of certain project profiles to certain users and groups.

The quick setup also establishes a default Git connection to AWS CodeCommit for users to manage their code repository. However, you also have the option to create and enable new Git connections to GitHub, GitHub Enterprise Server, GitLab, and GitLab self-managed. The Free Tier release of Amazon Q is enabled by default to all users of SageMaker Unified Studio domain. Amazon Q Developer Pro can be configured if IAM Identity Center is configured for users of the domain.

Finally, as part of the initial setup, the admin provides access to Amazon Bedrock serverless models.

In a multi-account scenario, the central admin associates AWS accounts, and the associated account admins accept the association and enable the blueprints for the project profiles that the central admin would create. Refer to the appendix at the end of this post for more details.

  1. To organize the data assets within the organization, the admin logs in to the SageMaker Unified Studio URL and creates domain units aligned with the business divisions.
  2. Each domain unit receives delegated ownership, enabling autonomous management of assets within their designated scope. This domain-based isolation provides clear boundaries while allowing unit owners to independently govern their assets and enforce relevant policies.

Steps 3 and 4 are optional as part of the quick deployment setup. Users can directly log in to SageMaker Unified Studio to build data products for their business use case if domain units are not part of immediate requirement. If no domain units are created, all users and groups fall back under the root domain level and authorization policies are applied on the root domain.

Behind the scenes

While users interact with a streamlined project creation interface in SageMaker Unified Studio, a sophisticated orchestration of components operates beneath the surface. This abstraction allows the admin to deploy infrastructure through simple selections while the system handles resource provisioning automatically. Let’s examine the underlying process behind the scenes, as illustrated in the following figure.

conceptual diagram of blueprints

This workflow consists of the following steps:

  1. Administrators enable the blueprints containing the AWS CloudFormation templates that have information on how to create and set up the underlying data infrastructure. These blueprints are automatically enabled during the quick setup deployment.
  2. Project profiles bundle these blueprint configurations into templates. These templates determine which infrastructure components deploy when a project is created.
  3. When users select a project profile within SageMaker Unified Studio, the system automatically triggers the relevant CloudFormation stack and deploys the necessary infrastructure resources in the form of environments. Environments are the actual data infrastructure behind a project.

In a multi-account scenario, the associated account admin enables the blueprints. However, the project profile creation happens at the root domain account. The project profile template will include the associated account details and the linked blueprints from the associated account. Refer to the appendix at the end of this post for more details.

Now that we have understood the functional building blocks of SageMaker Unified Studio, let’s proceed with the deployment walkthrough. We will create a domain using the quick setup deployment for single account. Refer to the appendix for multi-account deployment steps.

Prerequisites

You will need to complete the following prerequisites before you can follow the instructions in the next section:

  1. Sign up for an AWS account.
  2. Create a user with administrative access.
  3. Enable IAM Identity Center in the same AWS Region you want to create your SageMaker Unified Studio domain. Confirm in which Region SageMaker Unified Studio is currently available. Set up your IdP and synchronize identities and groups with IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.
  4. To use Amazon Bedrock FMs, grant access to base models.

Set up domain

Complete the following steps to create a new SageMaker Unified Studio domain:

  1. Sign in to the SageMaker console in the Region in which IAM Identity Center is enabled.
  2. Choose Create a Unified Studio domain.

create domain

  1. Select the Quick setup (recommended for exploration).
  2. Choose Create VPC (you can also use your own VPC but to simplify the cleanup, we opted to use a new VPC).

create vpc

This will open a new tab to deploy the CloudFormation stack to create the VPC and the necessary private and public subnets.

  1. For Stack name, enter a unique name to the stack (if the default name already exists).
  2. Keep the parameter for useVpcEndpoints as false.
  3. Choose Create stack.

create stack

  1. After the stack is created, go to the domain creation page and refresh the page, as shown in the following screenshot.

refresh

  1. For Name, enter a unique name for the domain.
  2. Keep the default selections for Domain Execution role, Domain Service role, Provisioning role, and Manage Access role.
  3. The configuration automatically selects the VPC and private subnets.

domain roles

service roles

  1. Keep the default selection for Model provisioning role and Model consumption role.
  2. Choose Continue.

prov roles

  1. Provide the email address of the SSO user that exists in IAM Identity Center.

The SSO user selected here is used as the administrator in SageMaker Unified Studio. If the account doesn’t have IAM Identity Center set up, then it will create an IAM Identity Center account instance, so long as the account is permitted to do so. An SSO or IAM user is required so that a user is able to log in to the studio after the domain is created.

  1. Choose Create domain.

create IdC

  1. After the domain is created, a dialog box pops up. You can close dialog box to set up authorization policies and onboard users.

dialog box

On the domain detail page, the Amazon SageMaker Unified Studio URL is listed. You can authenticate with your IAM user credentials or with credentials from your IdP through IAM Identity Center or with your SAML credentials. To authorize users to log in to the URL, the administrator must onboard the users to the domain. We see this as part of the next steps.

Unified Studio URL

Onboard users and associated accounts

Complete the following steps:

  1. To onboard users, go to the User management tab and choose Add.
  2. On the Add menu, choose either Add SSO users and groups or Add IAM users.

You can also add IAM roles for the purpose of managing the domain programmatically. However, you can’t use IAM roles to log in to the SageMaker Unified Studio URL. After you add the users, they will appear with the status Assigned. The status changes to Activated only when the user logs in to the SageMaker Unified Studio URL.

onboard users

  1. If you want to onboard multiple AWS accounts to your domain account, go to the Account associations tab and choose Request association.

This enables domain users to publish and consume data from these AWS accounts.

associate accounts

For a multi-account setup, by sending an association request to another AWS account, you share the root domain with the other AWS account with AWS Resource Access Manger (AWS RAM). The associated admin domain owner accepts the invitation. To access the compute resources of the associated accounts from SageMaker Unified Studio, the associated domain owner must enable the necessary blueprints. Refer to the appendix to understand the cross-account deployment steps.

Project profiles and authorizing users

For the quick setup deployment, when you navigate to the Blueprints tab, you will notice all the blueprints are automatically enabled. Also, on the Project profiles tab, you will find default project profiles are available to the user.project profiles

Leave the rest of the tabs with the default options.

Create a custom project profile and authorize users (optional)

In the following example, we show the steps to create a custom project profile by bundling selected blueprints. We also show the steps to authorize only restricted users to use this project profile template. This example creates a custom project profile with selective blueprints. This enables the user to create a data lake environment with AWS Glue database and Athena workgroup to query the data. The user can also create an Amazon MWAA environment for orchestration. You can also change or override the configuration parameters of the blueprint by using the Tooling configurations option within the project profile.

Because SageMaker Unified Studio is in preview mode, the naming conventions of some visual elements might appear different in the current version.

When you create a project profile, you can add blueprint deployment settings in two modes: on create and on demand. On create mode allows you to deploy the blueprint deployment settings as soon as the project is created. On demand mode allows you to deploy the blueprint deployment settings when users need it.

Create a project, create domain units, and delegate ownership (optional)

In the following example, the administrator logs in to SageMaker Unified Studio and creates the retail domain unit. The admin also delegates ownership to the retail business user. The retail business user logs in to SageMaker Unified Studio and creates a project with the authorized project profile template.

With these configurations in place, you have successfully completed the initial infrastructure plane deployment from an administrative perspective.

Authorization of blueprints (optional)

By default, all domain users have authorization to create projects with the enabled blueprints across domain units. If you want to restrict the usage of the blueprint within a specific domain unit (in this case, the retail domain unit, as shown in the following screenshot), you need to revoke the existing permissions and authorize the specific domain units. By limiting the use of blueprints to a particular domain unit, users can only create projects using the blueprint within that domain unit. To apply authorization settings to child domain units, enable the Cascade to all child domain units option.

blueprints authorization

Clean up

Make sure you remove the SageMaker Unified Studio resources to mitigate any unexpected costs. This involves a few steps:

  1. If you had multiple projects and subscribed to assets, unsubscribe to all assets.
  2. Note the names of all AWS Glue databases and Athena workgroups created by your projects.
  3. Delete any connections you created in the data explorer that you don’t want to keep.
  4. Note the project IDs.
  5. Delete the projects. If you encounter any errors, check the AWS CloudFormation console and find the failed stack. Fix the error that failed the stack deletion and delete the projects.
  6. Note down the domain ID.
  7. Delete the domain.
  8. Delete the S3 bucket named amazon-datazone-AWSACCOUNTID-AWSREGION-DOMAINID.
  9. Delete the AWS Glue databases and Athena workgroups you noted earlier.
  10. Delete the CloudFormation stack for the VPC (if you followed that step in the setup).

If you have additional resources that haven’t been deleted, you can also use tags to identify and delete specific resources.

Conclusion

In this post, we discussed the foundational building blocks of SageMaker Unified Studio and how, by abstracting complex technical implementations behind user-friendly interfaces, organizations can maintain standardized governance while enabling efficient resource management across business units. This approach provides consistency in infrastructure deployment while providing the flexibility needed for diverse business requirements.

To learn more, refer to the Amazon SageMaker Unified Studio Administrator Guide and the following resources:

Appendix: Multi-account administration

This section illustrates the cross-account association. After the account invitation is accepted by the associated account owner, follow the instructions as shown in the following example to understand how to enable the blueprints. After the blueprints are enabled in the associate accounts, the root domain account can create project profile templates with the parameters of the associated account, including its linked blueprints. The example then demonstrates how the retail domain unit user can deploy compute resources and create data using the resources from the associated account.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Fabrizio Napolitano is a Principal Specialist Solutions Architect for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.

Seamless integration of data lake and data warehouse using Amazon Redshift Spectrum and Amazon DataZone

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/seamless-integration-of-data-lake-and-data-warehouse-using-amazon-redshift-spectrum-and-amazon-datazone/

Unlocking the true value of data often gets impeded by siloed information. Traditional data management—wherein each business unit ingests raw data in separate data lakes or warehouses—hinders visibility and cross-functional analysis. A data mesh framework empowers business units with data ownership and facilitates seamless sharing.

However, integrating datasets from different business units can present several challenges. Each business unit exposes data assets with varying formats and granularity levels, and applies different data validation checks. Unifying these necessitates additional data processing, requiring each business unit to provision and maintain a separate data warehouse. This burdens business units focused solely on consuming the curated data for analysis and not concerned with data management tasks, cleansing, or comprehensive data processing.

In this post, we explore a robust architecture pattern of a data sharing mechanism by bridging the gap between data lake and data warehouse using Amazon DataZone and Amazon Redshift.

Solution overview

Amazon DataZone is a data management service that makes it straightforward for business units to catalog, discover, share, and govern their data assets. Business units can curate and expose their readily available domain-specific data products through Amazon DataZone, providing discoverability and controlled access.

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. Thousands of customers use Amazon Redshift data sharing to enable instant, granular, and fast data access across Amazon Redshift provisioned clusters and serverless workgroups. This allows you to scale your read and write workloads to thousands of concurrent users without having to move or copy the data. Amazon DataZone natively supports data sharing for Amazon Redshift data assets. With Amazon Redshift Spectrum, you can query the data in your Amazon Simple Storage Service (Amazon S3) data lake using a central AWS Glue metastore from your Redshift data warehouse. This capability extends your petabyte-scale Redshift data warehouse to unbounded data storage limits, which allows you to scale to exabytes of data cost-effectively.

The following figure shows a typical distributed and collaborative architectural pattern implemented using Amazon DataZone. Business units can simply share data and collaborate by publishing and subscribing to the data assets.

hubandspoke

The Central IT team (Spoke N) subscribes the data from individual business units and consumes this data using Redshift Spectrum. The Central IT team applies standardization and performs the tasks on the subscribed data such as schema alignment, data validation checks, collating the data, and enrichment by adding additional context or derived attributes to the final data asset. This processed unified data can then persist as a new data asset in Amazon Redshift managed storage to meet the SLA requirements of the business units. The new processed data asset produced by the Central IT team is then published back to Amazon DataZone. With Amazon DataZone, individual business units can discover and directly consume these new data assets, gaining insights to a holistic view of the data (360-degree insights) across the organization.

The Central IT team manages a unified Redshift data warehouse, handling all data integration, processing, and maintenance. Business units access clean, standardized data. To consume the data, they can choose between a provisioned Redshift cluster for consistent high-volume needs or Amazon Redshift Serverless for variable, on-demand analysis. This model enables the units to focus on insights, with costs aligned to actual consumption. This allows the business units to derive value from data without the burden of data management tasks.

This streamlined architecture approach offers several advantages:

  • Single source of truth – The Central IT team acts as the custodian of the combined and curated data from all business units, thereby providing a unified and consistent dataset. The Central IT team implements data governance practices, providing data quality, security, and compliance with established policies. A centralized data warehouse for processing is often more cost-efficient, and its scalability allows organizations to dynamically adjust their storage needs. Similarly, individual business units produce their own domain-specific data. There are no duplicate data products created by business units or the Central IT team.
  • Eliminating dependency on business units – Redshift Spectrum uses a metadata layer to directly query the data residing in S3 data lakes, eliminating the need for data copying or relying on individual business units to initiate the copy jobs. This significantly reduces the risk of errors associated with data transfer or movement and data copies.
  • Eliminating stale data – Avoiding duplication of data also eliminates the risk of stale data existing in multiple locations.
  • Incremental loading – Because the Central IT team can directly query the data on the data lakes using Redshift Spectrum, they have the flexibility to query only the relevant columns needed for the unified analysis and aggregations. This can be done using mechanisms to detect the incremental data from the data lakes and process only the new or updated data, further optimizing resource utilization.
  • Federated governance – Amazon DataZone facilitates centralized governance policies, providing consistent data access and security across all business units. Sharing and access controls remain confined within Amazon DataZone.
  • Enhanced cost appropriation and efficiency – This method confines the cost overhead of processing and integrating the data with the Central IT team. Individual business units can provision the Redshift Serverless data warehouse to solely consume the data. This way, each unit can clearly demarcate the consumption costs and impose limits. Additionally, the Central IT team can choose to apply chargeback mechanisms to each of these units.

In this post, we use a simplified use case, as shown in the following figure, to bridge the gap between data lakes and data warehouses using Redshift Spectrum and Amazon DataZone.

custom blueprints and spectrum

The underwriting business unit curates the data asset using AWS Glue and publishes the data asset Policies in Amazon DataZone. The Central IT team subscribes to the data asset from the underwriting business unit. 

We focus on how the Central IT team consumes the subscribed data lake asset from business units using Redshift Spectrum and creates a new unified data asset.

Prerequisites

The following prerequisites must be in place:

  • AWS accounts – You should have active AWS accounts before you proceed. If you don’t have one, refer to How do I create and activate a new AWS account? In this post, we use three AWS accounts. If you’re new to Amazon DataZone, refer to Getting started.
  • A Redshift data warehouse – You can create a provisioned cluster following the instructions in Create a sample Amazon Redshift cluster, or provision a serverless workgroup following the instructions in Get started with Amazon Redshift Serverless data warehouses.
  • Amazon Data Zone resources – You need a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone environment (with a custom AWS service blueprint).
  • Data lake asset – The data lake asset Policies from the business units was already onboarded to Amazon DataZone and subscribed by the Central IT team. To understand how to associate multiple accounts and consume the subscribed assets using Amazon Athena, refer to Working with associated accounts to publish and consume data.
  • Central IT environment – The Central IT team has created an environment called env_central_team and uses an existing AWS Identity and Access Management (IAM) role called custom_role, which grants Amazon DataZone access to AWS services and resources, such as Athena, AWS Glue, and Amazon Redshift, in this environment. To add all the subscribed data assets to a common AWS Glue database, the Central IT team configures a subscription target and uses central_db as the AWS Glue database.
  • IAM role – Make sure that the IAM role that you want to enable in the Amazon DataZone environment has necessary permissions to your AWS services and resources. The following example policy provides sufficient AWS Lake Formation and AWS Glue permissions to access Redshift Spectrum:
{
	"Version": "2012-10-17",
	"Statement": [{
		"Effect": "Allow",
		"Action": [
			"lakeformation:GetDataAccess",
			"glue:GetTable",
			"glue:GetTables",
			"glue:SearchTables",
			"glue:GetDatabase",
			"glue:GetDatabases",
			"glue:GetPartition",
			"glue:GetPartitions"
		],
		"Resource": "*"
	}]
}

As shown in the following screenshot, the Central IT team has subscribed to the data Policies. The data asset is added to the env_central_team environment. Amazon DataZone will assume the custom_role to help federate the environment user (central_user) to the action link in Athena. The subscribed asset Policies is added to the central_db database. This asset is then queried and consumed using Athena.

The goal of the Central IT team is to consume the subscribed data lake asset Policies with Redshift Spectrum. This data is further processed and curated into the central data warehouse using the Amazon Redshift Query Editor v2 and stored as a single source of truth in Amazon Redshift managed storage. In the following sections, we illustrate how to consume the subscribed data lake asset Policies from Redshift Spectrum without copying the data.

Automatically mount access grants to the Amazon DataZone environment role

Amazon Redshift automatically mounts the AWS Glue Data Catalog in the Central IT Team account as a database and allows it to query the data lake tables with three-part notation. This is available by default with the Admin role.

To grant the required access to the mounted Data Catalog tables for the environment role (custom_role), complete the following steps:

  1. Log in to the Amazon Redshift Query Editor v2 using the Amazon DataZone deep link.
  2. In the Query Editor v2, choose your Redshift Serverless endpoint and choose Edit Connection.
  3. For Authentication, select Federated user.
  4. For Database, enter the database you want to connect to.
  5. Get the current user IAM role as illustrated in the following screenshot.

getcurrentUser from Redshift QEv2

  1. Connect to Redshift Query Editor v2 using the database user name and password authentication method. For example, connect to dev database using the admin user name and password. Grant usage on the awsdatacatalog database to the environment user role custom_role (replace the value of current_user with the value you copied):
GRANT USAGE ON DATABASE awsdatacatalog to "IAMR:current_user"

grantpermissions to awsdatacatalog

Query using Redshift Spectrum

Using the federated user authentication method, log in to Amazon Redshift. The Central IT team will be able to query the subscribed data asset Policies (table: policy) that was automatically mounted under awsdatacatalog.

query with spectrum

Aggregate tables and unify products

The Central IT team applies the necessary checks and standardization to aggregate and unify the data assets from all business units, bringing them at the same granularity. As shown in the following screenshot, both the Policies and Claims data assets are combined to form a unified aggregate data asset called agg_fraudulent_claims.

creatingunified product

These unified data assets are then published back to the Amazon DataZone central hub for business units to consume them.

unified asset published

The Central IT team also unloads the data assets to Amazon S3 so that each business unit has the flexibility to use either a Redshift Serverless data warehouse or Athena to consume the data. Each business unit can now isolate and put limits to the consumption costs on their individual data warehouses.

Because the intention of the Central IT team was to consume data lake assets within a data warehouse, the recommended solution would be to use custom AWS service blueprints and deploy them as part of one environment. In this case, we created one environment (env_central_team) to consume the asset using Athena or Amazon Redshift. This accelerates the development of the data sharing process because the same environment role is used to manage the permissions across multiple analytical engines.

Clean up

To clean up your resources, complete the following steps:

  1. Delete any S3 buckets you created.
  2. On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
  3. Delete the Amazon DataZone domain.
  4. On the Lake Formation console, delete the Lake Formation admins registered by Amazon DataZone along with the tables and databases created by Amazon DataZone.
  5. If you used a provisioned Redshift cluster, delete the cluster. If you used Redshift Serverless, delete any tables created as part of this post.

Conclusion

In this post, we explored a pattern of seamless data sharing with data lakes and data warehouses with Amazon DataZone and Redshift Spectrum. We discussed the challenges associated with traditional data management approaches, data silos, and the burden of maintaining individual data warehouses for business units.

In order to curb operating and maintenance costs, we proposed a solution that uses Amazon DataZone as a central hub for data discovery and access control, where business units can readily share their domain-specific data. To consolidate and unify the data from these business units and provide a 360-degree insight, the Central IT team uses Redshift Spectrum to directly query and analyze the data residing in their respective data lakes. This eliminates the need for creating separate data copy jobs and duplication of data residing in multiple places.

The team also takes on the responsibility of bringing all the data assets to the same granularity and process a unified data asset. These combined data products can then be shared through Amazon DataZone to these business units. Business units can only focus on consuming the unified data assets that aren’t specific to their domain. This way, the processing costs can be controlled and tightly monitored across all business units. The Central IT team can also implement chargeback mechanisms based on the consumption of the unified products for each business unit.

To learn more about Amazon DataZone and how to get started, refer to Getting started. Check out the YouTube playlist for some of the latest demos of Amazon DataZone and more information about the capabilities available.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Implement data quality checks on Amazon Redshift data assets and integrate with Amazon DataZone

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/implement-data-quality-checks-on-amazon-redshift-data-assets-and-integrate-with-amazon-datazone/

Data quality is crucial in data pipelines because it directly impacts the validity of the business insights derived from the data. Today, many organizations use AWS Glue Data Quality to define and enforce data quality rules on their data at rest and in transit. However, one of the most pressing challenges faced by organizations is providing users with visibility into the health and reliability of their data assets. This is particularly crucial in the context of business data catalogs using Amazon DataZone, where users rely on the trustworthiness of the data for informed decision-making. As the data gets updated and refreshed, there is a risk of quality degradation due to upstream processes.

Amazon DataZone is a data management service designed to streamline data discovery, data cataloging, data sharing, and governance. It allows your organization to have a single secure data hub where everyone in the organization can find, access, and collaborate on data across AWS, on premises, and even third-party sources. It simplifies the data access for analysts, engineers, and business users, allowing them to discover, use, and share data seamlessly. Data producers (data owners) can add context and control access through predefined approvals, providing secure and governed data sharing. The following diagram illustrates the Amazon DataZone high-level architecture. To learn more about the core components of Amazon DataZone, refer to Amazon DataZone terminology and concepts.

DataZone High Level Architecture

To address the issue of data quality, Amazon DataZone now integrates directly with AWS Glue Data Quality, allowing you to visualize data quality scores for AWS Glue Data Catalog assets directly within the Amazon DataZone web portal. You can access the insights about data quality scores on various key performance indicators (KPIs) such as data completeness, uniqueness, and accuracy.

By providing a comprehensive view of the data quality validation rules applied on the data asset, you can make informed decisions about the suitability of the specific data assets for their intended use. Amazon DataZone also integrates historical trends of the data quality runs of the asset, giving full visibility and indicating if the quality of the asset improved or degraded over time. With the Amazon DataZone APIs, data owners can integrate data quality rules from third-party systems into a specific data asset. The following screenshot shows an example of data quality insights embedded in the Amazon DataZone business catalog. To learn more, see Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions.

In this post, we show how to capture the data quality metrics for data assets produced in Amazon Redshift.

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. Amazon DataZone natively supports data sharing for Amazon Redshift data assets.

With Amazon DataZone, the data owner can directly import the technical metadata of a Redshift database table and views to the Amazon DataZone project’s inventory. As these data assets gets imported into Amazon DataZone, it bypasses the AWS Glue Data Catalog, creating a gap in data quality integration. This post proposes a solution to enrich the Amazon Redshift data asset with data quality scores and KPI metrics.

Solution overview

The proposed solution uses AWS Glue Studio to create a visual extract, transform, and load (ETL) pipeline for data quality validation and a custom visual transform to post the data quality results to Amazon DataZone. The following screenshot illustrates this pipeline.

Glue ETL pipeline

The pipeline starts by establishing a connection directly to Amazon Redshift and then applies necessary data quality rules defined in AWS Glue based on the organization’s business needs. After applying the rules, the pipeline validates the data against those rules. The outcome of the rules is then pushed to Amazon DataZone using a custom visual transform that implements Amazon DataZone APIs.

The custom visual transform in the data pipeline makes the complex logic of Python code reusable so that data engineers can encapsulate this module in their own data pipelines to post the data quality results. The transform can be used independently of the source data being analyzed.

Each business unit can use this solution by retaining complete autonomy in defining and applying their own data quality rules tailored to their specific domain. These rules maintain the accuracy and integrity of their data. The prebuilt custom transform acts as a central component for each of these business units, where they can reuse this module in their domain-specific pipelines, thereby simplifying the integration. To post the domain-specific data quality results using a custom visual transform, each business unit can simply reuse the code libraries and configure parameters such as Amazon DataZone domain, role to assume, and name of the table and schema in Amazon DataZone where the data quality results need to be posted.

In the following sections, we walk through the steps to post the AWS Glue Data Quality score and results for your Redshift table to Amazon DataZone.

Prerequisites

To follow along, you should have the following:

The solution uses a custom visual transform to post the data quality scores from AWS Glue Studio. For more information, refer to Create your own reusable visual transforms for AWS Glue Studio.

A custom visual transform lets you define, reuse, and share business-specific ETL logic with your teams. Each business unit can apply their own data quality checks relevant to their domain and reuse the custom visual transform to push the data quality result to Amazon DataZone and integrate the data quality metrics with their data assets. This eliminates the risk of inconsistencies that might arise when writing similar logic in different code bases and helps achieve a faster development cycle and improved efficiency.

For the custom transform to work, you need to upload two files to an Amazon Simple Storage Service (Amazon S3) bucket in the same AWS account where you intend to run AWS Glue. Download the following files:

Copy these downloaded files to your AWS Glue assets S3 bucket in the folder transforms (s3://aws-glue-assets<account id>-<region>/transforms). By default, AWS Glue Studio will read all JSON files from the transforms folder in the same S3 bucket.

customtransform files

In the following sections, we walk you through the steps of building an ETL pipeline for data quality validation using AWS Glue Studio.

Create a new AWS Glue visual ETL job

You can use AWS Glue for Spark to read from and write to tables in Redshift databases. AWS Glue provides built-in support for Amazon Redshift. On the AWS Glue console, choose Author and edit ETL jobs to create a new visual ETL job.

Establish an Amazon Redshift connection

In the job pane, choose Amazon Redshift as the source. For Redshift connection, choose the connection created as prerequisite, then specify the relevant schema and table on which the data quality checks need to be applied.

dqrulesonredshift

Apply data quality rules and validation checks on the source

The next step is to add the Evaluate Data Quality node to your visual job editor. This node allows you to define and apply domain-specific data quality rules relevant to your data. After the rules are defined, you can choose to output the data quality results. The outcomes of these rules can be stored in an Amazon S3 location. You can additionally choose to publish the data quality results to Amazon CloudWatch and set alert notifications based on the thresholds.

Preview data quality results

Choosing the data quality results automatically adds the new node ruleOutcomes. The preview of the data quality results from the ruleOutcomes node is illustrated in the following screenshot. The node outputs the data quality results, including the outcomes of each rule and its failure reason.

previewdqresults

Post the data quality results to Amazon DataZone

The output of the ruleOutcomes node is then passed to the custom visual transform. After both files are uploaded, the AWS Glue Studio visual editor automatically lists the transform as mentioned in post_dq_results_to_datazone.json (in this case, Datazone DQ Result Sink) among the other transforms. Additionally, AWS Glue Studio will parse the JSON definition file to display the transform metadata such as name, description, and list of parameters. In this case, it lists parameters such as the role to assume, domain ID of the Amazon DataZone domain, and table and schema name of the data asset.

Fill in the parameters:

  • Role to assume is optional and can be left empty; it’s only needed when your AWS Glue job runs in an associated account
  • For Domain ID, the ID for your Amazon DataZone domain can be found in the Amazon DataZone portal by choosing the user profile name

datazone page

  • Table name and Schema name are the same ones you used when creating the Redshift source transform
  • Data quality ruleset name is the name you want to give to the ruleset in Amazon DataZone; you could have multiple rulesets for the same table
  • Max results is the maximum number of Amazon DataZone assets you want the script to return in case multiple matches are available for the same table and schema name

Edit the job details and in the job parameters, add the following key-value pair to import the right version of Boto3 containing the latest Amazon DataZone APIs:

--additional-python-modules

boto3>=1.34.105

Finally, save and run the job.

dqrules post datazone

The implementation logic of inserting the data quality values in Amazon DataZone is mentioned in the post Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions . In the post_dq_results_to_datazone.py script, we only adapted the code to extract the metadata from the AWS Glue Evaluate Data Quality transform results, and added methods to find the right DataZone asset based on the table information. You can review the code in the script if you are curious.

After the AWS Glue ETL job run is complete, you can navigate to the Amazon DataZone console and confirm that the data quality information is now displayed on the relevant asset page.

Conclusion

In this post, we demonstrated how you can use the power of AWS Glue Data Quality and Amazon DataZone to implement comprehensive data quality monitoring on your Amazon Redshift data assets. By integrating these two services, you can provide data consumers with valuable insights into the quality and reliability of the data, fostering trust and enabling self-service data discovery and more informed decision-making across your organization.

If you’re looking to enhance the data quality of your Amazon Redshift environment and improve data-driven decision-making, we encourage you to explore the integration of AWS Glue Data Quality and Amazon DataZone, and the new preview for OpenLineage-compatible data lineage visualization in Amazon DataZone. For more information and detailed implementation guidance, refer to the following resources:


About the Authors

Fabrizio Napolitano is a Principal Specialist Solutions Architect for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys nature and outdoor activities, reading, and traveling.