Tag Archives: Amazon SageMaker Data & AI Governance

Enforce business glossary classification rules in Amazon SageMaker Catalog

2025-11-20 Ramesh H Singh

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/enforce-business-glossary-classification-rules-in-amazon-sagemaker-catalog/

Organizations are scaling their data catalogs faster than ever. Maintaining consistent metadata standards across teams remains a challenge. Business glossaries define the language of the enterprise—terms like Customer Profile, Transaction, or Confidential Data—but assets are often published without these classifications, leading to inconsistent metadata and poor discoverability.

To address this, Amazon SageMaker Catalog now supports metadata enforcement rules for glossary terms classification (tagging) at the asset level. With this capability, administrators can require that assets include specific business terms or classifications. Data producers must apply required glossary terms or classifications before an asset can be published. This enforces metadata consistency across the catalog and makes sure assets carry the business context needed for effective discovery and governance.

This capability builds on existing metadata rule features for enforcing required metadata fields during asset publishing. The new addition extends those rules to cover glossary term validation, strengthening the link between business language and technical data assets.

In this post, we show how to enforce business glossary classification rules in SageMaker Catalog.

Why metadata enforcement matters

A common governance challenge is the lack of standardized tagging and classification for assets entering enterprise catalogs. Without enforcement, data producers might publish assets missing required business terms (such as data sensitivity level or product domain), resulting in inconsistent metadata that confuses business users, unreliable search and filtering results, and manual cleanup and downstream compliance risks.

By automatically validating metadata at publish time, SageMaker Catalog validates metadata when assets are published. This offers the following key benefits:

Assets are classified with approved business terms before publication
Validation supports compliance with internal glossary and classification standards
Consistent tagging enhances search accuracy and reduces noise
Incomplete or incorrectly tagged assets don’t reach consumers

How metadata enforcement works

On the Amazon SageMaker Unified Studio console, administrators navigate to Catalog, Governance, Rules and create metadata rules targeting the asset publishing workflow. Rules can specify required glossary terms or classification fields (for example, Business Unit, PII Category, or Data Sensitivity). Rules can apply organization-wide or within specific domains or projects.

When a producer attempts to publish an asset, SageMaker Catalog checks that the asset includes the required glossary terms or classifications. If any required metadata is missing, the publish action fails with a clear error message. After the metadata is added, the asset can be published successfully.

Enforced tagging makes sure published assets can be searched and filtered using consistent business terminology, improving catalog usability for analysts and business users.

Solution overview

For this post, we explore a financial services use case. Our example a financial services company defines a rule requiring all datasets published from the project to have ‘Finance’ glossary associated:

A data producer attempting to publish a new dataset without this tag receives a validation error
After applying the correct classification, the dataset publishes successfully
Analysts can now filter the catalog to find only Finance datasets or join assets consistently tagged with the same glossary term

In the following sections, we walk through the steps to configure this solution. We create a rule that all assets published from a specific project should have a business unit tag called Finance.

Prerequisites

To test this solution, you should have a SageMaker Unified Studio domain set up with a domain owner or domain unit owner privileges. You should also have an existing project to publish assets and catalog assets. For instructions to create these assets, see the Getting started guide.

In this example, we created a project named financial_analysis and a test table. For instructions to create a table, see Get started with Amazon S3 Tables in Amazon SageMaker Unified Studio. To ingest the sample data to SageMaker Catalog and generate business metadata, see Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog.

Create glossary and add terms

Complete the following steps to create a new glossary and add terms:

In SageMaker Unified Studio, on the Discover menu, choose Glossaries.
Choose Create glossary.
Provide details for your glossary, including name, owning project, and optional description.
For Glossary restriction, turn on Enabled.
Choose Create.
Create the term Finance in the Business Unit Details glossary.

Create rule to enforce glossary terms

Complete the following steps to create a rule to define glossary terms:

On the Govern menu, choose Domain units.
On the Rules tab, choose Add.
Add a publishing rule for the Finance project to have the Finance tag for all assets published to the catalog.
Choose Add rule.

The following screenshot shows the configuration details for your new rule.

Publish asset with enforced rules

Complete the following steps to publish your asset with the enforced rules:

On the financial_analysis project page, go to your asset.
In the Glossary terms section, choose Add terms.

If you choose Publish without adding the needed term, you get an error stating the Finance term should be assigned.
Choose Finance to add the required term.
Choose Publish asset.

The following screenshot shows the published asset and the required terms in the glossary.

Conclusion

With metadata enforcement rules for glossary terms, SageMaker Catalog brings stronger control and consistency to how organizations publish and manage their data assets. By requiring approved business classifications before publication, teams can make sure assets adhere to enterprise metadata standards, improving governance, discoverability, and trust in shared catalogs. This capability helps organizations scale their catalog governance without adding manual overhead—embedding compliance and quality directly into the publishing workflow.

Metadata enforcement rules for glossary terms are available in AWS Regions where SageMaker Catalog operates. Get started with this capability, refer to the user guide.

About the Authors

Enhanced data discovery in Amazon SageMaker Catalog with custom metadata forms and rich text documentation

2025-11-20 Ramesh H Singh

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/enhanced-data-discovery-in-amazon-sagemaker-catalog-with-custom-metadata-forms-and-rich-text-documentation/

Amazon SageMaker Catalog now supports custom metadata forms and rich text descriptions at the column level, extending existing curation capabilities for business names, descriptions, and glossary term classifications.

With these new features, data stewards can define and capture business-specific metadata directly in individual columns, and authors can use markdown-enabled rich text to provide detailed documentation and business context. Both form fields and formatted descriptions are indexed in real time, making them immediately discoverable through catalog search.

Column-level context is essential for understanding and trusting data. This release helps organizations improve data discoverability, collaboration, and governance by letting metadata stewards document columns using structured and formatted information that aligns with internal standards.

In this post, we show how to enhance data discovery in SageMaker Catalog with custom metadata forms and rich text documentation at the schema level.

Key capabilities

SageMaker Catalog now offers the following key capabilities:

Custom metadata forms – Data stewards can now use custom metadata forms to capture organization-specific metadata fields for columns such as Business Owner, Regulatory Classification, Units of Measure, or Approved Use Case. Each field is stored as a key-value pair and indexed for search, enabling business-level queries like “find columns where sensitivity = confidential.”
Rich text (markdown) descriptions – Each column supports a markdown-enabled description field. Authors can format text with headings, bullet lists, and hyperlinks to add deeper business or operational context—for example, logic definitions, sample values, or data lineage references.
Real-time indexing for search – Custom form values and rich text content are indexed as soon as they are saved. Users can search using a metadata value, keyword, or glossary term across columns.

Solution overview

For this post, we explore a financial services use case. Our example financial services organization defines a column metadata form that includes several fields, as illustrated in the following table.

Field	Example Value
Approved Use Case	Financial revenue modeling
Business Owner	Finance Office
Domain	RF

For a dataset column named revenue, the author adds the following markdown description:

# Business Revenue

- Use for Financial Modeling
- Use only for batch use cases

When analysts search for Domain = RF, this column appears in results with complete business context.

In the following sections, we demonstrate how to use to use metadata forms for columns and add rich text descriptions that is searchable.

Prerequisites

To test this solution, you should have an Amazon SageMaker Unified Studio domain set up with a domain owner or domain unit owner privileges. You should also have an existing project to publish assets and catalog assets. For instructions to create these assets, see the Getting started guide.

In this example, we created a project named financial_analysis and a test table. To create a similar table, see Get started with Amazon S3 Tables in Amazon SageMaker Unified Studio. To ingest the sample data to SageMaker Catalog and generate business metadata, see Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog.

Create new metadata form

Complete the following steps to create a new metadata form:

In SageMaker Unified Studio, go to your project.
Under Project catalog in the navigation pane, choose Metadata entities.
Choose Create metadata form.
Provide an optional display name, a technical name, and an optional description, then choose Create metadata form.
Define the form fields. In this example, we add the fields Domain, Business Owner, and Approved Use Case.
For Requirement Options, select the configuration for each field. For our use case, we select Always required.
Choose Create field.
Turn on Enabled so the form is visible and can be used for assets.

Attach metadata form to column

Complete the following steps to attach the metadata form to a column:

Under Project catalog in the navigation pane, choose Assets.
Search for and select your asset (for this example, we use the asset business_finance).
On the Schema tab, choose View/Edit next to the revenue field.
Choose Add metadata form.
Choose the form you created and choose Add.
Add details for the metadata form fields

Add additional context as formatted text

Next, we enter a rich text description for each column using the markdown editor, including headings, bullet lists, links, and sample values. Complete the following steps:

Choose Edit next to README for the revenue field where you added the metadata form.
Enter details and choose Save.
Choose Preview to view the formatted README at the column level.

Publish and verify search

Now you’re ready to publish the asset. The metadata form values and markdown descriptions become part of the catalog record and are indexed for search. You can also see the history of revisions on the History tab. Other project users can see the metadata form and rich text description for the published assets and subscribe to the data asset. You can create more data products with these assets, and they will also have the column metadata form and README.

In the catalog search UI, data users can now filter on custom form fields (for example, “Domain = RF”) or search in natural language for text that matches the column description.

Best practices

Consider the following best practices when using this feature:

Define metadata forms aligned with your business vocabulary (domains, owners, sensitivity levels) proactively before publishing assets at scale.
Make column descriptions actionable—include business definitions, value ranges, logic, update cadence, and dependencies.
Verify the catalog indexing is timely; publish changes proactively so search results reflect new metadata.
Use governance controls. You can combine column-level metadata with existing asset-level templates and approval workflows to enforce publishing standards.
Monitor search usage and metadata completeness; target high-value datasets for complete column-level documentation first.
Do not store confidential or sensitive information in your metadata forms.

Conclusion

With column-level metadata forms and rich text descriptions, SageMaker Catalog helps organizations deliver higher-quality metadata, stronger governance, and better data discovery. These features make it straightforward for teams to capture complete business context and for analysts to quickly locate and understand the data they need.

Custom metadata forms and rich text descriptions at the column level are now available in AWS Regions where SageMaker is supported.

To learn more about SageMaker, see the Amazon SageMaker User Guide. Get started with this capability, refer to the user guide.

About the Authors

Cross-account lakehouse governance with Amazon S3 Tables and SageMaker Catalog

2025-11-19 Sneha Rao

Post Syndicated from Sneha Rao original https://aws.amazon.com/blogs/big-data/cross-account-lakehouse-governance-with-amazon-s3-tables-and-sagemaker-catalog/

Organizations increasingly face challenges when analyzing data stored across multiple AWS accounts and storage formats. Data teams often need to query both traditional Amazon Simple Storage Service (Amazon S3) objects and Apache Iceberg tables, leading to costly data duplication, potential inconsistencies, and complex permission management across accounts.

To address these challenges, you can combine Amazon S3 Tables, which provides native Apache Iceberg support within S3, with Amazon SageMaker Catalog for unified data governance. This solution supports secure cross-account data access without duplicating datasets or compromising security controls.

In this post, we walk you through a practical solution for secure, efficient cross-account data sharing and analysis. You’ll learn how to set up cross-account access to S3 Tables using federated catalogs in Amazon SageMaker, perform unified queries across accounts with Amazon Athena in Amazon SageMaker Unified Studio, and implement fine-grained access controls at the column level using AWS Lake Formation.

This post helps you establish proper governance and security controls for S3 Tables in a multi-account environment, enabling secure and efficient cross-account data access.

Solution overview

We walk you through implementing a three-account lakehouse governance architecture where you can securely share data. As shown in the following diagram, Account A serves as your data producer with S3 Tables, Account B acts as your central governance hub with SageMaker Catalog, and Account C represents your data consumers. We’ll demonstrate step-by-step how to configure cross-account access and implement governance controls so consumers can discover and query data from both S3 tables and traditional S3 buckets.

Prerequisite and Set up

In this post, we focus on how to do the cross account set up and how to onboard S3 Tables. All three accounts are in the same AWS Region. To implement this solution, you will need three individual accounts (A, B, C). The setup in the accounts should look like the following:

Account A (Producer): Create an Amazon S3 Table on the account.
Account B (Central governance and producer): This is another account where you have data in Amazon S3 buckets catalog via Glue Catalog. You would onboard these into domain portal.
- S3 cataloged in Glue Data Catalog.
- Create a SageMaker Domain in Account B.
Account C (Consumer account): Identify an account where you have consumers query data using Athena to follow along.

The following are the high-level implementation steps for this solution:

Step 1: Configure cross-account association for governance.
Step 2: Create three Project Profiles in Account B pointing to tables in Account A, B, and C.
Step 3: Create three Projects.
Step 4: Set up permissions for Projects in AWS Lake Formation.
Step 5: In Account B, create Datasource to connect S3 Table from Account A and Glue Catalog Tables from Account B.
Step 6: Publish and Subscribe to asset.
Step 7: Query S3 table (Account A) and S3 (Account B) data together in SQL editor (Account C).

Step 1

A. Configure cross-account association for governance

In this section, we associate Account A and C in the Governance account B.

Open the SageMaker Unified Studio console in Account B.
Navigate to Domains, select your domain, then choose the Account associations tab.
Choose Request association and enter the Account IDs for Account A and Account C.
Submit the association request and verify the accounts appear with “Requested” status.

B. Enable Blueprints for your domain in Accounts A, B, and C

The LakeHouseDatabase blueprint enables SageMaker Unified Studio to securely manage, query, and share data from S3, Redshift, and other sources using open standards—so in this step, you enable it in Accounts A, B, and C to support unified data access and collaboration.

In Account A, in the SageMaker console, navigate to your domain and select the Blueprints tab.
Select the LakeHouseDatabase blueprint and choose Enable.
Keeping the Permissions and resources section at the default settings, choose Enable Blueprint.
Back on the blueprints screen, select the Tooling blueprint and choose Enable.
Keeping the Permissions and resources section at the default settings, configure the Networking section with the desired VPC and subnet configurations.
Choose Enable Blueprint.
Repeat Step1.B and enable the same blueprints in Account B to make S3 data publishable and Account C so consumers can query the data using Athena.

Step 2: Create Project Profiles in Account B

Use the documentation to create three project profiles in Account B using the ‘LakeHouseDatabase’ Blueprint, with each profile configured for Accounts A, B, and C respectively. For this post, we use the following naming convention:

datalake-project-profile-s3tables (for Account A)
datalake-project-profile (for Account B)
datalake-project-profile-consumer (for Account C)

Step 3: Create three Projects for accounts A, B, and C

Using the documentation, create one Project in each account. For this post, we use the following naming convention:
- ‘producer-s3tables’ – This is configured for Account A
- ‘producer-s3’ – This is configured for Account B
- ‘consumer’ – This is configured for Account C
After creating the Project, locate and make note of the Project role ARN listed under Project details on the project overview page.

Step 4: Set up permissions for Projects in AWS Lake Formation

In Account A, onboard the S3 table in SageMaker Lakehouse and grant permissions to the project role:

In the AWS Lake Formation console, choose Permissions, choose Data permissions, and then choose Grant.
Choose Principals, select IAM users and roles, then select the role generated by the project producer-s3tables in Step 3.
In LF-Tags or catalog resources, choose Named data catalog resources, select the S3 table catalog from the Catalogs list.
In Catalog permissions, configure the Catalog permissions and grantable permissions. Choose Grant to apply the following permissions.

In Account A, we repeat these steps for grant permissions to the database:

In the AWS Lake Formation console, choose Permissions, choose Data permissions, and then choose Grant.
Choose Principals, select IAM users and roles, then select the role generated by the project producer-s3tables in Step 3.
In LF-Tags or catalog resources, choose Named data catalog resources, choose both the S3 table catalog and database from their respective dropdown lists.
Configure database permissions and grantable permissions. Choose Grant to apply the following permissions.

In Account A, repeat these steps for grant permissions to the table in the database:

In the AWS Lake Formation console, choose Permissions, choose Data permissions, and then choose Grant.
Choose Principals, select IAM users and roles, then select the role generated by the project producer-s3tables in Step 3.
In LF-Tags or catalog resources, choose Named data catalog resources, choose both the S3 table catalog, database, and S3 table from their respective dropdown lists.
Configure table permissions and grantable permissions. Choose Grant to apply the following permissions.

Repeat Step 4 in Accounts B to onboard S3 to SageMaker Lakehouse and grant the necessary permissions to the role created by your project for Account B.

Step 5: Create Datasource and onboard S3 Table from Account A and Glue Catalog Tables from Account B

To enable unified access and cross-account analytics with data lineage tracking, you’ll connect your SageMaker Unified Studio project to S3 tables from both accounts:

Navigate to your project in SageMaker Unified Studio, select Data sources under the Project catalog section and choose Create data source.
Enter a name, description, and select AWS Glue as the Data source type. Under Data selection, specify the S3 table catalog name.
In this post, we will keep the Publishing setting and Metadata settings as the default configuration.
Choose the run preference as Run on demand to manually initiate data source runs.
Configure any optional connection settings, such as importing data lineage or setting up data quality options. Review your configuration and create the data source.
Once created, run the data source to import the Glue assets into your project’s inventory.
Add asset filter to restrict consumer access, On the Asset filters tab, choose Add asset filter.
Select Column as the filter type, choose the columns for consumer access, and create the asset filter.
Select the assets created and choose Publish assets to the SageMaker Unified Studio catalog to make them discoverable by other users.
Use the documentation to add Glue catalog as data source for S3.

Step 6: Subscribe to the asset from Consumer account in Account C

In Account C, enable the consumer teams to discover, request, and subscribe to those assets for secure, governed data sharing and collaboration across projects.

In SageMaker Unified Studio, select the consumer project.
Use the Discover menu (top navigation) and go to Catalog.
Browse or search for the published asset (S3 tables from Account A).
Select the desired asset (S3 tables from Account A) and choose Subscribe.
In the subscription pop-up:
1. Choose the target project for asset access.
2. Provide a short justification for the access request.
Submit the subscription request.
Repeat step 6 to enable the consumer (Account C) teams to discover assets in Account B.

Approve or reject a subscription request

In Account A, open the SageMaker Unified Studio portal.
Under Project catalog, Subscription requests, Incoming requests tab locate and view the subscription request.
Review the requester and justification.
Choose the option to approve with row and column filters. For this post, we use the filter that we created earlier.
Repeat step 6 to enable the consumer (Account C) teams to discover assets in Account B.

Step 7: Analyze S3 table and S3 data together in query editor

Account C (consumer) now has full access to the customer data in S3 from Account B, and the daily_sales_by_customer data in S3 tables from Account A with restricted columns. Both datasets contain a common column Customer_id.

To generate combined insights, assets from Account A and Account B can be queried and joined on Customer_id.

In SageMaker Unified Studio (consumer project in Account C), go to the Build section and select Query Editor.

Run the following SQL query to join the assets from Account B and Account A on the common column Customer_id, enabling unified cross-account analytics.

SELECT
    c.c_last_name,
    c.c_first_name,
    d.*
FROM "awsdatacatalog"."glue_db_cqmfkub9co3rqh"."customer" c
JOIN "awsdatacatalog"."glue_db_cqmfkub9co3rqh"."daily_sales_by_customer" d
    ON c.c_customer_id = d.customer_id
LIMIT 10;

This approach allows combining filtered, governed data from multiple accounts into a single query for comprehensive insights.

Clean up

To avoid ongoing charges, clean up the resources created during this walkthrough. Complete these steps in the specified order to facilitate proper resource deletion. You might need to add respective delete permissions for databases, table buckets, and tables if your IAM user or role doesn’t already have them.

Delete any created IAM roles or policies.
Delete all the projects you created in the SageMaker Unified Studio domain.
Delete the SageMaker Unified Studio domain you created.

Conclusion

In this post, we explored how Amazon SageMaker Catalog integrates with S3 Tables to provide comprehensive data governance in cross-account environments. We demonstrated how data publishers can onboard S3 Tables to SageMaker Lakehouse while data consumers can efficiently search, request access, and leverage approved datasets for analytics and AI development.

The integration between SageMaker Catalog, S3 Tables, and AWS AWS Lake Formation creates a unified governance framework that eliminates data silos while maintaining robust security controls. Through automated subscription workflows and fine-grained access permissions, organizations can implement self-service data access without compromising compliance or data quality.

About the authors

Accelerate data governance with custom subscription workflows in Amazon SageMaker

2025-10-24 Nira Jaiswal

Post Syndicated from Nira Jaiswal original https://aws.amazon.com/blogs/big-data/accelerate-data-governance-with-custom-subscription-workflows-in-amazon-sagemaker/

Amazon SageMaker provides a single data and AI development environment to discover and build with your data. This unified platform integrates functionality from existing AWS Analytics and Artificial Intelligence and Machine Learning (AI/ML) services, including Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, and Amazon Bedrock.

Organizations need to efficiently manage data assets while maintaining governance controls in their data marketplaces. Although manual approval workflows remain important for sensitive datasets and production systems, there’s an increasing need for automated approval processes with less sensitive datasets. In this post, we show you how to automate subscription request approvals within SageMaker, accelerating data access for data consumers.

Prerequisites

For this walkthrough, you must have the following prerequisites:

An AWS account – If you don’t have an account, you can create one. The account should have permission to do the following:
- Create and manage SageMaker domains
- Create and manage IAM roles
- Create and invoke Lambda functions
SageMaker domain – For instructions to create a domain, refer to Create an Amazon SageMaker Unified Studio domain – quick setup.
A demo project – Create a demo project in your SageMaker domain. For instructions, see Create a project. For this example, we choose All capabilities in the project profile section.
SageMaker domain ID, project ID, and project role ARN – These will be used in later steps to provide permissions for existing datasets and resources, and automatic subscription approval code. To retrieve this information, go to the Project details tab on the project details page on the SageMaker console.
AWS CLI installed – You must have the AWS Command Line Interface (AWS CLI) version 2.11 or later.
Python installed – You must have Python version 3.8 or later.
IAM permissions – Sign in as the user with administrative access

Lambda permissions – Configure the appropriate IAM permissions for the Lambda execution role. The following code is a sample role used for testing this solution. Before implementing this IAM policy in your environment, provide the values for your specific AWS Region and account ID. Adjust them based on the principle of least privilege. To learn more about creating Lambda execution roles, refer to Defining Lambda function permissions with an execution role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "datazone:ListSubscriptionRequests",
                "datazone:AcceptSubscriptionRequest",
                "datazone:GetSubscriptionRequestDetails",
                "datazone:GetDomain",
                "datazone:ListProjects"
            ],
            "Resource": "<<Domain-ARN>>"
        },
        {
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "<<Domain-ARN>>",
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalArn": "<<Lambda ARN>>"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": "sns:Publish",
            "Resource": "<<SNS-ARN>>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/*",
                "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/*:*"
            ]
        }
    ]
}

Solution overview

Understanding the subscription and approval workflow in Amazon SageMaker is important before diving deep into custom workflow solution. After an asset is published to the SageMaker catalog, data consumers can discover assets. When a data consumer discovers assets in SageMaker catalog, they request access to the asset, by submitting a subscription request with business justification and intended use case. The request enters a pending state and notifies the data producer or asset owner for review. The data producer evaluates the request based on governance policies, consumer credentials, and business context. The data producer can accept, reject, or request additional information from the data consumer. Upon acceptance, SageMaker triggers the AcceptSubscriptionRequest event and begins automated access provisioning. After a subscription is accepted, a subscription fulfilment process gets kicked off to facilitate access to the asset, for the data producer. SageMaker integrates deeply with AWS Lake Formation to manage fine-grained permissions. When a subscription is approved, SageMaker automatically calls Lake Formation APIs to grant specific database, table, and column-level permissions to the subscriber’s IAM role. Lake Formation acts as the central permission engine, translating subscription approvals into actual data access rights without manual intervention. The system provisions and updates resource-based policies on data sources. Once the provisioning completes, the data consumer can immediately access subscribed data through query engines like Athena, Redshift, or EMR, with Lake Formation enforcing permissions at query time.

By default, subscription requests to a published asset require manual approval by a data owner. However, Amazon SageMaker supports automatic approval of subscription requests at asset level: when publishing a data asset, you can choose to not require subscription approval. In this case, all incoming subscription requests to that asset are automatically approved. Let’s first outline the step-by-step process for disabling automatic approval at the asset level.

Configure automatic approval at asset level:

To configure automatic approval, data producers can follow the steps below.

Log in to SageMaker Unified Studio portal as data producer. Navigate to Assets and select the target asset
Choose Assets → Pick the asset, which you would like to configure for automatic approval.
On the asset details page, locate Edit Subscription settings in the right pane.
Choose Edit next to Subscription Required
1. Select Not Required in the dialogue box
2. Confirm your selection

Customize SageMaker’s subscription workflow:

While manual approval workflow remains essential for production environments and sensitive data handling, organizations seek to streamline and automate approvals for lower-risk environments and non-sensitive datasets. To achieve this project-level automation, we can enhance SageMaker’s native approval workflow through a custom event-driven solution. This solution leverages AWS’s serverless architecture, combining using AWS Lambda, Amazon EventBridge rules, and Amazon Simple Notification Service (Amazon SNS) to create an automated approval workflow. This customization allows organizations to maintain governance while reducing administrative overhead and accelerating the development cycle in non-critical environments. The event-driven approach ensures real-time processing of approval requests, maintains audit trails, and can be configured to apply different approval rules based on project characteristics and data sensitivity levels.

The custom workflow consists of the following steps:

The data consumer submits a subscription request for a published data asset.
SageMaker detects the request and generates a subscription event, which is automatically sent to EventBridge.
EventBridge triggers the designated Lambda function.
The Lambda function sends an AcceptSubscriptionRequest API call to SageMaker.
The function also sends a notification through Amazon SNS.
AWS Lake Formation processes the approved subscription and updates the relevant access control lists (ACLs) and permission sets.
Lake Formation grants access permissions to the data consumer’s project AWS Identity and Access Management (IAM) role.
The data consumer now has authorized access to the requested data asset and can begin working with the subscribed data.

The following diagram illustrates the high-level architecture of the solution.

Key benefits

This solution uses AWS Lambda and Amazon EventBridge to automate SageMaker subscription requests approvals, delivering the following benefits for organizations and end-users:

Scalability – Automatically handles high volumes of subscription requests
Cost-efficiency – Pay-as-you-go approach with no idle resource costs
Minimal maintenance – Serverless components require no infrastructure management
Flexible triggering – Supports event-driven, scheduled, and manual invocation modes
Audit compliance – Comprehensive logging and traceability through AWS CloudTrail

Step-by-step procedure

This section outlines the detailed process for implementing a custom subscription request approval workflow in Amazon SageMaker

Create Lambda function

Complete the following steps to create your Lambda function:

On the Lambda console, choose Functions in the navigation pane.
Choose Create function.
Select Author from scratch.
For Function name, enter a name for the function.
For Runtime, choose your runtime (for this post, we use Python version 3.9 or later).
Choose Create function.
On the Lambda function page, choose the Configuration tab and then choose Permissions.
Note the execution role to use when configuring the SageMaker project.

Create SNS topic

For this solution, we create SNS topic. Complete the following steps to create the SNS topic for automatic approvals:

On the Amazon SNS console, choose Topics in the navigation pane.
Choose Create topic.
For Type, select Standard.
For Name, enter a name for the topic.
Choose Create topic.
On the SNS topic details page, note the SNS topic Amazon Resource Name (ARN) to use later in the Lambda function.
On Subscription tab, choose Create Subscription.
For Protocol, choose Email.
For Endpoint, enter email address of Data consumers.

Create EventBridge rule

Complete the following steps to create an EventBridge rule to capture subscription request events:

On the EventBridge console, choose Rules in the navigation pane.
Choose Create rule.
For Name, enter a name for the rule.
For Rule type, select Rule with event pattern.
This option enables the automatic subscription approval workflow to be triggered when a subscription request is initiated. Alternatively, you can select Schedule to schedule the rule to trigger on a regular basis. Refer to Creating a rule that runs on a schedule in Amazon EventBridge to learn more.
Choose Next.
For Event source, select AWS events or EventBridge partner events.
For Creation method, select Use pattern form
For Event source, select AWS services
For AWS service, select DataZone.
For Event type, select Subscription Request Created.
Configure your target to route events to both the Lambda function and SNS topic.
Choose Next.
For this post, skip configuring tags and choose Next.
Review the settings and choose Create rule.

Configure automation workflow

Complete the following steps to configure the automation workflow:

On the Lambda console, go to the function you created.
Configure the EventBridge rule to trigger the Lambda function
Configure the destination as SNS topic for event notification.

Configure code in Lambda function

Complete the following steps to configure your Lambda function:

On the Lambda console, go to the function you created.

Add the following code to your function. Provide the domain ID, project ID, and SNS topic ARN that you noted earlier.

import boto3
import json
import logging
import os
from botocore.exceptions import ClientError

# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    """Lambda function to auto-approve subscription requests in Amazon SageMaker"""
    try:
        # Initialize clients
        datazone_client = boto3.client('datazone')
        sns_client = boto3.client('sns')
        
        # Get configuration from environment variables or use hardcoded values
        domain_id = os.environ.get('DOMAIN_ID', '<domain_id>')
        project_id = os.environ.get('PROJECT_ID', '<project_id>')
        sns_topic_arn = os.environ.get('SNS_TOPIC_ARN', '<sns_topic_arn>')
        
        # Get pending subscription requests
        pending_requests = get_pending_requests(datazone_client, domain_id, project_id)
        
        if not pending_requests:
            logger.info("No pending subscription requests found")
            return
        
        # Process requests
        for request in pending_requests:
            approve_request(datazone_client, sns_client, domain_id, request, sns_topic_arn)
            
    except Exception as e:
        logger.error(f"Error: {str(e)}")

def get_pending_requests(client, domain_id, project_id):
    """Get all pending subscription requests"""
    requests = []
    next_token = None
    
    try:
        while True:
            params = {
                'domainIdentifier': domain_id,
                'status': 'PENDING',
                'approverProjectId': project_id
            }
            
            if next_token:
                params['nextToken'] = next_token
            
            response = client.list_subscription_requests(**params)
            
            if 'items' in response:
                requests.extend(response['items'])
            
            next_token = response.get('nextToken')
            if not next_token:
                break
                
        logger.info(f"Found {len(requests)} pending requests")
        return requests
        
    except ClientError as e:
        logger.error(f"Error listing requests: {e}")
        return []

def approve_request(datazone_client, sns_client, domain_id, request, sns_topic_arn):
    """Approve a subscription request and send notification"""
    request_id = request.get('id')
    if not request_id:
        return
        
    try:
        # Approve the request
        datazone_client.accept_subscription_request(
            domainIdentifier=domain_id,
            identifier=request_id,
            decisionComment="Subscription request is auto-approved by Lambda"
        )
        
        # Send notification
        asset_name = request.get('assetName', 'Unknown asset')
        
        message = f"Your subscription request has been auto-approved by Lambda. You can now access this asset."
        
        sns_client.publish(
            TopicArn=sns_topic_arn,
            Subject=f"Subscription Request is auto-approved by Lambda",
            Message=message
        )
        
        logger.info(f"Approved request {request_id} for {asset_name}")
        
    except Exception as e:
        logger.error(f"Error processing request {request_id}: {e}")

Choose Test to test the Lambda function code. To learn more about testing Lambda code, refer to Testing Lambda functions in the console.
Choose Deploy to deploy the code.

Configure Lambda and project execution roles in SageMaker

Complete the following steps:

In SageMaker Unified Studio, go to your publishing project.
Choose Members in the navigation pane.
Choose Add members.
Add the Lambda execution role and project execution roles as Contributor.

Test the solution

Complete the following steps to test the solution:

In SageMaker Unified Studio, navigate to the data catalog and choose Subscribe on the configured asset to initiate a subscription request.
Choose Subscription requests in the navigation pane to view the outgoing requests and choose the Approved tab to verify automatic approval.
Choose View subscription to confirm the approver appears as the Lambda execution role with “Auto-approved by Lambda” as the reason.
On the CloudTrail console, choose Event history to view the event you created and review the automated approval audit trail.

Clean up

To avoid incurring future charges, clean up the resources you created during this walkthrough. The following steps use the AWS Management Console, but you can also use the AWS CLI.

Delete the SageMaker domain. To use the AWS CLI, run the following commands:

aws sagemaker delete-project --project-name <project-name>
aws datazone delete-domain –identifier <domain_identifier>

Delete the SNS topics. To use the AWS CLI, run the following command:
```
aws sns delete-topic --topic-arn <topic-arn>
```
Delete the Lambda function. To use the AWS CLI, run the following command:
```
aws lambda delete-function --function-name <Lambda function name>
```

Conclusion

Combining an event-driven architecture with SageMaker creates an automated, cost-effective solution for data governance challenges. This serverless approach automatically handles data access requests while maintaining compliance, so organizations can scale efficiently as their data grows. The solution discussed in this post can help data teams access insights faster with minimal operational costs, making it an excellent choice for businesses that need quick, compliant data access while keeping their systems lean and efficient.

To learn more, visit the Amazon SageMaker Unified Studio page.

About the authors

Automate email notifications for governance teams working with Amazon SageMaker Catalog

2025-10-21 Himanshu Sahni

Post Syndicated from Himanshu Sahni original https://aws.amazon.com/blogs/big-data/automate-email-notifications-for-governance-teams-working-with-amazon-sagemaker-catalog/

Amazon SageMaker Catalog simplifies the discovery, governance, and collaboration for data and AI across Data Lakehouse, AI models, and applications. With Amazon SageMaker Catalog, you can securely discover and access approved data and models using semantic search with generative AI–created metadata or could just ask Amazon Q Developer with natural language to find their data.

Large enterprise customers have multiple lines of businesses who produce and consume data using a central SageMaker Data Catalog. Many customers have a central data governance team that is responsible for creating, publishing, and maintaining data governance standards and best practices across the firm. As the customer’s data platform scales, it becomes challenging for the central governance team to maintain the standards across all data producers and consumers. Because of this, many governance teams need to monitor user activity in Amazon SageMaker Catalog to ensure data assets are published according to established organizational governance standards and best practices. In this scenario, there is a need for automation where the central governance teams can be notified when critical events happen in Amazon SageMaker Catalog.

In this post, we show you how to create custom notifications for events occurring in SageMaker Catalog using Amazon EventBridge, AWS Lambda, and Amazon Simple Notification Service (Amazon SNS). You can expand this solution to automatically integrate SageMaker Catalog with in-house enterprise workflow tools like ServiceNow and Helix.

Solution overview

The following solution architecture shows how SageMaker Catalog integrates with other AWS services like AWS IAM Identity Center, Amazon EventBridge, Amazon SQS, AWS Lambda, and Amazon SNS to generate automated notifications to capture critical events in the enterprise catalog.

A SageMaker Catalog user logs into Amazon SageMaker Unified Studio using IAM Identity center. This could be a data scientist, machine learning engineer, or analyst looking for published data sets in the firm. AWS IAM Identity center ensures that only authorized personnel can access the cataloged assets and ML resources.
User performs an activity within SageMaker Catalog. Example user creates a new project or user searches for a data asset and creates a subscription request to access the asset.
User events from SageMaker Catalog are captured in Amazon EventBridge. Amazon EventBridge is a fully managed, serverless event bus service designed to help you build scalable, event-driven applications across AWS, SaaS, and custom applications. Amazon EventBridge provides the ability to filter events and allow users to take action on specific events.The following example event pattern in EventBridge filters DataZone create project events.
```
{
  "source": [
    "aws.datazone"
  ],
  "detail": {
    "eventSource": [
      "datazone.amazonaws.com"
    ],
    "eventName": [
      "CreateProject"
    ]
  }
}
```
Amazon EventBridge sends the filtered events to Amazon SQS. Routing events to an SQS queue improves reliability and durability. Amazon SQS acts as a buffer between Amazon EventBridge and AWS Lambda, decoupling event producers from consumers. This allows your Lambda functions to process messages at their own pace, preventing overload during traffic spikes or when downstream resources are temporarily slow or unavailable. Amazon SQS provides durable, persistent storage for events. If Lambda service is unavailable or throttled, messages remain in the queue until they can be successfully processed, reducing the risk of data loss. There is a Dead Letter Queue (DLQ) attached to the main SQS queue. Attaching a DLQ to SQS ensures that any messages that can’t be processed after multiple attempts are safely captured for inspection and troubleshooting, preventing them from blocking or endlessly circulating in the main queue.
AWS Lambda function reads the messages from SQS queue. Lambda function formats the notification based on your needs.
AWS Lambda publishes the message to Amazon SNS. End users and Central Governance team can subscribe to the SNS topic to receive email alerts when an event happens in SageMaker catalog.
Amazon CloudWatch integrates with AWS Lambda to monitor performance, logs events, and can trigger alarms if anything goes awry, ensuring your workflows run smoothly.

Prerequisites

You need to setup the following prerequisite resources:

An AWS account with a configured Amazon Amazon Virtual Private Cloud (Amazon VPC) and base network.
An existing SageMaker Unified Studio domain (follow instructions on Setting up Amazon SageMaker Unified Studio).
Grant Lambda Access in SageMaker Unified Studio (required for Publishing the assets)
- Add the Lambda execution role as an IAM role in SageMaker Unified Studio.
- Assign the Lambda execution role to your project within the SageMaker Unified Studio portal.

This configuration ensures that Lambda function has the required authorization to access Data Zone resources and successfully publish assets from your SageMaker Unified Studio projects.

Code Deployment

Review the instructions on our GitHub repository to deploy the framework in your AWS account using AWS CDK. The CDK provisions an event-driven notification architecture for Amazon SageMaker Unified Studio, focusing on project creation and asset publishing events.

Core AWS Resources Deployed – The following are the core AWS resourced deployed:

EventBridge Rules
- DataZoneCreateProjectRule: Captures DataZone project creation events (CreateProject).
- DataZonePublishAssetRule: Captures DataZone asset publishing events (CreateListingChangeSet with PUBLISH action for ASSET entity type).
SQS Queue
- DataZoneEventQueue: Buffers DataZone events from EventBridge before processing.
- Queue Policy: Allows EventBridge to send messages to the SQS queue.
Lambda Function
- ProjectNotificationLambda: Processes messages from the SQS queue, retrieves event details from DataZone, and sends notifications to an SNS topic.
  - IAM Role: Grants permissions to access SQS, SNS, CloudWatch Logs, and DataZone services.
  - Event Source Mapping: Triggers the Lambda function for each SQS message.
SNS Topic
- LambdaSNSTopic: Receives notifications from the Lambda function.
  - Email Subscriptions: Two email endpoints are subscribed to receive notifications.
- Add your email ID to the SNS topic. You’ll receive an email to request for subscription, click on ‘Confirm Subscription’
Permissions
- Amazon EventBridge sends events to SQS (requiring SQS permissions), Lambda poll reads messages from Amazon SQS (requiring Lambda role in SQS permissions), and Lambda publishes to Amazon SNS (requiring SNS permissions).
- IAM Policies: Lambda execution role has necessary permissions for SQS, SNS, logging, and Data Zone operations.

Outputs Provided (CloudFormation Output)

Amazon SNS Topic ARN: For notification publishing.
Amazon SQS Queue ARN: For event buffering.
AWS Lambda Function ARN: For event processing.
Amazon EventBridge Rule ARNs: For both asset publishing and project creation events.

Project Creation Notification

Execute the following steps to login to SageMaker Unified Studio and create a project.

Login to SageMaker Unified Studio Console. This takes you to Amazon SageMaker Unified Studio domain login screen (SSO and IAM sign-in options).
Choose Create Project on SageMaker Unified Studio login page.
Choose a project name of your choice, such as ‘My_Demo_Project’. In Project profile, select ‘All-Capabilities’.
Choose Continue. Keep everything as default.
Choose Continue. On next page, create on ‘Create project’.
Project creation final screen
Email Notification. Once project creation is successful, you should see an email notification sent by the above deployed automation.

Asset Publish Notification

To publish a sample asset in SageMaker Unified Studio.

Lambda Permissions
After the CDK Stack creates the Lambda execution role ‘DatazoneStack-LambdaExecutionRole’, use the following procedure to integrate this role into your SageMaker Studio project. This integration enables Lambda functions to interact with DataZone API in SageMaker Unified Studio project.
1. Login to SageMaker Unified studio using SSO, click on Members, Add members.
2. Find the role ‘DatazoneStack-LambdaExecutionRole’ and add as a ‘Contributor’
  
  The LambdaExecutionRole (<cf-stack-name>-LambdaExecutionRole) has been added as a member to a project in SageMaker Unified Studio.
Create Asset
1. In your project ‘My_Demo_Project’, click on Data. Choose the plus sign to add a data set.
2. Upload your CSV file using the sample ‘Product_v6.csv’ found in the checkout folder of the ‘sample-sagemaker-unified-studio-governance-notifications’ GitHub repository.
3. Use table type as S3/external table.
4. Review and confirm that the column/attribute names in the uploaded CSV file.
5. Check the Glue database(glue_db_<unique_id>) to confirm that the table has been created and properly imported
Publish Asset
1. Select the asset, choose Actions and Publish to Catalog.
2. View the published asset below.
3. In the Project Catalog’s Assets section, locate the highlighted entry and verify the published table’s name
4. Choose the asset name to display additional details and properties about the table/asset.
Email Alerts
1. Once the asset is published to SageMaker Unified studio, you’ll receive an email alert sent with details of the published asset. Central governance teams can use this alert to review the published asset to ensure it aligns with the enterprise standards.
  
  Email alerts are sent to notify users when assets have been published

Cleanup

To clean up your resources, complete the following steps:

cdk destroy --profile <PIPELINE-PROFILE>

Conclusion

In this post, you learned how to build an automated notification system for Amazon SageMaker Unified Studio using AWS services. Specifically, we covered:

How to set up event-driven notifications from Amazon SageMaker Unified Studio leveraging Amazon EventBridge, AWS Lambda, and Amazon SNS
The step-by-step process of deploying the solution using AWS CDK
Practical examples of monitoring critical events like project creation and asset publishing
How to integrate AWS Lambda permissions with SageMaker Unified Studio for secure operations
Best practices for implementing governance controls through automated notifications

Amazon SageMaker Catalog helps governance teams stay informed of catalog activities in real-time, enabling them to maintain organizational standards as their Data and ML platforms scale. The architecture is flexible and can be extended to integrate with enterprise workflow tools like ServiceNow or to monitor additional event types based on your organization’s needs.

We look forward to hearing how you adapt this solution for your organization’s governance needs. Fork the CDK code from our repository and share your implementation experience in the comments below

About the Authors

Visualize data lineage using Amazon SageMaker Catalog for Amazon EMR, AWS Glue, and Amazon Redshift

2025-10-13 Shubham Purwar

Post Syndicated from Shubham Purwar original https://aws.amazon.com/blogs/big-data/visualize-data-lineage-using-amazon-sagemaker-catalog-for-amazon-emr-aws-glue-and-amazon-redshift/

Amazon SageMaker offers a comprehensive hub that integrates data, analytics, and AI capabilities, providing a unified experience for users to access and work with their data. Through Amazon SageMaker Unified Studio, a single and unified environment, you can use a wide range of tools and features to support your data and AI development needs, including data processing, SQL analytics, model development, training, inference, and generative AI development. This offering is further enhanced by the integration of Amazon Q and Amazon SageMaker Catalog, which provide an embedded generative AI and governance experience, helping users work efficiently and effectively across the entire data and AI lifecycle, from data preparation to model deployment and monitoring.

With the SageMaker Catalog data lineage feature, you can visually track and understand the flow of your data across different systems and teams, gaining a complete picture of your data assets and how they’re connected. As an OpenLineage-compatible feature, it helps you trace data origins, track transformations, and view cross-organizational data consumption, giving you insights into cataloged assets, subscribers, and external activities. By capturing lineage events from OpenLineage-enabled systems or through APIs, you can gain a deeper understanding of your data’s journey, including activities within SageMaker Catalog and beyond, ultimately driving better data governance, quality, and collaboration across your organization.

Additionally, the SageMaker Catalog data lineage feature versions each event, so you can track changes, visualize historical lineage, and compare transformations over time. This provides valuable insights into data evolution, facilitating troubleshooting, auditing, and data integrity by showing exactly how data assets have evolved, and generates trust in data.

In this post, we discuss the visualization of data lineage in SageMaker Catalog and how capture lineage from different AWS analytics services such as AWS Glue, Amazon Redshift, and Amazon EMR Serverless automatically, and visualize it with SageMaker Unified Studio.

Solution overview

The generation of data lineage in SageMaker Catalog operates through an automated system that captures metadata and relationships between different data artifacts for AWS Glue, Amazon EMR, and Amazon Redshift. When data moves through various AWS services, SageMaker automatically tracks these movements, transformations, and dependencies, creating a detailed map of the data’s journey. This tracking includes information about data sources, transformations, processing steps, and final outputs, providing a complete audit trail of data movement and transformation.

The implementation of data lineage in SageMaker Catalog offers several key benefits:

Compliance and audit support – Organizations can demonstrate compliance with regulatory requirements by showing complete data provenance and transformation history
Impact analysis – Teams can assess the potential impact of changes to data sources or transformations by understanding dependencies and relationships in the data pipeline
Troubleshooting and debugging – When issues arise, the lineage system helps identify the root cause by showing the complete path of data transformation and processing
Data quality management – By tracking transformations and dependencies, organizations can better maintain data quality and understand how data quality issues might propagate through their systems

Lineage capture is automated using several tools in SageMaker Unified Studio. To learn more, refer to Data lineage support matrix.

In the following sections, we show you how to configure your resources and implement the solution. For this post, we create the solution resources in the us-west-2 AWS Region using an AWS CloudFormation template.

Prerequisites

Before getting started, make sure you have the following:

An active AWS account with billing enabled.
An AWS Identity and Access Management (IAM) user with administrator access (AdministratorAccess policy) or specific permissions to create and manage resources such as a virtual private cloud (VPC), subnet, security group, IAM roles, NAT gateway, internet gateway, SageMaker Unified Studio, and Amazon Simple Storage Service (Amazon S3) buckets.
An S3 bucket (for this post, datazone-{account_id}).
Sufficient VPC capacity in your chosen Region.
AWS IAM Identity Center set up. For instructions, refer to Enable IAM Identity Center and Add users to your Identity Center directory.

Configure SageMaker Unified Studio with AWS CloudFormation

The vpc-analytics-lineage-sus.yaml stack creates a VPC, subnet, security group, IAM roles, NAT gateway, internet gateway, Amazon Elastic Compute Cloud (Amazon EC2) client, S3 buckets, SageMaker Unified Studio domain, and SageMaker Unified Studio project. To create the solution resources, complete the following steps:

Launch the stack vpc-analytics-lineage-sus using the CloudFormation template:

Provide the parameter values as listed in the following table.

Parameters	Sample value
DatazoneS3Bucket	s3://datazone-{account_id}/
DomainName	dz-studio
EnvironmentName	sm-unifiedstudio
PrivateSubnet1CIDR	10.192.20.0/24
PrivateSubnet2CIDR	10.192.21.0/24
PrivateSubnet3CIDR	10.192.22.0/24
ProjectName	sidproject
PublicSubnet1CIDR	10.192.10.0/24
PublicSubnet2CIDR	10.192.11.0/24
PublicSubnet3CIDR	10.192.12.0/24
UsersList	analyst
VpcCIDR	10.192.0.0/16

The stack creation process can take approximately 20 minutes to complete. You can check the Outputs tab for the stack after the stack is created.

Next, we prepare source data, setup the AWS Glue ETL Job, Amazon EMR Serverless Spark Job and Amazon Redshift Job to generate the lineage and capture lineage from Amazon SageMaker Unified Studio

Prepare data

The following is example data from our CSV files:

attendance.csv

EmployeeID,Date,ShiftStart,ShiftEnd,Absent,OvertimeHours
E1000,2024-01-01,2024-01-01 08:00:00,2024-01-01 16:22:00,False,3
E1001,2024-01-08,2024-01-08 08:00:00,2024-01-08 16:38:00,False,2
E1002,2024-01-23,2024-01-23 08:00:00,2024-01-23 16:24:00,False,3
E1003,2024-01-09,2024-01-09 10:00:00,2024-01-09 18:31:00,False,0
E1004,2024-01-15,2024-01-15 09:00:00,2024-01-15 17:48:00,False,1

employees.csv

EmployeeID,Name,Department,Role,HireDate,Salary,PerformanceRating,Shift,Location
E1000,Employee_0,Quality Control,Operator,2021-08-08,33002.0,1,Night,Plant C
E1001,Employee_1,Maintenance,Supervisor,2015-12-31,69813.76,5,Evening,Plant B
E1002,Employee_2,Production,Technician,2015-06-18,46753.32,1,Evening,Plant A
E1003,Employee_3,Admin,Supervisor,2020-10-13,52853.4,5,Night,Plant A
E1004,Employee_4,Quality Control,Manager,2023-09-21,55645.27,5,Evening,Plant A

Upload the sample data from attendance.csv and employees.csv to the S3 bucket specified in the previous CloudFormation stack (s3://datazone-{account_id}/csv/).

Ingest employee data in Amazon Relational Database Dervice (Amazon RDS) for MySQL table

On the CloudFormation console, open the stack vpc-analytics-lineage-sus and collect the Amazon RDS for MySQL database endpoint to use in the following commands to create a default employeedb database.

Connect to Amazon EC2 instance with mysql package installation

Run the following command to connect to the database

>MySQL -u admin -h database-1.cuqd06l5efvw.us-west-2.rds.amazonaws.com -p

Run the following command to create an employee table

Use employeedb;

CREATE TABLE employee (
  EmployeeID longtext,
  Name longtext,
  Department longtext,
  Role longtext,
  HireDate longtext,
  Salary longtext,
  PerformanceRating longtext,
  Shift longtext,
  Location longtext
);

Running the following command to insert rows.

INSERT INTO employee (EmployeeID, Name, Department, Role, HireDate, Salary, PerformanceRating, Shift, Location) VALUES ('E1000', 'Employee_0', 'Quality Control', 'Operator', '2021-08-08', 33002.00, 1, 'Night', 'Plant C'), ('E1001', 'Employee_1', 'Maintenance', 'Supervisor', '2015-12-31', 69813.76, 5, 'Evening', 'Plant B'), ('E1002', 'Employee_2', 'Production', 'Technician', '2015-06-18', 46753.32, 1, 'Evening', 'Plant A'), ('E1003', 'Employee_3', 'Admin', 'Supervisor', '2020-10-13', 52853.40, 5, 'Night', 'Plant A'), ('E1004', 'Employee_4', 'Quality Control', 'Manager', '2023-09-21', 55645.27, 5, 'Evening', 'Plant A');

Capture lineage from AWS Glue ETL job and notebook

To demonstrate the lineage, we set up an AWS Glue extract, transform, and load (ETL) job to read the employee data from an Amazon RDS for MySQL table and the employee attendance data from Amazon S3, and join both datasets. Finally, we write the data to Amazon S3 and create the attendance_with_emp1 table in the AWS Glue Data Catalog.

Create and configure AWS Glue job for lineage generation

Complete the following steps to create your AWS Glue ETL job:

On the AWS Glue console, create a new ETL job with AWS Glue version 5.0.
Enable Generate lineage events and provide the domain ID (retrieve from the CloudFormation template output for DataZoneDomainid; it will have the format dzd_xxxxxxxx)

Use the following code snippet in the AWS Glue ETL job script. Provide the S3 bucket (bucketname-{account_id}) used in the preceding CloudFormation stack.

from pyspark.sql import SparkSession
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
import sys
import logging


spark = SparkSession.builder.appName("lineageglue").enableHiveSupport().getOrCreate()
 
connection_details = glueContext.extract_jdbc_conf(connection_name="connectionname")

employee_df = spark.read.format("jdbc").option("url", "jdbc:MySQL://dbhost:3306/database_name").option("dbtable", "employee").option("user", connection_details['user']).option("password", connection_details['password']).load()

s3_paths = {
'absent_data': 's3://bucketname-{account_id}/csv/attendance.csv'
}
absent_df = spark.read.csv(s3_paths['absent_data'], header=True, inferSchema=True)

joined_df = employee_df.join(absent_df, on="EmployeeID", how="inner")

joined_df.write.mode("overwrite").format("parquet").option("path", "s3://datazone-{account_id}/attendanceparquet/").saveAsTable("gluedbname.tablename")

Choose Run to start the job.
On the Runs tab, confirm the job ran without failure.
After the job has executed successfully, navigate to the SageMaker Unified Studio domain.
Choose Project and under Overview, choose Data Sources.
Select the Data Catalog source (accountid-AwsDataCatalog-glue_db_suffix-default-datasource).
On the Actions dropdown menu, choose Edit.
Under Connection, enable Import data lineage.
In the Data Selection section, under Table Selection Criteria, provide a table name or use * to generate lineage.
Update the data source and choose Run to create an asset called attendance_with_emp1 in SageMaker Catalog.
Navigate to Assets, choose the attendance_with_emp1 asset, and navigate to the LINEAGE section.

The following lineage diagram shows an AWS Glue job that integrates data from two sources: employee information stored in Amazon RDS for MySQL and employee absence records stored in Amazon S3. The AWS Glue job combines these datasets through a join operation, then creates a table in the Data Catalog and registers it as an asset in SageMaker Catalog, making the unified data available for further analysis or machine learning purposes.

Create and configure AWS Glue notebook for lineage generation

Complete the following steps to create the AWS Glue notebook:

On the AWS Glue console, choose Author using an interactive code notebook.
Under Options, choose Start fresh and choose Create notebook.
In the notebook, use the following code to generate lineage.
In the following code, we add the required Spark configuration to generate lineage and then read CSV data from Amazon S3 and write in Parquet format to the Data Catalog table. The Spark configuration includes the following parameters:
- spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener – Registers the OpenLineage listener to capture Spark job execution events and metadata for lineage tracking
- spark.openlineage.transport.type=amazon_datazone_api – Specifies Amazon DataZone as the destination service where the lineage data will be sent and stored
- spark.openlineage.transport.domainId=dzd_xxxxxxx – Defines the unique identifier of your Amazon DataZone domain where the lineage data will be associated
- spark.glue.accountId={account_id} – Specifies the AWS account ID where the AWS Glue job is running for proper resource identification and access
- spark.openlineage.facets.custom_environment_variables – Lists the specific environment variables to capture in the lineage data for context about the AWS and AWS Glue environment
- spark.glue.JOB_NAME=lineagenotebook – Sets a unique identifier name for the AWS Glue job that will appear in lineage tracking and logs
See the following code:
```
%%configure —name project.spark -f
{
"—conf":"spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
--conf spark.openlineage.transport.type=amazon_datazone_api \
--conf spark.openlineage.transport.domainId=dzd_xxxxxxxx \
--conf spark.glue.accountId={account_id} \
--conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] \
--conf spark.glue.JOB_NAME=lineagenotebook"
}

from pyspark.sql import SparkSession
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
import sys
import logging


spark = SparkSession.builder.appName("lineagegluenotebook").enableHiveSupport().getOrCreate()

s3_paths = {
'absent_data': 's3://datazone-{account_id}/csv/attendance.csv'
}
absent_df = spark.read.csv(s3_paths['absent_data'], header=True, inferSchema=True)

absent_df.write.mode("overwrite").format("parquet").option("path", "s3://datazone-{account_id}/attendanceparquet2/").saveAsTable("gluedbname.tablename")
```
After the notebook has executed successfully, navigate to the SageMaker Unified Studio domain.
Choose Project and under Overview, choose Data Sources.
Choose the Data Catalog source ({account_id}-AwsDataCatalog-glue_db_suffix-default-datasource).
Choose Run to create the asset attendance_with_empnote in SageMaker Catalog.
Navigate to Assets, choose the attendance_with_empnote asset, and navigate to the LINEAGE section.

The following lineage diagram shows an AWS Glue job that reads data from the employee absence records stored in Amazon S3. The AWS Glue job transform CSV data into Parquet format, then creates a table in the Data Catalog and registers it as an asset in SageMaker Catalog.

Capture lineage from Amazon Redshift

To demonstrate the lineage, we are creating an employee table and an attendance table and join both datasets. Finally, we create a new table called employeewithabsent in Amazon Redshift. Complete the following steps to create and configure lineage for Amazon Redshift tables:

In SageMaker Unified Studio, open your domain.
Under Compute, choose Data warehouse.
Open project.redshift and copy the endpoint name (redshift-serverless-workgroup-xxxxxxx).
On the Amazon Redshift console, open the Query Editor v2, and connect to the Redshift Serverless workgroup with a secret. Use the AWS Secrets Manager option and choose the secret redshift-serverless-namespace-xxxxxxxx.

Use the following code to create tables in Amazon Redshift and load data from Amazon S3 using the COPY command. Make sure the IAM role has GetObject permission on the S3 files attendance.csv and employees.csv.

Create Redshift table absent

CREATE TABLE public.absent (
    employeeid character varying(65535),
    date date,
    shiftstart timestamp without time zone ,
    shiftend timestamp without time zone,
    absent boolean,
    overtimehours integer
);

Load data into absent table.

COPY absent
FROM 's3://datazone-{account_id}/csv/attendance.csv' 
IAM_ROLE 'arn:aws:iam::accountid:role/RedshiftAdmin'
csv
IGNOREHEADER 1;

Create Redshift table employee

CREATE TABLE public.employee (
    employeeid character varying(65535),
    name character varying(65535),
    department character varying(65535),
    role character varying(65535),
    hiredate date,
    salary double precision,
    performancerating integer,
    shift character varying(65535),
    location character varying(65535)
);

Load data into employee table.

COPY employee
FROM 's3://datazone-{account_id}/csv/employees.csv' 
IAM_ROLE 'arn:aws:iam::account-id:role/RedshiftAdmin'
csv
IGNOREHEADER 1;

After the tables are created and the data is loaded, perform the join between the tables and create a new table with a CTAS query:

CREATE TABLE public.employeewithabsent AS
SELECT 
  e.*,
  a.absent,
  a.overtimehours
FROM public.employee e
INNER JOIN public.absent a
ON e.EmployeeID = a.EmployeeID;

Navigate to the SageMaker Unified Studio domain.
Choose Project and under Overview, choose Data Sources.
Select the Amazon Redshift source (RedshiftServerless-default-redshift-datasource).
On the Actions dropdown menu, choose Edit.
Under Connection, Enable Import data lineage.
In the Data Selection section, under Table Selection Criteria, provide a table name or use * to generate lineage.
Update the data source and choose Run to create an asset called employeewithabsent in SageMaker Catalog.
Navigate to Assets, choose the employeewithabsent asset, and navigate to the LINEAGE section.

The following lineage diagram shows joining two redshift tables and creating a new redshift table and registers it as an asset in SageMaker Catalog.

Capture lineage from EMR Serverless job

To demonstrate the lineage, we read employee data from an RDS for MySQL table and an attendance dataset from Amazon Redshift, and join both datasets. Finally, we write the data to Amazon S3 and create the attendance_with_employee table in the Data Catalog. Complete the following steps:

On the Amazon EMR console, choose EMR Serverless in the navigation pane.
To create or manage EMR Serverless applications, you need the EMR Studio UI.
1. If you already have an EMR Studio in the Region where you want to create an application, choose Manage applications to navigate to your EMR Studio, or select the EMR Studio that you want to use.
2. If you don’t have an EMR Studio in the Region where you want to create an application, choose Get started and then choose Create and launch Studio. EMR Serverless creates an EMR Studio for you so you can create and manage applications.
In the Create studio UI that opens in a new tab, enter the name, type, and release version for your application.
Choose Create application.
Create an EMR Spark serverless application with the following configuration:
1. For Type, choose Spark.
2. For Release version, choose emr-7.8.0.
3. For Architecture, choose x86_64.
4. For Application setup options, select Use custom settings.
5. For Interactive endpoint, enable the endpoint for EMR Studio.
6. For Application configuration, use the following configuration:
```
[{
    "Classification": "iceberg-defaults",
    "Properties": {
        "iceberg.enabled": "true"
    }
}]
```
Choose Create and Start application.

After application has started, submit the Spark application to generate lineage events. Copy the following script and upload it to the S3 bucket (s3://datazone-{account_id}/script/). Upload the MySQL-connector-java JAR file to the S3 bucket (s3://datazone-{account_id}/jars/) to read the data from MySQL.

from pyspark.sql import SparkSession
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark import SparkContext
from pyspark.sql import SparkSession
import sys
import logging


spark = SparkSession.builder.appName("lineageglue").enableHiveSupport().getOrCreate()

employee_df = spark.read.format("jdbc").option("driver","com.MySQL.cj.jdbc.Driver").option("url", "jdbc:MySQL://dbhostname:3306/databasename").option("dbtable", "employee").option("user", "admin").option("password", "xxxxxxx").load()

absent_df = spark.read.format("jdbc").option("url", "jdbc:redshift://redshiftserverlessendpoint:5439/dev").option("dbtable", "public.absent").option("user", "admin").option("password", "xxxxxxxxxx").load()

joined_df = employee_df.join(absent_df, on="EmployeeID", how="inner")

joined_df.write.mode("overwrite").format("parquet").option("path", "s3://datazone-{account_id}/emrparquetnew/").saveAsTable("gluedname.tablename")

After you upload the script, use the following command to submit the Spark application. Change the following parameters according to your environment details:

application-id: Provide the Spark application ID you generated.
execution-role-arn: Provide the EMR execution role.
entryPoint: Provide the Spark script S3 path.
domainID: Provide the domain ID (from the CloudFormation template output for DataZoneDomainid: dzd_xxxxxxxx).

accountID: Provide your AWS account ID.

aws emr-serverless start-job-run --application-id 00frv81tsqe0ok0l --execution-role-arn arn:aws:iam::{account_id}:role/service-role/AmazonEMR-ExecutionRole-1717662744320 --name "Spark-Lineage" --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://datazone-{account_id}/script/emrspark2.py",
            "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=2 --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.jars=/usr/share/aws/datazone-openlineage-spark/lib/DataZoneOpenLineageSpark-1.0.jar,s3://datazone-{account_id}/jars/MySQL-connector-java-8.0.20.jar --conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=amazon_datazone_api --conf spark.openlineage.transport.domainId=dzd_xxxxxxxx --conf spark.glue.accountId={account_id}"
        }
    }'

After the job has executed successfully, navigate to the SageMaker Unified Studio domain.
Choose Project and under Overview, choose Data Sources.
Select the Data Catalog source ({account_id}-AwsDataCatalog-glue_db_xxxxxxxxxx-default-datasource).
On the Actions dropdown menu, choose Edit.
Under Connection, enable Import data lineage.
In the Data Selection section, under Table Selection Criteria, provide a table name or use * to generate lineage.
Update the data source and choose Run to create an asset called attendancewithempnew in SageMaker Catalog.
Navigate to Assets, choose the attendancewithempnew asset, and navigate to the LINEAGE section.

The following lineage diagram shows an AWS Glue job that integrates employee information stored in Amazon RDS for MySQL and employee absence records stored in Amazon Redshift. The AWS Glue job combines these datasets through a join operation, then creates a table in the Data Catalog and registers it as an asset in SageMaker Catalog.

Clean up

To clean up your resources, complete the following steps:

On the AWS Glue console, delete the AWS Glue job.
On the Amazon EMR console, delete the EMR Serverless Spark application and EMR Studio.
On the AWS CloudFormation console, delete the CloudFormation stack vpc-analytics-lineage-sus.

Conclusion

In this post, we showed how data lineage in SageMaker Catalog helps you track and understand the complete lifecycle of your data across various AWS analytics services. This comprehensive tracking system provides visibility into how data flows through different processing stages, transformations, and analytical workflows, making it an essential tool for data governance, compliance, and operational efficiency.

Try out these lineage visualization methods for your own use cases, and share your questions and feedback in the comments section.

About the Authors

Introducing restricted classification terms for governed classification in Amazon SageMaker Catalog

2025-09-08 Ramesh H Singh

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/introducing-restricted-classification-terms-for-governed-classification-in-amazon-sagemaker-catalog/

Security and compliance concerns are key considerations when customers across industries rely on Amazon SageMaker Catalog. Customers use SageMaker Catalog to organize, discover, and govern data and machine learning (ML) assets. A common request from domain administrators is the ability to enforce governance controls on certain metadata terms that carry compliance or policy significance. Examples include terms used to classify assets with sensitive data (such as PHI in healthcare or PCI in financial services) or terms used to trigger automatic access grants based on regulatory or organizational policies.

AWS announced restricted classification terms in SageMaker Catalog. This new capability allows domain administrators to define governance-controlled terms and enforce which teams and users are authorized to apply them. Restricted classification terms are designed to allow organizations to set standards for consistent classification of sensitive data, help prevent misuse of regulatory tags, and enable downstream workflows such as automatic access grants across the enterprise.

Restricted classification (glossary) terms

Customers have told us that the flexibility of applying glossary terms in SageMaker Catalog has been valuable for collaboration and scale. At the same time, many enterprises—especially in regulated industries—wanted an additional layer of control for certain classifications. For example, terms like PHI (Protected Health Information) in healthcare or PCI (payment card industry) in financial services should only be applied by authorized personnel, because they carry compliance and policy significance. Customers also asked for a way to enforce these governance policies without adding operational overhead. As catalogs grow to thousands of assets, forms, and columns, validating tens of thousands of terms can create performance and compliance challenges. A solution was needed to combine the openness of cataloging with governance precision for sensitive use cases.With this launch, SageMaker Catalog introduces a restricted classification terms section on each asset:

Business glossary terms (existing): Open tagging, no restrictions.
Restricted glossary terms (new): Only authorized users or groups can apply terms. Unauthorized users can view and filter assets based on these terms but not assign them.

Customer spotlight

As a large-scale organization with diverse data needs, the Business Data Technologies (BDT) team at Amazon manages thousands of assets across business units. Making sure these assets are consistently classified and governed is critical to maintaining compliance and enabling secure data sharing at scale. With restricted classification terms in SageMaker Catalog, the BDT team can now enforce which groups are authorized to apply terms, such as policy-driven classifications for merchants or payment data, while keeping discovery seamless for users.

“Restricted classification terms are instrumental in helping us scale data onboarding and governance across Amazon. By enforcing who can apply policy-related terms in the Amazon SageMaker Catalog, we’re able to accelerate consolidation of data assets across business units without compromising compliance. This facilitates consistent classification, prevents misuse, and allows us to automate downstream access grants—enabling our builders to innovate quickly while maintaining the highest standards of governance.”

– Gerry Moses, Senior Principal Technologist, Business Data Technologies, Amazon

Key benefits

With the introduction of restricted classification terms, customers gain stronger governance controls without losing the flexibility of open cataloging. This capability is designed to provide customers with the following key benefits:

Governance enforcement – Sensitive terms such as PHI or PCI can only be applied by approved users or groups, supporting compliance with organizational and regulatory policies.
Consistency at scale – Helps prevent misclassification across thousands of assets, maintaining a single source of truth for governed terms across domains and projects.
Automatic access workflows – Restricted terms can trigger downstream policies, such as auto-granting access to regulated projects or routing assets to compliance-approved environments.

Sample use case

A pharmaceutical company uses SageMaker Catalog to manage clinical trial data. They define a glossary called Regulated Data Categories with restricted terms like PHI and Genomic Data. Only compliance-approved data stewards are authorized to apply these terms to assets. When applied, the term PHI can automatically trigger policies that restrict access only to approved research groups or environments with HIPAA compliance enabled. This makes sure clinical datasets containing PHI to be consistently tagged and subject to the right access policies, while still discoverable for approved researchers.

A retail bank manages transaction and credit data in its domain catalog. They create a glossary called Data Sensitivity Levels with restricted terms like PCI and Credit Bureau Data. When an authorized risk officer classifies an asset with PCI, SageMaker Catalog can automatically grant access only to members of the bank’s Payments Compliance project. Other users, such as analysts in marketing, can see the classification exists but cannot apply or override it. This approach helps prevent accidental misuse of sensitive financial terms while automating secure access grants aligned with regulatory requirements.

Solution overview

In this section, we will walk through how to create and apply restricted classification terms.

Prerequisites

To follow this post, you should have an Amazon SageMaker Unified Studio domain set up with a domain owner or domain unit owner privileges. You should also have existing projects or permissions to create new projects and business glossaries. For instructions to create them, see the Getting started guide. In this post, we created a project named Clinical Study Trials.

Create a restricted business glossary

In this step, a compliance officer creates a new glossary called Regulated Data Categories and marks it as restricted. Usage grants are given to the Clinical Data Stewardship project.

Log in to your Amazon SageMaker Unified Studio (off-console) portal. Select the project, navigate to Business Glossaries tab and choose Create Glossary.
Enter a name and description for the glossary. Select Restrict this glossary for governed term use and choose Add projects.
Select the projects that should have permissions to tag governed terms to assets. Choose Add policy grant.
Choose Create to create the restricted business glossary.
The Regulated Data Categories business glossary is created and ready to populate.

Add restricted business glossary terms

In this step you will add two terms: PHI and Genomic Data to the glossary.

Choose Create term.
Enter a Name and Description. Turn on Enabled and choose Create term.
Follow the same steps to add the second term and both terms should be available in the glossary.

Apply restricted glossary terms to classify assets

In this step, a data steward will publish a new asset and apply the restricted terms.

Go to the Data Steward project and navigate to the asset where Restricted Terms should be tagged and choose Add terms.
From Regulated Data Categories select PHI and Genomic Data and choose Add terms.
Restricted terms are attached to the asset.

If a project that doesn’t have grants to use restricted term tries to attach restricted terms, you would receive the error Unable to apply restricted terms.

Search and discovery

Data consumers can search for assets and filter by restricted terms filters on the left filters tab (for example, PHI or PCI) to discover governed assets.

Cleanup

If you decide that you no longer need any of the assets first unpublish assets, deleted terms, delete business glossary, delete assets and delete the new projects.

Conclusion

As customers expand their use of SageMaker Catalog, the need for governance becomes clear. From our work with customers in healthcare, life sciences, and financial services, we learned that organizations value the flexibility of open cataloging but need precise controls for terms that carry compliance or policy weight.

Restricted classification terms are designed to bring the best of both worlds: Flexibility for builders to continue tagging and discovering assets, and governance precision to help ensure that sensitive classifications are applied consistently. This capability lays the foundation for future enhancements such as column-level governance and deeper integration with enterprise data governance services. By balancing openness with control, SageMaker Catalog continues to help customers organize, govern, and scale their data and ML assets with confidence.

To learn more and get started, visit the Amazon SageMaker Catalog documentation.

About the authors

Develop and deploy a generative AI application using Amazon SageMaker Unified Studio

2025-08-04 Amit Maindola

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/develop-and-deploy-a-generative-ai-application-using-amazon-sagemaker-unified-studio/

Picture this: You’re a financial analyst starting your Monday morning with a steaming cup of coffee, ready to review your investment portfolio. But instead of manually scouring dozens of news websites, financial reports, and industry analyses, you simply ask your AI assistant: “What global events happened over the weekend that might impact my technology stock holdings?” Within seconds, you receive a comprehensive analysis of relevant news, sentiment scores, and potential investment implications—all powered by a sophisticated generative AI application you built yourself.

This scenario isn’t science fiction; it’s the reality that modern financial professionals can create today. In an era where information moves at the speed of light and industry conditions can shift dramatically overnight, staying informed isn’t just an advantage—it’s essential for survival in competitive financial landscapes. The challenge lies in processing the overwhelming volume of global information that could impact investments while distinguishing reliable insights from noise.

Amazon SageMaker – Develop and scale AI use cases with the broadest set of tools

Luckily for us, technology is making this more straightforward. The next generation of Amazon SageMaker with Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access the data in your organization and act on it using the best tools across different use cases. SageMaker Unified Studio brings together the functionality and tools from existing AWS analytics and artificial intelligence and machine learning (AI/ML) services, including Amazon EMR , AWS Glue, Amazon Athena, Amazon Redshift , Amazon Bedrock, and Amazon SageMaker AI. From within SageMaker Unified Studio, you can ﬁnd, access, and query data and AI assets across your organization, then work together in projects to securely build and share analytics and AI artifacts, including data, models, and generative AI applications.

With SageMaker Unified Studio, you can efficiently build generative AI applications in a trusted and secure environment using Amazon Bedrock. You can choose from a selection of high-performing foundation models (FMs) and advanced customization capabilities like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, Amazon Bedrock Agents, and Amazon Bedrock Flows. You can rapidly tailor and deploy generative AI applications and share with the built-in catalog for discovery.

What makes SageMaker Unified Studio particularly powerful for organizations is its integration with Amazon Bedrock Flows to build generative AI workflows, which is changing how organizations think about AI application development.

Amazon Bedrock Flows for generative AI application development

With Amazon Bedrock Flows, you can build and execute complex generative AI workflows without writing code, using an intuitive visual interface that democratizes AI development. This capability is transformative for organizations where speed, accuracy, and adaptability are paramount. It offers the following benefits:

Visual workflow development – Users can design AI applications by dragging and dropping components onto a canvas, making AI logic transparent and modifiable
Business logic flexibility – The service supports complex business logic through conditional branching, multi-path decision trees, and dynamic routing
Democratizing AI development – Business experts can directly contribute to AI application development without requiring extensive technical expertise
Seamless integration – Amazon Bedrock Flows integrates with FMs, knowledge bases, guardrails, and other AWS services
Reduced development complexity – The service handles infrastructure management and scaling through serverless execution and SDK APIs

Solution overview

In this post, we explore a financial use case, in which we want to stay on top of latest global events and determine our investment or financial exposure based on this. We can use a SageMaker Unified Studio flow application to pull in latest news summaries, derive sentiment based on news summary, and determine their effects on my investments. The following diagram illustrates this use case.

In the following sections, we show how to create a new project and build a flow application using a generative AI profile in SageMaker Unified Studio.

Prerequisites

For this walkthrough, you must have the following prerequisites:

An AWS account – If you don’t have an account, you can create one.
A SageMaker Unified Studio domain – For instructions, refer to Create an Amazon SageMaker Unified Studio domain – quick setup, and add a single sign-on (SSO) user for logging and set two-factor authentication. Finally, on the Connections tab, you can create a connection to a code repository.

A demo project – Create a demo project in your SageMaker Unified Studio domain. For instructions, see Create a project. For this example, we choose All capabilities in the project profile section, which includes the generative AI project profile enabled.

Create new project and build a flow application in SageMaker Unified Studio

In this section, we create a new a flow application that uses an Amazon Bedrock knowledge base to provide information about your personal portfolio. Complete the following steps:

In SageMaker Unified Studio, open the project you created as a prerequisite and choose Build and then Flow.

Drag Knowledge Base from Nodes to the design panel to add a knowledge base that will include the user’s investment portfolio and news articles and other information like earnings call transcripts, financial analyst reports, and so on.

Choose the Knowledge Base node and configure the knowledge base as follows:
Add a name for your knowledge base name (for example, portfolio…).
Choose the model (for example, Claude 3.5 Haiku).

Choose Create new Knowledge Base.
Enter a name for the knowledge base.
Select Project data source.
For Select a data source, choose the Amazon Simple Storage Service (Amazon S3) bucket location where you uploaded your data.
Choose Create.

The knowledge base creation process takes a few minutes to complete.

When the knowledge base is ready, choose Save to save it to the flow.

Choose My components, and on the options menu (three vertical dots), choose Sync to sync the knowledge base.

Make sure the S3 bucket has all the data (user portfolio data and latest news information data) before syncing the knowledge base.

We don’t provide any financial or news information data as part of this post. Upload current events or news data and investment portfolio data from your own data sources.

Test the flow application

After the knowledge base sync is complete, you can return to the flow application and ask questions. Using SageMaker Unified Studio flows, a financial analyst can provide a more personalized and customized financial outlook to their customers using rich internal financial information on their customer’s investment portfolio and latest publicly available current events and news information. The following are some example questions that you can ask to test the knowledge base:

Check if Tesla or Apple is in any of user's investment portfolio

Please check latest news information to provide information if Tesla has positive, negative or neutral outlook in the near future

Flow-based applications offer a visual approach to creating complex AI workflows. By chaining different nodes, each optimized for specific functions, you can create sophisticated solutions that are more reliable, maintainable, and efficient than single-prompt approaches. These flows allow for conditional logic and branching paths, mimicking human decision-making processes and enabling more nuanced responses based on context and intermediate results.

Clean up

To avoid ongoing charges in your AWS account, delete the resources you created during this tutorial:

Delete the project.
Delete the domain created as part of the prerequisites.

Conclusion

In this post, we demonstrated how to use Amazon Bedrock Flows in SageMaker Unified Studio to build a sophisticated generative AI application for financial analysis and investment decision-making without extensive coding knowledge. With this integration, you can create sophisticated financial analysis workflows through an intuitive visual interface, where you can process industry data, analyze news sentiment, and assess investment implications in real time. The solution integrates seamlessly with AWS services and FMs while providing essential features like automatic scaling, compliance controls, and audit capabilities. The implementation process involves setting up a SageMaker Unified Studio domain, configuring knowledge bases with portfolio and news data, and creating visual workflows that can analyze complex financial information. This democratized approach to AI development allows both technical and business teams to collaborate effectively, significantly reducing development time while maintaining the sophisticated capabilities needed for modern financial analysis.

To get started, explore the SageMaker Unified Studio documentation, set up a project in your AWS environment, and discover how this solution can transform your organization’s data analytics capabilities.

About the authors

Amit Maindola is a Senior Data Architect focused on data engineering, analytics, and AI/ML at Amazon Web Services. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.

Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers adopt and use the AWS Cloud. He is focused on big data, data lakes, streaming and batch analytics services, and generative AI technologies.

Melody Yang is a Principal Analytics Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Gaurav Parekh is a Solutions Architect at AWS, specializing in generative AI and data analytics, with extensive experience building production AI systems on AWS.

Automate data lineage in Amazon SageMaker using AWS Glue Crawlers supported data sources

2025-07-30 Mohit Dawar

Post Syndicated from Mohit Dawar original https://aws.amazon.com/blogs/big-data/automate-data-lineage-in-amazon-sagemaker-using-aws-glue-crawlers-supported-data-sources/

The next generation of Amazon SageMaker is the center for all your data, analytics, and AI. Bringing together widely adopted Amazon Web Services (AWS) machine learning (ML) and analytics capabilities, it delivers an integrated experience for analytics and AI with unified access to all your data. From Amazon SageMaker Unified Studio, a single data and AI development environment, you can access your data and use a suite of powerful tools for data processing, SQL analytics, model development, training and inference, and generative AI development.

With data lineage, now part of Amazon SageMaker Catalog, you can centralize lineage metadata of your data assets in a single place. You can track the flow of data over time, determining a clear understanding of where it originated, how it has changed, and its usage across the business. By providing this level of transparency, data lineage helps data consumers gain trust that the data is correct and compliant for their use cases. With data lineage captured at the table, column, and job level, data producers can conduct impact analysis of changes in their data pipelines and respond to data issues when needed, for example, when a column in the resulting dataset is missing the quality required by the business.

Data lineage is a powerful tool that can transform how organizations understand and manage their data flows. In this post, we explore its real-world impact through the lens of an ecommerce company striving to boost their bottom line.

To illustrate this practical application, we walk you through how you can use the prebuilt integration between SageMaker Catalog and AWS Glue crawlers to automatically capture lineage for data assets stored in Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. Using this workflow, you can capture lineage automatically from additional data sources using AWS Glue crawlers. Refer to the Data lineage support matrix in the SageMaker Unified Studio User Guide for supported sources. We also use SageMaker Unified Studio to navigate these data assets and learn about their origin, transformations, and dependencies, thanks to the lineage metadata captured using the AWS Glue crawlers.

Key features of the SageMaker Catalog lineage graph

In SageMaker Unified Studio, you can explore and discover data assets of your organization suited for your use case. As you dive into these data assets, you can learn more about its business context, schema, quality, and lineage. When you decide to work with a subset of these assets, you can subscribe to them in a self-service fashion and start working with them. For more detail, visit Data discovery, subscription, and consumption in the SageMaker Unified Studio User Guide.

SageMaker Studio provides a visual lineage graph that shows how a data asset has evolved from its source through transformations to its final state. This helps data scientists, engineers, and analysts answer key questions such as:

Where did this data come from?
What transformations has it gone through?
Which downstream assets will be impacted by a change?

With this level of visibility, teams can perform faster impact analysis, find the root cause of data quality issues, and ensure models are built on trusted data. It also supports better collaboration so users can confidently use and share data across the organization. The following screenshot shows how SageMaker Unified Studio visualizes data lineage, making it straightforward to trace data flow and understand dependencies.

Column-level lineage – You can expand column-level lineage when available in dataset nodes. This automatically shows relationships with upstream or downstream dataset nodes if source column information is available.
Column search – If the dataset has more than 10 columns, the node presents pagination to navigate to columns not initially presented. To quickly view a particular column, you can search on the dataset node that lists only the searched column.
Details pane – Each lineage node captures and displays the following details:
- Every dataset node has three tabs: LINEAGE, SCHEMA, and HISTORY. The HISTORY tab lists the different versions of lineage event captured for that node.
- The job node has a details pane to display job details with the tabs Job info and History. The details pane also captures queries or expressions run as part of the job.
View dataset nodes only – If you want to filter out the job nodes, you can choose the open view control icon in the graph viewer and toggle the display dataset nodes only, which will remove all the job nodes from the graph and let you navigate only the dataset nodes.
Version tabs – All lineage nodes in Amazon DataZone data lineage will have versioning, captured as history, based on lineage events captured. You can view lineage at a selected timestamp that opens a new tab on the lineage page to help compare or contrast between the different timestamps.

You can try some of these features as you explore the data assets of this post. To learn more on data lineage in SageMaker, we encourage you to dive deep into the Data lineage in Amazon SageMaker Unified Studio.

Solution overview

Imagine a scenario where an ecommerce company aims to optimize conversion rates and enhance customer experience by gaining deeper insights into the customer journey. They need to connect the dots between user interactions and actual purchases, but with data scattered across multiple sources, where do they begin? This is where data lineage becomes invaluable. To perform their analysis, they need data from two primary sources:

Clickstream data stored in Amazon S3 (in JSON or Parquet format)
Transactional order data stored as items in Amazon DynamoDB

To make these datasets discoverable across the business, you need to:

Create a project in SageMaker Unified Studio that will be used to source and manage the datasets
Enable data lineage capture in the SageMaker Unified Studio project
Set up the resources for this use case, which includes an AWS Glue data source (set up in SageMaker Unified Studio) and AWS Glue crawler (set up in AWS Glue)
Run the AWS Glue crawler to catalog the datasets in AWS Glue Data Catalog
Source the metadata of the data assets into the SageMaker Catalog by running the data source
Use SageMaker Unified Studio to navigate through the lineage of the data assets and visualize their origin
Understand how schema evolution is captured in the data asset’s lineage

Prerequisites

To complete the steps on this post, you need an SageMaker Unified Studio domain already deployed in your AWS account. To get started quickly in a testing environment, we suggest creating your SageMaker domain using the quick setup option as explained in Create an Amazon SageMaker Unified Studio domain – quick setup.

Solution steps

To capture data lineage for AWS Glue tables managed with AWS Glue crawlers using SageMaker Unified Studio, complete the steps in the following sections.

Set up a SageMaker project with SQL capability

In SageMaker Unified Studio, a project profile defines an uber template for projects in your Amazon SageMaker unified domain. By setting up a project with the right tooling (project profile), you will provision resources you can use to work with data, which might include cataloging it in SageMaker, transforming it into new data assets, analyzing it to drive business value, or even use it for ML or AI applications.

To demonstrate data lineage effectively, we use SageMaker SQL analytics project profile for a streamlined setup. Although this profile offers comprehensive data analytics capabilities, we focus specifically on two key components:

AWS Glue database – A lakehouse for storing and managing technical metadata
Data source job – Automatically collects and tracks metadata into SageMaker Catalog

We’ve chosen this profile to bypass complex manual configurations so we can focus on the core concepts of data lineage.

To create a new project in your SageMaker domain using the SQL analytics project profile, follow the steps detailed in SQL analytics project profile. Keep all default configurations when creating the project.

After creating your project in SageMaker Studio, you’ll unlock powerful data lineage capabilities that make tracking and understanding your data flows intuitive. Through the data sourcing feature, you can easily monitor how data moves from source to the AWS Glue database. This visibility becomes particularly valuable when debugging data issues—you can quickly trace data back to its source, understand how changes impact downstream processes, and identify affected analyses or reports. Next, populate the AWS Glue database with sample data to observe these features in action and demonstrate how they can streamline your data operations.

For further guidance on how to access the details of the new SageMaker project, refer to Get project details. After you access the data source details, in the Database name field, take note of the AWS Glue database name associated to the SageMaker project.

Enable data lineage capture in the SageMaker project’s data source

To enable lineage capture, follow these steps:

Expand the Actions menu, then choose Edit data source.
Go to the connections and select Import data lineage to configure lineage capture from the source, as shown in the following screenshot.
Make other changes to the data source fields as desired, then choose Save.

Enabling lineage will make sure the data source job will capture lineage in the next run.

Deploy resources for the use case

Follow these steps:

To deploy the resources required for this post, download the AWS CloudFormation template amazon-datazone-examples in the AWS Samples GitHub repository. Deploy it in your AWS account.

For further guidance on how to deploy a CloudFormation stack, refer to Create a stack from the CloudFormation console. You need to provide a Stack name and the name of the AWS GlueDatabaseName associated to the project of your SageMaker domain, as shown in the following screenshot.

Choose Next.

The template will deploy the following resources:

A S3 bucket with a sample file of clickstream data. The bucket name and location of the file will follow the path pattern s3://ecomm-analytics-<ACCOUNT_ID>-<REGION>/clickstream/<YYYY>/<MM>/<DD>/data.json. The file will contain a sample record with the following structure:

{
    "session_id": "abc123",
    "user_id": "u789",
    "event_type": "product_view",
    "product_id": "prod456",
    "timestamp": "2025-06-04T09:23:12Z"
}

A DynamoDB table with a sample item of order data (transactions). The table will be named OrderTransactionTable. The sample item will have the following structure:

{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z"
}

An AWS Glue crawler configured to crawl the S3 bucket and DynamoDB table deployed as part of the stack and store the metadata in the AWS Glue database associated to the SageMaker project. You can access the crawler’s details in the AWS console, as shown in the following screenshot.

Two AWS Identity and Access Management (IAM) roles for loading the source data into Amazon S3 and DynamoDB and to use when running the AWS Glue crawler.

Run the AWS Glue crawler

The AWS Glue crawler deployed in the previous step will allow you to capture metadata from the two data sources, Amazon S3 and DynamoDB, and store it in AWS Glue Data Catalog, specifically in the database associated to the SageMaker project. After the metadata is stored, it will be accessible to SageMaker.

Before running the crawler, you need to provide AWS Lake Formation permissions to the IAM role that the AWS Glue crawler will use to interact with your data source and target AWS Glue database. The following command will grant the permissions needed for the crawler to store metadata into the AWS Glue database of the SageMaker project.

To invoke this command, we recommend using AWS CloudShell on the AWS console as explained in AWS CloudShell Concepts. Update the <REGION>, <ACCOUNT_ID> and <GLUE_DATABASE_NAME> placeholders with the right values for your AWS Region, AWS account ID, and name of the AWS Glue database associated to the SageMaker project.

aws lakeformation grant-permissions \
  --region  \
  --principal DataLakePrincipalIdentifier=arn:aws:iam:::role/glue-crawler-role \ 
  --permissions CREATE_TABLE \
  --resource '{ "Database": { "Name": "" } }'

Next, run the AWS Glue Crawler on the AWS console. After the crawler successfully finishes, two new tables, clickstream and ordertransactiontable, will be created in the AWS Glue database associated to the SageMaker project. Refer to Viewing crawler results and details to learn more about AWS Glue crawler results.

Source metadata from the AWS Glue database into SageMaker

To source metadata from data assets in the AWS Glue database, including their lineage, into SageMaker, use the data source that was deployed as part of the SageMaker project creation.

To run the data source, go to the data source details page.
Choose Run. (Data sources can be scheduled to run as well, however, for this demonstration we trigger a manual run).

After the data source run is complete, metadata from both data assets in the AWS Glue database will be imported into the SageMaker domain as the project’s inventory assets. You can find the details of the data source run from within SageMaker Unified Studio, which include:

The data assets from the AWS Glue database that were ingested into SageMaker.
The status of the data lineage import for each data asset, which includes an event ID for traceability. This lineage event ID can be used to debug inconsistencies in the resulting lineage graph. You can use the GetLineageEvent API to retrieve the raw payload of the lineage event.

Visualizing the data lineage graph of the data assets in SageMaker Unified Studio

With SageMaker Unified Studio, you have a single place to manage and discover data assets. When accessing a data asset published in the SageMaker central catalog or in your project’s own inventory, you can dive into the asset’s metadata, which includes its schema, business description, custom metadata forms, quality, lineage, and more. To visualize the lineage graph of each data asset of this post, follow these steps:

In SageMaker Studio, navigate to the Assets section of the SageMaker project details page and choose INVENTORY
Select the asset that you want to explore. You can also access the asset directly from the data source run by selecting the asset name.
To view the lineage graph of the data asset up to its origin, shown in the following screenshots, choose the LINEAGE tab.
- For clickstream table (Sourced from S3)

- For order transactions table (Sourced from DynamoDB)

With lineage, you can now confirm that the data originated from sources such as Amazon S3 and Amazon DynamoDB and understand how it has been transformed along the way. Because of this end-to-end visibility, you can trust the data, make informed decisions, and provide compliance with confidence. The lineage graph captures essential metadata that forms the foundation of lineage tracking.

This includes table schemas, column definitions and their data types.
Column-level lineage becomes particularly powerful in this context. Imagine your clickstream’s AWS Glue table powers an Amazon QuickSight dashboard analyzing customer purchase patterns and notice discrepancies in your revenue reports. With column lineage, you can instantly trace the source of those columns.
This granular visibility not only accelerates debugging but also proves invaluable during schema changes, as we show in the following section by changing the source schema.
The crawler details such as crawlerRunId (present in the source identifier of the lineage node) and crawler start and end times can be used to debug which crawler runs updated the table.

Understanding your data asset’s schema evolution through lineage in SageMaker Unified Studio

Imagine the order transactions source in DynamoDB was updated with new information. Because this source powers an Amazon QuickSight report for the customer using the AWS Glue database table, it’s important for consumers to know what changes in the data pipeline updated the report.

Edit the DynamoDB table item with additional columns to learn how lineage graph can be used to view historical updates:

{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z",
	"customerSegment": "new-customer",
    "conversionSource": "primeDayEmailCampaign"
}

Enter the OrderTransactionsCrawler Glue crawler again on the AWS console. After completion, you’ll notice that it updated the ordertransactiontable AWS Glue table, as shown in the following screenshot.

Run again the data source associated to the project in SageMaker Unified Studio to import the latest metadata into the SageMaker Catalog. After completion, you’ll notice the data source updated the ordertransactiontable data asset in the SageMaker Catalog, as shown in the following screenshot.

This section explores how lineage can be useful to track the updates.

Navigate to the ordertransactiontable data asset in SageMaker Catalog by selecting it from the data source run and choose the LINEAGE tab, as shown in the following screenshot.

Notice how the new columns are available in the lineage graph. A new crawler run ID is present as the source identifier of the crawler lineage node. The history tab shows multiple crawler runs. You can navigate to check the state of the system during the first run.

Cleanup

After you’re done, we recommend to cleaning up the resources created for this post to avoid unintended charges:

Delete the inventory assets that were cataloged in the SageMaker project’s inventory, as explained in Delete an Amazon SageMaker Unified Studio asset.
Delete the SageMaker project that was created as part of this post, as explained in Delete a project.
Delete the CloudFormation stack that was deployed as part of this post, as explained in Delete a stack from the CloudFormation console.
The S3 bucket created as part of the CloudFormation stack will remain after its deletion because it contains a data file in it. Empty and delete the bucket, as explained in Deleting a general purpose bucket.

Conclusion

In this post, you were able to explore the data lineage capabilities of Amazon SageMaker, specifically when working with AWS Glue crawlers. You learned how you can set up an AWS Glue crawler to infer metadata from data assets in multiple sources such as Amazon S3 and DynamoDB and store it the AWS Glue Data Catalog. You also imported this metadata, including data lineage, into Amazon SageMaker through the data source capability of a SageMaker project. Finally, you explored the resulting lineage graph of data assets in SageMaker Unified Studio and saw some of the functionalities available to understand the origin path of them, understand how columns are transformed, and what impact looks like when performing changes to any step of the pipeline.We encourage you to now test the capabilities you explored in this post with your own data. By following the pattern presented in this post, many customers have been able to achieve governance of their data lake and lakehouse platforms on top of Amazon SageMaker with data lineage and more.

About the authors

Mohit Dawar is a Senior Software Engineer at Amazon Web Services (AWS) working on Amazon DataZone. Over the past 3 years, he has led efforts around the core metadata catalog, generative AI–powered metadata curation, and lineage visualization. He enjoys working on large-scale distributed systems, experimenting with AI to improve user experience, and building tools that make data governance feel effortless. Connect with him on LinkedIn: Mohit Dawar.

Jose Romero is a Senior Solutions Architect for Startups at Amazon Web Services (AWS) based in Austin, TX, US. He is passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn: Jose Romero.

AWS Weekly Roundup: SQS fair queues, CloudWatch generative AI observability, and more (July 28, 2025)

2025-07-28 Micah Walter

Post Syndicated from Micah Walter original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-sqs-fair-queues-cloudwatch-generative-ai-observability-and-more-july-28-2025/

To be honest, I’m still recovering from the AWS Summit in New York, doing my best to level up on launches like Amazon Bedrock AgentCore (Preview) and Amazon Simple Storage Service (S3) Vectors. There’s a lot of new stuff to learn!

Meanwhile, it’s been an exciting week for AWS builders focused on reliability and observability. The standout announcement has to be Amazon SQS fair queues, which tackles one of the most persistent challenges in multi-tenant architectures: the “noisy neighbor” problem. If you’ve ever dealt with one tenant’s message processing overwhelming shared infrastructure and affecting other tenants, you’ll appreciate how this feature enables more balanced message distribution across your applications.

On the AI front, we’re also seeing AWS continue to enhance our observability capabilities with the preview launch of Amazon CloudWatch generative AI observability. This brings AI-powered insights directly into your monitoring workflows, helping you understand infrastructure and application performance patterns in new ways. And for those managing Amazon Connect environments, the addition of AWS CloudFormation for message template attachments makes it easier to programmatically deploy and manage email campaign assets across different environments.

Last week’s launches

Amazon SQS Fair Queues — AWS launched Amazon SQS fair queues to help mitigate the “noisy neighbor” problem in multi-tenant systems, enabling more balanced message processing and improved application resilience across shared infrastructure.
Amazon CloudWatch Generative AI Observability (Preview) — AWS launched a preview of Amazon CloudWatch generative AI observability, enabling users to gain AI-powered insights into their cloud infrastructure and application performance through advanced monitoring and analysis capabilities.
Amazon Connect CloudFormation Support for Message Template Attachments —AWS has expanded the capabilities of Amazon Connect by introducing support for AWS CloudFormation for Outbound Campaign message template attachments, enabling customers to programmatically manage and deploy email campaign attachments across different environments.
Amazon Connect Forecast Editing — Amazon Connect introduces a new forecast editing UI that allows contact center planners to quickly adjust forecasts by percentage or exact values across specific date ranges, queues, and channels for more responsive workforce planning.
Bloom Filters for Amazon ElastiCache — Amazon ElastiCache now supports Bloom filters in version 8.1 for Valkey, offering a space-efficient way to quickly check if an item is in a set with over 98% memory efficiency compared to traditional sets.
Amazon EC2 Skip OS Shutdown Option — AWS has introduced a new option for Amazon EC2 that allows customers to skip the graceful operating system shutdown when stopping or terminating instances, enabling faster application recovery and instance state transitions.
AWS HealthOmics Git Repository Integration — AWS HealthOmics now supports direct Git repository integration for workflow creation, allowing researchers to seamlessly pull workflow definitions from GitHub, GitLab, and Bitbucket repositories while enabling version control and reproducibility.
AWS Organizations Tag Policies Wildcard Support — AWS Organizations now supports a wildcard statement (ALL_SUPPORTED) in Tag Policies, allowing users to apply tagging rules to all supported resource types for a given AWS service in a single line, simplifying policy creation and reducing complexity.

Blogs of note

Beyond IAM Access Keys: Modern Authentication Approaches — AWS recommends moving beyond traditional IAM access keys to more secure authentication methods, reducing risks of credential exposure and unauthorized access by leveraging modern, more robust approaches to identity management.

Upcoming AWS events

AWS re:Invent 2025 (December 1-5, 2025, Las Vegas) — AWS’s flagship annual conference offering collaborative innovation through peer-to-peer learning, expert-led discussions, and invaluable networking opportunities.

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Mexico City (August 6) and Jakarta (August 7).

AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Singapore (August 2), Australia (August 15), Adria (September 5), Baltic (September 10), and Aotearoa (September 18).

Introducing MCP Server for Apache Spark History Server for AI-powered debugging and optimization

2025-07-23 Manabu McCloskey

Post Syndicated from Manabu McCloskey original https://aws.amazon.com/blogs/big-data/introducing-mcp-server-for-apache-spark-history-server-for-ai-powered-debugging-and-optimization/

Organizations running Apache Spark workloads, whether on Amazon EMR, AWS Glue, Amazon Elastic Kubernetes Service (Amazon EKS), or self-managed clusters, invest countless engineering hours in performance troubleshooting and optimization. When a critical extract, transform, and load (ETL) pipeline fails or runs slower than expected, engineers end up spending hours navigating through multiple interfaces such as logs or Spark UI, correlating metrics across different systems and manually analyzing execution patterns to identify root causes. Although Spark History Server provides detailed telemetry data, including job execution timelines, stage-level metrics, and resource consumption patterns, accessing and interpreting this wealth of information requires deep expertise in Spark internals and navigating through multiple interconnected web interface tabs.

Today, we’re announcing the open source release of Spark History Server MCP, a specialized Model Context Protocol (MCP) server that transforms this workflow by enabling AI assistants to access and analyze your existing Spark History Server data through natural language interactions. This project, developed collaboratively by AWS open source and Amazon SageMaker Data Processing, turns complex debugging sessions into conversational interactions that deliver faster, more accurate insights without requiring changes to your current Spark infrastructure. You can use this MCP server with your self-managed or AWS managed Spark History Servers to analyze Spark applications running in the cloud or on-premises deployments.

Understanding Spark observability challenge

Apache Spark has become the standard for large-scale data processing, powering critical ETL pipelines, real-time analytics, and machine learning (ML) workloads across thousands of organizations. Building and maintaining Spark applications is, however, still an iterative process, where developers spend significant time testing, optimizing, and troubleshooting their code. Spark application developers focused on data engineering and data integration use cases often encounter significant operational challenges due to a few different reasons:

Complex connectivity and configuration options to a variety of resources with Spark – Although this makes it a popular data processing platform, it often makes it challenging to find the root cause of inefficiencies or failures when Spark configurations aren’t optimally or correctly configured.
Spark’s in-memory processing model and distributed partitioning of datasets across its workers – Although good for parallelism, this often makes it difficult for users to identify inefficiencies. This results in slow application execution or root cause of failures caused by resource exhaustion issues such as out of memory and disk exceptions.
Lazy evaluation of Spark transformations – Although lazy evaluation optimizes performance, it makes it challenging to accurately and quickly identify the application code and logic that caused the failure from the distributed logs and metrics emitted from different executors.

Spark History Server

Spark History Server provides a centralized web interface for monitoring completed Spark applications, serving comprehensive telemetry data including job execution timelines, stage-level metrics, task distribution, executor resource consumption, and SQL query execution plans. Although Spark History Server assists developers for performance debugging, code optimization, and capacity planning, it still has challenges:

Time-intensive manual workflows – Engineers spend hours navigating through the Spark History Server UI, switching between multiple tabs to correlate metrics across jobs, stages, and executors. Engineers must constantly switch between the Spark UI, cluster monitoring tools, code repositories, and documentation to piece together a complete picture of application performance, which often takes days.
Expertise bottlenecks – Effective Spark debugging requires deep understanding of execution plans, memory management, and shuffle operations. This specialized knowledge creates dependencies on senior engineers and limits team productivity.
Reactive problem-solving – Teams typically discover performance issues only after they impact production systems. Manual monitoring approaches don’t scale to proactively identify degradation patterns across hundreds of daily Spark jobs.

How MCP transforms Spark observability

The Model Context Protocol provides a standardized interface for AI agents to access domain-specific data sources. Unlike general-purpose AI assistants operating with limited context, MCP-enabled agents can access technical information about specific systems and provide insights based on actual operational data rather than generic recommendations.With the help of Spark History Server accessible through MCP, instead of manually gathering performance metrics from multiple sources and correlating them to understand application behavior, engineers can engage with AI agents that have direct access to all Spark execution data. These agents can analyze execution patterns, identify performance bottlenecks, and provide optimization recommendations based on actual job characteristics rather than general best practices.

Introduction to Spark History Server MCP

The Spark History Server MCP is a specialized bridge between AI agents and your existing Spark History Server infrastructure. It connects to one or more Spark History Server instances and exposes their data through standardized tools that AI agents can use to retrieve application metrics, job execution details, and performance data.

Importantly, the MCP server functions purely as a data access layer, enabling AI agents such as Amazon Q Developer CLI, Claude desktop, Strands Agents, LlamaIndex, and LangGraph to access and reason about your Spark data. The following diagram shows this flow.

The Spark History Server MCP directly addresses these operational challenges by enabling AI agents to access Spark performance data programmatically. This transforms the debugging experience from manual UI navigation to conversational analysis. Instead of hours in the UI, ask, “Why did job spark-abcd fail?” and receive root cause analysis of the failure. This allows users to use AI agents for expert-level performance analysis and optimization recommendations, without requiring deep Spark expertise.

The MCP server provides comprehensive access to Spark telemetry across multiple granularity levels. Application-level tools retrieve execution summaries, resource utilization patterns, and success rates across job runs. Job and stage analysis tools provide execution timelines, stage dependencies, and task distribution patterns for identifying critical path bottlenecks. Task-level tools expose executor resource consumption patterns and individual operation timings for detailed optimization analysis. SQL-specific tools provide query execution plans, join strategies, and shuffle operation details for analytical workload optimization. You can review the complete set of tools available in the MCP server in the project README.

How to use the MCP server

The MCP is an open standard that enables secure connections between AI applications and data sources. This MCP server implementation supports both Streamable HTTP and STDIO protocols for maximum flexibility.

The MCP server runs as a local service within your infrastructure either on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EKS, connecting directly to your Spark History Server instances. You maintain complete control over data access, authentication, security, and scalability.

All the tools are available with streamable HTTP and STDIO protocol:

Streamable HTTP – Full advanced tools for LlamaIndex, LangGraph, and programmatic integrations
STDIO mode – Core functionality of Amazon Q CLI and Claude Desktop

For deployment, it supports multiple Spark History Server instances and provides deployments with AWS Glue, Amazon EMR, and Kubernetes.

Quick local setup

To set up Spark History MCP server locally, execute the following commands in your terminal:

git clone 
cd spark-history-server-mcp

# Install Task (if not already installed)
brew install go-task # macOS, see  for others

# Setup and start testing
task install            # Install dependencies
task start-spark-bg     # Start Spark History Server with sample data
task start-mcp-bg       # Start MCP Server
task start-inspector-bg # Start MCP Inspector

# Opens  for interactive testing
# When done, run task stop-all

For comprehensive configuration examples and integration guides, refer to the project README.

Integration with AWS managed services

The Spark History Server MCP integrates seamlessly with AWS managed services, offering enhanced debugging capabilities for Amazon EMR and AWS Glue workloads. This integration adapts to various Spark History Server deployments available across these AWS managed services while providing a consistent, conversational debugging experience:

AWS Glue – Users can use the Spark History Server MCP integration with self-managed Spark History Server on an EC2 instance or launch locally using Docker container. Setting up the integration is straightforward. Follow the step-by-step instructions in the README to configure the MCP server with your preferred Spark History Server deployment. Using this integration, AWS Glue users can analyze AWS Glue ETL job performance regardless of their Spark History Server deployment approach.
Amazon EMR – Integration with Amazon EMR uses the service-managed Persistent UI feature for EMR on Amazon EC2. The MCP server requires only an EMR cluster Amazon Resource Name (ARN) to discover the available Persistent UI on the EMR cluster or automatically configure a new one for cases its missing with token-based authentication. This eliminates the need for manually configuring Spark History Server setup while providing secure access to detailed execution data from EMR Spark applications. Using this integration, data engineers can ask questions about their Spark workloads, such as “Can you get job bottle neck for spark-<emr-applicationId>? ” The MCP responds with detailed analysis of execution patterns, resource utilization differences, and targeted optimization recommendations, so teams can fine-tune their Spark applications for optimal performance across AWS services.

For comprehensive configuration examples and integration details, refer to the AWS Integration Guides.

Looking ahead: The future of AI-assisted Spark optimization

This open-source release establishes the foundation for enhanced AI-powered Spark capabilities. This project establishes the foundation for deeper integration with AWS Glue and Amazon EMR to simplify the debugging and optimization experience for customers using these Spark environments. The Spark History Server MCP is open source under the Apache 2.0 license. We welcome contributions including new tool extensions, integrations, documentation improvements, and deployment experiences.

Get started today

Transform your Spark monitoring and optimization workflow today by providing AI agents with intelligent access to your performance data.

Explore the GitHub repository
Review the comprehensive README for setup and integration instructions
Join discussions and submit issues for enhancements
Contribute new features and deployment patterns

Acknowledgment: A special thanks to everyone who contributed to the development and open-sourcing of the Apache Spark history server MCP: Vaibhav Naik, Akira Ajisaka, Rich Bowen, Savio Dsouza.

About the authors

Manabu McCloskey is a Solutions Architect at Amazon Web Services. He focuses on contributing to open source application delivery tooling and works with AWS strategic customers to design and implement enterprise solutions using AWS resources and open source technologies. His interests include Kubernetes, GitOps, Serverless, and Souls Series.

Vara Bonthu is a Principal Open Source Specialist SA leading Data on EKS and AI on EKS at AWS, driving open source initiatives and helping AWS customers to diverse organizations. He specializes in open source technologies, data analytics, AI/ML, and Kubernetes, with extensive experience in development, DevOps, and architecture. Vara focuses on building highly scalable data and AI/ML solutions on Kubernetes, enabling customers to maximize cutting-edge technology for their data-driven initiatives

Andrew Kim is a Software Development Engineer at AWS Glue, with a deep passion for distributed systems architecture and AI-driven solutions, specializing in intelligent data integration workflows and cutting-edge feature development on Apache Spark. Andrew focuses on re-inventing and simplifying solutions to complex technical problems, and he enjoys creating web apps and producing music in his free time.

Shubham Mehta is a Senior Product Manager at AWS Analytics. He leads generative AI feature development across services such as AWS Glue, Amazon EMR, and Amazon MWAA, using AI/ML to simplify and enhance the experience of data practitioners building data applications on AWS.

Kartik Panjabi is a Software Development Manager on the AWS Glue team. His team builds generative AI features for the Data Integration and distributed system for data integration.

Mohit Saxena is a Senior Software Development Manager on the AWS Data Processing Team (AWS Glue and Amazon EMR). His team focuses on building distributed systems to enable customers with new AI/ML-driven capabilities to efficiently transform petabytes of data across data lakes on Amazon S3, databases and data warehouses on the cloud.

Unifying data insights with Amazon QuickSight and Amazon SageMaker

2025-07-18 Ramon Lopez

Post Syndicated from Ramon Lopez original https://aws.amazon.com/blogs/big-data/unifying-data-insights-with-amazon-quicksight-and-amazon-sagemaker/

Amazon SageMaker has announced an integration with Amazon QuickSight, bringing together data in SageMaker seamlessly with QuickSight capabilities like interactive dashboards, pixel perfect reports and generative business intelligence (BI)—all in a governed and automated manner. With this integration users can go from exploring data in SageMaker to visualizing it in QuickSight with a single click.

“The integration between Amazon SageMaker and Amazon QuickSight will help us streamline how our teams move from data exploration to insights. Our analysts can go from data discovery to building and sharing dashboards through a unified, governed experience. Dashboards are no longer siloed, one-off reports. They’re cataloged, discoverable assets that others can find and access. This has made insight delivery faster, more consistent, and far easier to scale across the business.”

– Lingam Chockalingam, Chief Data Architect, Maryland Department of Human Services – MD THINK

About QuickSight

QuickSight is a cloud-powered BI service that revolutionizes data analysis and visualization. It seamlessly integrates data from various sources, including AWS services, third-party applications, and software as a service (SaaS) platforms into a single, intuitive dashboard. As a fully managed service, QuickSight offers enterprise-grade security, global accessibility, and scalability without the hassle of infrastructure management. Amazon Q in QuickSight transforms access to data insights for the entire organization using generative AI. Using Amazon Q, business analysts can generate dashboards and reports using natural language prompts. With Amazon Q, business users can ask and answer questions of data using data Q&A, get natural language executive summaries of data to see trends and insights, and use the powerful new agentic data analysis experience of scenarios to discover patterns and outliers in data and perform what-if analysis.

About SageMaker

Amazon SageMaker Unified Studio provides a unified, end-to-end experience consisting of data, analytics, and AI capabilities. You can use familiar AWS services for model development, generative AI, data processing, and analytics—all within a single, governed environment. Users can now build, deploy, and execute end-to-end workflows from a single interface. SageMaker is built on the foundations of Amazon DataZone, where it uses domains to categorize and structure the data assets, while offering project-based collaboration features that teams can use to securely share artifacts and work together across various compute services. This experience allows multiple personas to seamlessly collaborate, while operating under appropriate access controls and governance policies.

Dashboard and insight workflows simplified

Today administrators can configure SageMaker projects with QuickSight to streamline the flow of building insights from your data lake. After being set up, the integration automatically creates a restricted folders that provides a governed context to share assets and data sources, pre-configured with secure connections to data lake tables. This serves as the foundation for any project member securely building and sharing insights. When exploring data in your project the integration allows for one-click access to building a dashboard from any table. Behind the scenes, SageMaker creates a QuickSight dataset in the project’s restricted folder that’s accessible only to members within the project. Not only do dashboards you build in QuickSight stay within this folder, they’re also automatically added as assets to your SageMaker project. There, you can add custom metadata, publish to the SageMaker Catalog and share with users or groups in your corporate directory for broader access—all within SageMaker Unified Studio. This keeps your dashboards organized, discoverable, shareable, and governed, making cross-team collaboration and asset reuse straightforward.

Configure SageMaker and QuickSight

To get started with SageMaker and QuickSight integration, you enable the QuickSight blueprint and create project profiles in the AWS Management Console.

Note that both your SageMaker Uniﬁed Studio domain and QuickSight account must be integrated with AWS IAM Identity Center using the same Identity Center instance. Additionally, your QuickSight account must exist in the same AWS account.

Go to the SageMaker console and choose Domain in the navigation pane.
Select the Blueprints tab.
To enable the QuickSight Blueprint, select it from the list, then choose Enable.
On the Enable QuickSight page:
1. 1. 1. For Provisioning role, select your provisioning role.
    2. For QuickSight VPC manager role, select the AmazonSageMakerQuickSightVPC role.
Choose Enable blueprint.
A confirmation message will appear after the blueprint is successfully enabled.
Go back to the Domains page and select the Project profiles tab and then select the SQL analytics project profile.
Choose Add blueprint deployment settings.
Configure the blueprint deployment settings as follows:
- Blueprint deployment settings name: Enter a name for your settings. For this post, we used QuickSight-BDS.
- Blueprint: Select the QuickSight blueprint from the list.
- Other parameters: Adjust these based on your use case. For this post, we kept the default values.
Scroll down and choose Add blueprint deployment settings to save your configuration.
You’ll receive a confirmation message, and you’ll see that the QuickSight Blueprint deployment setting (QuickSight-BDS) has been added to the list.

Create a SageMaker project with QuickSight enabled:

After the QuickSight integration has been set up by the administrator, data consumers such as analysts and data scientists can begin using it in the SageMaker portal by creating a new project.

Go to the SageMaker portal.
Choose Select a project, then, choose Create project.
On the Create project page:
1. Project name: Enter the name of your project. For this post, we’re using KPI-Analysis.
2. Project profile: Select the SQL Analytics project profile.
3. Choose Continue.
Leave the remaining parameters set to their default values and choose Continue.
Review the information displayed, then choose Create project.
You’ll be redirected to the Creating new project page. Wait for the process to complete.
After the project creation process is complete, you’ll be taken to the Project overview page.

Create a data asset to build the analysis

For this post, you’ll use the transactions.csv file, which contains financial transaction data from various departments.
Choose Build in the top-right menu.
Then select Query Editor from the dropdown.
Choose the plus (+) icon
Select Create table, then choose Next.
On the Set table properties page:
1. Upload file: Upload the transactions.csv file.
2. Table type: Select S3/external table.
3. Leave the remaining parameters at the default values.
4. Choose Next.
On the Preview schema page, verify that the schema matches the expected structure, then choose Create table.
The Transactions table has now been successfully created.

Create a dashboard using QuickSight

Choose the KPI-Analysis project, then choose Data.
On the Data page: Select the Transactions table, choose Actions, then select Open in QuickSight.
This step redirects you to the QuickSight UI, specifically to the transactions dataset page.
Choose USE IN ANALYSIS to begin exploring the data.
Choose a folder to save your new analysis—for this post, we selected the Assets folder.
Choose Add to save the analysis.
On the New sheet page, leave all parameters at the default values, then choose CREATE.
You’ll now be taken to the Analysis page. In this example, you analyze credit card spending at gas stations, focusing on identifying the most popular fuel type among your cardholders. The goal is to use this insight to design targeted promotions.
Under Visuals, select Pie chart.
Under GROUP/COLOR, select fuel_type.
Under Value, select amount[Sum].
You will see that credit card holders of AWSome-Bank prefer the Premium fuel type.
Publish this new dashboard to the enterprise data catalog. To do that, choose PUBLISH located in the top right corner.
On the Publish Dashboard page:
1. Enter a name for the dashboard. For this post, we’re using gas_consumption_analysis.
2. Leave the remaining parameters set to their default values.
3. Choose PUBLISH DASHBOARD.

Documenting and publishing a QuickSight asset

After the dashboard is created, it’s automatically added to the SageMaker project. From there, analysts or BI engineers can enrich it with business metadata, make it discoverable across the organization, and share it with other users or groups in their corporate directory.

Go back to the Amazon SageMaker portal
Select the Assets tab.
On the Inventory tab, select the gas_consumption_analysis asset.
This will take you to the main asset page, where you can add business metadata, view the lineage diagram, and review the asset history.
For this post, you will only add a README section.
Choose CREATE README to get started.
Add a description for the asset. For this POST, we used the following:

Overview
This Amazon QuickSight dashboard provides insights into the fuel type preferences of a bank’s credit card holders. It helps business stakeholders and analysts understand customer behavior at fuel stations, supporting data-driven marketing strategies and product personalization.
Purpose
The goal of this dashboard is to:
Analyze which fuel types (for example, Regular, Premium, Diesel, Electric) are most frequently purchased using the bank’s credit cards.
Identify customer segments (for example, age groups, locations, income brackets) that prefer specific fuel types.
Understand transaction patterns such as frequency, average spend per fuel type, and purchase timeframes.

Choose SAVE README to save the description.
On this page, you can also add glossary terms and metadata forms to provide additional business context to the asset. For this post, leave these fields empty.
Now you’re ready to publish the QuickSight asset to the enterprise data catalog. To do this, choose PUBLISH ASSET.
A confirmation prompt will appear. Choose PUBLISH ASSET again to complete the publishing process.

Search for a QuickSight asset

For this post, we created a second project called Marketing, but you can use any other project within your domain or even reuse the one created in the earlier steps.
Navigate to the SageMaker home page.
In the catalog search field, enter gas to find the published asset.
Select the relevant result for the published asset from the search results.
This will take you to the asset’s main page, where you can view the metadata added by the producer.

Sharing a QuickSight asset

You can share the QuickSight dashboard with users and groups in your organization directly from within SageMaker.

Go back to the KPI-Analysis project.
Choose the Data tab.
Then, select Assets from the Project catalog.
Go to the PUBLISHED tab, then select the gas_consumption_analysis asset.
Choose Actions, then select Share.
You can share the asset with individual SSO users or with groups. For this post, we selected an SSO group named quicksight-users, but you can choose any user or group you have previously created.
Choose Share.
A confirmation message will appear after the asset has been successfully shared.

Clean up

When you’re done with these exercises, complete the following steps to delete your resources to avoid incurring costs:

Delete the QuickSight assets that you created.
1. If QuickSight is enabled solely for testing, make sure to cancel the QuickSight account.
Delete the project created in SageMaker.
1. If SageMaker is enabled solely for testing, make sure to cancel the SageMaker account.

Conclusion

This post walked through the complete process of integrating Amazon QuickSight with Amazon SageMaker Unified Studio, demonstrating how teams can move from raw data to published dashboards in a secure and governed environment. By combining the advanced analytics capabilities of QuickSight with the collaborative project-based structure of SageMaker, organizations can accelerate insight delivery while maintaining clear control over data access and governance.

The integration simplifies creating datasets directly from Amazon Athena or Amazon Redshift tables, enrich them with business metadata, and publish dashboards to the SageMaker Catalog. When published, these dashboards can be shared with users or groups across the organization, making insights both discoverable and actionable.

With the added power of Amazon Q in QuickSight and generative BI, users can ask questions in plain English and receive real-time visualizations and insights. This makes data exploration intuitive and inclusive, empowering more users to make informed decisions. Combined with the unified analytics and AI environment of SageMaker Unified Studio, this solution supports secure, scalable, and collaborative data-driven innovation.

About the authors

Ramon Lopez is a Principal Solutions Architect for Amazon QuickSight. With many years of experience building BI solutions and a background in accounting, he loves working with customers, creating solutions, and making world-class services. When not working, he prefers to be outdoors in the ocean or up on a mountain.

Leonardo Gomez is a Principal Analytics Specialist Solutions Architect at AWS. He has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn.

Unifying metadata governance across Amazon SageMaker and Collibra

2025-07-16 Vasiliki Nikolopoulou

Post Syndicated from Vasiliki Nikolopoulou original https://aws.amazon.com/blogs/big-data/unifying-metadata-governance-across-amazon-sagemaker-and-collibra/

This post was co-written with Vasiliki Nikolopoulou from Collibra.

Managing metadata across tools and teams is a growing challenge for organizations building modern data and AI platforms. As data volumes grow and generative AI becomes more central to business strategy, teams need a consistent way to define, discover, and govern their datasets, features, and models.

Collibra is a widely adopted data intelligence platform that helps organizations centralize governance workflows, define business glossaries, and enforce policies across data assets. Teams use Collibra to curate business context, classify sensitive data, and manage access to information in line with compliance requirements.

Amazon SageMaker Catalog, part of the next generation of Amazon SageMaker, provides a unified environment where users can register, search, and govern AI and data assets. It allows organizations to organize datasets, trained models, features, and pipelines and apply metadata such as business terms, classifications, ownership, and usage context. Amazon SageMaker Catalog is designed to support collaboration across roles, including data scientists, engineers, and business stakeholders.

As organizations scale their data and AI initiatives, ensuring consistency and trust in metadata becomes increasingly important. Teams need a unified way to manage glossary terms, asset descriptions, classifications, and access governance across platforms. Without this consistency, it becomes difficult to enforce standards, support compliance, and drive collaboration across teams building and consuming data.

To address this challenge, Amazon Web Services (AWS) and Collibra have built a new integrated solution that demonstrates the integration between the Collibra Platform and the next generation of Amazon SageMaker. Developed collaboratively by both companies, the solution is based on the APIs of both products and is designed to help customers explore what’s possible through hands-on testing. It provides a practical example of how metadata synchronization between Collibra and SageMaker can be accomplished in real-world scenarios. With this integration, you can align business and technical metadata across both platforms, so you can extend your governance workflows to AI and analytics assets managed in Amazon SageMaker.

This solution allows metadata to remain consistent across both platforms, regardless of where it was created. It helps reduce duplication, improve metadata quality, and ensure that business context travels with data and AI assets throughout their lifecycle. The integration supports metadata synchronization, glossary term mapping, and access approval workflows using native APIs and automation.

In this post, we take a closer look at the integration, describe the use cases it enables, walk through the architecture, and show how to implement the solution in your environment.

Solution overview

The integration between Amazon SageMaker Catalog and Collibra offers automated, bidirectional metadata synchronization and access governance across both platforms. It’s built using the built-in APIs of Amazon SageMaker and Collibra Data Governance Center (DGC) to provide a scalable and configurable mechanism for metadata exchange. The solution consists of two main capabilities: metadata synchronization and access subscription workflow integration. The following diagram illustrates the solution architecture.

Metadata synchronization

Many organizations manage business and technical metadata across multiple systems. Without synchronization, glossary terms, asset descriptions, and classifications can become inconsistent, leading to duplicated work and misalignment across teams.

This integration allows metadata to flow between Amazon SageMaker Catalog and Collibra, regardless of where it was created. Key elements such as glossary terms, their hierarchy, associated descriptions, and relationships to assets like datasets or columns are automatically synchronized between platforms.

The solution supports:

Bidirectional synchronization of glossary terms and descriptions
Preservation of glossary structure, including parent-child relationships
Association of terms with data assets such as datasets, tables, and columns
Synchronization of additional business metadata, such as classifications and data categories
Alignment of technical descriptions for datasets and columns between systems

By keeping metadata consistent, the integration reduces manual work, avoids duplication, and provides users in both platforms with the same trusted context.

Subscription and approval flow

Organizations that rely on Collibra for access governance can now extend those workflows to assets cataloged in Amazon SageMaker. After metadata is synchronized, users can discover and request access to datasets directly from within Collibra, using familiar approval processes.

This integration connects Collibra’s workflow engine with the access control mechanism offered by Amazon SageMaker. When an asset is registered in Amazon SageMaker and shared into Collibra, users can initiate a subscription request in Collibra. When it’s approved, access is granted using Amazon the SageMaker built-in access management, which supports multiple AWS services such as AWS Glue and Amazon Redshift.

Key capabilities include:

Discovery and access request initiation from Collibra or Amazon SageMaker
Centralized review and approval processes managed within Collibra
Access provisioning using the Amazon SageMaker grant mechanism
Consistent metadata and asset context available throughout the request lifecycle

This flow helps streamline the experience for both business and technical users while keeping access to governed data traceable, auditable, and aligned with organizational policies.

Prerequisites

To perform the solution, you need the following prerequisites:

An Amazon Simple Storage Service (Amazon S3) bucket. For information on using buckets, refer to Getting started with Amazon S3.
An AWS Secrets Manager secret containing Collibra configuration and credentials
A Collibra Edge site
An Amazon SageMaker domain
Two Amazon SageMaker projects (one consumer and one producer)
Amazon Redshift or AWS Glue databases with tables

Walkthrough

The next section provides a walkthrough that shows how the integration works from start to finish. It highlights how a user discovers a data asset, submits a subscription request, and how that request is reviewed, approved, and fulfilled. Throughout the process, metadata and governance policies remain aligned between Collibra and Amazon SageMaker Catalog. This example helps illustrate what the integration enables and how it fits into a typical data access workflow.

Setup on the Collibra environment

To enable this solution, some initial setup is needed in the Collibra environment. This involves configuring the key components that users will need to discover, request, and manage access to data. The following steps outline the basic setup required to support the overall experience.

Operating Model changes and import workflows in Collibra

The operating model of the Collibra instance needs two new asset types and attribute types as well as two new relations and statuses for the scripts and workflows to work properly. These new asset types are recommended because Amazon SageMaker introduces its own concepts and architecture, such as domains and projects. Using the same names in Collibra makes it easier for users to understand and navigate both systems consistently. In the following diagram, the new asset types are shown with dotted lines along with the corresponding new relations, attributes, and statuses.

In addition to AWS projects, the implementation requires synchronization of AWS users beyond the standard capabilities. This is necessary because in AWS, a user can’t subscribe to an asset directly as an individual. They can only do so as a member of a project. As a result, when a user subscribes to an asset, they must specify which project they’re subscribing through. To support this behavior, membership to projects information for AWS users needs to be maintained and synchronized within Collibra. AWS project to user mapping needs to be maintained in Collibra, which is accessed by administrative users. The metadata information about AWS user membership to projects can be kept in a Collibra environment or community, which isn’t accessible to anyone except authorized users. Steps for implementation of Collibra operating model changes:

Go to Settings, then Operating model, and add two new asset types, AWS Project and AWS User.
In Settings, navigate to Attribute types and add the new attribute types. The new attribute types are: Project id assigned to the AWS Project asset type, Membership to Project assigned to the AWS User, AWS Project id, Consuming Project and Consuming Project Id to be assigned to the existing asset type Data Usage. Refer to the documentation for more details on how to add new attribute types and how to assign them to asset types
In Settings, go to Relation types and add the Asset to be used relation between asset types data usage and data asset. Refer to the documentation for guidance on how to add a new relation to a pair of asset types.
In Settings, go to Statuses and add the new statuses, which are Access granted and Pending, to be assigned to the asset type data usage.
Go back to the Operating model and, for each new asset type, add the newly created relations, attributes, and statuses. Don’t skip this step. If it isn’t completed, the new configurations will won’t take effect.
Create the following domains:
1. AWS Users – This is a business asset domain where the metadata for AWS user memberships will be stored. Users and their memberships are automatically imported into Collibra through the solution. An example is shown in the screenshot.
2. AWS Projects – This is also a business asset domain where AWS projects and their metadata will be automatically imported. The following screenshot shows an example of such a domain. The AWS projects, along with their published assets, are brought into Collibra through the solution.
3. AWS Subscription Requests – This is a domain of type data usage registry. It will hold all new AWS subscription requests along with their context, such as the consuming project and the subscribed data asset. The status of each request is especially important because it drives the integration workflow that users can use to track the current state of their request.

Workflows installation

This solution includes two workflows: one for managing subscription request approvals and another for notifying users when access is granted.

The first workflow handles the full subscription process. It begins by prompting the user to select the consuming project because only projects the user is a member of are eligible for subscriptions. After it’s selected, a new subscription request asset is created in Collibra with a timestamp, the consuming project details, and a status set to Pending.

An approval task is then assigned to the business steward of the requested data asset. If the steward approves the request, the status changes to Approved. This triggers a notification to the requester and signals the AWS solution to pick up the request and grant access. When access is granted, the status is updated to Access granted.

If the steward rejects the request, the status is changed to Rejected and the requester is notified. No further action is taken in that case.

The second workflow notifies the requester that the access was granted. It’s triggered by the functions in AWS when the subscription grant is completed. The steps to deploy the two workflows are as follows:

Go to Settings, then select Workflows followed by Definitions, as shown in the following screenshot.

Choose Upload a file, as shown in the following screenshot. Then, upload both workflow files from the GitHub directory where all the files are provided. In that GitHub directory, there is a directory with the workflow files called Workflows.

After the workflows are uploaded, complete the following steps for each one, as shown in the following screenshot:
1. Enable the workflow by choosing Play. When enabled, the button will display a Pause icon.
2. Under Rules, set it to apply to Assets, then choose Add Rules and choose Asset: Table. You can also use Data Asset for a broader scope, but in this case, published assets in AWS are tables.
3. Clear This workflow can only run once at the same time on a specific resource. This provides that multiple users can request subscriptions to the same asset simultaneously.

The workflows are now uploaded, enabled, and ready for use.

Add responsibilities

We need to assign business stewards to the ingested AWS assets so that when the workflows are triggered, there is a designated user responsible for approving subscription requests. In this version of the solution, it’s assumed that each asset has only one Business Steward.

To add a Business Steward, follow these steps:

In the domain or community where the AWS data assets have been ingested using the Edge integration, choose Responsibilities. Then choose Add, as shown in the following screenshot

Choose Business Steward from the Role dropdown list, as shown in the following screenshot. From the Users or groups dropdown list, choose the user who will be responsible for approving subscription requests for these assets. This solution allows only one business steward per asset. You can assign a business steward at the community level, and this way this role will be inherited to all assets under this community.

Choose Add, as shown in the following screenshot. This will assign the selected user to the Business Steward role for the specified asset, domain, or community of assets.

Setup on the AWS environment

Now that the configuration on the Collibra side is complete, set up the Amazon SageMaker domain that is used for this walkthrough. We provide the following assets to help users set up this solution

An AWS CloudFormation template in YAML format, called template.yaml
Instructions to generate a lambda zip file that contains all the scripts that the Cloud Formation will run, called lambda_build.zip
Instructions to create a secret using AWS Secrets Manager that will store Collibra credentials.

Create the CloudFormation stack

To support this solution, provision a set of AWS resources that facilitate communication between environments and automate key tasks. In this section, we show how to deploy the foundational infrastructure using AWS CloudFormation, which simplifies resource provisioning and provides consistency across environments.

On the AWS Management Console, navigate to CloudFormation and choose Create stack, then choose With new resources (standard), as shown in the following screenshot.

Choose the provided CloudFormation template and choose Next.

Enter a name for the stack and complete all required parameters below:

CollibraAwsProjectAttributeTypeId – The attribute type ID for AWS projects in Collibra.
CollibraAwsProjectDomainId – The domain ID for AWS projects in Collibra.
CollibraAwsProjectToAssetRelationTypeId – The relation type ID between AWS projects and assets in Collibra.
CollibraAwsProjectTypeId – The type ID for AWS projects in Collibra.
CollibraAwsUserDomainId – The domain ID for AWS users in Collibra.
CollibraAwsUserProjectAttributeTypeId – The attribute type ID for AWS user projects in Collibra.
CollibraAwsUserTypeId – The type ID for AWS users in Collibra.
CollibraConfigSecretsName – The name of the AWS Secrets Manager secret containing Collibra configuration and credentials.
SMUSProducerProjectId – The project ID in SMUS that contains the data assets to be shared (producer side).
SMUSConsumerProjectId – The project ID in SMUS where shared data assets will be accessed (consumer side).
SMUSDomainId – The unique identifier for the SageMaker Unified Studio (SMUS) domain.
CollibraSubscriptionRequestCreationWorkflowId – The unique identifier for the Collibra workflow that creates subscription requests in Collibra.
CollibraSubscriptionRequestApprovalWorkflowId – The unique identifier for the Collibra workflow that approves subscription requests in Collibra.
LambdaCodeS3Bucket – The S3 bucket containing the Lambda function deployment package.
LambdaCodeS3Key – The S3 key (path and filename) of the Lambda function deployment package within the specified bucket.

Select the acknowledgement checkbox, then choose Next, as shown in the following screenshot.

Choose Submit to start the stack deployment. When the process is complete, the stack status will update to CREATE_COMPLETE.

Configure consumer and producer projects

For this post, only two projects are used: one serving as the producer and one as the consumer. Future versions of the solution are planned to support all projects.

On the AWS Management Console, go to the SMUS Domain detail page. Under the Users section, choose Add, then select Add IAM users.

From the dropdown, select the SMUSCollibraIntegrationAdminRole created by the CloudFormation template, then choose Add user(s), as shown in the following screenshot.

Open the Unified Studio portal for this domain and navigate to the Producer Project. Go to the Members tab and choose Add members.
Search for SMUSCollibraIntegrationAdminRole and select it from the results.

Set the role to Owner, then choose Add members.

Repeat the same steps for the Consumer Project. After adding the member, the configuration should look like the example in the following screenshot.

Make sure the producer project has the necessary authorization to create glossary terms in the domain unit it belongs to. For more information, refer to Domain units and authorization policies in Amazon SageMaker Unified Studio in the Amazon SageMaker Unified Studio documentation.

Synchronization of metadata

Metadata synchronization between Collibra and SageMaker Catalog happens on two distinct levels, each serving a specific purpose.The first level focuses on technical metadata. Collibra connects to services such as Amazon Redshift and AWS Glue using JDBC and other supported connection methods. Through these connections, it ingests schema details including tables, columns, and data types. This helps technical teams maintain visibility into the structure of the datasets available in SageMaker Catalog.The second level, which is the focus of this solution, handles business metadata synchronization. Using Collibra APIs, SageMaker Catalog retrieves business glossary terms, column descriptions, asset definitions, and the relationships among them. Additionally, Collibra ingests information about SageMaker projects, the assets published within them, and project membership details. This supports approval workflows and helps manage subscriptions based on project-level access. The following diagram illustrates how these two levels of metadata synchronization work together to bridge technical and business perspectives across both platforms.

For the technical metadata ingestion from AWS to Collibra, follow these steps:

Within the Collibra Edge site, create a new connection for each type of AWS data store you want to ingest metadata from. For detailed instructions, refer to the About Edge and Collibra Cloud site connections in the Collibra Documentation.
1. Depending on the type of connection, especially if it’s JDBC, you might need to add a capability such as JDBC catalog ingestion. Refer to the official documentation for more details.
2. So the integration works correctly, name all your AWS connections in Edge with “AWS” at the start of the name. The integration script relies on this naming convention to accurately identify assets that originate from AWS.
In Collibra, go to Catalog, select your connection, configure the rules for your schemas (such as which tables to include or exclude), and run the synchronization. You can also schedule the synchronization to run automatically at intervals defined in the user interface.
When metadata ingestion is complete, go to Catalog, then Data Sources. You can optionally filter by a specific AWS source or keep the default view to view all sources. From there, you can review the schemas, tables, and other metadata imported from AWS, as shown in the following diagram.

These data assets are imported using the JDBC connections that are available from Collibra Edge. The AWS solution we present here, in addition to these data assets, will import AWS projects and will link them to the assets ingested here that are published in these projects.

Technical and business stewardship in Collibra

Collibra provides business glossaries to define business context. These glossaries can also include a hierarchy or taxonomy of business terms based on their interdependence. The following is an example of a glossary used for this post.

An Order includes components such as Order Date, Order ID, and others. In Collibra, Business and Technical Stewards are responsible for linking Business Terms to the columns and tables ingested from AWS, as shown in the following diagram. For detailed guidance on how to perform stewardship activities, refer to the official Collibra documentation.

The entire business glossary with its one-level hierarchy is imported into AWS SageMaker Unified Studio automatically with this solution. Some business terms are also linked to data categories that are associated with data privacy, regulatory policies, and standards. In the example in the following screenshot, customer ID is connected to a data category. This connection between business terms and data categories links the associated data to relevant policies and standards. As a result, a table or column connected to a business term that is linked to a data category will also inherit the associated policy or standard.

The business term customer ID is linked to the data category personally identifiable information (PII). With this relation, all columns or tables that are linked to this business term automatically inherit the PII data category, and therefore the policies linked and associated with it.

The metadata is imported into AWS SageMaker Unified Studio at the asset and schema levels.

All the business metadata described previously is synchronized with AWS using this solution. Descriptions, data categories, tags, business terms are all imported into AWS and linked to respective assets. In the README, the data category is associated with one of the columns and the business term associated with a table or dataset.From Collibra we import into AWS the following:

Business terms and their hierarchies and descriptions
The link of the business terms to the technical assets
Data category of business terms inherited in the technical assets imported in the README section of the technical asset
Tags and descriptions of technical data assets

Not only is the business term imported into AWS SageMaker Unified Studio, its taxonomy is imported exactly as it is in Collibra. The following screenshot shows an example where order is imported to have under it the business terms order ID, quantity, and so on.

Subscription to published assets

For the subscription process, the same workflows and series of tasks occur whether the request is initiated from AWS or from Collibra. An overview of these tasks and the end-to-end flow from both platforms is shown in the following diagrams:

This diagram outlines the subscription request flow when initiated from Collibra. A user searches for a business term, locates the related asset, and submits a subscription request. The system creates a corresponding request asset in Collibra. The user then selects the destination project for the data. An approval workflow is triggered, notifying the designated business steward. If the request is approved, SageMaker Catalog automatically provisions access and updates the request status to Access Granted. The user receives a final notification confirming access. This process provides controlled, transparent data sharing across platforms.

The following diagram illustrates the end-to-end subscription flow when the data user initiates the process from within SageMaker Studio. The user begins by searching for data using a business term and selecting the relevant asset. After choosing the appropriate table, they request access, which triggers the creation of a subscription request asset in Collibra. The user then selects a destination project based on their memberships. Collibra sends an approval request to the designated business steward, who reviews and either approves or rejects it. If approved, SageMaker Catalog automatically provisions the subscription and notifies the requester. The subscription request status is then updated to Access Granted, completing the workflow.

For this post, the process is described starting from Collibra, although it functions the same way if initiated from AWS. In this example, a data consumer is searching for data related to AWS orders using the Collibra interface.

In Amazon SageMaker Unified Studio, the data consumer is a member of the Orders and Products project. At this stage, the user has no active subscriptions or access to data assets. The following screenshot is included to illustrate the state before the integration takes effect.

In Collibra, navigate to the Search area and enter a business-friendly term describing what the user is looking for. In this example, enter order.

In the Data Marketplace, filters such as Business Terms can be applied to narrow the results by asset type, as shown in the following screenshot. This approach helps users focus on relevant assets by starting from clear business context, which is especially useful when dealing with many similarly named tables or columns.

In the example shown in the following screenshot, the business term Order is selected, and the Diagram view is opened to display its full logical lineage. The diagram shows that the term is linked to the aws_orders table. Selecting the table in the diagram reveals its metadata details, which appear on the right side of the page. From there, users can navigate directly to the table.

In the aws_orders table asset, access can be requested by initiating an AWS subscription request. From the asset view, selecting Actions reveals the list of available workflows. In this example, the Creation of a new subscription workflow is selected to start the approval process.

The user must select the AWS project to use as the consuming project for the subscription. A list of all projects the user is a member of is displayed to facilitate the selection. After choosing the appropriate project, choose Send to submit the request.

After it’s submitted, the workflow is triggered, and a task is assigned to the business steward of the asset for which the subscription is requested. A new subscription request is also created in the AWS Subscription Requests domain with a status of Pending, and it’s automatically linked to the requested asset.

The new subscription request is also reflected in the lineage of the data asset, as shown in the following screenshot.

The business steward assigned to the asset receives an approval notification.
1. Choose Tasks button in the top right corner.
2. Locate the most recent task titled Accept or Reject, which is associated with the aws_orders asset.

The business steward opens the task and chooses either Approve or Reject, depending on the request. In this example, Approve is selected. The task is then marked as complete.

After the business steward approves the subscription request, the corresponding Subscription Request asset is automatically updated to the status Approved.

The requester is notified that the subscription request has been approved. To acknowledge, the requester choose Tasks, locates the approval notification, and chooses Done to confirm receipt, as shown in the following screenshot.

After a subscription request is approved, the integration solution automatically process the request by creating and granting the corresponding subscription in AWS using the asset’s metadata. The user can then confirm the new subscription is reflected in Amazon SageMaker, as shown in the following screenshot.

After the subscription is granted, the status of the Subscription Request is updated to Access Granted.

The requester now receives a new task, which is a notification confirming that the subscription request has been granted. Choose the Send button to acknowledge and complete the task.

In the AWS Subscription Requests domain, all requests and their status are visible. In addition to Approved and Access Granted statuses, Rejected requests are also listed. If a request is rejected by the approver, its status changes to Rejected and no subscription is created in AWS.

Synchronization Interval

The solution keeps Collibra and Amazon SageMaker Catalog in sync through regular updates. Core elements including business metadata of Collibra, user profiles, project information & published assets of Amazon SageMaker Catalog, and subscription requests originating in Collibra are synchronized every 5 minutes. However, when subscription requests are created in Amazon SageMaker Catalog, they are instantly synchronized to Collibra.

Cleanup

To avoid incurring unnecessary costs after testing or exploring the solution, delete the provisioned resources. Follow these steps:

Remove the CloudFormation stack – Go to the AWS CloudFormation console, select the stack you created for this solution, and choose Delete. This will automatically remove the associated AWS resources provisioned by the stack.
Clean up Collibra configurations – In the Collibra environment, remove test domains, projects, or workflows created for this solution to ensure a clean slate for future experiments.
Revoke access tokens or credentials – If you used API credentials or access tokens for integration, ensure they’re revoked or deleted if no longer needed.

Performing these steps ensures your environments stay clean and you avoid unintended resource usage.

Conclusion

The solution connecting Amazon SageMaker Catalog and Collibra gives organizations a simple way to unify metadata and streamline access workflows. It helps reduce duplication, improve governance, and build trust in data for both analytics and AI.We demonstrated how to synchronize metadata and manage access requests using APIs, enabling a shared view of data across teams.Learn more by exploring:

We welcome your feedback as you explore what’s possible with this solution.

About the authors

Vasiliki Nikolopoulou is a Principal Integrations Architect at Collibra, where she is working for the past 11 years. Her extensive career includes roles such as Director, Enterprise Architect at AXA Insurance US, Principal Sales Engineer at Oracle, and Certified Senior IT Professional in technical sales at IBM for over 15 years. She holds numerous technical certifications. Connect with her on LinkedIn.

Divij Bhatia is a Software Development Engineer at AWS. He is passionate about building resilient and scalable cloud-native solutions that solve real-world problems for customers. His free time often takes him outdoors, traveling and shooting landscapes. Connect with him on LinkedIn.

Develop and monitor a Spark application using existing data in Amazon S3 with Amazon SageMaker Unified Studio

2025-07-09 Amit Maindola

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/develop-and-monitor-a-spark-application-using-existing-data-in-amazon-s3-with-amazon-sagemaker-unified-studio/

Organizations face significant challenges managing their big data analytics workloads. Data teams struggle with fragmented development environments, complex resource management, inconsistent monitoring, and cumbersome manual scheduling processes. These issues lead to lengthy development cycles, inefficient resource utilization, reactive troubleshooting, and difficult-to-maintain data pipelines.These challenges are especially critical for enterprises processing terabytes of data daily for business intelligence (BI), reporting, and machine learning (ML). Such organizations need unified solutions that streamline their entire analytics workflow.

The next generation of Amazon SageMaker with Amazon EMR in Amazon SageMaker Unified Studio addresses these pain points through an integrated development environment (IDE) where data workers can develop, test, and refine Spark applications in one consistent environment. Amazon EMR Serverless alleviates cluster management overhead by dynamically allocating resources based on workload requirements, and built-in monitoring tools help teams quickly identify performance bottlenecks. Integration with Apache Airflow through Amazon Managed Workflows for Apache Airflow (Amazon MWAA) provides robust scheduling capabilities, and the pay-only-for-resources-used model delivers significant cost savings.

In this post, we demonstrate how to develop and monitor a Spark application using existing data in Amazon Simple Storage Service (Amazon S3) using SageMaker Unified Studio.

Solution overview

This solution uses SageMaker Unified Studio to execute and oversee a Spark application, highlighting its integrated capabilities. We cover the following key steps:

Create an EMR Serverless compute environment for interactive applications using SageMaker Unified Studio.
Create and configure a Spark application.
Use TPC-DS data to build and run the Spark application using a Jupyter notebook in SageMaker Unified Studio.
Monitor application performance and schedule recurring runs with Amazon MWAA integrated.
Analyze results in SageMaker Unified Studio to optimize workflows.

Prerequisites

For this walkthrough, you must have the following prerequisites:

An AWS account – If you don’t have an account, you can create one.
A SageMaker Unified Studio domain – For instructions, refer to Create an Amazon SageMaker Unified Studio domain – quick setup.
A demo project – Create a demo project in your SageMaker Unified Studio domain. For instructions, see Create a project. For this example, we choose All capabilities in the project profile section.

Add EMR Serverless as compute

Complete the following steps to create an EMR Serverless compute environment to build your Spark application:

In SageMaker Unified Studio, open the project you created as a prerequisite and choose Compute.
Choose Data processing, then choose Add compute.
Choose Create new compute resources, then choose Next.

Choose EMR Serverless, then choose Next.

For Compute name, enter a name.
For Release label, choose emr-7.5.0.
For Permission mode, choose Compatibility.
Choose Add compute.

It takes a few minutes to spin up the EMR Serverless application. After it’s created, you can view the compute in SageMaker Unified Studio.

The preceding steps demonstrate how you can set up an Amazon EMR Serverless application in SageMaker Unified Studio to run interactive PySpark workloads. In subsequent steps, we build and monitor Spark applications in an interactive JupyterLab workspace.

Develop, monitor, and debug a Spark application in a Jupyter notebook within SageMaker Unified Studio

In this section, we build a Spark application using the TPC-DS dataset within SageMaker Unified Studio. With Amazon SageMaker Data Processing, you can focus on transforming and analyzing your data without managing compute capacity or open source applications, saving you time and reducing costs. SageMaker Data Processing provides a unified developer experience from Amazon EMR, AWS Glue, Amazon Redshift, Amazon Athena, and Amazon MWAA in a single notebook and query interface. You can automatically provision your capacity on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) or EMR Serverless. Scaling rules manage changes to your compute demand to optimize performance and runtimes. Integration with Amazon MWAA simplifies workflow orchestration by alleviating infrastructure management needs. For this post, we use EMR Serverless to read and query the TPC-DS dataset within a notebook and run it using Amazon MWAA.

Complete the following steps:

Upon completion of the previous steps and prerequisites, navigate to SageMaker Studio and open your project.
Choose Build and then JupyterLab.

The notebook takes about 30 seconds to initialize and connect to the space.

Under Notebook, choose Python 3 (ipykernel).
In the first cell, next to Local Python, choose the dropdown menu and choose PySpark.
Choose the dropdown menu next to Project.Spark and choose EMR-S Compute.
Run the following code to develop your Spark application. This example reads a 3 TB TPC-DS dataset in Parquet format from a publicly accessible S3 bucket:

spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/store/").createOrReplaceTempView("store")

After the Spark session starts and execution logs start to populate, you can explore the Spark UI and driver logs to further debug and troubleshoot Spark progra The following screenshot shows an example of the Spark UI. The following screenshot shows an example of the driver logs. The following screenshot shows the Executors tab, which provides access to the driver and executor logs.

Use the following code to read some more TPC-DS datasets. You can create temporary views and use the Spark UI to see the files being read. Refer to the appendix at the end of this for details on using the TPC-DS dataset within your buckets.

spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/item/").createOrReplaceTempView("item")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/store_sales/").createOrReplaceTempView("store_sales")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/date_dim/").createOrReplaceTempView("date_dim")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/customer/").createOrReplaceTempView("customer")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/catalog_sales/").createOrReplaceTempView("catalog_sales")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/web_sales/").createOrReplaceTempView("web_sales")

In each cell of your notebook, you can expand Spark Job Progress to view the stages of the job submitted to EMR Serverless for a specific cell. You can see the time taken to complete each stage. In addition, if a failure occurs, you can examine the logs, making troubleshooting a seamless experience.

Because the files are partitioned based on date key column, you can observe that Spark runs parallel tasks for reads.

Next, get the count across the date time keys on data that is partitioned based on the time key using the following code:

select count(1), ss_sold_date_sk from store_sales group by ss_sold_date_sk order by ss_sold_date_sk

Monitor jobs in the Spark UI

On the Jobs tab of the Spark UI, you can see a list of complete or actively running jobs, with the following details:

The action that triggered the job
The time it took (for this example, 41 seconds, but timing will vary)
The number of stages (2) and tasks (3,428); these are for reference and specific to this specific example

You can choose the job to view more details, particularly around the stages. Our job has two stages; a new stage is created whenever there is a shuffle. We have one stage for the initial reading of each dataset, and one for the aggregation. In the following example, we run some TPC-DS SQL statements that are used for performance and benchmarks:

 with frequent_ss_items as
 (select substr(i_item_desc,1,30) itemdesc,i_item_sk item_sk,d_date solddate,count(*) cnt
  from store_sales, date_dim, item
  where ss_sold_date_sk = d_date_sk
    and ss_item_sk = i_item_sk
    and d_year in (2000, 2000+1, 2000+2,2000+3)
  group by substr(i_item_desc,1,30),i_item_sk,d_date
  having count(*) >4),
 max_store_sales as
 (select max(csales) tpcds_cmax
  from (select c_customer_sk,sum(ss_quantity*ss_sales_price) csales
        from store_sales, customer, date_dim
        where ss_customer_sk = c_customer_sk
         and ss_sold_date_sk = d_date_sk
         and d_year in (2000, 2000+1, 2000+2,2000+3)
        group by c_customer_sk) x),
 best_ss_customer as
 (select c_customer_sk,sum(ss_quantity*ss_sales_price) ssales
  from store_sales, customer
  where ss_customer_sk = c_customer_sk
  group by c_customer_sk
  having sum(ss_quantity*ss_sales_price) > (95/100.0) *
    (select * from max_store_sales))
 select sum(sales)
 from (select cs_quantity*cs_list_price sales
       from catalog_sales, date_dim
       where d_year = 2000
         and d_moy = 2
         and cs_sold_date_sk = d_date_sk
         and cs_item_sk in (select item_sk from frequent_ss_items)
         and cs_bill_customer_sk in (select c_customer_sk from best_ss_customer)
      union all
      (select ws_quantity*ws_list_price sales
       from web_sales, date_dim
       where d_year = 2000
         and d_moy = 2
         and ws_sold_date_sk = d_date_sk
         and ws_item_sk in (select item_sk from frequent_ss_items)
         and ws_bill_customer_sk in (select c_customer_sk from best_ss_customer))) x

You can monitor your Spark job in SageMaker Unified Studio using two methods. Jupyter notebooks provide basic monitoring, showing real-time job status and execution progress. For more detailed analysis, use the Spark UI. You can examine specific stages, tasks, and execution plans. The Spark UI is particularly useful for troubleshooting performance issues and optimizing queries. You can track estimated stages, running tasks, and task timing details. This comprehensive view helps you understand resource utilization and track job progress in depth.

In this section, we explained how you can EMR Serverless compute in SageMaker Unified Studio to build an interactive Spark application. Through the Spark UI, the interactive application provides fine-grained task-level status, I/O, and shuffle details, as well as links to corresponding logs of the task for this stage directly from your notebook, enabling a seamless troubleshooting experience.

Clean up

To avoid ongoing charges in your AWS account, delete the resources you created during this tutorial:

Delete the connection.
Delete the EMR job.
Delete the EMR output S3 buckets.
Delete the Amazon MWAA resources, such as workflows and environments.

Conclusion

In this post, we demonstrated how the next generation of SageMaker, combined with EMR Serverless, provides a powerful solution for developing, monitoring, and scheduling Spark applications using data in Amazon S3. The integrated experience significantly reduces complexity by offering a unified development environment, automatic resource management, and comprehensive monitoring capabilities through Spark UI, while maintaining cost-efficiency through a pay-as-you-go model. For businesses, this means faster time-to-insight, improved team collaboration, and reduced operational overhead, so data teams can focus on analytics rather than infrastructure management.

To get started, explore the Amazon SageMaker Unified Studio User Guide, set up a project in your AWS environment, and discover how this solution can transform your organization’s data analytics capabilities.

Appendix

In the following sections, we discuss how to run a workload on a schedule and provide details about the TPC-DS dataset for building the Spark application using EMR Serverless.

Run a workload on a schedule

In this section, we deploy a JupyterLab notebook and create a workflow using Amazon MWAA. You can use workflows to orchestrate notebooks, querybooks, and more in your project repositories. With workflows, you can define a collection of tasks organized as a directed acyclic graph (DAG) that can run on a user-defined schedule.Complete the following steps:

In SageMaker Unified Studio, choose Build, and under Orchestration, choose Workflows.

Choose Create Workflow in Editor.

You will be redirected to the JupyterLab notebook with a new DAG called untitled.py created under the /src/workflows/dag folder.

We rename this notebook to tpcds_data_queries.py.
You can reuse the existing template with the following updates:
1. Update line 17 with the schedule you want your code to run.
2. Update line 26 with your NOTEBOOK_PATH. This should be in src/<notebook_name>.ipynb. Note the name of the automatically generated dag_id; you can name it based on your requirements.

Choose File and Save notebook.

To test, you can trigger a manual run of your workload.

In SageMaker Unified Studio, choose Build, and under Orchestration, choose Workflows.
Choose your workflow, then choose Run.

You can monitor the success of your job on the Runs tab.

To debug your notebook job by accessing the Spark UI within your Airflow job console, you must use EMR Serverless Airflow Operators to submit your job. The link is available on the Details tab of your query.

This option has the following key limitations: it’s not available for Amazon EMR on EC2, and SageMaker notebook job operators don’t work.

You can configure the operator to generate one-time links to the application UIs and Spark stdout logs by passing enable_application_ui_links=True as a parameter. After the job starts running, these links are available on the Details tab of the relevant task. If enable_application_ui_links=False, then the links will be present but grayed out.

Make sure you have the emr-serverless:GetDashboardForJobRun AWS Identity and Access Management (IAM) permissions to generate the dashboard link.

Open the Airflow UI for your job. The Spark UI and history server dashboard options are visible on the Details tab, as shown in the following screenshot.

The following screenshot shows the Jobs tab of the Spark UI.

Use the TPC-DS dataset to build the Spark application using EMR Serverless

To use the TPC-DS dataset to run the Spark application against a dataset in an S3 bucket, you need to copy the TPC-DS dataset into your S3 bucket:

Create a new S3 bucket in your test account if needed. In the following code, replace $YOUR_S3_BUCKET with your S3 bucket name. We suggest you export YOUR_S3_BUCKET as an environment variable:

<Your bucket name>

Copy the TPC-DS source data as input to your S3 bucket. If it’s not exported as an environment variable, replace $YOUR_S3_BUCKET with your S3 bucket name:

aws s3 sync s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/ s3://$YOUR_S3_BUCKET/blog/BLOG_TPCDS-TEST-3T-partitioned/

About the Authors

Abhilash is a senior specialist solutions architect at Amazon Web Services (AWS), helping public sector customers on their cloud journey with a focus on AWS Data and AI services. Outside of work, Abhilash enjoys learning new technologies, watching movies, and visiting new places.

Embracing event driven architecture to enhance resilience of data solutions built on Amazon SageMaker

2025-06-05 Dhrubajyoti Mukherjee

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/embracing-event-driven-architecture-to-enhance-resilience-of-data-solutions-built-on-amazon-sagemaker/

Amazon Web Services (AWS) customers value business continuity while building modern data governance solutions. A resilient data solution helps maximize business continuity by minimizing solution downtime and making sure that critical information remains accessible to users. This post provides guidance on how you can use event driven architecture to enhance the resiliency of data solutions built on the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI. SageMaker is a managed service with high availability and durability. If customers want to build a backup and recovery system on their end, we show you how to do this in this blog. It provides three design principles to improve the data solution resiliency of your organization. In addition, it contains guidance to formulate a robust disaster recovery strategy based on event driven architecture. It contains code samples to back up the system metadata of your data solution built on SageMaker, enabling disaster recovery.

The AWS Well-Architected Framework defines resilience as the ability of a system to recover from infrastructure or service disruptions. You can enhance the resiliency of your data solution by adopting three design principles that are highlighted in this post and by establishing a robust disaster recovery strategy. Recovery point objective (RPO) and recovery time objective (RTO) are industry standard metrics to measure the resilience of a system. RPO indicates how much data loss your organization can accept in case of solution failure. RTO refers to the time for the solution to recover after failure. You can measure these metrics in seconds, minutes, hours, or days. The next section discusses how you can align your data solution resiliency strategy to meet the needs of your organization.

Formulating a strategy to enhance data solution resilience

To develop a robust resiliency strategy for your data solution built on SageMaker, start with how users interact with the data solution. The user interaction influences the data solution architecture, the degree of automation, and determines your resiliency strategy. Here are a few aspects you might consider while designing the resiliency of your data solution.

Data solution architecture – The data solution of your organization might follow a centralized, decentralized, or hybrid architecture. This architecture pattern reflects the distribution of responsibilities of the data solution based on the data strategy of your organization. This shift in responsibilities is reflected in the structure of the teams that perform activities in the Amazon DataZone data portal, SageMaker Unified Studio portal, AWS Management Console, and underlying infrastructure. Examples of such activities include configuring and running the data sources, publishing data assets in the data catalog, subscribing to data assets, and assigning members to projects.
User persona – The user persona, their data, and cloud maturity influence their preferences for interacting with the data solution. The users of a data governance solution fall into two categories: business users and technical users. Business users of your organization might include data owners, data stewards, and data analysts. They might find the Amazon DataZone data portal and SageMaker Unified Studio portal more convenient for tasks such as approving or rejecting subscription requests and performing one-time queries. Technical users such as data solution administrators, data engineers, and data scientists might opt for automation when making system changes. Examples of such activities include publishing data assets, managing glossary and metadata forms in the Amazon DataZone data portal or in SageMaker Unified Studio portal. A robust resiliency strategy accounts for tasks performed by both user groups.
Empowerment of self-service – The data strategy of your organization determines autonomy granted to the users. Increased user autonomy demands a high level of abstraction of the cloud infrastructure powering the data solution. SageMaker empowers self-service by enabling users to perform regular data management activities in the Amazon DataZone data portal and in the SageMaker Unified Studio portal. The level of self-service maturity of the data solution depends on the data strategy and user maturity of your organization. At an early stage, you might limit the self-service features to the use cases for onboarding the data solution. As the data solution scales, consider increasing the self-service capabilities. See Data Mesh Strategy Framework to learn about the different phases of a data mesh-based data solution.

Adopt the following design principles to enhance the resiliency of your data solution:

Choose serverless services – Use serverless AWS services to build your data solution. Serverless services scale automatically with increasing system load, provide fault isolation, and have built-in high-availability. Serverless services minimize the need for infrastructure management, reducing the need to design resiliency into the infrastructure. SageMaker seamlessly integrates with several serverless services such Amazon Simple Storage Service (Amazon S3), AWS Glue, AWS Lake Formation, and Amazon Athena.
Document system metadata – Document the system metadata of your data solution using infrastructure-as-code (IaC) and automation. Consider how users interact with the data solution. If the users prefer to perform certain activities through the Amazon DataZone data portal and SageMaker Unified Studio portal, implement automation to capture and store the metadata that’s relevant for disaster recovery. Use Amazon Relational Database Service (Amazon RDS) and Amazon DynamoDB to store the system metadata of your data solution.
Monitor system health – Implement a monitoring and alerting solution for your data solution so that you can respond to service interruptions and initiate the recovery process. Make sure that system activities are logged so that you can troubleshoot the system interruption. Amazon CloudWatch helps you monitor AWS resources and the applications you run on AWS in real time.

The next section presents disaster recovery strategies to recover your data solution built on SageMaker.

Disaster recovery strategies

Disaster recovery focuses on one-time recovery objectives in response to natural disasters, large-scale technical failures, or human threats such as attack or error. Disaster recovery is a crucial part of your business continuity plan. As shown in the following figure, AWS offers the following options for disaster recovery: Backup and restore, pilot light, warm standby, and multi-site active/active.

The business continuity requirements and cost of recovery should guide your organization’s disaster recovery strategy. As a general guideline, the recovery cost of your data solution increases with reduced RPO and RTO requirements. The next section provides architecture patterns to implement a robust backup and recovery solution for a data solution built on SageMaker.

Solution overview

This section provides event-driven architecture patterns following the backup and restore approach to enhance resiliency of your data solution. This active/passive strategy-based solution stores the system metadata in a DynamoDB table. You can use the system metadata to restore your data solution. The following architecture patterns provide regional resilience. You can simplify the architecture of this solution to restore data in a single AWS Region.

Pattern 1: Point-in-time backup

The point-in-time backup captures and stores system metadata of a data solution built on SageMaker when a user or an automation performs an action. In this pattern, a user activity or an automation initiates an event that captures the system metadata. This pattern is suited for low RPO requirements, ranging from seconds to minutes. The following architecture diagram shows the solution for the point-in-time backup process.

Architecture point-in-time-backup

The steps comprise the following.

User or automation performs an activity on an Amazon DataZone domain or Amazon Unified Studio domain.
This activity creates a new event in AWS CloudTrail.
The CloudTrail event is sent to Amazon EventBridge. Alternatively, you can use Amazon DataZone as the event source for the EventBridge rule.
AWS Lambda transforms and stores this event in a DynamoDB global table where the Amazon DataZone domain is hosted.
The information is replicated into the replica DynamoDB table in a secondary Region. The replica DynamoDB table can be used to restore the data solution based on SageMaker in the secondary Region.

Pattern 2: Scheduled backup

The scheduled backup captures and stores system metadata of a data solution built on SageMaker at regular intervals. In this pattern, an event is initiated based on a defined time schedule. This pattern is suited for RPO requirements in the order of hours. The following architecture diagram displays the solution for point-in-time backup process.

The steps comprise the following.

EventBridge triggers an event at regular interval and sends this event to AWS Step Functions.
The Step Functions state machine contains multiple Lambda functions. These Lambda functions get the system metadata from either a SageMaker Unified Studio domain or an Amazon DataZone domain.
The system metadata is stored in an DynamoDB global table in the primary Region where the Amazon DataZone domain is hosted.
The information is replicated into the replica DynamoDB table in a secondary Region. The data solution can be restored in the secondary Region using the replica DynamoDB table.

The next section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern. This code sample stores asset information of a data solution built on a SageMaker Unified Studio domain and an Amazon DataZone domain in an DynamoDB global table. The data in the DynamoDB table is encrypted at rest using a customer managed key stored in AWS Key Management Service (AWS KMS). A multi-Region replica key encrypts the data in the secondary Region. The asset uses the data lake blueprint that contains the definition for launching and configuring a set of services (AWS Glue, Lake Formation, and Athena) to publish and use data lake assets in the business data catalog. The code sample uses the AWS Cloud Development Kit (AWS CDK) to deploy the cloud infrastructure.

Prerequisites

An active AWS account.
AWS administrator credentials for the central governance account in your development environment
AWS Command Line Interface (AWS CLI) installed to manage your AWS services from the command line (recommended)
Node.js and Node Package Manager (npm) installed to manage AWS CDK applications
AWS CDK Toolkit installed globally in your development environment by using npm, to synthesize and deploy AWS CDK applications

npm install -g aws-cdk

TypeScript installed in your development environment or installed globally by using npm compiler:

npm install -g typescript

Docker installed in your development environment (recommended)
An integrated development environment (IDE) or text editor with support for Python and TypeScript (recommended)

Walkthrough for data solutions built on a SageMaker Unified Studio domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on a SageMaker Unfied Studio domain.

Set up SageMaker Unified Studio

Sign into the IAM console. Create an IAM role that trusts Lambda with the following policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "datazone:Search",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "dynamodb:PutItem"
            ],
            "Resource": "arn:aws:dynamodb:<AWS_REGION>:<AWS_ACCOUNT>:table/*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:ReEncrypt*",
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:<AWS_REGION>:<AWS_ACCOUNT>:key/<KMS_KEY_ID>"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*:log-stream:*",
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*"
            ]
        }
    ]
}

Note down the Amazon Resource Name (ARN) of the Lambda role. Navigate to SageMaker and choose Create a Unified Studio domain.
Select Quick setup and expand the Quick setup settings section. Enter a domain name, for example, CORP-DEV-SMUS. Select the Virtual private cloud (VPC) and Subnets. Choose Continue.
Enter the email address of the SageMaker Unified Studio user in the Create IAM Identity Center user section. Choose Create domain.
After the domain is created, choose Open unified studio in the top right corner.
Sign in to SageMaker Unified Studio using the single sign-on (SSO) credentials of your user. Choose Create project at the top right corner. Enter a project name and description, choose Continue twice, and choose Create project. Wait unti project creation is complete.
After the project is created, go into the project by selecting the project name. Select Query Editor from the Build drop-down menu on the top left. Paste the following create table as select (CTAS) query script in the query editor window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.

CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Navigate to Data sources from the Project. Choose Run in the Actions section next to the project.default_lakehouse connection. Wait until the run is complete.
Navigate to Assets in the left side bar. Select the mkt_sls_table in the Inventory section and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.
Choose Publish Asset to publish the mkt_sls_table table to the business data catalog, making it discoverable and understandable across your organization.
Choose Members in the navigation pane. Choose Add members and select the IAM role you created in Step 1. Add the role as a Contributor in the project.

Deployment steps

After setting up SageMaker Unified Studio, use the AWS CDK stack provided on GitHub to deploy the solution to back up the asset metadata that is created in the previous section.

Clone the repository from GitHub to your preferred integrated development environment (IDE) using the following commands.

git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd sample-event-driven-resilience-data-solutions-sagemaker

Export AWS credentials and the primary Region to your development environment for the IAM role with administrative permissions, use the following format

export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use AWS Secrets Manager or AWS Systems Manager Parameter Store to manage credentials. Automate the deployment process using a continuous integration and delivery (CI/CD) pipeline.

Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command.

cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd unified-studio

Modify the following parameters in the config/Config.ts file.

SMUS_APPLICATION_NAME – Name of the application.
SMUS_SECONDARY_REGION – Secondary AWS region for backup.
SMUS_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval. 
SMUS_STAGE_NAME – Name of the stage. 
SMUS_DOMAIN_ID – Domain identifier of the Amazon SageMaker Unified Studio. 
SMUS_PROJECT_ID – Project identifier of the Amazon SageMaker Unified Studio. 
SMUS_ASSETS_REGISTRAR_ROLE_ARN – ARN of the AWS Lambda role created in step 1 of the preceding section.

Install the dependencies by running the following command:

npm install

Synthesize the CloudFormation template by running the following command.

cdk synth

Deploy the solution by running the following command.

cdk deploy –all

After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

When deployment is complete, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table. Retrieve the data from the DynamoDB table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region. Screenshot smus-dynamodb

Clean up

Use the following steps to clean up the resources deployed.

Empty the S3 buckets that were created as part of this deployment.
In your local development environment (Linux or macOS):
Navigate to the unified-studio directory of your repository.
Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
To destroy the cloud resources, run the following command:

cdk destroy --all

Go to the SageMaker Unified Studio and delete the published data assets that were created in the project.
Use the console to delete the SageMaker Unified Studio domain.

Walkthrough for data solutions built on an Amazon DataZone domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on an Amazon DataZone domain.

Deployment steps

After completing the prerequisites, use the AWS CDK stack provided on GitHub to deploy the solution to backup system metadata of the data solution built on Amazon DataZone domain

Clone the repository from GitHub to your preferred IDE using the following commands.

git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd event-driven-resilience-sagemaker

Export AWS credentials and the primary Region information to your development environment for the AWS Identity and Access Management (IAM) role with administrative permissions, use the following format:

export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use Secrets Manager or Systems Manager Parameter Store to manage credentials. Automate the deployment process using a CI/CD pipeline.

Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command:

cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd datazone

From the console for IAM, note the Amazon Resource Name (ARN) of the CDK execution role. Update the trust relationship of the IAM role so that Lambda can assume the role.

Modify the following parameters in the config/Config.ts file.

DZ_APPLICATION_NAME – Name of the application.
DZ_SECONDARY_REGION – Secondary Region for backup.
DZ_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval.
DZ_STAGE_NAME – Name of the stage (dev, qa, or prod).
DZ_DOMAIN_NAME – Name of the Amazon DataZone domain
DZ_DOMAIN_DESCRIPTION – Description of the Amazon DataZone domain
DZ_DOMAIN_TAG – Tag of the Amazon DataZone domain
DZ_PROJECT_NAME – Name of the Amazon DataZone project
DZ_PROJECT_DESCRIPTION – Description of the Amazon DataZone project
CDK_EXEC_ROLE_ARN – ARN of the CDK execution role
DZ_ADMIN_ROLE_ARN – ARN of the administrator role

Install the dependencies by running the following command:

npm install

Synthesize the AWS CloudFormation template by running the following command:

cdk synth

Deploy the solution by running the following command:

cdk deploy --all

After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

Document system metadata

This section provides instructions to create an asset and demonstrates how you can retrive the metadata of the asset. Perform the following steps to retrieve the systems metadata.

Sign in to the Amazon DataZone data portal from the console. Select the project and choose Query data at the upper right.

Screenshot datazone-open-query

Choose Open Athena and make sure that <DZ_PROJECT_NAME>_DataLakeEnvironment is selected in the Amazon DataZone environment dropdown at the upper right and that on the left, and that <DZ_PROJECT_NAME>_datalakeenvironment_pub_db is selected as the Database.
Create a new AWS Glue table for publishing to Amazon DataZone. Paste the following create table as select (CTAS) query script in the Query window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.

CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Go to the Tables and Views section and verify that the mkt_sls_table table was successfully created.
In the Amazon DataZone Data Portal, go to Data sources, select the <DZ_PROJECT_NAME>-DataLakeEnvironment-default-datasource, and choose Run. The mkt_sls_table will be listed in the inventory and available to publish.
Select the mkt_sls_table table and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.
Choose Publish Asset and the mkt_sls_table table will be published to the business data catalog, making it discoverable and understandable across your organization.
After the table is published, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table and retrieve the data from the table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region.

Clean up

Use the following steps to clean up the resources deployed.

Empty the Amazon Simple Storage Service (Amazon S3) buckets that were created as part of this deployment.
Go to the Amazon DataZone domain portal and delete the published data assets that were created in the Amazon DataZone project.
In your local development environment (Linux or macOS):

Navigate to the datazone directory of your repository.
Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
To destroy the cloud resources, run the following command:

cdk destroy --all

Conclusion

This post explores how to build a resilient data governance solution on Amazon SageMaker. Resilient design principles and a robust disaster recovery strategy are central to the business continuity of AWS customers. The code samples included in this post implement a backup process of the data solution at regular time interval. They store the Amazon SageMaker asset information in Amazon DynamoDB Global tables. You can extend the backup solution by identifying the system metadata that is relevant for the data solution of your organization and by using Amazon SageMaker APIs to capture and store the metadata. The DynamoDB Global table replicates the changes in the DynamoDB table in the primary region to the secondary region in an asynchronous manner. Consider Implementing an additional layer of resiliency by using AWS Backup to back up the DynamoDB table at regular interval. In the next post, we show how you can use the system metadata to restore your data solution in the secondary region.

Adopt the resiliency features offered by Amazon DataZone and Amazon SageMaker Unified Studio. Use AWS Resilience Hub to assess the resilience of your data solution. AWS Resilience Hub helps you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the AWS Well-Architected Framework.

To build a data mesh based data solution using Amazon DataZone domain, see our GitHub repository. This open source project provides a step-by-step blueprint for constructing a data mesh architecture using the powerful capabilities of Amazon SageMaker, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.

About the authors

BDB-4558-Dhruba Dhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data governance, and artificial intelligence at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure cloud solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

2025-03-21 Lakshmi Nair

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/connect-share-and-query-where-your-data-sits-using-amazon-sagemaker-unified-studio/

The ability for organizations to quickly analyze data across multiple sources is crucial for maintaining a competitive advantage. Imagine a scenario where the retail analytics team is trying to answer a simple question: Among customers who purchased summer jackets last season, which customers are likely to be interested in the new spring collection?

While the question is straightforward, getting the answer requires piecing together data across multiple data sources such as customer profiles stored in Amazon Simple Storage Service (Amazon S3) from customer relationship management (CRM) systems, historical purchase transactions in an Amazon Redshift data warehouse, and current product catalog information in Amazon DynamoDB. Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems.

In this blog post, we will demonstrate how business units can use Amazon SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems.

SageMaker Unified Studio provides a unified experience for using data, analytics, and AI capabilities. You can use familiar AWS services for model development, generative AI, data processing, and analytics—all within a single, governed environment. To strike a fine balance of democratizing data and AI access while maintaining strict compliance and regulatory standards, Amazon SageMaker Data and AI Governance is built into SageMaker Unified Studio. With Amazon SageMaker Catalog, teams can collaborate through projects, discover, and access approved data and models using semantic search with generative AI-created metadata, or you can use natural language to ask Amazon Q to find your data. Within SageMaker Unified Studio, organizations can implement a single, centralized permission model with fine-grained access controls, facilitating seamless data and AI asset sharing through streamlined publishing and subscription workflows. Teams can also query the data directly from sources such as Amazon S3 and Amazon Redshift, through Amazon SageMaker Lakehouse.

SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources. Built on AWS Glue Data Catalog and AWS Lake Formation, it organizes data through catalogs that can be accessed through an open, Apache Iceberg REST API to help ensure secure access to data with consistent, fine-grained access controls. SageMaker Lakehouse organizes data access through two types of catalogs: federated catalogs and managed catalogs (shown in the following figure). A catalog is a logical container that organizes objects from a data store, such as schemas, tables, views, or materialized views such as from Amazon Redshift. You can also create nested catalogs to mirror the hierarchical structure of your data sources within SageMaker Lakehouse.

Federated catalogs: Through SageMaker Unified Studio, you can create connections to external data sources such as Amazon DynamoDB. See Data connections in Amazon SageMaker Lakehouse for all the supported external data sources. These connections are stored in the AWS Glue Data Catalog (Data Catalog) and registered with Lake Formation, allowing you to create a federated catalog for each available data source.
Managed catalogs: A managed catalog refers to the data that resides on Amazon S3 or Redshift Managed Storage (RMS).

The existing Data Catalog becomes the Default catalog (identified by the AWS account number) and is readily available in SageMaker Lakehouse.

If the business units don’t have a data warehouse but need the benefits of one—such as a query result cache and query rewrite optimizations—then, they can create an RMS managed catalog in SageMaker Unified Studio. This is a SageMaker Lakehouse managed catalog backed by RMS storage. The table metadata is managed by Data Catalog. When you create an RMS managed catalog, it deploys an Amazon Redshift managed serverless workgroup. Users can write data to managed RMS tables using Iceberg APIs, Amazon Redshift, or Zero-ETL ingestion from supported data sources.

Functional working model

In SageMaker Unified Studio, the infrastructure team will enable the blueprints and configure the project profiles for tools and technologies to the respective business units to build and monitor their pipelines. They will also onboard the teams to SageMaker Unified Studio, enabling them to build the data products in a single integrated, governed environment. To enforce standardization within the organization, the central governance team can also create hierarchical representations of business units through domain units and dictate certain actions that these teams can perform under a domain unit. Global policies such as data dictionaries (business glossaries), data classification tags, and additional information with metadata forms can be created by the governance team to ensure standardization and consistency within the organization.

Individual business units will use these project profiles based on their needs to process the data using the authorized tool of their choice and create data products. Business units can enjoy the full flexibility to process and consume the data without worrying about the maintenance of the underlying infrastructure. Depending on the nature of the workloads, business units can choose a storage solution that best fits their use case. You can use SageMaker Lakehouse to unify the data across different data sources.

To share the data outside the business unit, the teams will publish the metadata of their data to a SageMaker catalog and make it discoverable and accessible to other business units. Amazon SageMaker Catalog serves as a central repository hub to store both technical and business catalog information of the data product. To establish trust between the data producers and data consumers, SageMaker Catalog also integrates the data quality metrics and data lineage events to track and drive transparency in data pipelines. While sharing the data, data producers of these business units can apply fine grained access control permissions at row and column level to these assets during subscription approval workflows. SageMaker Unified Studio automatically grants subscription access to the subscribed data assets after the subscription request is approved by the data producer. As shown in the following figure, the data sharing capability highlights that the data remains at its origin with the data producer, while consumers from other business units can consume and analyze it using their own compute resources. This approach eliminates any data duplication or data movement.

Solution overview

In this post, we explore two scenarios for sharing data between different teams (retail, marketing, and data analysts). The solution in this post gives you the implementation for a single account use case.

Scenario 1

The retail team needs to create a comprehensive view of customer behavior to optimize their spring collection launch. Their data landscape is diverse:

Customer profiles stored in Amazon S3 (default Data Catalog)
Historical purchase transactions stored in RMS (SageMaker Lakehouse managed RMS catalog)
Inventory information of the product in DynamoDB. (federated catalog)

The team needs to share this unified view with their regional data analysts while maintaining strict data governance protocols. Data analysts discover the data and subscribe to the data. We will also walk through the publishing and subscription workflow as part of the data sharing process. To get a unified view of the customer sales transactions for active products, the data analysts will use Amazon Athena.

Here are the high level steps of the solution implementation as shown in the preceding diagram:

In this post, we take an example of two teams who participate in the collaboration. The retail team has created a project retailsales-sql-project and the data analysts team has created a project dataanalyst-sql-project within SageMaker Unified Studio.
The retail team creates and stores their data in various sources:
1. customer data in Amazon S3 (contains customer data)
2. inventory data in a DynamoDB table (contains product catalog information)
3. store_sales_lakehouse in SageMaker Lakehouse managed RMS (contains purchase history)
The retail team publishes the assets to the project catalog to make them discoverable to other domain members within the organization.
The data analysts team discovers the data and subscribes to the data assets.
An incoming request is sent to the retail team, who then approves the subscription request. After the subscription is approved, data analysts use Athena to create a unified query from all the subscribed data assets to get insights into the data.

In this scenario, we will review how SageMaker Catalog manages the subscription grants to Data Catalog assets (both federated and managed).

For this scenario, we assume that the retail team doesn’t have their own data warehouse and they want to create and manage Amazon Redshift tables using Data Catalog.

Scenario 2

The marketing team needs access to transaction data for campaign optimization. They have campaign performance data stored in an Amazon Redshift data warehouse. However, to have improved campaign ROI and better resource allocation, they need data from the retail team to understand actual customer purchase behavior. To improve the campaign ROI, they need answers to crucial questions such as:

What is the true conversion rate across different customer segments?
Which customers should be targeted for upcoming promotions?
How do seasonal buying patterns affect campaign success?

Here the retail team shares the purchase history data store_sales to the marketing team. In this scenario, shown in the preceding figure, we assume that the retail team has their own data warehouse and uses Amazon Redshift to store the purchase history data.

The high level steps of the solution implementation for this scenario are:

The marketing team has created the project marketing-sql-project within SageMaker Unified Studio.
The retail team has store_sales in Amazon Redshift data warehouse (contains purchase history)
The retail team has published the assets to the project catalog
The marketing team discovers the data and subscribes to the data assets.
An incoming request is sent to the retail team, who then approves the subscription request. After the subscription is approved, the marketing team uses Amazon Redshift to consume the purchase history and identify high-value customer segments.

In this scenario, we will review the process of how SageMaker Catalog grants access to managed Amazon Redshift assets.

Prerequisites

To follow the step by step guide, you must complete the following prerequisites:

Sign up for an AWS account.
Create a user with administrative access.
Enable AWS IAM Identity Center in the AWS Region where you want to create your SageMaker Unified Studio domain. Make sure that you are using a Region that SageMaker Unified Studio is available in. Set up your IdP and synchronize identities and groups with IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.
Create a SageMaker Unified Studio domain and three projects using the SQL analytics project profile. See Create a new project to create a project. For this post, you will create the following projects: retailsales-sql-project, marketing-sql-project and dataanalyst-sql-project.

Note that the default SQL analytics project profile provides you with a RedshiftServerless blueprint. However, in this post, we want to showcase the data sharing capabilities of different types of SageMaker Lakehouse catalogs (managed and federated).

For the simplicity, we chose the SQL analytics project profile. However, you can also test this by using the Custom project profile by selecting specific blueprints such as LakehouseCatalog and LakeHouseDatabase for scenarios where the business unit doesn’t have their own data warehouse.

Solution walkthrough (Scenario 1)

The first step focuses on preparing the data for each data source for unified access.

Data preparation

In this section, you will create the following data sets:

customer data in Amazon S3 (default Data Catalog)
inventory data in a DynamoDB table (federated catalog)
store_sales_lakehouse in SageMaker Lakehouse managed RMS (managed catalog)

Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor.

Select the following options:
1. Under CONNECTIONS, select Athena (Lakehouse).
2. Under CATALOGS, select AwsDataCatalog.
3. Under DATABASES, select glue_db_<environmentid> or the customer glue database name you provided during project creation.
4. After the options are selected, choose Choose.

When users select a project profile within SageMaker Unified Studio, the system automatically triggers the relevant AWS CloudFormation stack (DataZone-Env-<environmentid>) and deploys the necessary infrastructure resources in the form of environments. Environments are the actual data infrastructure behind a project.

Run the following SQL:

CREATE TABLE customer AS
SELECT 13251813 cust_id,'Joyce Deaton'   cust_name,'Greece'   cust_country, '[email protected]'   cust_email
UNION
SELECT 1581546  ,'Daniel Dow'  ,'India'  , '[email protected]'  
UNION
SELECT 1581536  ,'Marie Lange'  ,'Canada'  , '[email protected]'  
UNION
SELECT 1827661  ,'Wesley Harris'  ,'Rome'  , '[email protected]'  
UNION
SELECT 1581536  ,'Alexander Salyer'  ,'Germany'  , '[email protected]'  
UNION
SELECT 3581536  ,'Jerry Tracy'  ,'Swiss'  , '[email protected]'

After the SQL is executed, you will find that the customer table has been created in the Lakehouse section under Lakehouse/AwsDataCatalog/glue_db_<environmentid>.

The product catalog is stored in DynamoDB. You can create a new table named inventory in DynamoDB with partition key prod_id through AWS CloudShell with the following command:

aws dynamodb create-table \
    --table-name inventory\
    --attribute-definitions \
AttributeName=prod_id,AttributeType=N \
    --key-schema \
AttributeName=prod_id,KeyType=HASH \
    --provisioned-throughput \
ReadCapacityUnits=5,WriteCapacityUnits=5 \
    --table-class STANDARD

Populate the DynamoDB table using the following commands:

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "1"}, "prod_name": {"S": "Widget A"},"active": {"S": "Y"}}' 

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "2"}, "prod_name": {"S": "Gadget B"},"active": {"S": "Y"}}'

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "3"}, "prod_name": {"S": "Item C"},"active": {"S": "N"}}'

To use the DynamoDB table in SageMaker Unified Studio, you need to configure a resource-based policy that allows the appropriate actions for the project role.
1. To create the resource-based policy, navigate to the DynamoDB console and choose Tables from the navigation pane.
2. Select the Permissions table and choose Create table policy.

The following is an example policy that allows connecting to DynamoDB tables as a federated source. Replace the <aws_region> with the Region you are working on, <aws_account_id> with the AWS Account ID where DynamoDB is deployed, <dynamodb_table> with the DynamoDB table (in this case inventory) that you intend to query from Amazon SageMaker Unified Studio and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the Project role Amazon Resource Name (ARN) in SageMaker Unified Studio portal. You can get the project role ARN by navigating to the project in SageMaker Unified Studio and then to Project overview.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:DescribeTable",
                "dynamodb:PartiQLSelect",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": "arn:aws:dynamodb:<aws_region>:<aws_accountid>:table/<dynamodb_table>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After the policies are incorporated on the DynamoDB table, create an SageMaker Lakehouse connection within SageMaker Unified Studio. As shown in the example, dynamodb-connection-catalogs is created.

After the connection is successfully established, you will see the DynamoDB table inventory under Lakehouse.

The next step is to create a managed catalog for RMS objects using SageMaker Lakehouse.

Choose Data in the navigation pane.
In the data explorer, choose the plus icon to add a data source.
Select Create Lakehouse catalog.
Choose Next.

Enter the name of the catalog. The catalog name provided in the example is redshift-lakehouse-connection-catalogs. Choose Add data.

After the connection is created, you will see the catalog under Lakehouse.

This creates a managed Amazon Redshift Serverless workgroup in your AWS account. You will see a new database dev@<redshift-catalog-name> in the managed Amazon Redshift Serverless workgroup.
1. On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor.
2. Select Redshift (Lakehouse) from CONNECTIONS, dev@<redshift-catalog-name> from DATABASES and public from SCHEMAS

Run the following SQL in order. The SQL creates the store_sales_lakehouse table in the dev database in the public schema. The retail team inserts data into the store_sales_lakehouse table.

CREATE TABLE public.store_sales_lakehouse (
    sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
    cust_id INTEGER NOT NULL,
    sale_date DATE NOT NULL,
    sale_amount DECIMAL(10, 2) NOT NULL,
    prod_id INTEGER  NOT NULL,
    last_purchase_date DATE
);

INSERT INTO public.store_sales_lakehouse (cust_id, sale_date, sale_amount, prod_id, last_purchase_date)
VALUES
(13251813, '2023-01-15', 150.00, 1, '2023-01-15'),
(29033279, '2023-01-20', 200.00, 4, '2023-01-20'),
(12755125, '2023-02-01', 75.50, 3, '2023-02-01'),
(26009249, '2023-02-10', 300.00, 2, '2023-02-10'),
(3270685, '2023-02-15', 125.00, 2, '2023-02-15'),
(6520539, '2023-03-01', 100.00, 2, '2023-03-01'),
(10251183, '2023-03-10', 250.00, 1, '2023-03-10'),
(10251283, '2023-03-15', 180.00, 1, '2023-03-15'),
(10251383, '2023-04-01', 90.00, 2, '2023-04-01'),
(10251483, '2023-04-10', 220.00, 3, '2023-04-10'),
(10251583, '2023-04-15', 175.00, 3, '2023-04-15'),
(10251683, '2023-05-01', 130.00, 1, '2023-05-01'),
(10251783, '2023-05-10', 280.00, 1, '2023-05-10'),
(10251883, '2023-05-15', 195.00, 4, '2023-05-15'),
(10251983, '2023-06-01', 110.00, 2, '2023-06-01'),
(10251083, '2023-06-10', 270.00, 1, '2023-06-10'),
(10252783, '2023-06-15', 185.00, 2, '2023-06-15'),
(10253783, '2023-07-01', 95.00, 3, '2023-07-01'),
(10254783, '2023-07-10', 240.00, 1, '2023-07-10'),
(10255783, '2023-07-15', 160.00, 3, '2023-07-15');

On successful creation of the table, you should now be able to query the data. Select the table store_sales_lakehouse and select Query with Redshift.

Import assets to the project catalog from various data sources

To share your assets outside your own project to other business units, you must first bring your metadata to SageMaker Catalog. To import the assets into the project’s inventory, you need to create a data source in the project catalog. In this section, we show you how to import the technical metadata from AWS Glue data catalogs. Here, you will import data assets from various sources that you have created as part of your data preparation.

Sign in to SageMaker Unified Studio as a member of the retail team. Select the project retailsales-sql-project, under Project catalog. Choose Data sources and import the assets by choosing Run.

To import the federated catalog, create a new data source and choose Run. This will import the metadata of the inventory data from DynamoDB table.

After successful run of all the data sources, choose Assets under Project catalog in the navigation plane. You will find all the assets in the Inventory of Project catalog.

Publish the assets

To make the assets discoverable to the data analysts team, the retail team must publish their assets.

In the project retailsales-sql-project, choose Project catalog and select Assets.
Select each asset in the INVENTORY tab, enrich the asset with the automated metadata generation and PUBLISH ASSET.

Discover the assets

SageMaker Catalog within SageMaker Unified Studio enables efficient data asset discovery and access management. The data analysts team signs in to SageMaker Unified Studio and selects the project dataanalyst-sql-project. The data analysts team then locates the desired assets in SageMaker Catalog and initiates the subscription request.

In this section, members of dataanalyst-sql-project browse the catalog and find the assets. There are multiple ways to find the desired assets.

Sign in to SageMaker Unified Studio as a member of the data analysts team. Choose Discover in the top navigation bar and select Catalog. Find the desired asset by browsing or entering the name of the asset into the search bar.
Search for the asset through a conversational interface using Amazon Q.
Use the faceted filter search by selecting the desired project in the BROWSE CATALOG.

The data analysts team selects the project retailsales-sql-project.

Subscribe to the assets

The data analysts team submits a subscription request with an appropriate justification for each of these assets.

For each asset, choose SUBSCRIBE.
Select dataanalyst-sql-project in Project.
Provide the Reason for request as “need this data for analysis”.

Note that during the subscription process, the requester sees a message that the asset access control and fulfillment will be Managed. This means that SageMaker Unified Studio automatically manages subscription access grants and permissions for these assets.

Subscription approval workflow

To approve the subscription request, you must be a member of the retail team and select the project that has published the asset.

Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
In the navigation pane, choose Project catalog and then select Subscription requests.
In INCOMING REQUESTS, choose the REQUESTED tab and select View request for each asset to see detailed information of the subscription request.

REQUEST DETAILS provides information about the subscribing project, the requestor, and the justification to access the asset.
RESPONSE DETAILS provides an option to approve the subscription with full access to the data (Full access) or restricted access to the data (Approve with row or column filters). With restricted access to data, the subscription approval workflow process offers granular access control for sensitive data through row-level filtering and column-level filtering. Using row filters, approvers can restrict access to specific records based on defined criteria. Using column filters, approvers can control access to specific columns within the data sets. This allows excluding sensitive fields while sharing the relevant data. Approvers can implement these filters during the approval process, helping to ensure that the data access aligns with the organization’s security requirements and compliance policies. For this post, select Full access in the RESPONSE DETAILS
(Optional) Decision comment is where you can add a comment about accepting or rejecting the subscription request.
Choose APPROVE.

Repeat the subscription approval workflow process for all the requested assets.
After all the subscription requests are approved, choose the APPROVED tab to view all the approved assets.

Subscription fulfillment methods

After subscription approval, a fulfillment process manages access to the assets. SageMaker Unified Studio provides fulfillment methods for managed assets and unmanaged assets.

Managed assets: SageMaker Unified Studio automatically manages the fulfillment and permissions for assets such as AWS Glue tables and Amazon Redshift tables and views.
Unmanaged assets: For unmanaged assets, permissions are handled externally. SageMaker Unified Studio publishes standard events for actions such as approvals through Amazon EventBridge, enabling integration with other AWS services or third-party solutions for custom integrations.

In this scenario 1, because the assets are Data Catalogs, SageMaker Unified Studio grants and manages access to these managed assets on your behalf through Lake Formation. See the SageMaker Unified Studio subscription workflow for updates on sharing options.

Analyze the data

The data analysts team uses the subscribed data assets from varied sources to get unified insights.

As a data analyst, sign in to SageMaker Unified Studio and select the project dataanalyst-sql-project. In the navigation pane, choose Project catalog and select Assets.
Choose the SUBSCRIBED tab to find all the subscribed assets from the retailsales-sql-project.
The status under each asset is Asset accessible. This indicates that the subscription grants are fulfilled and the data analysts team can now consume the assets with the compute of their choice.

Query using Athena (subscription grants fulfilled using Lake Formation)

As a member of the data analysts team, create a unified view to get purchase history with customer information for active products.

In the dataanalyst-sql-project project, go to Build and select Query Editor.
Use the following sample query to get the required information. Replace glue_db_<environmentid> with your subscribed glue database.

select * from "redshift-lakehouse-connection-catalogs/dev"."public"."store_sales_lakehouse" sales 
 left  join "awsdatacatalog"."glue_db_<environmentid>"."customer" customer
 on sales.cust_id=customer.cust_id
 inner  join "dynamodb-connection-catalogs"."default"."inventory" inventory
 on sales.prod_id = inventory.prod_id
 where inventory.active ='Y'

Solution walk-through (Scenario 2)

In this scenario, we assume that the retail team stores the purchase history data in their Amazon Redshift data warehouse. Because you’re using the default SQL analytics project profile to create the project, you will use a Redshift Serverless compute (project.redshift). The purchase history data is shared with the marketing team for enhanced campaign performance.

Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor
Select the following options:
- Under CONNECTIONS, select Redshift(Lakehouse).
- Under CATALOGS, select dev.
- Under DATABASES, select public.
Run the following SQL:

CREATE TABLE public.store_sales (
sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
cust_id INTEGER NOT NULL,
sale_date DATE NOT NULL,
sale_amount DECIMAL(10, 2) NOT NULL,
prod_id INTEGER  NOT NULL,
last_purchase_date DATE
);

INSERT INTO public.store_sales (cust_id, sale_date, sale_amount, prod_id, last_purchase_date)
VALUES
(13251813, '2023-01-15', 150.00, 1, '2023-01-15'),
(29033279, '2023-01-20', 200.00, 4, '2023-01-20'),
(12755125, '2023-02-01', 75.50, 3, '2023-02-01'),
(26009249, '2023-02-10', 300.00, 2, '2023-02-10'),
(3270685, '2023-02-15', 125.00, 2, '2023-02-15'),
(6520539, '2023-03-01', 100.00, 2, '2023-03-01'),
(10251183, '2023-03-10', 250.00, 1, '2023-03-10'),
(10251283, '2023-03-15', 180.00, 1, '2023-03-15'),
(10251383, '2023-04-01', 90.00, 2, '2023-04-01'),
(10251483, '2023-04-10', 220.00, 3, '2023-04-10'),
(10251583, '2023-04-15', 175.00, 3, '2023-04-15'),
(10251683, '2023-05-01', 130.00, 1, '2023-05-01'),
(10251783, '2023-05-10', 280.00, 1, '2023-05-10'),
(10251883, '2023-05-15', 195.00, 4, '2023-05-15'),
(10251983, '2023-06-01', 110.00, 2, '2023-06-01'),
(10251083, '2023-06-10', 270.00, 1, '2023-06-10'),
(10252783, '2023-06-15', 185.00, 2, '2023-06-15'),
(10253783, '2023-07-01', 95.00, 3, '2023-07-01'),
(10254783, '2023-07-10', 240.00, 1, '2023-07-10'),
(10255783, '2023-07-15', 160.00, 3, '2023-07-15');

5. On successful execution of the query, you will see store_sales under Redshift in the navigation pane.

Import the asset to the project catalog inventory

To share your assets outside your own project to other marketing business units, you must first share your metadata to SageMaker Catalog. To import the assets into the project’s inventory, you need to run the data source in the project catalog.

In the project retailsales-sql-project, under Project catalog, select Data sources and import the asset store-sales. Select the highlighted data source and choose Run as shown in the screenshot.

Publish the asset

To make the assets discoverable to the marketing team, the retail team must publish their asset.

Go to the navigation pane and choose Project catalog, and then select Assets.
Select store-sales in the INVENTORY tab, enrich the asset with the automated metadata generation and PUBLISH ASSET as illustrated in the screenshot.

Discover and subscribe the asset

The marketing team discovers and subscribes to the store-sales asset.

Sign in to SageMaker Unified Studio as a member of the marketing team and select marketing-sql-project.
Navigate to the Discover menu in the top navigation bar and choose Catalog. Find the desired asset by browsing or entering the name of the asset into the search bar.
Select the asset and choose SUBSCRIBE.
Enter a justification in Reason for request and choose REQUEST.

Subscription approval workflow

The retail team gets an incoming request in their project to approve the subscription request.

Sign in to the SageMaker Unified Studio and select the project retailsales-sql-project as a member of the retail team. Under Project catalog, select Subscription requests.
In the INCOMING REQUESTS, under the REQUESTED tab, select View request for store-sales.

You will see detailed information for the subscription request.
Select Full access in the RESPONSE DETAILS and choose APPROVE.

Analyze the data

In the Project catalog, select Assets and choose the SUBSCRIBED tab to find all the subscribed assets from the retailsales-sql-project.
Notice the status under the asset marked as Asset accessible. This indicates that the subscription grants are fulfilled and the marketing team can now consume the asset with the compute of their choice.

Query using Amazon Redshift (subscription grants fulfilled using native Amazon Redshift data sharing)

To query the shared data with Amazon Redshift compute, select Build and then Query Editor. Select the following options

Under CONNECTIONS, select Redshift(Lakehouse).
Under CATALOGS, select dev.
Under DATABASES, select project.

select * from "dev"."project"."store_sales" sales

When a subscription to an Amazon Redshift table or view is approved, SageMaker Unified Studio automatically adds the subscribed asset to the consumer’s Amazon Redshift Serverless workgroup for the project. Notice the subscribed asset is shared under the folder project. In the Redshift navigation pane, you can also see the datashare created between the source and the target cluster. In this case, because the data is shared in the same account but between different clusters, SageMaker Unified Studio creates a view in the target database and permissions are granted on the view. See Grant access to managed Amazon Redshift assets in Amazon SageMaker Unified Studio for information about data sharing options within Amazon Redshift.

Clean up

Make sure you remove the SageMaker Unified Studio resources to avoid any unexpected costs. Start by deleting the connections, catalogs, underlying data sources, projects, databases, and domain that you created for this post. For additional details, see the Amazon SageMaker Unified Studio Administrator Guide.

Conclusion

In this post, we explored two distinct approaches to data sharing and analytics.

Business units without an existing data warehouse can use a SageMaker Lakehouse managed RMS catalog. In the first scenario, we showcased subscription fulfillment of AWS Glue Data Catalogs using AWS Lake Formation for federated and managed catalogs. The data analysts team was able to connect and subscribe to the data shared by the retail team that resided in Amazon S3, Amazon Redshift, and other data sources such as DynamoDB through SageMaker Lakehouse.

In the second scenario, we demonstrated the native data-sharing capabilities of Amazon Redshift. In this scenario, we assume that the retail team has sales transactions stored in an Amazon Redshift data warehouse. Using the data sharing feature of Amazon Redshift, the asset was shared to the marketing team using Amazon SageMaker Unified Studio.

Both approaches enable unified querying across varied data sources with teams able to efficiently discover, publish, and subscribe to data assets while maintaining strict access controls through Amazon SageMaker Data and AI Governance. Subscription fulfillment is automated, reducing the administrative overhead. Using the query-in-place approach eliminates data redundancy and maintains data consistency while allowing unified analysis across data sources through a single integrated experience.

To learn more, see the Amazon SageMaker Unified Studio Administrator Guide and the following resources:

About the authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached through LinkedIn

Ramkumar Nottath is a Principal Solutions Architect at AWS focusing on Analytics services. He enjoys working with various customers to help them build scalable, reliable big data and analytics solutions. His interests extend to various technologies such as analytics, data warehousing, streaming, data governance, and machine learning. He loves spending time with his family and friends.