Tag Archives: Amazon SageMaker Unified Studio

Use account-agnostic, reusable project profiles in Amazon SageMaker to streamline governance

2025-09-03 Ramesh H Singh

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/use-account-agnostic-reusable-project-profiles-in-amazon-sagemaker-to-streamline-governance/

Amazon SageMaker now supports account-agnostic project profiles, so you can create reusable project templates across multiple AWS accounts and organizational units. In this post, we demonstrate how account-agnostic project profiles can help you simplify and streamline the management of SageMaker project creation while maintaining security and governance features. We walk through the technical steps to configure account-agnostic, reusable project profiles, helping you maximize the flexibility of your SageMaker deployments.

New feature: Account-agnostic project profiles

Previously, SageMaker provided the ability to create project profiles, which required selecting an AWS account and AWS Region at the time of profile creation. This feature provides you the flexibility to insert the AWS account and Region dynamically when creating projects.

SageMaker now supports generic, account-agnostic project profiles (templates) in SageMaker domains, so domain administrators can define project configurations one time and reuse them across multiple AWS accounts and Regions.

Project profiles are no longer tied to a specific AWS account or Region. Instead, platform teams can reference an account pool—a new domain entity that enables dynamic account and Region selection at the time of project creation, based on custom enterprise authorization policies or user-specific logic. This decoupling of profile definitions from static deployment settings is designed to simplify governance, reduce duplication, and accelerate onboarding across large-scale data and machine learning (ML) environments.

Account-agnostic project profiles offer the following key benefits:

Project creators benefit from a more flexible experience – During project creation, project creators can select from a personalized list of authorized AWS accounts and Regions, powered by custom resolution strategies or predefined account pools.
The feature streamlines project profile governance – This model is intended to enable organizations operating across many different accounts to scale efficiently across those accounts, while preserving organization’s centralized control and permission boundaries.

Customer spotlight

As a large data-driven organization, Bayer AG looks to harness the power of data, analytics, and ML to help researchers and engineers accelerate pharmaceutical innovation. With the ability to create account agnostic templates and reusable templates in SageMaker, the research teams at Bayer can innovate faster without platform and engineering overhead.

“At Bayer, we use Amazon SageMaker Unified Studio as a unified, governed workspace that brings together data from multiple AWS accounts—enabling our users to run analytics, build pipelines, and train models as part of their day-to-day work. With the new capability to create account-agnostic templates, our platform team can publish reusable templates once, and teams can select the right authorized AWS account at project creation—without relying on platform hand-offs. This will support faster onboarding, improved agility, and consistent governance as we scale ML across our global operations.”

— Avinash Reddy Erupaka, Principal Engineering Lead, Drug Innovation Platform, Bayer

Solution overview

For our example use case, a leading pharmaceutical company has implemented SageMaker to manage their enterprise-wide data governance initiatives. The organization faces the complex challenge of managing thousands of AWS accounts across their global operations.

To streamline this process, their platform administrator needs to develop a system of reusable project profiles that map to specific account pools, organized according to the company’s organizational structure. For instance, they’ve created a specialized Corporate HR project profile tailored to meet the Corporate HR team’s specific requirements, as well as a comprehensive Data Engineer project profile designed for data engineering teams operating across North America, Asia-Pacific, and European Regions. This strategic approach helps data engineers efficiently create new projects using these preconfigured profiles while selecting from pre-authorized account and Region combinations. This structure strikes an optimal balance between operational flexibility and enhanced security and governance features.

In the following sections, we provide a detailed, step-by-step implementation guide for this solution.

Prerequisites

For this walkthrough, you must have the following prerequisites:

An AWS account – If you don’t have an account, you can create one. The account should have permission to do the following:
- Create and manage SageMaker domains
- Create and manage AWS Identity and Access Management (IAM) roles
- Create and invoke AWS Lambda functions (optional)
SageMaker domain – For instructions, refer to Create a domain – quick setup.
AWS CLI installed – The AWS Command Line Interface (AWS CLI) version 2.11 or later.
Python installed – Python 3.8 or later (if using custom Lambda handlers).
IAM permissions – The following IAM permissions are required:
- sagemaker:CreateProject
- sagemaker:CreateProjectProfile
- datazone:CreateAccountPool

Platform administrator tasks

The platform administrator is responsible for two key setup tasks: creating account pools and establishing project profiles associated with these pools. This section provides the steps to accomplish both crucial processes.

Create account pools

There are two ways to create account pools:

For static account sources, provide a list of accounts and Regions
For dynamic account sources, use a custom Lambda handler to authorize account and Region pair information

As of this writing, the creation, update, and deletion of account pools are only supported in the AWS CLI.

For creating account pools, use the create-account-pool command and provide the resources. We used the following commands to create account pools for our example use case. Replace the relevant values with your own resources, such as domain identifier, account, and Region.

First, create the account pool hr-accountpool with a single AWS account. In the following command, the parameter MANUAL refers to the mechanism by which an account is chosen from the pool at project creation time. Because the platform admin is manually choosing the accounts, the resolution strategy is set to MANUAL.

aws datazone create-account-pool --domain-identifier dzd_5yxxxxxxxxxxxx --name hr-accountpool --resolution-strategy MANUAL --account-source '{"accounts": [{"awsAccountId": "633xxxxxxxxx", "supportedRegions": ["us-east-1"], "awsAccountName": "HRaccount"}]}'

Next, create the account pool namer-data-engg-pool with multiple AWS accounts. Use the same code to create account pools for the EMEA and APAC Regions:

aws datazone create-account-pool --domain-identifier dzd_5yxxxxxxxxxxxx --name namer-data-engg-pool --resolution-strategy MANUAL --account-source '{"accounts": [{"awsAccountId": "633xxxxxxxxx", "supportedRegions": ["us-east-1"], "awsAccountName": "usaccount1"}, {"awsAccountId": "635xxxxxxxxx ", "supportedRegions": ["us-east-1"], "awsAccountName": "usaccount2"}]}'

You will use these account pools in subsequent steps to create project profiles.

To verify account pool creation, use the following command:

aws datazone list-account-pools --domain-identifier <domain-id>

If you have an external permissioning system, you can use the following custom Lambda command to create your account pool that will dynamically resolve during project creation:

aws datazone create-account-pool --domain-identifier dzd_cdy9yy904sxxxx --name custom- accountpool --resolution-strategy MANUAL --account-source '{"customAccountPoolHandler": {"lambdaFunctionArn": "<<Lambda ARN>>","lambdaExecutionRoleArn": "<<Lambda execution role>>"}}'

Create project profiles and account pool assignments

In this step, we establish project profiles and connect them to authorized account pools. There are three possible scenarios for setting up project profiles.

Scenario 1: Project profile associated with a single account pool

This is the simplest configuration, where one project profile is mapped to a single account pool. In the following steps, we create a project profile for the Corporate HR team and tie it to the HR account pool:

On the SageMaker console, choose Domains in the navigation pane.
On the Project profiles tab, choose Create.
Enter a name and description for your profile.
Choose an appropriate project profile template that aligns with your project’s needs.
Select Choose account and region during project creation.
Select Choose account pool(s) and choose the account pool you created for the HR team.
Leave the remaining settings as default and choose Create project profile.
On the project details page, choose Enable to activate your profile.
Choose Enable in the confirmation pop-up to proceed.

You will see a success message confirming that the Corporate HR profile has been created and linked to one account pool.

On the Project profiles tab, you should now see your newly created Corporate HR profile listed among the available project profiles.

To explore further, navigate to the Corporate HR project profile and choose the Blueprints tab to see a list of available blueprints. Choose a blueprint to view its details.

On the blueprint details page, the blueprint shows as deployable to the single account pool you associated with this project profile.

Scenario 2: Project profile associated with multiple account pools

In this example, we create a project profile for a global Data Engineering team, connecting it to three Regional account pools: NAMER (North America), APAC (Asia Pacific), and EMEA (Europe, Middle East, and Africa). Complete the following steps:

On the SageMaker console, choose Domains in the navigation pane.
On the Project profiles tab, choose Create.
Enter a name and description for your profile.
Choose an appropriate project profile template that aligns with your project’s needs.
Select Choose account and region during project creation.
Select Choose account pool(s) and choose all three Regional pools:
1. NAMER Data Engineering team
2. EMEA Data Engineering team
3. APAC Data Engineering team
Leave the remaining settings as default and choose Create project profile.
On the project details page, choose Enable to activate your profile.
Choose Enable in the confirmation pop-up to proceed.

You will see a success message confirming the Data Engineer profile creation. The profile will show connections to all three Regional account pools.

You can find your new profile listed on the Project profiles tab.

Navigate to your project profile and choose the Blueprints tab to see a list of available blueprints. Choose a blueprint to view its details.

On the blueprint details page, the blueprint shows as deployable to the three account pools you associated with this project profile.

Scenario 3: Project profile with all associated accounts

In this scenario, we create a project profile linked to all the associated accounts for this domain. Complete the following steps:

On the SageMaker console, choose Domains in the navigation pane.
On the Project profiles tab, choose Create.
Enter a name and description for your profile.
Choose an appropriate project profile template that aligns with your project’s needs.
Select Choose account and region during project creation.
Select All associated accounts.
Leave the remaining settings as default and choose Create project profile.

You can find your new profile listed on the Project profiles tab.

Project owner tasks

Now that the administrator has created project profiles for the account pools, project owners can log in to SageMaker to create projects for their account pools. In this section, we demonstrate the procedure to create a project using an account-agnostic project profile with a single account pool. You can use the same procedure to create projects using an account-agnostic project profile with multiple account pools.

For this scenario, Sarah from HR will create a project for the HR team, using the Corporate HR team profile that is associated with the HR account pool.

On the SageMaker portal, choose Create project.
Enter a name and optional description.
Choose the Corporate HR project profile.
Choose Continue.
For Account and AWS Region, choose the HR account.
Choose Continue.
Review the information and choose Create project.

You can view the successfully created project.

Clean up

To clean up resources, complete the following steps:

Delete the projects using the AWS CLI:

aws sagemaker delete-project --project-name <project-name>

Delete the account pools:

aws datazone delete-account-pool --domain-identifier <domain-id> --name <pool-name>

Conclusion

In this post, we discussed how account-agnostic project profiles can help organizations simplify and streamline the management of SageMaker project creation while maintaining enhanced security and governance features. To learn more about account-agnostic project profiles in SageMaker, refer to Account pools in Amazon SageMaker Unified Studio, and demo: account-agnostic project profile in Amazon SageMaker.

About the Authors

Zero-ETL: How AWS is tackling data integration challenges

2025-08-26 Nikki Rouda

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/zero-etl-how-aws-is-tackling-data-integration-challenges/

In this blog post, we show you how Amazon Web Services (AWS) is simplifying data integration with zero-ETL while realizing performance benefits and cost optimizations. As organizations gather data for analytics and AI, they are increasingly finding themselves caught in a complex web of extract, transform, and load (ETL) pipelines—the traditional backbone of data integration. While these pipelines still serve their purpose, they’ve also become a costly bottleneck, consuming valuable staff time and resources that could be better spent on innovation. Now, zero-ETL integrations are simplifying how businesses handle data integration. Zero-ETL can eliminate the need for complex data pipelines while still maintaining seamless data flow between your operational databases and analytics environments, including data warehouses, data lakes, and the combination of these into lakehouses.

Thousands of AWS customers have used zero-ETL to process petabytes of data with thousands of integrations. AWS customers are using integrations with services such as Amazon Aurora, Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon DynamoDB, and Amazon SageMaker, along with multiple third-party software as a service (SaaS) applications. These zero-ETL integrations are transforming data integration from a technical burden into a strategic advantage, so that businesses can focus on deriving actionable insights from their data.

The evolution of data integration

Traditionally, organizations have relied on ETL processes to move data between operational databases and analytics systems. This approach, while functional, presents several key challenges that can hinder an organization’s ability to derive timely insights from their data.

Building and maintaining ETL pipelines requires significant engineering resources, often diverting talent from core business initiatives. These pipelines need constant attention, updates, and optimization, creating an ongoing operational burden. As data volumes grow, updates happen faster, and schemas evolve, the complexity of these pipelines increases exponentially.

Pipeline failures can cause delays in data availability, impacting decision-making processes. When a pipeline breaks, it can take hours or even days to diagnose and fix the issue, during which time critical business decisions might be made with outdated information. This lag between data creation and availability for analysis can be a significant competitive disadvantage in fast-moving industries.

Complex transformations introduce potential points of failure, increasing the risk of data inconsistencies. Each transformation step is an opportunity for errors to creep in, whether through bugs in the transformation logic or unexpected edge cases in the data. Making sure of data quality and consistency across these transformations requires rigorous testing and validation processes.

Furthermore, as organizations add new data sources, the operational overhead of managing multiple pipelines increases exponentially. Each new source typically requires its own pipeline, complete with custom logic for extraction, transformation, and loading. This proliferation of pipelines can quickly become unwieldy, making it difficult to maintain a coherent data strategy across the organization.

How zero-ETL makes data accessible for analytics

AWS zero-ETL integrations provide automated, fully managed data replication from both AWS services and third-party applications to AWS data warehouses, data lakes, and lakehouses without requiring custom pipeline development. This innovative approach offers numerous benefits across several key areas, fundamentally changing how organizations approach data integration.

Simplified data architecture

Zero-ETL integrations offer low-code or no-code setup, which means that organizations can quickly establish data access and flows without specialized expertise. This democratization of data integration means that teams across the organization can set up and manage their own data integration, reducing bottlenecks and accelerating time-to-insight.

Zero-ETL integrations automatically handle data definition languages (DDLs), schema changes, and data type mapping, so that data in your analytics store is correct and complete. This data is immediately available for business consumption, helping to ensure consistency between source and target systems. This automatic mapping significantly reduces the risk of errors that can occur with manual mapping processes, helping to ensure that data types and structures are correctly translated between systems.

Built-in monitoring and error handling capabilities provide visibility into the replication process and help maintain data integrity. Administrators can set up alerts for specific conditions, such as replication lag or failed transfers, allowing for proactive management of the data integration process.

Zero-ETL integrations automatically handle full load and ongoing changes through change data capture (CDC) for quick access to the latest data. Organizations can use this dual capability to migrate existing data while also making sure that new data is continuously replicated, providing a seamless transition to the new integration model.

Near real-time analytics

With zero-ETL integrations, data is typically available in the target system within seconds or minutes of updates in the source system. This near real-time capability supports even high-volume transactional workloads, enabling timely insights for fast-moving businesses. For example, an ecommerce company can analyze purchase patterns almost immediately, enabling real-time inventory management and personalized recommendations.

The solution maintains consistent performance at scale, accommodating growing data volumes without degradation. As businesses grow and data volumes increase, the zero-ETL integration scales automatically, keeping performance consistent even as the demands on the system increase.

Built-in fault tolerance and recovery mechanisms help ensure high availability and data consistency. If an issue occurs during replication, manual or automatic retries of failed operations help resume from the last successful point, minimizing data loss and helping to ensure consistency between source and target systems.

Reduced operational burden

By eliminating the need for custom pipeline maintenance, zero-ETL integrations free up valuable engineering resources. Data engineers can focus on higher-value tasks such as data modeling, advanced analytics, and machine learning, rather than spending time on routine pipeline maintenance.

There is no additional infrastructure to manage, reducing complexity and cost. The zero-ETL integration runs on AWS-managed infrastructure, eliminating the need for customers to provision and manage servers, storage, or networking components for data integration.

The system automatically handles schema changes, adapting to evolving data structures without manual intervention. When a new column is added to a source table, for example, the zero-ETL integration will automatically detect this change and update the target schema accordingly, helping to ensure that the data remains in sync without any manual effort.

Native integration with AWS security controls helps ensure that data remains protected throughout the replication process. This includes support for encryption at rest and in transit, and integration with AWS Key Management Service (AWS KMS) for compliance with various regulatory standards.

Customer success with Zero-ETL

Since launch, zero-ETL integrations have seen rapid customer adoption. The versatility and benefits of zero-ETL integrations are demonstrated through diverse customer implementations across industries.

Yossi Shlomo, Director of Payment Systems Architecture at MassPay, a leading global payment solutions provider, stated, “Zero-ETL has been transformative for teams at MassPay. By using Amazon Aurora MySQL-Compatible Edition zero-ETL integration with Amazon Redshift, we’ve streamlined data flow from our core payment systems into analytics environments used for fraud detection, compliance case management, and business insights. This shift reduced latency by >90% and gives our teams near-instant access to critical data to optimize processes and decisions.” Because of this dramatic improvement in data freshness and availability, MassPay can make more timely and informed decisions, improving their service to customers and their competitive position in the market.

Available AWS service Integrations

AWS currently offers zero-ETL integrations designed to seamlessly connect popular AWS database services with Amazon Redshift, a fully managed data warehouse service. These include Amazon Aurora MySQL-Compatible, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon DynamoDB. This means that organizations can use the strengths of each service—the transactional capabilities of Aurora and Amazon RDS, the flexibility of DynamoDB, and the analytical power of Amazon Redshift—while minimizing the complexity of data movement between these systems.

Third-party integration support

Zero-ETL integrations have expanded beyond AWS services to support a wide range of third-party data too. AWS has zero-ETL integrations with sources including SAP OData, Salesforce, Salesforce Marketing Cloud Account Engagement, ServiceNow, Zendesk, and Zoho CRM, plus Facebook Ads and Instagram Ads. Targets include Amazon Redshift and a lakehouse with Amazon SageMaker.

Recent updates include:

Traditional relational databases from various vendors can also link to a lakehouse through zero-ETL integrations. This comprehensive support means that organizations can consolidate data from virtually any source into their AWS analytics environment without building custom integration pipelines. By using zero-ETL to break down data silos—even between multiple vendors’ solutions—and simplifying the data integration process, organizations can focus on deriving insights rather than managing complex data movements.

Additional integrations are in development to support more AWS services and data sources, further expanding the ecosystem. AWS is committed to continually expanding the range of zero-ETL integrations, responding to customer needs and evolving data landscapes.

Advanced features and capabilities of AWS zero-ETL

AWS zero-ETL capabilities include several sophisticated features that set them apart from other clouds. For example, by using the refresh interval control, you can customize how frequently data is synchronized, helping to ensure that analytics are based on data that is as current as necessary for each use case. Meanwhile, History Mode maintains historical versions of data, enabling trend analysis, insightful dashboards, and meeting audit requirements. You can also create type 2 slowly changing dimensions (SCD 2) tables in Amazon Redshift.

You can use the data filtering capabilities to selectively replicate specific objects and data subsets, optimizing storage use and focusing on the most relevant data. Comprehensive logging and monitoring features provide visibility into data movement and system health, so that administrators can quickly identify and address any issues.

You can also combine two primary integration approaches. Zero-ETL provides full data replication (movement) for comprehensive analytics in a central repository, complementing federation allows querying data in place when real-time access to source data is critical. You can use this flexibility to tailor your data integration strategy to your organization’s specific needs and use cases.

Getting started with zero-ETL

To begin using zero-ETL integrations, you should first identify your source database and target analytics service. This involves assessing your current data architecture and determining which data flows would benefit most from a zero-ETL approach.

Next, you need to configure the necessary permissions and networking requirements. This typically involves setting up either an AWS Identity and Access Management (IAM) identity or single sign-on using AWS IAM Identity Center and making sure that the source and target services can communicate securely.

As shown in the following image, after the prerequisites are in place, creating the integration is a click-through experience within the AWS Management Console. The intuitive interface guides you through the process, prompting you to specify source and target details, select tables for replication, and configure any additional options.

Salesforce objects for zero-ETL

After setup, you can monitor replication status and performance to help ensure optimal operation. AWS provides detailed metrics and logs to help you track the health and performance of your zero-ETL integrations.

For detailed setup instructions, visit the AWS documentation for zero-ETL integrations, which provides step-by-step guides for each supported integration.

What’s ahead for zero-ETL

AWS has an active roadmap for support of additional AWS services and data sources, expanding the reach of zero-ETL integrations so that more customers can benefit from simplified data integration across a broader range of use cases.

Zero-ETL integrations represent a fundamental shift in how organizations approach data integration. Without the complexity of ETL pipelines, customers can focus on deriving value from their data rather than managing infrastructure. This approach aligns with the AWS commitment to simplifying cloud operations and empowering customers to innovate faster.

To learn more about zero-ETL integrations and how they can benefit your organization, see the following topics:

For Aurora zero-ETL integrations, see Benefits, Key concepts, Limitations, Quotas, and Supported Regions of zero-ETL integrations
For Amazon RDS zero-ETL integrations, see Benefits, Key concepts, Limitations, Quotas, and Supported Regions of zero-ETL
For DynamoDB zero-ETL integrations, see DynamoDB zero-ETL integration with Amazon Redshift
For zero-ETL integrations with applications, see Zero-ETL integrations

Get started today and discover how you can streamline your data operations and unlock the full potential of your data with AWS zero-ETL integrations.

Nikki Rouda works in product marketing at AWS. He has many years experience across a wide range of IT infrastructure, storage, networking, security, IoT, analytics, and modern applications.

Guide to adopting Amazon SageMaker Unified Studio from ATPCO’s Journey

2025-08-18 Mitesh Patel

Post Syndicated from Mitesh Patel original https://aws.amazon.com/blogs/big-data/guide-to-adopting-amazon-sagemaker-unified-studio-from-atpcos-journey/

This blog post is co-written with Raj Samineni from ATPCO.

Launched at AWS re:Invent 2024, the next generation of Amazon SageMaker is expediting innovation for organizations such as ATPCO through a unified data management and tooling experience for analytics and AI use cases. This comprehensive service provides both technical and business users with Amazon SageMaker Unified Studio, a single data and AI development environment to discover the data and put it to work using familiar AWS tools. SageMaker Unified Studio offers a single governed environment to complete end-to-end development workflows, including data analysis, data processing, model training, generative AI application building, and more. It simplifies the creation of analytics and AI applications, fast-tracking the journey from raw data to actionable insights through its integrated data and tooling environment.

ATPCO is the backbone of modern airline retailing, helping airlines and third-party channels deliver the right offers to customers at the right time. ATPCO’s vision is to be the platform driving innovation in airline retailing while remaining a trusted partner to the airline ecosystem. ATPCO aims to support data-driven decision-making by making high-quality data discoverable by every business unit, with the appropriate governance on who can access what, and required tooling to support their needs. ATPCO addressed data governance challenges using Amazon DataZone. SageMaker Unified Studio, built on the same architecture as Amazon DataZone, offers additional capabilities, so users can complete various tasks such as building data pipelines using AWS Glue and Amazon EMR, or conducting analyses using Amazon Athena and Amazon Redshift query editor across diverse datasets, all within a single, unified environment.

In this post, we walk you through the challenges ATPCO addresses for their business using SageMaker Unified Studio. We start with the admin flow, a one-time setup process that lays the foundation for non-admin users in preparation for a company-wide rollout. When onboarding users from different business units to SageMaker Unified Studio, it’s crucial to make sure they have immediate access to their data sources such as Amazon Simple Storage Service (Amazon S3), AWS Glue Data Catalog, and Redshift tables as well as tools like Amazon EMR, AWS Glue, and Amazon Redshift that they already use. This helps users become productive swiftly and use the full potential of SageMaker Unified Studio. Next, we walk you through the developer flow, detailing how non-admin users can use SageMaker Unified Studio to access their data and act on it using their choice of tools.

“SageMaker Unified Studio has transformed how our teams access and collaborate on data. It’s the first time business and technical users can work together in a single, intuitive environment—no more tool switching or fragmented workflows.”
–Rajesh Samineni, Director of Data Engineering at ATPCO

ATPCO’s challenges

The implementation of SageMaker Unified Studio at ATPCO has been instrumental in addressing several critical challenges and unlocking new use cases across various business units within the organization. By building on the foundation laid by Amazon DataZone, ATPCO is helping users self-serve insights and fostering a culture of shared understanding and reusability of data assets, leading to more informed decision-making and a robust data culture.

SageMaker Unified Studio helped address the following challenges:

Data silos and discoverability – Analysts often struggled to locate the right data sources, verify data freshness, and maintain consistent definitions across different departments. By offering a single entry point for searching and subscribing to curated datasets, SageMaker Unified Studio minimizes these barriers. Integrated tools for data exploration, querying, and visualization, along with contextual metadata and lineage, builds trust in the data, making it straightforward for users to find and use the information they need.
Manual data handling – Teams relied heavily on manual exports and custom reports to gather insights, leading to inefficiencies and delays in decision-making. SageMaker Unified Studio helps users across departments, including product, sales, operations, and analytics, self-serve insights without manual intervention. This accelerates the decision-making process and helps teams focus on strategic initiatives rather than data collection.

Solution overview

The following diagram illustrates ATPCO’s architecture for SageMaker Unified Studio.

ATPCO-Solution-SMUS-AdminFlow-1

The following sections walk you through the steps that ATPCO went through to prepare the SageMaker Unified Studio environment for use by different personas in engineering and business units.

Prerequisites

If you’re new to SageMaker Unified Studio, you should first become familiar with concepts such as domains, domain units, projects, project profiles, blueprints, lakehouses, and catalogs before continuing with this post. For a company-wide rollout of SageMaker Unified Studio, it’s important to understand the foundation setup required as an admin user. For more information about the role of a SageMaker Unified Studio admin user and steps required to set up a SageMaker Unified Studio domain,refer to Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI. As an admin user, start with domain units and projects based on the need of different business units for the data and tooling.

Create domain units and set up projects with required tools

As an admin or root domain owner, you begin with the design of domain units and projects to organize different teams and users to their respective domain units. When non-admin users log in to the SageMaker Unified Studio portal, they should have seamless access to necessary AWS resources. These resources include the required tools and data sources to perform their job. Providing users access to these resources is critical for the successful adoption and utilization of SageMaker Unified Studio in your organization. ATPCO created separate domain units for engineering teams and non-engineering business units, as shown in the preceding architecture diagram. It only shows few examples. In reality, they have more domain units to meet their business needs, which we discuss in the following sections.

Data engineering domain

This domain unit has the Operational Metrics project, managed by the data engineering team, which supports a key backbone of visibility across the organization: understanding how ATPCO’s products perform in real time. Data engineers bring together signals from infrastructure, application logs, API monitoring, and internal systems to build aggregated, curated datasets that track latency, availability, adoption, and reliability. These operational metrics are published using SageMaker Unified Studio for consumption by other domains. Rather than fielding one-off requests or maintaining bespoke dashboards for different stakeholders, the engineering team now:

Builds reusable data assets that can be subscribed to one time and reused by many
Creates unified views of system health that are automatically updated and versioned
Supports other teams such as Product, Sales, and analysts with quick access to performance indicators in a format aligned with their needs

SageMaker Unified Studio becomes the center for operational intelligence, reducing duplication and making sure data engineers can focus on scale and automation rather than ticket-based support.

Analyst domain

The Data Exploration project in this domain unit serves the entire ATPCO community. Its purpose is to make available datasets regardless of their owning domain easily discoverable and ready for analysis. Previously, analysts struggled with locating the right data source, verifying its freshness, or aligning on consistent definitions. With SageMaker Unified Studio, those barriers are removed. The project provides:

A single entry point where users can search and subscribe to curated datasets
Integrated tools for exploration, query, and visualization
Contextual metadata and lineage to build trust in the data

Users in product, strategy, operations, or analytics can self-serve insights without waiting on manual exports or custom reports.

Sales domain

The Customer Profile project in this domain unit helps the Sales team understand which customers are actively engaging with ATPCO’s products, how they are using them, and where there might be opportunities to strengthen relationships. By using SageMaker Unified Studio, Sales team members can access the following:

Customer data sourced from CRM systems, including interaction history, product adoption, and support engagement
Operational metrics from the Data Engineering team, revealing which features are being used, how often, and whether the customer is experiencing reliability issues

With this combined insight, the Sales team can accomplish the following:

Identify high-value accounts for follow-up based on recent usage
Detect drop-off in engagement or technical issues before a customer raises a concern
Tailor outreach and proposals using objective data, not assumptions

All of this happens within SageMaker Unified Studio, reducing the time spent on manual data gathering and enabling more strategic, proactive customer engagement.

Onboard data sources to domain units and projects

Now that domain units and projects are created for different business units, the next step is to onboard existing Amazon S3 data sources, Data Catalog tables, and database tables available in Amazon Redshift. After logging in, users have access to the required data and tools. This required the ATPCO team to build the inventory to see which team has access to what data sources and what level of permissions are needed. For example, the Data Engineering team needs access to raw, processed and curated S3 buckets for building data processing jobs. They must also read and write to the Data Catalog, and prepare and write curated and aggregated data to the Redshift tables. The following sections guide you through configuring these various data sources within SageMaker Unified Studio, making sure users can access the data sources to continue their work in SageMaker Unified Studio.

Configure existing Amazon S3 data sources into SageMaker Unified Studio

To use an existing S3 bucket in SageMaker Unified Studio, configure an S3 bucket policy that allows the appropriate actions for the project AWS Identity and Access Management (IAM) role.

The Data Engineering team that owns the data processing pipeline must grant access to raw, processed, and curated S3 buckets to the data engineering project role. To learn more about using existing S3 buckets, refer to Access your existing data and resources through Amazon SageMaker Unified Studio, Part 2: Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.

Configure an existing Data Catalog into SageMaker Unified Studio

The next generation of SageMaker is built on a lakehouse architecture, which streamlines cataloging and managing permissions on data from multiple sources. Built on the Data Catalog and AWS Lake Formation, it organizes data through catalogs that can be accessed through an open, Apache Iceberg REST API to help enforce secure access to data with consistent, fine-grained access controls. SageMaker Lakehouse organizes data access through two types of catalogs: federated catalogs and managed catalogs (shown in the following figure). A catalog is a logical container that organizes objects from a data store, such as schemas, tables, views, or materialized views from Amazon Redshift. The following diagram illustrates this architecture.

ATPCO-Solution-SMUS-Catalog-2

ATPCO built a data lake on Amazon S3 using the Data Catalog and implemented data governance and fine-grained access control using Lake Formation. When developer users log in to SageMaker Unified Studio, they need access to the Data Catalog tables owned by their respective team. Existing Data Catalog databases are made available in SageMaker Lakehouse as a federated catalog because they’re created outside of SageMaker Lakehouse and not managed by it.

To access an existing Data Catalog, you must provide explicit permissions to SageMaker Unified Studio to be able to access the Data Catalog databases and tables. For more details, see Configure Lake Formation permissions for Amazon SageMaker Unified Studio. To onboard Data Catalog tables to SageMaker Lakehouse in SageMaker Unified Studio, the Lake Formation admin must grant access to specific Data Catalog database tables to the SageMaker Unified Studio project role. For more details, refer to Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift. The Lake Formation permission model is the prerequisite to grant access to SageMaker Unified Studio. If Lake Formation is not the permission model for the Data Catalog, then you must register the S3 path and delegate the permission model to Lake Formation before it can be granted to the SageMaker Unified Studio project role. After you complete these steps, users of the project can access the Data Catalog database and are granted tables under the AwsDataCatalog namespace, and your tables will be visible in the Data Explorer (see the following screenshot). Your data is now ready for tagging, searching, enrichment, and data analysis.

Configure Redshift data into SageMaker Unified Studio

ATPCO relies on Amazon Redshift as their enterprise data warehouse and stores their aggregated data for insights and dashboarding. Users can combine the data from Amazon Redshift and SageMaker Lakehouse for unified data analysis in SageMaker Unified Studio without leaving SageMaker Unified Studio. For more information about how to add existing Redshift data sources, refer to Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift.

After it’s connected, the Amazon Redshift compute engine becomes visible in the Data Explorer of your project. Project users can perform the following actions:

Write and run SQL queries directly against Amazon Redshift
Explore Redshift schemas and tables
Use Redshift tables to define SageMaker Unified Studio data sources
Combine Redshift data with metadata tagging, glossary linking, and publishing

ATPCO-Solution-SMUS-Compute-4

This doesn’t require copying or duplicating data. You’re using the data exactly where it lives in your Redshift cluster while benefiting from the collaborative features of SageMaker Unified Studio. Adding compute makes the data within the warehouse available to query inside the SageMaker Unified Studio query editor.

ATPCO-Solution-SMUS-DataExplore-5

Onboard users to their respective domain units and projects

Now that as an admin you have created the environments for different business units, your next step is to add domain owner users to the respective domain units. First, you must add domain and project owners’ users for them to get access to the SageMaker Unified Studio domain portal.

ATPCO-Solution-SMUS-Domain-6

Domain units make it possible to organize your assets and other domain entities under specific business units and teams. Domain unit owners can create policies such as membership, domain, and project creation.

ATPCO-Solution-SMUS-Owner-7

Domain unit owners can add one of the members as owner of the project so that when the owner user logs in, they can add other users of their team as an owner or contributor to the project. This helps other users get access to the projects when they login to SageMaker Unified Studio.

ATPCO-Solution-SMUS-members-8

Use the SageMaker Unified Studio environment

After the admin completes the required setup for different business units and onboardsproject members, users can log in to the portal and start using the preconfigured SageMaker Unified Studio environment. Users have access to respective data sources and tools as shown in the following developer flow diagram.

ATPCO-Solution-SMUS-DeveloperFlow-9

At ATPCO, developers must often combine data from various sources to perform extract, transform, and load (ETL) processes efficiently. In this section, we demonstrate how developers can benefit from the SageMaker unified lakehouse environment by seamlessly integrating data from both Amazon Redshift and the Data Catalog. Using PySpark within SageMaker Unified Studio notebooks, we read transactional data from Amazon Redshift and enrich it with metadata stored in AWS Glue backed S3 tables such as warehouse or product attributes. This integrated view supports complex transformations and aggregations across disparate sources without needing to move or duplicate data. By using native connectors and Spark’s distributed processing, users can join, filter, and analyze multi-source datasets efficiently and write the results back to Amazon Redshift for downstream analytics or dashboarding, all within a single, interactive lakehouse interface.

The following code snippet sets up a Spark session to directly query Amazon Redshift managed storage tables using the lakehouse architecture. It registers an AWS Glue backed Iceberg catalog (rmscatalog) that points to a specific Redshift lakehouse catalog and database, allowing Spark to read from and write to Redshift Iceberg tables. By enabling Iceberg extensions and linking the catalog to AWS Glue and Lake Formation, this setup provides seamless, scalable access to Amazon Redshift managed data using standard Spark SQL.

from pyspark.sql import SparkSession
from pyspark.sql.functions import count, avg, round as _round, col
catalog_name = "rmscatalog"
#Change <your_account_id> with your AWS account ID
rms_catalog_id = "<your_account_id>:rms-catalog-demo/dev"
#Change with your AWS region
aws_region="<region>"
spark = SparkSession.builder.appName('rms_demo') \
.config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
.config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
.config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_id) \
.config(f'spark.sql.catalog.{catalog_name}.client.region', aws_region) \
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions').getOrCreate()

ATPCO-Solution-SMUS-Code-10

=== Check for the tables and load them into dataframes
SHOW TABLES IN rmscatalog.salesdb

ATPCO-Solution-SMUS-Code-11

city_info_df = spark.table("rmscatalog.salesdb.city_info") 
carrier_info_df = spark.table("rmscatalog.salesdb.carrier_info")

The following step sets the active AWS Glue database to shopping_data and retrieves metadata for the shopping_data_catalog table using DESCRIBE EXTENDED. It filters for key properties like Provider, Location, and Table Properties to understand the table’s storage and configuration. Finally, it loads the entire table into a Spark DataFrame (shopping_data_df) for downstream processing.

# === Use Glue Catalog and Load Shopping Data ===
spark.sql("USE shopping_data")
# describing the glue table properties
desc_df = spark.sql("DESCRIBE EXTENDED shopping_data_catalog")
desc_df.filter("col_name IN ('Provider', 'Location', 'Table Properties')") \
.selectExpr("col_name AS Property", "data_type AS Value") \
.show(truncate=True)
shopping_data_df = spark.sql("SELECT * FROM shopping_data_catalog")

ATPCO-Solution-SMUS-Code-13

The following code shows how you can seamlessly combine and aggregate two disparate data sources, Amazon Redshift and the Data Catalog, within SageMaker Unified Studio. Using PySpark, we perform transformations and derive meaningful summaries across the unified view. This facilitates streamlined analysis and reporting without the need for complex data movement or duplication.

# == Join and Aggregate Data ===
shopping_with_cities_df = shopping_data_df \
.join(city_info_df.alias("origin_city"), shopping_data_df.origincitycode == col("origin_city.citycode"), "left") \
.join(city_info_df.alias("dest_city"), shopping_data_df.destinationcitycode == col("dest_city.citycode"), "left")
shopping_full_df = shopping_with_cities_df \
.join(carrier_info_df, col("validatingcarrier") == col("carrier_code"), "left")
result_df = shopping_full_df.groupBy("origin_city.region", "alliance") \
.agg(
count("*").alias("total_trips"),
_round(avg("totalamount"), 2).alias("avg_amount")
) \
.orderBy("total_trips", ascending=False)
result_df.show(10, truncate=False)

ATPCO-Solution-SMUS-Code-14

After the job runs, it writes the transformed dataset directly into a Data Catalog table that is Iceberg-compatible. This integration makes sure the data is stored in Amazon S3 with ACID transaction support, and also registered and tracked in the Data Catalog for unified governance, schema discovery, and downstream query access. The Iceberg table format organizes the data into Parquet files under a data/ directory and maintains rich versioned metadata in a metadata/ folder, supporting features like schema evolution, time travel, and partition pruning. This design facilitates scalable, reliable, and SQL-compatible analytics on modern data lakes.

ATPCO-Solution-SMUS-Code-15

ATPCO-Solution-SMUS-Data-File-16

The table becomes immediately available for querying through the Athena query editor, providing interactive access to fresh, transactional data without additional ingestion steps or manual registration.This approach streamlines the end-to-end data flow, from processing in Spark to interactive querying in Athena within the modern SageMaker Lakehouse environment.

ATPCO-Solution-SMUS-Query-Data-16

Conclusion

This post walked you through the steps to prepare a SageMaker Unified Studio environment for a company-wide rollout, using APTCO’s journey as an example. We covered the domain design and admin flow, which is a one-time setup to prepare the SageMaker Unified Studio environment for different teams in the organization who requires different levels of access to the data and tools. After the admin flow, we demonstrated the developer flow and how to use tools like a Jupyter notebook and SQL editor to use the data across different sources such as Amazon S3, the Data Catalog, and Redshift assets to perform a unified analysis.

Try out this solution and get started with SageMaker Unified Studio and modernize with the next generation of SageMaker. To learn more about SageMaker Unified Studio and how to get started, refer to the Amazon SageMaker Unified Studio Administrator Guide, and the latest AWS Big Data Blog posts.

About the authors

Mitesh Patel is a Principal Solutions Architect at AWS. His passion is helping customers harness the power of Analytics, Machine Learning, AI & GenAI to drive business growth. He engages with customers to create innovative solutions on AWS.

Nikki Rouda works in product marketing at AWS. He has many years experience across a wide range of IT infrastructure, storage, networking, security, IoT, analytics, and modern applications.

Raj Samineni is the Director of Data Engineering at ATPCO, leading the creation of advanced cloud-based data platforms. His work ensures robust, scalable solutions that support the airline industry’s strategic transformational objectives. By leveraging machine learning and AI, Raj drives innovation and data culture, positioning ATPCO at the forefront of technological advancement.

Saurabh Rawat is a Solution Architect at AWS with 13 years of experience working with enterprise data systems. He has designed and delivered large-scale, cloud-native solutions for customers across industries, with a focus on data engineering, analytics, and well-architected architectures. Over his career, he has helped organizations modernize their data platforms, optimize for performance, and cost, and adopt best practices for scalability and security. Outside of work, he is a passionate musician and enjoys playing with his band.

Integrate scientific data management and analytics with the next generation of Amazon SageMaker, Part 1

2025-08-05 Nadeem Bulsara

Post Syndicated from Nadeem Bulsara original https://aws.amazon.com/blogs/big-data/integrate-scientific-data-management-and-analytics-with-the-next-generation-of-amazon-sagemaker-part-1/

Our customers tell us that scientists are increasingly spending more time managing data-related challenges than focusing on science. The primary reason for this challenge is that scientific data comes in many types and is siloed across systems, groups, and stages, and scientists struggle to efficiently discover, access, share, and analyze datasets across silos. This fragmentation creates lengthy cycles full of manual interventions, leading to inefficiencies. Mapping data sources and negotiating access across silos can take 4–6 weeks, integrating datasets can extend to months, and fully connecting data from source to tooling can take years, if ever achieved. These data challenges reduce lab productivity and slow down scientific innovation, which decrease drug and product pipeline throughput, and ultimately delay time-to-market. The solution lies in breaking down data silos by creating digital environments that help scientists efficiently connect disparate datasets and analytical tools, so they can conduct iterative hypothesis and product testing without technology friction.

Part 1 of this series shows an example project in drug target identification where two groups of scientists need to collaborate as they integrate no-code knowledge searching, scientific data management, and sophisticated analytics. In this example, a computational biology team begins by mining the scientific literature on a knowledge search GUI. Next, they navigate to a data catalog to find and access relevant datasets, which they share with the data scientist team to run analytics with sophisticated tools (see the following figure). Although the end-to-end journey illustrates the benefits to a target identification example, the underlying data challenges and technology solution apply to any life sciences use case requiring the integration of data management and analytics. Details of the implementation and technical solution will be discussed in Part 2 of the series.

A flow diagram with a dark background starting with Scientific data. It shows people with stock images as example personas that use the data to derive insights.

Example use case

A computational biologist has been tasked with identifying a target for Non-Alcoholic Fatty Liver Disease (NAFLD). A typical question from the biologist might be “Can I find genes associated with NAFLD and do we have a patient cohort with variants in those genes?” The solution we designed for this use case involves three simple steps:

Search the scientific literature through a no-code interface to identify genomic variants associated with NAFLD.
Search an internal data catalog with natural language:
- Find datasets of interest, such as multi-omics and clinical data for patients associated with NAFLD.
- Request access to the relevant datasets.
Share relevant datasets with a data scientist collaborator for deeper analysis.

In designing this solution, we focused on the following features:

Providing no-code scientists with point-and-click and natural-language interfaces
Reducing silos with data findability, governance automation, and seamless collaboration
Providing technical personas with the sophisticated tools and environments they prefer

Solution overview

This solution uses the next generation of Amazon SageMaker, including Amazon SageMaker Unified Studio, an integrated data and AI development environment. SageMaker Unified Studio offers capabilities for data processing, SQL analytics, model development, and generative AI application development, built on existing AWS services. The next generation of SageMaker also includes Amazon SageMaker Catalog, which is built on Amazon DataZone, a data management service designed to streamline data discovery, data cataloging, data sharing, and governance. Your organization can have a single secure data hub where everyone in the organization can find, access, and collaborate on data across AWS, on premises, and even third-party sources.

SageMaker Catalog supports certain system asset types, such as tables from Amazon Redshift, tables from AWS Glue, and object collections from Amazon Simple Storage Service (Amazon S3). It also offers the ability to support custom asset types, which gives users flexibility to catalog data that can’t be categorized as a system asset type. For asset type S3ObjectCollectionType, see Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone. SageMaker Catalog also offers the ability to support custom asset types, which gives users flexibility to catalog data that can’t be categorized as a system asset type. For this example use case, we used AWS HealthOmics variant stores to store and allow querying of genomic variant data. This example lists HealthOmics variant stores as a custom asset type within the catalog. Details of the implementation and technical solution for access management will be discussed in Part 2 of the series.

In the example use case, a computational biologist, in order to identify a target for NAFLD, relies heavily on diverse datasets from multiple sources (genomic sequences, gene expression data, clinical records, and more). This data comes from both internal sources (first-party) and external partners or public databases (third-party). Multiple teams are responsible for collecting and processing this data before making it available to computational biologists, researchers, data scientists, and bioinformaticians within the organization.

In this solution, users (data engineers, data scientists, bioinformaticians, computational biologists) log in to a project-based environment from SageMaker Unified Studio with a preconfigured authentication method. A typical workflow involves the following steps:

Data stewards as authorized members of projects publish data assets into the SageMaker catalog.
Data consumers as authorized members of projects seeking to analyze data for their scientific needs find and discover available data assets of interest from the SageMaker catalog.
Data consumers request to subscribe to the relevant discovered data assets.
Data producers review and decide to approve or reject the subscription request.
Data consumers access and analyze the data using preconfigured tools from SageMaker Unified Studio.

The following diagram illustrates the solution architecture and workflow.

architecture diagram

In the following sections, we explore each step of the workflow in more detail.

Step 1: Data producers publish data assets

As shown in the preceding workflow diagram, data producers can use SageMaker Catalog to publish their datasets as data assets or data products with appropriate business (such as source, license, vendor, study identifier), scientific (such as disease name, cohort information, data modality, assay type), or technical (file types, data formats, file sizes) metadata. In our example use case, the data producers publish clinical data as AWS Glue tables and genomic variant data as a table within the HealthOmics variant store. Additionally, data producers can use AI-based recommendations to automatically populate descriptors, making it straightforward for consumers to find and understand its use.

Step 2: Data consumers find relevant datasets

Data consumers, such as data scientists and bioinformaticians, can log in to SageMaker Unified Studio and navigate to SageMaker Catalog to search for the appropriate data assets and products, such as “NAFLD Variants” or “NAFLD Clinical.” They can also find data assets or products using metadata filters such as study identifiers or disease names to discover the possible datasets associated with a study or disease.

Step 3: Data consumers subscribe to required data assets or products

After the data consumers see a data asset or data product of interest (for example, the clinical and genomics data for NAFLD), they can subscribe to them. Data consumers can also optionally include a comment in the subscription request to add more context to the request. This initiates the subscription workflow based on the asset type.

Step 4: Data producers review and approve the subscription request

Data producers get notified of subscription requests and review if access should be granted and approve accordingly. The response can optionally include a comment for reasoning and traceability. In addition, data producers can limit access to certain rows and columns to protect controlled data.

Step 5: Data consumers access the subscribed data assets or products

Upon approval from the data producer, the data consumer gets access to those data assets and can use them in the appropriate environments configured within their project. For example, data scientists can open a workspace with a JupyterLab notebook already available within SageMaker Unified Studio. Subsequently, the data scientist can start analyzing the tabular clinical and variant data that was just approved for access.

Conclusion

The next generation of SageMaker transforms how scientists work with data by creating an integrated data and analytics environment. In this unified environment, data producers are empowered to publish datasets with rich metadata. Data consumers are able to use the catalog within SageMaker Unified Studio to search for their required datasets, either using free text or using metadata and business glossary filters. Data consumers can subscribe to data securely, tap into powerful search capabilities using free text or metadata filters, and access essential analysis tools (Amazon Athena, JupyterLab IDE, Amazon EMR) directly. The result is a unified digital workspace that reduces communication bottlenecks, speeds up scientific cycles, and removes technical barriers. Scientists can now focus on what matters most—testing hypotheses and products, and scaling scientific innovation to production—within a unified, powerful platform. This streamlined approach accelerates data-driven science, enabling research institutions, pharmaceutical companies, and clinical laboratories to innovate more efficiently. For example, data scientists can launch a space with a JupyterLab notebook preinstalled.

Consider using the next generation of SageMaker to increase productivity within your organization. Contact your account representatives or an AWS Representative to learn how we can help accelerate your projects and your business.

About the authors

Nadeem Bulsara is a Principal Solutions Architect at AWS specializing in Genomics and Life Sciences. He brings his 13+ years of Bioinformatics, Software Engineering, and Cloud Development skills as well as experience in research and clinical genomics and multi-omics to help Healthcare and Life Sciences organizations globally. He is motivated by the industry’s mission to enable people to have a long and healthy life.

Chaitanya Vejendla is a Senior Solutions Architect specialized in DataLake & Analytics primarily working for Healthcare and Life Sciences industry division at AWS. Chaitanya is responsible for helping life sciences organizations and healthcare companies in developing modern data strategies, deploy data governance and analytical applications, electronic medical records, devices, and AI/ML-based applications, while educating customers about how to build secure, scalable, and cost-effective AWS solutions. His expertise spans across data analytics, data governance, AI, ML, big data, and healthcare-related technologies.

Dr. Mileidy Giraldo has over 20 years of experience bridging bioinformatics, research, and industry technology strategy. She specializes in making technology accessible for organizations in the life sciences sector. In her current role as WW Lead for Life Sciences Strategy and Lab of the Future at AWS, she helps biotechs, biopharma, and diagnostics organizations design Data & AI-driven initiatives that modernize labs and help scientists unlock the full value of their data.

Chris Clark is a Senior Solutions Architect focused on helping Life Science customers leverage AWS technology to advance their operational capabilities. With 20+ years of hands-on experience in life sciences manufacturing and supply chain, he combines deep industry knowledge with his AWS expertise to guide his customers. When he’s not working to solve customer challenges, he enjoys cycling and building and repairing things in his workshop.

Nick Furr is a Specialist Solutions Architect at AWS, supporting Data & Analytics for Healthcare and Life Sciences. He helps providers, payers, and life sciences organizations build secure, scalable data platforms to drive innovation and improve outcomes. His work focuses on modernizing data strategies through cloud analytics, governed data processing, and machine learning for use cases like clinical research and population health.

Subrat Das is a Principal Solutions Architect for Global Healthcare and Life Sciences accounts at AWS. He is passionate about modernizing and architecting complex customers workloads. When he’s not working on technology solutions, he enjoys long hikes and traveling around the world.

Develop and deploy a generative AI application using Amazon SageMaker Unified Studio

2025-08-04 Amit Maindola

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/develop-and-deploy-a-generative-ai-application-using-amazon-sagemaker-unified-studio/

Picture this: You’re a financial analyst starting your Monday morning with a steaming cup of coffee, ready to review your investment portfolio. But instead of manually scouring dozens of news websites, financial reports, and industry analyses, you simply ask your AI assistant: “What global events happened over the weekend that might impact my technology stock holdings?” Within seconds, you receive a comprehensive analysis of relevant news, sentiment scores, and potential investment implications—all powered by a sophisticated generative AI application you built yourself.

This scenario isn’t science fiction; it’s the reality that modern financial professionals can create today. In an era where information moves at the speed of light and industry conditions can shift dramatically overnight, staying informed isn’t just an advantage—it’s essential for survival in competitive financial landscapes. The challenge lies in processing the overwhelming volume of global information that could impact investments while distinguishing reliable insights from noise.

Amazon SageMaker – Develop and scale AI use cases with the broadest set of tools

Luckily for us, technology is making this more straightforward. The next generation of Amazon SageMaker with Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access the data in your organization and act on it using the best tools across different use cases. SageMaker Unified Studio brings together the functionality and tools from existing AWS analytics and artificial intelligence and machine learning (AI/ML) services, including Amazon EMR , AWS Glue, Amazon Athena, Amazon Redshift , Amazon Bedrock, and Amazon SageMaker AI. From within SageMaker Unified Studio, you can ﬁnd, access, and query data and AI assets across your organization, then work together in projects to securely build and share analytics and AI artifacts, including data, models, and generative AI applications.

With SageMaker Unified Studio, you can efficiently build generative AI applications in a trusted and secure environment using Amazon Bedrock. You can choose from a selection of high-performing foundation models (FMs) and advanced customization capabilities like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, Amazon Bedrock Agents, and Amazon Bedrock Flows. You can rapidly tailor and deploy generative AI applications and share with the built-in catalog for discovery.

What makes SageMaker Unified Studio particularly powerful for organizations is its integration with Amazon Bedrock Flows to build generative AI workflows, which is changing how organizations think about AI application development.

Amazon Bedrock Flows for generative AI application development

With Amazon Bedrock Flows, you can build and execute complex generative AI workflows without writing code, using an intuitive visual interface that democratizes AI development. This capability is transformative for organizations where speed, accuracy, and adaptability are paramount. It offers the following benefits:

Visual workflow development – Users can design AI applications by dragging and dropping components onto a canvas, making AI logic transparent and modifiable
Business logic flexibility – The service supports complex business logic through conditional branching, multi-path decision trees, and dynamic routing
Democratizing AI development – Business experts can directly contribute to AI application development without requiring extensive technical expertise
Seamless integration – Amazon Bedrock Flows integrates with FMs, knowledge bases, guardrails, and other AWS services
Reduced development complexity – The service handles infrastructure management and scaling through serverless execution and SDK APIs

Solution overview

In this post, we explore a financial use case, in which we want to stay on top of latest global events and determine our investment or financial exposure based on this. We can use a SageMaker Unified Studio flow application to pull in latest news summaries, derive sentiment based on news summary, and determine their effects on my investments. The following diagram illustrates this use case.

In the following sections, we show how to create a new project and build a flow application using a generative AI profile in SageMaker Unified Studio.

Prerequisites

For this walkthrough, you must have the following prerequisites:

An AWS account – If you don’t have an account, you can create one.
A SageMaker Unified Studio domain – For instructions, refer to Create an Amazon SageMaker Unified Studio domain – quick setup, and add a single sign-on (SSO) user for logging and set two-factor authentication. Finally, on the Connections tab, you can create a connection to a code repository.

A demo project – Create a demo project in your SageMaker Unified Studio domain. For instructions, see Create a project. For this example, we choose All capabilities in the project profile section, which includes the generative AI project profile enabled.

Create new project and build a flow application in SageMaker Unified Studio

In this section, we create a new a flow application that uses an Amazon Bedrock knowledge base to provide information about your personal portfolio. Complete the following steps:

In SageMaker Unified Studio, open the project you created as a prerequisite and choose Build and then Flow.

Drag Knowledge Base from Nodes to the design panel to add a knowledge base that will include the user’s investment portfolio and news articles and other information like earnings call transcripts, financial analyst reports, and so on.

Choose the Knowledge Base node and configure the knowledge base as follows:
Add a name for your knowledge base name (for example, portfolio…).
Choose the model (for example, Claude 3.5 Haiku).

Choose Create new Knowledge Base.
Enter a name for the knowledge base.
Select Project data source.
For Select a data source, choose the Amazon Simple Storage Service (Amazon S3) bucket location where you uploaded your data.
Choose Create.

The knowledge base creation process takes a few minutes to complete.

When the knowledge base is ready, choose Save to save it to the flow.

Choose My components, and on the options menu (three vertical dots), choose Sync to sync the knowledge base.

Make sure the S3 bucket has all the data (user portfolio data and latest news information data) before syncing the knowledge base.

We don’t provide any financial or news information data as part of this post. Upload current events or news data and investment portfolio data from your own data sources.

Test the flow application

After the knowledge base sync is complete, you can return to the flow application and ask questions. Using SageMaker Unified Studio flows, a financial analyst can provide a more personalized and customized financial outlook to their customers using rich internal financial information on their customer’s investment portfolio and latest publicly available current events and news information. The following are some example questions that you can ask to test the knowledge base:

Check if Tesla or Apple is in any of user's investment portfolio

Please check latest news information to provide information if Tesla has positive, negative or neutral outlook in the near future

Flow-based applications offer a visual approach to creating complex AI workflows. By chaining different nodes, each optimized for specific functions, you can create sophisticated solutions that are more reliable, maintainable, and efficient than single-prompt approaches. These flows allow for conditional logic and branching paths, mimicking human decision-making processes and enabling more nuanced responses based on context and intermediate results.

Clean up

To avoid ongoing charges in your AWS account, delete the resources you created during this tutorial:

Delete the project.
Delete the domain created as part of the prerequisites.

Conclusion

In this post, we demonstrated how to use Amazon Bedrock Flows in SageMaker Unified Studio to build a sophisticated generative AI application for financial analysis and investment decision-making without extensive coding knowledge. With this integration, you can create sophisticated financial analysis workflows through an intuitive visual interface, where you can process industry data, analyze news sentiment, and assess investment implications in real time. The solution integrates seamlessly with AWS services and FMs while providing essential features like automatic scaling, compliance controls, and audit capabilities. The implementation process involves setting up a SageMaker Unified Studio domain, configuring knowledge bases with portfolio and news data, and creating visual workflows that can analyze complex financial information. This democratized approach to AI development allows both technical and business teams to collaborate effectively, significantly reducing development time while maintaining the sophisticated capabilities needed for modern financial analysis.

To get started, explore the SageMaker Unified Studio documentation, set up a project in your AWS environment, and discover how this solution can transform your organization’s data analytics capabilities.

About the authors

Amit Maindola is a Senior Data Architect focused on data engineering, analytics, and AI/ML at Amazon Web Services. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.

Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers adopt and use the AWS Cloud. He is focused on big data, data lakes, streaming and batch analytics services, and generative AI technologies.

Melody Yang is a Principal Analytics Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Gaurav Parekh is a Solutions Architect at AWS, specializing in generative AI and data analytics, with extensive experience building production AI systems on AWS.

Automate data lineage in Amazon SageMaker using AWS Glue Crawlers supported data sources

2025-07-30 Mohit Dawar

Post Syndicated from Mohit Dawar original https://aws.amazon.com/blogs/big-data/automate-data-lineage-in-amazon-sagemaker-using-aws-glue-crawlers-supported-data-sources/

The next generation of Amazon SageMaker is the center for all your data, analytics, and AI. Bringing together widely adopted Amazon Web Services (AWS) machine learning (ML) and analytics capabilities, it delivers an integrated experience for analytics and AI with unified access to all your data. From Amazon SageMaker Unified Studio, a single data and AI development environment, you can access your data and use a suite of powerful tools for data processing, SQL analytics, model development, training and inference, and generative AI development.

With data lineage, now part of Amazon SageMaker Catalog, you can centralize lineage metadata of your data assets in a single place. You can track the flow of data over time, determining a clear understanding of where it originated, how it has changed, and its usage across the business. By providing this level of transparency, data lineage helps data consumers gain trust that the data is correct and compliant for their use cases. With data lineage captured at the table, column, and job level, data producers can conduct impact analysis of changes in their data pipelines and respond to data issues when needed, for example, when a column in the resulting dataset is missing the quality required by the business.

Data lineage is a powerful tool that can transform how organizations understand and manage their data flows. In this post, we explore its real-world impact through the lens of an ecommerce company striving to boost their bottom line.

To illustrate this practical application, we walk you through how you can use the prebuilt integration between SageMaker Catalog and AWS Glue crawlers to automatically capture lineage for data assets stored in Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. Using this workflow, you can capture lineage automatically from additional data sources using AWS Glue crawlers. Refer to the Data lineage support matrix in the SageMaker Unified Studio User Guide for supported sources. We also use SageMaker Unified Studio to navigate these data assets and learn about their origin, transformations, and dependencies, thanks to the lineage metadata captured using the AWS Glue crawlers.

Key features of the SageMaker Catalog lineage graph

In SageMaker Unified Studio, you can explore and discover data assets of your organization suited for your use case. As you dive into these data assets, you can learn more about its business context, schema, quality, and lineage. When you decide to work with a subset of these assets, you can subscribe to them in a self-service fashion and start working with them. For more detail, visit Data discovery, subscription, and consumption in the SageMaker Unified Studio User Guide.

SageMaker Studio provides a visual lineage graph that shows how a data asset has evolved from its source through transformations to its final state. This helps data scientists, engineers, and analysts answer key questions such as:

Where did this data come from?
What transformations has it gone through?
Which downstream assets will be impacted by a change?

With this level of visibility, teams can perform faster impact analysis, find the root cause of data quality issues, and ensure models are built on trusted data. It also supports better collaboration so users can confidently use and share data across the organization. The following screenshot shows how SageMaker Unified Studio visualizes data lineage, making it straightforward to trace data flow and understand dependencies.

Column-level lineage – You can expand column-level lineage when available in dataset nodes. This automatically shows relationships with upstream or downstream dataset nodes if source column information is available.
Column search – If the dataset has more than 10 columns, the node presents pagination to navigate to columns not initially presented. To quickly view a particular column, you can search on the dataset node that lists only the searched column.
Details pane – Each lineage node captures and displays the following details:
- Every dataset node has three tabs: LINEAGE, SCHEMA, and HISTORY. The HISTORY tab lists the different versions of lineage event captured for that node.
- The job node has a details pane to display job details with the tabs Job info and History. The details pane also captures queries or expressions run as part of the job.
View dataset nodes only – If you want to filter out the job nodes, you can choose the open view control icon in the graph viewer and toggle the display dataset nodes only, which will remove all the job nodes from the graph and let you navigate only the dataset nodes.
Version tabs – All lineage nodes in Amazon DataZone data lineage will have versioning, captured as history, based on lineage events captured. You can view lineage at a selected timestamp that opens a new tab on the lineage page to help compare or contrast between the different timestamps.

You can try some of these features as you explore the data assets of this post. To learn more on data lineage in SageMaker, we encourage you to dive deep into the Data lineage in Amazon SageMaker Unified Studio.

Solution overview

Imagine a scenario where an ecommerce company aims to optimize conversion rates and enhance customer experience by gaining deeper insights into the customer journey. They need to connect the dots between user interactions and actual purchases, but with data scattered across multiple sources, where do they begin? This is where data lineage becomes invaluable. To perform their analysis, they need data from two primary sources:

Clickstream data stored in Amazon S3 (in JSON or Parquet format)
Transactional order data stored as items in Amazon DynamoDB

To make these datasets discoverable across the business, you need to:

Create a project in SageMaker Unified Studio that will be used to source and manage the datasets
Enable data lineage capture in the SageMaker Unified Studio project
Set up the resources for this use case, which includes an AWS Glue data source (set up in SageMaker Unified Studio) and AWS Glue crawler (set up in AWS Glue)
Run the AWS Glue crawler to catalog the datasets in AWS Glue Data Catalog
Source the metadata of the data assets into the SageMaker Catalog by running the data source
Use SageMaker Unified Studio to navigate through the lineage of the data assets and visualize their origin
Understand how schema evolution is captured in the data asset’s lineage

Prerequisites

To complete the steps on this post, you need an SageMaker Unified Studio domain already deployed in your AWS account. To get started quickly in a testing environment, we suggest creating your SageMaker domain using the quick setup option as explained in Create an Amazon SageMaker Unified Studio domain – quick setup.

Solution steps

To capture data lineage for AWS Glue tables managed with AWS Glue crawlers using SageMaker Unified Studio, complete the steps in the following sections.

Set up a SageMaker project with SQL capability

In SageMaker Unified Studio, a project profile defines an uber template for projects in your Amazon SageMaker unified domain. By setting up a project with the right tooling (project profile), you will provision resources you can use to work with data, which might include cataloging it in SageMaker, transforming it into new data assets, analyzing it to drive business value, or even use it for ML or AI applications.

To demonstrate data lineage effectively, we use SageMaker SQL analytics project profile for a streamlined setup. Although this profile offers comprehensive data analytics capabilities, we focus specifically on two key components:

AWS Glue database – A lakehouse for storing and managing technical metadata
Data source job – Automatically collects and tracks metadata into SageMaker Catalog

We’ve chosen this profile to bypass complex manual configurations so we can focus on the core concepts of data lineage.

To create a new project in your SageMaker domain using the SQL analytics project profile, follow the steps detailed in SQL analytics project profile. Keep all default configurations when creating the project.

After creating your project in SageMaker Studio, you’ll unlock powerful data lineage capabilities that make tracking and understanding your data flows intuitive. Through the data sourcing feature, you can easily monitor how data moves from source to the AWS Glue database. This visibility becomes particularly valuable when debugging data issues—you can quickly trace data back to its source, understand how changes impact downstream processes, and identify affected analyses or reports. Next, populate the AWS Glue database with sample data to observe these features in action and demonstrate how they can streamline your data operations.

For further guidance on how to access the details of the new SageMaker project, refer to Get project details. After you access the data source details, in the Database name field, take note of the AWS Glue database name associated to the SageMaker project.

Enable data lineage capture in the SageMaker project’s data source

To enable lineage capture, follow these steps:

Expand the Actions menu, then choose Edit data source.
Go to the connections and select Import data lineage to configure lineage capture from the source, as shown in the following screenshot.
Make other changes to the data source fields as desired, then choose Save.

Enabling lineage will make sure the data source job will capture lineage in the next run.

Deploy resources for the use case

Follow these steps:

To deploy the resources required for this post, download the AWS CloudFormation template amazon-datazone-examples in the AWS Samples GitHub repository. Deploy it in your AWS account.

For further guidance on how to deploy a CloudFormation stack, refer to Create a stack from the CloudFormation console. You need to provide a Stack name and the name of the AWS GlueDatabaseName associated to the project of your SageMaker domain, as shown in the following screenshot.

Choose Next.

The template will deploy the following resources:

A S3 bucket with a sample file of clickstream data. The bucket name and location of the file will follow the path pattern s3://ecomm-analytics-<ACCOUNT_ID>-<REGION>/clickstream/<YYYY>/<MM>/<DD>/data.json. The file will contain a sample record with the following structure:

{
    "session_id": "abc123",
    "user_id": "u789",
    "event_type": "product_view",
    "product_id": "prod456",
    "timestamp": "2025-06-04T09:23:12Z"
}

A DynamoDB table with a sample item of order data (transactions). The table will be named OrderTransactionTable. The sample item will have the following structure:

{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z"
}

An AWS Glue crawler configured to crawl the S3 bucket and DynamoDB table deployed as part of the stack and store the metadata in the AWS Glue database associated to the SageMaker project. You can access the crawler’s details in the AWS console, as shown in the following screenshot.

Two AWS Identity and Access Management (IAM) roles for loading the source data into Amazon S3 and DynamoDB and to use when running the AWS Glue crawler.

Run the AWS Glue crawler

The AWS Glue crawler deployed in the previous step will allow you to capture metadata from the two data sources, Amazon S3 and DynamoDB, and store it in AWS Glue Data Catalog, specifically in the database associated to the SageMaker project. After the metadata is stored, it will be accessible to SageMaker.

Before running the crawler, you need to provide AWS Lake Formation permissions to the IAM role that the AWS Glue crawler will use to interact with your data source and target AWS Glue database. The following command will grant the permissions needed for the crawler to store metadata into the AWS Glue database of the SageMaker project.

To invoke this command, we recommend using AWS CloudShell on the AWS console as explained in AWS CloudShell Concepts. Update the <REGION>, <ACCOUNT_ID> and <GLUE_DATABASE_NAME> placeholders with the right values for your AWS Region, AWS account ID, and name of the AWS Glue database associated to the SageMaker project.

aws lakeformation grant-permissions \
  --region  \
  --principal DataLakePrincipalIdentifier=arn:aws:iam:::role/glue-crawler-role \ 
  --permissions CREATE_TABLE \
  --resource '{ "Database": { "Name": "" } }'

Next, run the AWS Glue Crawler on the AWS console. After the crawler successfully finishes, two new tables, clickstream and ordertransactiontable, will be created in the AWS Glue database associated to the SageMaker project. Refer to Viewing crawler results and details to learn more about AWS Glue crawler results.

Source metadata from the AWS Glue database into SageMaker

To source metadata from data assets in the AWS Glue database, including their lineage, into SageMaker, use the data source that was deployed as part of the SageMaker project creation.

To run the data source, go to the data source details page.
Choose Run. (Data sources can be scheduled to run as well, however, for this demonstration we trigger a manual run).

After the data source run is complete, metadata from both data assets in the AWS Glue database will be imported into the SageMaker domain as the project’s inventory assets. You can find the details of the data source run from within SageMaker Unified Studio, which include:

The data assets from the AWS Glue database that were ingested into SageMaker.
The status of the data lineage import for each data asset, which includes an event ID for traceability. This lineage event ID can be used to debug inconsistencies in the resulting lineage graph. You can use the GetLineageEvent API to retrieve the raw payload of the lineage event.

Visualizing the data lineage graph of the data assets in SageMaker Unified Studio

With SageMaker Unified Studio, you have a single place to manage and discover data assets. When accessing a data asset published in the SageMaker central catalog or in your project’s own inventory, you can dive into the asset’s metadata, which includes its schema, business description, custom metadata forms, quality, lineage, and more. To visualize the lineage graph of each data asset of this post, follow these steps:

In SageMaker Studio, navigate to the Assets section of the SageMaker project details page and choose INVENTORY
Select the asset that you want to explore. You can also access the asset directly from the data source run by selecting the asset name.
To view the lineage graph of the data asset up to its origin, shown in the following screenshots, choose the LINEAGE tab.
- For clickstream table (Sourced from S3)

- For order transactions table (Sourced from DynamoDB)

With lineage, you can now confirm that the data originated from sources such as Amazon S3 and Amazon DynamoDB and understand how it has been transformed along the way. Because of this end-to-end visibility, you can trust the data, make informed decisions, and provide compliance with confidence. The lineage graph captures essential metadata that forms the foundation of lineage tracking.

This includes table schemas, column definitions and their data types.
Column-level lineage becomes particularly powerful in this context. Imagine your clickstream’s AWS Glue table powers an Amazon QuickSight dashboard analyzing customer purchase patterns and notice discrepancies in your revenue reports. With column lineage, you can instantly trace the source of those columns.
This granular visibility not only accelerates debugging but also proves invaluable during schema changes, as we show in the following section by changing the source schema.
The crawler details such as crawlerRunId (present in the source identifier of the lineage node) and crawler start and end times can be used to debug which crawler runs updated the table.

Understanding your data asset’s schema evolution through lineage in SageMaker Unified Studio

Imagine the order transactions source in DynamoDB was updated with new information. Because this source powers an Amazon QuickSight report for the customer using the AWS Glue database table, it’s important for consumers to know what changes in the data pipeline updated the report.

Edit the DynamoDB table item with additional columns to learn how lineage graph can be used to view historical updates:

{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z",
	"customerSegment": "new-customer",
    "conversionSource": "primeDayEmailCampaign"
}

Enter the OrderTransactionsCrawler Glue crawler again on the AWS console. After completion, you’ll notice that it updated the ordertransactiontable AWS Glue table, as shown in the following screenshot.

Run again the data source associated to the project in SageMaker Unified Studio to import the latest metadata into the SageMaker Catalog. After completion, you’ll notice the data source updated the ordertransactiontable data asset in the SageMaker Catalog, as shown in the following screenshot.

This section explores how lineage can be useful to track the updates.

Navigate to the ordertransactiontable data asset in SageMaker Catalog by selecting it from the data source run and choose the LINEAGE tab, as shown in the following screenshot.

Notice how the new columns are available in the lineage graph. A new crawler run ID is present as the source identifier of the crawler lineage node. The history tab shows multiple crawler runs. You can navigate to check the state of the system during the first run.

Cleanup

After you’re done, we recommend to cleaning up the resources created for this post to avoid unintended charges:

Delete the inventory assets that were cataloged in the SageMaker project’s inventory, as explained in Delete an Amazon SageMaker Unified Studio asset.
Delete the SageMaker project that was created as part of this post, as explained in Delete a project.
Delete the CloudFormation stack that was deployed as part of this post, as explained in Delete a stack from the CloudFormation console.
The S3 bucket created as part of the CloudFormation stack will remain after its deletion because it contains a data file in it. Empty and delete the bucket, as explained in Deleting a general purpose bucket.

Conclusion

In this post, you were able to explore the data lineage capabilities of Amazon SageMaker, specifically when working with AWS Glue crawlers. You learned how you can set up an AWS Glue crawler to infer metadata from data assets in multiple sources such as Amazon S3 and DynamoDB and store it the AWS Glue Data Catalog. You also imported this metadata, including data lineage, into Amazon SageMaker through the data source capability of a SageMaker project. Finally, you explored the resulting lineage graph of data assets in SageMaker Unified Studio and saw some of the functionalities available to understand the origin path of them, understand how columns are transformed, and what impact looks like when performing changes to any step of the pipeline.We encourage you to now test the capabilities you explored in this post with your own data. By following the pattern presented in this post, many customers have been able to achieve governance of their data lake and lakehouse platforms on top of Amazon SageMaker with data lineage and more.

About the authors

Mohit Dawar is a Senior Software Engineer at Amazon Web Services (AWS) working on Amazon DataZone. Over the past 3 years, he has led efforts around the core metadata catalog, generative AI–powered metadata curation, and lineage visualization. He enjoys working on large-scale distributed systems, experimenting with AI to improve user experience, and building tools that make data governance feel effortless. Connect with him on LinkedIn: Mohit Dawar.

Jose Romero is a Senior Solutions Architect for Startups at Amazon Web Services (AWS) based in Austin, TX, US. He is passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn: Jose Romero.

Secure generative SQL with Amazon Q

2025-07-25 Gregory Knowles

Post Syndicated from Gregory Knowles original https://aws.amazon.com/blogs/big-data/secure-generative-sql-with-amazon-q/

Amazon Q generative SQL brings generative AI capabilities to help speed up deriving insights from your Amazon Redshift data warehouses and AWS Glue Data Catalog, generating SQL for Amazon Redshift or Amazon Athena. With Amazon Q, you get SQL commands generated with your context. This means you can focus on deriving insights faster, rather than having to first learn potentially complex schemas. Without generative SQL, your data analysts might have to frequently switch between different types of SQL, which can further slow analysis down. Amazon Q generative SQL can help by generating SQL statements from natural language and speeding up development. This can help onboard analysts faster and improve analyst productivity. The generative SQL experience is available through Amazon SageMaker Unified Studio and Amazon Redshift Query Editor v2.

To scale the use of generative SQL in production scenarios, you need to consider how relevant and accurate SQL is generated. In doing so, it’s important to understand what data is used and how your information is protected. Amazon Q generative SQL is designed to keep your data secure and private. Your queries, data, and database schemas are not used to train generative AI foundation models (FMs). For more information, see Considerations when interacting with Amazon Q generative SQL.

In the post Write queries faster with Amazon Q generative SQL for Amazon Redshift, we provided general advice around getting started with generative SQL. In this post, we discuss the design and security controls in place when using generative SQL and its use in both SageMaker Unified Studio and Amazon Redshift Query Editor v2.

Solution overview

Generating relevant SQL requires context from your data warehouse or data catalog schemas. Your analysts can ask free text or natural language questions in the Amazon Q chat window and have SQL statements returned that reference your tables and columns. It’s important that the generated SQL is consistent with your schema so that it can find the most relevant fields to answer questions and generate queries that accurately reference data. In SageMaker Unified Studio or Amazon Redshift Query Editor v2, when the Amazon Q chat window is open, database metadata that is viewable under the connection context is made available to Amazon Q for SQL generation. This means that only the schema information that the connecting user can access is used. Tables or database objects the user doesn’t have access to are excluded.

When a user submits questions in the Amazon Q chat window, a search algorithm is used to find the most relevant context from the available database schema metadata information. This context is combined with the user’s question and used as a prompt to a large language model (LLM) to generate a SQL statement. The supporting information is cached so that your data source doesn’t need to be queried every time a user initiates SQL generation. Instead, data source metadata will be periodically refreshed if it remains in use, or you can trigger a manual refresh. If the data is not being used, Amazon Q will automatically delete it. Where applicable, the information used to support SQL generation is encrypted with an AWS Key Management Service (AWS KMS) customer managed KMS key where one has been specified in the SageMaker Unified Studio or Amazon Redshift Query Editor v2 settings. Otherwise, an AWS managed key is used. Your information is encrypted in transit and at rest.

The following diagram shows the process flow for SQL generation when using SageMaker Unified Studio or the Amazon Redshift Query Editor and using Amazon Redshift or Data Catalog source data.

Process diagram for SQL generation

The Amazon Q generative SQL process can be summarized as the following steps:

A user interacts with the Amazon Q chat pane through SageMaker Unified Studio or the Amazon Redshift Query Editor.
The SQL chat frontend sends the prompt along with the connection configuration to Amazon Q.
Amazon Q uses the connection context to retrieve information that will support SQL generation if this data is not already available.
Amazon Q encrypts the retrieved information under the appropriate AWS managed or customer managed KMS key. The information is subsequently decrypted on retrieval.
The information is stored along with custom context information, if this has been provided.
Relevant context from the combined information is selected and added to the user’s questions and sent to an LLM to generate a SQL statement, which is returned to the user.
The user can decide whether to run the statement and can provide feedback on usefulness and accuracy.

Additional context to enhance SQL generation

You can provide further context to supplement the database schema information, which can help improve the accuracy and relevancy of the generated SQL.

One option is to provide custom context. Custom context gives the option to specify instructions and extra information, such as descriptions of tables and columns. These descriptions can then be used to help the selection of relevant tables and attributes when generating SQL statements. This is particularly relevant when your schema uses more obscure naming that might not directly relate to business terms or uses non-standard abbreviations. For example, consider a table called sls_r1_2024. With custom context, you can add a table description specifying that, for example, the table includes sales information across stores in the US region for the calendar year 2024. This information can help the LLM generate SQL referencing the correct tables. The same approach can be applied to columns within the table. Your custom context is encrypted using a customer managed KMS key if one has been specified (during Amazon Redshift Query Editor account creation or SageMaker Unified Studio project creation) or an AWS managed key otherwise.

You can also introduce constraints using custom context. For example, you can explicitly include or exclude specific schemas, tables, or columns from SQL query generation. Similarly, specific topics can also be disallowed, such as not generating SQL statements to support financial reporting. For more details about the information that can be supplied, refer to Custom context.

Another option is to grant SQL query history access to the user establishing the connection. This information is then also made available to enhance SQL generation and to provide the LLM with examples of relevant queries. Be aware that granting wider SQL query history access to the connecting user, and therefore also the generative SQL workflow, allows viewing of queries over tables or objects the user might not have access to. Furthermore, string literals might be present in historic statements that might contain sensitive information. To help mitigate this risk, you could instead use the CuratedQueries section of custom context to provide predefined question and answer examples, without exposing all user queries.

Generated statement response

Before a SQL statement is returned to the user, Amazon Q tries to detect syntax issues. This step helps improve the likelihood that only valid SQL syntax is returned. Amazon Q will use the available information for the user to return statements that align with user permissions, to reduce scenarios where users can’t run generated statements. For example, if you have given access to SQL query history information, then the SQL generation step might produce a query statement referencing a table that the user asking the question doesn’t have access to. Amazon Q minimizes the occurrence of this scenario by assessing if the generated SQL aligns with user permissions and updating the statement if not. User permissions are not bypassed through the use of Amazon Q generative SQL. If a statement was returned referencing a table the user doesn’t have access to, the authorization applied to the user will enforce access control when the statement is executed.

Statements generated by Amazon Q that could potentially change your database, such as DML or DDL statements, are returned with a warning. The warning highlights to the user that running the statement could potentially modify the database. Again, these statements are only executable if the user has the required permissions.

Prerequisites

Amazon Q generative SQL works with your Redshift data warehouses and Data Catalog tables. To get started, you should have data available in either or both of these environments. To use Amazon Q generative SQL with your AWS Glue tables, you need a SageMaker Unified Studio domain. Within your domain, you can use the Amazon Q chat integration to ask questions of your data and have SQL generated. This also works for Amazon Redshift data sources available in the domain. You can use Amazon Q generative SQL without a SageMaker Unified Studio domain using the Amazon Redshift Query Editor. Access to the editor enables Amazon Q chat integration against your Amazon Redshift data sources.

Enable Amazon Q generative SQL

You can control access to generative SQL at the account-Region level in the Amazon Redshift Query Editor or at the SageMaker Unified Studio domain level. To enable this feature, an account admin must explicitly turn on Amazon Q generative SQL. By default, the feature is not accessible to your users. Administrators that have permission for the sqlworkbench:UpdateAccountQSqlSettings AWS Identity and Access Management (IAM) action can turn the Amazon Q generation SQL feature on or off through the admin window, as illustrated in the following sections. When turned off, this will restrict users from opening the Amazon Q chat pane and help prevent interaction with generative SQL.

Enable Amazon Q in your SageMaker domain

To enable Amazon Q in your SageMaker domain, you can navigate to the Amazon Q tab on the domain settings page and choose to enable the service. For more information, see Amazon Q in Amazon SageMaker Unified Studio.

Enable Amazon Q in SageMaker Unified Studio domain

Enable Amazon Q in Amazon Redshift

To enable Amazon Q generative SQL from the Amazon Redshift Query Editor, access the Amazon Q generative SQL settings. This requires the administrator to have the sqlworkbench:UpdateAccountQSqlSettings permission in their IAM policy. For more information, see Updating generative SQL settings as an administrator.

Enabling Amazon Q generative SQL from Redshift query editor

With generative SQL enabled at the account-Region level, you can restrict access to specific users with IAM controls. IAM administrators can build IAM policies that allow or deny access to the action sqlworkbench:GetQSqlRecommendations. For more information, refer to Actions, resources, and condition keys for AWS SQL Workbench. Policies can then be associated with IAM users or roles to control access to SQL generation at a more granular level. An appropriately scoped service control policy (SCP) can be used to limit access to SQL generation to specific accounts within your organization if required.

The following is an example policy denying access to use SQL generation:

{
"Version": "2012-10-17",
    "Statement": [
        {
"Sid": "DenyAccessToAmazonQGenerativeSql",
            "Effect": "Deny",
            "Action": [
                "sqlworkbench:GetQSqlRecommendations"
            ],
            "Resource": "*",
        }
    ]
}

Cross-Region inference

Amazon Q Developer uses cross-Region inference to distribute traffic across different AWS Regions, which provides increased throughput and resilience during high demand periods, improved performance, and access to the latest Amazon Q Developer capabilities.

When a request is made from an Amazon Q Developer profile, it is kept within the Regions in the same geography as the original data. Although this doesn’t change where the data is stored, the requests and output results might move across Regions during the inference process. Data is encrypted when transmitted across Amazon’s network. For more information on cross-Region inference, see Cross-region processing in Amazon Q Developer.

Monitoring

To monitor which IAM users or roles are interacting with generative SQL, you can use AWS CloudTrail. CloudTrail monitors API calls and logs which identities have performed particular actions. When a user first asks a question, a CloudTrail event is emitted called IngestQSqlMetadata. This is a result of Amazon Q starting the metadata ingest process. Ingestion is an asynchronous operation, so there might be a series of GetQSqlMetadataStatus events. This is due to the workflow checking the ingestion process status.

After the workflow has completed successfully, each question sees a GetQSqlRecommendation event. This is the result of users submitting questions and triggering generation of SQL statements. The following is an example CloudTrail event for GetQSqlRecommendation. In this example, Amazon Q emits detailed CloudTrail events highlighting the warehouse being queried, IAM principal calling Amazon Q, and the entire response structure from Amazon Q in responseElements:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROA123456789EXAMPLE:demouser",
        "arn": "arn:aws:sts::111122223333:assumed-role/DemoUser",
        "accountId": "111122223333",
        "accessKeyId": "ASIAIOSFODNN7EXAMPLE",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROA123456789EXAMPLE",
                "arn": "arn:aws:iam::111122223333:role/DemoUser",
                "accountId": "111122223333",
                "userName": "DemoUser"
            },
            "attributes": {
                "creationDate": "2025-01-17T05:31:01Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2025-01-17T05:34:51Z",
    "eventSource": "sqlworkbench.amazonaws.com",
    "eventName": "GetQSqlRecommendation",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "122.171.17.139",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:133.0) Gecko/20100101 Firefox/133.0",
    "requestParameters": {
        "dbConfig": {
            "database": "sample_data_dev"
        },
        "databaseConfiguration": {
            "redshiftConfig": {
                "clusterIdentifier": "redshift-cluster-1",
                "database": "sample_data_dev"
            }
        },
        "prompt": "HIDDEN_DUE_TO_SECURITY_REASONS",
        "clientToken": "HIDDEN_DUE_TO_SECURITY_REASONS",
        "logConfig": {},
        "sqlworkbenchConnectionArn": "arn:aws:sqlworkbench:us-east-1:111122223333:connection/47ahg61-ce0b-4646-831b-a140ea4055ae"
    },
    "responseElements": {
        "data": {
            "extractionErrors": false,
            "guardRails": {
                "isDml": false
            },
            "sqlStatement": "HIDDEN_DUE_TO_SECURITY_REASONS",
            "syntaxErrors": "HIDDEN_DUE_TO_SECURITY_REASONS"
        },
        "logSessionId": "623318ad-dbcc-4f69-ae08-f85d1b63a70f",
        "questionId": "623318ad-dbcc-4f69-ae08-f85d1b63a70f",
        "originalQuestionId": "623318ad-dbcc-4f69-ae08-ae08asd1a"
    },
    "requestID": "623318ad-dbcc-4f69-ae08-f85d1b63a70f",
    "eventID": "ac2c1932-49b1-41b3-a1af-20fa4461cf7d",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "qsql.sqlworkbench.us-east-1.amazonaws.com"
    },
    "sessionCredentialFromConsole": "true"
}

Conclusion

In this post, we discussed the Amazon Q generative SQL workflow. We highlighted the process around using your schema context alongside metadata such as historic SQL queries and custom context. Using this metadata allows the generation of relevant SQL that helps accelerate your analyst’s productivity. Although it’s important to assist analysts, it’s also imperative to make sure data remains secure and protected. To support this, generative SQL uses only the data the connected user has access to. This helps prevent exposure to information beyond their authorization.When you’re looking to increase the relevance of generated SQL through sharing additional query history, it’s important to consider the trade-off of exposing additional information to the user. Deciding your approach here should take into account the domain context of the data and the possible exposure of metadata the user doesn’t have access to, or potentially sensitive information that might appear in query strings. Keeping these considerations in mind can help you achieve the appropriate security posture for your workloads.

To get started with Amazon Q generative SQL, see Write queries faster with Amazon Q generative SQL for Amazon Redshift and Interacting with Amazon Q generative SQL.

About the authors

Gregory Knowles is a data and AI specialist solution architect at AWS, focusing on the UK public sector. With extensive experience in cloud-based architectures, Greg guides public sector customers in implementing modern data solutions. His expertise spans governance, analytics, and AI/ML. Greg’s passion lies in accelerating transformation and innovation to improve productivity and outcomes. He has successfully led projects that moved data systems into the cloud, adopted new data architectures, and implemented AI at scale in production.

Abhinav Tripathy is a Software Engineer and Security Guardian at AWS, where he develops Amazon Q generative SQL by combining machine learning, databases, and web systems. Abhinav is passionate about building scalable web systems from scratch that solve real customer challenges. Outside of work, he enjoys traveling, watching soccer, and playing badminton.

Erol Murtezaoglu is a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

AWS AI League: Learn, innovate, and compete in our new ultimate AI showdown

2025-07-17 Elizabeth Fuentes

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/aws-ai-league-learn-innovate-and-compete-in-our-new-ultimate-ai-showdown/

Since 2018, AWS DeepRacer has engaged over 560,000 builders worldwide, demonstrating that developers learn and grow through competitive experiences. Today, we’re excited to expand into the generative AI era with AWS Artificial Intelligence (AI) League.

This is a unique competitive experience – your chance to dive deep into generative AI regardless of your skill level, compete with peers, and build solutions that solve actual business problems through an engaging, competitive experience.

With AWS AI League, your organization hosts private tournaments where teams collaborate and compete to solve real-world business use cases using practical AI skills. Participants craft effective prompts and fine-tune models while building powerful generative AI solutions relevant for their business. Throughout the competition, participants’ solutions are evaluated against reference standards on a real-time leaderboard that tracks performance based on accuracy and latency.

The AWS AI League experience starts with a 2-hour hands-on workshop led by AWS experts. This is followed by self-paced experimentation, culminating in a gameshow-style grand finale where participants showcase their generative AI creations addressing business challenges. Organizations can set up their own AWS AI League within half a day. The scalable design supports 500 to 5,000 employees while maintaining the same efficient timeline.

Supported by up to $2 million in AWS credits and a $25,000 championship prize pool at AWS re:Invent 2025, the program provides a unique opportunity to solve real business challenges.

AWS AI League transforms how organizations develop generative AI capabilities
AWS AI League transforms how organizations develop generative AI capabilities by combining hands-on skills development, domain expertise, and gamification. This approach makes AI learning accessible and engaging for all skill levels. Teams collaborate through industry-specific challenges that mirror real organizational needs, with each challenge providing reference datasets and evaluation standards that reflect actual business requirements.

Customizable industry-specific challenges – Tailor competitions to your specific business context. Healthcare teams work on patient discharge summaries, financial services focus on fraud detection, and media companies develop content creation solutions.
Integrated AWS AI stack experience – Participants gain hands-on experience with AWS AI and ML tools, including Amazon SageMaker AI, Amazon Bedrock, and Amazon Nova, accessible from Amazon SageMaker Unified Studio. Teams work through a secure, cost-controlled environment within their organization’s AWS account.
Real-time performance tracking – The leaderboard evaluates submissions against established benchmarks and reference standards throughout the competition, providing immediate feedback on accuracy and speed so teams can iterate and improve their solutions. During the final round, this scoring includes expert evaluation where domain experts and a live audience participate in real-time voting to determine which AI solutions best solve real business challenges.

AWS AI League offers two foundational competition tracks:
- Prompt Sage – The Ultimate Prompt Battle – Race to craft the perfect AI prompts that unlock breakthrough solutions. whether you detect financial fraud or streamlining healthcare workflows, every word counts as they climb the leaderboard using zero-shot learning and chain-of-thought reasoning.
- Tune Whiz – The Model Mastery Showdown – Generic AI models meet their match as you sculpt them into industry-specific powerhouses. Armed with your domain expertise and specialized questions, competitors fine-tune models that speak your business language fluently. Victory goes to who achieve the perfect balance of blazing performance, lightning efficiency, and cost optimization.

As Generative AI continues to evolve, AWS AI League will regularly introduce new challenges and formats in addition to these tracks.

Get started today
Ready to get started? Organizations can host private competitions by applying through the AWS AI League page. Individual developers can join public competitions at AWS Summits and AWS re:Invent.

PS: Writing a blog post at AWS is always a team effort, even when you see only one name under the post title. In this case, I want to thank Natasya Idries, for her generous help with technical guidance, and expertise, which made this overview possible and comprehensive.

— Eli

Streamline the path from data to insights with new Amazon SageMaker Catalog capabilities

2025-07-16 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/streamline-the-path-from-data-to-insights-with-new-amazon-sagemaker-capabilities/

Modern organizations manage data across multiple disconnected systems—structured databases, unstructured files, and separate visualization tools—creating barriers that slow analytics workflows and limit insight generation. Separate visualization platforms often create barriers that prevent teams from extracting comprehensive business insights.

These disconnected workflows prevent your organizations from maximizing your data investments, creating delays in decision making and missed opportunities for comprehensive analysis that combines multiple data types.

Starting today, you can use three new capabilities in Amazon SageMaker to accelerate your path from raw data to actionable insights:

Amazon QuickSight integration – Launch Amazon QuickSight directly from Amazon SageMaker Unified Studio to build dashboards using your project data, then publish them to the Amazon SageMaker Catalog for broader discovery and sharing across your organization.
Amazon SageMaker adds support for Amazon S3 general purpose buckets and Amazon S3 Access Grants in SageMaker Catalog– Make data stored in Amazon S3 general purpose buckets easier for teams to ﬁnd, access, and collaborate on all types of data including unstructured data, while maintaining ﬁne-grained access control using Amazon S3 Access Grants.
Automatic data onboarding from your lakehouse – Automatic onboarding of existing AWS Glue Data Catalog (GDC) datasets from the lakehouse architecture into SageMaker Catalog, without manual setup.

These new SageMaker capabilities address the complete data lifecycle within a unified and governed experience. You get automatic onboarding of existing structured data from your lakehouse, seamless cataloging of unstructured data content in Amazon S3, and streamlined visualization through QuickSight—all with consistent governance and access controls.

Let’s take a closer look at each capability.

Amazon SageMaker and Amazon QuickSight Integration
With this integration, you can build dashboards in Amazon QuickSight using data from your Amazon SageMaker projects. When you launch QuickSight from Amazon SageMaker Unified Studio, Amazon SageMaker automatically creates the QuickSight dataset and organizes it in a secured folder accessible only to project members.

Furthermore, the dashboards you build stay within this folder and automatically appear as assets in your SageMaker project, where you can publish them to the SageMaker Catalog and share them with users or groups in your corporate directory. This keeps your dashboards organized, discoverable, and governed within SageMaker Unified Studio.

To use this integration, both your Amazon SageMaker Unified Studio domain and QuickSight account must be integrated with AWS IAM Identity Center using the same IAM Identity Center instance. Additionally, your QuickSight account must exist in the same AWS account where you want to enable the QuickSight blueprint. You can learn more about the prerequisites on Documentation page.

After these prerequisites are met, you can enable the blueprint for Amazon QuickSight by navigating to the Amazon SageMaker console and choosing the Blueprints tab. Then find Amazon QuickSight and follow the instructions.

You also need to configure your SQL analytics project profile to include Amazon QuickSight in Add blueprint deployment settings.

To learn more on onboarding setup, refer to the Documentation page.

Then, when you create a new project, you need to use the SQL analytics profile.

With your project created, you can start building visualizations with QuickSight. You can navigate to the Data tab, select the table or view to visualize, and choose Open in QuickSight under Actions.

This will redirect you to the Amazon QuickSight transactions dataset page and you can choose USE IN ANALYSIS to begin exploring the data.

When you create a project with the QuickSight blueprint, SageMaker Unified Studio automatically provisions a restricted QuickSight folder per project where SageMaker scopes all new assets—analyses, datasets, and dashboards. The integration maintains real-time folder permission sync, keeping QuickSight folder access permissions aligned with project membership.

Amazon Simple Storage Service (S3) general purpose buckets integration
Starting today, SageMaker adds support for S3 general purpose buckets in SageMaker Catalog to increase discoverability and allows granular permissions through S3 Access Grants, enabling users to govern data, including sharing and managing permissions. Data consumers, such as data scientists, engineers, and business analysts, can now discover and access S3 assets through SageMaker Catalog. This expansion also enables data producers to govern security controls on any S3 data asset through a single interface.

To use this integration, you need appropriate S3 general purpose bucket permissions, and your SageMaker Unified Studio projects must have access to the S3 buckets containing your data. Learn more about prerequisites on Amazon S3 data in Amazon SageMaker Unified Studio Documentation page.

You can add a connection to an existing S3 bucket.

When it’s connected, you can browse accessible folders and create discoverable assets by choosing on the bucket or a folder and selecting Publish to Catalog.

This action creates a SageMaker Catalog asset of type “S3 Object Collection” and opens an asset details page where users can augment business context to improve search and discoverability. Once published, data consumers can discover and subscribe to these cataloged assets. When data consumers subscribe to “S3 Object Collection” assets, SageMaker Catalog automatically grants access using S3 Access Grants upon approval, enabling cross-team collaboration while ensuring only the right users have the right access.

When you have access, now you can process your unstructured data in Amazon SageMaker Jupyter notebook. Following screenshot is an example to process image in medical use case.

If you have structured data, you can query your data using Amazon Athena or process using Spark in notebooks.

With this access granted through S3 Access Grants, you can seamlessly incorporate S3 data into my workflows—analyzing it in notebooks, combining it with structured data in the lakehouse and Amazon Redshift for comprehensive analytics. You can access unstructured data such as documents, images in JupyterLab notebooks to train ML models, or generate queryable insights.

Automatic data onboarding from your lakehouse
This integration automatically onboards all your lakehouse datasets into SageMaker Catalog. The key benefit for you is to bring AWS Glue Data Catalog (GDC) datasets into SageMaker Catalog, eliminating manual setup for cataloging, sharing, and governing them centrally.

This integration requires an existing lakehouse setup with Data Catalog containing your structured datasets.

When you set up a SageMaker domain, SageMaker Catalog automatically ingests metadata from all lakehouse databases and tables. This means you can immediately explore and use these datasets from within SageMaker Unified Studio without any configuration.

The integration helps you to start managing, governing, and consuming these assets from within SageMaker Unified Studio, applying the same governance policies and access controls you can use for other data types while unifying technical and business metadata.

Additional things to know
Here are a couple of things to note:

Availability – These integrations are available in all commercial AWS Regions where Amazon SageMaker is supported.
Pricing – Standard SageMaker Unified Studio, QuickSight, and Amazon S3 pricing applies. No additional charges for the integrations themselves.
Documentation – You can find complete setup guides in the SageMaker Unified Studio Documentation.

Get started with these new integrations through the Amazon SageMaker Unified Studio console.

Happy building!
— Donnie

Introducing Amazon S3 Vectors: First cloud storage with native vector support at scale (preview)

2025-07-16 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/introducing-amazon-s3-vectors-first-cloud-storage-with-native-vector-support-at-scale/

Today, we’re announcing the preview of Amazon S3 Vectors, a purpose-built durable vector storage solution that can reduce the total cost of uploading, storing, and querying vectors by up to 90 percent. Amazon S3 Vectors is the first cloud object store with native support to store large vector datasets and provide subsecond query performance that makes it affordable for businesses to store AI-ready data at massive scale.

Vector search is an emerging technique used in generative AI applications to find similar data points to given data by comparing their vector representations using distance or similarity metrics. Vectors are numerical representation of unstructured data created from embedding models. You generate vectors using embedding models for fields inside your document and store vectors into S3 Vectors to search semantically.

S3 Vectors introduces vector buckets, a new bucket type with a dedicated set of APIs to store, access, and query vector data without provisioning any infrastructure. When you create an S3 vector bucket, you organize your vector data within vector indexes, making it simple for running similarity search queries against your dataset. Each vector bucket can have up to 10,000 vector indexes, and each vector index can hold tens of millions of vectors.

After creating a vector index, when adding vector data to the index, you can also attach metadata as key-value pairs to each vector to filter future queries based on a set of conditions, for example, dates, categories, or user preferences. As you write, update, and delete vectors over time, S3 Vectors automatically optimizes the vector data to achieve the best possible price-performance for vector storage, even as the datasets scale and evolve.

S3 Vectors is also natively integrated with Amazon Bedrock Knowledge Bases, including within Amazon SageMaker Unified Studio, for building cost-effective Retrieval-Augmented Generation (RAG) applications. Through its integration with Amazon OpenSearch Service, you can lower storage costs by keeping infrequent queried vectors in S3 Vectors and then quickly move them to OpenSearch as demands increase or to support real-time, low-latency search operations.

With S3 Vectors, you can now economically store the vector embeddings that represent massive amounts of unstructured data such as images, videos, documents, and audio files, enabling scalable generative AI applications including semantic and similarity search, RAG, and build agent memory. You can also build applications to support a wide range of industry use cases including personalized recommendations, automated content analysis, and intelligent document processing without the complexity and cost of managing vector databases.

S3 Vectors in action
To create a vector bucket, choose Vector buckets in the left navigation pane in the Amazon S3 console and then choose Create vector bucket.

Enter a vector bucket name and choose the encryption type. If you don’t specify an encryption type, Amazon S3 applies server-side encryption with Amazon S3 managed keys (SSE-S3) as the base level of encryption for new vectors. You can also choose server-side encryption with AWS Key Management Service (AWS KMS) keys (SSE-KMS). To learn more about managing your vector bucket, visit S3 Vector buckets in the Amazon S3 User Guide.

Now, you can create a vector index to store and query your vector data within your created vector bucket.

Enter a vector index name and the dimensionality of the vectors to be inserted in the index. All vectors added to this index must have exactly the same number of values.

For Distance metric, you can choose either Cosine or Euclidean. When creating vector embeddings, select your embedding model’s recommended distance metric for more accurate results.

Choose Create vector index and then you can insert, list, and query vectors.

To insert your vector embeddings to a vector index, you can use the AWS Command Line Interface (AWS CLI), AWS SDKs, or Amazon S3 REST API. To generate vector embeddings for your unstructured data, you can use embedding models offered by Amazon Bedrock.

If you’re using the latest AWS Python SDKs, you can generate vector embeddings for your text using Amazon Bedrock using following code example:

# Generate and print an embedding with Amazon Titan Text Embeddings V2.
import boto3 
import json 

# Create a Bedrock Runtime client in the AWS Region of your choice. 
bedrock= boto3.client("bedrock-runtime", region_name="us-west-2") 

The text strings to convert to embeddings.
texts = [
"Star Wars: A farm boy joins rebels to fight an evil empire in space", 
"Jurassic Park: Scientists create dinosaurs in a theme park that goes wrong",
"Finding Nemo: A father fish searches the ocean to find his lost son"]

embeddings=[]
#Generate vector embeddings for the input texts
for text in texts:
        body = json.dumps({
            "inputText": text
        })    
        # Call Bedrock's embedding API
        response = bedrock.invoke_model(
        modelId='amazon.titan-embed-text-v2:0',  # Titan embedding model 
        body=body)   
        # Parse response
        response_body = json.loads(response['body'].read())
        embedding = response_body['embedding']
        embeddings.append(embedding)

Now, you can insert vector embeddings into the vector index and query vectors in your vector index using the query embedding:

# Create S3Vectors client
s3vectors_client = boto3.client('s3vectors', region_name='us-west-2')

# Insert vector embedding
s3vectors.put_vectors( vectorBucketName="channy-vector-bucket",
  indexName="channy-vector-index", 
  vectors=[
{"key": "v1", "data": {"float32": embeddings[0]}, "metadata": {"id": "key1", "source_text": texts[0], "genre":"scifi"}},
{"key": "v2", "data": {"float32": embeddings[1]}, "metadata": {"id": "key2", "source_text": texts[1], "genre":"scifi"}},
{"key": "v3", "data": {"float32": embeddings[2]}, "metadata": {"id": "key3", "source_text":  texts[2], "genre":"family"}}
],
)

#Create an embedding for your query input text
# The text to convert to an embedding.
input_text = "List the movies about adventures in space"

# Create the JSON request for the model.
request = json.dumps({"inputText": input_text})

# Invoke the model with the request and the model ID, e.g., Titan Text Embeddings V2. 
response = bedrock.invoke_model(modelId="amazon.titan-embed-text-v2:0", body=request)

# Decode the model's native response body.
model_response = json.loads(response["body"].read())

# Extract and print the generated embedding and the input text token count.
embedding = model_response["embedding"]

# Performa a similarity query. You can also optionally use a filter in your query
query = s3vectors.query_vectors( vectorBucketName="channy-vector-bucket",
  indexName="channy-vector-index",
  queryVector={"float32":embedding},
  topK=3, 
  filter={"genre":"scifi"},
  returnDistance=True,
  returnMetadata=True
  )
results = query["vectors"]
print(results)

To learn more about inserting vectors into a vector index, or listing, querying, and deleting vectors, visit S3 vector buckets and S3 vector indexes in the Amazon S3 User Guide. Additionally, with the S3 Vectors embed command line interface (CLI), you can create vector embeddings for your data using Amazon Bedrock and store and query them in an S3 vector index using single commands. For more information, see the S3 Vectors Embed CLI GitHub repository.

Integrate S3 Vectors with other AWS services
S3 Vectors integrates with other AWS services such as Amazon Bedrock, Amazon SageMaker, and Amazon OpenSearch Service to enhance your vector processing capabilities and provide comprehensive solutions for AI workloads.

Create Amazon Bedrock Knowledge Bases with S3 Vectors
You can use S3 Vectors in Amazon Bedrock Knowledge Bases to simplify and reduce the cost of vector storage for RAG applications. When creating a knowledge base in the Amazon Bedrock console, you can choose the S3 vector bucket as your vector store option.

In Step 3, you can choose the Vector store creation method either to create an S3 vector bucket and vector index or choose the existing S3 vector bucket and vector index that you’ve previously created.

For detailed step-by-step instructions, visit Create a knowledge base by connecting to a data source in Amazon Bedrock Knowledge Bases in the Amazon Bedrock User Guide.

Using Amazon SageMaker Unified Studio
You can create and manage knowledge bases with S3 Vectors in Amazon SageMaker Unified Studio when you build your generative AI applications through Amazon Bedrock. SageMaker Unified Studio is available in the next generation of Amazon SageMaker and provides a unified development environment for data and AI, including building and texting generative AI applications that use Amazon Bedrock knowledge bases.

You can choose your knowledge bases using the S3 Vectors created through Amazon Bedrock when you build generative AI applications. To learn more, visit Add a data source to your Amazon Bedrock app in the Amazon SageMaker Unified Studio User Guide.

Export S3 vector data to Amazon OpenSearch Service
You can balance cost and performance by adopting a tiered strategy that stores long-term vector data cost-effectively in Amazon S3 while exporting high priority vectors to OpenSearch for real-time query performance.

This flexibility means your organizations can access OpenSearch’s high performance (high QPS, low latency) for critical, real-time applications, such as product recommendations or fraud detection, while keeping less time-sensitive data in S3 Vectors.

To export your vector index, choose Advanced search export, then choose Export to OpenSearch in the Amazon S3 console.

Then, you will be brought to the Amazon OpenSearch Service Integration console with a template for S3 vector index export to OpenSearch vector engine. Choose Export with pre-selected S3 vector source and a service access role.

It will start the steps to create a new OpenSearch Serverless collection and migrate data from your S3 vector index into an OpenSearch knn index.

Choose the Import history in the left navigation pane. You can see the new import job that was created to make a copy of vector data from your S3 vector index into the OpenSearch Serverless collection.

Once the status changes to Complete, you can connect to the new OpenSearch serverless collection and query your new OpenSearch knn index.

To learn more, visit Creating and managing Amazon OpenSearch Serverless collections in the Amazon OpenSearch Service Developer Guide.

Now available
Amazon S3 Vectors, and its integrations with Amazon Bedrock, Amazon OpenSearch Service, and Amazon SageMaker are now in preview in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Frankfurt), and Asia Pacific (Sydney) Regions.

Give S3 Vectors a try in the Amazon S3 console today and send feedback to AWS re:Post for Amazon S3 or through your usual AWS Support contacts.

— Channy

Introducing Jobs in Amazon SageMaker

2025-07-15 Chiho Sugimoto

Post Syndicated from Chiho Sugimoto original https://aws.amazon.com/blogs/big-data/introducing-jobs-in-amazon-sagemaker/

Processing large volumes of data efficiently is critical for businesses, and so data engineers, data scientists, and business analysts need reliable and scalable ways to run data processing workloads. The next generation of Amazon SageMaker is the center for all your data, analytics, and AI. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access all of the data in your organization and act on it using the best tools across any use case.

We’re excited to announce a new data processing job experience for Amazon SageMaker. Jobs are a common concept widely used in existing AWS services such as Amazon EMR and AWS Glue. With this launch, you can now build jobs in SageMaker to process large volumes of data. Jobs can be built using your preferred tool. For example, you can create jobs from extract, transform, and load (ETL) scripts coded in the Unified Studio code editor, code interactively in a Unified Studio Notebooks, or create jobs visually using the Unified Studio Visual ETL editor. After being created, data processing jobs can be set to run on demand, scheduled using the built in scheduler, or orchestrated with SageMaker workflows. You can monitor the status of your data processing jobs and view run history showing status, logs, and performance metrics. When jobs encounter failures, you can use generative AI troubleshooting to automatically analyze errors and receive detailed recommendations to resolve issues quickly. Together, you can use these capabilities to author, manage, operate, and monitor data processing workloads across your organization. The new experience provides an experience that’s consistent with other AWS analytics services such as AWS Glue.

This post demonstrates how the new jobs experience works in SageMaker Unified Studio.

Prerequisites

To get started, you must have the following prerequisites in place:

An AWS account
A SageMaker Unified Studio domain
A SageMaker Unified Studio project with an Data analytics and AI-ML model development project profile

Example use case

A global apparel ecommerce retailer processes thousands of customer reviews daily across multiple marketplaces. They need to transform their raw review data into actionable insights to improve their product offerings and customer experience. Using SageMaker Unified Studio visual ETL editor, we’ll demonstrate how to transform raw review data into structured analytical datasets that enable market-specific performance analysis and product quality monitoring.

Create and run a visual job

In this section, you’ll create a Visual ETL Job that processes the review data from a Parquet file in Amazon Simple Storage Service Amazon S3. The job transforms the data using SQL queries and saves the results back to S3 buckets. Complete the following steps to create a Visual ETL Job:

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under DATA ANALYSIS & INTEGRATION, choose Data processing jobs.
Choose Create Visual ETL Job.

You’ll be directed to the Visual ETL editor, where you can create ETL jobs. You can use this editor to design data transformation pipelines by connecting source nodes, transformation nodes, and target nodes.

On the top left, choose the plus (+) icon in the circle. Under Data sources, select Amazon S3.
Select the Amazon S3 source node and enter the following values:
1. S3 URI: s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/
2. Format: Parquet
Select Update node.
Choose the plus (+) icon in the circle to the right of the Amazon S3 source node. Under Transforms, select SQL query.
Enter the following query statement and select Update node.

SELECT
    marketplace,
    star_rating,
    DATE_FORMAT(review_date, 'yyyy-MM-dd') as review_date,
    COUNT(*) as review_count,
    AVG(CAST(helpful_votes as DOUBLE) / NULLIF(total_votes, 0)) as helpfulness_ratio,
    COUNT(CASE WHEN insight = 'Y' THEN 1 END) as insight_count
FROM {myDataSource}
GROUP BY
    marketplace,
    star_rating,
    DATE_FORMAT(review_date, 'yyyy-MM-dd')

Choose the plus (+) icon to the right of the SQL Query node. Under Data target, select Amazon S3.
Select the Amazon S3 target node and enter the following values:
1. S3 URI: Choose the Amazon S3 location from the project overview page and add the suffix “/output/rating_analysis/”. For example, s3://<bucket-name>/<domainId>/<projectId>/output/rating_analysis/
2. Format: Parquet
3. Compression: Snappy
4. Partition keys: review_date
5. Mode: Append
Select Update node.

Next, add another SQL query node connected to the same Amazon S3 data source. This node performs a SQL query transformations and outputs the results to a separate S3 location.

On the top left, choose the plus (+) icon in the circle. Under Transforms, select SQL query, and connect the Amazon S3 source node.
Enter the following query statement and select Update node.

SELECT 
    marketplace,
    product_id,
    product_title,
    COUNT(*) as review_count,
    AVG(star_rating) as avg_rating,
    SUM(helpful_votes) as total_helpful_votes,
    COUNT(DISTINCT customer_id) as unique_reviewers,
    COUNT(CASE WHEN insight = 'Y' THEN 1 END) as insight_count
FROM {myDataSource}
GROUP BY 
    marketplace,
    product_id,
    product_title

Choose the plus (+) icon to the right of the SQL Query node. Under Data target, select Amazon S3.
Select the Amazon S3 target node and enter the following values:
1. S3 URI: Choose the Amazon S3 location from the project overview page and add suffix “/output/product_analysis/”. For example, s3://<bucket-name>/<domainId>/<projectId>/output/product_analysis/
2. Format: Parquet
3. Compression: Snappy
4. Partition keys: marketplace
5. Mode: Append
Select Update node.

At this point, your end-to-end visual job should look like the following image. The next step is to save this job to the project and run the job.

On the top right, choose Save to project to save the draft job. You can optionally change the name and add a description.
Choose Save.
On the top right, choose Run.

This will start running your Visual ETL job. You can monitor the list of job runs by selecting View runs in the top middle of the screen.

Create and run a code based job

In addition to creating jobs through the Visual ETL Editor, you can create jobs using a code-based approach by specifying Python script or Notebook files. When you specify a Notebook file, it automatically converts to a Python script to create the job. Here, you’ll create a notebook in JupyterLab within SageMaker Unified Studio, save it to the project repository, and then create a code-based job from that notebook. First, create a Notebook.

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under IDE & APPLICATIONS, select JupyterLab.
Select Python 3 under Notebook.

For the first cell, select Local Python, python, enter following code:

%%configure -n project.spark.compatibility
{
    "number_of_workers": 10,
    "session_type": "etl",
    "glue_version": "5.0",
    "worker_type": "G.1X",
    "idle_timeout": 10,
    "timeout": 1200
}

For the second cell, select PySpark, project.spark.compatibility, enter following code. This performs the same processing as the Visual ETL job you created above. Replace the S3 bucket and folder names for output_path.

import sys
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()

# Create Spark session
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()

# Configure paths
input_path = "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Apparel/"
output_path = "s3://<bucket-name>/<domainId>/<projectId>/code-job-output/results"


# Read data from S3
df = spark.read.format("parquet").load(input_path)
df.createOrReplaceTempView("reviews")

# Transform 1: Rating Analysis
rating_analysis = spark.sql("""
    SELECT 
        marketplace,
        star_rating,
        DATE_FORMAT(review_date, 'yyyy-MM-dd') as review_date,
        COUNT(*) as review_count,
        AVG(CAST(helpful_votes as DOUBLE) / NULLIF(total_votes, 0)) as helpfulness_ratio,
        COUNT(CASE WHEN insight = 'Y' THEN 1 END) as insight_count
    FROM reviews
    GROUP BY 
        marketplace,
        star_rating,
        DATE_FORMAT(review_date, 'yyyy-MM-dd')
""")

# Transform 2: Product Analysis
product_analysis = spark.sql("""
    SELECT 
        marketplace,
        product_id,
        product_title,
        COUNT(*) as review_count,
        AVG(star_rating) as avg_rating,
        SUM(helpful_votes) as total_helpful_votes,
        COUNT(DISTINCT customer_id) as unique_reviewers,
        COUNT(CASE WHEN insight = 'Y' THEN 1 END) as insight_count
    FROM reviews
    GROUP BY 
        marketplace,
        product_id,
        product_title
    HAVING 
        COUNT(*) >= 5
""")

# Write results to S3
rating_analysis.write.format("parquet") \
    .option("compression", "snappy") \
    .partitionBy("review_date") \
    .mode("append") \
    .save(f"{output_path}/rating_analysis")

product_analysis.write.format("parquet") \
    .option("compression", "snappy") \
    .partitionBy("marketplace") \
    .mode("append") \
    .save(f"{output_path}/product_analysis")

Choose the File icon to save the notebook file. Enter the name of your notebook.

Save the notebook to the project’s repository.

Choose the Git icon in the left navigation. This opens a panel where you can view the commit history and perform Git operations.
Choose the plus (+) icon next to the files you want to commit.
Enter a brief summary of the commit in the Summary text entry field. Optionally, enter a longer description of the commit in the Description text entry field.
Choose Commit.
Choose the Push committed changes icon to do a git push.

Create the Code-based Job from the Notebook file in the project repository.

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under DATA ANALYSIS & INTEGRATION, choose Data processing jobs.
Choose Create job from files.
Choose Choose project files and choose Browse files.
Select the Notebook file you created and choose Select.

Here, the Python script automatically converted from your notebook file will be displayed. Review the content.

Choose Next.
For Job name, enter the name of your job.
Choose Submit to create your job.
Choose the job you created.
Choose Run job.

Convert existing Visual ETL flows to jobs

You can convert an existing visual ETL flow to a job by saving your existing Visual ETL flow to the project repository. Use the following steps to create a job from your existing visual ETL flow:

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under DATA ANALYSIS & INTEGRATION, select Visual ETL editor.
Select the existing Visual ETL flow.
On the top right, choose Save to project to save the draft flow. You can optionally change the name and add a description.
Choose Save.

View jobs

You can view the list of jobs in your project on the Data processing jobs page. Jobs can be filtered by mode (Visual ETL or Code).

Monitor job runs

On each job’s detail page, you can view a list of job runs in the Job runs tab. You can filter activities by job run ID, status, start time, and end time. The Job runs list shows basic attributes such as duration, resources consumed, and instance type, along with log group names and various job parameters. You can list, compare, and explore job runs history based on various attributes.

On the individual job run details page, you can view job properties and output logs from the run. When a job fails because of an error, you can see the error message at the top of the page and examine detailed error information in the output logs.

Intelligent troubleshooting with generative AI: When jobs fail, you can take advantage of generative AI troubleshooting to resolve issues quickly. SageMaker Unified Studio’s AI-powered troubleshooting automatically analyzes job metadata, Spark event logs, error stack traces, and runtime metrics to identify root causes and provide actionable solutions. It handles both simple scenarios like missing S3 buckets, and complex performance issues such as out-of-memory exceptions. The analysis explains not just what failed, but why it failed and how to fix it, reducing troubleshooting time from hours or days to minutes.

To start the analysis, choosing Troubleshoot with AI at the top right. The troubleshooting analysis provides Root Cause Analysis identifying the specific issue, Analysis Insights explaining the error context and failure patterns, and Recommendations with step-by-step remediation actions. This expert-level analysis makes complex Spark debugging accessible to all team members, regardless of their Spark expertise.

Clean up

To avoid incurring future charges, delete the resources you created during this walkthrough:

Delete Visual ETL flows in Visual ETL editor.
Delete Data processing jobs, including Visual ETL and Code-based jobs.
Delete Output files in the S3 bucket.

Conclusion

In this post, we explored the new job experience in Amazon SageMaker Unified Studio, which brings a familiar and consistent experience for data processing and data integration tasks. This new capability streamlines your workflows by providing enhanced visibility, cost management, and seamless migration paths from AWS Glue.With the ability to create both visual and code-based jobs, monitor job runs, and set up scheduling, the new jobs experience helps you build and manage data processing and data integration tasks efficiently. Whether you’re a data engineer working on ETL processes or a data scientist preparing datasets for machine learning, the job experience in SageMaker Unified Studio provides the tools you need in a unified environment.Start exploring the new job experience today to simplify your data processing workflows and make the most of your data in Amazon SageMaker Unified Studio.

About the authors

Chiho Sugimoto is a Cloud Support Engineer on the AWS Big Data Support team. She is passionate about helping customers build data lakes using ETL workloads. She loves planetary science and enjoys studying the asteroid Ryugu on weekends.

Noritaka Sekiyama is a Principal Big Data Architect at the AWS Analytics product team. He’s responsible for designing new features in AWS products, building software artifacts, and providing architecture guidance to customers. In his spare time, he enjoys cycling on his road bike.

Matt Su is a Senior Product Manager on the AWS Glue team. He enjoys helping customers uncover insights and make better decisions using their data with AWS Analytics services. In his spare time, he enjoys skiing and gardening.

Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker

2025-07-15 Naohisa Takahashi

Post Syndicated from Naohisa Takahashi original https://aws.amazon.com/blogs/big-data/orchestrate-data-processing-jobs-querybooks-and-notebooks-using-visual-workflow-experience-in-amazon-sagemaker/

Automation of data processing and data integration tasks and queries is essential for data engineers and analysts to maintain up-to-date data pipelines and reports. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access the data in your organization and act on it using the ideal tools for your use case. SageMaker Unified Studio offers multiple ways to integrate with data through the Visual ETL, Query Editor, and JupyterLab builders. SageMaker is natively integrated with Apache Airflow and Amazon Managed Workflows for Apache Airflow (Amazon MWAA), and is used to automate the workflow orchestration for jobs, querybooks, and notebooks with a Python-based DAG definition.

Today, we are excited to launch a new visual workflows builder in SageMaker Unified Studio. With the new visual workflow experience, you don’t need to code the Python DAGs manually. Instead, you can visually define the orchestration workflow in SageMaker Unified Studio, and the visual definition is automatically converted to a Python DAG definition that is supported in Airflow. This post demonstrates the new visual workflow experience in SageMaker Unified Studio.

Example use case

In this post, a fictional ecommerce company sells many different products, like books, toys, and jewelry. Customers can leave reviews and star ratings for each product so other customers can make informed decisions about what they should buy. We use a sample synthetic review dataset for demonstration purposes, which includes different products and customer reviews.In this example, we demonstrate the new visual workflow experience with a data processing job, SQL querybook, and notebook. We also identify the top 10 customers who have contributed the most helpful votes per category.The following diagram illustrates the solution architecture.

In the following sections, we show how to configure a series of components using data processing jobs, querybooks, and notebooks with SageMaker Unified Studio visual workflows. You can use sample data to extract information from the specific category, update partition metadata, and display query results in the notebook using Python code.

Prerequisites

To get started, you must have the following prerequisites:

An AWS account
A SageMaker Unified Studio domain. To use the sample data provided in this blog post, your domain should be in us-east-1 region.
A SageMaker Unified Studio project with the Data analytics and AI-ML model development project profile
A workflow environment

Create a data processing job

The first step is to create a data processing job to run visual transformations to identify top contributing customers per category. Complete the following steps to create a data processing job:

On the top menu, under Build, choose Visual ETL flow.
Choose the plus sign, and under Data sources, choose Amazon S3.
Choose the Amazon S3 source node and enter the following values:
1. S3 URI: s3://aws-bigdata-blog/generated_synthetic_reviews/data/
2. Format: Parquet
Choose Update node.
Choose the plus sign, and under Transform, choose Filter.
Choose the Filter node and enter the following values:
1. Filter Type: Global AND
2. Key: product_category
3. Operation: ==
4. Value: Books
Choose Update node.
Choose the plus sign, and under Data targets, choose Amazon S3.
Choose the S3 node and enter the following values:
1. S3 URI: Use the Amazon S3 location from the project overview page and add the suffix /data/books_synthetic_reviews/ (for example, /dzd_al0ii4pi2sqv68/awi0lzjswu0yhc/dev/data/books_synthetic_reviews/)
2. Format: Parquet
3. Compression: Snappy
4. Partition keys: marketplace
5. Mode: Overwrite
6. Update Catalog: True
7. Database: Choose your database
8. Table: books_synthetic_review
9. Include header: False
Choose Update node.

At this point, you should have an end-to-end visual flow. Now you can publish it.

Choose Save to project to save the draft flow.
Change Job name to filter-books-synthetic-review, then choose Update.

The data processing job has been successfully created.

Create a querybook

Complete the following steps to create a querybook to run a SQL query against the source table to recognize partitions:

Choose the plus sign next to the querybook tab to open new querybook.
Enter the following query and choose Save to project. The query MSCK REPAIR TABLE is prepared for recognizing partitions in the table. We don’t run this querybook yet because the querybook is designed to be triggered by a workflow.

MSCK REPAIR TABLE `books_synthetic_review`;

For Querybook title, enter QueryBook-synthetic-review-<timestamp>, then choose Save changes.

The querybook to recognize new partitions has been successfully created.

Create a notebook

Next, we create notebook to generate output and visualize the results. Complete following steps:

On the top menu, under Build, choose JupyterLab.
Choose File, New, and Notebook to create a new notebook.
Enter the following code snippets into notebook cells and save them (provide your AWS account ID, AWS Region, and S3 bucket):

import sys
!{sys.executable} -m pip install PyAthena
from sagemaker_studio import Project
from pyathena import connect
import pandas as pd

project = Project()
s3_path = f'{project.s3.root}/sys/athena/'
region = project.connection().physical_endpoints[0].aws_region
database = project.connection().catalog().databases[0].name

conn = connect(s3_staging_dir=s3_path, region_name=region)

print("Top 10 most helpful commented customer, Books category")
df = pd.read_sql(f"""
select customer_id, sum(helpful_votes) helpful_votes_sum from {database}.books_synthetic_review group by customer_id order by sum(helpful_votes) desc limit 10;
""", conn)
df

Choose File, Save Notebook.

Rename the file name, and choose Rename and Save.
Choose the Git sidebar and choose the plus sign next to the file name.

Enter the commit message and choose COMMIT.
Choose Push to Remote.

Create a workflow

Complete the following steps to create a workflow:

On the top menu, under Build, choose Workflows.
Choose Create new workflow.

Choose the plus sign, then choose Data processing job.

Choose the Data processing job node, then choose Browse jobs.
Select filter-books-synthetic-review and choose Select.

Choose the plus sign, then choose Querybook.
Choose the Querybook node, then choose Browse files.
Select QueryBook-synthetic-review-<timestamp>.sqlnb and choose Select.
Choose the plus sign, then choose Notebook.
Choose the Notebook node, then choose Browse files.
Select synthetics-review-result.ipynb and choose Select.

At this point, you should have an end-to-end visual workflow. Now you can publish it.

Choose Save to project to save the draft flow.
Change Workflow name to synthetic-review-workflow and choose Save to project.

Run the workflow

To run your workflow, complete following steps:

Choose Run on the workflow details page.

Choose View runs to see the running workflow.

When the run is complete, you can check the notebook task result by choosing the run ID (manual__<timestamp>), then choose the notebook task ID (notebook-task-xxxx).

You can find the IDs of the top 10 customers who have contributed the most helpful votes in the notebook output.

Clean up

To avoid incurring future charges, clean up the resources you created during this walkthrough:

On the workflows page, select your workflow, and under Actions, choose Delete workflow.

On the Visual ETL flows page, select filter-books-synthetics-review, and under Actions, choose Delete flow.
In Query Editor, enter and run the following SQL to drop table:

DROP TABLE `books_synthetic_review`;

In JupyterLab, in the File Browser sidebar, choose (right-click) each notebook (synthetics-review-result.ipynb and QueryBook-synthetic-review-<timestamp>.sqlnb) and choose Delete.
Commit with git and then push to the remote repository.

Conclusion

The new visual workflow editor in SageMaker Unified Studio can help you orchestrate your data integration tasks visually without requiring deep expertise in Airflow. Through the visual interface, data engineers and analysts can focus on their core tasks instead of spending time on manual workflow Python DAG code implementation.Visual workflows offer several advantages, including an intuitive visual interface for workflow design and automatic conversion of visual workflows to Python DAG definitions. The integration with Airflow and Amazon MWAA further enhances the utility, and improved monitoring capabilities provide greater visibility into workflow runs. These features contribute to reduced development time in workflow creation. Visual workflows make workflow automation easy for a variety of use cases, such as data engineers orchestrating complex ETL pipelines or analysts maintaining regular reports.We encourage you to explore visual workflows in SageMaker Unified Studio, and discover how they can streamline your data processing and analytics workflows. For more information about SageMaker Unified Studio and its features, see AWS documentation.

About the authors

Naohisa Takahashi is a Senior Cloud Support Engineer on the AWS Support Engineering team. He supports customers resolve technical issues and launch systems. In his spare time, he plays board games with his friends.

Noritaka Sekiyama is a Principal Big Data Architect with AWS Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Iris Tian is a UX designer on the Amazon SageMaker Unified Studio team. She designs intuitive, end-to-end experiences that simplify and streamline workflows across data processing and orchestration. In her spare time, she enjoys snowboarding and visiting museums.

Regan Baum is a Senior Software Development Engineer on the Amazon SageMaker Unified Studio team. She designs, implements, and maintains features that enable customers to manage their workflows in SageMaker Unified Studio. Outside of work, she enjoys hiking and running.

Yuhang Huang is a Software Development Manager on the Amazon SageMaker Unified Studio team. He leads the engineering team to design, build, and operate scheduling and orchestration capabilities in SageMaker Unified Studio. In his free time, he enjoys playing tennis.

Gal Heyne is a Senior Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.

Develop and monitor a Spark application using existing data in Amazon S3 with Amazon SageMaker Unified Studio

2025-07-09 Amit Maindola

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/develop-and-monitor-a-spark-application-using-existing-data-in-amazon-s3-with-amazon-sagemaker-unified-studio/

Organizations face significant challenges managing their big data analytics workloads. Data teams struggle with fragmented development environments, complex resource management, inconsistent monitoring, and cumbersome manual scheduling processes. These issues lead to lengthy development cycles, inefficient resource utilization, reactive troubleshooting, and difficult-to-maintain data pipelines.These challenges are especially critical for enterprises processing terabytes of data daily for business intelligence (BI), reporting, and machine learning (ML). Such organizations need unified solutions that streamline their entire analytics workflow.

The next generation of Amazon SageMaker with Amazon EMR in Amazon SageMaker Unified Studio addresses these pain points through an integrated development environment (IDE) where data workers can develop, test, and refine Spark applications in one consistent environment. Amazon EMR Serverless alleviates cluster management overhead by dynamically allocating resources based on workload requirements, and built-in monitoring tools help teams quickly identify performance bottlenecks. Integration with Apache Airflow through Amazon Managed Workflows for Apache Airflow (Amazon MWAA) provides robust scheduling capabilities, and the pay-only-for-resources-used model delivers significant cost savings.

In this post, we demonstrate how to develop and monitor a Spark application using existing data in Amazon Simple Storage Service (Amazon S3) using SageMaker Unified Studio.

Solution overview

This solution uses SageMaker Unified Studio to execute and oversee a Spark application, highlighting its integrated capabilities. We cover the following key steps:

Create an EMR Serverless compute environment for interactive applications using SageMaker Unified Studio.
Create and configure a Spark application.
Use TPC-DS data to build and run the Spark application using a Jupyter notebook in SageMaker Unified Studio.
Monitor application performance and schedule recurring runs with Amazon MWAA integrated.
Analyze results in SageMaker Unified Studio to optimize workflows.

Prerequisites

For this walkthrough, you must have the following prerequisites:

An AWS account – If you don’t have an account, you can create one.
A SageMaker Unified Studio domain – For instructions, refer to Create an Amazon SageMaker Unified Studio domain – quick setup.
A demo project – Create a demo project in your SageMaker Unified Studio domain. For instructions, see Create a project. For this example, we choose All capabilities in the project profile section.

Add EMR Serverless as compute

Complete the following steps to create an EMR Serverless compute environment to build your Spark application:

In SageMaker Unified Studio, open the project you created as a prerequisite and choose Compute.
Choose Data processing, then choose Add compute.
Choose Create new compute resources, then choose Next.

Choose EMR Serverless, then choose Next.

For Compute name, enter a name.
For Release label, choose emr-7.5.0.
For Permission mode, choose Compatibility.
Choose Add compute.

It takes a few minutes to spin up the EMR Serverless application. After it’s created, you can view the compute in SageMaker Unified Studio.

The preceding steps demonstrate how you can set up an Amazon EMR Serverless application in SageMaker Unified Studio to run interactive PySpark workloads. In subsequent steps, we build and monitor Spark applications in an interactive JupyterLab workspace.

Develop, monitor, and debug a Spark application in a Jupyter notebook within SageMaker Unified Studio

In this section, we build a Spark application using the TPC-DS dataset within SageMaker Unified Studio. With Amazon SageMaker Data Processing, you can focus on transforming and analyzing your data without managing compute capacity or open source applications, saving you time and reducing costs. SageMaker Data Processing provides a unified developer experience from Amazon EMR, AWS Glue, Amazon Redshift, Amazon Athena, and Amazon MWAA in a single notebook and query interface. You can automatically provision your capacity on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) or EMR Serverless. Scaling rules manage changes to your compute demand to optimize performance and runtimes. Integration with Amazon MWAA simplifies workflow orchestration by alleviating infrastructure management needs. For this post, we use EMR Serverless to read and query the TPC-DS dataset within a notebook and run it using Amazon MWAA.

Complete the following steps:

Upon completion of the previous steps and prerequisites, navigate to SageMaker Studio and open your project.
Choose Build and then JupyterLab.

The notebook takes about 30 seconds to initialize and connect to the space.

Under Notebook, choose Python 3 (ipykernel).
In the first cell, next to Local Python, choose the dropdown menu and choose PySpark.
Choose the dropdown menu next to Project.Spark and choose EMR-S Compute.
Run the following code to develop your Spark application. This example reads a 3 TB TPC-DS dataset in Parquet format from a publicly accessible S3 bucket:

spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/store/").createOrReplaceTempView("store")

After the Spark session starts and execution logs start to populate, you can explore the Spark UI and driver logs to further debug and troubleshoot Spark progra The following screenshot shows an example of the Spark UI. The following screenshot shows an example of the driver logs. The following screenshot shows the Executors tab, which provides access to the driver and executor logs.

Use the following code to read some more TPC-DS datasets. You can create temporary views and use the Spark UI to see the files being read. Refer to the appendix at the end of this for details on using the TPC-DS dataset within your buckets.

spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/item/").createOrReplaceTempView("item")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/store_sales/").createOrReplaceTempView("store_sales")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/date_dim/").createOrReplaceTempView("date_dim")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/customer/").createOrReplaceTempView("customer")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/catalog_sales/").createOrReplaceTempView("catalog_sales")
spark.read.parquet("s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/web_sales/").createOrReplaceTempView("web_sales")

In each cell of your notebook, you can expand Spark Job Progress to view the stages of the job submitted to EMR Serverless for a specific cell. You can see the time taken to complete each stage. In addition, if a failure occurs, you can examine the logs, making troubleshooting a seamless experience.

Because the files are partitioned based on date key column, you can observe that Spark runs parallel tasks for reads.

Next, get the count across the date time keys on data that is partitioned based on the time key using the following code:

select count(1), ss_sold_date_sk from store_sales group by ss_sold_date_sk order by ss_sold_date_sk

Monitor jobs in the Spark UI

On the Jobs tab of the Spark UI, you can see a list of complete or actively running jobs, with the following details:

The action that triggered the job
The time it took (for this example, 41 seconds, but timing will vary)
The number of stages (2) and tasks (3,428); these are for reference and specific to this specific example

You can choose the job to view more details, particularly around the stages. Our job has two stages; a new stage is created whenever there is a shuffle. We have one stage for the initial reading of each dataset, and one for the aggregation. In the following example, we run some TPC-DS SQL statements that are used for performance and benchmarks:

 with frequent_ss_items as
 (select substr(i_item_desc,1,30) itemdesc,i_item_sk item_sk,d_date solddate,count(*) cnt
  from store_sales, date_dim, item
  where ss_sold_date_sk = d_date_sk
    and ss_item_sk = i_item_sk
    and d_year in (2000, 2000+1, 2000+2,2000+3)
  group by substr(i_item_desc,1,30),i_item_sk,d_date
  having count(*) >4),
 max_store_sales as
 (select max(csales) tpcds_cmax
  from (select c_customer_sk,sum(ss_quantity*ss_sales_price) csales
        from store_sales, customer, date_dim
        where ss_customer_sk = c_customer_sk
         and ss_sold_date_sk = d_date_sk
         and d_year in (2000, 2000+1, 2000+2,2000+3)
        group by c_customer_sk) x),
 best_ss_customer as
 (select c_customer_sk,sum(ss_quantity*ss_sales_price) ssales
  from store_sales, customer
  where ss_customer_sk = c_customer_sk
  group by c_customer_sk
  having sum(ss_quantity*ss_sales_price) > (95/100.0) *
    (select * from max_store_sales))
 select sum(sales)
 from (select cs_quantity*cs_list_price sales
       from catalog_sales, date_dim
       where d_year = 2000
         and d_moy = 2
         and cs_sold_date_sk = d_date_sk
         and cs_item_sk in (select item_sk from frequent_ss_items)
         and cs_bill_customer_sk in (select c_customer_sk from best_ss_customer)
      union all
      (select ws_quantity*ws_list_price sales
       from web_sales, date_dim
       where d_year = 2000
         and d_moy = 2
         and ws_sold_date_sk = d_date_sk
         and ws_item_sk in (select item_sk from frequent_ss_items)
         and ws_bill_customer_sk in (select c_customer_sk from best_ss_customer))) x

You can monitor your Spark job in SageMaker Unified Studio using two methods. Jupyter notebooks provide basic monitoring, showing real-time job status and execution progress. For more detailed analysis, use the Spark UI. You can examine specific stages, tasks, and execution plans. The Spark UI is particularly useful for troubleshooting performance issues and optimizing queries. You can track estimated stages, running tasks, and task timing details. This comprehensive view helps you understand resource utilization and track job progress in depth.

In this section, we explained how you can EMR Serverless compute in SageMaker Unified Studio to build an interactive Spark application. Through the Spark UI, the interactive application provides fine-grained task-level status, I/O, and shuffle details, as well as links to corresponding logs of the task for this stage directly from your notebook, enabling a seamless troubleshooting experience.

Clean up

To avoid ongoing charges in your AWS account, delete the resources you created during this tutorial:

Delete the connection.
Delete the EMR job.
Delete the EMR output S3 buckets.
Delete the Amazon MWAA resources, such as workflows and environments.

Conclusion

In this post, we demonstrated how the next generation of SageMaker, combined with EMR Serverless, provides a powerful solution for developing, monitoring, and scheduling Spark applications using data in Amazon S3. The integrated experience significantly reduces complexity by offering a unified development environment, automatic resource management, and comprehensive monitoring capabilities through Spark UI, while maintaining cost-efficiency through a pay-as-you-go model. For businesses, this means faster time-to-insight, improved team collaboration, and reduced operational overhead, so data teams can focus on analytics rather than infrastructure management.

To get started, explore the Amazon SageMaker Unified Studio User Guide, set up a project in your AWS environment, and discover how this solution can transform your organization’s data analytics capabilities.

Appendix

In the following sections, we discuss how to run a workload on a schedule and provide details about the TPC-DS dataset for building the Spark application using EMR Serverless.

Run a workload on a schedule

In this section, we deploy a JupyterLab notebook and create a workflow using Amazon MWAA. You can use workflows to orchestrate notebooks, querybooks, and more in your project repositories. With workflows, you can define a collection of tasks organized as a directed acyclic graph (DAG) that can run on a user-defined schedule.Complete the following steps:

In SageMaker Unified Studio, choose Build, and under Orchestration, choose Workflows.

Choose Create Workflow in Editor.

You will be redirected to the JupyterLab notebook with a new DAG called untitled.py created under the /src/workflows/dag folder.

We rename this notebook to tpcds_data_queries.py.
You can reuse the existing template with the following updates:
1. Update line 17 with the schedule you want your code to run.
2. Update line 26 with your NOTEBOOK_PATH. This should be in src/<notebook_name>.ipynb. Note the name of the automatically generated dag_id; you can name it based on your requirements.

Choose File and Save notebook.

To test, you can trigger a manual run of your workload.

In SageMaker Unified Studio, choose Build, and under Orchestration, choose Workflows.
Choose your workflow, then choose Run.

You can monitor the success of your job on the Runs tab.

To debug your notebook job by accessing the Spark UI within your Airflow job console, you must use EMR Serverless Airflow Operators to submit your job. The link is available on the Details tab of your query.

This option has the following key limitations: it’s not available for Amazon EMR on EC2, and SageMaker notebook job operators don’t work.

You can configure the operator to generate one-time links to the application UIs and Spark stdout logs by passing enable_application_ui_links=True as a parameter. After the job starts running, these links are available on the Details tab of the relevant task. If enable_application_ui_links=False, then the links will be present but grayed out.

Make sure you have the emr-serverless:GetDashboardForJobRun AWS Identity and Access Management (IAM) permissions to generate the dashboard link.

Open the Airflow UI for your job. The Spark UI and history server dashboard options are visible on the Details tab, as shown in the following screenshot.

The following screenshot shows the Jobs tab of the Spark UI.

Use the TPC-DS dataset to build the Spark application using EMR Serverless

To use the TPC-DS dataset to run the Spark application against a dataset in an S3 bucket, you need to copy the TPC-DS dataset into your S3 bucket:

Create a new S3 bucket in your test account if needed. In the following code, replace $YOUR_S3_BUCKET with your S3 bucket name. We suggest you export YOUR_S3_BUCKET as an environment variable:

<Your bucket name>

Copy the TPC-DS source data as input to your S3 bucket. If it’s not exported as an environment variable, replace $YOUR_S3_BUCKET with your S3 bucket name:

aws s3 sync s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/ s3://$YOUR_S3_BUCKET/blog/BLOG_TPCDS-TEST-3T-partitioned/

About the Authors

Abhilash is a senior specialist solutions architect at Amazon Web Services (AWS), helping public sector customers on their cloud journey with a focus on AWS Data and AI services. Outside of work, Abhilash enjoys learning new technologies, watching movies, and visiting new places.

Perform per-project cost allocation in Amazon SageMaker Unified Studio

2025-07-09 Enrique Salgado Hernández

Post Syndicated from Enrique Salgado Hernández original https://aws.amazon.com/blogs/big-data/perform-per-project-cost-allocation-in-amazon-sagemaker-unified-studio/

Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access your data and act on it using AWS resources for SQL analytics, data processing, model development, and generative AI application development.

SageMaker Unified Studio is part of the next generation of Amazon SageMaker. SageMaker brings together AWS artificial intelligence and machine learning (AI/ML) and analytics capabilities and delivers an integrated experience for analytics and AI with unified access to data.

With SageMaker Unified Studio, you can create domains and projects, providing a single interface to build, deploy, execute, and monitor end-to-end workflows. This approach helps drive collaboration across teams and facilitates agile development.

SageMaker Unified Studio implements resource tagging when AWS resources are provisioned. You can use these tags to track and allocate costs for the various resources created as part of the domains and projects within SageMaker Unified Studio.

This post demonstrates how to perform cost allocation using these resource tags, so finance analysts and business analysts can implement and follow Financial Operations (FinOps) best practices to control and track cloud infrastructure costs.

Solution overview

The following diagram illustrates how tagging works within SageMaker domains.

High level diagram that illustrates SageMaker Unified Studio entities (domains, projects and environments) are organized and how tags are applied to each of them

Before reviewing the implementation details, let’s explore several key SageMaker concepts: domain, project, project profile, and environment blueprint. For more information, refer to the SageMaker Unified Studio Administrator Guide.

Domain – A domain is an organizing entity created by an administrator. Administrators assign users to domains to enable collaboration using similar tools, assets, and resources. A domain can represent a business organization or a business unit containing people who collaborate and share resources. After creating a domain, administrators share the URL with users to access the portal.
Projects – Projects exist within each domain. A project provides a boundary where users can collaborate on a business use case. Users can create and share data, computing, and other resources within projects.
Project profile – When you create a project, you must select a project profile. A project profile is a template that governs infrastructure for the project, simplifying project creation with preconfigured settings and resources ready for use.
Environment blueprints – Environment blueprints are reusable templates for creating environments. They define settings for resource deployment and provide information for provisioning. Each blueprint uses an AWS CloudFormation template to create resources in a repeatable and scalable manner.

For effective cost tracking and allocation, make sure your SageMaker resources have proper tags. You can configure these as cost allocation tags to group and filter across AWS Billing and Cost Management tools (such as AWS Cost Explorer and AWS Data Exports).

As of this writing, SageMaker domains support tagging at the blueprint, domain, project, and environment level. When you create projects or add resources within an existing project, the following tags are automatically added to resources through CloudFormation resource tags, configured for each blueprint stack:

AmazonDataZoneBlueprint – Type of blueprint corresponding to this blueprint’s CloudFormation template (for example, Tooling)
AmazonDataZoneDomain – Amazon DataZone domain associated with this CloudFormation template
AmazonDataZoneEnvironment – Amazon DataZone environment ID associated with this CloudFormation template
AmazonDataZoneProject – Amazon DataZone project associated with this CloudFormation template

To track costs in SageMaker Unified Studio, you will perform the following steps:

Create a SageMaker domain and project.
Configure cost and billing settings by enabling cost allocation tags.
(Optional) Generate costs for your project.
Track costs using Cost Explorer and Data Exports.

Prerequisites

This post requires the following configurations in your AWS account:

AWS IAM Identity Center enabled in your organization management account (preferred) or in the member account where you will use SageMaker Unified Studio. For instructions on enabling IAM Identity Center, refer to Enable IAM Identity Center.
Cost Explorer enabled in your organization management account (preferred) or in the member account where you will use SageMaker Unified Studio. For configuration steps, refer to Enabling Cost Explorer.

Either legacy AWS Cost and Usage Reports (AWS CUR) with Amazon Athena integration or Data Exports configured and integrated with Athena for queries. For setup instructions, refer to creating Data Exports.

Create a SageMaker Unified Studio domain and project

Complete the following steps to set up your domain and project:

Create a SageMaker Unified Studio domain using the Quick setup option (recommended for new users) or manual setup.

After domain creation, you will be redirected to the domain overview page.

Choose Open Unified Studio.
On the SageMaker Unified Studio console, choose Create project.
For Project profile, choose SQL analytics, then choose Continue.

SageMaker Unified Studio create project wokflow (configuration page)

Choose Continue to keep the default blueprint parameters.
Review the configuration summary, then choose Create project.

SageMaker Unified Studio create project wokflow (confirmation page)

After the project is created, you will be redirected to the project overview page. Record the project ID and domain ID.

Project details page showing various details such as project id, project name and project IAM role ARN

Cost and billing configuration

As mentioned earlier, to track costs in SageMaker Unified Studio, you must configure cost allocation tags. Refer to Organizing and tracking costs using AWS cost allocation tags for more information about this feature.

Complete the following steps:

On the AWS Billing and Cost Management console, under Cost organization in the navigation pane, choose Cost allocation tags.
Select the following tags and choose Activate:
1. AmazonDataZoneDomain
2. AmazonDataZoneProject
3. AmazonDataZoneEnvironment
4. AmazonDataZoneBlueprint

The AmazonDataZoneProject and AmazonDataZoneDomain tags correspond to the project and domain ID values you recorded earlier.

AWS cost allocation tags interface showing the AWS tags that are currently configured as cost allocation tags

Cost allocation tags configuration doesn’t apply retroactively. If you want to monitor costs associated with these tags in the AWS Billing and Cost Management tools before the activation date, you must request a cost allocation tag backfill. The backfill operation can take several hours to complete.

Generate costs for the project

This section explains how to generate costs associated with the underlying data backend (Amazon Redshift in this case) to examine them using AWS billing tools. You can skip this section if you’re tracking costs on an active project.

To generate costs, we use the table structure used in the Redshift Immersion Labs. Refer to Create Tables for more details.

To run queries in SageMaker Unified Studio, follow these steps:

In your project, choose New and then Query.

Image that shows the query button within the SageMaker Unified Studio project overview page allowing users to open the query editor tool

Use the Amazon Redshift Serverless compute configured for the project to generate the costs:
1. Choose the Redshift (Lakehouse) connection.
2. Choose the dev database.
3. Choose the project schema.
4. Choose Choose.

Image that shows the conection selector available in SageMaker Unified Studio. In this case Redshift LakeHouse connection is selected with dev database and project schema selected underneath

Copy and execute the SQL statements provided in the following GitHub repo into the SageMaker Unified Studio query editor to create, load, and validate data on the tables.

View of the Query editor within the SageMaker Unified Studio portal. Image contains two SQL queries (create tables and COPY data operation)

After running these steps, you will have generated some Amazon Redshift costs that will be present for further analysis in AWS Billing and Cost Management tools. However, these tools (Cost Explorer and Data Exports) are refreshed least one time every 24 hours, so you might need to wait up to 24 hours before proceeding to the next section.

Tracking costs in AWS Billing and Cost Management tools

With the cost allocation tags enabled, you can use AWS Billing and Cost Management tools to analyze and track costs, including Cost Explorer and Data Exports. For more information about using these tools, refer to the AWS Billing and Cost Management User Guide.

Check costs in Cost Explorer

You can check your SageMaker Unified Studio costs using Cost Explorer. With this tool, you can view and analyze your costs and usage through an interface with pre-built filters and aggregation capabilities for various metrics. For more information, refer to the Analyzing your costs and usage with AWS Cost Explorer.

To access Cost Explorer, complete the following steps:

On the AWS Management Console, choose your account name in the top right corner and choose Billing Dashboard, or search for “Cost Explorer” in the console search bar.
On the Billing Dashboard, choose Cost Explorer in the navigation pane.
For first-time users, choose Launch Cost Explorer to enable the service.

AWS can take up to 24 hours to prepare your cost data.

To view overall costs per project, configure the following report parameters:
1. For Date Range, enter your range.
2. For Granularity, choose Monthly.
3. For Dimension, choose Tag.
4. For Tag, enter your tag (AmazonDataZoneProject).

Image that shows how to group by a particular dimension (tag) in cost explorer

The following screenshot shows a sample report.

AWS cost explorer report showing costs by SageMaker Unified Studio project

To view different service costs for a specific project, update the following parameters:
1. For Dimension, choose Service.
2. For Tag¸ choose AmazonDataZoneProject and choose the value of the project you want to inspect (in this case, 4z9d694nbsnyqx).

Image that illustrates how to filter by a specific dimension (tag) and value in cost explorer

The results should look similar to the following screenshot.

AWS cost explorer report showing service costs for a particular SageMaker Unified Studio project

Check costs using Data Exports

With Data Exports, you can query your cost and usage in AWS with the maximum flexibility degree compared to other tools such as Cost Explorer. It provides a comprehensive set of measures and dimensions that you can include in the export to create a personalized report. This report is then delivered to Amazon Simple Storage Service (Amazon S3) so you can configure it with Athena, so it can be queried using SQL or business intelligence (BI) tools such as Amazon QuickSight.

This post assumes you have already configured a data export and you have it integrated with Athena (refer to Processing data exports for more information). For instructions on setting up CUR and Athena integration, refer to Creating reports.

Check costs by project

Use the following query to check costs by project:

SELECT product_servicecode,
    product_product_family,
    resource_tags[ 'user_amazon_data_zone_project' ] as user_amazon_data_zone_project,
    round(sum(line_item_unblended_cost), 2) costs,
    line_item_line_item_description 
FROM "data_exports"."data_exportdata"
where resource_tags [ 'user_amazon_data_zone_project' ] != ''
group by product_product_family,
    product_servicecode,
    resource_tags[ 'user_amazon_data_zone_project' ],
    line_item_line_item_description
order by round(sum(line_item_unblended_cost), 2) DESC;

Results will look similar to the following screenshot on the Athena console.

Athena SQL query results when querying cost and usage data from data exports

The preceding query shows your costs grouped by:

Project (using tags)
Service
Product family, which corresponds to the subtype for a given product usage charge (for example, ML Instance for SageMaker, or Managed Storage for Amazon Redshift)

Check costs for individual projects

To check costs for a specific SageMaker Unified Studio project (for example, the sample project 4z9d694nbsnyqx created during this walkthrough), you can use the following query:

SELECT product_servicecode,
    product_product_family,
    resource_tags[ 'user_amazon_data_zone_project' ] as user_amazon_data_zone_project,
    round(sum(line_item_unblended_cost), 2) costs,
    line_item_line_item_description 
FROM "data_exports"."data_exportdata"
where resource_tags [ 'user_amazon_data_zone_project' ] != ''
and resource_tags [ 'user_amazon_data_zone_project' ] = <provide the project id here>
group by product_product_family,
    product_servicecode,
    resource_tags[ 'user_amazon_data_zone_project' ],
    line_item_line_item_description
order by round(sum(line_item_unblended_cost), 2) DESC;

Monitor costs with Data Exports and QuickSight

If you enabled Athena to work with Data Exports, you can also configure QuickSight to query this data source. With QuickSight, you can create interactive dashboards to track SageMaker costs in SageMaker Unified Studio at scale.

Configure access and permissions

To create CUR dashboards in QuickSight, first complete the following steps:

Subscribe to QuickSight and have an author user account. For instructions on subscribing to QuickSight, refer to Signing up for an Amazon QuickSight subscription.
Enable access to Athena and your CUR S3 bucket in the Security & permissions section of the QuickSight administration console. You need QuickSight administrator permissions to access this console.

Image shows QuickSight administration console where administrators can edit the AWS services (Athena in this case) that QuickSight is allowed to access

If you’re using AWS Lake Formation, make sure your QuickSight user is authorized to query the CUR database and table. For more information about granting access in Lake Formation, refer to Granting permissions on Data Catalog resources.

Create a QuickSight dataset

The next step is to create a dataset in QuickSight using a SQL query. For instructions on creating a dataset with SQL, refer to Using SQL to customize data. Use the following SQL expression:

SELECT product_servicecode,
    product_product_family,
    resource_tags[ 'user_amazon_data_zone_environment' ] as user_amazon_data_zone_environment,
    resource_tags[ 'user_amazon_data_zone_project' ] as user_amazon_data_zone_project,
    resource_tags[ 'user_amazon_data_zone_domain' ] as user_amazon_data_zone_domain,
    line_item_unblended_cost,
    line_item_usage_start_date,
    line_item_line_item_description
FROM "data_exports"."data_exportdata"
where resource_tags [ 'user_amazon_data_zone_environment' ] != '' or resource_tags [ 'user_amazon_data_zone_project' ] != ''

Image of QuickSight dataset preparation page. Shows a SQL query that is used to extract data from the data exports previously configured.

The preceding query includes only cost and usage data that’s tagged with either user_amazon_data_zone_environment or user_amazon_data_zone_project to focus on SageMaker associated costs. To include other AWS costs, you must modify these filters.

Create QuickSight dashboards

Using the authoring capabilities of QuickSight, you can create interactive dashboards where business stakeholders can explore and track costs associated with SageMaker Unified Studio projects. You can use these dashboards to review relevant cost metrics at a glance that are derived from the Data Exports dimensions and metrics included in your dataset, as shown in the following screenshot. For more information about adding visuals to analyses, refer to Adding visuals to Amazon QuickSight analyses.

Example of a QuickSight dashboard consuming data exports cost and usage data. Dashboard contains multiple visuals that illustrate SageMaker Unified Studio costs by project and service

The preceding example shows a dashboard built using QuickSight connected to a Data Exports dataset. The dashboard contains the following visuals:

KPI visual showing the current monthly costs for SageMaker Unified Studio along with the month over month (MoM) variation and history
Autonarrative visual analyzing SageMaker Unified Studio costs (highest) by month
Vertical stacked bar chart showing SageMaker Unified Studio costs by month (grouped by project)
Donut chart showing SageMaker Unified Studio cost by service
Heat map visual correlating costs by project ID and service

Using this approach (QuickSight and Data Exports), you can create highly customizable dashboards to explore and monitor your SageMaker Unified Studio costs. Furthermore, you can create automated reports using the QuickSight reporting feature to send these by email to the relevant stakeholders.

Clean up

Delete the resources you created as part of this post when you’re done with them to avoid monthly charges. This includes SageMaker resources, created Data Export reports and the QuickSight subscription (in case it was created to visualize costs).

Delete SageMaker resources
1. Log in to the SageMaker domain using an admin role.
2. Delete the project you created.
3. Delete the SageMaker domain.
Delete Data Exports reports
1. On the AWS Billing console, in the navigation pane, choose Cost & Usage Reports.
2. Select the report you want to delete.
3. Choose Delete.
4. Confirm the deletion by choosing Delete report.

For more information about managing Data Exports, refer to Deleting exports.

Unsubscribe from QuickSight
1. On the QuickSight console, choose your profile name in the top right corner.
2. Choose Manage QuickSight.
3. Choose Account settings.
4. At the bottom of the page, choose Delete your QuickSight account.
5. Review the information about data deletion.
6. Enter delete to confirm.
7. Choose Delete.

IMPORTANT NOTE: Before unsubscribing, make sure you backed up any dashboards or analyses you want to keep. After deletion, you can’t recover your QuickSight assets. For more information about managing your QuickSight subscription, refer to Deleting your Amazon QuickSight subscription and closing the account.

Conclusion

Managing costs on a unified platform like SageMaker can seem challenging because it aggregates many tools and services with different cost models. In this post, we showed how to use AWS Billing and Cost Management tools to aggregate and categorize costs across the various services used within SageMaker. With this approach, you can monitor and track respective service costs, either in aggregate or focusing on a particular project.

Start taking control of your analytics and ML costs today. With AWS Billing and Cost Management tools with SageMaker, you can:

Track and monitor your service costs
Break down expenses by project or service
Implement efficient back charging mechanisms to the different business units or organizations using SageMaker within your organization

For further reading, refer to Analyzing your costs and usage with AWS Cost Explorer and Processing Data Exports (using Athena).

About the authors

Enrique Salgado Hernández is a Senior Specialist Solutions Architect at AWS with more than 10 years of experience working in the cloud. He specializes in designing and implementing large-scale analytics architectures across various industry sectors. He is passionate about working with customers to solve their problems by supporting them during their cloud journey.

Angel Conde Manjon is a Senior EMEA Data & AI PSA, based in Madrid. He previously worked on research related to data analytics and AI in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.

Capture data lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

2025-06-24 Jose Romero

Post Syndicated from Jose Romero original https://aws.amazon.com/blogs/big-data/capture-data-lineage-from-dbt-apache-airflow-and-apache-spark-with-amazon-sagemaker/

The next generation of Amazon SageMaker is the center for your data, analytics, and AI. SageMaker brings together AWS artificial intelligence and machine learning (AI/ML) and analytics capabilities and delivers an integrated experience for analytics and AI with unified access to data. From Amazon SageMaker Unified Studio, a single interface, you can access your data and use a suite of powerful tools for data processing, SQL analytics, model development, training and inference, as well as generative AI development. This unified experience is assisted by Amazon Q and Amazon SageMaker Catalog (powered by Amazon DataZone), which delivers an embedded generative AI and governance experience at every step.

With data lineage, now part of SageMaker Catalog, domain administrators and data producers can centralize lineage metadata of their data assets in a single place. You can track the flow of data over time, giving you a clear understanding of where it originated, how it has changed, and its ultimate use across the business. By providing this level of transparency around the origin of data, data lineage helps data consumers gain trust that the data is correct for their use case. Because data lineage is captured at the table, column, and job level, data producers can also conduct impact analysis and respond to data issues when needed.

Capture of data lineage in SageMaker starts after connections and data sources are configured and lineage events are generated when data is transformed in AWS Glue or Amazon Redshift. This capability is also fully compatible with OpenLineage, so you can further augment data lineage capture to other data processing tools. This post walks you through how to use the OpenLineage-compatible API of SageMaker or Amazon DataZone to push data lineage events programmatically from tools supporting the OpenLineage standard like dbt, Apache Airflow, and Apache Spark.

Solution overview

Many third-party and open source tools that are used today to orchestrate and run data pipelines, like dbt, Airflow, and Spark, have active support of the OpenLineage standard to provide interoperability across environments. With this capability, you only need to include and configure the right library to your environment, to be able to stream lineage events from jobs running on this tool directly to their corresponding output logs or to a target HTTP endpoint that you specify.

With the target HTTP endpoint option, you can introduce a pattern to post lineage events from these tools into SageMaker or Amazon DataZone to further help you centralize governance of your data assets and processes in a single place. This pattern takes the form of a proxy, and its simplified architecture is illustrated in the following figure.

The way that the proxy for OpenLineage works is simple:

Amazon API Gateway exposes an HTTP endpoint and path. Jobs running with the OpenLineage package on top of the supported data processing tools can be set up with the HTTP transport option pointing to this endpoint and path. If connectivity allows, lineage events will be streamed into this endpoint as the job runs.
An Amazon Simple Queue Service (Amazon SQS) queue buffers the events as they arrive. By storing them in a queue, you have the option to implement strategies for retries and errors when needed. For cases where event order is required, we recommend the use of first-in-first-out (FIFO) queues; however, SageMaker and Amazon DataZone are able to map incoming OpenLineage events, even if they are out of order.
An AWS Lambda function retrieves events from the queue in batches. For every event in a batch, the function can perform transformations when needed and post the resulting event to the target SageMaker or Amazon DataZone domain.
Even though it’s not shown in the architecture, AWS Identity and Access Management (IAM) and Amazon CloudWatch are key capabilities that allow secure interaction between resources with minimum permissions and logging for troubleshooting and observability.

The AWS sample OpenLineage HTTP Proxy for Amazon SageMaker Governance and Amazon DataZone provides a working implementation of this simplified architecture that you can test and customize as needed. To deploy in a test environment, follow the steps as described in the repository. We use an AWS CloudFormation template to deploy solution resources.

After you have deployed the OpenLineage HTTP Proxy solution, you can use it to post lineage events from data processing tools like dbt, Airflow, and Spark into a SageMaker or Amazon DataZone domain, as shown in the following examples.

Set up the OpenLineage package for Spark in AWS Glue 4.0

AWS Glue added built-in support for OpenLineage with AWS Glue 5.0 (to learn more, see Introducing AWS Glue 5.0 for Apache Spark). For jobs that are still running on AWS Glue 4.0, you still can stream OpenLineage events into SageMaker or Amazon DataZone by using the OpenLineage HTTP Proxy solution. This serves as an example that can be applied to other platforms running Spark like Amazon EMR, third-party solutions, or self-managed clusters.

To properly add OpenLineage capabilities to an AWS Glue 4.0 job and configure it to stream lineage events into the OpenLineage HTTP Proxy solution, complete the following steps:

Download the official OpenLineage package for Spark. For our example, we used the JAR package version 2.12 release 1.9.1.
Store the JAR file in an Amazon Simple Storage Service (Amazon S3) bucket that can be accessed by your AWS Glue job.
On the AWS Glue console, open your job.
Under Libraries, for Dependent JARs path, enter the path of the JAR package stored in your S3 bucket.

In the Job parameters section, add the following parameters:
1. Enable the OpenLineage package:
  1. Key: --user-jars-first
  2. Value: true
2. Configure how the OpenLineage package will be used to stream lineage events. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the corresponding values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack. Replace <ACCOUNT_ID> with your AWS account ID.
  1. Key: --conf
  2. Value:
```
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
--conf spark.openlineage.transport.type=http 
--conf spark.openlineage.transport.url=<OPENLIMEAGE_PROXY_ENDPOINT_URL>
--conf spark.openlineage.transport.endpoint=/<OPENLIMEAGE_PROXY_ENDPOINT_PATH>
--conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] 
--conf spark.glue.accountId=<ACCOUNT_ID>
```

With this setup, the AWS Glue 4.0 job will use the HTTP transport option of the OpenLineage package to stream lineage events into the OpenLineage proxy, which will post events to the SageMaker or Amazon DataZone domain.

Run the AWS Glue 4.0 job.

The job’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them. As you explore the sourced dataset in SageMaker Unified Studio, you can observe its origin path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

The origin path in this example is extensive and maps the resulting dataset down to its origin, in this case, a couple of tables hosted in a relational database and transformed through a data pipeline with two AWS Glue 4.0 (Spark) jobs.

Set up the OpenLineage package for dbt

dbt has rapidly become a popular framework to build data pipelines on top of data processing and data warehouse tools like Amazon Redshift, Amazon EMR, and AWS Glue, as well as other traditional and third-party solutions. This framework supports OpenLineage as a way to standardize generation of lineage events and integrate with the growing data governance ecosystem.dbt deployments might vary per environment, which is why we don’t dive into the specifics in this post. However, to simply configure your dbt project to leverage the OpenLineage HTTP Proxy solution, complete the following steps:

Install the OpenLineage package for dbt. You can learn more in the OpenLineage documentation.
In the root folder of your dbt project, create an openlineage.yml file where you can specify the transport configuration. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack.

transport:
  type: http
  url: <OPENLIMEAGE_PROXY_ENDPOINT_URL>
  endpoint: <OPENLIMEAGE_PROXY_ENDPOINT_PATH>
  timeout: 5

Run your dbt pipeline. As explained in the OpenLineage documentation, instead of running the standard dbt run command, you run the dbt-ol run command. The latter command is just a wrapper on top of the standard dbt run command so that lineage events are captured and streamed as configured.

The job’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them. As you explore the sourced dataset in SageMaker Unified Studio, you can observe its lineage path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

In this example, the dbt project is running on top of Amazon Redshift, which is a common use case among customers. Amazon Redshift is integrated for automatic lineage capture with SageMaker and Amazon DataZone, but such capabilities weren’t used as part of this example to illustrate how you can still integrate OpenLineage events from dbt using the pattern implemented in the OpenLineage HTTP Proxy solution.The dbt pipeline is made by two stages running sequentially, which are illustrated in the origin path as the nodes with the dbt type.

Set up the OpenLineage package for Airflow

Airflow is a well-positioned tool to orchestrate data pipelines at any scale. AWS provides Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a managed alternative for customers that want to reduce management and accelerate the development of their data strategy with Airflow in a cost-effective way. Airflow also supports OpenLineage, so you can centralize lineage with tools like SageMaker and Amazon DataZone.

The following steps are specific for Amazon MWAA, but they can be extrapolated to other forms of deployment of Airflow:

Install the OpenLineage package for Airflow. You can learn more in the OpenLineage documentation. For versions 2.7 and later, it’s recommended to use the native Airflow OpenLineage package (apache-airflow-providers-openlineage), which is the case for this example.
To install the package, add it to the requirements.txt file that you are storing in Amazon S3 and that you are pointing to when provisioning your Amazon MWAA environment. To learn more, refer to Managing Python dependencies in requirements.txt.
As you install the OpenLineage package or afterwards, you can configure it to send lineage events to the OpenLineage proxy:
1. When filling the form to create a new Amazon MWAA environment or edit an existing one, in the Airflow configuration options section, add the following. Replace <OPENLIMEAGE_PROXY_ENDPOINT_URL> and <OPENLIMEAGE_PROXY_ENDPOINT_PATH> with the values of the OpenLineage HTTP Proxy solution. These values can be found as outputs of the deployed CloudFormation stack:
  1. Configuration option: openlineage.transport
  2. Custom value: {"type": "http", "url": "<OPENLIMEAGE_PROXY_ENDPOINT_URL>", "endpoint": "<OPENLIMEAGE_PROXY_ENDPOINT_PATH>"}

Run your pipeline.

The Airflow tasks will automatically use the transport configuration to stream lineage events into the OpenLineage proxy as they run. The task’s resulting datasets should be sourced into SageMaker or Amazon DataZone so that OpenLineage events are mapped to them.As you explore the sourced dataset in SageMaker Unified Studio, you can observe its origin path as described by the OpenLineage events streamed through the OpenLineage proxy.

When working with Amazon DataZone, you will get the same result.

In this example, the Amazon MWAA Directed Acyclic Graph (DAG) is operating on top of Amazon Redshift, similar to the dbt example before. However, it’s still not using the native integration for automatic data capture between Amazon Redshift and SageMaker or Amazon DataZone. This way, we can illustrate how you can still integrate OpenLineage events from Airflow using the pattern implemented in the OpenLineage HTTP Proxy solution.The Airflow DAG is made by a single task that outputs the resulting dataset by using datasets that were created as part of the dbt pipeline in the previous example. This is illustrated in the origin path, where it includes nodes with the dbt type and a node with AIRFLOW type. With this final example, note how SageMaker and Amazon DataZone map all datasets and jobs to reflect the reality of your data pipelines.

Additional considerations when implementing the OpenLineage proxy pattern

The OpenLineage proxy pattern implemented in the sample OpenLineage HTTP Proxy solution and presented in this post has shown to be a practical approach to integrate a growing set of data processing tools into a centralized data governance strategy on top of SageMaker. We encourage you to dive into it and use it in your test environments to learn how it can be best used for your specific setup.If interested in taking this pattern to production, we suggest you first review it thoroughly and customize it to your particular needs. The following are some items worth reviewing as you evaluate this pattern implementation:

The solution used in the examples of this post uses a public API endpoint with no authentication or authorization mechanism. For a production workload, we recommend limiting access to the endpoint to a minimum so only authorized resources are able to stream messages into it. To learn more, refer to Control and manage access to HTTP APIs in API Gateway.
The logic implemented in the Lambda function is intended to be customized depending on your use case. You might need to implement transformation logic, depending on how OpenLineage events are created by the tool you are using. As a reference, for the case of the Amazon MWAA example of this post, some minor transformations were required on the name and namespace fields of the inputs and outputs elements of the event for full compatibility with the format expected for Amazon Redshift datasets as described in the dataset naming conventions of OpenLineage. You might also need to change how the function logs execution details or include retry/error logic and more.
The SQS queue used in the OpenLineage HTTP Proxy solution is standard, which implies that events aren’t delivered in order. If this is a requirement, you could use FIFO queues instead.

For cases where you want to post OpenLineage events directly into SageMaker or Amazon DataZone, without using the proxy pattern explained in this post, a custom transport is now available as an extension of the OpenLineage project version 1.33.0. Leverage this feature in cases where you don’t need additional controls on your OpenLineage event stream, for example, if you don’t need any custom transformation logic.

Summary

In this post, we showed how to use the OpenLineage-compatible APIs of SageMaker to capture data lineage from any tool supporting this standard, by following an architectural pattern introduced as the OpenLineage proxy. We presented some examples of how you can set up tools like dbt, Airflow, and Spark to stream lineage events to the OpenLineage proxy, which subsequently posts them to a SageMaker or Amazon DataZone domain. Finally, we introduced a working implementation of this pattern that you can test and discussed some considerations when implementing this same pattern to production.

The SageMaker compatibility with OpenLineage can help simplify governance of your data assets and increase trust in your data. This capability is one of the many features that are now available to build a comprehensive governance strategy powered by data lineage, data quality, business metadata, data discovery, access automation, and more. By bundling data governance capabilities with the growing set of tools available for data and AI development, you can derive value from your data faster and get closer to consolidating a data-driven culture. Try out this solution and get started with SageMaker to join the growing set of customers that are modernizing their data platform.

About the authors

Jose Romero is a Senior Solutions Architect for Startups at AWS, based in Austin, Texas. He is passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn.

Priya Tiruthani is a Senior Technical Product Manager with Amazon SageMaker Catalog (Amazon DataZone) at AWS. She focuses on building products and their capabilities in data analytics and governance. She is passionate about building innovative products to address and simplify customers’ challenges in their end-to-end data journey. Outside of work, she enjoys being outdoors to hike and capture nature’s beauty. Connect with her on LinkedIn.

Reduce time to access your transactional data for analytical processing using the power of Amazon SageMaker Lakehouse and zero-ETL

2025-06-16 Avijit Goswami

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/reduce-time-to-access-your-transactional-data-for-analytical-processing-using-the-power-of-amazon-sagemaker-lakehouse-and-zero-etl/

As the lines between analytics and AI continue to blur, organizations find themselves dealing with converging workloads and data needs. Historical analytics data is now being used to train machine learning models and power generative AI applications. This shift requires shorter time to value and tighter collaboration among data analysts, data scientists, machine learning (ML) engineers, and application developers. However, the reality of scattered data across various systems—from data lakes to data warehouses and applications—makes it difficult to access and use data efficiently. Moreover, organizations attempting to consolidate disparate data sources into a data lakehouse have historically relied on extract, transform, and load (ETL) processes, which have become a significant bottleneck in their data analytics and machine learning initiatives. Traditional ETL processes are often complex, requiring significant time and resources to build and maintain. As data volumes grow, so do the costs associated with ETL, leading to delayed insights and increased operational overhead. Many organizations find themselves struggling to efficiently onboard transactional data into their data lakes and warehouses, hindering their ability to derive timely insights and make data-driven decisions. In this post, we address these challenges with a two-pronged approach:

Unified data management: Using Amazon SageMaker Lakehouse to get unified access to all your data across multiple sources for analytics and AI initiatives with a single copy of data, regardless of how and where the data is stored. SageMaker Lakehouse is powered by AWS Glue Data Catalog and AWS Lake Formation and brings together your existing data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses with integrated access controls. In addition, you can ingest data from operational databases and enterprise applications to the lakehouse in near real-time using zero-ETL which is a set of fully-managed integrations by AWS that eliminates or minimizes the need to build ETL data pipelines.
Unified development experience: Using Amazon SageMaker Unified Studio to discover your data and put it to work using familiar AWS tools for complete development workflows, including model development, generative AI application development, data processing, and SQL analytics, in a single governed environment.

In this post, we demonstrate how you can bring transactional data from AWS OLTP data stores like Amazon Relational Database Service (Amazon RDS) and Amazon Aurora flowing into Redshift using zero-ETL integrations to SageMaker Lakehouse Federated Catalog (Bring your own Amazon Redshift into SageMaker Lakehouse). With this integration, you can now seamlessly onboard the changed data from OLTP systems to a unified lakehouse and expose the same to analytical applications for consumptions using Apache Iceberg APIs from new SageMaker Unified Studio. Through this integrated environment, data analysts, data scientists, and ML engineers can use SageMaker Unified Studio to perform advanced SQL analytics on the transactional data.

Architecture patterns for a unified data management and unified development experience

In this architecture pattern, we show you how to use zero-ETL integrations to seamlessly replicate transactional data from Amazon Aurora MySQL-Compatible Edition, an operational database, into the Redshift Managed Storage layer. This zero-ETL approach eliminates the need for complex data extraction, transformation, and loading processes, enabling near real-time access to operational data for analytics. The transferred data is then cataloged using a federated catalog in the SageMaker Lakehouse Catalog and exposed through the Iceberg Rest Catalog API, facilitating comprehensive data analysis by consumer applications.

You then use SageMaker Unified Studio, to perform advanced analytics on the transactional data bridging the gap between operational databases and advanced analytics capabilities.

Prerequisites

Make sure that you have the following prerequisites:

An AWS account with permission to create AWS Identity and Access Management (IAM) roles and policies
Administrator access to one or more of the following sources: Aurora MySQL-Compatible and Amazon Redshift
An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI). We recommend that you use temporary credentials using AWS IAM Identity Center as a security best practice for accessing your AWS accounts because these credentials are generated every time you sign in. See Getting IAM Identity Center user credentials for the AWS CLI or AWS SDKs steps to configure IAM Identity Center authentication.

Deployment steps

In this section, we share steps for deploying resources needed for Zero-ETL integration using AWS CloudFormation.

Setup resources with CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use. The CloudFormation template provisions the following components:

An Aurora MySQL provisioned cluster (source).
An Amazon Redshift Serverless data warehouse (target).
Zero-ETL integration between the source (Aurora MySQL) and target (Amazon Redshift Serverless). See Aurora zero-ETL integrations with Amazon Redshift for more information.

Create your resources

To create resources using AWS Cloudformation, follow these steps:

Sign in to the AWS Management Console.
Select the us-east-1 AWS Region in which to create the stack.
Open the AWS CloudFormation
Choose Launch Stack
Choose Next.
This automatically launches CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
For Stack name, enter a stack name, for example UnifiedLHBlogpost.
Keep the default values for the rest of the Parameters and choose Next.
On the next screen, choose Next.
Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Submit.

Stack creation can take up to 30 minutes.

After the stack creation is complete, go to the Outputs tab of the stack and record the values of the keys for the following components, which you will use in a later step:
- NamespaceName
- PortNumber
- RDSPassword
- RDSUsername
- RedshiftClusterSecurityGroupName
- RedshiftPassword
- RedshiftUsername
- VPC
- Workgroupname
- ZeroETLServicesRoleNameArn

Implementation steps

To implement this solution, follow these steps:

Setting up zero-ETL integration

A zero-ETL integration is already created as a part of CloudFormation template provided. Use the following steps from the Zero-ETL integration post to complete setting up the integration.:

Bring Amazon Redshift metadata to the SageMaker Lakehouse catalog

Now that transactional data from Aurora MySQL is replicating into Redshift tables through zero-ETL integration, you next bring the data into SageMaker Lakehouse, so that operational data can co-exist and be accessed and governed together with other data sources in the data lake. You do this by registering an existing Amazon Redshift Serverless namespace that has Zero-ETL tables as a federated catalog in SageMaker Lakehouse.

Before starting the next steps, you need to configure data lake administrators in AWS Lake Formation.

Go to the Lake Formation console and in the navigation pane, choose Administration roles and then choose Tasks under Administration. Under Data lake administrators, choose Add.
In the Add administrators page, under Access type, select Data Lake administrator.
Under IAM users and roles, select Admin. Choose Confirm.

Add AWS Lake Formation Administrators

On the Add administrators page, for Access type, select Read-only administrators. Under IAM users and roles, select AWSServiceRoleForRedshift and choose Confirm. This step enables Amazon Redshift to discover and access catalog objects in AWS Glue Data Catalog.

Add AWS Lake Formation Administrators 2

With the data lake administrators configured, you’re ready to bring your existing Amazon Redshift metadata to SageMaker Lakehouse catalog:

From the Amazon Redshift Serverless console, choose Namespace configuration in the navigation pane.
Under Actions, choose Register with AWS Glue Data Catalog. You can find more details on registering a federated Amazon Redshift catalog in Registering namespaces to the AWS Glue Data Catalog.

Choose Register. This will register the namespace to AWS Glue Data Catalog

After registration is complete, the Namespace register status will change to Registered to AWS Glue Data Catalog.
Navigate to the Lake Formation console and choose Catalogs New under Data Catalog in the navigation pane. Here you can see a pending catalog invitation is available for the Amazon Redshift namespace registered in Data Catalog.

Select the pending invitation and choose Approve and create catalog. For more information, see Creating Amazon Redshift federated catalogs.

Enter the Name, Description, and IAM role (created by the CloudFormation template). Choose Next.

Grant permissions using a principal that is eligible to provide all permissions (an admin user).
- Select IAM users and rules and choose Admin.
- Under Catalog permissions, select Super user to grant super user permissions.

Assigning super user permissions grants the user unrestricted permissions to the resources (databases, tables, views) within this catalog. Follow the principal of least privilege to grant users only the permissions required to perform a task wherever applicable as a security best practice.

As final step, review all settings and choose Create Catalog

After the catalog is created, you will see two objects under Catalogs. dev refers to the local dev database inside Amazon Redshift, and aurora_zeroetl_integration is the database created for Aurora to Amazon Redshift ZeroETL tables

Fine-grained access control

To set up fine-grained access control, follow these steps:

To grant permission to individual objects, choose Action and then select Grant.

On the Principals page, grant access to individual objects or more than one object to different principals under the federated catalog.

Access lakehouse data using SageMaker Unified Studio

SageMaker Unified Studio provides an integrated experience outside the console to use all your data for analytics and AI applications. In this post, we show you how to use the new experience through the Amazon SageMaker management console to create a SageMaker platform domain using the quick setup method. To do this, you set up IAM Identity Center, a SageMaker Unified Studio domain, and then access data through SageMaker Unified Studio.

Set up IAM Identity Center

Before creating the domain, makes sure that your data admins and data workers are ready to use the Unified Studio experience by enabling IAM Identity Center for single sign-on following the steps in Setting up Amazon SageMaker Unified Studio. You can use Identity Center to set up single sign-on for individual accounts and for accounts managed through AWS Organizations. Add users or groups to the IAM instance as appropriate. The following screenshot shows an example email sent to a user through which they can activate their account in IAM Identity Center.

Set up SageMaker Unified domain

Follow steps in Create a Amazon SageMaker Unified Studio domain – quick setup to set up a SageMaker Unified Studio domain. You need to choose the VPC that was created by the CloudFormation stack earlier.

The quick setup method also has a Create VPC option that sets up a new VPC, subnets, NAT Gateway, VPC endpoints, and so on, and is meant for testing purposes. There are charges associated with this, so delete the domain after testing.

If you see the No models accessible, you can use the Grant model access button to grant access to Amazon Bedrock serverless models for use in SageMaker Unified Studio, for AI/ML use-cases

Fill in the sections for Domain Name. For example, MyOLTPDomain. In the VPC section, select the VPC that was provisioned by the CloudFormation stack, for example UnifiedLHBlogpost-VPC. Select subnets and choose Continue.

In the IAM Identity Center User section, look up the newly created user from (for example, Data User1) and add them to the domain. Choose Create Domain. You should see the new domain along with a link to open Unified Studio.

Access data using SageMaker Unified Studio

To access and analyze your data in SageMaker Unified Studio, follow these steps:

1. Select the URL for SageMaker Unified Studio. Choose Sign in with SSO and sign in using the IAM user, for example datauser1, and you will be prompted to select a multi-factor authentication (MFA) method.
2. Select Authenticator App and proceed with next steps. For more information about SSO setup, see Managing users in Amazon SageMaker Unified Studio.After you have signed in to the Unified Studio domain, you need to set up a new project. For this illustration, we created a new sample project called MyOLTPDataProject using the project profile for SQL Analytics as shown here.A project profile is a template for a project that defines what blueprints are applied to the project, including underlying AWS compute and data resources. Wait for the new project to be set up, and when status is Active, open the project in Unified Studio.By default, the project will have access to the default Data Catalog (AWSDataCatalog). For the federated redshift catalog redshift-consumer-catalog to be visible, you need to grant permissions to the project IAM role using Lake Formation. For this example, using the Lake Formation console, we have granted below access to the demodb database that is part of the Zero-ETL catalog to the Unified Studio project IAM role. Follow steps in Adding existing databases and catalogs using AWS Lake Formation permissions.In your SageMaker Unified Studio Project’s Data section, connect to the Lakehouse Federated catalog that you created and registered earlier (for example redshift-zetl-auroramysql-catalog/aurora_zeroetl_integration). Select the objects that you want to query and execute them using the Redshift Query Editor integrated with SageMaker Unified Studio.If you select Redshift, you will be transferred to the Query editor where you can execute the SQL and see the results as shown in the following figure.

With this integration of Amazon Redshift metadata with SageMaker Lakehouse federated catalog, you have access to your existing Redshift data warehouse objects in your organizations centralized catalog managed by SageMaker Lakehouse catalog and join the existing Redshift data seamlessly with the data stored in your Amazon S3 data lake. This solution helps you avoid unnecessary ETL processes to copy data between the data lake and the data warehouse and minimize data redundancy.

You can further integrate more data sources serving transactional workloads such as Amazon DynamoDB and enterprise applications such as Salesforce and ServiceNow. The architecture shared in this post for accelerated analytical processing using Zero-ETL and SageMaker Lakehouse can be further expanded by adding Zero-ETL integrations for DynamoDB using DynamoDB zero-ETL integration with Amazon SageMaker Lakehouse and for enterprise applications by following the instructions in Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Clean up

When you’re finished, delete the CloudFormation stack to avoid incurring costs for some of the AWS resources used in this walkthrough incur a cost. Complete the following steps:

On the CloudFormation console, choose Stacks.
Choose the stack you launched in this walkthrough. The stack must be currently running.
In the stack details pane, choose Delete.
Choose Delete stack.
On the Sagemaker console, choose Domains and delete the domain created for testing.

Summary

In this post, you’ve learned how to bring data from operational databases and applications into your lake house in near real-time through Zero-ETL integrations. You’ve also learned about a unified development experience to create a project and bring in the operational data to the lakehouse, which is accessible through SageMaker Unified Studio, and query the data using integration with Amazon Redshift Query Editor. You can use the following resources in addition to this post to quickly start your journey to make your transactional data available for analytical processing.

About the authors

Avijit Goswami is a Principal Data Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike in the San Francisco Bay Area trails, watch sports, and listen to music.

Saman Irfan is a Senior Specialist Solutions Architect focusing on Data Analytics at Amazon Web Services. She focuses on helping customers across various industries build scalable and high-performant analytics solutions. Outside of work, she enjoys spending time with her family, watching TV series, and learning new technologies.

Sudarshan Narasimhan is a Principal Solutions Architect at AWS specialized in data, analytics and databases. With over 19 years of experience in Data roles, he is currently helping AWS Partners & customers build modern data architectures. As a specialist & trusted advisor he helps partners build & GTM with scalable, secure and high performing data solutions on AWS. In his spare time, he enjoys spending time with his family, travelling, avidly consuming podcasts and being heartbroken about Man United’s current state.

Simplify real-time analytics with zero-ETL from Amazon DynamoDB to Amazon SageMaker Lakehouse

2025-06-06 Narayani Ambashta

Post Syndicated from Narayani Ambashta original https://aws.amazon.com/blogs/big-data/simplify-real-time-analytics-with-zero-etl-from-amazon-dynamodb-to-amazon-sagemaker-lakehouse/

At AWS re:Invent 2024, we introduced a no code zero-ETL integration between Amazon DynamoDB and Amazon SageMaker Lakehouse, simplifying how organizations handle data analytics and AI workflows. This integration alleviates the traditional challenges of building and maintaining complex extract, transform, and load (ETL) pipelines for transforming NoSQL data into analytics-ready formats, which previously required significant time and resources while introducing potential system vulnerabilities. Organizations can now seamlessly combine the strength of DynamoDB in handling rapid, concurrent transactions with immediate analytical processing through the zero-ETL integration. For example, an ecommerce platform storing user session data and cart information in DynamoDB can now analyze this data in near real time without building custom pipelines. Gaming companies using DynamoDB for player data can instantly analyze user behavior as events occur, enabling real-time insights into game balance and player engagement patterns.

The zero-ETL capability uses built-in change data capture (CDC) to automatically synchronize data updates and schema changes between DynamoDB and SageMaker Lakehouse tables. By using Apache Iceberg format, the integration provides reliable performance with ACID transaction support and efficient large-scale data handling. Data scientists can train ML models on fresh data and data analysts can generate reports using current information, with typical synchronization latency in minutes rather than hours.

In this post, we share how to set up this zero-ETL integration from DynamoDB to your SageMaker Lakehouse environment.

Solution overview

We use a SageMaker Lakehouse catalog, AWS Lake Formation, Amazon Athena, AWS Glue, and Amazon SageMaker Unified Studio for this integration. The following is the reference data flow diagram for the zero-ETL integration.

ref architecture

The workflow consists of the following components:

The recently launched zero-ETL integration capability within the AWS Glue console enables direct integration between DynamoDB and SageMaker Lakehouse, storing data in Iceberg format. This streamlined approach opens up new possibilities for data teams by creating a large-scale open and secure data ecosystem without traditional ETL processing overhead.
When building a SageMaker Lakehouse architecture, you can use an Amazon Simple Storage Service (Amazon S3) based managed catalog as your zero-ETL target, providing seamless data integration without transformation overhead. This approach creates a robust foundation for your SageMaker Lakehouse implementation while maintaining the cost-effectiveness and scalability inherent to Amazon S3 storage, enabling efficient analytics and machine learning workflows.
Organizations can use a Redshift Managed Storage (RMS) based managed catalog when they need high-performance SQL analytics and multi-table transactions. This approach uses RMS for storage while maintaining data in the Iceberg format, providing an optimal balance of performance and flexibility.
After you establish your Lakehouse infrastructure, you can access it through diverse analytics engines, including AWS services like Athena, Amazon Redshift, AWS Glue, and Amazon EMR as independent services. For a more streamlined experience, SageMaker Unified Studio offers centralized analytics management, where you can query your data from a single unified interface.

Prerequisites

In this section, we walk through the steps to set up your solution resources and confirm your permission settings.

Create a SageMaker Unified Studio domain, project, and IAM role

Before you begin, you need an AWS Identity and Access Management (IAM) role for enabling the zero-ETL integration. In this post, we use SageMaker Unified Studio, which offers a unified data platform experience. It automatically manages required Lake Formation permissions on data and catalogs for you.

You have to first create a SageMaker Unified Studio domain, an administrative entity that controls user access, permissions, and resources for teams working within the SageMaker Unified Studio environment. Note down the SageMaker Unified Studio URL after you create the domain. You will be using this URL later to log in to the SageMaker Unified Studio portal and query our data across multiple engines.

Then, you create a SageMaker Unified Studio project, an integrated development environment (IDE) that provides a unified experience for data processing, analytics, and AI development. As part of project creation, an IAM role is automatically generated. This role will be used when you access SageMaker Unified Studio later. For more details on how to create a SageMaker Unified Studio project and domain, refer to An integrated experience for all your data and AI with Amazon SageMaker Unified Studio.

Prepare a sample dataset within DynamoDB

To implement this solution, you need a DynamoDB table that can either be used from your existing resources, or created using the sample data file that you can import from an S3 bucket. For this post, we guide you through importing sample data from an S3 bucket into a new DynamoDB table, providing a practical foundation for the concepts discussed.

To create a sample table in DynamoDB, complete the following steps:

Download the fictitious ecommerce_customer_behavior.csv dataset. This dataset captures customer behavior and interactions on an ecommerce platform.
On the Amazon S3 console, open the S3 bucket used by the SageMaker Unified Studio project.
Upload the CSV file you downloaded.

Select the uploaded file to view its details page.

Copy the value for S3 URI and make a note of it; you will use this path for the subsequent DynamoDB table creation step.

Create a Dynamo DB table

Complete the following steps to create a DynamoDB table from a file from Amazon S3, using the import from Amazon S3 functionality. Then you can enable the settings on the DynamoDB table required to enable zero-ETL integration.

On the DynamoDB console, select Imports from S3 in the navigation pane.
Select Import from S3.

Enter the S3 URI from previous step for Source S3 URL, select CSV for Import file format, and select Next.

Provide the table name as ecommerce_customer_behavior, the partition key as customer_id, and the sort key as product_id, then select Next.

Use the default table settings, then select Next to review the details.

Review the settings and select Import.

It will take a few minutes for the import status to change from Importing to Completed.

When the import is complete, you should be able to see the table created on the Tables page.

Select the ecommerce_customer_behavior table and select Edit PTIR.

Select Turn on point in time recovery and select Save changes.

This is required for setting up zero-ETL using DynamoDB as source.
On the Backups tab, you should see the status for PITR as On.

Additionally, you need to use a table policy to enable access for zero-ETL integration. On the Permissions tab, and copy the following code under Resource-based policy for table:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "TablePolicy01",
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": [
                "dynamodb:ExportTableToPointInTime",
                "dynamodb:DescribeExport",
                "dynamodb:DescribeTable"
            ],
            "Resource": "*"
        }
    ]
}

This policy uses all the resources, which shouldn’t be used in production workload. To deploy this setup in production, restrict it to only specific zero-ETL integration resources by adding a condition to the resource-based policy.

Now that you have used the Amazon S3 import method to load a CSV file to create a DynamoDB table, you can enable zero-ETL integration on the table.

Validate permission settings

To validate if the catalog permission setting is appropriate, complete the following steps:

On the AWS Glue console, select Databases in the navigation pane.

Check for the database salesmarketing_XXX.

Select Catalog settings in the navigation pane, and save the permissions.

The following code is an example of permissions for catalog settings:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Account-id>:root"
            },
            "Action": "glue:CreateInboundIntegration",
            "Resource": "arn:aws:glue:<region>:<Account-id>:database/salesmarketing_XXX"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "glue:AuthorizeInboundIntegration",
            "Resource": "arn:aws:glue:<region>:<Account-id>:database/salesmarketing_XXX"
        }
    ]
}

Now you’re ready to create your zero-ETL integration.

Create a zero-ETL integration

Complete the following steps to create a zero-ETL integration:

On the AWS Glue console, select Zero-ETL integrations in the navigation pane.

Select “Create zero-ETL integration” to create a new configuration.

Select Amazon DynamoDB as the source type.

Under Source details, select ecommerce_customer_behavior for DynamoDB table.

Under Target details, provide the following information:
1. For AWS account, select Use the current account.
2. For Data warehouse or catalog, enter the account ID of your default catalog.
3. For Target database, enter salesmarketing_XXX.
4. For Target IAM role, enter datazone_usr_role_XXX.

Under Output settings, select Unnest all fields and Use primary keys from DynamoDB tables, leave Configure target table name as the default value (ecommerce_customer_behavior), then select Next.

Enter zetl-ecommerce-customer-behavior for Name under Integration details, then select Next.

Select Create and launch integration to launch the integration.

The status should be Creating after the integration is successfully initiated.
The status will change to Active in approximately a minute.

Verify that the SageMaker Lakehouse table exists. This process might take up to 15 minutes to complete, because the default refresh interval from DynamoDB is set to 15 minutes.

Validate the SageMaker Lakehouse table

You can now query your SageMaker Lakehouse table, created through zero-ETL integration, using various query engines. Complete the following steps to verify you can you see the table in SageMaker Unified Studio:

Select your project to view its details page.

Select Data in the navigation pane.

Verify that you can see the Iceberg table in the SageMaker Lakehouse catalog.

Query with Athena

In this section, we show how to use Athena to query the SageMaker Lakehouse table from SageMaker Unified Studio. On the project page, locate the ecommerce_customer_behavior table in the catalog, and on the options menu (three dots), select Query with Athena.

This creates a SELECT query against the SageMaker Lakehouse table in a new window, and you should see the query results as shown in the following screenshot.

Query with Amazon Redshift

You can also query the SageMaker Lakehouse table from SageMaker Unified Studio using Amazon Redshift. Complete the following steps:

Select the connection on the top right.
Select Redshift (Lakehouse) from the list of connections.
Select the awsdatacatalog database.
Select the salesmarketing schema.
Select Choose button.

The results will be shown in the Amazon Redshift Query Editor.

Query with Amazon EMR Serverless

You can query the Lakehouse table using Amazon EMR Serverless, which uses Apache Spark’s processing capabilities. Complete the following steps:

On the project page, select Compute in the navigation pane.
Select Add compute on the Data processing tab to create an EMR Serverless compute associated to the project.

You can create new compute resources or connect to existing resources. For this example, select Create new compute resources.

Select EMR Serverless.

Enter a compute name (for example, Sales-Marketing), select the most recent release of EMR Serverless, and select Add compute.

It will take some time to create the compute.

You should see the status as Started for the compute. Now it’s ready to be used as your compute option for querying through a Jupyter notebook.

Select the Build menu and select JupyterLab.

It will take some time to set up the workspace for running JupyterLab.

After the Jupyter Lab space is set up, you should see a page similar to the following screenshot.

Select the new folder icon to create a new folder.

Name the folder lakehouse_zetl_lab.

Navigate to the folder you just created and create a notebook under this folder.

Select the notebook Python3 (ipykernel) on the Launcher tab, and rename the notebook to query_lakehouse_table.

You can observe that the notebook is showing local Python as default language and compute. The two drop down menus show the connection type and compute for the selected connection type, just above the first cell within the Jupyter notebook.

Select PySpark as the connection, and select the EMR Serverless application as compute.

Enter the following sample code to query the table using Spark SQL:

import sys
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Set the current database
spark.catalog.setCurrentDatabase("salesmarketing_XXX")

# Execute SQL query and store results in DataFrame
df = spark.sql("select * from ecommerce_customer_behavior limit 10")

# Display the results
df.show()

You can see the Spark DataFrame results.

Clean up

To avoid incurring future charges, delete the SageMaker domain, DynamoDB table, AWS Glue resources, and other objects created from this post.

Conclusion

This post demonstrated how you can establish a zero-ETL connection from DynamoDB to SageMaker Lakehouse, making your data available in Iceberg format without building custom data pipelines. We showed how you can analyze this DynamoDB data through various compute engines within SageMaker Unified Studio. This streamlined approach alleviates traditional data movement complexities, and enables more efficient data analysis workflows directly from your DynamoDB tables.

Try out this solution for your own use case, and share your feedback in the comments.

About the authors

Narayani Ambashta is an Analytics Specialist Solutions Architect at AWS, focusing on the automotive and manufacturing sector, where she guides strategic customers in developing modern data and AI strategies. With over 15 years of cross-industry experience, she specializes in big data architecture, real-time analytics, and AI/ML technologies, helping organizations implement modern data architectures. Her expertise spans across lakehouse, generative AI, and IoT platforms, enabling customers to drive digital transformation initiatives. When not architecting modern solutions, she enjoys staying active through sports and yoga.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with AWS. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life sciences, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Yadgiri Pottabhathini is a Senior Analytics Specialist Solutions Architect in the media and entertainment sector. He specializes in assisting enterprise customers with their data and analytics cloud transformation initiatives, while providing guidance on accelerating their Generative AI adoption through the development of data foundations and modern data strategies that leverage open-source frameworks and technologies.

Junpei Ozono is a Sr. Go-to-market (GTM) Data & AI solutions architect at AWS in Japan. He drives technical market creation for data and AI solutions while collaborating with global teams to develop scalable GTM motions. He guides organizations in designing and implementing innovative data-driven architectures powered by AWS services, helping customers accelerate their cloud transformation journey through modern data and AI solutions. His expertise spans across modern data architectures including Data Mesh, Data Lakehouse, and Generative AI, enabling customers to build scalable and innovative solutions on AWS.

Embracing event driven architecture to enhance resilience of data solutions built on Amazon SageMaker

2025-06-05 Dhrubajyoti Mukherjee

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/embracing-event-driven-architecture-to-enhance-resilience-of-data-solutions-built-on-amazon-sagemaker/

Amazon Web Services (AWS) customers value business continuity while building modern data governance solutions. A resilient data solution helps maximize business continuity by minimizing solution downtime and making sure that critical information remains accessible to users. This post provides guidance on how you can use event driven architecture to enhance the resiliency of data solutions built on the next generation of Amazon SageMaker, a unified platform for data, analytics, and AI. SageMaker is a managed service with high availability and durability. If customers want to build a backup and recovery system on their end, we show you how to do this in this blog. It provides three design principles to improve the data solution resiliency of your organization. In addition, it contains guidance to formulate a robust disaster recovery strategy based on event driven architecture. It contains code samples to back up the system metadata of your data solution built on SageMaker, enabling disaster recovery.

The AWS Well-Architected Framework defines resilience as the ability of a system to recover from infrastructure or service disruptions. You can enhance the resiliency of your data solution by adopting three design principles that are highlighted in this post and by establishing a robust disaster recovery strategy. Recovery point objective (RPO) and recovery time objective (RTO) are industry standard metrics to measure the resilience of a system. RPO indicates how much data loss your organization can accept in case of solution failure. RTO refers to the time for the solution to recover after failure. You can measure these metrics in seconds, minutes, hours, or days. The next section discusses how you can align your data solution resiliency strategy to meet the needs of your organization.

Formulating a strategy to enhance data solution resilience

To develop a robust resiliency strategy for your data solution built on SageMaker, start with how users interact with the data solution. The user interaction influences the data solution architecture, the degree of automation, and determines your resiliency strategy. Here are a few aspects you might consider while designing the resiliency of your data solution.

Data solution architecture – The data solution of your organization might follow a centralized, decentralized, or hybrid architecture. This architecture pattern reflects the distribution of responsibilities of the data solution based on the data strategy of your organization. This shift in responsibilities is reflected in the structure of the teams that perform activities in the Amazon DataZone data portal, SageMaker Unified Studio portal, AWS Management Console, and underlying infrastructure. Examples of such activities include configuring and running the data sources, publishing data assets in the data catalog, subscribing to data assets, and assigning members to projects.
User persona – The user persona, their data, and cloud maturity influence their preferences for interacting with the data solution. The users of a data governance solution fall into two categories: business users and technical users. Business users of your organization might include data owners, data stewards, and data analysts. They might find the Amazon DataZone data portal and SageMaker Unified Studio portal more convenient for tasks such as approving or rejecting subscription requests and performing one-time queries. Technical users such as data solution administrators, data engineers, and data scientists might opt for automation when making system changes. Examples of such activities include publishing data assets, managing glossary and metadata forms in the Amazon DataZone data portal or in SageMaker Unified Studio portal. A robust resiliency strategy accounts for tasks performed by both user groups.
Empowerment of self-service – The data strategy of your organization determines autonomy granted to the users. Increased user autonomy demands a high level of abstraction of the cloud infrastructure powering the data solution. SageMaker empowers self-service by enabling users to perform regular data management activities in the Amazon DataZone data portal and in the SageMaker Unified Studio portal. The level of self-service maturity of the data solution depends on the data strategy and user maturity of your organization. At an early stage, you might limit the self-service features to the use cases for onboarding the data solution. As the data solution scales, consider increasing the self-service capabilities. See Data Mesh Strategy Framework to learn about the different phases of a data mesh-based data solution.

Adopt the following design principles to enhance the resiliency of your data solution:

Choose serverless services – Use serverless AWS services to build your data solution. Serverless services scale automatically with increasing system load, provide fault isolation, and have built-in high-availability. Serverless services minimize the need for infrastructure management, reducing the need to design resiliency into the infrastructure. SageMaker seamlessly integrates with several serverless services such Amazon Simple Storage Service (Amazon S3), AWS Glue, AWS Lake Formation, and Amazon Athena.
Document system metadata – Document the system metadata of your data solution using infrastructure-as-code (IaC) and automation. Consider how users interact with the data solution. If the users prefer to perform certain activities through the Amazon DataZone data portal and SageMaker Unified Studio portal, implement automation to capture and store the metadata that’s relevant for disaster recovery. Use Amazon Relational Database Service (Amazon RDS) and Amazon DynamoDB to store the system metadata of your data solution.
Monitor system health – Implement a monitoring and alerting solution for your data solution so that you can respond to service interruptions and initiate the recovery process. Make sure that system activities are logged so that you can troubleshoot the system interruption. Amazon CloudWatch helps you monitor AWS resources and the applications you run on AWS in real time.

The next section presents disaster recovery strategies to recover your data solution built on SageMaker.

Disaster recovery strategies

Disaster recovery focuses on one-time recovery objectives in response to natural disasters, large-scale technical failures, or human threats such as attack or error. Disaster recovery is a crucial part of your business continuity plan. As shown in the following figure, AWS offers the following options for disaster recovery: Backup and restore, pilot light, warm standby, and multi-site active/active.

The business continuity requirements and cost of recovery should guide your organization’s disaster recovery strategy. As a general guideline, the recovery cost of your data solution increases with reduced RPO and RTO requirements. The next section provides architecture patterns to implement a robust backup and recovery solution for a data solution built on SageMaker.

Solution overview

This section provides event-driven architecture patterns following the backup and restore approach to enhance resiliency of your data solution. This active/passive strategy-based solution stores the system metadata in a DynamoDB table. You can use the system metadata to restore your data solution. The following architecture patterns provide regional resilience. You can simplify the architecture of this solution to restore data in a single AWS Region.

Pattern 1: Point-in-time backup

The point-in-time backup captures and stores system metadata of a data solution built on SageMaker when a user or an automation performs an action. In this pattern, a user activity or an automation initiates an event that captures the system metadata. This pattern is suited for low RPO requirements, ranging from seconds to minutes. The following architecture diagram shows the solution for the point-in-time backup process.

Architecture point-in-time-backup

The steps comprise the following.

User or automation performs an activity on an Amazon DataZone domain or Amazon Unified Studio domain.
This activity creates a new event in AWS CloudTrail.
The CloudTrail event is sent to Amazon EventBridge. Alternatively, you can use Amazon DataZone as the event source for the EventBridge rule.
AWS Lambda transforms and stores this event in a DynamoDB global table where the Amazon DataZone domain is hosted.
The information is replicated into the replica DynamoDB table in a secondary Region. The replica DynamoDB table can be used to restore the data solution based on SageMaker in the secondary Region.

Pattern 2: Scheduled backup

The scheduled backup captures and stores system metadata of a data solution built on SageMaker at regular intervals. In this pattern, an event is initiated based on a defined time schedule. This pattern is suited for RPO requirements in the order of hours. The following architecture diagram displays the solution for point-in-time backup process.

The steps comprise the following.

EventBridge triggers an event at regular interval and sends this event to AWS Step Functions.
The Step Functions state machine contains multiple Lambda functions. These Lambda functions get the system metadata from either a SageMaker Unified Studio domain or an Amazon DataZone domain.
The system metadata is stored in an DynamoDB global table in the primary Region where the Amazon DataZone domain is hosted.
The information is replicated into the replica DynamoDB table in a secondary Region. The data solution can be restored in the secondary Region using the replica DynamoDB table.

The next section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern. This code sample stores asset information of a data solution built on a SageMaker Unified Studio domain and an Amazon DataZone domain in an DynamoDB global table. The data in the DynamoDB table is encrypted at rest using a customer managed key stored in AWS Key Management Service (AWS KMS). A multi-Region replica key encrypts the data in the secondary Region. The asset uses the data lake blueprint that contains the definition for launching and configuring a set of services (AWS Glue, Lake Formation, and Athena) to publish and use data lake assets in the business data catalog. The code sample uses the AWS Cloud Development Kit (AWS CDK) to deploy the cloud infrastructure.

Prerequisites

An active AWS account.
AWS administrator credentials for the central governance account in your development environment
AWS Command Line Interface (AWS CLI) installed to manage your AWS services from the command line (recommended)
Node.js and Node Package Manager (npm) installed to manage AWS CDK applications
AWS CDK Toolkit installed globally in your development environment by using npm, to synthesize and deploy AWS CDK applications

npm install -g aws-cdk

TypeScript installed in your development environment or installed globally by using npm compiler:

npm install -g typescript

Docker installed in your development environment (recommended)
An integrated development environment (IDE) or text editor with support for Python and TypeScript (recommended)

Walkthrough for data solutions built on a SageMaker Unified Studio domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on a SageMaker Unfied Studio domain.

Set up SageMaker Unified Studio

Sign into the IAM console. Create an IAM role that trusts Lambda with the following policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "datazone:Search",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "dynamodb:PutItem"
            ],
            "Resource": "arn:aws:dynamodb:<AWS_REGION>:<AWS_ACCOUNT>:table/*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:ReEncrypt*",
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:<AWS_REGION>:<AWS_ACCOUNT>:key/<KMS_KEY_ID>"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*:log-stream:*",
                "arn:aws:logs:<AWS_REGION>:<AWS_ACCOUNT>:log-group:*"
            ]
        }
    ]
}

Note down the Amazon Resource Name (ARN) of the Lambda role. Navigate to SageMaker and choose Create a Unified Studio domain.
Select Quick setup and expand the Quick setup settings section. Enter a domain name, for example, CORP-DEV-SMUS. Select the Virtual private cloud (VPC) and Subnets. Choose Continue.
Enter the email address of the SageMaker Unified Studio user in the Create IAM Identity Center user section. Choose Create domain.
After the domain is created, choose Open unified studio in the top right corner.
Sign in to SageMaker Unified Studio using the single sign-on (SSO) credentials of your user. Choose Create project at the top right corner. Enter a project name and description, choose Continue twice, and choose Create project. Wait unti project creation is complete.
After the project is created, go into the project by selecting the project name. Select Query Editor from the Build drop-down menu on the top left. Paste the following create table as select (CTAS) query script in the query editor window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.

CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Navigate to Data sources from the Project. Choose Run in the Actions section next to the project.default_lakehouse connection. Wait until the run is complete.
Navigate to Assets in the left side bar. Select the mkt_sls_table in the Inventory section and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.
Choose Publish Asset to publish the mkt_sls_table table to the business data catalog, making it discoverable and understandable across your organization.
Choose Members in the navigation pane. Choose Add members and select the IAM role you created in Step 1. Add the role as a Contributor in the project.

Deployment steps

After setting up SageMaker Unified Studio, use the AWS CDK stack provided on GitHub to deploy the solution to back up the asset metadata that is created in the previous section.

Clone the repository from GitHub to your preferred integrated development environment (IDE) using the following commands.

git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd sample-event-driven-resilience-data-solutions-sagemaker

Export AWS credentials and the primary Region to your development environment for the IAM role with administrative permissions, use the following format

export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use AWS Secrets Manager or AWS Systems Manager Parameter Store to manage credentials. Automate the deployment process using a continuous integration and delivery (CI/CD) pipeline.

Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command.

cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd unified-studio

Modify the following parameters in the config/Config.ts file.

SMUS_APPLICATION_NAME – Name of the application.
SMUS_SECONDARY_REGION – Secondary AWS region for backup.
SMUS_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval. 
SMUS_STAGE_NAME – Name of the stage. 
SMUS_DOMAIN_ID – Domain identifier of the Amazon SageMaker Unified Studio. 
SMUS_PROJECT_ID – Project identifier of the Amazon SageMaker Unified Studio. 
SMUS_ASSETS_REGISTRAR_ROLE_ARN – ARN of the AWS Lambda role created in step 1 of the preceding section.

Install the dependencies by running the following command:

npm install

Synthesize the CloudFormation template by running the following command.

cdk synth

Deploy the solution by running the following command.

cdk deploy –all

After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

When deployment is complete, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table. Retrieve the data from the DynamoDB table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region. Screenshot smus-dynamodb

Clean up

Use the following steps to clean up the resources deployed.

Empty the S3 buckets that were created as part of this deployment.
In your local development environment (Linux or macOS):
Navigate to the unified-studio directory of your repository.
Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
To destroy the cloud resources, run the following command:

cdk destroy --all

Go to the SageMaker Unified Studio and delete the published data assets that were created in the project.
Use the console to delete the SageMaker Unified Studio domain.

Walkthrough for data solutions built on an Amazon DataZone domain

This section provides step by step instructions to deploy a code sample that implements the scheduled backup pattern for data solutions built on an Amazon DataZone domain.

Deployment steps

After completing the prerequisites, use the AWS CDK stack provided on GitHub to deploy the solution to backup system metadata of the data solution built on Amazon DataZone domain

Clone the repository from GitHub to your preferred IDE using the following commands.

git clone https://github.com/aws-samples/sample-event-driven-resilience-data-solutions-sagemaker.git
cd event-driven-resilience-sagemaker

Export AWS credentials and the primary Region information to your development environment for the AWS Identity and Access Management (IAM) role with administrative permissions, use the following format:

export AWS_REGION=
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_SESSION_TOKEN=

In a production environment, use Secrets Manager or Systems Manager Parameter Store to manage credentials. Automate the deployment process using a CI/CD pipeline.

Bootstrap the AWS account in the primary and secondary Regions by using AWS CDK and running the following command:

cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_REGION>
cdk bootstrap aws://<AWS_ACCOUNT_ID>/<AWS_SECONDARY_REGION>
cd datazone

From the console for IAM, note the Amazon Resource Name (ARN) of the CDK execution role. Update the trust relationship of the IAM role so that Lambda can assume the role.

Modify the following parameters in the config/Config.ts file.

DZ_APPLICATION_NAME – Name of the application.
DZ_SECONDARY_REGION – Secondary Region for backup.
DZ_BACKUP_INTERVAL_MINUTES – Minutes before each backup interval.
DZ_STAGE_NAME – Name of the stage (dev, qa, or prod).
DZ_DOMAIN_NAME – Name of the Amazon DataZone domain
DZ_DOMAIN_DESCRIPTION – Description of the Amazon DataZone domain
DZ_DOMAIN_TAG – Tag of the Amazon DataZone domain
DZ_PROJECT_NAME – Name of the Amazon DataZone project
DZ_PROJECT_DESCRIPTION – Description of the Amazon DataZone project
CDK_EXEC_ROLE_ARN – ARN of the CDK execution role
DZ_ADMIN_ROLE_ARN – ARN of the administrator role

Install the dependencies by running the following command:

npm install

Synthesize the AWS CloudFormation template by running the following command:

cdk synth

Deploy the solution by running the following command:

cdk deploy --all

After the deployment is complete, sign in to your AWS account and navigate to the CloudFormation console to verify that the infrastructure deployed.

Document system metadata

This section provides instructions to create an asset and demonstrates how you can retrive the metadata of the asset. Perform the following steps to retrieve the systems metadata.

Sign in to the Amazon DataZone data portal from the console. Select the project and choose Query data at the upper right.

Screenshot datazone-open-query

Choose Open Athena and make sure that <DZ_PROJECT_NAME>_DataLakeEnvironment is selected in the Amazon DataZone environment dropdown at the upper right and that on the left, and that <DZ_PROJECT_NAME>_datalakeenvironment_pub_db is selected as the Database.
Create a new AWS Glue table for publishing to Amazon DataZone. Paste the following create table as select (CTAS) query script in the Query window and run it to create a new table named mkt_sls_table as described in Produce data for publishing. The script creates a table with sample marketing and sales data.

CREATE TABLE mkt_sls_table AS
SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Go to the Tables and Views section and verify that the mkt_sls_table table was successfully created.
In the Amazon DataZone Data Portal, go to Data sources, select the <DZ_PROJECT_NAME>-DataLakeEnvironment-default-datasource, and choose Run. The mkt_sls_table will be listed in the inventory and available to publish.
Select the mkt_sls_table table and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.
Choose Publish Asset and the mkt_sls_table table will be published to the business data catalog, making it discoverable and understandable across your organization.
After the table is published, wait for the duration of DZ_BACKUP_INTERVAL_MINUTES. Navigate to the <DZ_APPLICATION_NAME >AssetsInfo DynamoDB table and retrieve the data from the table. The following screenshot shows the data in the Items returned section. Verify the same data in the secondary Region.

Clean up

Use the following steps to clean up the resources deployed.

Empty the Amazon Simple Storage Service (Amazon S3) buckets that were created as part of this deployment.
Go to the Amazon DataZone domain portal and delete the published data assets that were created in the Amazon DataZone project.
In your local development environment (Linux or macOS):

Navigate to the datazone directory of your repository.
Export the AWS credentials for the IAM role that you used to create the AWS CDK stack.
To destroy the cloud resources, run the following command:

cdk destroy --all

Conclusion

This post explores how to build a resilient data governance solution on Amazon SageMaker. Resilient design principles and a robust disaster recovery strategy are central to the business continuity of AWS customers. The code samples included in this post implement a backup process of the data solution at regular time interval. They store the Amazon SageMaker asset information in Amazon DynamoDB Global tables. You can extend the backup solution by identifying the system metadata that is relevant for the data solution of your organization and by using Amazon SageMaker APIs to capture and store the metadata. The DynamoDB Global table replicates the changes in the DynamoDB table in the primary region to the secondary region in an asynchronous manner. Consider Implementing an additional layer of resiliency by using AWS Backup to back up the DynamoDB table at regular interval. In the next post, we show how you can use the system metadata to restore your data solution in the secondary region.

Adopt the resiliency features offered by Amazon DataZone and Amazon SageMaker Unified Studio. Use AWS Resilience Hub to assess the resilience of your data solution. AWS Resilience Hub helps you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the AWS Well-Architected Framework.

To build a data mesh based data solution using Amazon DataZone domain, see our GitHub repository. This open source project provides a step-by-step blueprint for constructing a data mesh architecture using the powerful capabilities of Amazon SageMaker, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.

About the authors

BDB-4558-Dhruba Dhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data governance, and artificial intelligence at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure cloud solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Unify streaming and analytical data with Amazon Data Firehose and Amazon SageMaker Lakehouse

2025-05-27 Kalyan Janaki

Post Syndicated from Kalyan Janaki original https://aws.amazon.com/blogs/big-data/unify-streaming-and-analytical-data-with-amazon-data-firehose-and-amazon-sagemaker-lakehouse/

Organizations are increasingly required to derive real-time insights from their data while maintaining the ability to perform analytics. This dual requirement presents a significant challenge: how to effectively bridge the gap between streaming data and analytical workloads without creating complex, hard-to-maintain data pipelines. In this post, we demonstrate how to simplify this process using Amazon Data Firehose (Firehose) to deliver streaming data directly to Apache Iceberg tables in Amazon SageMaker Lakehouse, creating a streamlined pipeline that reduces complexity and maintenance overhead.

Streaming data empowers AI and machine learning (ML) models to learn and adapt in real time, which is crucial for applications that require immediate insights or dynamic responses to changing conditions. This creates new opportunities for business agility and innovation. Key use cases include predicting equipment failures based on sensor data, monitoring supply chain processes in real time, and enabling AI applications to respond dynamically to changing conditions. Real-time streaming data helps customers make quick decisions, fundamentally changing how businesses compete in real-time markets.

Amazon Data Firehose seamlessly acquires, transforms, and delivers data streams to lakehouses, data lakes, data warehouses, and analytics services, with automatic scaling and delivery within seconds. For analytical workloads, a lakehouse architecture has emerged as an effective solution, combining the best elements of data lakes and data warehouses. Apache Iceberg, an open table format, enables this transformation by providing transactional guarantees, schema evolution, and efficient metadata handling that were previously only available in traditional data warehouses. SageMaker Lakehouse unifies your data across Amazon Simple Storage Service (Amazon S3) data lakes, Amazon Redshift data warehouses, and other sources, and gives you the flexibility to access your data in-place with Iceberg-compatible tools and engines. By using SageMaker Lakehouse, organizations can harness the power of Iceberg while benefiting from the scalability and flexibility of a cloud-based solution. This integration removes the traditional barriers between data storage and ML processes, so data workers can work directly with Iceberg tables in their preferred tools and notebooks.

In this post, we show you how to create Iceberg tables in Amazon SageMaker Unified Studio and stream data to these tables using Firehose. With this integration, data engineers, analysts, and data scientists can seamlessly collaborate and build end-to-end analytics and ML workflows using SageMaker Unified Studio, removing traditional silos and accelerating the journey from data ingestion to production ML models.

Solution overview

The following diagram illustrates the architecture of how Firehose can deliver real-time data to SageMaker Lakehouse.

This post includes an AWS CloudFormation template to set up supporting resources so Firehose can deliver streaming data to Iceberg tables. You can review and customize it to suit your needs. The template performs the following operations:

Creates an AWS Identity and Access Management (IAM) role with permissions needed for Firehose to write to an S3 bucket
Creates resources for the Amazon Kinesis Data Generator to send sample streaming data to Firehose
Grants AWS Lake Formation permissions to the Firehose IAM role for Iceberg tables created in SageMaker Unified Studio
Creates an S3 bucket to backup records failed to deliver

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account – If you don’t have an account, you can create one.
A SageMaker Unified Studio domain – For instructions, refer to Create an Amazon SageMaker Unified Studio domain – quick setup.
A demo project – Create a demo project in your SageMaker Unified Studio domain. For instructions, see Create a project. For this example, we choose All capabilities in the project profile section and use streaming_datalake as the AWS Glue database name.

After you create the prerequisites, verify you can log in to SageMaker Unified Studio and the project is created successfully. Every project created in SageMaker Unified Studio gets a project location and project IAM role, as highlighted in the following screenshot.

Create an Iceberg table

For this solution, we use Amazon Athena as the engine for our query editor. Complete the following steps to create your Iceberg table:

In SageMaker Unified Studio, on the Build menu, choose Query Editor.

Choose Athena as the engine for query editor and choose the AWS Glue database created for the project.

Use the following SQL statement to create the Iceberg table. Make sure to provide your project AWS Glue database and project Amazon S3 location (can be found on the project overview page):

CREATE TABLE firehose_events (
type struct<device: string, event: string, action: string>,
customer_id string,
event_timestamp timestamp,
region string)
LOCATION '<PROJECT_S3_LOCATION>/iceberg/events'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);

Deploy the supporting resources

The next step is to deploy the required resources into your AWS environment by using a CloudFormation template. Complete the following steps:

Choose Launch Stack.
Choose Next.
Leave the stack name as firehose-lakehouse.
Provide the user name and password that you want to use for accessing the Amazon Kinesis Data Generator application.
For DatabaseName, enter the AWS Glue database name.
For ProjectBucketName, enter the project bucket name (located on the SageMaker Unified Studio project details page).
For TableName, enter the table name created in SageMaker Unified Studio.
Choose Next.

Select I acknowledge that AWS CloudFormation might create IAM resources and choose Next.

Complete the stack.

Create a Firehose stream

Complete the following steps to create a Firehose stream to deliver data to Amazon S3:

On the Firehose console, choose Create Firehose stream.

For Source, choose Direct PUT.
For Destination, choose Apache Iceberg Tables.

This example chooses Direct PUT as the source, but you can apply the same steps for other Firehose sources, such as Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

For Firehose stream name, enter firehose-iceberg-events.

Collect the database name and table name from the SageMaker Unified Studio project to use in the next step.

In the Destination settings section, enable Inline parsing for routing information and provide the database name and table name from the previous step.

Make sure you enclose the database and table names in double quotes if you want to deliver data to a single database and table. Amazon Data Firehose can also route records to different tables based on the content of the record. For more information, refer to Route incoming records to different Iceberg tables.

Under Buffer hints, reduce the buffer size to 1 MiB and the buffer interval to 60 seconds. You can fine-tune these settings based on your use case latency needs.

In the Backup settings section, enter the S3 bucket created by the CloudFormation template (s3://firehose-demo-iceberg-<account_id>-<region>) and the error output prefix (error/events-1/).

In the Advanced settings section, enable Amazon CloudWatch error logging to troubleshoot any failures, and in for Existing IAM roles, choose the role that starts with Firehose-Iceberg-Stack-FirehoseIamRole-*, created by the CloudFormation template.
Choose Create Firehose stream.

Generate streaming data

Use the Amazon Kinesis Data Generator to publish data records into your Firehose stream:

On the AWS CloudFormation console, choose Stacks in the navigation pane and open your stack.
Select the nested stack for the generator, and go to the Outputs tab.
Choose the Amazon Kinesis Data Generator URL.

Enter the credentials that you defined when deploying the CloudFormation stack.

Choose the AWS Region where you deployed the CloudFormation stack and choose your Firehose stream.
For the template, replace the default values with the following code:

{
"type": {
"device": "{{random.arrayElement(["mobile", "desktop", "tablet"])}}",
"event": "{{random.arrayElement(["firehose_events_1", "firehose_events_2"])}}",
"action": "update"
},
"customer_id": "{{random.number({ "min": 1, "max": 1500})}}",
"event_timestamp": "{{date.now("YYYY-MM-DDTHH:mm:ss.SSS")}}",
"region": "{{random.arrayElement(["pdx", "nyc"])}}"
}

Before sending data, choose Test template to see an example payload.
Choose Send data.

You can monitor the progress of the data stream.

Query the table in SageMaker Unified Studio

Now that Firehose is delivering data to SageMaker Lakehouse, you can perform analytics on that data in SageMaker Unified Studio using different AWS analytics services.

Clean up

It’s generally a good practice to clean up the resources created as part of this post to avoid additional cost. Complete the following steps:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select the stack firehose-lakehouse* and on the Actions menu, choose Delete Stack.
In SageMaker Unified Studio, delete the domain created for this post.

Conclusion

Streaming data allows models to make predictions or decisions based on the latest information, which is crucial for time-sensitive applications. By incorporating real-time data, models can make more accurate predictions and decisions. Streaming data can help organizations avoid the costs associated with storing and processing large datasets, because it focuses on the most relevant information. Amazon Data Firehose makes it straightforward to bring real-time streaming data to data lakes in Iceberg format and unifying it with other data assets in SageMaker Lakehouse, making streaming data accessible by various analytics and AI services in SageMaker Unified Studio to deliver real-time insights. Try out the solution for your own use case, and share your feedback and questions in the comments.

About the Authors

Kalyan Janaki is Senior Big Data & Analytics Specialist with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose.

Maria Ho is a Product Marketing Manager for Streaming and Messaging services at AWS. She works with services including Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Managed Service for Apache Flink, Amazon Data Firehose, Amazon Kinesis Data Streams, Amazon MQ, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Notification Services (Amazon SNS).

Access Amazon Redshift Managed Storage tables through Apache Spark on AWS Glue and Amazon EMR using Amazon SageMaker Lakehouse

2025-05-15 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/access-amazon-redshift-managed-storage-tables-through-apache-spark-on-aws-glue-and-amazon-emr-using-amazon-sagemaker-lakehouse/

Data environments in data-driven organizations are changing to meet the growing demands for analytics, including business intelligence (BI) dashboarding, one-time querying, data science, machine learning (ML), and generative AI. These organizations have a huge demand for lakehouse solutions that combine the best of data warehouses and data lakes to simplify data management with easy access to all data from their preferred engines.

Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in place with all Apache Iceberg compatible tools and engines. It secures your data in the lakehouse by defining fine-grained permissions, which are consistently applied across all analytics and ML tools and engines. You can bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. It accesses and queries data in-place with federated query capabilities across third-party data sources through Amazon Athena.

With SageMaker Lakehouse, you can access tables stored in Amazon Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This expands your data integration workload across data lakes and data warehouses, enabling seamless access to diverse data sources.

Amazon SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0 natively support SageMaker Lakehouse. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0.

How to access RMS tables through Apache Spark on AWS Glue and Amazon EMR

With SageMaker Lakehouse, RMS tables are accessible through the Apache Iceberg REST catalog. Open source engines such as Apache Spark are compatible with Apache Iceberg, and they can interact with RMS tables by conﬁguring this Iceberg REST catalog. You can learn more in Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint.

Note that the Iceberg REST extensions endpoint is used when you access RMS tables. This endpoint is accessible through the Apache Iceberg AWS Glue Data Catalog extensions, which comes preinstalled on AWS Glue 5.0 and Amazon EMR 7.5.0 or higher. The extension library enables access to RMS tables using the Amazon Redshift connector for Apache Spark.

To access RMS backed catalog databases from Spark, each RMS database requires its own Spark session catalog configuration. Here are the required Spark configurations:

Spark config key	Value
`spark.sql.catalog.{catalog_name}`	`org.apache.iceberg.spark.SparkCatalog`
`spark.sql.catalog.{catalog_name}.type`	`glue`
`spark.sql.catalog.{catalog_name}.glue.id`	`{account_id}:{rms_catalog_name}/{database_name}`
`spark.sql.catalog.{catalog_name}.client.region`	`{aws_region}`
`spark.sql.extensions`	`org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions`

Configuration parameters:

{catalog_name}: Your chosen name for referencing the RMS catalog database in your application code
{rms_catalog_name}: The RMS catalog name as shown in the AWS Lake Formation catalogs section
{database_name}: The RMS database name
{aws_region}: The AWS Region where the RMS catalog is located

For a deeper understanding of how the Amazon Redshift hierarchy (databases, schemas, and tables) is mapped to the AWS Glue multilevel catalogs, you can refer to the Bringing Amazon Redshift data into the AWS Glue Data Catalog documentation.

In the following section, we demonstrate how to access RMS tables through Apache Spark using SageMaker Unified Studio JupyterLab notebooks with the AWS Glue 5.0 runtime and Amazon EMR Serverless.

Although we can bring existing Amazon Redshift tables into the AWS Glue Data catalog by creating a Lakehouse Redshift catalog from an existing Redshift namespace and provide access to a SageMaker Unified Studio project, in the following example, you’ll create a managed Amazon Redshift Lakehouse catalog directly from SageMaker Unified Studio and work with that.

Prerequisites

To follow these instructions, you must have the following prerequisites:

An AWS account
An SageMaker Unified Studio domain

Create a SageMaker Unified Studio project

Complete the following steps to create a SageMaker Unified Studio project:

Sign in to SageMaker Unified Studio.
Choose Select a project on the top menu and choose Create project.
For Project name, enter demo.
For Project profile, choose All capabilities.
Choose Continue.

Leave the default values and choose Continue.
Review the configurations and choose Create project.

You need to wait for the project to be created. Project creation can take about 5 minutes. When the project status changes to Active, select the project name to access the project’s home page.

Make note of the Project role ARN because you’ll need it for next steps.

You’ve successfully created the project and noted the project role ARN. The next step is to configure a Lakehouse catalog for your RMS.

Configure a Lakehouse catalog for your RMS

Complete the following steps to configure a Lakehouse catalog for your RMS:

In the navigation pane, choose Data.
Choose the + (plus) sign.
Select Create Lakehouse catalog to create a new catalog and choose Next.

For Lakehouse catalog name, enter rms-catalog-demo.
Choose Add catalog.

Wait for the catalog to be created.

In SageMaker Unified Studio, choose Data in the left navigation pane, then select the three vertical dots next to Redshift (Lakehouse) and choose Refresh to make sure the Amazon Redshift compute is active.

Create a new table in the RMS Lakehouse catalog:

In SageMaker Unified Studio, on the top menu, under Build, choose Query Editor.
On the top right, choose Select data source.
For CONNECTIONS, choose Redshift (Lakehouse).
For DATABASES, choose dev@rms-catalog-demo.
For SCHEMAS, choose public.
Choose Choose.

In the query cell, enter and execute the following query to create a new schema:

create schema "dev@rms-catalog-demo".salesdb

In a new cell, enter and execute the following query to create a new table:

create table salesdb.store_sales (ss_sold_timestamp timestamp, ss_item text, ss_sales_price float);

In a new cell, enter and execute the following query to populate the table with sample data:

insert into salesdb.store_sales values ('2024-12-01T09:00:00Z', 'Product 1', 100.0),
('2024-12-01T11:00:00Z', 'Product 2', 500.0),
('2024-12-01T15:00:00Z', 'Product 3', 20.0),
('2024-12-01T17:00:00Z', 'Product 4', 1000.0),
('2024-12-01T18:00:00Z', 'Product 5', 30.0),
('2024-12-02T10:00:00Z', 'Product 6', 5000.0),
('2024-12-02T16:00:00Z', 'Product 7', 5.0);

In a new cell, enter and run the following query to verify the table contents:

select * from salesdb.store_sales;

(Optional) Create an Amazon EMR Serverless application

IMPORTANT: This section is only required if you plan to test also using Amazon EMR Serverless. If you intend to use AWS Glue exclusively, you can skip this section entirely.

Navigate to the project page. In the left navigation pane, select Compute, then select the Data processing Choose Add compute.

Choose Create new compute resources, then choose Next.

Select EMR Serverless.

Specify emr_serverless_application as Compute name, select Compatibility as Permission mode, and choose Add compute.

Monitor the deployment progress. Wait for the Amazon EMR Serverless application to complete its deployment. This process can take a minute.

Access Amazon Redshift Managed Storage tables through Apache Spark

In this section, we demonstrate how to query tables stored in RMS using a SageMaker Unified Studio notebook.

In the navigation pane, choose Data
Under Lakehouse, select the down arrow next to rms-catalog-demo
Under dev, select the down arrow next salesdb, choose store_sales, and choose the three dots

SageMaker Lakehouse oﬀers multiple analysis options: Query with Athena, Query with Redshift, and Open in Jupyter Lab notebook.

Choose Open in Jupyter Lab notebook
On the Launcher tab, choose Python 3 (ipykernel)

In SageMaker Unified Studio JupyterLab, you can specify different compute types for each notebook cell. Although this example demonstrates using AWS Glue compute (project.spark.compatibility), the same code can be executed using Amazon EMR Serverless by selecting the appropriate compute in the cell settings. The following table shows the connection type and compute values to specify when running PySpark code or Spark SQL code with different engines:

Compute option	Pyspark code		Spark SQL
Compute option	Connection type	Compute	Connection type	Compute
AWS Glue	Pyspark	`project.spark.compatibility`	SQL	`project.spark.compatibility`
Amazon EMR Serverless	Pyspark	`emr-s.emr_serverless_application`	SQL	`emr-s.emr_serverless_application`

In the notebook cell’s top left corner, set Connection Type to PySpark and select spark.compatibility (AWS Glue 5.0) as Compute
Execute the following code to initialize the SparkSession and configure rmscatalog as the session catalog for accessing the dev database under the rms-catalog-demo RMS catalog:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
#Change <your_account_id> with your AWS account ID
rms_catalog_id = "<your_account_id>:rms-catalog-demo/dev"

#Change with your AWS region
aws_region="us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
    .config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_id) \
    .config(f'spark.sql.catalog.{catalog_name}.client.region', aws_region) \
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .getOrCreate()

Create a new cell and switch the connection type from PySpark to SQL to execute Spark SQL commands directly
Enter the following SQL statement to view all tables under salesdb (RMS schema) within rmscatalog:

SHOW TABLES IN rmscatalog.salesdb

In a new SQL cell, enter the following DESCRIBE EXTENDED statement to view detailed information about the store_sales table in the salesdb schema:

DESCRIBE EXTENDED rmscatalog.salesdb.store_sales

In the output, you’ll observe that the Provider is set to iceberg. This indicates that the table is recognized as an Iceberg table, despite being stored in Amazon Redshift managed storage.

In a new SQL cell, enter the following SELECT statement to view the content of the table

SELECT * FROM rmscatalog.salesdb.store_sales

Throughout this example, we demonstrated how to create a table in Amazon Redshift Serverless and seamlessly query it as an Iceberg table using Apache Spark within a SageMaker Unified Studio notebook.

Clean up

To avoid incurring future charges, clean up all created resources:

Delete the created SageMaker Unified Studio project. This step will automatically delete Amazon EMR compute (for example, the Amazon EMR Serverless application) that was provisioned from the project:
1. Inside SageMaker Studio, navigate to the demo project’s Project overview section.
2. Choose Actions, then select Delete project.
3. Type confirm and choose Delete project.

Delete the created Lakehouse catalog:
1. Navigate to the AWS Lake Formation page in the Catalogs section.
2. Select the rms-catalog-demo catalog, choose Actions, then select Delete.
3. In the confirmation window type rms-catalog-demo and then choose Drop.

Conclusion

In this post, we demonstrated how to use Apache Spark to interact with Amazon Redshift Managed Storage tables through Amazon SageMaker Lakehouse using the Iceberg REST catalog. This integration provides a unified view of your data across Amazon S3 data lakes and Amazon Redshift data warehouses, so you can build powerful analytics and AI/ML applications while maintaining a single copy of your data.

For additional workloads and implementations, visit Simplify data access for your enterprise using Amazon SageMaker Lakehouse.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect with Amazon Web Services (AWS) Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Stefano Sandonà is a Senior Big Data Specialist Solution Architect at Amazon Web Services (AWS). Passionate about data, distributed systems, and security, he helps customers worldwide architect high-performance, efficient, and secure data solutions.

Derek Liu is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through Amazon Web Services (AWS) analytic services.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services (AWS). He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Angel Conde Manjon is a Sr. EMEA Data & AI PSA, based in Madrid. He has previously worked on research related to data analytics and AI in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.

Appendix: Sample script for Lake Formation FGAC enabled Spark cluster

If you want to access RMS tables from Lake Formation FGAC enabled Spark cluster on AWS Glue or Amazon EMR, refer to the following code example:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
rms_catalog_name = "123456789012:rms-catalog-demo/dev"
account_id = "123456789012"
region = "us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
.config('spark.sql.defaultCatalog', catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
.config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
.config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}.client.region', region) \
.config(f'spark.sql.catalog.{catalog_name}.glue.account-id', account_id) \
.config(f'spark.sql.catalog.{catalog_name}.glue.catalog-arn',f'arn:aws:glue:{region}:{rms_catalog_name}') \
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.getOrCreate()