Tag Archives: Amazon SageMaker Unified Studio

Unified scheduling for visual ETL flows and query books in Amazon SageMaker Unified Studio

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/unified-scheduling-for-visual-etl-flows-and-query-books-in-amazon-sagemaker-unified-studio/

Data engineers and analysts often need to automate their data processing workflows and queries to maintain up-to-date data pipelines and reports. Amazon SageMaker Unified Studio provides a unified environment for data, analytics, machine learning (ML), and AI workloads. Amazon SageMaker Unified Studio provides powerful tools for visual extract, transform, and load (ETL) flows and query books. Until today, scheduling these workflows has required additional setup and infrastructure.

Today, we’re excited to introduce a new unified scheduling feature that simplifies this process. SageMaker Unified Studio allows you to create ETL flows using a visual interface and write SQL analytics queries using query books. This new unified scheduling feature allows you to schedule your visual ETL flows and query books directly from SageMaker Unified Studio within the same interface, eliminating the need for visiting other consoles or complex configurations. Using Amazon EventBridge Scheduler, this feature provides a seamless and easy-to-use scheduling experience.

In this post, we walk through how to schedule your visual ETL flows and query books with just a few clicks, explore the underlying architecture, and demonstrate how this feature can streamline your data workflow automation.

Feature overview

SageMaker Unified Studio unified scheduling is built on top of EventBridge Scheduler and Amazon SageMaker Training. When you configure a new schedule from SageMaker Unified Studio, a new EventBridge schedule is automatically created in your AWS account. The EventBridge schedule is configured with the SageMaker CreateTrainingJob API. The SageMaker Training job runs visual ETL flows or query books.

The following diagram illustrates how it works.

Prerequisites

To run the instruction, you must have the following prerequisites:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with a All capabilities profile. This profile includes Tooling blueprint in which Scheduling is enabled by default. If scheduling is disabled, you may need to update your project’s profile.

Schedule a visual ETL flow

Complete the following steps to configure a schedule on a visual ETL flow:

  1. On the SageMaker Unified Studio console, on the top menu, choose Build.
  2. Under DATA ANALYSIS & INTEGRATION, choose Visual ETL flows.
  3. For Select or create project to continue, select your project, and choose Continue.
  4. Choose your visual ETL flow. If you don’t have any visual ETL flows, refer to Author visual ETL flows on Amazon SageMaker Unified Studio to create a new visual ETL flow.
  5. Choose the Schedule icon.
  6. For Schedule name, enter a unique name (for example, everyday).
  7. For Schedule Type, select Recurring.
  8. For Value, enter 1.
  9. For Unit, choose days.
  10. For Timezone, choose your time zone.
  11. Choose Create schedule.

You have successfully configured the schedule. Because Start date and time is not given, the visual ETL flow is triggered immediately and then it is triggered once a day after that.

Edit the schedule

You can view the configured schedules with the following steps:

  1. On the SageMaker Unified Studio console, navigate to Visual ETL flows for your project.
  2. Choose the Schedules tab.
  3. Choose Edit schedule under Actions.
  4. Edit with your preferences, then choose Save.

Pause or resume the schedule

If you want to pause the schedule, complete the following steps:

  1. Choose Pause schedule under Actions.

On the same Schedule tab, Status of the schedule will be updated to Paused.

  1. To resume the schedule, choose Activate schedule.

Delete the schedule

To delete the schedule, complete the following steps:

  1. Choose Delete schedule under Actions.
  2. Choose Delete schedule in the dialog.

On the same Schedule tab, you can verify that the deleted schedule disappears.

Schedule a query book flow

Complete the following steps to configure a schedule on a query book:

  1. On the SageMaker Unified Studio console, on the top menu, choose Build.
  2. Under DATA ANALYSIS & INTEGRATION, choose Query Editor.
  3. On the data explorer, under Lakehouse, choose AwsDataCatalog.
  4. Navigate to the table venue_event_agg. This table is created in the previous section.
  5. On the options menu (three dots), choose Query with Athena.
  6. On the Actions menu, choose Save to project.
  7. Choose Save changes.
  8. On the Actions menu, choose Create schedule.
  9. For Schedule Type, choose Recurring.
  10. For Value, enter 1.
  11. For Unit, choose days.
  12. For Timezone, choose your time zone.
  13. Choose Create schedule.

You have successfully configured the schedule. Because Start date and time was not set, the query book is triggered immediately and then it is triggered once a day after that. You can optionally configure start and end times if you want to limit your schedule to run in a specific date range.

To view the configured schedules, in the navigation pane, choose Scheduled queries.

You can view the list of scheduled queries and edit, pause, resume, or delete them, as shown in the previous section.

Clean up

To avoid incurring future charges, clean up the resources you created during this walkthrough:

  1. On the Schedule tab of Visual ETL flows, select the everyday schedule, and choose Delete schedule under Actions. The related EventBridge schedule is automatically deleted as well.
  2. On the SageMaker AI console, choose Training jobs under Training, and delete all the SageMaker training jobs that start with everyday-.
  3. (Optional) To delete the visual ETL flow, on the Flows tab of Visual ETL flows, select your visual ETL flow, and choose Delete flow under Actions.

Conclusion

The new unified scheduling experience in SageMaker Unified Studio simplifies workflow automation. With unified scheduling, you can seamlessly orchestrate your visual ETL flows and query books in one centralized location.

Whether you’re running daily data transformations, weekly analytical queries, or monthly reporting workflows, the unified scheduling experience provides a straightforward path to automation. This capability enables data teams to focus more on deriving insights from their data and less on managing infrastructure and scheduling configurations.

We encourage you to try out this new experience and share your feedback with us. For more information about SageMaker Unified Studio and its capabilities, visit our documentation or explore our other blog posts about visual ETL flows and query books.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect for AWS Analytics services with a strong focus on data engineering. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Daniel Obi is a Frontend Engineer on the Amazon SageMaker Unified Studio team. He is dedicated to building intuitive and effective solutions that enhance user experience and technical functionality. Outside of his professional work, he enjoys watching and playing basketball.

Vasudevan Venkataramanan is a Senior Software Engineer on the Amazon SageMaker Unified Studio team. He is responsible for technical direction of scheduling and orchestration within SageMaker Unified Studio. Outside of his professional work, he enjoys spending time with his kid, and playing pickleball and cricket.

Yuhang Huang is a Software Development Manager on the Amazon SageMaker Unified Studio team. He leads the engineering team to design, build, and operate scheduling and orchestration capabilities in SageMaker Unified Studio. In his free time, he enjoys playing tennis.

Gal HeyneGal Heyne is a Senior Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/access-your-existing-data-and-resources-through-amazon-sagemaker-unified-studio-part-1-aws-glue-data-catalog-and-amazon-redshift/

Amazon SageMaker Unified Studio provides a unified environment for data, analytics, machine learning (ML), and AI workloads. Part of the next generation of Amazon SageMaker, SageMaker Unified Studio allows you to discover your data and put it to work using familiar AWS tools to complete end-to-end development workflows, including data analysis, data processing, model training, generative AI app development, and more, in a single governed environment. You can create or join projects to collaborate with your teams, share AI and analytics artifacts securely, and discover and use your data stored in different data sources through Amazon SageMaker Lakehouse.

This series of posts demonstrates how you can onboard and access existing AWS data sources using SageMaker Unified Studio. This post focuses on onboarding existing AWS Glue Data Catalog tables and database tables available in Amazon Redshift. Part 2 discusses using Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR.

Access your existing data and resources through Amazon SageMaker Unified Studio

This series primarily focuses on the UI experience. If you prefer script-based automation, refer to Bringing existing resources into Amazon SageMaker Unified Studio.

Access management with SageMaker Unified Studio

The SageMaker Unified Studio authorization model is a hierarchical access control list (ACL) based on the resource type such as a domain or a project. For example, at the domain level, a user might have a domain owner designation and at the project level, the user can be an owner or contributor. You can configure these profiles at AWS Identity and Access Management (IAM) user, single sign-on (SSO) user, and SSO group level.

Each project has a project role. When the user interacts with resources within SageMaker Unified Studio, it generates IAM session credentials based on the user’s effective profile in the specific project context, and then users can use tools such as Amazon Athena or Amazon Redshift to query the relevant data. The project owner can add or remove project members for their project, create publishing agreements with a domain, and publish assets to a domain.

SageMaker Unified Studio can be accessed by IAM users or SSO authenticated users, and IAM roles can interact with the SageMaker Unified Studio through its APIs.

Solution overview

AWS Lake Formation enables you to define fine-grained access control on the Data Catalog, where you can configure access at database, table, row, or column level or define permissions with tags. When setting up Lake Formation, you can configure it with hybrid access mode, where you get flexibility to selectively enable Lake Formation permissions for specific databases and tables, and continue using IAM permissions for others. SageMaker Unified Studio supports Lake Formation hybrid mode.

When you create a project in SageMaker Unified Studio, an AWS Glue database is added by default as part of the project. Assets published into that database don’t need any additional permissions, but if you want to publish or subscribe assets from an existing AWS Glue database, then you need to provide explicit permissions to SageMaker Unified Studio to be able to access the database and tables. For more details, see Configure Lake Formation permissions for Amazon SageMaker Unified Studio.

Let’s understand how we can access existing datasets through SageMaker Unified Studio.

Prerequisites

To run the instruction, you must complete the following prerequisites:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with All capabilities project profile

In the SageMaker Unified Studio, select the project and navigate to the Project overview page. Copy the Project role ARN as highlighted in the screenshot. This project role will be used further in the post to provide permissions on existing datasets and resources.

Use existing AWS Glue tables

This section has following prerequisites:

One extra prerequisite step is to revoke IAMAllowedPrincipals group permission on both database and table to enforce Lake Formation permission for access. For detailed instruction see Revoking permission using the Lake Formation console.

To access existing Data Catalog tables in SageMaker Unified Studio, complete the following steps:

  1. On the Lake Formation console using the data lake administrator, choose Data lake locations in the navigation pane and choose Register location.
  2. Enter the S3 prefix for Amazon S3 path.
  3. For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
  4. Select Lake Formation for Permission mode and choose Register location.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select the existing Data Catalog database.
  3. From the Actions menu, choose Grant to grant permissions to the project role.

  1. For IAM users and roles, choose the project role.
  2. Select Named Data Catalog resources, and for Catalogs, choose the default catalog.
  3. For Databases, choose your existing Data Catalog database.

  1. For Database permissions, select Describe and choose Grant.

The next step is to grant the permission on the tables to the project role.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select the existing Data Catalog database.
  3. From the Actions menu, choose Grant to grant permissions to the project role.
  4. For IAM users and roles, choose the project role.
  5. Select Named Data Catalog resources, and for Catalogs, choose the default catalog.
  6. For Databases, choose your Data Catalog database.
  7. For Tables, select the tables that you need to provide permission to the project role.

  1. For Table permissions, select Select and Describe.
  2. For Grantable permissions, select Select and Describe.
  3. Choose Grant.

You should revoke any existing permissions of IAMAllowedPrincipals on the databases and tables within Lake Formation.

Now let’s verify that we can access the existing AWS Glue table from the SageMaker Unified Studio Query Editor.

  1. In SageMaker Unified Studio, navigate to your project.
  2. On the project page, under Lakehouse, choose Data.
  3. Next to the Data Catalog table, choose the options menu (three dots), and choose Query with Athena.

SageMaker Unified Studio provides a unified JupyterLab experience across different languages, including SQL, PySpark, and Scala Spark. It also supports unified access across different compute runtimes such as Amazon Redshift and Athena for SQL, Amazon EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark. To access the data through the unified JupyterLab experience, complete the following steps:

  1. On the SageMaker Unified Studio project page, on the top menu, choose Build, and under IDE & APPLICATIONS, choose
  2. Wait for the space to be ready.
  3. Choose the plus sign and for Notebook, choose Python 3.
  4. In the notebook, switch the connection type to PySpark, choose spark.fineGrained, and query the existing Data Catalog table:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_sql = spark.sql("""
select * from retaildb.salesorders
""" )

df_sql.show()

Use existing Redshift clusters

This section has following prerequisites:

  • A Redshift cluster

To bring in existing Redshift clusters, follow these steps:

  1. To use your provisioned Redshift cluster or a Redshift Serverless workgroup, add either of the following tags (key/value) to the resource:
    1. Add AmazonDataZoneProject: <projectID> if you want to allow only a specific SageMaker Unified Studio project to access the Amazon Redshift resource. Replace <projectID> with the ID of the project created in SageMaker Unified Studio.
    2. Add for-use-with-all-datazone-projects: true if you want to allow all SageMaker Unified Studio projects to access the Amazon Redshift resource.

  1. To add the compute connection in SageMaker Unified Studio, you can authenticate the cluster using either the user name and password of the database, IAM credentials, or AWS Secrets Manager. To provide the authentication using Secrets Manager, add either of the following tags. This will enable the existing secret to appear on the dropdown menu, while defining the connection in SageMaker Unified Studio.
    1. AmazonDataZoneProject: <projectID>
    2. for-use-with-all-datazone-projects: true

In the following screenshot, you can see the tag configuration section within Secrets Manager settings for Redshift Serverless compute. To understand how to create a secret for a database in a Redshift cluster using Secrets Manager, refer to Managing Amazon Redshift admin passwords using AWS Secrets Manager.

  1. After the tags are applied, log in to SageMaker Unified Studio and choose the project.
  2. Go to the Compute section of your project, and on the Data warehouse tab, choose Add compute.
  3. Select Connect to existing compute resources.
  4. Choose the compute type: Amazon Redshift Provisioned cluster or Amazon Redshift Serverless.
  5. Configure the parameters by selecting the existing compute and authentication and choose Add compute.

The detailed walkthrough process is illustrated in the following screenshot.

Use Redshift tables with existing compute

This section has following prerequisites:

  • A Redshift table

In this section, we illustrate steps to create a federated connection for an existing Amazon Redshift data source. You can register an existing Redshift provisioned cluster as well as Redshift Serverless with the Data Catalog using SageMaker Unified Studio. This creates a federated multi-level catalog and provides the ability to centrally manage permissions to the data with fine-grained access control using Lake Formation. By mounting Amazon Redshift data in the Data Catalog, you can query it using your preferred tools such as Athena or AWS Glue extract, transform, and load (ETL) without having to copy or move the data.

Create an Amazon Redshift managed VPC endpoint for Amazon Redshift

Amazon Redshift managed virtual private cloud (VPC) endpoints use AWS PrivateLink to allow one VPC to privately access resources in another VPC as if they were local to the same VPC. With an Amazon Redshift managed VPC endpoint, you can connect to your private Redshift cluster with the RA3 instance type or Redshift Serverless within your VPC.

In this section, we explain how to create an Amazon Redshift managed VPC endpoint for both Redshift Serverless and an Amazon Redshift provisioned cluster in a single account. The managed VPC endpoint needs to be created only if your Redshift provisioned or Redshift Serverless cluster is in a different VPC than the SageMaker Unified Studio domain VPC.

If the SageMaker Unified Studio domain account is in a different account, allow the additional AWS accounts to create cluster endpoints. For steps to authorize your Amazon Redshift provisioned or Redshift Serverless cluster to deploy endpoints in additional accounts and grant access to the cross-account VPC, refer to Granting access to a VPC.

Redshift Serverless

For Redshift Serverless, follow these instructions.

The common practice is to allow port 5439 (Amazon Redshift connectivity port) to the security group or CIDR range in which your consumption workloads run.

  1. In the security group associated with the Redshift cluster, add an inbound rule with Type as Redshift, Protocol as TCP, Port range as 5439 (Amazon Redshift connectivity port), and Source as the CIDR range in which your consumption workloads run.

  1. On the Amazon Redshift console of the workgroup, go to Redshift-managed VPC endpoints.
  2. Choose Create endpoint.
  3. In the Endpoint settings section, choose the VPC, associated private subnet, and security group created for the SageMaker Unified Studio domain account to deploy the endpoint against.

The following screenshot shows the Amazon Redshift managed VPC endpoint created for Redshift Serverless.

Redshift provisioned

For Amazon Redshift provisioned, follow these instructions:

  1. To implement an Amazon Redshift managed VPC endpoint for a provisioned cluster, you need to enable cluster relocation and create subnet groups. In the cluster subnet group, choose the VPC and subnets of the SageMaker Unified Studio domain account.
  2. On the Amazon Redshift console, choose Configurations in the navigation pane.
  3. Provide the endpoint details, then choose Create endpoint.

Create a federated connection for Amazon Redshift

Complete the following steps to create a federated catalog in the Data Catalog to query the data using various preferred analytics tools such as Athena, visual ETL in SageMaker Unified Studio, Amazon EMR, and more:

  1. On the SageMaker Unified Studio console, choose your project.
  2. Choose Data in the navigation pane.
  3. In the data explorer, choose the plus sign to add a data source.
  4. Under Add a data source, choose Add connection, then choose Amazon Redshift.
  5. Enter the following parameters in the connection details, and choose Add data.
    1. Name: Enter the connection name.
    2. Host: Enter the Amazon Redshift managed VPC endpoint.
    3. Port: Enter the port number (Amazon Redshift uses 5439 as the default port).
    4. Database: Enter the database name.
    5. Authentication: Choose either the database user name and password credentials or Secrets Manager.

After the connection is established, you will see that the federated catalog is created, as shown in the following screenshot. This catalog uses the AWS Glue connection to connect to Amazon Redshift. The databases, tables, and views are automatically cataloged in the Data Catalog and registered with Lake Formation.

With Athena, data analysts can run federated SQL queries to scan data from multiple data sources in-place without creating complex data pipelines or data replication.

Use existing Data Catalog tables and Amazon Redshift assets in the SageMaker Unified Studio business data catalog

You can use the SageMaker Unified Studio business data catalog to catalog the data across your organization with business context. To use Amazon SageMaker Catalog, you must bring your existing data assets into the inventory of your project. Follow the instructions in this section to bring your existing Data Catalogs and Amazon Redshift assets into the project inventory.

Add an existing Data Catalog to the project inventory

To enrich the asset with business context and share your assets outside your own project, you must first bring the metadata to SageMaker Catalog. To import the metadata of the assets into the project’s inventory, you need to create a data source in the project catalog.

  1. In SageMaker Unified Studio, navigate to the Project catalog page within the project.
  2. Choose Data sources.
  3. Choose CREATE DATA SOURCE.
  4. For Name, provide the name of the data source.
  5. Choose AWS Glue (Lakehouse) for Data source type.
  6. For Data selection, choose the Database name and choose Next.
  7. Keep the rest as default and choose CREATE.
  8. Choose RUN to import the metadata.

After the data source successfully completes its run, metadata of all the data assets gets added to the project’s inventory.

Add existing Redshift tables and views to the project inventory

Create a data source to bring in the existing Redshift tables and views to add to the project’s inventory:

  1. In SageMaker Unified Studio, navigate to the Project catalog within the project.
  2. Choose Data sources.
  3. Choose CREATE DATA SOURCE.
  4. For Name, provide the name of the data source.
  5. Choose Amazon Redshift for Data source type.
  6. For Connection, choose the name of the Redshift connection.
  7. For Database name, choose dev and for Schema, enter public.
  8. Keep the rest as default and choose CREATE.
  9. Choose RUN to import the metadata.

After the data source successfully completes its run, metadata of all the data assets gets added to the project’s inventory.

Conclusion

This post explained how you can access existing data and resources available in the Data Catalog and Amazon Redshift using SageMaker Unified Studio. SageMaker Unified Studio provides an integrated environment for analytics and AI. Being able to access existing datasets available in your AWS account helps reduce operational overhead because users of your organization can access a common interface, collaborate, and share datasets. It also brings in efficiency for administrators because they can manage permissions for domains and projects in a common place.

In the next post, we will demonstrate how you can onboard and access other existing data sources such as Amazon S3, Amazon RDS, DynamoDB, and Amazon EMR.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Sakti Mishra is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Daiyan Alamgir is a Principal Frontend Engineer on the Amazon SageMaker Unified Studio team based in New York.

Vipin Mohan is a Principal Product Manager at AWS, leading the launch of generative AI capabilities in Amazon SageMaker Unified Studio. He is committed to shaping impactful products by working backward from customer insights, championing user-focused solutions, and delivering scalable results.

Chanu Damarla is a Principal Product Manager on the Amazon SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 2: Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/access-your-existing-data-and-resources-through-amazon-sagemaker-unified-studio-part-2-amazon-s3-amazon-rds-amazon-dynamodb-and-amazon-emr/

Organizations often face the challenge of managing and analyzing data spread across multiple storage systems and databases while providing secure, efficient access for their data science teams. Amazon SageMaker Unified Studio addresses this challenge by providing a unified analytics and AI development environment where data scientists can access, analyze, and use data from various sources within a single, governed workspace, allowing teams to use their existing data infrastructure while taking advantage of advanced analytics and AI capabilities. SageMaker Unified Studio is part of the next generation of Amazon SageMaker, the center for all your data, analytics, and AI.

In Part 1 of this series, we explored how to access AWS Glue Data Catalog tables and Amazon Redshift resources through SageMaker Unified Studio. Continuing our journey, this post discusses integrating additional vital data sources such as Amazon Simple Storage Service (Amazon S3) buckets, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR clusters. We demonstrate how to configure the necessary permissions, establish connections, and effectively use these resources within SageMaker Unified Studio. Whether you’re working with object storage, relational databases, NoSQL databases, or big data processing, this post can help you seamlessly incorporate your existing data infrastructure into your SageMaker Unified Studio workflows.

Access your existing data and resources through Amazon SageMaker Unified Studio

Solution overview

SageMaker Unified Studio seamlessly works with your existing data and resources through relevant permissions and network settings.

Let’s understand how we can access existing datasets across S3, RDS, DynamoDB, and EMR through SageMaker Unified Studio.

Prerequisites

To run the instruction, you must complete the following prerequisites:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with All capabilities project profile

In SageMaker Unified Studio, select the project and navigate to the Project overview page. Copy the Project role ARN as highlighted in the screenshot. This project role will be used further in the post to provide permissions on existing datasets and resources.

Use existing S3 buckets

This section has following prerequisites:

  • An S3 bucket

To use an existing S3 bucket in SageMaker Unified Studio, configure an S3 bucket policy that allows the appropriate actions for the project AWS Identity and Access Management (IAM) role.

The following is an example bucket policy. Replace <aws_accountid> with the AWS account ID where the domain resides, <s3_bucket> with the name of the S3 bucket that you intend to query in SageMaker Unified Studio, and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the project role in SageMaker Unified Studio:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::<s3_bucket>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        },
        {
            "Sid": "Statement2",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::<s3_bucket>/*",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After you configure the policy, log in to SageMaker Unified Studio and open the project.

Query the data using the JupyterLab IDE to perform analysis, as shown in the following screenshot.

Although the project role has been given appropriate permissions to access the S3 bucket in SageMaker Unified Studio, you will not able to list the contents of the bucket and show the S3 path in the data explorer section within SageMaker Unified Studio.

Use existing RDS DB instances

This section has following prerequisites:

  • A VPC and a private subnet
  • A RDS DB instance on the private subnet in the VPC

SageMaker Unified Studio uses the virtual private cloud (VPC) and subnets that are specified in the domain creation. If you have the data source like an RDS DB instance in a separate VPC, you can configure network reachability between the domain VPC and the data source VPC using VPC peering, AWS Transit Gateway, or a resource VPC endpoint, or alternatively you can create a new domain using the data source VPC.

Add a PostgreSQL connection

Complete the following steps to configure that reachability using VPC peering with Amazon Virtual Private Cloud (Amazon VPC):

  1. On the Amazon VPC console, choose Your VPCs, and make a note of the VPC ID of your VPC named SageMakerUnifiedStudioVPC.
  2. Choose Peering connections, and choose Create peering connection.
  3. Under Select another VPC to peer with, for VPC ID (Requester), choose the VPC ID noted earlier.
  4. Under Select another VPC to peer with, for VPC ID (Accepter), choose the VPC where the target RDS DB instance is located.
  5. Review your settings and choose Create peering connection.
  6. On the Peering connections page, select your peering connection.
  7. Under Actions, choose Accept request.
  8. Review the settings and choose Accept request.

Now you have configured the VPC peering connection. The next step is to configure the network route from the SageMaker Unified Studio VPC to the Amazon RDS VPC.

  1. On the Amazon VPC console, choose Route tables in the navigation pane.
  2. Choose the route table that is used in the private subnets of SageMakerUnifiedStudioVPC.
  3. Choose Edit routes.
  4. Choose Add route.
  5. For Destination, choose the VPC CIDR of the VPC where the RDS DB instance is located.
  6. For Target, choose Peering Connection, and choose the peering connection you created earlier.
  7. Choose Save changes.

Now you have configured the route table from the SageMaker Unified Studio VPC to the Amazon RDS VPC. The next step is to configure the opposite route.

  1. On the Amazon VPC console, choose Route tables in the navigation pane.
  2. Choose the route table that is used in the private subnets of the RDS DB instance.
  3. Choose Edit routes.
  4. Choose Add route.
  5. For Destination, choose the VPC CIDR of SageMakerUnifiedStudioVPC.
  6. For Target, choose Peering Connection, and choose the peering connection you created earlier.
  7. Choose Save changes.

Now you configure your RDS security group to accept traffic coming from SageMaker Unified Studio.

  1. On the Amazon RDS console, navigate to your RDS DB instance, and choose VPC security groups.
  2. Select your security group, and choose Inbound rules.
  3. Choose Edit inbound rules.
  4. Choose Add rule.
  5. For Type, choose Custom TPC.
  6. For Port range, enter your RDS port number.
  7. For Source, enter the VPC CIDR of SageMakerUnifiedStudioVPC.

Now you have network reachability required to use the existing RDS DB instance. The next step is to create a connection pointing to that RDS DB instance in SageMaker Unified Studio.

  1. Sign in to SageMaker Unified Studio and open your project.
  2. In your project, in the navigation pane, choose Data.
  3. Choose the plus sign, and for Add data source, choose Add connection.
  4. Select PostgreSQL.
  5. For Data source name, enter postgresql_source.
  6. For Host, enter the host name of your Aurora PostgreSQL database cluster.
  7. For Port, enter the port number of your Aurora PostgreSQL database cluster (by default, it’s 5432).
  8. For Database, enter your database name.
  9. For Authentication, select Username and password, and enter your user name and password.
  10. Choose Add data source.

You will need to wait for several minutes to complete this step.

Use a visual ETL flow to ingest data to Amazon RDS

In a visual extract, transform, and load (ETL) flow, you can use PostgreSQL as source and target. You can create a PostgreSQL target, and for Name, choose postgresql_source to ingest data into Amazon RDS.

  1. Choose the plus sign, and under Data sources, choose Amazon S3.
  2. Choose Amazon S3 for the source node, and enter following values:
    1. S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv
    2. Format: CSV
    3. Sep: ,
    4. Multiline: Enabled
    5. Header: Disabled
    6. Leave the rest as default.
  3. Wait for the data preview to be available.
  4. Choose the plus sign to the right of Amazon S3 Under Transforms, choose Rename Columns.
  5. Choose the Rename Columns node, and choose Add new rename pair.
  6. For Current name and New name, enter following pairs:
    1. _c0: venueid
    2. _c1: venuename
    3. _c2: venuecity
    4. _c3: venuestate
    5. _c4: venueseats
  7. Choose the plus sign to the right of Rename Columns
  8. Under Targets, choose PostgreSQL, and enter following values:
    1. Name: postgresql_source
    2. Schema: public
    3. Table: venue

  1. Choose Save to project. You can optionally change the name and add a description.
  2. Choose Run. Optionally, you can change the compute parameters.

Wait for completion. Then the data has been successfully ingested.

Run an Athena query to explore the table on Amazon RDS

After you create a table on Amazon RDS, you can explore the table through a data explorer in SageMaker Unified Studio:

  1. On SageMaker Unified Studio, choose Data.
  2. Under Lakehouse, choose postgresql_source, public, and venue.
  3. On the options menu (three dots), choose Query with Athena.

You get records from the RDS table venue.

Use existing DynamoDB tables

This section has following prerequisites:

  • A DynamoDB table

To access existing DynamoDB tables, configure a resource-based policy that allows the appropriate actions for the project role:

  1. On the DynamoDB console, choose Tables in the navigation pane.
  2. Select your table.
  3. Choose the Permissions tab and choose Create table policy.

The following example policy allows connecting to DynamoDB tables as a federated source. Replace <aws_region> with your AWS Region, <aws_account_id> with the AWS account ID where DynamoDB is deployed, <dynamodb_table> with the DynamoDB table that you intend to query from SageMaker Unified Studio, and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the project role in SageMaker Unified Studio:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:DescribeTable",
                "dynamodb:PartiQLSelect",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": "arn:aws:dynamodb:<aws_region>:<aws_accountid>:table/<dynamodb_table>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After the policies are incorporated on the DynamoDB table, create an Amazon SageMaker Lakehouse connection within SageMaker Unified Studio:

  1. Choose Data in the navigation pane.
  2. In the data explorer, choose the plus sign to add a data source.
  3. Select Add connection and choose Next.
  4. Select Amazon DynamoDB and choose Next.
  5. For Name, enter a name, then choose Add data.

The following screenshot shows the detailed steps to create a federated DynamoDB connection in SageMaker Unified Studio. After the connection is established, you can query the data from the DynamoDB table with using the Athena query editor.

You can also use existing DynamoDB tables as part of the ETL process. In the following screenshot, we demonstrate this using a visual ETL flow.

Use existing EMR clusters

This section has following prerequisites:

  • An EMR on EC2 cluster

SageMaker Unified Studio enables you to create new compute or add existing compute resources to a project for submitting jobs. You can add existing Amazon EMR on EC2 clusters or add existing Amazon EMR Serverless applications to submit data analytics jobs. To add a new EMR Serverless application, an administrator must enable a blueprint for the project.

To add an existing EMR on EC2 cluster, complete the following steps:

  1. In SageMaker Unified Studio, navigate to the project for which you plan to add compute, then choose Compute in the navigation pane.
  2. Choose the Data processing
  3. To add an existing EMR on EC2 cluster, choose Add compute.
  4. Choose Connect to existing compute resources and choose Next.

  1. To specify the compute resources to choose from, choose EMR on EC2 cluster.

  1. The Add Compute dialog box requires you to have the correct permissions to access the EMR on EC2 cluster. You can choose Copy project information to copy the data; the admin will need to grant the data worker access. Send the information to your admin.
  2. After the account administrator has granted the data worker access, you can specify the Amazon Resource Names (ARNs) associated with the cluster. You must fill in the Access role ARN, EMR on EC2 cluster ARN, Instance profile role ARN, and Name
  3. After you configure these settings, choose Add compute.

Your EMR on EC2 instance will be added to your project.

After you have added a cluster to a project, you will be able to see the cluster on the Data processing tab of the Compute page. You can then view the cluster details by choosing the specific cluster.

In addition to adding existing compute resources, you have the option to create new compute resources, which allows you to create both EMR on EC2 cluster and EMR Serverless applications.

Conclusion

SageMaker Unified Studio enables you to integrate with multiple data sources, providing data scientists and analysts with a powerful, unified environment for their AI and analytics workflows. As demonstrated throughout this two-part series, you can seamlessly connect to and use data from the Data Catalog, Amazon Redshift, Amazon S3, Amazon RDS, DynamoDB, and Amazon EMR—while maintaining proper security controls and permissions. This flexibility alleviates the need for complex data movement operations and allows teams to focus on extracting insights from their data rather than managing infrastructure. By following the approaches outlined in these posts, organizations can maximize their existing data investments while taking advantage of the advanced capabilities of SageMaker Unified Studio for their data science and analytics needs.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Sakti Mishra is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Daiyan Alamgir is a Principal Frontend Engineer on the Amazon SageMaker Unified Studio team based in New York.

Vipin Mohan is a Principal Product Manager at AWS, leading the launch of generative AI capabilities in Amazon SageMaker Unified Studio. He is committed to shaping impactful products by working backward from customer insights, championing user-focused solutions, and delivering scalable results.

Chanu Damarla is a Principal Product Manager on the Amazon SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

Streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/streamline-data-discovery-with-precise-technical-identifier-search-in-amazon-sagemaker-unified-studio/

We’re excited to introduce a new enhancement to the search experience in Amazon SageMaker Catalog, part of the next generation of Amazon SageMaker—exact match search using technical identifiers. With this capability, you can now perform highly targeted searches for assets such as column names, table names, database names, and Amazon Redshift schema names by enclosing search terms in a qualifier such as double quotes (" "). This yields results with exact precision, dramatically improving the speed and accuracy of data discovery.

In this post, we demonstrate how to streamline data discovery with precise technical identifier search in Amazon SageMaker Unified Studio.

Solving real-world discovery challenges

In large, enterprise-scale environments, discovering the right dataset often hinges on pinpointing specific technical identifiers. Users frequently search for exact terms like "customer_id" or "sales_summary_2023" – but conventional keyword and semantic searches often return related results, instead of the exact match.

With the new qualified search capability, entering "customer_id" will surface only those assets whose technical name matches exactly—eliminating noise, saving time, and improving confidence in discovery. Whether you’re a data analyst seeking a specific metric or a data steward validating metadata compliance, this update delivers a more precise, governed, and intuitive search experience.

Built for complex, high-scale catalogs

This feature builds on existing keyword and semantic search capabilities in SageMaker Unified Studio and adds an important layer of control for customers managing complex data catalogs with intricate naming conventions. By reducing time spent filtering partial matches and improving the relevance of results, this enhancement streamlines workflows and helps maintain metadata quality across domains.

One such customer is NatWest, a global banking leader operating across thousands of assets:

“In our complex data ecosystem, discovering the right assets quickly is paramount. In a data-driven banking environment, the new exact and partial match search capabilities in SageMaker Unified Studio have been transformative. By enabling precise discovery of critical attributes like loan IDs and party IDs across thousands of data assets, we’ve dramatically accelerated insight generation while strengthening our metadata governance. This feature cuts through complexity, reduces search time, minimizes errors, and fosters unprecedented collaboration across our data engineering, analytics, and business teams.”

— Manish Mittal, Data Marketplace Engineering Lead, NatWest

Key benefits

With this new capability, SageMaker Catalog users can:

  • Quickly locate precise data assets – Search using known technical names—like "customer_id" or "revenue_code" – to immediately surface the right datasets without sifting through irrelevant results.
  • Reduce false positives and ambiguous matches – Alleviate confusion caused by keyword or semantic searches that return loosely matched results, improving trust in the search experience.
  • Accelerate productivity across data roles – Analysts, stewards, and engineers can find what they need faster—reducing delays in reporting, validation, and development cycles.
  • Strengthen governance and compliance – Surface and validate critical naming conventions and metadata standards (for example, columns prefixed with "pii_" or "audit_" will return all column names starting with pii or audit) to support policy enforcement and audit readiness.

Example use cases

This feature can help the following roles in different use cases:

  • Data analysts – A business analyst preparing a margin analysis report searches for "profit_margin" to locate the exact field across multiple sales datasets. This reduces time-to-insight and makes sure the right metric is used in reporting.
  • Data stewards – A governance lead searches for terms like "audit_log" or "classified_pii" to confirm that all required classifications and logging conventions are in place. This helps enforce data handling policies and validate catalog health.
  • Data engineers – A platform engineer performs a search for "temp_" or "backup_" to identify and clean up unused or legacy assets created during extract, transform, and load (ETL) workflows. This supports data hygiene and infrastructure cost optimization.

Solution demo

To demonstrate the exact match filter solution, we have ingested an individual asset loaded from the TPC-DS tables and also created data product bundling of assets.

The following screenshot shows an example of the data product.

The following screenshot shows an example of the individual assets.

Next, the data analyst wants to search all assets that have customer login details. The customer login is stored as the "c_login" field in the assets.

With the technical identifier feature, the data analyst directly searches the catalog with the identifier "c_login" to get the required results, as shown in the following screenshot.

The data analyst can verify that the login information is present in the returned result.

Conclusion

The addition of precise technical identifier search in SageMaker Unified Studio reinforces a step toward enhancing data discovery and usability in complex data ecosystems. By providing search capabilities based on technical identifiers, this feature addresses the needs of diverse stakeholders, enabling them to efficiently locate the assets they require.

As data continues to grow in scale and complexity, SageMaker Unified Studio remains committed to delivering features that simplify data management, improve productivity, and enable organizations to unlock actionable insights. Start using this enhanced search capability today and experience the difference it brings to your data discovery journey.

Refer to the product documentation to learn more about how to set up metadata rules for subscription and publishing workflows.


About the Authors

Ramesh H Singh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon SageMaker team. He is passionate about building high-performance ML/AI and analytics products that enable enterprise customers to achieve their critical goals using cutting-edge technology. Connect with him on LinkedIn.

Pradeep Misra PicPradeep Misra is a Principal Analytics Solutions Architect at AWS. He works across Amazon to architect and design modern distributed analytics and AI/ML platform solutions. He is passionate about solving customer challenges using data, analytics, and AI/ML. Outside of work, Pradeep likes exploring new places, trying new cuisines, and playing board games with his family. He also likes doing science experiments, building LEGOs and watching anime with his daughters.

Rajat Mathur is a Software Development Manager at AWS, leading the Amazon DataZone and SageMaker Unified Studio engineering teams. His team designs, builds, and operates services which make it faster and easier for customers to catalog, discover, share, and govern data. With deep expertise in building distributed data systems at scale, Rajat plays a key role in advancing AWS’s data analytics and AI/ML capabilities.

Jie Lan is a Software Engineer at AWS based in New York, where he works on the Amazon SageMaker team. He is passionate about developing cutting-edge solutions in the big data and AI space, helping customers leverage cloud technology to solve complex problems.

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/connect-share-and-query-where-your-data-sits-using-amazon-sagemaker-unified-studio/

The ability for organizations to quickly analyze data across multiple sources is crucial for maintaining a competitive advantage. Imagine a scenario where the retail analytics team is trying to answer a simple question: Among customers who purchased summer jackets last season, which customers are likely to be interested in the new spring collection?

While the question is straightforward, getting the answer requires piecing together data across multiple data sources such as customer profiles stored in Amazon Simple Storage Service (Amazon S3) from customer relationship management (CRM) systems, historical purchase transactions in an Amazon Redshift data warehouse, and current product catalog information in Amazon DynamoDB. Traditionally, answering this question would involve multiple data exports, complex extract, transform, and load (ETL) processes, and careful data synchronization across systems.

In this blog post, we will demonstrate how business units can use Amazon SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems.

SageMaker Unified Studio provides a unified experience for using data, analytics, and AI capabilities. You can use familiar AWS services for model development, generative AI, data processing, and analytics—all within a single, governed environment. To strike a fine balance of democratizing data and AI access while maintaining strict compliance and regulatory standards, Amazon SageMaker Data and AI Governance is built into SageMaker Unified Studio. With Amazon SageMaker Catalog, teams can collaborate through projects, discover, and access approved data and models using semantic search with generative AI-created metadata, or you can use natural language to ask Amazon Q to find your data. Within SageMaker Unified Studio, organizations can implement a single, centralized permission model with fine-grained access controls, facilitating seamless data and AI asset sharing through streamlined publishing and subscription workflows. Teams can also query the data directly from sources such as Amazon S3 and Amazon Redshift, through Amazon SageMaker Lakehouse.

SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources. Built on AWS Glue Data Catalog and AWS Lake Formation, it organizes data through catalogs that can be accessed through an open, Apache Iceberg REST API to help ensure secure access to data with consistent, fine-grained access controls. SageMaker Lakehouse organizes data access through two types of catalogs: federated catalogs and managed catalogs (shown in the following figure). A catalog is a logical container that organizes objects from a data store, such as schemas, tables, views, or materialized views such as from Amazon Redshift. You can also create nested catalogs to mirror the hierarchical structure of your data sources within SageMaker Lakehouse.

  • Federated catalogs: Through SageMaker Unified Studio, you can create connections to external data sources such as Amazon DynamoDB. See Data connections in Amazon SageMaker Lakehouse for all the supported external data sources. These connections are stored in the AWS Glue Data Catalog (Data Catalog) and registered with Lake Formation, allowing you to create a federated catalog for each available data source.
  • Managed catalogs: A managed catalog refers to the data that resides on Amazon S3 or Redshift Managed Storage (RMS).

The existing Data Catalog becomes the Default catalog (identified by the AWS account number) and is readily available in SageMaker Lakehouse.

If the business units don’t have a data warehouse but need the benefits of one—such as a query result cache and query rewrite optimizations—then, they can create an RMS managed catalog in SageMaker Unified Studio. This is a SageMaker Lakehouse managed catalog backed by RMS storage. The table metadata is managed by Data Catalog. When you create an RMS managed catalog, it deploys an Amazon Redshift managed serverless workgroup. Users can write data to managed RMS tables using Iceberg APIs, Amazon Redshift, or Zero-ETL ingestion from supported data sources.

Functional working model

In SageMaker Unified Studio, the infrastructure team will enable the blueprints and configure the project profiles for tools and technologies to the respective business units to build and monitor their pipelines. They will also onboard the teams to SageMaker Unified Studio, enabling them to build the data products in a single integrated, governed environment. To enforce standardization within the organization, the central governance team can also create hierarchical representations of business units through domain units and dictate certain actions that these teams can perform under a domain unit. Global policies such as data dictionaries (business glossaries), data classification tags, and additional information with metadata forms can be created by the governance team to ensure standardization and consistency within the organization.

Individual business units will use these project profiles based on their needs to process the data using the authorized tool of their choice and create data products. Business units can enjoy the full flexibility to process and consume the data without worrying about the maintenance of the underlying infrastructure. Depending on the nature of the workloads, business units can choose a storage solution that best fits their use case. You can use SageMaker Lakehouse to unify the data across different data sources.

To share the data outside the business unit, the teams will publish the metadata of their data to a SageMaker catalog and make it discoverable and accessible to other business units. Amazon SageMaker Catalog serves as a central repository hub to store both technical and business catalog information of the data product. To establish trust between the data producers and data consumers, SageMaker Catalog also integrates the data quality metrics and data lineage events to track and drive transparency in data pipelines. While sharing the data, data producers of these business units can apply fine grained access control permissions at row and column level to these assets during subscription approval workflows. SageMaker Unified Studio automatically grants subscription access to the subscribed data assets after the subscription request is approved by the data producer. As shown in the following figure, the data sharing capability highlights that the data remains at its origin with the data producer, while consumers from other business units can consume and analyze it using their own compute resources. This approach eliminates any data duplication or data movement.

Solution overview

In this post, we explore two scenarios for sharing data between different teams (retail, marketing, and data analysts). The solution in this post gives you the implementation for a single account use case.

Scenario 1

The retail team needs to create a comprehensive view of customer behavior to optimize their spring collection launch. Their data landscape is diverse:

  • Customer profiles stored in Amazon S3 (default Data Catalog)
  • Historical purchase transactions stored in RMS (SageMaker Lakehouse managed RMS catalog)
  • Inventory information of the product in DynamoDB. (federated catalog)

The team needs to share this unified view with their regional data analysts while maintaining strict data governance protocols. Data analysts discover the data and subscribe to the data. We will also walk through the publishing and subscription workflow as part of the data sharing process. To get a unified view of the customer sales transactions for active products, the data analysts will use Amazon Athena.

Here are the high level steps of the solution implementation as shown in the preceding diagram:

  1. In this post, we take an example of two teams who participate in the collaboration. The retail team has created a project retailsales-sql-project and the data analysts team has created a project dataanalyst-sql-project within SageMaker Unified Studio.
  2. The retail team creates and stores their data in various sources:
    1. customer data in Amazon S3 (contains customer data)
    2. inventory data in a DynamoDB table (contains product catalog information)
    3. store_sales_lakehouse in SageMaker Lakehouse managed RMS (contains purchase history)
  3. The retail team publishes the assets to the project catalog to make them discoverable to other domain members within the organization.
  4. The data analysts team discovers the data and subscribes to the data assets.
  5. An incoming request is sent to the retail team, who then approves the subscription request. After the subscription is approved, data analysts use Athena to create a unified query from all the subscribed data assets to get insights into the data.

In this scenario, we will review how SageMaker Catalog manages the subscription grants to Data Catalog assets (both federated and managed).

For this scenario, we assume that the retail team doesn’t have their own data warehouse and they want to create and manage Amazon Redshift tables using Data Catalog.

Scenario 2

The marketing team needs access to transaction data for campaign optimization. They have campaign performance data stored in an Amazon Redshift data warehouse. However, to have improved campaign ROI and better resource allocation, they need data from the retail team to understand actual customer purchase behavior. To improve the campaign ROI, they need answers to crucial questions such as:

  • What is the true conversion rate across different customer segments?
  • Which customers should be targeted for upcoming promotions?
  • How do seasonal buying patterns affect campaign success?

Here the retail team shares the purchase history data store_sales to the marketing team. In this scenario, shown in the preceding figure, we assume that the retail team has their own data warehouse and uses Amazon Redshift to store the purchase history data.

The high level steps of the solution implementation for this scenario are:

  1. The marketing team has created the project marketing-sql-project within SageMaker Unified Studio.
  2. The retail team has store_sales in Amazon Redshift data warehouse (contains purchase history)
  3. The retail team has published the assets to the project catalog
  4. The marketing team discovers the data and subscribes to the data assets.
  5. An incoming request is sent to the retail team, who then approves the subscription request. After the subscription is approved, the marketing team uses Amazon Redshift to consume the purchase history and identify high-value customer segments.

In this scenario, we will review the process of how SageMaker Catalog grants access to managed Amazon Redshift assets.

Prerequisites

To follow the step by step guide, you must complete the following prerequisites:

Note that the default SQL analytics project profile provides you with a RedshiftServerless blueprint. However, in this post, we want to showcase the data sharing capabilities of different types of SageMaker Lakehouse catalogs (managed and federated).

For the simplicity, we chose the SQL analytics project profile. However, you can also test this by using the Custom project profile by selecting specific blueprints such as LakehouseCatalog and LakeHouseDatabase for scenarios where the business unit doesn’t have their own data warehouse.

Solution walkthrough (Scenario 1)

The first step focuses on preparing the data for each data source for unified access.

Data preparation

In this section, you will create the following data sets:

  • customer data in Amazon S3 (default Data Catalog)
  • inventory data in a DynamoDB table (federated catalog)
  • store_sales_lakehouse in SageMaker Lakehouse managed RMS (managed catalog)
  1. Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
  2. On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor.

  1. Select the following options:
    1. Under CONNECTIONS, select Athena (Lakehouse).
    2. Under CATALOGS, select AwsDataCatalog.
    3. Under DATABASES, select glue_db_<environmentid> or the customer glue database name you provided during project creation.
    4. After the options are selected, choose Choose.

When users select a project profile within SageMaker Unified Studio, the system automatically triggers the relevant AWS CloudFormation stack (DataZone-Env-<environmentid>) and deploys the necessary infrastructure resources in the form of environments. Environments are the actual data infrastructure behind a project.

  1. Run the following SQL:
CREATE TABLE customer AS
SELECT 13251813 cust_id,'Joyce Deaton'   cust_name,'Greece'   cust_country, '[email protected]'   cust_email
UNION
SELECT 1581546  ,'Daniel Dow'  ,'India'  , '[email protected]'  
UNION
SELECT 1581536  ,'Marie Lange'  ,'Canada'  , '[email protected]'  
UNION
SELECT 1827661  ,'Wesley Harris'  ,'Rome'  , '[email protected]'  
UNION
SELECT 1581536  ,'Alexander Salyer'  ,'Germany'  , '[email protected]'  
UNION
SELECT 3581536  ,'Jerry Tracy'  ,'Swiss'  , '[email protected]' 
  1. After the SQL is executed, you will find that the customer table has been created in the Lakehouse section under Lakehouse/AwsDataCatalog/glue_db_<environmentid>.

  1. The product catalog is stored in DynamoDB. You can create a new table named inventory in DynamoDB with partition key prod_id through AWS CloudShell with the following command:
aws dynamodb create-table \
    --table-name inventory\
    --attribute-definitions \
AttributeName=prod_id,AttributeType=N \
    --key-schema \
AttributeName=prod_id,KeyType=HASH \
    --provisioned-throughput \
ReadCapacityUnits=5,WriteCapacityUnits=5 \
    --table-class STANDARD
  1. Populate the DynamoDB table using the following commands:
aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "1"}, "prod_name": {"S": "Widget A"},"active": {"S": "Y"}}' 

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "2"}, "prod_name": {"S": "Gadget B"},"active": {"S": "Y"}}'

aws dynamodb put-item --table-name inventory --item '{"prod_id": {"N": "3"}, "prod_name": {"S": "Item C"},"active": {"S": "N"}}' 
  1. To use the DynamoDB table in SageMaker Unified Studio, you need to configure a resource-based policy that allows the appropriate actions for the project role.
    1. To create the resource-based policy, navigate to the DynamoDB console and choose Tables from the navigation pane.
    2. Select the Permissions table and choose Create table policy.

  1. The following is an example policy that allows connecting to DynamoDB tables as a federated source. Replace the <aws_region> with the Region you are working on, <aws_account_id> with the AWS Account ID where DynamoDB is deployed, <dynamodb_table> with the DynamoDB table (in this case inventory) that you intend to query from Amazon SageMaker Unified Studio and <datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy> with the Project role Amazon Resource Name (ARN) in SageMaker Unified Studio portal. You can get the project role ARN by navigating to the project in SageMaker Unified Studio and then to Project overview.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "dynamodb:Query",
                "dynamodb:Scan",
                "dynamodb:DescribeTable",
                "dynamodb:PartiQLSelect",
                "dynamodb:BatchWriteItem"
            ],
            "Resource": "arn:aws:dynamodb:<aws_region>:<aws_accountid>:table/<dynamodb_table>",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "arn:aws:iam::<aws_accountid>:role/<datazone_usr_role_xxxxxxxxxxxxxx_yyyyyyyyyyyyyy>"
                }
            }
        }
    ]
}

After the policies are incorporated on the DynamoDB table, create an SageMaker Lakehouse connection within SageMaker Unified Studio. As shown in the example, dynamodb-connection-catalogs is created.

  1. After the connection is successfully established, you will see the DynamoDB table inventory under Lakehouse.

The next step is to create a managed catalog for RMS objects using SageMaker Lakehouse.

  1. Choose Data in the navigation pane.
  2. In the data explorer, choose the plus icon to add a data source.
  3. Select Create Lakehouse catalog.
  4. Choose Next.

  1. Enter the name of the catalog. The catalog name provided in the example is redshift-lakehouse-connection-catalogs. Choose Add data.

  1. After the connection is created, you will see the catalog under Lakehouse.

  1. This creates a managed Amazon Redshift Serverless workgroup in your AWS account. You will see a new database dev@<redshift-catalog-name> in the managed Amazon Redshift Serverless workgroup.
    1. On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor.
    2. Select Redshift (Lakehouse) from CONNECTIONSdev@<redshift-catalog-name> from DATABASES and public from SCHEMAS

  1. Run the following SQL in order. The SQL creates the store_sales_lakehouse table in the dev database in the public schema. The retail team inserts data into the store_sales_lakehouse table.
CREATE TABLE public.store_sales_lakehouse (
    sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
    cust_id INTEGER NOT NULL,
    sale_date DATE NOT NULL,
    sale_amount DECIMAL(10, 2) NOT NULL,
    prod_id INTEGER  NOT NULL,
    last_purchase_date DATE
);
INSERT INTO public.store_sales_lakehouse (cust_id, sale_date, sale_amount, prod_id, last_purchase_date)
VALUES
(13251813, '2023-01-15', 150.00, 1, '2023-01-15'),
(29033279, '2023-01-20', 200.00, 4, '2023-01-20'),
(12755125, '2023-02-01', 75.50, 3, '2023-02-01'),
(26009249, '2023-02-10', 300.00, 2, '2023-02-10'),
(3270685, '2023-02-15', 125.00, 2, '2023-02-15'),
(6520539, '2023-03-01', 100.00, 2, '2023-03-01'),
(10251183, '2023-03-10', 250.00, 1, '2023-03-10'),
(10251283, '2023-03-15', 180.00, 1, '2023-03-15'),
(10251383, '2023-04-01', 90.00, 2, '2023-04-01'),
(10251483, '2023-04-10', 220.00, 3, '2023-04-10'),
(10251583, '2023-04-15', 175.00, 3, '2023-04-15'),
(10251683, '2023-05-01', 130.00, 1, '2023-05-01'),
(10251783, '2023-05-10', 280.00, 1, '2023-05-10'),
(10251883, '2023-05-15', 195.00, 4, '2023-05-15'),
(10251983, '2023-06-01', 110.00, 2, '2023-06-01'),
(10251083, '2023-06-10', 270.00, 1, '2023-06-10'),
(10252783, '2023-06-15', 185.00, 2, '2023-06-15'),
(10253783, '2023-07-01', 95.00, 3, '2023-07-01'),
(10254783, '2023-07-10', 240.00, 1, '2023-07-10'),
(10255783, '2023-07-15', 160.00, 3, '2023-07-15');
  1. On successful creation of the table, you should now be able to query the data. Select the table store_sales_lakehouse and select Query with Redshift.

Import assets to the project catalog from various data sources

To share your assets outside your own project to other business units, you must first bring your metadata to SageMaker Catalog. To import the assets into the project’s inventory, you need to create a data source in the project catalog. In this section, we show you how to import the technical metadata from AWS Glue data catalogs. Here, you will import data assets from various sources that you have created as part of your data preparation.

  1. Sign in to SageMaker Unified Studio as a member of the retail team. Select the project retailsales-sql-project, under Project catalog. Choose Data sources and import the assets by choosing Run.

  1. To import the federated catalog, create a new data source and choose Run. This will import the metadata of the inventory data from DynamoDB table.

  1. After successful run of all the data sources, choose Assets under Project catalog in the navigation plane. You will find all the assets in the Inventory of Project catalog.

Publish the assets

To make the assets discoverable to the data analysts team, the retail team must publish their assets.

  1. In the project retailsales-sql-project, choose Project catalog and select Assets.
  2. Select each asset in the INVENTORY tab, enrich the asset with the automated metadata generation and PUBLISH ASSET.

Discover the assets

SageMaker Catalog within SageMaker Unified Studio enables efficient data asset discovery and access management. The data analysts team signs in to SageMaker Unified Studio and selects the project dataanalyst-sql-project. The data analysts team then locates the desired assets in SageMaker Catalog and initiates the subscription request.

In this section, members of dataanalyst-sql-project browse the catalog and find the assets. There are multiple ways to find the desired assets.

  • Sign in to SageMaker Unified Studio as a member of the data analysts team. Choose Discover in the top navigation bar and select Catalog. Find the desired asset by browsing or entering the name of the asset into the search bar.
  • Search for the asset through a conversational interface using Amazon Q.
  • Use the faceted filter search by selecting the desired project in the BROWSE CATALOG.

The data analysts team selects the project retailsales-sql-project.

Subscribe to the assets

The data analysts team submits a subscription request with an appropriate justification for each of these assets.

  1. For each asset, choose SUBSCRIBE.
  2. Select dataanalyst-sql-project in Project.
  3. Provide the Reason for request as “need this data for analysis”.

Note that during the subscription process, the requester sees a message that the asset access control and fulfillment will be Managed. This means that SageMaker Unified Studio automatically manages subscription access grants and permissions for these assets.

Subscription approval workflow

To approve the subscription request, you must be a member of the retail team and select the project that has published the asset.

  1. Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
  2. In the navigation pane, choose Project catalog and then select Subscription requests.
  3. In INCOMING REQUESTS, choose the REQUESTED tab and select View request for each asset to see detailed information of the subscription request.

  • REQUEST DETAILS provides information about the subscribing project, the requestor, and the justification to access the asset.
  • RESPONSE DETAILS provides an option to approve the subscription with full access to the data (Full access) or restricted access to the data (Approve with row or column filters). With restricted access to data, the subscription approval workflow process offers granular access control for sensitive data through row-level filtering and column-level filtering. Using row filters, approvers can restrict access to specific records based on defined criteria. Using column filters, approvers can control access to specific columns within the data sets. This allows excluding sensitive fields while sharing the relevant data. Approvers can implement these filters during the approval process, helping to ensure that the data access aligns with the organization’s security requirements and compliance policies. For this post, select Full access in the RESPONSE DETAILS
  • (Optional) Decision comment is where you can add a comment about accepting or rejecting the subscription request.
  • Choose APPROVE.

  1. Repeat the subscription approval workflow process for all the requested assets.
  2. After all the subscription requests are approved, choose the APPROVED tab to view all the approved assets.

Subscription fulfillment methods

After subscription approval, a fulfillment process manages access to the assets. SageMaker Unified Studio provides fulfillment methods for managed assets and unmanaged assets.

  • Managed assets: SageMaker Unified Studio automatically manages the fulfillment and permissions for assets such as AWS Glue tables and Amazon Redshift tables and views.
  • Unmanaged assets: For unmanaged assets, permissions are handled externally. SageMaker Unified Studio publishes standard events for actions such as approvals through Amazon EventBridge, enabling integration with other AWS services or third-party solutions for custom integrations.

In this scenario 1, because the assets are Data Catalogs, SageMaker Unified Studio grants and manages access to these managed assets on your behalf through Lake Formation. See the SageMaker Unified Studio subscription workflow for updates on sharing options.

Analyze the data

The data analysts team uses the subscribed data assets from varied sources to get unified insights.

  1. As a data analyst, sign in to SageMaker Unified Studio and select the project dataanalyst-sql-project. In the navigation pane, choose Project catalog and select Assets.
  2. Choose the SUBSCRIBED tab to find all the subscribed assets from the retailsales-sql-project.
  3. The status under each asset is Asset accessible. This indicates that the subscription grants are fulfilled and the data analysts team can now consume the assets with the compute of their choice.

Query using Athena (subscription grants fulfilled using Lake Formation)

As a member of the data analysts team, create a unified view to get purchase history with customer information for active products.

  1. In the dataanalyst-sql-project project, go to Build and select Query Editor.
  2. Use the following sample query to get the required information. Replace glue_db_<environmentid> with your subscribed glue database.
select * from "redshift-lakehouse-connection-catalogs/dev"."public"."store_sales_lakehouse" sales 
 left  join "awsdatacatalog"."glue_db_<environmentid>"."customer" customer
 on sales.cust_id=customer.cust_id
 inner  join "dynamodb-connection-catalogs"."default"."inventory" inventory
 on sales.prod_id = inventory.prod_id
 where inventory.active ='Y'

Solution walk-through (Scenario 2)

In this scenario, we assume that the retail team stores the purchase history data in their Amazon Redshift data warehouse. Because you’re using the default SQL analytics project profile to create the project, you will use a Redshift Serverless compute (project.redshift). The purchase history data is shared with the marketing team for enhanced campaign performance.

  1. Sign in to SageMaker Unified Studio as a member of the retail team and select the project retailsales-sql-project.
  2. On the top menu, choose Build, and under DATA ANALYSIS & INTEGRATION, select Query Editor
  3. Select the following options:
    • Under CONNECTIONS, select Redshift(Lakehouse).
    • Under CATALOGS, select dev.
    • Under DATABASES, select public.
  4. Run the following SQL:
CREATE TABLE public.store_sales (
sale_id INTEGER IDENTITY(1,1) PRIMARY KEY,
cust_id INTEGER NOT NULL,
sale_date DATE NOT NULL,
sale_amount DECIMAL(10, 2) NOT NULL,
prod_id INTEGER  NOT NULL,
last_purchase_date DATE
);
INSERT INTO public.store_sales (cust_id, sale_date, sale_amount, prod_id, last_purchase_date)
VALUES
(13251813, '2023-01-15', 150.00, 1, '2023-01-15'),
(29033279, '2023-01-20', 200.00, 4, '2023-01-20'),
(12755125, '2023-02-01', 75.50, 3, '2023-02-01'),
(26009249, '2023-02-10', 300.00, 2, '2023-02-10'),
(3270685, '2023-02-15', 125.00, 2, '2023-02-15'),
(6520539, '2023-03-01', 100.00, 2, '2023-03-01'),
(10251183, '2023-03-10', 250.00, 1, '2023-03-10'),
(10251283, '2023-03-15', 180.00, 1, '2023-03-15'),
(10251383, '2023-04-01', 90.00, 2, '2023-04-01'),
(10251483, '2023-04-10', 220.00, 3, '2023-04-10'),
(10251583, '2023-04-15', 175.00, 3, '2023-04-15'),
(10251683, '2023-05-01', 130.00, 1, '2023-05-01'),
(10251783, '2023-05-10', 280.00, 1, '2023-05-10'),
(10251883, '2023-05-15', 195.00, 4, '2023-05-15'),
(10251983, '2023-06-01', 110.00, 2, '2023-06-01'),
(10251083, '2023-06-10', 270.00, 1, '2023-06-10'),
(10252783, '2023-06-15', 185.00, 2, '2023-06-15'),
(10253783, '2023-07-01', 95.00, 3, '2023-07-01'),
(10254783, '2023-07-10', 240.00, 1, '2023-07-10'),
(10255783, '2023-07-15', 160.00, 3, '2023-07-15');

5. On successful execution of the query, you will see store_sales under Redshift in the navigation pane.

Import the asset to the project catalog inventory

To share your assets outside your own project to other marketing business units, you must first share your metadata to SageMaker Catalog. To import the assets into the project’s inventory, you need to run the data source in the project catalog.

In the project retailsales-sql-project, under Project catalog, select Data sources and import the asset store-sales. Select the highlighted data source and choose Run as shown in the screenshot.

Publish the asset

To make the assets discoverable to the marketing team, the retail team must publish their asset.

  1. Go to the navigation pane and choose Project catalog, and then select Assets.
  2. Select store-sales in the INVENTORY tab, enrich the asset with the automated metadata generation and PUBLISH ASSET as illustrated in the screenshot.

Discover and subscribe the asset

The marketing team discovers and subscribes to the store-sales asset.

  1. Sign in to SageMaker Unified Studio as a member of the marketing team and select marketing-sql-project.
  2. Navigate to the Discover menu in the top navigation bar and choose Catalog. Find the desired asset by browsing or entering the name of the asset into the search bar.
  3. Select the asset and choose SUBSCRIBE.
  4. Enter a justification in Reason for request and choose REQUEST.

Subscription approval workflow

The retail team gets an incoming request in their project to approve the subscription request.

  1. Sign in to the SageMaker Unified Studio and select the project retailsales-sql-project as a member of the retail team. Under Project catalog, select Subscription requests.
  2. In the INCOMING REQUESTS, under the REQUESTED tab, select View request for store-sales.

  1. You will see detailed information for the subscription request.
  2. Select Full access in the RESPONSE DETAILS and choose APPROVE.

Analyze the data

Sign in to SageMaker Unified Studio as a member of the marketing team and select marketing-sql-project.

  1. In the Project catalog, select Assets and choose the SUBSCRIBED tab to find all the subscribed assets from the retailsales-sql-project.
  2. Notice the status under the asset marked as Asset accessible. This indicates that the subscription grants are fulfilled and the marketing team can now consume the asset with the compute of their choice.

Query using Amazon Redshift (subscription grants fulfilled using native Amazon Redshift data sharing)

To query the shared data with Amazon Redshift compute, select Build and then Query Editor. Select the following options

  1. Under CONNECTIONS, select Redshift(Lakehouse).
  2. Under CATALOGS, select dev.
  3. Under DATABASES, select project.
select * from "dev"."project"."store_sales" sales  

When a subscription to an Amazon Redshift table or view is approved, SageMaker Unified Studio automatically adds the subscribed asset to the consumer’s Amazon Redshift Serverless workgroup for the project. Notice the subscribed asset is shared under the folder project. In the Redshift navigation pane, you can also see the datashare created between the source and the target cluster. In this case, because the data is shared in the same account but between different clusters, SageMaker Unified Studio creates a view in the target database and permissions are granted on the view. See Grant access to managed Amazon Redshift assets in Amazon SageMaker Unified Studio for information about data sharing options within Amazon Redshift.

Clean up

Make sure you remove the SageMaker Unified Studio resources to avoid any unexpected costs. Start by deleting the connections, catalogs, underlying data sources, projects, databases, and domain that you created for this post. For additional details, see the Amazon SageMaker Unified Studio Administrator Guide.

Conclusion

In this post, we explored two distinct approaches to data sharing and analytics.

Business units without an existing data warehouse can use a SageMaker Lakehouse managed RMS catalog. In the first scenario, we showcased subscription fulfillment of AWS Glue Data Catalogs using AWS Lake Formation for federated and managed catalogs. The data analysts team was able to connect and subscribe to the data shared by the retail team that resided in Amazon S3, Amazon Redshift, and other data sources such as DynamoDB through SageMaker Lakehouse.

In the second scenario, we demonstrated the native data-sharing capabilities of Amazon Redshift. In this scenario, we assume that the retail team has sales transactions stored in an Amazon Redshift data warehouse. Using the data sharing feature of Amazon Redshift, the asset was shared to the marketing team using Amazon SageMaker Unified Studio.

Both approaches enable unified querying across varied data sources with teams able to efficiently discover, publish, and subscribe to data assets while maintaining strict access controls through Amazon SageMaker Data and AI Governance. Subscription fulfillment is automated, reducing the administrative overhead. Using the query-in-place approach eliminates data redundancy and maintains data consistency while allowing unified analysis across data sources through a single integrated experience.

To learn more, see the Amazon SageMaker Unified Studio Administrator Guide and the following resources:


About the authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached through LinkedIn

Ramkumar Nottath is a Principal Solutions Architect at AWS focusing on Analytics services. He enjoys working with various customers to help them build scalable, reliable big data and analytics solutions. His interests extend to various technologies such as analytics, data warehousing, streaming, data governance, and machine learning. He loves spending time with his family and friends. 

AWS Weekly Roundup: AWS Pi Day, Amazon Bedrock multi-agent collaboration, Amazon SageMaker Unified Studio, Amazon S3 Tables, and more

Post Syndicated from Prasad Rao original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-aws-pi-day-amazon-bedrock-multi-agent-collaboration-amazon-sagemaker-unified-studio-amazon-s3-tables-and-more/

Thanks to everyone who joined us for the fifth annual AWS Pi Day on March 14. Since its inception in 2021, commemorating the Amazon Simple Storage Service (Amazon S3) 15th anniversary, AWS Pi Day has grown into a flagship event highlighting the transformative power of cloud technologies in data management, analytics, and AI.

This year’s virtual event featured in-depth discussions with Amazon Web Services (AWS) product teams showcasing our continued innovation in helping customers build robust data foundations for analytics and AI workloads.

Missed the live event? You can still access all content on-demand at the event page. Whether you’re developing data lakehouses, training AI models, creating generative AI applications, or optimizing analytics workloads, the shared insights will help you maximize the value of your data.

Last week’s launches
Here are some launches that got my attention during the previous week.

Amazon Bedrock now supports multi-agent collaboration – With the availability of multi-agent collaboration in Amazon Bedrock, you can create networks of specialized agents that communicate and coordinate under the guidance of a supervisor agent. You can build, deploy, and manage networks of AI agents that work together to execute complex, multi-step workflows efficiently.

Availability of fully managed DeepSeek-R1 model in Amazon Bedrock – AWS is the first cloud service provider (CSP) to deliver DeepSeek-R1 as a fully managed, generally available model. Use the capabilities of DeepSeek-R1 for your generative AI applications with a single API through this fully managed service in Amazon Bedrock.

Amazon SageMaker Unified Studio is now generally available – You can now use Amazon SageMaker Unified Studio as your single data and AI development environment, where you can find and access all of your organization’s data and work using the best tools for your specific needs. With the new simplified permissions management, you can easily bring your existing AWS resources into the unified studio. You’ll be able to find, access, and query your organization’s data and AI assets while collaborating with your team to securely build and share your analytics and AI artifacts—from data and models to generative AI applications.

Amazon Bedrock’s capabilities now generally available within Amazon SageMaker Unified Studio – SageMaker Unified Studio brings selected capabilities from Amazon Bedrock into SageMaker. You can now rapidly prototype, customize, and share generative AI applications using foundation models (FMs) and advanced features such as Amazon Bedrock Knowledge BasesAmazon Bedrock GuardrailsAmazon Bedrock Agents, and Amazon Bedrock Flows to create tailored solutions aligned with your requirements and responsible AI guidelines all within SageMaker.

Amazon S3 Tables integration with Amazon SageMaker Lakehouse is now generally availableAmazon S3 Tables now seamlessly integrate with Amazon SageMaker Lakehouse, making it easy for you to query and join S3 Tables with data in S3 data lakes, Amazon Redshift data warehouses, and third-party data sources. S3 Tables deliver the first cloud object store with built-in Apache Iceberg support.

Amazon S3 Tables now support create and query table operations directly from the S3 console using Amazon Athena – Amazon S3 Tables adds create and query table support in the S3 console. With this new feature, you can now create a table, populate it with data, and query it directly from the S3 console using Amazon Athena, making it easier to get started and analyze data in S3 table buckets.

Amazon S3 reduces pricing for S3 object tagging by 35% – Amazon S3 reduces pricing for S3 object tagging by 35% in all AWS Regions to $0.0065 per 10,000 tags per month. Object tags are key-value pairs applied to S3 objects that can be created, updated, or deleted at any time during the lifetime of the object.

Serverless Land Patterns available in Visual Studio CodeServerless Land‘s extensive application pattern library is now available directly into the Visual Studio Code (VS Code) IDE, making it easier for developers to build serverless applications. This integration eliminates the need to switch between your development environment and external resources when building serverless architectures by enabling you to browse, search, and implement pre-built serverless patterns directly in VS Code IDE.

Amplify Hosting Announces Skew Protection SupportAWS Amplify Hosting now offers Skew Protection, a feature that guarantees version consistency across your deployments. This feature ensures frontend requests are always routed to the correct server backend version—eliminating version skew and making deployments more reliable.

Amazon Route 53 Traffic Flow introduces a new visual editor to improve DNS policy editingAmazon Route 53 Traffic Flow now offers an enhanced user interface for improved DNS traffic policy editing. With this release, you can more easily understand and change the way traffic is routed between users and endpoints using the new features of the visual editor.

From community.aws
Here are some of my favorite posts from community.aws. Create your AWS Builder ID to start sharing your tips and connect with fellow builders. Your Builder ID is a universal login credential that gives you access, beyond the AWS Management Console, to AWS tools and resources, including over 600 free training courses, community features, and developer tools such as Amazon Q Developer.

Seamless SQL Server Recovery on EC2 with AWS Systems Manager (Greg Vinton) – This guide explains how to use the AWSEC2-RestoreSqlServerDatabaseWithVss automation runbook to restore a Microsoft SQL Server database on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Secure Deployment Strategies in Amazon EKS with Azure DevOps (Abhishek Nanda) – Build and Deploy containerized applications on Amazon Elastic Kubernetes Service (Amazon EKS) using Azure DevOps.

Connect Your Favorite LLM Client to Bedrock (Qinjie Zhang) – It’s common to use desktop applications like MSTY, Chatbox AI, LM Studio to simplify the use of Large Language Models (LLM) models. This blog provides a step-by-step guide on how you can connect your favorite local LLM clients to Amazon Bedrock.

From PHP to Python with the help of Amazon Q Developer (Ricardo Sueiras) – In this blog post, Ricardo showcases how to use Amazon Q Developer CLI to refactor code from one programming language to another.

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Milan, Italy (April 2), Bay Area – Security Edition (April 4), Timișoara, Romania (April 10), and Prague, Czech Republic (April 29).

AWS Innovate: Generative AI + Data – Join a free online conference focusing on generative AI and data innovations in Latin America on April 8.

AWS Summits – The AWS Summit season is coming along! Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Paris (April 9), Amsterdam (April 16), London (April 30), and Poland (May 5).

AWS re:Inforce (June 16–18) – Our annual learning event devoted to all things AWS Cloud security in Philadelphia, PA. Registration opens in March, so be ready to join more than 5,000 security builders and leaders.

AWS DevDays are free, technical events where developers can learn about some of the hottest topics in cloud computing. DevDays offer hands-on workshops, technical sessions, live demos, and networking with AWS technical experts and your peers. Register to access AWS DevDays sessions on demand.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Prasad

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!


How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

Accelerate analytics and AI innovation with the next generation of Amazon SageMaker

Post Syndicated from G2 Krishnamoorthy original https://aws.amazon.com/blogs/big-data/accelerate-analytics-and-ai-innovation-with-the-next-generation-of-amazon-sagemaker/

At AWS re:Invent 2024, we announced the next generation of Amazon SageMaker, the center for all your data, analytics, and AI. Amazon SageMaker brings together widely adopted AWS machine learning (ML) and analytics capabilities and addresses the challenges of harnessing organizational data for analytics and AI through unified access to tools and data with governance built in. It enables teams to securely find, prepare, and collaborate on data assets and build analytics and AI applications through a single experience, accelerating the path from data to value.

At the core of the next generation of Amazon SageMaker is Amazon SageMaker Unified Studio, a single data and AI development environment where you can find and access your organization’s data and act on it using the best tool for the job across virtually any use case. We are excited to announce the general availability of SageMaker Unified Studio.

In this post, we explore the benefits of SageMaker Unified Studio and how to get started.

Benefits of SageMaker Unified Studio

SageMaker Unified Studio brings together the functionality and tools from existing AWS Analytics and AI/ML services, including Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Bedrock, and Amazon SageMaker AI. From within the unified studio, you can discover data and AI assets from across your organization, then work together in projects to securely build and share analytics and AI artifacts, including data, models, and generative AI applications. Governance features including fine-grained access control are built into SageMaker Unified Studio using Amazon SageMaker Catalog to help you meet enterprise security requirements across your entire data estate.

Unified access to your data is provided by Amazon SageMaker Lakehouse, a unified, open, and secure data lakehouse built on Apache Iceberg open standards. Whether your data is stored in Amazon Simple Storage Service (Amazon S3) data lakes, Redshift data warehouses, or third-party and federated data sources, you can access it from one place and use it with Iceberg-compatible engines and tools. In addition, SageMaker Lakehouse now integrates with Amazon S3 Tables, the first cloud object store with native Apache Iceberg support, so you can use SageMaker Lakehouse to create, query, and process S3 Tables efficiently using various analytics engines in SageMaker Unified Studio as well as Iceberg-compatible engines like Apache Spark and PyIceberg.

Capabilities from Amazon Bedrock are now generally available in SageMaker Unified Studio, allowing you to rapidly prototype, customize, and share generative AI applications in a governed environment. Users have an intuitive interface to access high-performing foundation models (FMs) in Amazon Bedrock, including the Amazon Nova model series, and the ability to create Agents, Flows, Knowledge Bases, and Guardrails with a few clicks.

Amazon Q Developer, the most capable generative AI assistant for software development, can be used within SageMaker Unified Studio to streamline tasks across the data and AI development lifecycle, including code authoring, SQL generation, data discovery, and troubleshooting.

A new integrated way of working

The general availability of SageMaker Unified Studio represents another meaningful step in our journey to offer our customers a streamlined way to work with their data, whether for analytics or AI. Many of our customers have told us that you are building data-driven applications to guide business decisions, improve agility, and drive innovation, but that these applications are complex to build because they require collaboration across teams and the integration of data and tools. Not only is it time consuming for users to learn multiple development experiences, but because data, code, and other development artifacts are stored separately, it is challenging for users to understand how they interact with each other and to use them cohesively. Configuring and governing access is also a cumbersome manual process. To overcome these hurdles, many organizations are building bespoke integrations between services, tools, and homegrown access management systems. However, what you need is the flexibility to adopt the best services for your use case while empowering your data teams with a unified development experience.

“When we build data-driven applications for our customers, we want a unified platform where the technologies work together in an integrated way. Amazon SageMaker Unified Studio streamlines our solution delivery processes through comprehensive analytics capabilities, a unified studio experience, and a lakehouse that integrates data management across data warehouses and data lakes. Amazon SageMaker Unified Studio reduces the time-to-value for our customers’ data projects by up to 40%, helping us with our mission to accelerate our customers’ digital transformation journey.”

—Akihiro Suzue, Head of Solutions Sector, NTT DATA; Yuji Shono, Senior Manager, Apps & Data Technology Department, NTT DATA; Yuki Saito, Manager, Digital Success Solutions Division, NTT DATA

Millions of organizations trust AWS and utilize our comprehensive set of purpose-built analytics, AI/ML, and generative AI capabilities to power data-driven applications without compromising on performance, scale, or cost. Our goal for the next generation of Amazon SageMaker, including SageMaker Unified Studio, is to make data and AI workers more productive by providing access to all your data and tools in a single development environment.

Building from a single data and AI development environment

Let’s explore a common business challenge: increasing revenue through better lead generation. Consider an organization implementing an intelligent digital assistant on their website to engage with customers—a process that traditionally requires multiple tools and data sources. With SageMaker Unified Studio, this entire process can now be carried out within a single data and AI development environment.

First, the data team uses the generative AI playground within SageMaker Unified Studio to quickly evaluate and select the best model for their customer interactions. They then create a project to house the tools and resources necessary for their use case and use Amazon Bedrock within the project to build and deploy a sophisticated virtual assistant that quickly begins qualifying leads through their website.

To identify the most promising opportunities, the team develops a segmentation strategy. The data engineer asks Amazon Q Developer to identify datasets that contain lead data and uses zero-ETL integrations to bring the data into SageMaker Lakehouse. The data analyst then discovers it and creates a comprehensive view of their market. They use the SQL query editor to build out marketing segments, which they then write back to SageMaker Lakehouse, where they are available to other team members.

Finally, the data scientist accesses the same dataset, which they use to train and deploy an automated lead scoring model using tools available from SageMaker AI. During the model development phase, they use Amazon Q Developer’s inline code authoring and troubleshooting capabilities to efficiently write error free-code in their JupyterLab notebook. The final model provides sales teams with the highest-value opportunities, which they can visualize in a business intelligence dashboard and take action on immediately.

Reducing time-to-value in a unified environment

What is remarkable about this example is that entire process happens in one integrated environment. Without SageMaker Unified Studio, the team would have had to work with multiple data sources, tools, and services, spending time learning multiple development environments, creating resources shares, and manually configuring access controls. The data engineer and data analyst would have worked in various data warehouses, data lakes, and analytics tools, the data scientist would have worked in an ML studio and notebook environment, and the application builder in a generative AI tool. Now, they’re able to build and collaborate with their data and tools available in one experience, dramatically reducing time-to-value.

That’s why we’re so excited about the next generation of Amazon SageMaker and the general availability of SageMaker Unified Studio. We believe that by putting everything you need for analytics and AI in one place, you can solve complex end-to-end problems more efficiently and get to innovative outcomes faster than ever before.

Getting started with SageMaker Unified Studio

To learn more, check out the following resources:


About the authors

G2 Krishnamoorthy is VP of Analytics, leading AWS data lake services, data integration, Amazon OpenSearch Service, and Amazon QuickSight. Prior to his current role, G2 built and ran the Analytics and ML Platform at Facebook/Meta, and built various parts of the SQL Server database, Azure Analytics, and Azure ML at Microsoft.

Rahul Pathak is VP of Relational Database Engines, leading Amazon Aurora, Amazon Redshift, and Amazon QLDB. Prior to his current role, he was VP of Analytics at AWS, where he worked across the entire AWS database portfolio. He has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation.

Collaborate and build faster with Amazon SageMaker Unified Studio, now generally available

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/collaborate-and-build-faster-with-amazon-sagemaker-unified-studio-now-generally-available/

Today, we’re announcing the general availability of Amazon SageMaker Unified Studio, a single data and AI development environment where you can find and access all of the data in your organization and act on it using the best tool for the job across virtually any use case. Introduced as preview during AWS re:Invent 2024, my colleague, Antje, summarized it as:

SageMaker Unified Studio (preview) is a single data and AI development environment. It brings together functionality and tools from the range of standalone “studios,” query editors, and visual tools that we have today in Amazon AthenaAmazon EMRAWS GlueAmazon RedshiftAmazon Managed Workflows for Apache Airflow (Amazon MWAA), and the existing SageMaker Studio.

Here’s a video to see Amazon SageMaker Unified Studio in action:

SageMaker Unified Studio breaks down silos in data and tools, giving data engineers, data scientists, data analysts, ML developers and other data practitioners a single development experience. This saves development time and simplifies access control management so data practitioners can focus on what really matters to them—building data products and AI applications.

This post focuses on several important announcements that we’re excited to share:

To get started, go to the Amazon SageMaker console and create a SageMaker Unified Studio domain. To learn more, visit Create an Amazon SageMaker Unified Studio domain in the AWS documentation.

New capabilities for Amazon Bedrock in SageMaker Unified Studio
The capabilities of Amazon Bedrock within Amazon SageMaker Unified Studio offer a governed collaborative environment for developers to rapidly create and customize generative AI applications. This intuitive interface caters to developers of all skill levels, providing seamless access to the high-performance FMs offered in Amazon Bedrock and advanced customization tools for collaborative development of tailored generative AI applications.

Since the preview launch, several new FMs have become available in Amazon Bedrock and are fully integrated with SageMaker Unified Studio, including Anthropic’s Claude 3.7 Sonnet and DeepSeek-R1. These models can be used for building generative AI apps and chatting in the playground in SageMaker Unified Studio.

Here’s how you can choose Anthropic’s Claude 3.7 Sonnet on the model selection in your project.

You can also source data or documents from S3 folders within your project and select specific FMs when creating knowledge bases. 

During preview, we introduced Amazon Bedrock Guardrails to help you implement safeguards for your Amazon Bedrock application based on your use cases and responsible AI policies. Now, Amazon Bedrock Guardrails is extended to Amazon Bedrock Flows with this general availability release.

Additionally, we have streamlined generative AI setup for associated accounts with a new user management interface in SageMaker Unified Studio, making it straightforward for domain administrators to grant associated account admins access to model governance projects. This enhancement eliminates the need for command line operations, streamlining the process of configuring generative AI capabilities across multiple AWS accounts.

These new features eliminate barriers between data, tools, and builders in the generative AI development process. You and your team will gain a unified development experience by incorporating the powerful generative AI capabilities of Amazon Bedrock — all within the same workspace.

Amazon Q Developer is now generally available in SageMaker Unified Studio
Amazon Q Developer is now generally available in Amazon SageMaker Unified Studio, providing data professionals with generative AI–powered assistance across the entire data and AI development lifecycle.

Amazon Q Developer integrates with the full suite of AWS analytics and AI/ML tools and services within SageMaker Unified Studio, including data processing, SQL analytics, machine learning model development, and generative AI application development, to accelerate collaboration and help teams build data and AI products faster. To get started, you can select Amazon Q Developer icon.

For new users of SageMaker Unified Studio, Amazon Q Developer serves as an invaluable onboarding assistant. It can explain core concepts such as domains and projects, provide guidance on setting up environments, and answer your questions.

Amazon Q Developer helps you discover and understand data using powerful natural language interactions with SageMaker Catalog. What makes this implementation particularly powerful is how Amazon Q Developer combines broad knowledge of AWS analytics and AI/ML services with the user’s context to provide personalized guidance.

You can chat about your data assets through a conversational interface, asking questions such as “Show all payment related datasets” without needing to navigate complex metadata structures.

Amazon Q Developer offers SQL query generation through its integration with the built-in query editor available in SageMaker Unified Studio. Data professionals of varying skill levels can now express their analytical needs in natural language, receiving properly formatted SQL queries in return.

For example, you can ask, “Analyze payment method preferences by age group and region” and Amazon Q Developer will generate the appropriate SQL with proper joins across multiple tables.

Additionally, Amazon Q Developer is also available to assist with troubleshooting and generating real-time code suggestions in SageMaker Unified Studio Jupyter notebooks, as well as building ETL jobs.

Now available

  • Availability — Amazon SageMaker Unified Studio is now available in the following AWS Regions: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London), South America (São Paulo). Learn more about the availability of these capabilities on supported Region documentation page.
  • Amazon Q Developer subscription — The free tier of Amazon Q Developer is available by default in SageMaker Unified Studio, requiring no additional setup or configuration. If you already have Amazon Q Developer Pro Tier subscriptions, you can use those enhanced capabilities within the SageMaker Unified Studio environment. For more information, visit the documentation page.
  • Amazon Bedrock capabilities — To learn more about the capabilities of Amazon Bedrock in Amazon SageMaker Unified Studio, refer to this documentation page

Start building with Amazon SageMaker Unified Studio today. For more information, visit the Amazon SageMaker Unified Studio page.

Happy building!

Donnie Prakoso

— How is the News Blog doing? Take this 1 minute survey! (This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

Amazon S3 Tables integration with Amazon SageMaker Lakehouse is now generally available

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/amazon-s3-tables-integration-with-amazon-sagemaker-lakehouse-is-now-generally-available/

At re:Invent 2024, we launched Amazon S3 Tables, the first cloud object store with built-in Apache Iceberg support to streamline storing tabular data at scale, and Amazon SageMaker Lakehouse to simplify analytics and AI with a unified, open, and secure data lakehouse. We also previewed S3 Tables integration with Amazon Web Services (AWS) analytics services for you to stream, query, and visualize S3 Tables data using Amazon Athena, Amazon Data Firehose, Amazon EMR, AWS Glue, Amazon Redshift, and Amazon QuickSight.

Our customers wanted to simplify the management and optimization of their Apache Iceberg storage, which led to the development of S3 Tables. They were simultaneously working to break down data silos that impede analytics collaboration and insight generation using the SageMaker Lakehouse. When paired with S3 Tables and SageMaker Lakehouse in addition to built-in integration with AWS analytics services, they can gain a comprehensive platform unifying access to multiple data sources enabling both analytics and machine learning (ML) workflows.

Today, we’re announcing the general availability of Amazon S3 Tables integration with Amazon SageMaker Lakehouse to provide unified S3 Tables data access across various analytics engines and tools. You can access SageMaker Lakehouse from Amazon SageMaker Unified Studio, a single data and AI development environment that brings together functionality and tools from AWS analytics and AI/ML services. All S3 tables data integrated with SageMaker Lakehouse can be queried from SageMaker Unified Studio and engines such as Amazon Athena, Amazon EMR, Amazon Redshift, and Apache Iceberg-compatible engines like Apache Spark or PyIceberg.

With this integration, you can simplify building secure analytic workflows where you can read and write to S3 Tables and join with data in Amazon Redshift data warehouses and third-party and federated data sources, such as Amazon DynamoDB or PostgreSQL.

You can also centrally set up and manage fine-grained access permissions on the data in S3 Tables along with other data in the SageMaker Lakehouse and consistently apply them across all analytics and query engines.

S3 Tables integration with SageMaker Lakehouse in action
To get started, go to the Amazon S3 console and choose Table buckets from the navigation pane and select Enable integration to access table buckets from AWS analytics services.

Now you can create your table bucket to integrate with SageMaker Lakehouse. To learn more, visit Getting started with S3 Tables in the AWS documentation.

1. Create a table with Amazon Athena in the Amazon S3 console
You can create a table, populate it with data, and query it directly from the Amazon S3 console using Amazon Athena with just a few steps. Select a table bucket and select Create table with Athena, or you can select an existing table and select Query table with Athena.

2. Create tables with Athena

When you want to create a table with Athena, you should first specify a namespace for your table. The namespace in an S3 table bucket is equivalent to a database in AWS Glue, and you use the table namespace as the database in your Athena queries.

Choose a namespace and select Create table with Athena. It goes to the Query editor in the Athena console. You can create a table in your S3 table bucket or query data in the table.

2. Query with Athena

2. Query with SageMaker Lakehouse in the SageMaker Unified Studio
Now you can access unified data across S3 data lakes, Redshift data warehouses, third-party and federated data sources in SageMaker Lakehouse directly from SageMaker Unified Studio.

To get started, go to the SageMaker console and create a SageMaker Unified Studio domain and project using a sample project profile: Data Analytics and AI-ML model development. To learn more, visit Create an Amazon SageMaker Unified Studio domain in the AWS documentation.

After the project is created, navigate to the project overview and scroll down to project details to note down the project role Amazon Resource Name (ARN).

3. Project details in SageMaker Unified Studio

Go to the AWS Lake Formation console and grant permissions for AWS Identity and Access Management (IAM) users and roles. In the in the Principals section, select the <project role ARN> noted in the previous paragraph. Choose Named Data Catalog resources in the LF-Tags or catalog resources section and select the table bucket name you created for Catalogs. To learn more, visit Overview of Lake Formation permissions in the AWS documentation.

4. Grant permissions in Lake Formation console

When you return to SageMaker Unified Studio, you can see your table bucket project under Lakehouse in the Data menu in the left navigation pane of project page. When you choose Actions, you can select how to query your table bucket data in Amazon Athena, Amazon Redshift, or JupyterLab Notebook.

5. S3 Tables in Unified Studio

When you choose Query with Athena, it automatically goes to Query Editor to run data query language (DQL) and data manipulation language (DML) queries on S3 tables using Athena.

Here is a sample query using Athena:

select * from "s3tablecatalog/s3tables-integblog-bucket”.”proddb"."customer" limit 10;

6. Athena query in Unified Studio

To query with Amazon Redshift, you should set up Amazon Redshift Serverless compute resources for data query analysis. And then you choose Query with Redshift and run SQL in the Query Editor. If you want to use JupyterLab Notebook, you should create a new JupyterLab space in Amazon EMR Serverless.

3. Join data from other sources with S3 Tables data
With S3 Tables data now available in SageMaker Lakehouse, you can join it with data from data warehouses, online transaction processing (OLTP) sources like relational or non-relational database, Iceberg tables, and other third party sources to gain more comprehensive and deeper insights.

For example, you can add connections to data sources such as Amazon DocumentDB, Amazon DynamoDB, Amazon Redshift, PostgreSQL, MySQL, Google BigQuery, or Snowflake and combine data using SQL without extract, transform, and load (ETL) scripts.

Now you can run the SQL query in the Query editor to join the data in the S3 Tables with the data in the DynamoDB.

Here is a sample query to join between Athena and DynamoDB:

select * from "s3tablescatalog/s3tables-integblog-bucket"."blogdb"."customer", 
              "dynamodb1"."default"."customer_ddb" where cust_id=pid limit 10;

To learn more about this integration, visit Amazon S3 Tables integration with Amazon SageMaker Lakehouse in the AWS documentation.

Now available
S3 Tables integration with SageMaker Lakehouse is now generally available in all AWS Regions where S3 Tables are available. To learn more, visit the S3 Tables product page and the SageMaker Lakehouse page.

Give S3 Tables a try in the SageMaker Unified Studio today and send feedback to AWS re:Post for Amazon S3 and AWS re:Post for Amazon SageMaker or through your usual AWS Support contacts.

In the annual celebration of the launch of Amazon S3, we will introduce more awesome launches for Amazon S3 and Amazon SageMaker. To learn more, join the AWS Pi Day event on March 14.

Channy

How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

Amazon Web Services named a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools

Post Syndicated from William Vambenepe original https://aws.amazon.com/blogs/big-data/amazon-web-services-named-a-leader-in-the-2024-gartner-magic-quadrant-for-data-integration-tools/

Amazon Web Services (AWS) has been recognized as a Leader in the 2024 Gartner Magic Quadrant for Data Integration Tools. We were positioned in the Challengers Quadrant in 2023.

This recognition, we feel, reflects our ongoing commitment to innovation and excellence in data integration, demonstrating our continued progress in providing comprehensive data management solutions.

The Gartner Magic Quadrant evaluates 20 data integration tool vendors based on two axes—Ability to Execute and Completeness of Vision. This evaluation, we feel, critically examines vendors’ capabilities to address key service needs, including data engineering, operational data integration, modern data architecture delivery, and enabling less-technical data integration across various deployment models.

Discover, prepare, and integrate all your data at any scale

AWS Glue is a fully managed, serverless data integration service that simplifies data preparation and transformation across diverse data sources. With its comprehensive suite of tools, AWS Glue allows users to build and manage data pipelines efficiently, without requiring extensive infrastructure management expertise.

Given the diverse data integration needs of customers, AWS offers a robust data integration system through multiple services including Amazon EMR, Amazon Athena, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis, and others. Many thousands of customers across various industries are using these services to transform, operationalize, and manage their data across data lakes and data warehouses.

We have embarked on a journey to unify the broad range of AWS data processing, analytics, and AI capabilities, starting with the announcement of Amazon SageMaker Unified Studio at re:Invent 2024. This includes the data integration capabilities mentioned above, with support for both structured and unstructured data. With an integrated experience for data workers, SageMaker Unified Studio provides an environment where users can collaborate and build faster. It supports model development, generative AI, data processing, and SQL analytics, all accelerated by Amazon Q Developer—our most capable generative AI assistant for software development. The Unified Studio provides unified access to all data sources, whether stored in data lakes, data warehouses, or third-party and federated sources, with robust governance and enterprise-grade security built-in.

Review the Gartner Magic Quadrant

Access a complimentary copy of the full report to see why Gartner positioned AWS as a Leader, and dive deep into the strengths and cautions of AWS. We believe our recognition as a Leader in the Gartner Magic Quadrant is a testament to delivering innovations for our customers.

MQ

Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

GARTNER is a registered trademark and service mark of Gartner and Magic Quadrant is a registered trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from here.


About the authors

William Vambenepe is Director of Product Management at AWS, where he leads the Product Management, Solutions Engineering, and UX Design team for data processing services (Amazon EMR, AWS Glue, Athena, Amazon MWAA), SageMaker Unified Studio, and SageMaker Catalog. Prior to AWS, William worked at Google (6 years building and growing the Data and Analytics product portfolio for Google Cloud, and 5 years as Product Management Director for Google Search). He had previously held software engineering leadership roles at Oracle and HP. William holds an Engineering degree from Ecole Centrale Paris, a graduate Diploma in Computer Science from Cambridge University, and a Masters in Engineering Management from Stanford University.

Santosh Chandrachood has been with AWS for over 8 years and helped build, launch, and scale a variety of AWS services. Currently, Santosh is Director and service leader for Data Processing, managing Amazon EMR, Athena, AWS Glue, and Managed Workflows for Apache Airflow (Amazon MWAA). Santosh also led AWS Data Integration as the General Manager. Before joining AWS, Santosh lead engineering teams in networking, storage, and data infrastructure areas.

Foundational blocks of Amazon SageMaker Unified Studio: An admin’s guide to implement unified access to all your data, analytics, and AI

Post Syndicated from Lakshmi Nair original https://aws.amazon.com/blogs/big-data/foundational-blocks-of-amazon-sagemaker-unified-studio-an-admins-guide-to-implement-unified-access-to-all-your-data-analytics-and-ai/

Amazon SageMaker Unified Studio (preview) provides a unified experience for using data, analytics, and AI capabilities. You can use familiar AWS services for model development, generative AI, data processing, and analytics—all within a single, governed environment. Users can now build, deploy, and execute end-to-end workflows from a single interface. SageMaker Unified Studio is built on the foundations of Amazon DataZone, where it uses domains to categorize and structure the data assets, while offering project-based collaboration features that allow teams to securely share artifacts and work together across various compute services. This experience allows multiple personas to seamlessly collaborate, while operating under appropriate access controls and governance policies.

In this post, we focus on the admin persona and deep dive into the foundational building blocks while implementing the self-service access to all your data.

Conceptual framework

SageMaker Unified Studio offers an integrated development experience organized into three distinct planes, each serving different personas and purposes within the development lifecycle. This architecture enables seamless collaboration while maintaining clear boundaries of responsibility.

As shown in the following figure, each plane represents a distinct layer of functionality that works in harmony with the others to create a complete data and machine learning (ML) solution.

foundational planes

The planes are as follows:

  • Infrastructure plane – The infrastructure plane forms the foundation of SageMaker Unified Studio. Here administrators and domain owners of the organization provision the underlying infrastructure and define rules for users of the data factory plane to deploy the compute resources for data and ML operations in self-service mode. They can also decide to onboard existing resources or pre-create them. They can set up access controls and permissions to enforce and allocate resources to different teams and projects. This layer makes sure that all necessary computational resources are available and properly governed for downstream computation.
  • Data factory plane – The data factory plane functions like a sophisticated vending machine for compute resources, where data scientists and ML engineers can select and utilize preconfigured compute resources or deploy new ones. The data product developers, data engineers, and data scientists can create collaboration spaces and build data products by consuming infrastructure resources, with all the underlying complexity abstracted away.
  • Product experience plane – At the outermost layer, the product experience plane serves as a discovery and collaboration hub where business units (data producers and data consumers) can explore available data products from the asset catalog. This plane drives users to engage in data-driven conversations with knowledge and insights shared across the organization. Through the product experience plane, data product owners can use automated workflows to capture data lineage and data quality metrics and oversee access controls. They can track how their data products are being used and continuously improve the value proposition of their data assets.

In this post, we focus on the infrastructure plane deployment steps from an administrator’s perspective, outlining key responsibilities and actions required and how to configure and organize your assets under specific business units and teams and authorize policies during the initial setup phase.

Roles and responsibilities of the domain owner (admin) for the infrastructure plane

As shown in the following figure, the infrastructure plane revolves around three pivotal operational paradigms: onboard, organize, and authorize.

The details of the three essential functions in the foundational layer are as follows:

  • Onboard – The domain owner establishes a foundational environment by creating a domain, which represents an organization entity for you to connect together your assets, users, resources, and code repository configs. They can onboard the users who have authorization to access the self-serve unified studio. The self-serve unified studio is a browser-based web application where you can analyze, discover, catalog, govern, and share data in self-serve manner. The admin can enable the necessary blueprints and create project profiles to set up the underlying data infrastructure. In a multi-account (Mesh) scenario, the admin can also onboard the business units by associating the AWS accounts.
  • Organize – Here the domain owner creates hierarchies to organize and isolate projects within individual business units. The method of creating hierarchical representation of business units or team-level organization is through domain units. This makes sure that each business unit takes ownership of their assets. The admin can also delegate ownership within these business units.
  • Authorize – The admin or owners of individual business units or line of business (domain unit owners) can manage user policies—project-specific policies that dictate certain actions these principals can perform under a domain unit.

Now that we have discussed the core functions, let’s delve into the workflow that brings these concepts together.

Process workflow (infrastructure plane)

In the following figure, we break down the roles and responsibilities of domain owners to unit administrators through a sequence of operations, providing infrastructure deployment and management.

process workflow

The workflow consists of the following steps:

  1. The root domain owner (admin) creates a SageMaker Unified Studio domain from the console. After the domain is created, you get a SageMaker Unified Studio URL—a browser-based web application that can authenticate you with your AWS Identity and Access Management (IAM) user credentials or with credentials from your identity provider (IdP) through AWS IAM Identity Center or with your SAML credentials.
  2. As part of the onboarding process, the admin onboards single sign-on (SSO) users, SSO groups, and IAM users who are authorized to log in to SageMaker Unified Studio. IAM roles can be onboarded on the domain as well, but can be used for programmatic access only. During the quick setup deployment of the domain, default project profile templates are created. A project profile is a collection of blueprints that holds configurations of AWS tools and services. You can create following project profiles:
    1. Generative AI application development – Provides you with the tooling capabilities to build generative AI applications using Amazon Bedrock foundation models (FMs) and tools.
    2. SQL analytics – Provides you with a SQL editor to query the data in Amazon SageMaker Lakehouse, Amazon Redshift, and Amazon Athena.
    3. Data analytics and AI-ML model development – Provides you tools to build and orchestrate ML and generative AI models powered by AWS Glue, Athena, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), Amazon SageMaker AI, and SageMaker Lakehouse.
    4. Custom project profile – Provides capabilities to build custom templates that can bundle multiple blueprints with varied tooling capabilities to suit your business needs.

Admins can also authorize project profile templates to specific users and groups, enforcing the capability to control resource deployment based on user personas. By default, all users are authorized to use default project profiles. However, this can be changed by the admin to limit the access of certain project profiles to certain users and groups.

The quick setup also establishes a default Git connection to AWS CodeCommit for users to manage their code repository. However, you also have the option to create and enable new Git connections to GitHub, GitHub Enterprise Server, GitLab, and GitLab self-managed. The Free Tier release of Amazon Q is enabled by default to all users of SageMaker Unified Studio domain. Amazon Q Developer Pro can be configured if IAM Identity Center is configured for users of the domain.

Finally, as part of the initial setup, the admin provides access to Amazon Bedrock serverless models.

In a multi-account scenario, the central admin associates AWS accounts, and the associated account admins accept the association and enable the blueprints for the project profiles that the central admin would create. Refer to the appendix at the end of this post for more details.

  1. To organize the data assets within the organization, the admin logs in to the SageMaker Unified Studio URL and creates domain units aligned with the business divisions.
  2. Each domain unit receives delegated ownership, enabling autonomous management of assets within their designated scope. This domain-based isolation provides clear boundaries while allowing unit owners to independently govern their assets and enforce relevant policies.

Steps 3 and 4 are optional as part of the quick deployment setup. Users can directly log in to SageMaker Unified Studio to build data products for their business use case if domain units are not part of immediate requirement. If no domain units are created, all users and groups fall back under the root domain level and authorization policies are applied on the root domain.

Behind the scenes

While users interact with a streamlined project creation interface in SageMaker Unified Studio, a sophisticated orchestration of components operates beneath the surface. This abstraction allows the admin to deploy infrastructure through simple selections while the system handles resource provisioning automatically. Let’s examine the underlying process behind the scenes, as illustrated in the following figure.

conceptual diagram of blueprints

This workflow consists of the following steps:

  1. Administrators enable the blueprints containing the AWS CloudFormation templates that have information on how to create and set up the underlying data infrastructure. These blueprints are automatically enabled during the quick setup deployment.
  2. Project profiles bundle these blueprint configurations into templates. These templates determine which infrastructure components deploy when a project is created.
  3. When users select a project profile within SageMaker Unified Studio, the system automatically triggers the relevant CloudFormation stack and deploys the necessary infrastructure resources in the form of environments. Environments are the actual data infrastructure behind a project.

In a multi-account scenario, the associated account admin enables the blueprints. However, the project profile creation happens at the root domain account. The project profile template will include the associated account details and the linked blueprints from the associated account. Refer to the appendix at the end of this post for more details.

Now that we have understood the functional building blocks of SageMaker Unified Studio, let’s proceed with the deployment walkthrough. We will create a domain using the quick setup deployment for single account. Refer to the appendix for multi-account deployment steps.

Prerequisites

You will need to complete the following prerequisites before you can follow the instructions in the next section:

  1. Sign up for an AWS account.
  2. Create a user with administrative access.
  3. Enable IAM Identity Center in the same AWS Region you want to create your SageMaker Unified Studio domain. Confirm in which Region SageMaker Unified Studio is currently available. Set up your IdP and synchronize identities and groups with IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.
  4. To use Amazon Bedrock FMs, grant access to base models.

Set up domain

Complete the following steps to create a new SageMaker Unified Studio domain:

  1. Sign in to the SageMaker console in the Region in which IAM Identity Center is enabled.
  2. Choose Create a Unified Studio domain.

create domain

  1. Select the Quick setup (recommended for exploration).
  2. Choose Create VPC (you can also use your own VPC but to simplify the cleanup, we opted to use a new VPC).

create vpc

This will open a new tab to deploy the CloudFormation stack to create the VPC and the necessary private and public subnets.

  1. For Stack name, enter a unique name to the stack (if the default name already exists).
  2. Keep the parameter for useVpcEndpoints as false.
  3. Choose Create stack.

create stack

  1. After the stack is created, go to the domain creation page and refresh the page, as shown in the following screenshot.

refresh

  1. For Name, enter a unique name for the domain.
  2. Keep the default selections for Domain Execution role, Domain Service role, Provisioning role, and Manage Access role.
  3. The configuration automatically selects the VPC and private subnets.

domain roles

service roles

  1. Keep the default selection for Model provisioning role and Model consumption role.
  2. Choose Continue.

prov roles

  1. Provide the email address of the SSO user that exists in IAM Identity Center.

The SSO user selected here is used as the administrator in SageMaker Unified Studio. If the account doesn’t have IAM Identity Center set up, then it will create an IAM Identity Center account instance, so long as the account is permitted to do so. An SSO or IAM user is required so that a user is able to log in to the studio after the domain is created.

  1. Choose Create domain.

create IdC

  1. After the domain is created, a dialog box pops up. You can close dialog box to set up authorization policies and onboard users.

dialog box

On the domain detail page, the Amazon SageMaker Unified Studio URL is listed. You can authenticate with your IAM user credentials or with credentials from your IdP through IAM Identity Center or with your SAML credentials. To authorize users to log in to the URL, the administrator must onboard the users to the domain. We see this as part of the next steps.

Unified Studio URL

Onboard users and associated accounts

Complete the following steps:

  1. To onboard users, go to the User management tab and choose Add.
  2. On the Add menu, choose either Add SSO users and groups or Add IAM users.

You can also add IAM roles for the purpose of managing the domain programmatically. However, you can’t use IAM roles to log in to the SageMaker Unified Studio URL. After you add the users, they will appear with the status Assigned. The status changes to Activated only when the user logs in to the SageMaker Unified Studio URL.

onboard users

  1. If you want to onboard multiple AWS accounts to your domain account, go to the Account associations tab and choose Request association.

This enables domain users to publish and consume data from these AWS accounts.

associate accounts

For a multi-account setup, by sending an association request to another AWS account, you share the root domain with the other AWS account with AWS Resource Access Manger (AWS RAM). The associated admin domain owner accepts the invitation. To access the compute resources of the associated accounts from SageMaker Unified Studio, the associated domain owner must enable the necessary blueprints. Refer to the appendix to understand the cross-account deployment steps.

Project profiles and authorizing users

For the quick setup deployment, when you navigate to the Blueprints tab, you will notice all the blueprints are automatically enabled. Also, on the Project profiles tab, you will find default project profiles are available to the user.project profiles

Leave the rest of the tabs with the default options.

Create a custom project profile and authorize users (optional)

In the following example, we show the steps to create a custom project profile by bundling selected blueprints. We also show the steps to authorize only restricted users to use this project profile template. This example creates a custom project profile with selective blueprints. This enables the user to create a data lake environment with AWS Glue database and Athena workgroup to query the data. The user can also create an Amazon MWAA environment for orchestration. You can also change or override the configuration parameters of the blueprint by using the Tooling configurations option within the project profile.

Because SageMaker Unified Studio is in preview mode, the naming conventions of some visual elements might appear different in the current version.

When you create a project profile, you can add blueprint deployment settings in two modes: on create and on demand. On create mode allows you to deploy the blueprint deployment settings as soon as the project is created. On demand mode allows you to deploy the blueprint deployment settings when users need it.

Create a project, create domain units, and delegate ownership (optional)

In the following example, the administrator logs in to SageMaker Unified Studio and creates the retail domain unit. The admin also delegates ownership to the retail business user. The retail business user logs in to SageMaker Unified Studio and creates a project with the authorized project profile template.

With these configurations in place, you have successfully completed the initial infrastructure plane deployment from an administrative perspective.

Authorization of blueprints (optional)

By default, all domain users have authorization to create projects with the enabled blueprints across domain units. If you want to restrict the usage of the blueprint within a specific domain unit (in this case, the retail domain unit, as shown in the following screenshot), you need to revoke the existing permissions and authorize the specific domain units. By limiting the use of blueprints to a particular domain unit, users can only create projects using the blueprint within that domain unit. To apply authorization settings to child domain units, enable the Cascade to all child domain units option.

blueprints authorization

Clean up

Make sure you remove the SageMaker Unified Studio resources to mitigate any unexpected costs. This involves a few steps:

  1. If you had multiple projects and subscribed to assets, unsubscribe to all assets.
  2. Note the names of all AWS Glue databases and Athena workgroups created by your projects.
  3. Delete any connections you created in the data explorer that you don’t want to keep.
  4. Note the project IDs.
  5. Delete the projects. If you encounter any errors, check the AWS CloudFormation console and find the failed stack. Fix the error that failed the stack deletion and delete the projects.
  6. Note down the domain ID.
  7. Delete the domain.
  8. Delete the S3 bucket named amazon-datazone-AWSACCOUNTID-AWSREGION-DOMAINID.
  9. Delete the AWS Glue databases and Athena workgroups you noted earlier.
  10. Delete the CloudFormation stack for the VPC (if you followed that step in the setup).

If you have additional resources that haven’t been deleted, you can also use tags to identify and delete specific resources.

Conclusion

In this post, we discussed the foundational building blocks of SageMaker Unified Studio and how, by abstracting complex technical implementations behind user-friendly interfaces, organizations can maintain standardized governance while enabling efficient resource management across business units. This approach provides consistency in infrastructure deployment while providing the flexibility needed for diverse business requirements.

To learn more, refer to the Amazon SageMaker Unified Studio Administrator Guide and the following resources:

Appendix: Multi-account administration

This section illustrates the cross-account association. After the account invitation is accepted by the associated account owner, follow the instructions as shown in the following example to understand how to enable the blueprints. After the blueprints are enabled in the associate accounts, the root domain account can create project profile templates with the parameters of the associated account, including its linked blueprints. The example then demonstrates how the retail domain unit user can deploy compute resources and create data using the resources from the associated account.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Fabrizio Napolitano is a Principal Specialist Solutions Architect for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.