Author visual ETL flows on Amazon SageMaker Unified Studio (preview)

2024-12-04 Praveen Kumar

Post Syndicated from Praveen Kumar original https://aws.amazon.com/blogs/big-data/author-visual-etl-flows-on-amazon-sagemaker-unified-studio/

Amazon SageMaker Unified Studio (preview) provides an integrated data and AI development environment within Amazon SageMaker. From the Unified Studio, you can collaborate and build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics. This experience includes visual ETL, a new visual interface that makes it simple for data engineers to author, run, and monitor extract, transform, load (ETL) data integration flow. You can use a simple visual interface to compose flows that move and transform data and run them on serverless compute. Additionally, you can choose to author your visual flows with English using generative AI prompts powered by Amazon Q. Visual ETL also automatically converts your visual flow directed acyclic graph (DAG) into Spark native scripts so you can continue authoring by notebook, enabling a quick-start experience for developers who prefer to author using code.

This post shows how you can build a low-code and no-code (LCNC) visual ETL flow that enables seamless data ingestion and transformation across multiple data sources. We demonstrate how to:

Connect to diverse data sources
Perform table joins
Apply custom filters
Export aggregated data to Amazon Simple Storage Service (Amazon S3)

Additionally, we explore how generative AI can enhance your LCNC visual ETL development process, creating an intuitive and powerful workflow that streamlines the entire development experience.

Use case walkthrough

In this example, we use Amazon SageMaker Unified Studio to develop a visual ETL flow. This pipeline reads data from an Amazon S3 based file location, performs transformations on the data, and subsequently writes the transformed data back into an Amazon S3 based AWS Glue Data Catalog table. We use allevents_pipe and venue_pipe files from the TICKIT dataset to demonstrate this capability.

The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts. Analysts can use this dataset to track how ticket sales change over time, evaluate the performance of sellers, and determine the most successful events, venues, and seasons in terms of ticket sales.

The process involves merging the allevents_pipe and venue_pipe files from the TICKIT dataset. Next, the merged data is filtered to include only a specific geographic region. The data is then aggregated to calculate the number of events by venue name. In the end, the transformed output data is saved to Amazon S3, and a new AWS Glue Data Catalog table is created.

The following diagram illustrates the architecture:

Prerequisites

To run the instruction, you must complete the following prerequisites:

An AWS account
A SageMaker Unified Studio domain
A SageMaker Unified Studio project with Data analytics and machine learning project profile

Build a visual ETL flow

Complete following steps to build a new visual ETL flow with sample dataset:

On the SageMaker Unified Studio console, on the top menu, choose Build.
Under DATA ANALYSIS & INTEGRATION, choose Visual ETL flows, as shown in the following screenshot.

Select your project and choose Continue.

Choose Create visual ETL flow.

This time, manually define the ETL flow.

On the top left, choose the + icon in the circle. Under Data sources, choose Amazon S3, as shown in the following screenshot. Locate the icon at the canvas.

Choose the Amazon S3 source node and enter the following values:

- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv
- Format: CSV
- Delimiter: ,
- Multiline: Enabled
- Header: Disabled

Leave the rest as default.

Wait for the data preview to be available at the bottom of the screen.

Choose the + icon in the circle to the right of the Amazon S3 node. Under Transforms, choose Rename Columns.

Choose the Rename Columns node and choose Add new rename pair. For Current name and New name, enter the following pairs:
- _c0: venueid
- _c1: venuename
- _c2: venuecity
- _c3: venuestate
- _c4: venueseats

Choose the + icon to the right of Rename Columns node. Under Transforms, choose Filter.
Choose Add new filter condition.
For Key, choose venuestate. For Operation, choose ==. For Value, enter DC, as shown in the following screenshot.

Repeat steps 5 and 6 to add the Amazon S3 source node for table events.

- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/events.csv
- Format: CSV
- Sep: ,
- Multiline: Enabled
- Header: Disabled

Leave the rest as default

Repeat steps 7 and 8 for the Amazon S3 source node. On the Rename Columns node, choose Add new rename pair. For Current name and New name, enter the following pairs:
- _c0: eventid
- _c1: e_venueid
- _c2: catid
- _c3: dateid
- _c4: eventname
- _c5: starttime

Choose the + icon to the right of Rename Column node. Under Transforms, choose Join.
Drag the + icon at the right of the Filter node and drop it at the left of the Join node.
For Join type, choose Inner. For Left data source, choose e_venueid. For Right data source, choose venue_id.

Choose the + icon to the right of the Join node. Under Transforms, choose SQL Query.
Enter the following query statement:

select 
  venuename,
  count(distinct eventid) as eventid_count 
from {myDataSource} 
group by venuename

Choose the + icon to the right of the SQL Query node. Under Data target, choose Amazon S3.
Choose the Amazon S3 target node and enter the following values:
- S3 URI: <choose s3 location from project overview page and add suffix “/output/venue_event/”> (for example, s3://<bucket-name>/dzd_bd693kieeb65yf/52d3z1nutb42w7/dev/output/venue_event/)
- Format: Parquet
- Compression: Snappy
- Mode: Overwrite
- Update catalog: True
- Database: Choose your database
- Table: venue_event_agg

At this point, you should encounter this end-to-end visual flow. Now you can publish it.

On the top right, choose Save to project to save the draft flow. You can optionally change the name and add a description. Choose Save to project, as shown in the following screenshot.

The visual ETL flow has been successfully saved.

Run flow

This section shows you how to run the visual ETL flow you authored.

On the top right, choose Run.

At the bottom of the screen, the run status is shown. The run status transitions from Starting to Running and Running to Finished.

Wait for the run to be Finished.

Query using Amazon Athena

The output data has been written to the target S3 bucket. This section shows you how to query the output table.

On the top left menu, under DATA ANALYSIS & INTEGRATION, choose Query Editor.

On the data explorer, under Lakehouse, choose AwsDataCatalog. Navigate to the table venue_event_agg.
From the three dots icon, choose Query with Athena.

Four records will be returned, as shown in the following screenshot. This indicates you succeeded in querying the output table written by the visual ETL flow.

Generative AI section to generate a visual ETL flow

The preceding instruction is done in step-by-step operations on the visual console. On the other hand, SageMaker Unified Studio can automate job authoring steps by using generative AI powered by Amazon Q.

On the top left menu, choose Visual ETL flows.
Choose Create visual ETL flow.
Enter the following text and choose Submit.

Create a flow to connect 2 Glue catalog tables venue and event in database glue_db, join on event id , filter on venue state with condition as venuestate=='DC' and write output to a S3 location

This creates the following boilerplate flow that you can edit to quickly author the visual ETL flow.

The generated flow keeps the context of the prompt at the node level.

Clean Up

To avoid incurring future charges, clean up the resources you created during this walkthrough:

From the SQL querybook, enter the following SQL to drop table:

drop table venue_event_agg

To delete the flow, under Actions, choose Delete flow

Conclusion

This post demonstrated how you can use Amazon SageMaker Unified Studio to build a low-code no-code (LCNC) visual ETL flow. This allows for a seamless data ingestion and transformation across multiple data sources.

To learn more, refer to our documentation and the AWS News Blog.

About the Authors

Praveen Kumar is an Analytics Solutions Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-based services. His areas of interest are serverless technology, data governance, and data-driven AI applications.

Noritaka Sekiyama is a Principal Big Data Architect with AWS Analytics services. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Alexandra Tello is a Senior Front End Engineer with the AWS Analytics services in New York City. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso enthusiast and enjoys building mechanical keyboards.

Ranu Shah is a Software Development Manager with AWS Analytics services. She loves building data analytics features for customers. Outside work, she enjoys reading books or listening to music.

Gal Heyne is a Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

2024-12-04 Shovan Kanjilal

Post Syndicated from Shovan Kanjilal original https://aws.amazon.com/blogs/big-data/simplify-data-integration-with-aws-glue-and-zero-etl-to-amazon-sagemaker-lakehouse/

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

In addition, organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge. Valuable information is often scattered across multiple repositories, including databases, applications, and other platforms. To harness the full potential of their data, businesses must enable seamless access and consolidation from these varied sources. However, this task is complicated by the unique characteristics of modern systems, such as differing API protocols, implementations, and rate limits. To address these challenges and accelerate innovation, AWS Glue has recently expanded its third-party application support by introducing native connectors for 19 applications.

To utilize these new application connectors for well-defined use cases such as replication and ingestion, AWS Glue is also launching zero-ETL integration support from external applications. With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift.

Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. By directly integrating with Lakehouse, all the data is automatically cataloged and can be secured through fine-grained permissions in Lake Formation.

What is zero-ETL?

Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. It makes data available in Amazon SageMaker Lakehouse and Amazon Redshift from multiple operational, transactional, and enterprise sources. Extract, transform, and load (ETL) is the process of combining, cleaning, and normalizing data from different sources to prepare it for analytics, artificial intelligence (AI), and machine learning (ML) workloads. You don’t need to maintain complex ETL pipelines. We take care of the ETL for you by automating the creation and management of data replication.

What’s the difference between zero-ETL and Glue ETL?

AWS Glue now offers multiple ways for you to build data integration pipelines, depending on your integration needs.

Zero-ETL provides service-managed replication. It’s designed for scenarios where customers need a fully managed, efficient way to replicate data from one source to AWS with minimal configuration. Zero-ETL handles the entire replication process, including schema discovery and evolution, without requiring customers to write or manage any custom logic. This approach is ideal for creating up-to-date replicas of source data in near-real-time, with AWS managing the underlying infrastructure and replication process.
Glue ETL offers customer-managed data ingestion. It’s the preferred choice when customers need more control and customization over the data integration process or require complex transformations. With Glue ETL, customers can write custom transformation logic, combine data from multiple sources, apply data quality rules, add calculated fields, and perform advanced data cleansing or aggregation. This flexibility makes Glue ETL suitable for scenarios where data must be transformed or enriched before analysis.

It’s worth mentioning that the source connections are reusable between Glue ETL and Glue zero-ETL so that can easily support both patterns. After you create a connection once, you can choose to use the same connection across various AWS Glue components including Glue ETL, Glue Visual ETL and zero-ETL. For example, you might start by creating a connection and a zero-ETL integration, but decide later to use the same connection to create a custom GlueETL pipeline.

This blog post will explore how zero-ETL capabilities combined with its new application connectors are transforming the way businesses integrate and analyze their data from popular platforms such as ServiceNow, Salesforce, Zendesk, SAP and others.

Use case

Consider a large company that relies heavily on data-driven insights to optimize its customer support processes. The company stores vast amounts of transactional data in ServiceNow. To gain a comprehensive understanding of their business and make informed decisions, the company needs to integrate and analyze data from ServiceNow seamlessly, identifying and addressing problems and root causes, managing service level agreements and compliance, and proactively planning for incident prevention.

The company is looking for an efficient, scalable, and cost-effective solution to collecting and ingesting data from ServiceNow, ensuring continuous near real-time replication, automated availability of new data attributes, robust monitoring capabilities to track data load statistics, and reliable data lake foundation supporting data versioning. This allows data analysts, data engineers, and data scientists to quickly explore ingested data and develop data products that meet the needs of business teams.

Solution overview

The following architecture diagram illustrates an efficient and scalable solution for collecting and ingesting replicated data from ServiceNow with zero-ETL integration. In this example we use ServiceNow as a source, but this can be done with any supported source such as Salesforce, Zendesk, SAP, or others. The AWS Glue managed connectors act as a bridge between ServiceNow and the target Amazon SageMaker Lakehouse, enabling seamless, near real-time data flow without the need for custom ETL and scheduling.

The following are the key components and steps in the integration process:

Zero-ETL extracts and loads the data into Amazon S3, a highly scalable object storage service. The data is also registered in the Glue Data Catalog, a metadata repository. Additionally, it keeps the information synchronized by capturing changes that occur in ServiceNow and maintains data consistency by automatically performing schema evolution.
Amazon CloudWatch, a monitoring and observability service, collects logs and metrics from the data integration process.
Amazon EventBridge, a serverless event bus service, triggers a downstream process that allows you to build event-driven architecture as soon as your new data arrives in your target. Through EventBridge, customers can build on top of zero-ETL for a diverse set of use cases such as:
- Trigger Glue ETL to perform transformations and aggregations on the data to create specific analysis.
- Trigger a Directed Acyclic Graph (DAG) in Amazon Managed Workflows for Apache Airflow (Amazon MWAA).
- Trigger a state machine in AWS Step Functions.
- Notify the status of replications and their details to downstream applications.

Prerequisites

Complete the following prerequisites before setting up the solution:

Create a bucket in Amazon S3 called zero-etl-demo-<your AWS Account Number>-<AWS Region> (for example, zero-etl-demo-012345678901-us-east-1). The bucket will be used to store the data ingested by zero-ETL in Apache Iceberg which is an open table format (OTF) supporting ACID transactions (atomicity, consistency, isolation, and durability), seamless schema evolution, and data versioning using time travel.
Create an AWS Glue database <your database name>, such as zero_etl_demo_db and associate the S3 bucket zero-etl-demo-<your AWS Account Number>-<AWS Region> as a location of the database. The database will be used to store the metadata related to the data integrations performed by zero-ETL.
Update AWS Glue Data Catalog settings using the following IAM policy for fine-grained access control of the data catalog for zero-ETL.
Create an AWS Identity and Access Management (IAM) role named zero_etl_demo_role. The IAM role will be used by zero-ETL to access the Glue Connector to read from the Service Now and write the data into the target. Optionally, you can create two separate IAM roles (one associated with your source data and another associated with your target).
Make sure you have a ServiceNow instance named ServiceNowInstance, a user named ServiceNowUser, and a password passwordServiceNowPassword with the required permissions to read from ServiceNow. The instance name, user, and password are used in the AWS Glue connection to authenticate within ServiceNow using the BASIC authentication type. Optionally, you can choose OAUTH2 if your ServiceNow supports it.
Create the secret zero_etl_demo_secret in AWS Secrets Manager to store ServiceNow credentials.

Build and verify the zero-ETL integration

Complete the following steps to create and validate zero-ETL integration:

Step 1: Set up a connector

Zero-ETL integration, when used with AWS Glue natively supported applications connectors, provides a straightforward way to bring third-party data into an Amazon S3 transactional data lake or Amazon Redshift. Use the following steps to create a ServiceNow data connection:

Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Connections.
Choose Create Connection.
In the Create Connection pane, enter ServiceNow in Data Sources.
Choose ServiceNow.
Choose Next.
For Instance Name, enter ServiceNowInstance (created as part of the prerequisites).
For IAM service role, choose the zero_etl_demo_role (created as part of the prerequisites).
For Authentication Type, choose the authentication type that you’re using for ServiceNow. In this example. we have chosen OAUTH2, which requires the set up of Application Registries in ServiceNow.
For AWS Secret, choose the secret zero_etl_demo_secret (created as part of the prerequisites).
Choose Next.
In the Connection Properties section, for Name, enter zero_etl_demo_conn.
Choose Next.
Choose Create connection.
There will be a popup from ServiceNow after you choose Create connection. Choose Allow.

Step 2: Set up Zero-ETL integration

After creating the data connection to ServiceNow, use the following steps to create the zero-ETL integration:

Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Zero-ETL integrations.
Choose Create zero-ETL integration.
In the Create integration pane, enter ServiceNow in Data Sources.
Choose ServiceNow.
Choose Next.
For ServiceNow connection, choose the data connection created on Step 1—zero_etl_demo_conn.
For Source IAM role, choose the zero_etl_demo_role (from the prerequisites).
For ServiceNow objects, choose the objects you want to perform the ingestion managed by zero-ETL integration. For this post, choose problem and incident objects.
For Namespace or Database, choose <your database name>. In this example, we use the zero_etl_demo_db (from the prerequisites).
For Target IAM role, choose the zero_etl_demo_role (from the prerequisites).
Choose Next.
For Security and data encryption, you can choose either AWS Managed KMS Key or choose a customer KMS key managed by AWS Key Management Service. For this post, choose Use AWS managed KMS key.
In the Integration details section, for Name, enter zero-etl-demo-integration.
Choose Next.
Review the details and choose Create and launch integration.
The newly created integration will show as Active in about a minute.

Step 3: Verify the initial SEED load

The SEED load refers to the initial loading of the tables that you want to ingest into an Amazon SageMaker Lakehouse using zero-ETL integration. The status and statistics of the SEED load are published into CloudWatch and the data ingested by zero-ETL integration can be accessed in AWS using a set of services such Amazon Sagemaker Unified Studio, Amazon QuickSight, and others. Use the following steps to access zero-ETL integration logs and query the data:

Open the AWS Glue console.
In the navigation pane, choose Zero-ETL integrations.
In the Zero-ETL integrations section, choose zero-etl-demo-integration.
In the Activity summary (all time) section, choose CloudWatch logs.
Check CloudWatch log events for the SEED Load. For each table ingested by the zero-ETL integration, two groups of logs are created: status and statistics. Highlighted in the following screenshot in IngestionTableStatistics are the statistics. The insertCount represents how many rows were extracted and loaded by zero-ETL integration. For the SEED load, you will always see only insertCount because it’s the initial load. In addition, in IngestionCompleted you will find information about the Zero-ETL integration such as status, load type, and message.

To validate the SEED load, query the data using Amazon Sagemaker Unified Studio.

Access Amazon Sagemaker Unified Studio for your specific domain through your AWS Console.
Open the Amazon SageMaker Unified Studio URL.
Sign in with SSO or AWS IAM user.
Select your project.
Go to Data from the left menu, expand the Lakehouse AWSDataCatalog, expand your database, and select the incident table. Click the ⋮ icon and select Query with Athena.

For Query, enter the following statement:

SELECT count(*) AS incidents_count
FROM "zero_etl_demo_db"."incident"

Choose Run.
Let’s check an existing incident in ServiceNow. This is the incident that you will update the description of in ServiceNow to validate change data capture (CDC). In the query editor, pane, for Query, enter the following statement:
```
SELECT number
, short_description
, description
FROM "zero_etl_demo_db"."incident"
WHERE number = 'INC0000003' -- update to your Incident number
```
Choose Run.

Step 4: Validate CDC

The CDC load is a technique used to identify and process only the data that has changed in a source system since the last extraction. Instead of reloading an entire dataset, CDC captures and transfers only the new, updated, or deleted records into the target system, making data processing more efficient and reducing load times. The status and statistics of the CDC load are published into CloudWatch. For this post, you will use Amazon SageMaker unified studio to query the data ingested. Use the following steps to access zero-ETL integration logs and query the data ingested. For the next step in this example, you will select an incident and perform an update in ServiceNow, changing the short_description and description of the incident.

To demonstrate CDC event, in this blog we are going to edit 1 incident and delete 1 incident in ServiceNow.
Open the AWS Glue console.
In the navigation pane, under Data catalog, choose Zero-ETL integrations.
In the Zero-ETL integrations section, choose zero-etl-demo-integration.
In the Activity summary (all time) section, choose CloudWatch logs.
Zero-ETL integration replicates the changes to the Amazon S3 transactional data lake every 60 minutes by default. Check CloudWatch log events for the CDC load. Shown in the following figure in IngestionTableStatistics, review updateCount and deleteCount for each specific object managed by zero-ETL integration. It’s applying the updates and deletes that happened in ServiceNow to the transactional data lake.

To validate the CDC load, query the data using Amazon SageMaker Unified Studio.

You can go back to Amazon SageMaker Unified Studio.

For Query, enter the following statement:

SELECT count(*) AS incidents_count
FROM "zero_etl_demo_db"."incident"

For Query, enter the following statement to record initial snapshot results before CDC:

SELECT number
    , short_description
    , description
FROM "zero_etl_demo_db"."incident"
WHERE number = 'INC0000003' -- update to your Incident number

Choose Run and confirm that one record was updated in short_description and description attributes.

By following these steps, you can effectively set up, build, and verify a zero-ETL job using the new AWS Glue application connector for ServiceNow. This process demonstrates the simplicity and efficiency of the zero-ETL approach in integrating applications data into your AWS environment.

Apache Iceberg Time Travel: Enhancing data versioning in zero-ETL

One of the benefits of using Apache Iceberg in zero-ETL integration is the ability to perform Time Travel. This feature allows you to access and query historical versions of your data effortlessly. With Iceberg Time Travel, you can easily roll back to previous data states, compare data across different points in time, or recover from accidental data changes. In the context of zero-ETL integrations, this capability becomes particularly valuable when dealing with rapidly changing applications data.

To demonstrate this feature, let’s consider a scenario where you’re analyzing ServiceNow incident data ingested through zero-ETL integration using Amazon SageMaker Unified Studio. Here’s an example query that showcases Iceberg time travel:

-- Query incident data as of particular timestamp before CDC
SELECT number,
    short_description,
    description
FROM "zero_etl_demo_db"."incident" 
FOR TIMESTAMP AS OF TIMESTAMP '2024-11-06 05:10:00 UTC' 
-- update this timestamp value to before your CDC update
WHERE number = 'INC0000003' -- update to your Incident number
-- Compare with current data
SELECT number,
    short_description,
    description
FROM "zero_etl_demo_db"."incident"
WHERE number = 'INC0000003' -- update to your Incident number

In this example:

The first query uses the FOR TIMESTAMP AS OF clause for time travel queries on Iceberg tables. It retrieves incident data as it existed before CDC update for the specific incident number INC0000003.
The second query fetches the current state of the data for the same incident number.

This capability allows you to track the evolution of incidents, identify trends in resolution times, or recover information that may have been inadvertently altered.

Clean up

To avoid incurring future charges, remove up the resources used in this post from your AWS account by completing the following steps:

Delete zero-ETL integration zero-etl-demo-integration.
Delete content from the S3 bucket zeroetl-etl-demo-<your AWS Account Number>-<AWS Region>.
Delete the Data Catalog database zero_etl_demo_db.
Delete the Data Catalog connection zero_etl_demo_conn.
Delete the AWS Secrets manager Secret.

Conclusion

As the pace of business continues to accelerate, the ability to quickly and efficiently integrate data from various applications and enterprise platforms has become a critical competitive advantage. By adopting a zero-ETL integration powered by AWS Glue and its new set of managed connectors, you organization can unlock the full potential of its data across multiple platforms faster and stay ahead of the curve.

To learn more about how AWS Amazon SageMaker Lakehouse can help your organization streamline its data integration efforts, visit Amazon SageMaker Lakehouse.

Get started with zero-ETL on AWS by creating a free account today!

About the authors

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure and high-performance data solutions in the cloud.

Vivek Pinyani is a Data Architect at AWS Professional Services with expertise in Big Data technologies. He focuses on helping customers build robust and performant Data Analytics solutions and Data Lake migrations. In his free time, he loves to spend time with his family and enjoys playing cricket and running.

Kartikay Khator is a Solutions Architect within Global Life Sciences at AWS, where he dedicates his efforts to developing innovative and scalable solutions that cater to the evolving needs of customers. His expertise lies in harnessing the capabilities of AWS analytics services. Extending beyond his professional pursuits, he finds joy and fulfillment in the world of running and hiking. Having already completed multiple marathons, he is currently preparing for his next marathon challenge.

Caio Sgaraboto Montovani is a Sr. Specialist Solutions Architect, Data Lake and AI/ML within AWS Professional Services, developing scalable solutions according customer needs. His vast experience has helped customers in different industries such as life sciences and healthcare, retail, banking, and aviation build solutions in data analytics, machine learning, and generative AI. He is passionate about rock and roll and cooking and loves to spend time with his family.

Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!

Catalog and govern Amazon Athena federated queries with Amazon SageMaker Lakehouse

2024-12-04 Sandeep Adwankar

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/catalog-and-govern-amazon-athena-federated-queries-with-amazon-sagemaker-lakehouse/

Yesterday, we announced Amazon SageMaker Unified Studio (Preview), an integrated experience for all your data and AI and Amazon SageMaker Lakehouse to unify data – from Amazon Simple Storage Service (S3) to third-party sources such as Snowflake. We’re excited by how SageMaker Lakehouse helps break down data silos, but we also know customers don’t want to compromise on data governance or introduce security and compliance risks as they expand data access.

With this new capability, data analysts can now securely access and query data stored outside S3 data lakes, including Amazon Redshift data warehouses and Amazon DynamoDB databases, all through a single, unified experience. Administrators can now apply access controls at different levels of granularity to ensure sensitive data remains protected while expanding data access. This allows organizations to accelerate data initiatives while maintaining security and compliance, leading to faster, data-driven decision-making.

In this post, we show how to connect to, govern, and run federated queries on data stored in Redshift, DynamoDB (Preview), and Snowflake (Preview). To query our data, we use Athena, which is seamlessly integrated with SageMaker Unified Studio. We use SageMaker Lakehouse to present data to end-users as federated catalogs, a new type of catalog object. Finally, we demonstrate how to use column-level security permissions in AWS Lake Formation to give analysts access to the data they need while restricting access to sensitive information.

Background

As data volumes grow, organizations often employ specialized storage systems to achieve optimal performance and cost-efficiency with different use cases. However, this approach can result in data silos, and makes it challenging to gain insights from data for several reasons. First, end-users often have to set up connections to data sources on their own. This is challenging because of configuration details that vary by source and technical connectivity properties they may not have access to. Second, data sources often have their own built-in access controls, which fragments data governance. Lastly, copying data from one storage system to another for the purposes of analysis adds cost and creates duplication risks.

SageMaker Lakehouse streamlines connecting to, cataloging, and managing permissions on data from multiple sources. It integrates with SageMaker Unified Studio, Athena, and other popular tools to give flexibility to end-users to work with data from their preferred tools.

As you create connections to data, SageMaker Lakehouse creates the underlying catalogs, databases, and tables, and integrates these resources with Lake Formation. Administrators can then define and centrally manage fine-grained access controls on these resources, without having to learn different access management concepts for each data source.

With the right access permissions in place, data discovery and analytics workflows are streamlined. Data analysts no longer need to connect to data sources on their own, saving time and frustration from setting up connectors with configurations that vary by source. Instead, analysts can simply run SQL queries on federated data catalogs, seamlessly accessing diverse data for various needs, which accelerates insights and enhances productivity.

Solution overview

This post presents a solution where a company is using multiple data sources containing customer data. Analysts want to query this data for analytics and AI and machine learning (ML) workloads. However, regulations require personally identifiable information (PII) data to be secured. The following diagram illustrates the solution architecture.

In our use case, an administrator is responsible for data governance and has administrator-level access to data sources – including Redshift, DynamoDB, and Snowflake. Existing regulations require administrators to safeguard sensitive PII data, such as customer mobile phone number, which is stored in multiple places. At the same time, there are business stakeholders in data analyst job functions who need access to these databases because they contain valuable business data that they need access to in order to gain insight on business health.

We will use an administrator account to create connections to Redshift, DynamoDB, and Snowflake, register these as catalogs in SageMaker Lakehouse, and then set up fine-grained access controls using Lake Formation. When complete, we use a data analyst account to query the data with Athena but we will be unable to access the data the role is not entitled to.

Prerequisites

Make sure you have the following prerequisites:

An AWS account with permission to create IAM roles and IAM policies
An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI)
Administrator access to SageMaker Lakehouse and the following roles:
- Administrator role
- Data analyst role
A SageMaker Unified Studio domain and two projects using the SQL Analytics profile. To learn more, refer to the Amazon SageMaker Unified Studio Administrator Guide.
- An Admin project will be used to create connections
- A Data Analyst project will be used to analyze data and will include both administrator and analysts as members. Take note of the IAM role in the Data Analyst project from the Project Overview page. This IAM role will be referenced when granting access later on.
Administrator access to one or more of the following data sources, and data sources set up as shown in the appendix A and B:
- Redshift
- DynamoDB
- Snowflake

Set up federated catalogs

The first step is to set up federated catalogs for our data sources using an administrator account. The section below walks you through the end-to-end process with DynamoDB and demonstrates how to query the data when setup is complete. When you are done setting up and exploring the DynamoDB data, repeat these steps for Redshift and Snowflake.

On the SageMaker Unified Studio console, open your project.
Choose Data in the navigation pane.
In the data explorer, choose the plus icon to add a data source.
Under Add a data source, choose Add connection, then choose Amazon DynamoDB.
Enter your connection details, and choose Add data source.

Next, SageMaker Unified Studio connects to your data source, registers the data source as a federated catalog with SageMaker Lakehouse, and displays it in your data explorer.

To explore and query your data, click any SageMaker Lakehouse catalog to view its contents. Use the data explorer to drill down to a table and use the Actions menu to select Query with Athena.

This brings you to the query editor where your sample query is executed. Here, try different SQL statements to better understand your data and to gain familiarity with query development features in SageMaker Unified Studio. To learn more, see SQL analytics in the Amazon SageMaker Unified Studio User Guide.

Similarly, you can setup data source connection for Redshift and Snowflake and query the data. Please refer to Appendix B which contains screenshots capturing the details needed to create the connection and data catalog for Redshift and Snowflake sources.

Set up fine-grained access permissions on federated catalogs

Our next step is to set up access permissions on our federated catalogs. As mentioned in the prerequisites, you have already set up an IAM role with data analyst permissions and a SageMaker Studio data analyst project. We will grant permissions to the data analyst role and SageMaker studio data analyst project role to ensure that access controls you specify are enforced when the data is queried. The following steps show how to set up permissions on a Redshift federated catalog, but the steps are the same for each data source.

Navigate to Lake Formation in the AWS management console as an administrator.
In the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs. Here, you will see the federated catalogs that were set up previously in SageMaker Unified Studio.
Choose the federated catalog that you wish to set up permissions for. Here, you can see details for the catalog and any associated databases and tables, and manage permissions.
From the Actions menu, choose Grant to grant permissions to the data analyst role and SageMaker studio data analyst project role.
In Catalogs, choose the federated catalog name for the source you wish to grant permissions on.
In Databases, choose your Redshift schema, Snowflake schema, or default for DynamoDB.
In Database permissions, select Describe.
Choose Grant.

The next step is to grant the permission on the tables to the data analyst role and SageMaker studio data analyst project role. For this solution, assume you wish to restrict access to a sensitive column containing the mobile phone number for each customer.

In the Actions menu, choose Grant.
In Catalogs, choose your federated catalog.
In Databases, choose your Redshift schema, Snowflake schema, or default for DynamoDB.
In Tables, choose your tables.
In Table permissions, choose Select.
In Data permissions, choose Column-based access.
In Choose permission filter, choose Include columns.
In Select columns, choose one or more columns.
Choose Grant.

You have successfully set up fine-grained access permissions on your Redshift federated catalog. Repeat these steps to add permissions on your DynamoDB and Snowflake federated catalogs.

Validate fine-grained access permissions on federated catalogs

Now that you have set up federated catalogs with fine-grained access permissions, it’s time to run queries to confirm access permissions are working as expected.

First, access SageMaker Unified Studio using the data analyst role and navigate to your project, select Query Editor from the Build menu, and click on the DynamoDB catalog in the Data explorer. Next, drill down to a table and click Query with Athena to run a sample query. Note how permissions are working as expected because the query result does not include the mobile phone number column that was visible before.

Next, query the Redshift data source and note how the mobile phone number is not included in the query result.

Lastly, query the Snowflake data source and, like the previous examples, note how the result does not include the mobile phone number column.

In this example, we demonstrated how to set up a basic column-level filter to restrict access to sensitive data. However, SageMaker Lakehouse supports a broad range of fine-grained access control scenarios beyond column filters that allow you to meet complex security and compliance requirements across diverse data sources. To learn more, see Managing Permissions.

Clean up

Make sure you remove the SageMaker Lakehouse resources to mitigate any unexpected costs. Start by deleting the connections, catalogs, underlying data sources, projects, and domain that you created for this blog. For additional details, refer to the Amazon SageMaker Unified Studio Administrator Guide.

Conclusion

In this blog post, we utilized fine-grained access controls with federated queries in Athena. We demonstrated how this feature allows flexibility in choosing the right data storage solutions for your needs while securely expanding access to data. We showed how to create federated catalogs and set up access policies with Lake Formation, and then queried data with Athena where we saw permissions enforced on different sources. This approach unified data access controls and streamlined data discovery, saving end-users valuable time. To learn more about federated queries in Athena and the data sources that support fine-grained access controls today, see Register your connection as a Glue Data Catalog in the Athena User Guide.

We encourage you to try fine-grained access controls on federated queries today in SageMaker Unified Studio, and to share your feedback with us. To learn more, see Getting started in the Amazon SageMaker Unified Studio User Guide.

Appendix A: Set up data sources

In this section, we provide the steps to set up your data sources.

Redshift

You can create a new table customer_rs in your current database with columns cust_id, mobile, and zipcode and populate with sample data using the following SQL command:

CREATE TABLE "customer_rs" AS
SELECT 6 AS "cust_id",  66666666 AS "mobile", 6000 as "zipcode"
UNION ALL SELECT 7, 77777777, 7000
UNION ALL SELECT 8,  88888888, 8000
UNION ALL SELECT 9,  99999999, 9000
UNION ALL SELECT 10, 11112222, 1100

DynamoDB

You can create a new table in DynamoDB with the partition key cust_id and the sort key zipcode through AWS CloudShell with the following command:

aws dynamodb create-table \
    --table-name customer_ddb \
    --attribute-definitions \
        AttributeName=cust_id,AttributeType=N \
        AttributeName=zipcode,AttributeType=N \
    --key-schema \
        AttributeName=cust_id,KeyType=HASH \
        AttributeName=zipcode,KeyType=RANGE \
    --provisioned-throughput \
        ReadCapacityUnits=5,WriteCapacityUnits=5 \
    --table-class STANDARD

You can populate the DynamoDB table with the following commands:

aws dynamodb put-item \
    --table-name customer_ddb  \
    --item \
        ‘{“cust_id”: {“N”: “11”}, “zipcode”: {“N”: “2000”}, “mobile”: {“N”: “11113333”}}’

aws dynamodb put-item \
    --table-name customer_ddb  \
    --item \
              ‘{“cust_id”: {“N”: “12”}, “zipcode”: {“N”: “2000”}, “mobile”: {“N”: “22224444”}}’

aws dynamodb put-item \
    --table-name customer_ddb \
    --item \
               ‘{“cust_id”: {“N”: “13”}, “zipcode”: {“N”: “3000”}, “mobile”: {“N”: “33335555”}}’
                            
aws dynamodb put-item \
    --table-name customer_ddb \
    --item \
               ‘{“cust_id”: {“N”: “14”}, “zipcode”: {“N”: “4000”}, “mobile”: {“N”: “55556666”}}’

Snowflake

You can create your database, schema, and tables in Snowflake with the following SQL queries:

use database tasty_bytes_sample_data
create schema "sf_schema"

CREATE TABLE "customer_sf" AS
SELECT 1 AS "cust_id",  11111111 AS "mobile", 1000 as "zipcode" 
UNION ALL SELECT 2, 22222222 , 2000
UNION ALL SELECT 3,  33333333, 3000
UNION ALL SELECT 4,  44444444, 4000
UNION ALL SELECT 5, 55555555, 5000
UNION ALL SELECT 21, 12341234, 1234

Appendix B: Connection Properties for Redshift and Snowflake

Redshift Connection Properties:

Snowflake Connection Properties:

About the Authors

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Praveen Kumar is a Principal Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-centered services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and generative AI applications.

Stuti Deshpande is a Big Data Specialist Solutions Architect at AWS. She works with customers around the globe, providing them strategic and architectural guidance on implementing analytics solutions using AWS. She has extensive experience in big data, ETL, and analytics. In her free time, Stuti likes to travel, learn new dance forms, and enjoy quality time with family and friends.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Scott Rigney is a Senior Technical Product Manager with AWS and has expertise in analytics, data science, and machine learning. He is passionate about building software products that enable enterprises to make data-driven decisions and drive innovation.

The next generation of Amazon SageMaker: The center for all your data, analytics, and AI

2024-12-04 G2 Krishnamoorthy

Post Syndicated from G2 Krishnamoorthy original https://aws.amazon.com/blogs/big-data/the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai/

This week on the keynote stages at AWS re:Invent 2024, you heard from Matt Garman, CEO, AWS, and Swami Sivasubramanian, VP of AI and Data, AWS, speak about the next generation of Amazon SageMaker, the center for all of your data, analytics, and AI.

The relationship between analytics and AI is rapidly evolving. Our customers are telling us that they are seeing their analytics and AI workloads increasingly converge around a lot of the same data, and this is changing how they are using analytics tools with their data. They aren’t using analytics and AI tools in isolation. They’re taking data they’ve historically used for analytics or business reporting and putting it to work in machine learning (ML) models and AI-powered applications.

We want to make it streamlined for our customers to work with their data, whether for analytics or AI, help them get to AI-ready data faster, and improve productivity of all data and AI workers. The next generation of SageMaker is set to do just that.

Introducing the next generation of SageMaker

The rise of generative AI is changing how data and AI teams work together. For example, when a retail data analyst creates customer segmentation reports, those same datasets are now being used by AI teams to train recommendation engines. Or customer service teams analyzing call logs to track common issues are now using that data to train AI chatbots to handle routine inquiries. Our customers tell us that they need tools that help data and AI teams collaborate seamlessly, but they face real challenges: data is siloed and scattered across systems, they have to build and maintain complex data pipelines, and teams struggle to access and use data efficiently due to inconsistent access controls. Customers also need to make sure that their data practices remain secure, reliable, and compliant with regulations. They need data that’s not just accessible, but also trustworthy and properly governed to keep up with growing business demands and AI opportunities.

The next generation of SageMaker, an integrated experience for data, analytics, and AI, addresses these challenges and more. SageMaker brings together widely adopted AWS ML and analytics capabilities—virtually all of the components you need for data exploration, preparation, and integration; petabyte-scale big data processing; fast SQL analytics; model development and training; governance; and generative AI development. SageMaker helps you work faster and smarter with your data and build powerful analytics and AI solutions that are deeply rooted in your unique data assets, giving you an edge over the competition.

Unified tools: Collaborate and build faster with one data and AI development environment

The rapid evolution of data and AI roles demands a revolution in the services and tools that power your work, driving a need for collaboration and teamwork across your entire organization. Amazon SageMaker Unified Studio (Preview) solves this challenge by providing an integrated authoring experience to use all your data and tools for analytics and AI. Collaborate and build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics with Amazon Q Developer, the most capable generative AI assistant for software development, helping you along the way. All your favorite functionality and tools, like standalone studios, query editors, and visual tools, are now available in one place, helping you discover and prepare data with ease, author queries or code, and get to insights faster.

SageMaker also comes with built-in generative AI powered by Amazon Q Developer that guides you along the way of your data and AI journey, transforming complex tasks into intuitive conversations. Ask questions in plain English to find the right datasets, automatically generate SQL queries, or create data pipelines without writing code. This isn’t just about making data management effortless—it’s about using AI to make your data work harder for you, unlocking insights that might otherwise remain hidden, and enabling everyone in your organization to work with data confidently, regardless of their technical expertise.

SageMaker still includes all the existing ML and AI capabilities you’ve come to know and love for data wrangling, human-in-the-loop data labeling with Amazon SageMaker Ground Truth, experiments, MLOps, Amazon SageMaker HyperPod managed distributed training, and more. Moving forward, we’ll refer to this set of AI/ML capabilities as SageMaker AI, and we’ll continue to innovate and expand on them to make sure the new SageMaker remains the premier center for building, training, and deploying AI models. With improved access and collaboration, you’ll be able to create and securely share analytics and AI artifacts and bring data and AI products to market faster.

Unified data: Reduce data silos with an open lakehouse to unify all your data

We see organizations embarking on digital transformations and needing to quickly adapt to ever-evolving customer demands. In doing so, a unified view across all their data is required—one that breaks down data silos and simplifies data usage for teams, without sacrificing the depth and breadth of capabilities that make AWS tools unbelievably valuable. This balance between unification and maintaining advanced capabilities is key to supporting our customers’ ongoing innovation and adaptability in a rapidly changing technological landscape.

Amazon SageMaker Lakehouse, now generally available, unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. This innovation drives an important change: you’ll no longer have to copy or move data between data lake and data warehouses. SageMaker Lakehouse enables seamless data access directly in the new SageMaker Unified Studio and provides the flexibility to access and query your data with all Apache Iceberg-compatible tools on a single copy of analytics data. With this launch, you can query data regardless of where it is stored with support for a wide range of use cases, including analytics, ad-hoc querying, data science, machine learning, and generative AI. You’ll get a single unified view of all your data for your data and AI workers, regardless of where the data sits, breaking down your data siloes. We’ve simplified data architectures, saving you time and costs on unnecessary data movement, data duplication, and custom solutions.

Additionally, we are advancing towards a zero-ETL future by expanding integrations that make data from multiple operational, transactional, and application sources available in SageMaker Lakehouse and Amazon Redshift. Zero-ETL integrations simplify data movement and ingestion, enabling increased agility, reduced costs, and minimized operational overhead while providing near real-time insights for AI and ML initiatives. All the existing Amazon Redshift zero-ETL integrations are seamlessly available within SageMaker—you can move transactional data from databases like Amazon Aurora, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB into Amazon Redshift without performance impact and ingest high-volume real-time data from Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK) with native streaming services integrations. We announced SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from eight applications, including Salesforce, Zendesk, ServiceNow, Zoho CRM, Salesforce Pardot, SAP, Facebook Ads, and Instagram Ads. This new capability streamlines data replication and ingestion into a unified process, minimizing the need for custom data replication pipelines. With automatic pipeline maintenance, the solution minimizes the complexity of building in-house connectors, reduces implementation and operational costs, and accelerates insights by unifying data from diverse applications.

“We have spent the last 18 months working with AWS to transform our data foundation to use best-in-class solutions that are cost-effective as well. With advancements like SageMaker Unified Studio and SageMaker Lakehouse, we expect to accelerate our velocity of delivery through seamless access to data and services, thus enabling our engineers, analysts, and scientists to surface insights that provide material value to our business.”

– Lee Slezak, SVP of Data and Analytic, Lennar

Unified governance: Meet your enterprise security needs with built-in data and AI governance

When it comes to data and AI governance, discipline equals freedom. The right governance practices can enable your teams to move faster. Data teams struggle to find a unified approach that enables effortless discovery, understanding, and assurance of data quality and security across various sources. Our customers tell us that the fragmented nature of permissions and access controls, managed separately within individual data sources and tools, leads to inconsistent implementation and potential security risks.

SageMaker simplifies the discovery, governance, and collaboration for data and AI across your lakehouse, AI models, and applications. With Amazon SageMaker Catalog, built on Amazon DataZone, you can define and enforce access policies consistently using a single permission model with fine-grained access controls. This unified catalog enables engineers, data scientists, and analysts to securely discover and access approved data and models using semantic search with generative AI-created metadata. Collaboration is seamless, with straightforward publishing and subscribing workflows, fostering a more connected and efficient work environment.

Having confidence in your data is key. SageMaker Catalog provides comprehensive data quality capabilities, including data profiling, data quality recommendations, monitoring of data quality rules, and alerts. By combining rule-based and ML approaches, we help you reconcile entities and deliver high-quality data, giving you the tools to make confident business decisions. You’ll have trust in your data, with real-time visibility of data quality and data and ML lineage, allowing you to resolve hard-to-find quality challenges. Automate data profiling and data quality recommendations, monitor data quality rules, and receive alerts. Resolve hard-to-find data quality challenges by using rule-based and ML approaches to reconcile entities, enabling you to deliver high-quality data to make confident business decisions.

Beyond discovery and collaboration, SageMaker takes AI governance to the next level by providing robust safeguards and tools to develop responsible AI policies. This holistic approach not only streamlines operations, but also builds and maintains trust throughout the organization, setting a new standard for responsible and efficient AI development and deployment.

Innovate faster with the convergence of data, analytics and AI

The next generation of SageMaker delivers an integrated experience to access, govern, and act on all your data by bringing together widely adopted AWS data, analytics, and AI capabilities. Collaborate and build faster from a unified studio using familiar AWS tools for model development, generative AI, data processing, and SQL analytics, with Amazon Q Developer assisting you along the way. Access all your data, whether it’s stored in data lakes, data warehouses, or third-party or federated data sources. And move with confidence and trust with built-in governance to address enterprise security needs. The tools to transform your business are here. We’re excited to see what you’ll build next!

To learn more, check out the following AWS News blog announcements:

About the authors

G2 Krishnamoorthy is VP of Analytics, leading AWS data lake services, data integration, Amazon OpenSearch Service, and Amazon QuickSight. Prior to his current role, G2 built and ran the Analytics and ML Platform at Facebook/Meta, and built various parts of the SQL Server database, Azure Analytics, and Azure ML at Microsoft.

Rahul Pathak is VP of Relational Database Engines, leading Amazon Aurora, Amazon Redshift, and Amazon QLDB. Prior to his current role, he was VP of Analytics at AWS, where he worked across the entire AWS database portfolio. He has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation.

How ANZ Institutional Division built a federated data platform to enable their domain teams to build data products to support business outcomes

2024-12-04 Leo Ramsamy

Post Syndicated from Leo Ramsamy original https://aws.amazon.com/blogs/big-data/how-anz-institutional-division-built-a-federated-data-platform-to-enable-their-domain-teams-to-build-data-products-to-support-business-outcomes/

In today’s rapidly evolving financial landscape, data is the bedrock of innovation, enhancing customer and employee experiences and securing a competitive edge. Recognizing this paradigm shift, ANZ Institutional Division has embarked on a transformative journey to redefine its approach to data management, utilization, and extracting significant business value from data insights.

Like many large financial institutions, ANZ Institutional Division operated with siloed data practices and centralized data management teams. As time went on, the limitations of this approach became apparent due to rising data complexity, larger volumes, and the growing demand for swift, business-driven insights. Consequently, the bank encountered several challenges and needed to take the following actions:

Create business insights from untapped data potential, estimated to be approximately $150 million in the Institutional Division alone
Improve operational efficiency by removing manual data handling, the use of spreadsheets, and duplicate data entries
Increase agility by making data expertise more readily available, thereby improving time to market and overall customer experience
Address data quality
Standardize tooling and remove the Shadow IT culture, driving scalability, reducing risk, and minimizing overall operational inefficiencies

These challenges are not unique to ANZ Institutional Division. Globally, financial institutions have been experiencing similar issues, prompting a widespread reassessment of traditional data management approaches.

One major trend, embraced by many financial institutions, has been the adoption of the data mesh architecture and the shift towards treating data as a product. This paradigm, pioneered by thought leaders like Zhamak Dehghani, introduces a decentralized approach to data management that aligns closely with modern organizational structures and agile methodologies.

Some notable global examples of leading companies embracing and implementing this trend are JPMorgan Chase, Capital One, and Saxo Bank.

Inspired by these global trends and driven by its own unique challenges, ANZ’s Institutional Division decided to pivot from viewing data as a byproduct of projects to treating it as a valuable product in its own right.

This shift promises several business benefits:

Empowered domain expertise – By decentralizing data ownership to domain-based teams, ANZ can use the deep business knowledge within each unit to create more relevant and valuable data products
Increased agility – Domain teams can now respond more quickly to business needs, creating and iterating on data products without relying on a centralized bottleneck
Improved data quality – With domain experts overseeing their own data, there’s a greater likelihood of catching and correcting quality issues at the source
Scalability – The federated approach allows for greater scalability, enabling ANZ to handle increasing data volumes and complexity more effectively
Innovation catalyst – By democratizing data access and empowering teams to create data products, ANZ is fostering a culture of innovation and data-driven decision-making across the organization

This transition is not just about technology; it represents a fundamental shift in how ANZ views and values its data assets. By treating data as a product, the bank is positioned to not only overcome current challenges, but to unlock new opportunities for growth, customer service, and competitive advantage.

This post explores how the shift to a data product mindset is being implemented, the challenges faced, and the early wins that are shaping the future of data management in the Institutional Division.

ANZ’s federated data strategy

In response to the challenges, ANZ Group formulated a data strategy that focuses on empowering employees to securely use data to improve the sustainability and financial well-being of their customers. At its core are the following pillars:

Introducing new ways of working that focus on generating customer value first
New technology platforms and tooling that allow the bank to collect, share, archive, and dispose data in a secure and controlled way
Achieving consistency in how data is produced and consumed across the entire bank through data products and better-connected systems
Supporting the bank’s risk and regulatory obligations by providing a secure and resilient data platform that provides fine-grained, controlled access to quality data products

ANZ has made the strategic decision to adopt an architectural and operational model aligned with the data mesh paradigm, which revolves around four key principles: domain ownership, data as a product, a self-serve data platform, and federated computational governance.

Domain ownership recognizes that the teams generating the data have the deepest understanding of it and are therefore best suited to manage, govern, and share it effectively. This principle makes sure data accountability remains close to the source, fostering higher data quality and relevance.

Treating data as a product instils a product-centric mindset, emphasizing that data must be secure, discoverable, understandable, interoperable, reusable, and managed throughout its lifecycle. This principle makes sure data consumers, both internal and external, derive consistent value from well-designed data products.

A self-serve data platform empowers domains to create, discover, and consume data products independently. It abstracts technical complexities and provides user-friendly tools, enabling a scalable, repeatable, and automated approach to producing high-quality data products.

Under the federated mesh architecture, each divisional mesh functions as a node within the broader enterprise data mesh, maintaining a degree of autonomy in managing its data products. To effectively coordinate these autonomous nodes and facilitate seamless integration, enterprise-wide standards, such as those related to data governance, interoperability, and security, are essential to maintain alignment and consistency across all nodes and domains and teams within.

With this approach, each node in ANZ maintains its divisional alignment and adherence to data risk and governance standards and policies to manage local data products and data assets. This enables global discoverability and collaboration without centralizing ownership or operations.

As a result, governance resides with the data products themselves, making sure standards and policies, such as access control, data quality, and compliance, are enforced where the data lives. In this regard, the enterprise data product catalog acts as a federated portal, facilitating cross-domain access and interoperability while maintaining alignment with governance principles. This model balances node or domain-level autonomy with enterprise-level oversight, creating a scalable and consistent framework across ANZ.

Within the ANZ enterprise data mesh strategy, aligning data mesh nodes with the ANZ Group’s divisional structure provides optimal alignment between data mesh principles and organizational structure, as shown in the following diagram.

Central to the success of this strategy is its support for each division’s autonomy and freedom to choose their own domain structure, which is closely aligned to their business needs. Divisions decide how many domains to have within their node; some may have one, others many. These nodes can implement analytical platforms like data lake houses, data warehouses, or data marts, all united by producing data products. Nodes and domains serve business needs and are not technology mandated.

Under the federated computational governance model, the ANZ Group strategy defines guardrails that treat a node as a logical data container suitable for the following:

Ingestion and metadata management
Creating source-aligned data products complying with ANZ’s Data Product Specification (DPS)
Integrating source-aligned data products from other nodes
Producing consumer-aligned data products for specific business purposes
Publishing conforming data products to ANZ’s Data Product Catalog (DPC)

Following on from this strategy is organizing its domain structure to provide autonomy to various functional teams while preserving the core values of data mesh. The following diagram depicts an example of the possible structure.

For instance, Domain A will have the flexibility to create data products that can be published to the divisional catalog, while also maintaining the autonomy to develop data products that are exclusively accessible to teams within the domain. These products will not be available to others until they are deemed ready for broader enterprise use.

This strategy supports each division’s autonomy to implement their own data catalogs and decide which data products to publish to the group-level catalog. This flexibility extends to divisional domains, which can choose which data products to publish to the divisional catalog or keep visible only to domain consumers.

Institutional Data & AI Platform architecture

The Institutional Division has implemented a self-service data platform to enable the domain teams to build and manage data products autonomously. The Institutional Data & AI platform adopts a federated approach to data while centralizing the metadata to facilitate simpler discovery and sharing of data products. The following diagram illustrates the building blocks of the Institutional Data & AI Platform.

The building blocks are as follows:

Foundational Data & AI Platform capabilities – A dedicated data platform team provides domain-agnostic tools, systems, and capabilities to enable autonomous data product development across domains. This self-serve infrastructure allows domain teams to manage the full data lifecycle without relying on a centralized data team. Key capabilities include data storage, data onboarding and transformation, and data utilities that facilitate data sharing with interoperability between domains. These capabilities abstract the technical complexities associated with data management infrastructure, allowing domain experts to focus on creating valuable data products rather than infrastructure management.
Domain-owned data assets – The domain-oriented data ownership approach distributes responsibility for data across the business units within the Institutional Division. Domain teams are responsible for developing, deploying, and managing their own analytical data products alongside operational data services. Data contracts authored by data product owners automate data product creation and provide a standard to access data products. By treating the data as a product, the outcome is a reusable asset that outlives a project and meets the needs of the enterprise consumer. Consumer feedback and demand drives creation and maintenance of the data product.
Division-level metadata management and data governance – A centrally hosted service provides domain teams with the capability to publish their data products along with relevant metadata, like business definitions and lineage. Some of the key features implemented are:
1. Metadata management that centralizes metadata and presents it within the context of data products, such as data quality scores and data product lineage.
2. A data portal for consumers to discover data products and access associated metadata.
3. Subscription workflows that simplify access management to the data products.
4. Computational governance that enforces divisional and enterprise data policies and standards, such as data classification and business data models for aligning terminology.

The following diagram is a high-level example of the technical architecture approach towards the Institutional Data & AI Platform. The solution uses a building block approach, on a cloud-centered platform comprised of AWS services, with partner solutions and open standards like OpenLineage and Apache Iceberg.

Let’s look at the key services that enable the federated platform to operate at scale:

Data storage and processing:
- Apache Iceberg on Amazon Simple Storage Service (Amazon S3) offers an optimized way to store data assets and products and promotes interoperability across other services
- Amazon Redshift allows domain teams to create and manage fit-for-purpose data marts
- AWS Lambda and AWS Glue are used for data onboarding and processing, and data utilities created in Python and PySpark promote reusability and quality across the data processing pipelines
- dbt simplifies data transformation rules and allows sub-domain data analysts to build modeling logic as SQL statements
- Amazon Managed Workflows for Apache Airflow (Amazon MWAA) enables efficient management of workflows and data pipeline orchestration using out-of-the-box integrations with AWS services
Metadata management and data governance:
- To maintain data reliability and accuracy, a robust data quality framework using Soda core is used that automates data quality using checks defined in a data contract
- Amazon DataZone enables data product cataloging, discovery, metadata management, and implementing computational governance
- OpenLineage simplifies harvesting and collection of data and process-level lineage, which are then published to Amazon DataZone
- AWS Lake Formation, combined with AWS Glue Data Catalog, provides data governance and access management to data products that reside within sub-domains
Analytics:
- Tableau offers capabilities for sub-domains with data visualization and business intelligence capabilities
Observability and security:
- Observability needs of the platform are built into all the processes using monitoring, with logging functionality provided by Amazon CloudWatch and AWS CloudTrail
- AWS Secrets Manager makes sure secrets are stored and made available for data pipelines to access services in a secure manner

The technical implementation actualizes the data product strategy at ANZ Institutional Division. Amazon DataZone plays an essential role in facilitating data product management for the domain teams. The service addresses several critical aspects of the Institutional Division’s data product strategy, including:

Data cataloging and metadata management – Amazon DataZone provides comprehensive data cataloging and metadata management capabilities
Data governance and compliance – Effective data governance is essential for scaling data products
Self-service capabilities – Amazon DataZone empowers domain teams with self-service capabilities, enabling them to create, manage, and deploy data products independently
Integration and interoperability – One of the challenges in scaling data products is providing seamless integration across various data sources and systems
Collaboration and sharing – Amazon DataZone provides a platform for sharing data and metadata across teams and domains

Institutional Division’s delivery model to achieve scale

The Institutional Division has successfully used the federated architecture, and key to this delivery model is the implementation of Foundational Data & AI Platform capabilities that serve all domains within the division. This model promotes self-service and accelerates the delivery of subsequent initiatives by using the capabilities built for previous use cases.

To evaluate the success of the delivery model, ANZ has implemented key metrics, such as cost transparency and domain adoption, to guide the data mesh governance team in refining the delivery approach. For instance, one enhancement involves integrating cross-functional squads to support data literacy.

The key to scaling the Institutional Division operating model are the following considerations:

Data as a product approach – Use techniques like event storming and domain-driven design to capture business events and their meanings.
Education and enablement – Conduct learning interventions to upskill teams on understanding and using the data as a product approach.
Iterative data platform delivery – Work backward from business initiative to iteratively deliver self-service data platform infrastructure capabilities.
Managing demand efficiently – Implement a feedback mechanism to manage demand on data products. Track and manage data debt using standard data contract specifications. Most importantly, adopt governance and standards to make sure data products are built and maintained with a long-term perspective, minimizing technical debt.

“The Institutional Data & Analytics Platform (IDAP) has allowed the Institutional team to establish a base foundation to allow various teams to aggregate and consume the wealth of data across the division. This self-service platform enables business leaders to both create and consume reusable data products, unlocking value across this division. It’s also an excellent proof point for our broader data mesh architecture, allowing us to connect this divisional data to broader enterprise data stores—further positioning us to put the customer at the center of everything we do.”

– Tim Hogarth, CTO ANZ

“AWS believes that democratizing data, while not compromising on security and fine-grained access, is a key component of any future-proof, scalable data platform, so we are pleased to be enabling ANZ bank’s IDAP metadata management and data governance capabilities through Amazon DataZone. This allows the diverse business functions at ANZ the autonomy to self-serve on their data needs with built-in governance.”

– Shikha Verma, Head of Product, Amazon DataZone

Conclusion

ANZ’s journey to move towards a data product approach has improved the organization’s approach to manage data and reduce data silos, and has positioned it to become a data-driven, customer-centric organization. By combining federated platform practices and adopting AWS services and open standards, ANZ Institutional Division is achieving its objectives in decentralization with a scalable data platform that enables its domain teams to make informed decisions, drive innovation, and maintain a competitive edge.

Special thanks: This implementation success is a result of close collaboration between ANZ Institutional Division, AWS ProServe, and the AWS account team. We want to thank ANZ Institutional Executives and the Leadership Team for the strong sponsorship and direction.

About the Authors

Leo Ramsamy is a Platform Architect specializing in data and analytics for ANZ’s Institutional division. He focuses on modern data practices, including Data Mesh architecture, data governance, quality management, and observability. His work aligns data strategies with business goals, improving accessibility and enabling better decision-making across ANZ.

Srinivasan Kuppusamy is a Senior Cloud Architect – Data at AWS ProServe, where he helps customers solve their business problems using the power of AWS Cloud technology. His areas of interests are data and analytics, data governance, and AI/ML.

Rada Stanic is a Chief Technologist at Amazon Web Services, where she helps ANZ customers across different segments solve their business problems using AWS Cloud technologies. Her special areas of interest are data analytics, machine learning/AI, and application modernization.

Accelerate foundation model training and fine-tuning with new Amazon SageMaker HyperPod recipes

2024-12-04 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/accelerate-foundation-model-training-and-fine-tuning-with-new-amazon-sagemaker-hyperpod-recipes/

Today, we’re announcing the general availability of Amazon SageMaker HyperPod recipes to help data scientists and developers of all skill sets to get started training and fine-tuning foundation models (FMs) in minutes with state-of-the-art performance. They can now access optimized recipes for training and fine-tuning popular publicly available FMs such as Llama 3.1 405B, Llama 3.2 90B, or Mixtral 8x22B.

At AWS re:Invent 2023, we introduced SageMaker HyperPod to reduce time to train FMs by up to 40 percent and scale across more than a thousand compute resources in parallel with preconfigured distributed training libraries. With SageMaker HyperPod, you can find the required accelerated compute resources for training, create the most optimal training plans, and run training workloads across different blocks of capacity based on the availability of compute resources.

SageMaker HyperPod recipes include a training stack tested by AWS, removing tedious work experimenting with different model configurations, eliminating weeks of iterative evaluation and testing. The recipes automate several critical steps, such as loading training datasets, applying distributed training techniques, automating checkpoints for faster recovery from faults, and managing the end-to-end training loop.

With a simple recipe change, you can seamlessly switch between GPU- or Trainium-based instances to further optimize training performance and reduce costs. You can easily run workloads in production on SageMaker HyperPod or SageMaker training jobs.

SageMaker HyperPod recipes in action
To get started, visit the SageMaker HyperPod recipes GitHub repository to browse training recipes for popular publicly available FMs.

You only need to edit straightforward recipe parameters to specify an instance type and the location of your dataset in cluster configuration, then run the recipe with a single line command to achieve state-of-art performance.

You need to edit the recipe config.yaml file to specify the model and cluster type after cloning the repository.

$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 install -r requirements.txt.
$ cd ./recipes_collections
$ vim config.yaml

The recipes support SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker training jobs. For example, you can set up a cluster type (Slurm orchestrator), a model name (Meta Llama 3.1 405B language model), an instance type (ml.p5.48xlarge), and your data locations, such as storing the training data, results, logs, and so on.

defaults:
- cluster: slurm # support: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # name of model to be trained
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or other supported cluster instances
base_results_dir: # Location(s) to store the results, checkpoints, logs etc.

You can optionally adjust model-specific training parameters in this YAML file, which outlines the optimal configuration, including the number of accelerator devices, instance type, training precision, parallelization and sharding techniques, the optimizer, and logging to monitor experiments through TensorBoard.

run:
  name: llama-405b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "6-00:00:00"
restore_from_path: null
trainer:
  devices: 8
  num_nodes: 2
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 10
  ...
exp_manager:
  exp_dir: # location for TensorBoard logging
  name: helloworld 
  create_tensorboard_logger: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    ...
  auto_checkpoint: True # for automated checkpointing
use_smp: True 
distributed_backend: smddp # optimized collectives
# Start training from pretrained model
model:
  model_type: llama_v3
  train_batch_size: 4
  tensor_model_parallel_degree: 1
  expert_model_parallel_degree: 1
  # other model-specific params

To run this recipe in SageMaker HyperPod with Slurm, you must prepare the SageMaker HyperPod cluster following the cluster setup instruction.

Then, connect to the SageMaker HyperPod head node, access the Slurm controller, and copy the edited recipe. Next, you run a helper file to generate a Slurm submission script for the job that you can use for a dry run to inspect the content before starting the training job.

$ python3 main.py --config-path recipes_collection --config-name=config

After training completion, the trained model is automatically saved to your assigned data location.

To run this recipe on SageMaker HyperPod with Amazon EKS, clone the recipe from the GitHub repository, install the requirements, and edit the recipe (cluster: k8s) on your laptop. Then, create a link between your laptop and running the EKS cluster and subsequently use the HyperPod Command Line Interface (CLI) to run the recipe.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
  "recipes.run.name": "hf-llama3-405b-seq8k-gpu-qlora",
  "recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
  "cluster": "k8s",
  "cluster_type": "k8s",
  "container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
  "recipes.model.data.train_dir": "<your_train_data_dir>",
  "recipes.model.data.val_dir": "<your_val_data_dir>",
}'

You can also run recipe on SageMaker training jobs using SageMaker Python SDK. The following example is running PyTorch training scripts on SageMaker training jobs with overriding training recipes.

...
recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}
pytorch_estimator = PyTorch(
           output_path=<output_path>,
           base_job_name=f"llama-recipe",
           role=<role>,
           instance_type="p5.48xlarge",
           training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
           recipe_overrides=recipe_overrides,
           sagemaker_session=sagemaker_session,
           tensorboard_output_config=tensorboard_output_config,
)
...

As training progresses, the model checkpoints are stored on Amazon Simple Storage Service (Amazon S3) with the fully automated checkpointing capability, enabling faster recovery from training faults and instance restarts.

Now available
Amazon SageMaker HyperPod recipes are now available in the SageMaker HyperPod recipes GitHub repository. To learn more, visit the SageMaker HyperPod product page and the Amazon SageMaker AI Developer Guide.

Give SageMaker HyperPod recipes a try and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

— Channy

AWS Education Equity Initiative: Applying generative AI to educate the next wave of innovators

2024-12-04 Jeff Barr

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-education-equity-initiative-applying-generative-ai-to-educate-the-next-wave-of-innovators/

Building on the work that we and our partners have been doing for many years, Amazon is committing up to $100 million in cloud technology and technical resources to help existing, dedicated learning organizations reach more learners by creating new and innovative digital learning solutions, all as part of the AWS Education Equity Initiative.

The Work So Far
AWS and Amazon have a long-standing commitment to learning and education. Here’s a sampling of what we have already done:

AWS AI & ML Scholarship Program – This program has awarded $28 million in scholarships to approximately 6000 students.

Machine Learning University – MLU offers a free program helping community colleges and Historically Black Colleges and Universities (HBCUs) teach data management, artificial intelligence, and machine learning concepts. The program is designed to address opportunity gaps by supporting students who are historically underserved and underrepresented in technology disciplines.

Amazon Future Engineer – Since 2021, up to $46 million in scholarships has been awarded to 1150 students through this program. In the past year, more than 2.1 million students received over 17 million hours of STEM education, literacy, and career exploration courses through this and other Amazon philanthropic education programs in the United States. I was able to speak to one such session last year and it was an amazing experience:

Free Cloud Training – In late 2020 we set a goal of helping 29 million people grow their tech skills with free cloud computing training by 2025. We worked hard and met that target a year ahead of time!

There’s More To Do
Despite all of this work and progress, there’s still more to be done. The future is definitely not evenly distributed: over half a billion students cannot be reached by digital learning today.

We believe that Generative AI can amplify the good work that socially-minded edtech organizations, non-profits, and governments are already doing. Our goal is to empower them to build new and innovative digital learning systems that can amplify their work and allow them to reach a bigger audience.

With the launch of the AWS Education Equity Initiative, we want to help pave the way for the next generation of technology pioneers as they build powerful tools, train foundation models at scale, and create AI-powered teaching assistants.

We are committing up to $100 million in cloud technology and comprehensive technical advising over the next five years. The awardees will have access to the portfolio of AWS services and technical expertise so that they can build and scale learning management systems, mobile apps, chatbots, and other digital learning tools. As part of the application process, applicants will be asked to demonstrate how their proposed solution will benefit students from underserved and underrepresented communities.

As I mentioned earlier, our partners are already doing a lot of great work in this area. For example:

Code.org has already used AWS to scale their free computer science curriculum to millions of students in more than 100 countries. With this initiative, they will expand their use of Amazon Bedrock to provide an automated assessment of student projects, freeing up educator time that can be use for individual instruction and tailored learning.

Rocket Learning focuses on early childhood education in India. They will use Amazon Q in QuickSight to enhance learning outcomes for more than three million children.

I’m super excited about this initiative and look forward to seeing how it will help to create and educate the next generation of technology pioneers!

— Jeff;

Solve complex problems with new scenario analysis capability in Amazon Q in QuickSight

2024-12-04 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/solve-complex-problems-with-new-scenario-analysis-capability-in-amazon-q-in-quicksight/

Today, we announced a new capability of Amazon Q in QuickSight that helps users perform scenario analyses to find answers to complex problems quickly. This AI-assisted data analysis experience helps business users find answers to complex problems by guiding them step-by-step through in-depth data analysis—suggesting analytical approaches, automatically analyzing data, and summarizing findings with suggested actions—using natural language prompts. This new capability eliminates hours of tedious and error-prone manual work traditionally required to perform analyses using spreadsheets or other alternatives. In fact, Amazon Q in QuickSight enables business users to perform complex scenario analysis up to 10x faster than spreadsheets. This capability expands upon existing data Q&A capabilities of Amazon QuickSight so business professionals can start their analysis by simply asking a question.

How it works
Business users are often faced with complex questions that have traditionally required specialized training and days or weeks of time analyzing data in spreadsheets or other tools to address. For example, let’s say you’re a franchisee with multiple locations to manage. You might use this new capability in Amazon Q in QuickSight to ask, “How can I help our new Chicago store perform as well as the ﬂagship store in New York?” Using an agentic approach, Amazon Q would then suggest analytical approaches needed to address the underlying business goal, automatically analyze data, and present results complete with visualizations and suggested actions. You can conduct this multistep analysis in an expansive analysis canvas, giving you the ﬂexibility to make changes, explore multiple analysis paths simultaneously, and adapt to situations over time.

This new analysis experience is part of Amazon QuickSight meaning it can read from QuickSight dashboards which connect to sources such as Amazon Athena, Amazon Aurora, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon OpenSearch Service. Specifically, this new experience is part of Amazon Q in QuickSight, which allows it to seamlessly integrate with other generative business intelligence (BI) capabilities such as data Q&A. You can also upload either a .csv or a single-table, single-sheet .xlsx file to incorporate into your analysis.

Here’s a visual walkthrough of this new analysis experience in Amazon Q in QuickSight.

I’m planning a customer event, and I’ve received an Excel spreadsheet of all who’ve registered to attend the event. I want to learn more about the attendees, so I analyze the spreadsheet and ask a few questions. I start by describing what I want to explore.

I upload the spreadsheet to start my analysis. Firstly, I want to understand how many people have registered for the event.

To design an agenda that’s suitable for the audience, I want to understand the various roles that will be attending. I select on the + icon to add a new block for asking a question following along the thread from the previous block.

I can continue to ask more questions. However, there are suggested questions for analyzing my data even further, and I now select one of these suggested questions. I want to increase marketing efforts at companies that don’t currently have a lot of attendees in this case, companies with fewer than two attendees.

Amazon Q executes the required analysis and keeps me updated of the progress. Step 1 of the process identifies companies that have fewer than two attendees and lists them.

Step 2 gives an estimate of how many more attendees I might get from each company if marketing efforts are increased.

In Step 3 I can see the potential increase in total attendees (including the percentage increase) in line with the increase in marketing efforts.

Lastly, Step 4 goes even further to highlight companies I should prioritize for these increased marketing efforts.

To increase the potential number of attendees even more, I wanted to change the analysis to identify companies with fewer than three attendees instead of two attendees. I choose the AI sparkle icon in the upper right to launch a modal that I then use to provide more context and make specific changes to the previous result.

This change resulted in new projections, and I can choose to consider them for my marketing efforts or keep to the previous projections.

Now available
Amazon Q in QuickSight Pro users can use this new capability in preview in the following AWS Regions at launch: US East (N. Virginia) and US West (Oregon). Get started with a free 30-day trial of QuickSight today. To learn more, visit the Amazon QuickSight User Guide. You can submit your questions to AWS re:Post for Amazon QuickSight, or through your usual AWS Support contacts.

– Veliswa.

Use Amazon Q Developer to build ML models in Amazon SageMaker Canvas

2024-12-04 Elizabeth Fuentes

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/use-amazon-q-developer-to-build-ml-models-in-amazon-sagemaker-canvas/

As a data scientist, I’ve experienced firsthand the challenges of making machine learning (ML) accessible to business analysts, marketing analysts, data analysts, and data engineers who are experts in their domains without ML experience. That’s why I’m particularly excited about today’s Amazon Web Services (AWS) announcement that Amazon Q Developer is now available in Amazon SageMaker Canvas. What catches my attention is how Amazon Q Developer helps connect ML expertise with business needs, making ML more accessible across organizations.

Amazon Q Developer helps domain experts build accurate, production-quality ML models through natural language interactions, even if they don’t have ML expertise. Amazon Q Developer guides these users by breaking down their business problems and analyzing their data to recommend step-by-step guidance for building custom ML models. It transforms users’ data to remove anomalies, and builds and evaluates custom ML models to recommend the best one, while providing users control and visibility into every step of the guided ML workflow. This empowers organizations to innovate faster with reduced time to market. It also reduces their reliance on ML experts so their specialists can focus on more complex technical challenges.

For example, a marketing analyst can state, “I want to predict home sales prices using home characteristics and past sales data”, and Amazon Q Developer will translate this into a set of ML steps, analyzing relevant customer data, building multiple models, and recommending the best approach.

Let’s see it in action
To start using Amazon Q Developer, I follow the Getting started with using Amazon SageMaker Canvas guide to launch the Canvas application. In this demo, I use natural language instructions to create a model to predict house prices for marketing and finance teams. From the SageMaker Canvas page, I select Amazon Q and then choose Start a new conversation.

In the new conversation I write:

I am an analyst and need to predict house prices for my marketing and finance teams.

Next, Amazon Q Developer explains the problem and recommends the appropriate ML model type. It also outlines the solution requirements, including the necessary dataset characteristics. Amazon Q Developer then asks if I want to upload my dataset or I want to choose a target column. I select it to upload my dataset.

In the next step, Amazon Q Developer lists the dataset requirements, which include relevant information about houses, current house prices, and the target variable for the regression model. It then recommended next steps, including: I want to upload my dataset, Select an existing dataset, Create a new dataset or I want to choose a target column. For this demo, I’ll use the canvas-sample-housing.csv sample dataset as my existing dataset.

After selecting and loading the dataset, Amazon Q Developer analyzes it and suggests median_house_value as the target column for the regression model. I accept by selecting I would like to predict the “median_house_value” column. Moving on to the next step, Amazon Q Developer details which dataset features (such as “location”, “housing_median_age”, and “total_rooms”) it will use to predict the median_house_value.

Before moving forward with model training, I ask about the data quality, because without good data we can’t build a reliable model. Amazon Q Developer responds with quality insights for my entire dataset.

I can ask specific questions about individual features and their distributions to better understand the data quality.

To my surprise, through the previous question, I discovered that the “households” column has a wide variation between extreme values, which could affect the model’s prediction accuracy. Therefore, I ask Amazon Q Developer to fix this outlier problem.

After the transformation is done, I can ask what steps Amazon Q Developer followed to make this change. Behind the scenes, Amazon Q Developer applies advanced data preparation steps using SageMaker Canvas data preparation capabilities, which I can review and see the steps so that I can visualize and replicate the process to get the final, prepared dataset for training the model.

After reviewing the data preparation steps, I select Launch my training job.

After the training job is launched, I can see its progress in the conversation, and the datasets created.

As a data scientist, I particularly appreciate that, with Amazon Q Developer, Ican see detailed metrics such as the confusion matrix and precision-recall scores for classification models and root mean square error (RMSE) for regression models. These are crucial elements I always look for when evaluating model performance and making data-driven decisions, and it’s refreshing to see them presented in a way that’s accessible to nontechnical users to build trust and enable proper governance while maintaining the depth that technical teams need.

You can access these metrics by selecting the new model from My Models or from the Amazon Q conversation menu:

Overview – This tab shows the Column impact analysis. In this case, median_income emerges as the primary factor influencing my model.
Scoring – This tab provides model accuracy insights, including RMSE metrics.
Advanced metrics – This tab displays the detailed Metrics table, Residuals and Error density for in-depth model evaluation.

Analyze My Model

After reviewing these metrics and validating the model’s performance, I can move to the final stages of the ML workflow:

Predictions – I can test my model using the Predictions tab to validate its real-world performance.
Deployment – I can create an endpoint deployment to make my model available for production use.

This simplifies the deployment process, a step that traditionally requires significant DevOps knowledge, into a straightforward operation that business analysts can handle confidently.

predictions and deploy

Things to know
Amazon Q Developer democratizes ML across organizations:

Empowering all skill levels with ML – Amazon Q Developer is now available in SageMaker Canvas, helping business analysts, marketing analysts, and data professionals who don’t have ML experience create solutions for business problems through a guided ML workflow. From data analysis and model selection to deployment, users can solve business problems using natural language, reducing dependence on ML experts such as data scientists and enabling organizations to innovate faster with reduced time to market.

Streamlining the ML workflow – With Amazon Q Developer available in SageMaker Canvas, users can prepare data, and build, analyze, and deploy ML models through a guided, transparent workflow. Amazon Q Developer provides advanced data preparation and AutoML capabilities that democratize ML, and allows non-ML experts to produce highly-accurate ML models.

Providing full visibility into the ML workflow – Amazon Q Developer provides full transparency by generating the underlying code and technical artifacts such as data transformation steps, model explainability, and accuracy measures. This allows cross-functional teams, including ML experts, to review, validate, and update the models as needed, facilitating collaboration in a secure environment.

Availability – Amazon Q Developer is now in preview release in Amazon SageMaker Canvas.

Pricing – Amazon Q Developer is now available in SageMaker Canvas at no additional cost to both Amazon Q Developer Pro Tier and Amazon Q Developer Free tier users. However, standard charges apply for resources such as SageMaker Canvas workspace instances and any resources used for building or deploying models. For detailed pricing information, visit the Amazon SageMaker Canvas Pricing.

To learn more about getting started visit the Amazon Q Developer product web page.

— Eli

Amazon Bedrock Guardrails now supports multimodal toxicity detection with image support (preview)

2024-12-04 Antje Barth

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/amazon-bedrock-guardrails-now-supports-multimodal-toxicity-detection-with-image-support/

Today, we’re announcing the preview of multimodal toxicity detection with image support in Amazon Bedrock Guardrails. This new capability detects and filters out undesirable image content in addition to text, helping you improve user experiences and manage model outputs in your generative AI applications.

Amazon Bedrock Guardrails helps you implement safeguards for generative AI applications by filtering undesirable content, redacting personally identifiable information (PII), and enhancing content safety and privacy. You can configure policies for denied topics, content filters, word filters, PII redaction, contextual grounding checks, and Automated Reasoning checks (preview), to tailor safeguards to your specific use cases and responsible AI policies.

With this launch, you can now use the existing content filter policy in Amazon Bedrock Guardrails to detect and block harmful image content across categories such as hate, insults, sexual, and violence. You can configure thresholds from low to high to match your application’s needs.

This new image support works with all foundation models (FMs) in Amazon Bedrock that support image data, as well as any custom fine-tuned models you bring. It provides a consistent layer of protection across text and image modalities, making it easier to build responsible AI applications.

Tero Hottinen, VP, Head of Strategic Partnerships at KONE, envisions the following use case:

In its ongoing evaluation, KONE recognizes the potential of Amazon Bedrock Guardrails as a key component in protecting gen AI applications, particularly for relevance and contextual grounding checks, as well as the multimodal safeguards. The company envisions integrating product design diagrams and manuals into its applications, with Amazon Bedrock Guardrails playing a crucial role in enabling more accurate diagnosis and analysis of multimodal content.

Here’s how it works.

Multimodal toxicity detection in action
To get started, create a guardrail in the AWS Management Console and configure the content filters for either text or image data or both. You can also use AWS SDKs to integrate this capability into your applications.

Create guardrail
On the console, navigate to Amazon Bedrock and select Guardrails. From there, you can create a new guardrail and use the existing content filters to detect and block image data in addition to text data. The categories for Hate, Insults, Sexual, and Violence under Configure content filters can be configured for either text or image content or both. The Misconduct and Prompt attacks categories can be configured for text content only.

After you’ve selected and configured the content filters you want to use, you can save the guardrail and start using it to build safe and responsible generative AI applications.

To test the new guardrail in the console, select the guardrail and choose Test. You have two options: test the guardrail by choosing and invoking a model or to test the guardrail without invoking a model by using the Amazon Bedrock Guardrails independent ApplyGuardail API.

With the ApplyGuardrail API, you can validate content at any point in your application flow before processing or serving results to the user. You can also use the API to evaluate inputs and outputs for any self-managed (custom), or third-party FMs, regardless of the underlying infrastructure. For example, you could use the API to evaluate a Meta Llama 3.2 model hosted on Amazon SageMaker or a Mistral NeMo model running on your laptop.

Test guardrail by choosing and invoking a model
Select a model that supports image inputs or outputs, for example, Anthropic’s Claude 3.5 Sonnet. Verify that the prompt and response filters are enabled for image content. Next, provide a prompt, upload an image file, and choose Run.

In my example, Amazon Bedrock Guardrails intervened. Choose View trace for more details.

The guardrail trace provides a record of how safety measures were applied during an interaction. It shows whether Amazon Bedrock Guardrails intervened or not and what assessments were made on both input (prompt) and output (model response). In my example, the content filters blocked the input prompt because they detected insults in the image with a high confidence.

Test guardrail without invoking a model
In the console, choose Use Guardrails independent API to test the guardrail without invoking a model. Choose whether you want to validate an input prompt or an example of a model generated output. Then, repeat the steps from before. Verify that the prompt and response filters are enabled for image content, provide the content to validate, and choose Run.

I reused the same image and input prompt for my demo, and Amazon Bedrock Guardrails intervened again. Choose View trace again for more details.

Join the preview
Multimodal toxicity detection with image support is available today in preview in Amazon Bedrock Guardrails in the US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Mumbai, Seoul, Singapore, Tokyo), Europe (Frankfurt, Ireland, London), and AWS GovCloud (US-West) AWS Regions. To learn more, visit Amazon Bedrock Guardrails.

Give the multimodal toxicity detection content filter a try today in the Amazon Bedrock console and let us know what you think! Send feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

— Antje

New Amazon Bedrock capabilities enhance data processing and retrieval

2024-12-04 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-bedrock-capabilities-enhance-data-processing-and-retrieval/

Today, Amazon Bedrock introduces four enhancements that streamline how you can analyze data with generative AI:

Amazon Bedrock Data Automation (preview) – A fully managed capability of Amazon Bedrock that streamlines the generation of valuable insights from unstructured, multimodal content such as documents, images, audio, and videos. With Amazon Bedrock Data Automation, you can build automated intelligent document processing (IDP), media analysis, and Retrieval-Augmented Generation (RAG) workflows quickly and cost-effectively. Insights include video summaries of key moments, detection of inappropriate image content, automated analysis of complex documents, and much more. You can customize outputs to tailor insights into your specific business needs. Amazon Bedrock Data Automation can be used as a standalone feature or as a parser when setting up a knowledge base for RAG workflows.

Amazon Bedrock Knowledge Bases now processes multimodal data –To help build applications that process both text and visual elements in documents and images, you can configure a knowledge base to parse documents using either Amazon Bedrock Data Automation or use a foundation model (FM) as the parser. Multimodal data processing can improve the accuracy and relevancy of the responses you get from a knowledge base which includes information embedded in both images and text.

Amazon Bedrock Knowledge Bases now supports GraphRAG (preview) – We now offer one of the first fully-managed GraphRAG capabilities. GraphRAG enhances generative AI applications by providing more accurate and comprehensive responses to end users by using RAG techniques combined with graphs.

Amazon Bedrock Knowledge Bases now supports structured data retrieval – This capability extends a knowledge base to support natural language querying of data warehouses and data lakes so that applications can access business intelligence (BI) through conversational interfaces and improve the accuracy of the responses by including critical enterprise data. Amazon Bedrock Knowledge Bases provides one of the first fully-managed out-of-the-box RAG solutions that can natively query structured data from where it resides. This capability helps break data silos across data sources and accelerates building generative AI applications from over a month to just a few days.

These new capabilities make it easier to build comprehensive AI applications that can process, understand, and retrieve information from structured and unstructured data sources. For example, a car insurance company can use Amazon Bedrock Data Automation to automate their claims adjudication workflow to reduce the time taken to process automobile claims, improving the productivity of their claims department.

Similarly, a media company can analyze TV shows and extract insights needed for smart advertisement placement such as scene summaries, industry standard advertising taxonomies (IAB), and company logos. A media production company can generate scene-by-scene summaries and capture key moments in their video assets. A financial services company can process complex financial documents containing charts and tables and use GraphRAG to understand relationships between different financial entities. All these companies can use structured data retrieval to query their data warehouse while retrieving information from their knowledge base.

Let’s take a closer look at these features.

Introducing Amazon Bedrock Data Automation
Amazon Bedrock Data Automation is a capability of Amazon Bedrock that simplifies the process of extracting valuable insights from multimodal, unstructured content, such as documents, images, videos, and audio files.

Amazon Bedrock Data Automation provides a unified, API-driven experience that developers can use to process multimodal content through a single interface, eliminating the need to manage and orchestrate multiple AI models and services. With built-in safeguards, such as visual grounding and confidence scores, Amazon Bedrock Data Automation helps promote the accuracy and trustworthiness of the extracted insights, making it easier to integrate into enterprise workflows.

Amazon Bedrock Data Automation supports 4 modalities (documents, images, video, and audio). When used in an application, all modalities use the same asynchronous inference API, and results are written to an Amazon Simple Storage Service (Amazon S3) bucket.

For each modality, you can configure the output based on your processing needs and generate two types of outputs:

Standard output – With standard output, you get predefined default insights that are relevant to the input data type. Examples include semantic representation of documents, summaries of videos by scene, audio transcripts and more. You can configure which insights you want to extract with just a few steps.

Custom output – With custom output, you have the flexibility to define and specify your extraction needs using artifacts called “blueprints” to generate insights tailored to your business needs. You can also transform the generated output into a specific format or schema that is compatible with your downstream systems such as databases or other applications.

Standard output can be used with all formats (audio, documents, images, and videos). During the preview, custom output can only be used with documents and images.

Both standard and custom output configurations can be saved in a project to reference in the Amazon Bedrock Data Automation inference API. A project can be configured to generate both standard output and custom output for each processed file.

Let’s look at an example of processing a document for both standard and custom outputs.

Using Amazon Bedrock Data Automation
On the Amazon Bedrock console, I choose Data Automation in the navigation pane. Here, I can review how this capability works with a few sample use cases.

Then, I choose Demo in the Data Automation section of the navigation pane. I can try this capability using one of the provided sample documents or by uploading my own. For example, let’s say I am working on an application that needs to process birth certificates.

I start by uploading a birth certificate to see the standard output results. The first time I upload a document, I’m asked to confirm to create an S3 bucket to store the assets. When I look at the standard output, I can tailor the result with a few quick settings.

I choose the Custom output tab. The document is recognized by one of the sample blueprints and information is extracted across multiple fields.

Most of the data for my application is there but I need a few customizations. For example, the date the birth certificate was issued (JUNE 10, 2022) is in a different format than the other dates in the document. I also need the state that issued the certificate and a couple of flags that tell me if the child last name matches the one from the mother or the father.

Most of the fields in the previous blueprint use the Explicit extraction type. That means they’re extracted as they are from the document.

If I want a date in a specific format, I can create a new field using the Inferred extraction type and add instructions on how to format the result starting from the content of the document. Inferred extractions can be used to perform transformations, such as date or Social Security number (SSN) format, or validations, for example, to check if a person is over 21 based on today’s date.

Sample blueprints cannot be edited. I choose Duplicate blueprint to create a new blueprint that I can edit and then Add field from the Fields drop down.

I add four fields with extraction type Inferred and these instructions:

The date the birth certificate was issued in MM/DD/YYYY format
The state that issued the birth certificate
Is ChildLastName equal to FatherLastName
Is ChildLastName equal to MotherLastName

The first two fields are strings and the last two booleans.

After I create the new fields, I can apply the new blueprint to the document I previously uploaded.

I choose Get result and look for the new fields in the results. I see the date formatted as I need, the two flags, and the state.

Now that I have created this custom blueprint tailored to the needs of my application, I can add it to a project. I can associate multiple blueprints with a project for the different document types I want to process, such as a blueprint for passports, a blueprint for birth certificates, a blueprint for invoices, and so on. When processing documents, Amazon Bedrock Data Automation matches each document to a blueprints within the project to extract relevant information.

I can also create a new blueprint form scratch. In that case, I can start with a prompt where I declare any fields I expect to find in the uploaded document and perform normalizations or validations.

Amazon Bedrock Data Automation can also process audio and video files. For example, here’s the standard output when uploading a video from a keynote presentation by Swami Sivasubramanian VP, AI and Data at AWS.

It takes a few minutes to get the output. The results include a summarization of the overall video, a summary scene by scene, and the text that appears during the video. From here, I can toggle the options to have a full audio transcript, content moderation, or Interactive Advertising Bureau (IAB) taxonomy.

I can also use Amazon Bedrock Data Automation as a parser when creating a knowledge base to extract insights from visually rich documents and images, for retrieval and response generation. Let’s see that in the next section.

Using multimodal data processing in Amazon Bedrock Knowledge Bases
Multimodal data processing support enables applications to understand both text and visual elements in documents.

With multimodal data processing, applications can use a knowledge base to:

Retrieve answers from visual elements in addition to existing support of text.
Generate responses based on the context that includes both text and visual data.
Provide source attribution that references visual elements from the original documents.

When creating a knowledge base in the Amazon Bedrock console, I now have the option to select Amazon Bedrock Data Automation as Parsing strategy.

When I select Amazon Bedrock Data Automation as parser, Amazon Bedrock Data Automation handles the extraction, transformation, and generation of insights from visually rich content, while Amazon Bedrock Knowledge Bases manages ingestion, retrieval, model response generation, and source attribution.

Alternatively, I can use the existing Foundation models as a parser option. With this option, there’s now support for Anthropic’s Claude 3.5 Sonnet as parser, and I can use the default prompt or modify it to suit a specific use case.

In the next step, I specify the Multimodal storage destination on Amazon S3 that will be used by Amazon Bedrock Knowledge Bases to store images extracted from my documents in the knowledge base data source. These images can be retrieved based on a user query, used to generate the response, and cited in the response.

When using the knowledge base, the information extracted by Amazon Bedrock Data Automation or FMs as parser is used to retrieve information about visual elements, understand charts and diagrams, and provide responses that reference both textual and visual content.

Using GraphRAG in Amazon Bedrock Knowledge Bases
Extracting insights from scattered data sources presents significant challenges for RAG applications, requiring multi-step reasoning across these data sources to generate relevant responses. For example, a customer might ask a generative AI-powered travel application to identify family-friendly beach destinations with direct flights from their home location that also offer good seafood restaurants. This requires a connected workflow to identify suitable beaches that other families have enjoyed, match these to flight routes, and select highly-rated local restaurants. A traditional RAG system may struggle to synthesize all these pieces into a cohesive recommendation because the information lives in disparate sources and is not interlinked.

Knowledge graphs can address this challenge by modeling complex relationships between entities in a structured way. However, building and integrating graphs into an application requires significant expertise and effort.

Amazon Bedrock Knowledge Bases now offers one of the first fully managed GraphRAG capabilities that enhances generative AI applications by providing more accurate and comprehensive responses to end users by using RAG techniques combined with graphs.

When creating a knowledge base, I can now enable GraphRAG in just a few steps by choosing Amazon Neptune Analytics as database, automatically generating vector and graph representations of the underlying data, entities and their relationships, and reducing development effort from several weeks to just a few hours.

I start the creation of new knowledge base. In the Vector database section, when creating a new vector store, I select Amazon Neptune Analytics (GraphRAG). If I don’t want to create a new graph, I can provide an existing vector store and select a Neptune Analytics graph from the list. GraphRAG uses Anthropic’s Claude 3 Haiku to automatically build graphs for a knowledge base.

After I complete the creation of the knowledge base, Amazon Bedrock automatically builds a graph, linking related concepts and documents. When retrieving information from the knowledge base, GraphRAG traverses these relationships to provide more comprehensive and accurate responses.

Using structured data retrieval in Amazon Bedrock Knowledge Bases
Structured data retrieval allows natural language querying of databases and data warehouses. For example, a business analyst might ask, “What were our top-selling products last quarter?” and the system automatically generates and runs the appropriate SQL query for a data warehouse stored in an Amazon Redshift database.

When creating a knowledge base, I now have the option to use a structured data store.

I enter a name and description for the knowledge base. In Data source details, I use Amazon Redshift as Query engine. I create a new AWS Identity and Access Management (IAM) service role to manage the knowledge base resources and choose Next.

I choose Redshift serverless in Connection options and the Workgroup to use. Amazon Redshift provisioned clusters are also supported. I use the previously created IAM role for Authentication. Storage metadata can be managed with AWS Glue Data Catalog or directly within an Amazon Redshift database. I select a database from the list.

In the configuration of the knowledge base, I can define the maximum duration for a query and include or exclude access to tables or columns. To improve the accuracy of query generation from natural language, I can optionally add a description for tables and columns and a list of curated queries that provides practical examples of how to translate a question into a SQL query for my database. I choose Next, review the settings, and complete the creation of the knowledge base

After a few minutes, the knowledge base is ready. Once synced, Amazon Bedrock Knowledge Bases handles generating, running, and formatting the result of the query, making it easy to build natural language interfaces to structured data. When invoking a knowledge base using structured data, I can ask to only generate SQL, retrieve data, or summarize the data in natural language.

Things to know
These new capabilities are available today in the following AWS Regions:

Amazon Bedrock Data Automation is available in preview in US West (Oregon).
Multimodal data processing support in Amazon Bedrock Knowledge Bases using Amazon Bedrock Data Automation as parser is available in preview in US West (Oregon). FM as a parser is available in all Regions where Amazon Bedrock Knowledge Bases is offered.
GraphRAG in Amazon Bedrock Knowledge Bases is available in preview in all commercial Regions where Amazon Bedrock Knowledge Bases and Amazon Neptune Analytics are offered.
Structured data retrieval is available in Amazon Bedrock Knowledge Bases in all commercial Regions where Amazon Bedrock Knowledge Bases is offered.

As usual with Amazon Bedrock, pricing is based on usage:

Amazon Bedrock Data Automation charges per images, per page for documents, and per minute for audio or video.
Multimodal data processing in Amazon Bedrock Knowledge Bases is charged based on the use of either Amazon Bedrock Data Automation or the FM as parser.
There is no additional cost for using GraphRAG in Amazon Bedrock Knowledge Bases but you pay for using Amazon Neptune Analytics as the vector store. For more information, visit Amazon Neptune pricing.
There is an additional cost when using structured data retrieval in Amazon Bedrock Knowledge Bases.

For detailed pricing information, see Amazon Bedrock pricing.

Each capability can be used independently or in combination. Together, they make it easier and faster to build applications that use AI to process data. To get started, visit the Amazon Bedrock console. To learn more, you can access the Amazon Bedrock documentation and send feedback to AWS re:Post for Amazon Bedrock. You can find deep-dive technical content and discover how our Builder communities are using Amazon Bedrock at community.aws. Let us know what you build with these new capabilities!

— Danilo

Fedora moves towards Forgejo (Fedora Magazine)

2024-12-04 jzb

Post Syndicated from jzb original https://lwn.net/Articles/1000751/

Fedora Project Leader Matthew Miller reports
that the project’s search to replace Pagure as its git forge is
almost complete, with the Fedora Council strongly in favor of Forgejo:

The Council, currently, has a clear preference for Forgejo. This is a
big decision and we don’t want it to feel rushed. Therefore, we’re
opening this up one last time to everyone’s comments. After two weeks,
we’ll take our formal vote — and then get on with the work!

LWN looked at
Forgejo in February.

Reduce costs and latency with Amazon Bedrock Intelligent Prompt Routing and prompt caching (preview)

2024-12-04 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/reduce-costs-and-latency-with-amazon-bedrock-intelligent-prompt-routing-and-prompt-caching-preview/

Today, Amazon Bedrock has introduced in preview two capabilities that help reduce costs and latency for generative AI applications:

Amazon Bedrock Intelligent Prompt Routing – When invoking a model, you can now use a combination of foundation models (FMs) from the same model family to help optimize for quality and cost. For example, with the Anthropic’s Claude model family, Amazon Bedrock can intelligently route requests between Claude 3.5 Sonnet and Claude 3 Haiku depending on the complexity of the prompt. Similarly, Amazon Bedrock can route requests between Meta Llama 3.1 70B and 8B. The prompt router predicts which model will provide the best performance for each request while optimizing the quality of response and cost. This is particularly useful for applications such as customer service assistants, where uncomplicated queries can be handled by smaller, faster, and more cost-effective models, and complex queries are routed to more capable models. Intelligent Prompt Routing can reduce costs by up to 30 percent without compromising on accuracy.

Amazon Bedrock now supports prompt caching – You can now cache frequently used context in prompts across multiple model invocations. This is especially valuable for applications that repeatedly use the same context, such as document Q&A systems where users ask multiple questions about the same document or coding assistants that need to maintain context about code files. The cached context remains available for up to 5 minutes after each access. Prompt caching in Amazon Bedrock can reduce costs by up to 90% and latency by up to 85% for supported models.

These features make it easier to reduce latency and balance performance with cost efficiency. Let’s look at how you can use them in your applications.

Using Amazon Bedrock Intelligent Prompt Routing in the console
Amazon Bedrock Intelligent Prompt Routing uses advanced prompt matching and model understanding techniques to predict the performance of each model for every request, optimizing for quality of responses and cost. During the preview, you can use the default prompt routers for Anthropic’s Claude and Meta Llama model families.

Intelligent prompt routing can be accessed through the AWS Management Console, the AWS Command Line Interface (AWS CLI), and the AWS SDKs. In the Amazon Bedrock console, I choose Prompt routers in the Foundation models section of the navigation pane.

I choose the Anthropic Prompt Router default router to get more information.

From the configuration of the prompt router, I see that it’s routing requests between Claude 3.5 Sonnet and Claude 3 Haiku using cross-Region inference profiles. The routing criteria defines the quality difference between the response of the largest model and the smallest model for each prompt as predicted by the router internal model at runtime. The fallback model, used when none of the chosen models meet the desired performance criteria, is Anthropic’s Claude 3.5 Sonnet.

I choose Open in Playground to chat using the prompt router and enter this prompt:

Alice has N brothers and she also has M sisters. How many sisters does Alice’s brothers have?

The result is quickly provided. I choose the new Router metrics icon on the right to see which model was selected by the prompt router. In this case, because the question is rather complex, Anthropic’s Claude 3.5 Sonnet was used.

Now I ask a straightforward question to the same prompt router:

Describe the purpose of a 'hello world' program in one line.

This time, Anthropic’s Claude 3 Haiku has been selected by the prompt router.

I select the Meta Prompt Router to check its configuration. It’s using the cross-Region inference profiles for Llama 3.1 70B and 8B with the 70B model as fallback.

Prompt routers are integrated with other Amazon Bedrock capabilities, such as Amazon Bedrock Knowledge Bases and Amazon Bedrock Agents, or when performing evaluations. For example, here I create a model evaluation to help me compare, for my use case, a prompt router to another model or prompt router.

To use a prompt router in an application, I need to set the prompt router Amazon Resource Name (ARN) as model ID in the Amazon Bedrock API. Let’s see how this works with the AWS CLI and an AWS SDK.

Using Amazon Bedrock Intelligent Prompt Routing with the AWS CLI
The Amazon Bedrock API has been extended to handle prompt routers. For example, I can list the existing prompt routes in an AWS Region using ListPromptRouters:

aws bedrock list-prompt-routers

In output, I receive a summary of the existing prompt routers, similar to what I saw in the console.

Here’s the full output of the previous command:

{
    "promptRouterSummaries": [
        {
            "promptRouterName": "Anthropic Prompt Router",
            "routingCriteria": {
                "responseQualityDifference": 0.26
            },
            "description": "Routes requests among models in the Claude family",
            "createdAt": "2024-11-20T00:00:00+00:00",
            "updatedAt": "2024-11-20T00:00:00+00:00",
            "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/anthropic.claude:1",
            "models": [
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-haiku-20240307-v1:0"
                },
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
                }
            ],
            "fallbackModel": {
                "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
            },
            "status": "AVAILABLE",
            "type": "default"
        },
        {
            "promptRouterName": "Meta Prompt Router",
            "routingCriteria": {
                "responseQualityDifference": 0.0
            },
            "description": "Routes requests among models in the LLaMA family",
            "createdAt": "2024-11-20T00:00:00+00:00",
            "updatedAt": "2024-11-20T00:00:00+00:00",
            "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1",
            "models": [
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
                },
                {
                    "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
                }
            ],
            "fallbackModel": {
                "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
            },
            "status": "AVAILABLE",
            "type": "default"
        }
    ]
}

I can get information about a specific prompt router using GetPromptRouter with a prompt router ARN. For example, for the Meta Llama model family:

aws bedrock get-prompt-router --prompt-router-arn arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1

{
    "promptRouterName": "Meta Prompt Router",
    "routingCriteria": {
        "responseQualityDifference": 0.0
    },
    "description": "Routes requests among models in the LLaMA family",
    "createdAt": "2024-11-20T00:00:00+00:00",
    "updatedAt": "2024-11-20T00:00:00+00:00",
    "promptRouterArn": "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1",
    "models": [
        {
            "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
        },
        {
            "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
        }
    ],
    "fallbackModel": {
        "modelArn": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-70b-instruct-v1:0"
    },
    "status": "AVAILABLE",
    "type": "default"
}

To use a prompt router with Amazon Bedrock, I set the prompt router ARN as model ID when making API calls. For example, here I use the Anthropic Prompt Router with the AWS CLI and the Amazon Bedrock Converse API:

aws bedrock-runtime converse \
    --model-id arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/anthropic.claude:1 \
    --messages '[{ "role": "user", "content": [ { "text": "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brothers have?" } ] }]' \

In output, invocations using a prompt router include a new trace section that tells which model was actually used. In this case, it’s Anthropic’s Claude 3.5 Sonnet:

{
    "output": {
        "message": {
            "role": "assistant",
            "content": [
                {
                    "text": "To solve this problem, let's think it through step-by-step:\n\n1) First, we need to understand the relationships:\n   - Alice has N brothers\n   - Alice has M sisters\n\n2) Now, we need to consider who Alice's brothers' sisters are:\n   - Alice herself is a sister to all her brothers\n   - All of Alice's sisters are also sisters to Alice's brothers\n\n3) So, the total number of sisters that Alice's brothers have is:\n   - The number of Alice's sisters (M)\n   - Plus Alice herself (+1)\n\n4) Therefore, the answer can be expressed as: M + 1\n\nThus, Alice's brothers have M + 1 sisters."
                }
            ]
        }
    },
    . . .
    "trace": {
        "promptRouter": {
            "invokedModelId": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.anthropic.claude-3-5-sonnet-20240620-v1:0"
        }
    }
}

Using Amazon Bedrock Intelligent Prompt Routing with an AWS SDK
Using an AWS SDK with a prompt router is similar to the previous command line experience. When invoking a model, I set the model ID to the prompt model ARN. For example, in this Python code I’m using the Meta Llama router with the ConverseStream API:

import json
import boto3

bedrock_runtime = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1",
)

MODEL_ID = "arn:aws:bedrock:us-east-1:123412341234:default-prompt-router/meta.llama:1"

user_message = "Describe the purpose of a 'hello world' program in one line."
messages = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

streaming_response = bedrock_runtime.converse_stream(
    modelId=MODEL_ID,
    messages=messages,
)

for chunk in streaming_response["stream"]:
    if "contentBlockDelta" in chunk:
        text = chunk["contentBlockDelta"]["delta"]["text"]
        print(text, end="")
    if "messageStop" in chunk:
        print()
    if "metadata" in chunk:
        if "trace" in chunk["metadata"]:
            print(json.dumps(chunk['metadata']['trace'], indent=2))

This script prints the response text and the content of the trace in response metadata. For this uncomplicated request, the faster and more affordable model has been selected by the prompt router:

A "Hello World" program is a simple, introductory program that serves as a basic example to demonstrate the fundamental syntax and functionality of a programming language, typically used to verify that a development environment is set up correctly.
{
  "promptRouter": {
    "invokedModelId": "arn:aws:bedrock:us-east-1:123412341234:inference-profile/us.meta.llama3-1-8b-instruct-v1:0"
  }
}

Using prompt caching with an AWS SDK
You can use prompt caching with the Amazon Bedrock Converse API. When you tag content for caching and send it to the model for the first time, the model processes the input and saves the intermediate results in a cache. For subsequent requests containing the same content, the model loads the preprocessed results from the cache, significantly reducing both costs and latency.

You can implement prompt caching in your applications with a few steps:

Identify the portions of your prompts that are frequently reused.
Tag these sections for caching in the list of messages using the new cachePoint block.
Monitor cache usage and latency improvements in the response metadata usage section.

Here’s an example of implementing prompt caching when working with documents.

First, I download three decision guides in PDF format from the AWS website. These guides help choose the AWS services that fit your use case.

Then, I use a Python script to ask three questions about the documents. In the code, I create a converse() function to handle the conversation with the model. The first time I call the function, I include a list of documents and a flag to add a cachePoint block.

import json

import boto3

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20241022-v2:0"
AWS_REGION = "us-west-2"

bedrock_runtime = boto3.client(
    "bedrock-runtime",
    region_name=AWS_REGION,
)

DOCS = [
    "bedrock-or-sagemaker.pdf",
    "generative-ai-on-aws-how-to-choose.pdf",
    "machine-learning-on-aws-how-to-choose.pdf",
]

messages = []


def converse(new_message, docs=[], cache=False):

    if len(messages) == 0 or messages[-1]["role"] != "user":
        messages.append({"role": "user", "content": []})

    for doc in docs:
        print(f"Adding document: {doc}")
        name, format = doc.rsplit('.', maxsplit=1)
        with open(doc, "rb") as f:
            bytes = f.read()
        messages[-1]["content"].append({
            "document": {
                "name": name,
                "format": format,
                "source": {"bytes": bytes},
            }
        })

    messages[-1]["content"].append({"text": new_message})

    if cache:
        messages[-1]["content"].append({"cachePoint": {"type": "default"}})

    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=messages,
    )

    output_message = response["output"]["message"]
    response_text = output_message["content"][0]["text"]

    print("Response text:")
    print(response_text)

    print("Usage:")
    print(json.dumps(response["usage"], indent=2))

    messages.append(output_message)


converse("Compare AWS Trainium and AWS Inferentia in 20 words or less.", docs=DOCS, cache=True)
converse("Compare Amazon Textract and Amazon Transcribe in 20 words or less.")
converse("Compare Amazon Q Business and Amazon Q Developer in 20 words or less.")

For each invocation, the script prints the response and the usage counters.

Adding document: bedrock-or-sagemaker.pdf
Adding document: generative-ai-on-aws-how-to-choose.pdf
Adding document: machine-learning-on-aws-how-to-choose.pdf
Response text:
AWS Trainium is optimized for machine learning training, while AWS Inferentia is designed for low-cost, high-performance machine learning inference.
Usage:
{
  "inputTokens": 4,
  "outputTokens": 34,
  "totalTokens": 29879,
  "cacheReadInputTokenCount": 0,
  "cacheWriteInputTokenCount": 29841
}
Response text:
Amazon Textract extracts text and data from documents, while Amazon Transcribe converts speech to text from audio or video files.
Usage:
{
  "inputTokens": 59,
  "outputTokens": 30,
  "totalTokens": 29930,
  "cacheReadInputTokenCount": 29841,
  "cacheWriteInputTokenCount": 0
}
Response text:
Amazon Q Business answers questions using enterprise data, while Amazon Q Developer assists with building and operating AWS applications and services.
Usage:
{
  "inputTokens": 108,
  "outputTokens": 26,
  "totalTokens": 29975,
  "cacheReadInputTokenCount": 29841,
  "cacheWriteInputTokenCount": 0
}

The usage section of the response contains two new counters: cacheReadInputTokenCount and cacheWriteInputTokenCount. The total number of tokens for an invocation is the sum of the input and output tokens plus the tokens read and written into the cache.

Each invocation processes a list of messages. The messages in the first invocation contain the documents, the first question, and the cache point. Because the messages preceding the cache point aren’t currently in the cache, they’re written to cache. According to the usage counters, 29,841 tokens have been written into the cache.

"cacheWriteInputTokenCount": 29841

For the next invocations, the previous response and the new question are appended to the list of messages. The messages before the cachePoint are not changed and found in the cache.

As expected, we can tell from the usage counters that the same number of tokens previously written is now read from the cache.

"cacheReadInputTokenCount": 29841

In my tests, the next invocations take 55 percent less time to complete compared to the first one. Depending on your use case (for example, with more cached content), prompt caching can improve latency up to 85 percent.

Depending on the model, you can set more than one cache point in a list of messages. To find the right cache points for your use case, try different configurations and look at the effect on the reported usage.

Things to know
Amazon Bedrock Intelligent Prompt Routing is available in preview today in US East (N. Virginia) and US West (Oregon) AWS Regions. During the preview, you can use the default prompt routers, and there is no additional cost for using a prompt router. You pay the cost of the selected model. You can use prompt routers with other Amazon Bedrock capabilities such as performing evaluations, using knowledge bases, and configuring agents.

Because the internal model used by the prompt routers needs to understand the complexity of a prompt, intelligent prompt routing currently only supports English language prompts.

Amazon Bedrock support for prompt caching is available in preview in US West (Oregon) for Anthropic’s Claude 3.5 Sonnet V2 and Claude 3.5 Haiku. Prompt caching is also available in US East (N. Virginia) for Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro.

With prompt caching, cache reads receive a 90 percent discount compared to noncached input tokens. There are no additional infrastructure charges for cache storage. When using Anthropic models, you pay an additional cost for tokens written in the cache. There are no additional costs for cache writes with Amazon Nova models. For more information, see Amazon Bedrock pricing.

When using prompt caching, content is cached for up to 5 minutes, with each cache hit resetting this countdown. Prompt caching has been implemented to transparently support cross-Region inference. In this way, your applications can get the cost optimization and latency benefit of prompt caching with the flexibility of cross-Region inference.

These new capabilities make it easier to build cost-effective and high-performing generative AI applications. By intelligently routing requests and caching frequently used content, you can significantly reduce your costs while maintaining and even improving application performance.

To learn more and start using these new capabilities today, visit the Amazon Bedrock documentation and send feedback to AWS re:Post for Amazon Bedrock. You can find deep-dive technical content and discover how our Builder communities are using Amazon Bedrock at community.aws.

— Danilo

Amazon Bedrock Marketplace: Access over 100 foundation models in one place

2024-12-04 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-bedrock-marketplace-access-over-100-foundation-models-in-one-place/

Today, we’re introducing Amazon Bedrock Marketplace, a new capability that gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. With this launch, you can now discover, test, and deploy new models from enterprise providers such as IBM and Nvidia, specialized models such as Upstages’ Solar Pro for Korean language processing, and Evolutionary Scale’s ESM3 for protein research, alongside Amazon Bedrock general-purpose FMs from providers such as Anthropic and Meta.

Models deployed with Amazon Bedrock Marketplace can be accessed through the same standard APIs as the serverless models and, for models which are compatible with Converse API, be used with tools such as Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases.

As generative AI continues to reshape how organizations work, the need for specialized models optimized for specific domains, languages, or tasks is growing. However, finding and evaluating these models can be challenging and costly. You need to discover them across different services, build abstractions to use them in your applications, and create complex security and governance layers. Amazon Bedrock Marketplace addresses these challenges by providing a single interface to access both specialized and general-purpose FMs.

Using Amazon Bedrock Marketplace
To get started, in the Amazon Bedrock console, I choose Model catalog in the Foundation models section of the navigation pane. Here, I can search for models that help me with a specific use case or language. The results of the search include both serverless models and models available in Amazon Bedrock Marketplace. I can filter results by provider, modality (such as text, image, or audio), or task (such as classification or text summarization).

In the catalog, there are models from organizations like Arcee AI, which builds context-adapted small language models (SLMs), and Widn.AI, which provides multilingual models.

For example, I am interested in the IBM Granite models and search for models from IBM Data and AI.

I select Granite 3.0 2B Instruct, a language model designed for enterprise applications. Choosing the model opens the model detail page where I can see more information from the model provider such as highlights about the model, pricing, and usage including sample API calls.

This specific model requires a subscription, and I choose View subscription options.

From the subscription dialog, I review pricing and legal notes. In Pricing details, I see the software price set by the provider. For this model, there are no additional costs on top of the deployed infrastructure. The Amazon SageMaker infrastructure cost is charged separately and can be seen in Amazon SageMaker pricing.

To proceed with this model, I choose Subscribe.

After the subscription has been completed, which usually takes a few minutes, I can deploy the model. For Deployment details, I use the default settings and the recommended instance type.

I expand the optional Advanced settings. Here, I can choose to deploy in a virtual private cloud (VPC) or specify the AWS Identity and Access Management (IAM) service role used by the deployment. Amazon Bedrock Marketplace automatically creates a service role to access Amazon Simple Storage Service (Amazon S3) buckets where the model weights are stored, but I can choose to use an existing role.

I keep the default values and complete the deployment.

After a few minutes, the deployment is In Service and can be reviewed in the Marketplace deployments page from the navigation pane.

There, I can choose an endpoint to view details and edit the configuration such as the number of instances. To test the deployment, I choose Open in playground and ask for some poetry.

I can also select the model from the Chat/text page of the Playground using the new Marketplace category where the deployed endpoints are listed.

In a similar way, I can use the model with other tools such as Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon Bedrock Prompt Management, Amazon Bedrock Guardrails, and model evaluations, by choosing Select Model and selecting the Marketplace model endpoint.

The model I used here is text-to-text, but I can use Amazon Bedrock Marketplace to deploy models with different modalities. For example, after I deploy Stability AI Stable Diffusion 3.5 Large, I can run a quick test in the Amazon Bedrock Image playground.

The models I deployed are now available through the Amazon Bedrock InvokeModel API. When a model is deployed, I can use it with the AWS Command Line Interface (AWS CLI) and any AWS SDKs using the endpoint Amazon Resource Name (ARN) as model ID.

For chat-tuned text-to-text models, I can also use the Amazon Bedrock Converse API, which abstracts model differences and enables model switching with a single parameter change.

Things to know
Amazon Bedrock Marketplace is available in the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), and South America (São Paulo).

With Amazon Bedrock Marketplace, you pay a software fee to the third-party model provider (which can be zero, as in the previous example) and a hosting fee based on the type and number of instances you choose for your model endpoints.

Start browsing the new models using the Model catalog in the Amazon Bedrock console, visit the Amazon Bedrock Marketplace documentation, and send feedback to AWS re:Post for Amazon Bedrock. You can find deep-dive technical content and discover how our Builder communities are using Amazon Bedrock at community.aws.

— Danilo

Meet your training timelines and budgets with new Amazon SageMaker HyperPod flexible training plans

2024-12-04 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/meet-your-training-timelines-and-budgets-with-new-amazon-sagemaker-hyperpod-flexible-training-plans/

Today, we’re announcing the general availability of Amazon SageMaker HyperPod flexible training plans to help data scientists train large foundation models (FMs) within their timelines and budgets and save them weeks of effort in managing the training process based on compute availability.

At AWS re:Invent 2023, we introduced SageMaker HyperPod to reduce the time to train FMs by up to 40 percent and scale across thousands of compute resources in parallel with preconfigured distributed training libraries and built-in resiliency. Most generative AI model development tasks need accelerated compute resources in parallel. Our customers struggle to find timely access to compute resources to complete their training within their timeline and budget constraints.

With today’s announcement, you can find the required accelerated compute resources for training, create the most optimal training plans, and run training workloads across different blocks of capacity based on the availability of the compute resources. Within a few steps, you can identify training completion date, budget, compute resources requirements, create optimal training plans, and run fully managed training jobs, without needing manual intervention.

SageMaker HyperPod training plans in action
To get started, go to the Amazon SageMaker AI console, choose Training plans in the left navigation pane, and choose Create training plan.

For example, choose your preferred training date and time (10 days), instance type and count (16 ml.p5.48xlarge) for SageMaker HyperPod cluster, and choose Find training plan.

SageMaker HyperPod suggests a training plan that is split into two five-day segments. This includes the total upfront price for the plan.

If you accept this training plan, add your training details in the next step and choose Create your plan.

After creating your training plan, you can see the list of training plans. When you’ve created a training plan, you have to pay upfront for the plan within 12 hours. One plan is in the Active state and already started, with all the instances being used. The second plan is Scheduled to start later, but you can already submit jobs that start automatically when the plan begins.

In the active status, the compute resources are available in SageMaker HyperPod, resume automatically after pauses in availability, and terminates at the end of the plan. There is a first segment currently running and another segment queued up to run after the current segment.

This is similar to the Managed Spot training in SageMaker AI, where SageMaker AI takes care of instance interruptions and continues the training with no manual intervention. To learn more, visit the SageMaker HyperPod training plans in the Amazon SageMaker AI Developer Guide.

Now available
Amazon SageMaker HyperPod training plans are now available in US East (N. Virginia), US East (Ohio), US West (Oregon) AWS Regions and support ml.p4d.48xlarge, ml.p5.48xlarge, ml.p5e.48xlarge, ml.p5en.48xlarge, and ml.trn2.48xlarge instances. Trn2 and P5en instances are only in US East (Ohio) Region. To learn more, visit the SageMaker HyperPod product page and SageMaker AI pricing page.

Give HyperPod training plans a try in the Amazon SageMaker AI console and send feedback to AWS re:Post for SageMaker AI or through your usual AWS Support contacts.

— Channy

Maximize accelerator utilization for model development with new Amazon SageMaker HyperPod task governance

2024-12-04 Channy Yun (윤석찬)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/maximize-accelerator-utilization-for-model-development-with-new-amazon-sagemaker-hyperpod-task-governance/

Today, we’re announcing the general availability of Amazon SageMaker HyperPod task governance, a new innovation to easily and centrally manage and maximize GPU and Tranium utilization across generative AI model development tasks, such as training, fine-tuning, and inference.

Customers tell us that they’re rapidly increasing investment in generative AI projects, but they face challenges in efficiently allocating limited compute resources. The lack of dynamic, centralized governance for resource allocation leads to inefficiencies, with some projects underutilizing resources while others stall. This situation burdens administrators with constant replanning, causes delays for data scientists and developers, and results in untimely delivery of AI innovations and cost overruns due to inefficient use of resources.

With SageMaker HyperPod task governance, you can accelerate time to market for AI innovations while avoiding cost overruns due to underutilized compute resources. With a few steps, administrators can set up quotas governing compute resource allocation based on project budgets and task priorities. Data scientists or developers can create tasks such as model training, fine-tuning, or evaluation, which SageMaker HyperPod automatically schedules and executes within allocated quotas.

SageMaker HyperPod task governance manages resources, automatically freeing up compute from lower-priority tasks when high-priority tasks need immediate attention. It does this by pausing low-priority training tasks, saving checkpoints, and resuming them later when resources become available. Additionally, idle compute within a team’s quota can be automatically used to accelerate another team’s waiting tasks.

Data scientists and developers can continuously monitor their task queues, view pending tasks, and adjust priorities as needed. Administrators can also monitor and audit scheduled tasks and compute resource usage across teams and projects and, as a result, they can adjust allocations to optimize costs and improve resource availability across the organization. This approach promotes timely completion of critical projects while maximizing resource efficiency.

Getting started with SageMaker HyperPod task governance
Task governance is available for Amazon EKS clusters in HyperPod. Find Cluster Management under HyperPod Clusters in the Amazon SageMaker AI console for provisioning and managing clusters. As an administrator, you can streamline the operation and scaling of HyperPod clusters through this console.

When you choose a HyperPod cluster, you can see a new Dashboard, Tasks, and Policies tab in the cluster detail page.

1. New dashboard
In the new dashboard, you can see an overview of cluster utilization, team-based, and task-based metrics.

First, you can view both point-in-time and trend-based metrics for critical compute resources, including GPU, vCPU, and memory utilization, across all instance groups.

Next, you can gain comprehensive insights into team-specific resource management, focusing on GPU utilization versus compute allocation across teams. You can use customizable filters for teams and cluster instance groups to analyze metrics such as allocated GPUs/CPUs for tasks, borrowed GPUs/CPUs, and GPU/CPU utilization.

You can also assess task performance and resource allocation efficiency using metrics such as counts of running, pending, and preempted tasks, as well as average task runtime and wait time. To gain comprehensive observability into your SageMaker HyperPod cluster resources and software components, you can integrate with Amazon CloudWatch Container Insights or Amazon Managed Grafana.

2. Create and manage a cluster policy
To enable task prioritization and fair-share resource allocation, you can configure a cluster policy that prioritizes critical workloads and distributes idle compute across teams defined in compute allocations.

To configure priority classes and fair sharing of borrowed compute in cluster settings, choose Edit in the Cluster policy section.

You can define how tasks waiting in queue are admitted for task prioritization: First-come-first-serve by default or Task ranking. When you choose task ranking, tasks waiting in queue will be admitted in the priority order defined in this cluster policy. Tasks of same priority class will be executed on a first-come-first-serve basis.

You can also configure how idle compute is allocated across teams: First-come-first-serve or Fair-share by default. The fair-share setting enables teams to borrow idle compute based on their assigned weights, which are configured in relative compute allocations. This enables every team to get a fair share of idle compute to accelerate their waiting tasks.

In the Compute allocation section of the Policies page, you can create and edit compute allocations to distribute compute resources among teams, enable settings that allow teams to lend and borrow idle compute, configure preemption of their own low-priority tasks, and assign fair-share weights to teams.

In the Team section, set a team name and a corresponding Kubernetes namespace will be created for your data science and machine learning (ML) teams to use. You can set a fair-share weight for a more equitable distribution of unused capacity across your teams and enable the preemption option based on task priority, allowing higher-priority tasks to preempt lower-priority ones.

In the Compute section, you can add and allocate instance type quotas to teams. Additionally, you can allocate quotas for instance types not yet available in the cluster, allowing for future expansion.

You can enable teams to share idle compute resources by allowing them to lend their unused capacity to other teams. This borrowing model is reciprocal: teams can only borrow idle compute if they are also willing to share their own unused resources with others. You can also specify the borrow limit that enables teams to borrow compute resources over their allocated quota.

3. Run your training task in SageMaker HyperPod cluster
As a data scientist, you can submit a training job and use the quota allocated for your team, using the HyperPod Command Line Interface (CLI) command. With the HyperPod CLI, you can start a job and specify the corresponding namespace that has the allocation.

$ hyperpod start-job --name smpv2-llama2 --namespace hyperpod-ns-ml-engineers
Successfully created job smpv2-llama2
$ hyperpod list-jobs --all-namespaces
{
 "jobs": [
  {
   "Name": "smpv2-llama2",
   "Namespace": "hyperpod-ns-ml-engineers",
   "CreationTime": "2024-09-26T07:13:06Z",
   "State": "Running",
   "Priority": "fine-tuning-priority"
  },
  ...
 ]
}

In the Tasks tab, you can see all tasks in your cluster. Each task has different priority and capacity need according to its policy. If you run another task with higher priority, the existing task will be suspended and that task can run first.

OK, now let’s check out a demo video showing what happens when a high-priority training task is added while running a low-priority task.

To learn more, visit SageMaker HyperPod task governance in the Amazon SageMaker AI Developer Guide.

Now available
Amazon SageMaker HyperPod task governance is now available in US East (N. Virginia), US East (Ohio), US West (Oregon) AWS Regions. You can use HyperPod task governance without additional cost. To learn more, visit the SageMaker HyperPod product page.

Give HyperPod task governance a try in the Amazon SageMaker AI console and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

— Channy

P.S. Special thanks to Nisha Nadkarni, a senior generative AI specialist solutions architect at AWS for her contribution in creating a HyperPod testing environment.

Walleij: New ARM32 Security Features in v6.10

2024-12-04 corbet

Post Syndicated from corbet original https://lwn.net/Articles/1000727/

Linus Walleij writes
about a pair of security features for 32-bit Arm systems; these landed
in 6.10, but, he says, have now stabilized to the point that distributors
may want to enable them.

PAN is an abbreviation for the somewhat grammatically incorrect
Privileged Access Never. […]

For modern ARM32 systems with large memories configured to use LPAE
nothing like PAN was available: this version of the MMU simply did
not implement a PAN option.

As of the patch originally developed by Catalin Marinas, we deploy
a scheme that will use the fact that LPAE has two separate
translation table base registers (TTBR:s): one for userspace
(TTBR0) and one for kernelspace (TTBR1).

Masatoshi Ohno & Jérôme Guth | Turning Fear into Excitement | Talks at Google

2024-12-04 Talks at Google

Post Syndicated from Talks at Google original https://www.youtube.com/watch?v=VyxL__v7Ny4

[$] The return of RWF_UNCACHED

2024-12-04 corbet

Post Syndicated from corbet original https://lwn.net/Articles/998783/

Linux offers two broad ways of performing I/O to files. Buffered I/O,
which is the usual way of accessing a file, stores a copy of the
transferred data in the kernel’s page cache to speed future accesses.
Direct I/O, instead, moves data directly between the storage device and a
user-space buffer, avoiding the page cache. Both modes have their
advantages and disadvantages. In 2019, Jens Axboe proposed an uncached buffered mode to get some
of the advantages of both, but that effort stalled at the time. Now, uncached buffered
I/O is back with some impressive performance results behind it.

Black Basta Ransomware Campaign Drops Zbot, DarkGate, and Custom Malware

2024-12-04 Tyler McGraw

Post Syndicated from Tyler McGraw original https://blog.rapid7.com/2024/12/04/black-basta-ransomware-campaign-drops-zbot-darkgate-and-custom-malware/

Executive Summary

Black Basta Ransomware Campaign Drops Zbot, DarkGate, and Custom Malware

Beginning in early October, Rapid7 has observed a resurgence of activity related to the ongoing social engineering campaign being conducted by Black Basta ransomware operators. Rapid7 initially reported the discovery of the novel social engineering campaign back in May, 2024, followed by an update in August 2024, when the operators updated their tactics and malware payloads and began sending lures via Microsoft Teams. Now, the procedures followed by the threat actors in the early stages of the social engineering attacks have been refined again, with new malware payloads, improved delivery, and increased defense evasion.

Overview

The social engineering attacks are still initiated in a similar manner. Users within the target environment will be email bombed by the threat actor, which is often achieved by signing up the user’s email to numerous mailing lists simultaneously. After the email bomb, the threat actor will reach out to the impacted users. Rapid7 has observed the initial contact still occurs primarily through usage of Microsoft Teams, by which the threat actor, as an external user, will attempt to call or message the impacted user to offer assistance. The account domains in use include both Azure/Entra tenant subdomains (e.g., username[@]tenantsubdomain[.]onmicrosoft[.]com) and custom domains (e.g., username[@]cofincafe[.]com).

In many cases, Rapid7 has observed that the threat actor will pretend to be a member of the target organization’s help desk, support team, or otherwise present themself as IT staff. Below are examples of Microsoft Teams display names observed, by Rapid7, to be in use by operators. The display names may or may not be padded with whitespace characters. Rapid7 has also observed threat actors use a first and last name, as the chat display name and/or account username, to impersonate an IT staff member within the targeted organization.

Operator Chat Display Name
Help Desk
HELP DESK
Help Desk Manager
Technical Support
Administracion

If the user interacts with the lure, either by answering the call or messaging back, the threat actor will attempt to get the user to install or execute a remote management (RMM) tool, including, but not limited to, QuickAssist, AnyDesk, TeamViewer, Level, or ScreenConnect. Rapid7 has also observed attempts to leverage the OpenSSH client, a native Windows utility, to establish a reverse shell. In at least one instance, the threat actor shared a QR code with the targeted user. The purpose of the QR code is unconfirmed but appears to be an attempt to bypass MFA after stealing a user’s credentials. The URL embedded within the QR code adheres to the following format: hxxps://<company_name>[.]qr-<letter><number>[.]com.

In a majority of cases, Rapid7 has observed that the operator, after gaining access to the user’s asset via RMM tool, will then attempt to download and execute additional malware payloads. In one case handled by Rapid7, the operator requested more time — potentially to hand off the access to another member of the group.

The payload delivery methods vary per case, but have included external compromised SharePoint instances, common file sharing websites, servers rented through hosting providers, or even direct upload to the compromised asset in the case of RMM tool remote control. In one case, the operator used the group’s custom credential harvester to dump the user’s credentials, the results for which were subsequently uploaded to a file sharing site — publicly exposing the stolen credentials. SharePoint has been used to distribute copies of AnyDesk portable, likely to circumvent security measures that would prevent the user from downloading it directly from anydesk[.]com. Such attempts have been blocked by web proxy in previous cases.

The overall goal following initial access appears to be the same: to quickly enumerate the environment and dump the user’s credentials. When possible, operators will also still attempt to steal any available VPN configuration files. With the user’s credentials, organization VPN information, and potential MFA bypass, it may be possible for them to authenticate directly to the target environment.

Rapid7 has observed usage of the same credential harvesting executable, previously reported as AntiSpam.exe, though it is now delivered in the form of a DLL and most commonly executed via rundll32.exe. Whereas before it was an unobfuscated .NET executable, the program is now commonly contained within a compiled 64-bit DLL loader. Rapid7 has analyzed at least one sample that has also been obfuscated using the group’s custom packer. The newest versions of the credential harvester now save output to the file 123.txt in the user’s %TEMP% directory, an update from the previous qwertyuio.txt file, though versions of the DLL distributed earlier in the campaign would still output to the previous file.

The credential harvester is most commonly followed by the execution of a loader such as Zbot (a.k.a. Zloader) or DarkGate. This can then serve as a gateway to the execution of subsequent payloads in memory, facilitate data theft, or otherwise perform malicious actions. Rapid7 has also observed operators distributing alternate payload archives containing Cobalt Strike beacon loaders and a pair of Java payloads containing a user credential harvester variant and a custom multi-threaded beacon by which to remotely execute PowerShell commands. In some cases, operators have sent the user a short command, via Teams, which will then begin an infection chain after execution by the targeted user.

Rapid7 continues to observe inconsistent usage of the group’s custom packer to deliver various malware payloads, including their custom credential harvester. A YARA rule is now publicly available that can be used to detect the packer. For example, this packer was used to deliver several obfuscated versions of Black Basta ransomware, obtained via open source intelligence, which directly links operators to the ongoing social engineering campaign.

At the time of writing, the threat actors behind the campaign continue to update both their strategy for gaining initial access and the tools subsequently used. For example, around the time the most recent campaign activity began, Rapid7 observed the delivery of a timestamped and versioned payload archive, 171024_V1US.zip (2024-10-17, version 1, US), which, when compared to a more recently delivered archive, 171124_V15.zip (2024-11-17, version 15), highlights the rapid iteration being undertaken. Many of the payloads being delivered follow a similar pattern as previous activity and often consist of a legitimate file where an export or function entry point has been overwritten to jump to malicious code, and the result is signed with a likely stolen code signing certificate.

Intrusions related to the campaign should be taken seriously — the intent goes beyond typical phishing activity. Past campaign activity has led to the deployment of Black Basta ransomware. While Rapid7 has handled a high volume of incidents related to the current social engineering campaign across a variety of customer environments, to date, every case has been contained before the operator was able to move laterally beyond the targeted user’s asset.

Technical Analysis

Initial Access

Each attack is preceded by the targeted user receiving an often overwhelming amount of emails. An operator will then attempt to contact the user via Microsoft Teams, either via messaging or calling, by which they will pretend to offer assistance. Operators will attempt to impersonate the organization’s help desk, such as using the names of existing staff members.

During this social engineering stage, operators often need to troubleshoot with the user to establish remote control of the user’s asset. Based on the environment, for example, RMM tool downloads or execution may be blocked (often some, but not all) or QuickAssist may be disabled, causing the operator to cycle through their options at establishing a foothold. One of the most common first steps after gaining either the confidence of the user, or remote access, is to execute a custom credential harvester.

Credential Harvesting

The credential harvester used by operators, for example SafeStore.dll (SHA256: 3B7E06F1CCAA207DC331AFD6F91E284FEC4B826C3C427DFFD0432FDC48D55176), is an updated version of the previously analyzed program AntiSpam.exe. The DLL variant of the credential harvester is executed by a command like the following example:

rundll32.exe SafeStore.dll,epaas_request_clone

The module will quickly execute three enumeration commands to gather system information — systeminfo, route print, ipconfig /all — and then prompt the user for their password. The user’s credentials are appended onto a new line of the text file 123.txt with each attempt, after the enumeration command output, regardless of whether the credentials are correct. If the user enters the wrong password, they will be prompted to try again. The output for the enumeration commands and the user’s credentials were saved to the file qwertyuio.txt in older versions of the harvester, but are now saved to 123.txt, within the user’s %TEMP% directory. The enumeration commands within the updated version are executed via successive calls to CreateProcessA.

Based on analysis of one credential harvester sample, EventCloud.dll, the program was present in shellcode form. The shellcode is decrypted from the Cursor Group 880 resource embedded within the executable, using the XOR key 5A 3C 77 6E 33 30 4D 38 4F 38 40 78 41 58 51 30 42 5F 3F 67 71 00, and then injected locally. The following strings which were extracted from the shellcode show the output file and list dynamically loaded libraries:

Credential Harvester Strings	–	–	–	–
cmd.exe /c	%s%s	%s%s%s%s	123.txt	ooki
Update	filter kb_outl	Need credentials to update…	Username:	Password:
ntdll.dll	Gdi32.dll	user32.dll	msvcrt.dll	ucrtbase.dll
Comctl32.dll	Advapi32.dll	kernel32.dll	–	–

The Java variant of the credential harvester, identity.jar, provides a similar prompt to the user, though when a password is entered it is appended, without the username, to a .txt file with a random 10-letter alphabetic name to the current working directory. The cancel button on the prompt, shown below, is not functional and the prompt is drawn on top of other windows, meaning that it will not close until the user has entered their password correctly.

Malware Payloads

Following execution of a credential harvester, an operator will typically infect the asset with Zbot or DarkGate. One of the Zbot samples delivered after initial access, SyncSuite.exe (SHA256: DB34E255AA4D9F4E54461571469B9DD53E49FEED3D238B6CFB49082DE0AFB1E4) contains similar functionality and strings to other Zbot/Zloader samples previously reported by ZScaler. However, in addition to previously observed strings, the sample also contains encrypted strings for an embedded command help menu, error messages, and more. Rapid7 observed the embedded malware version was 2.9.4.0.

Upon execution, the malware will copy itself to a random folder within the %APPDATA% directory. If the file does not have its original filename however, the process will immediately exit. The malware also contains the functionality to establish persistence either via a Run key at HKCU\Software\Microsoft\Windows\CurrentVersion\Run or a scheduled task named after the executable, which executes the malware copy in %APPDATA% whenever the user logs on. After collecting the hostname, username, and the installation date from the InstallDate value contained within the registry key HKLM\Software\Microsoft\Windows NT\CurrentVersion, this data is concatenated (delimited by underscore characters) and encrypted, along with other config information. It is then stored within the user’s registry inside a random key created at HKCU\Software\Microsoft\. The analyzed sample will also load a fresh copy of ntdll.dll to avoid hooking, which is then used to perform calls to NTAPI functions. SyncSuite.exe ultimately injects itself into a suspended instance of msedge.exe, created using NtCreateUserProcess and executed via ResumeThread, a technique known as Process Hollowing.

All of the strings used by the malware are stored encrypted within the .rdata section along with the configuration. The strings are decrypted using an obfuscated loop that is ultimately a simple XOR operation with the hard coded key 16 EB D5 3E AA E6 51 09 14 D3 DF 18 AD D6 1B BD BE, which is also stored in the .rdata section. The configuration is decrypted using an RC4 key, F3 F9 F7 FB FA F3 F7 F7 FF F5 F2 F3 FA FD FE F2 for this sample. The decrypted configuration for SyncSuite.exe can be seen below, with empty rows removed. The configuration contains a different public RSA key and botnet ID than the one previously shared by ThreatLabz, indicating that the campaign is being run by a different affiliate. All decrypted strings from SyncSuite.exe can be seen in the Zbot Strings section following other Indicators of Compromise.

Rapid7 has also observed the delivery of DarkGate malware following initial access. One payload archive contained both a DarkGate infection initiation script, test.vbs, and an executable copy of the DarkGate malware itself, SafeFilter.exe (SHA256: EF28A572CDA7319047FBC918D60F71C124A038CD18A02000C7AB413677C5C161 ), though this copy is packed using the group’s custom packer. The final payload containing the DarkGate malware, after several layers of decrypting and loading, contains the version string 7.0.6. If the folder c:\debugg exists on the system when the malware is executed it will display the version number via MessageBoxA. The configuration for this sample can be seen below along with hard coded commands. Notably, the campaign ID for the sample appears to be drk2.

The configuration is decrypted with the key ckcilIcconnh within a customized XOR loop near the beginning of execution to reveal CRLF delimited options. However, due to the implementation of the decryption loop, the keyspace is effectively reduced to that of a single byte (0-255), after the first byte. This makes the XOR key for the majority of the config 0x60, for this sample allowing for the encrypted data to be trivially bruteforced.

| SafeFilter.exe DarkGate Config |-|

Key-Value Pair	Description
0=179.60.149[.]194\|	C2 domains or IP addresses, delimited with ‘\|’ characters
8=No	If enabled and the file `C:\ProgramData\hedfdfd\Autoit3.exe` does not exist, call `MessageBoxTimeoutA` using keys 11 and 12 and a timeout of 1770ms.
11=Error	Used by key 8 as a message box title.
12=PyKtS5Q	The string `Error`, base64 encoded with the custom alphabet `zLAxuU0kQKf3sWE7ePRO2imyg9GSpVoYC6rhlX48ZHnvjJDBNFtMd1I5acwbqT+=`. Used by key 8 as a message box caption.
13=6	Unknown
14=Yes	Unknown
15=80	C2 communication port.
1=Yes	Enables infection.
32=Yes	If enabled, attempt bypass of detected security products. For example, enables calls to `RtlAdjustPrivilege` and `NtRaiseHardError` to cause a crash if `hdkcgae` is not present in `C:\temp\` and a Kaspersky product has been detected.
3=No	If disabled, do an anti-vm display check.
4=No	If enabled, compare system drive size to key 18. If below, exit.
18=100	Minimum drive size in GB.
6=No	If enabled and key 3 is disabled, check the display for known virtual machine display strings using `EnumDisplayDevicesA`. If matched, exit. Failed to match properly when tested.
7=No	If enabled, compare system RAM to key 19. If below, exit.
19=4096	Minimum RAM size in MB.
5=No	If enabled, check the registry key `ProcessorNameString` at `HKLM\HARDWARE\DESCRIPTION\System\CentralProcessor\0` for `xeon`. If found, exit.
21=No	Unknown
22	Not present in the config for this sample, but is still checked for in the code. If enabled, set the variant string to `DLL`, otherwise `?`.
23=Yes	If enabled, set the variant string to `AU3` for Autoit3 payloads.
31=No	If enabled, set the variant string to `AHK` for AutoHotKey payloads.
25=drk2	Campaign ID
26=No	Unknown
27=rsFxMyDX	Decryption key, also used to bound/find payloads stored within other files.
28=No	Unknown
29=2	Unknown
35=No	Unknown
tabla=IsUiPQ4&atzM5N=0($"3]TGfyK8JYwvO61SAF{ndrDuol29*RkmqCpgxeX[EH,V)}7jbZBc.WLh	Unknown

DarkGate Hard-coded Commands
/c cd /d "C:\Users\User\AppData\Roaming<browser_dir>" && move <browser_name> <browser_name><random_alphabet_string>
/c cd /d "C:\Users\User\AppData\Local" && move <browser_name> <browser_name><random_alphabet_string>
/c cmdkey /delete:
/c cmdkey /list > c:\temp\cred.txt
/c del /q /f /s C:\Users\User\AppData\Roaming\Mozilla\firefox*
/c ping 127.0.0.1 & del /q /f /s c:\temp & del /q /f /s C:\ProgramData\hedfdfd\ & rmdir /s /q C:\ProgramData\hedfdfd\
/c shutdown -f -r -t 0
/c shutdown -f -s -t 0
/c wmic ComputerSystem get domain > C:\ProgramData\hedfdfd\fcadaab

During execution, DarkGate will hash certain strings and use the result to create or check files at the directories C:\ProgramData\hedfdfd(mainfolder) and C:\temp\. The hashing algorithm uses a randomized key generated at runtime, so the hashes across infections will be different. Commonly used strings and their resultant hash, for the analysis environment, are shown below.

Path String	DarkGate Custom Hash
mainfolder	hedfdfd
logsfolder	fhhcfhh
settings	dhkbbfc
domain	fcadaab
mutex0	hfgdced
mutex1	cekchde
au3	dgfeabe
c.txt	adfcbdd
cc.txt	dehgaba
script	daaadeh
fs.txt	hdkcgae

DarkGate may also change its behavior if a known security product is detected. This is achieved by using CreateToolhelp32Snapshot and related functions to loop through running processes which are compared to a hard-coded list. The malware will also check for known installation directories using GetFileAttributesA. If a security product is found, a flag will be set which may alter the execution path. Only the following products had associated flags:

DarkGate “Supported” Security Products	–	–	–	–
Windows Defender	Sophos	Quick Heal	MalwareBytes	Panda Security
Norton/Symantec	ESET/Nod32	Kaspersky	Avast	SentinelOne
Bitdefender	–	–	–	–

At the end of the first execution of the DarkGate payload, it will then attempt to inject itself into a host process. First, DarkGate will select the injection target by searching a list of hard coded directories for any executable that contains the string updatecore.exe, subdirectories included. The path C:\Program Files (x86)\Microsoft\EdgeUpdate\ is searched first, with the fallback being C:\Program Files (x86)\Microsoft\EdgeUpdate\MicrosoftEdgeUpdate.exe. If a matching Edge executable is not found, the path C:\Program Files (x86)\Google\Update\ is then searched. If that also fails, the malware will attempt to use C:\Windows\Microsoft.NET\Framework\v4.0.30319\msbuild.exe.

After successfully choosing the injection target, DarkGate will then inject itself into the target process using shellcode, terminating the original instance of the final DarkGate payload after executing the shellcode. When creating an instance of the target process to inject, DarkGate will also attempt to spoof the parent process ID (PPID) of the injection target by enumerating running processes for accessibility using OpenProcess and then randomly selecting one from an assembled list. The PPID of the target is then updated using UpdateProcThreadAttribute prior to creation with CreateProcessA.

Execution of the injected process is coordinated by checking for the presence of two file based mutexes within C:\ProgramData\hedfdfd\ (mainfolder). Each instance of the DarkGate malware checks both of the file-based mutexes. The file mutex usage is checked via calls to CreateFileA using an exclusive share mode flag (0) and a creation disposition of CREATE_ALWAYS, which means that if the mutex is already in usage by another DarkGate instance the call will fail. If the call to both mutexes created by DarkGate, hfgdced and cekchde, fails, DarkGate will exit. As a result of having two mutexes, DarkGate will typically run within two injected process instances at the same time, so if one process is terminated, the remaining instance will spawn another. If a DarkGate instance is spawned and both calls to open the file based mutexes fail, indicating two existing DarkGate instances, the new instance will terminate. This technique is rarely used by malware developers and highlights the sophistication of DarkGate malware.

DarkGate will unconditionally log keystrokes as well as clipboard data that is under 1024 bytes. The logged data is stored encrypted at C:\ProgramData\hedfdfd\fhhcfhh (mainfolder\logsfolder) within files named <date>.log. The logged data may be sent directly to the C2 address contained within the config. A thread is also created to persist on infected systems by creating the Run key daaadeh (script) at HKCU\Software\Microsoft\Windows\CurrentVersion\Run. The Run key will point to the copies of Autoit3.exe and the compiled AU3 script payload dgfeabe.a3x (au3) created at C:\ProgramData\hedfdfd (mainfolder), with the former executing the latter every time the user logs on. When the AU3 script is executed, DarkGate reinfects the system. The thread continuously monitors the text within the infected user’s active window however, sleeping 1500ms between checks, and will delete the registry key if a blacklisted application is detected. This list includes popular analysis tools such as Process Hacker, Process Monitor, Task Manager, and even the Windows Registry Editor.

The DarkGate sample executed by SafeFilter.exe contains 78 remote commands, some of which can be seen below with their intended function. Every loop, the malware will re-send the text of the active window, user idle time, and whether or not the malware instance has admin rights, before checking for a command.

Command ID	Function
1000	Sleep for a randomized amount of time.
1004	Use MessageBoxA to display the message `test msg`.
1044,1045,1046	Click the user’s mouse at specified screen coordinates using `SetCursorPos` and successive calls to `mouse_event`. 1044 for double left-click. 1045 for single left click. 1046 for single right click.
1049	Create a remote shell via `powershell.exe`.
1059	Terminate process by PID.
1061	Inject DarkGate shellcode into a specified process or an Edge/Chrome process if none is selected. The shellcode is then executed via `ResumeThread`.
1062,1063,1064	Inject DarkGate shellcode into a specified process or `cmd.exe` if none is selected. The shellcode is then executed via `CreateRemoteThread`.
1066	Remove infection files by using `cmd.exe` to delete the staging directories `C:\ProgramData\hedfdfd` and `c:\temp\`.
1071	Steal `sitemanager.xml` and `recentservers.xml` from `%APPDATA%\FileZilla\` if present.
1079	If admin, delete stored credentials found using cmdkey.
1080	Rename browser directories for Firefox, Chrome, and Brave if present after terminating the related browser executable. Attempt to steal Opera cookies if present, after terminating the process.
1081	Use NTAPI calls `RtlAdjustPrivilege` and `NtRaiseHardError` to crash the system.
1083	Use the `shutdown` command to turn the system off.
1084	Use the `shutdown` command to restart the system.
1089	If 1=Yes in config, reinfect system with AU3 payloads.
1093	Create a remote shell via `cmd.exe`.
1097	Infect system with AU3 variant. Creates the files `script.a3x` and `Autoit3.exe` in `c:\temp` and then executes `script.a3x` via `Autoit3.exe` using `CreateProcessA`.
1104	Infect system with AHK variant. Creates the files `script.ahk`, `test.txt`, and `AutoHotkey.exe` in `c:\temp` and then executes `script.ahk` via `AutoHotkey.exe` using `CreateProcessA`.
1108	Infect system with DLL variant. Creates the files `libcurl.dll`, `test.txt`, and `GUP.exe` in `c:\temp` and then executes `GUP.exe` via `CreateProcessA`.
1111	Create the files `ransom.txt` and `decrypter.exe` in `c:\temp`. Terminate `decrypter.exe` if already running and then execute `decrypter.exe` using `CreateProcessA`. Likely ransomware deployment method.

DarkGate Remote Command Related Strings	–	–	–	–
U_Binder	U_BotUpdate	U_Constantes	U_FTPRecovery	U_FileManager
U_FileManagerMisc	U_GetScreens	U_HVNC	U_HVNC_7
U_HWID	U_InfoRecovery	U_InjectOnFly	U_Keylogger	U_LNKStartup
U_MemExecute	U_MemExecuteMisc	U_RemoteScreen	U_SysApi	U_SysNtReadWrite
U_miniclipboard	u_AntiAntiStartup	u_Antis	u_AudioRecord	u_CustomBase64
u_ExtraMisc	u_HollowInstall	u_InjectEP	u_InvokeBSOD	u_RDPRecovery
u_Ransomware	u_ReadCookies	u_ReverseShell	u_RootkitMutex	u_Settings
u_SettingsPad	u_ShellcodeEP	u_UnlockCookies	u_loadpe	hxxps://ipinfo[.]io/ip

Mitigation Guidance

Rapid7 recommends taking the following precautions to limit exposure to these types of attacks:

Restrict the ability for external users to contact users via Microsoft Teams to the greatest extent possible. This can be done for example by blocking all external domains or creating a white/black list. Microsoft Teams will allow all external requests by default. For more information, see this reference.
Standardize remote management tools within the environment. For unapproved tools, block known hashes and domains to prevent usage. Hash blocking can be done, for example, via Windows AppLocker or an endpoint protection solution.
Provide user awareness training regarding the social engineering campaign. Familiarize users with official help desk and support procedures to enable them to spot and report suspicious requests.
Standardize VPN access. Traffic from known low cost VPN solutions should be blocked at a firewall level if there is no business use case.

Rapid7 Customers

InsightIDR, Managed Detection and Response, and Managed Threat Complete customers have existing detection coverage through Rapid7’s expansive library of detection rules. Rapid7 recommends installing the Insight agent on all applicable hosts to ensure visibility into suspicious processes and proper detection coverage. Below is a non-exhaustive list of detections that are deployed and will alert on behavior related to this activity:

Detections
Suspicious Chat Request – Potential Social Engineering Attempt
Initial Access – Potential Social Engineering Session Initiated Following Chat Request
Suspicious Conversation – Potential Social Engineering Message Interaction
Attacker Technique – Process Executed Using Nt Object Path
Suspicious Process – Enumeration Burst via ShellExecute
Attacker Technique – Renamed Kaspersky Dump Writer
Ransomware – Possible Black Basta Related Binary Execution
Credential Access – Steal or Forge Kerberos tickets
Suspicious Process – Diskshadow (Windows Server) Delete Shadow Copies
Non-Approved Application – Remote Management and Monitoring (RMM) Tools

MITRE ATT&CK Techniques

Tactic	Technique	Procedure
Resource Development	T1587.001: Develop Capabilities: Malware	The threat actor is actively developing new malware to distribute.
Impact	T1498: Network Denial of Service	The threat actor overwhelms email protection solutions with spam.
Initial Access	T1566.004: Phishing: Spearphishing Voice	The threat actor calls impacted users and pretends to be a member of their organization’s IT team to gain remote access.
Defense Evasion	T1140: Deobfuscate/Decode Files or Information	The threat actor encrypts some zip archive payloads with a password.
Defense Evasion	T1055.002: Process Injection: Portable Executable Injection	Multiple payloads executed by the threat actor utilize local PE injection.
Defense Evasion	T1620: Reflective Code Loading	Multiple payloads executed by the threat actor load and execute shellcode.
Credential Access	T1649: Steal or Forge Authentication Certificates	The threat actor has distributed numerous signed malware payloads.
Credential Access	T1056.001: Input Capture: Keylogging	The threat actor runs an executable that harvests the user’s credentials.
Credential Access	T1558.003: Steal or Forge Kerberos Tickets: Kerberoasting	The threat actor has performed Kerberoasting after gaining initial access.
Discovery	T1033: System Owner/User Discovery	The threat actor enumerates asset and user information within the environment after gaining access.
Command and Control	T1572: Protocol Tunneling	The threat actor has attempted to use SSH reverse tunnels.
Command and Control	T1219: Remote Access Software	The threat actor has used QuickAssist, AnyDesk, ScreenConnect, TeamViewer, Level, and more, to facilitate remote access.

Indicators of Compromise

All indicators of compromise are available at the Rapid7 Labs Github repository.

Use case walkthrough

Prerequisites

Build a visual ETL flow

Run flow

Query using Amazon Athena

Generative AI section to generate a visual ETL flow

Clean Up

Conclusion

About the Authors

What is zero-ETL?

What’s the difference between zero-ETL and Glue ETL?

Use case

Solution overview

Prerequisites

Build and verify the zero-ETL integration

Step 1: Set up a connector

Step 2: Set up Zero-ETL integration

Step 3: Verify the initial SEED load

Step 4: Validate CDC

Apache Iceberg Time Travel: Enhancing data versioning in zero-ETL

Clean up

Conclusion

About the authors

Background

Solution overview

Prerequisites

Set up federated catalogs

Set up fine-grained access permissions on federated catalogs

Validate fine-grained access permissions on federated catalogs

Clean up

Conclusion

Appendix A: Set up data sources

Redshift

DynamoDB

Snowflake

Appendix B: Connection Properties for Redshift and Snowflake

About the Authors

Introducing the next generation of SageMaker

Unified tools: Collaborate and build faster with one data and AI development environment

Unified data: Reduce data silos with an open lakehouse to unify all your data

Unified governance: Meet your enterprise security needs with built-in data and AI governance

Innovate faster with the convergence of data, analytics and AI

About the authors

ANZ’s federated data strategy

Institutional Data & AI Platform architecture

Institutional Division’s delivery model to achieve scale

Conclusion

About the Authors

Executive Summary

Overview

Technical Analysis

Initial Access

Credential Harvesting

Malware Payloads

Mitigation Guidance

Rapid7 Customers

MITRE ATT&CK Techniques

Indicators of Compromise

The collective thoughts of the interwebz