Tag Archives: Technical How-to

Build enterprise-scale log ingestion pipelines with Amazon OpenSearch Service

2025-08-21 Akhil B

Post Syndicated from Akhil B original https://aws.amazon.com/blogs/big-data/build-enterprise-scale-log-ingestion-pipelines-with-amazon-opensearch-service/

Organizations of all sizes generate massive volumes of logs across their applications, infrastructure, and security systems to gain operational insights, troubleshoot issues, and maintain regulatory compliance. However, implementing log analytic solutions presents significant challenges, including complex data ingestion pipelines and the need to balance cost and performance while scaling to handle petabytes of data.

Amazon OpenSearch Service addresses these challenges by providing high-performance search and analytics capabilities, making it straightforward to deploy and manage OpenSearch clusters in the AWS Cloud without the infrastructure management overhead. A well-designed log analytics solution can help support proactive management in a variety of use cases, including debugging production issues, monitoring application performance, or meeting compliance requirements.

In this post, we share field-tested patterns for log ingestion that have helped organizations successfully implement logging at scale, while maintaining optimal performance and managing costs effectively.

Solution overview

Organizations can choose from several data ingestion architectures, such as:

Log shippers like FluentBit or Fluentd to send logs directly
Amazon Data Firehose for serverless data delivery
Custom solutions using AWS Lambda
Amazon OpenSearch Ingestion for managed data ingestion

Irrespective of the chosen pattern, a scalable log ingestion architecture should comprise the following logical layers:

Collect layer – This is the initial stage where logs are gathered from various sources, including application logs, system logs, and more.
Buffer layer – This layer acts as a temporary storage layer to handle spikes in log volume and prevents data loss during downstream processing issues. This layer also maintains system stability during high load.
Process layer – This layer transforms the unstructured logs into structured formats while adding relevant metadata and contextual information needed for effective analysis.
Store layer – This layer is the final destination for processed logs (OpenSearch in this case), which supports various access patterns, including querying, historical analysis, and data visualization.

OpenSearch Ingestion offers a purpose-built, fully managed experience that simplifies the data ingestion process. In this post, we focus on using OpenSearch Ingestion to load logs from Amazon Simple Storage Service (Amazon S3) into an OpenSearch Service domain, a common and efficient pattern for log analytics.

OpenSearch Ingestion is a fully managed, serverless data ingestion service that streamlines the process of loading data into OpenSearch Service domains or Amazon OpenSearch Serverless collections. It’s powered by Data Prepper, an open source data collector that filters, enriches, transforms, normalizes, and aggregates data for downstream analysis and visualization.

OpenSearch Ingestion uses pipelines as a mechanism that consists of the following major components:

Source – The input component of a pipeline. It defines the mechanism through which a pipeline consumes records.
Buffer – A persistent, disk-based buffer that stores data across multiple Availability Zones to enhance durability. OpenSearch Ingestion dynamically allocates OCUs for buffering, which increases pricing as you may need additional OCUs to maintain ingestion throughput.
Processors – The intermediate processing units that can filter, transform, and enrich records into a desired format before publishing them to the sink. The processor is an optional component of a pipeline.
Sink – The output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. A sink can also be another pipeline, so you can chain multiple pipelines together.

Because of its serverless nature, OpenSearch Ingestion automatically scales to accommodate varying workload demands, alleviating the need for manual infrastructure management while providing built-in monitoring capabilities. Users can focus on their data processing logic rather than spending time on operational complexities, making it an efficient solution for managing data pipelines in OpenSearch environments.

The following diagram illustrates the architecture of the log ingestion pipeline.

Let’s walk through how this solution processes Apache logs from ingestion to visualization:

The source application generates Apache logs that need to be analyzed and stores them in an S3 bucket, which acts as the central storage location for incoming log data. When a new log file is uploaded to the S3 bucket (ObjectCreate event), Amazon S3 automatically triggers an event notification that is configured to send messages to a designated Amazon Simple Queue Service (Amazon SQS) queue.
The SQS queue reliably manages and tracks the notifications of new files uploaded to Amazon S3, making sure the file event is delivered to the OpenSearch Ingestion pipeline. A dead-letter queue (DLQ) is configured to capture failed event processing.
The OpenSearch Ingestion pipeline monitors the SQS queue, retrieving messages that contain information about newly uploaded log files. When a message is received, the pipeline reads the corresponding log file from Amazon S3 for processing.
After the log file is retrieved, the OpenSearch Ingestion pipeline parses the content, and uses the OpenSearch Bulk API to efficiently ingest the processed log data into the OpenSearch Service domain, where it becomes available for search and analysis.
The ingested data can be visualized and analyzed through OpenSearch Dashboards, which provides a user-friendly interface for creating custom visualizations, dashboards, and performing real-time analysis of the log data with features like search, filtering, and aggregations.

In the following sections, we guide you through the steps to ingest application log files from Amazon S3 into OpenSearch Service using OpenSearch Ingestion. Additionally, we demonstrate how to visualize the ingested data using OpenSearch Dashboards.

Prerequisites

This post assumes you have the following:

An AWS account
The AWS Command Line Interface (AWS CLI) installed
The AWS CDK Toolkit installed
Python 3 installed

Deploy the solution

The solution uses a Python AWS Cloud Development Kit (AWS CDK) project to deploy an OpenSearch Service domain and associated components. This project demonstrates event-based data ingestion into the OpenSearch Service domain in a no code approach using OpenSearch Ingestion pipelines.

The deployment is automated using the AWS CDK and comprises the following steps:

Clone the GitHub repo.

git clone [email protected]:aws-samples/sample-log-ingestion-pipeline-for-amazon-opensearch-service.git

Create a virtual environment and install the Python dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Update the following environment variables in cdk.json:
1. domain_name: The OpenSearch domain to be created in your AWS account.
2. user_name: The user name for the internal primary user to be created within the OpenSearch domain.
3. user_password: The password for the internal primary user.

This deployment creates a public-facing OpenSearch domain but is secured through fine-grained access control (FGAC). For production workloads, consider deploying within a virtual private cloud (VPC) with additional security measures. For more information, see Security in Amazon OpenSearch Service.

Bootstrap the AWS CDK stack and initiate the deployment. Provide your AWS account number and the AWS Region where you want deploy the solution:

cdk bootstrap <Account ID>/<region>
cdk deploy --all

The process takes about 30–45 minutes to complete.

Verify the solution resources

When the previous steps are complete, you can check for the created resources.

You can confirm the existence of the stacks on the AWS CloudFormation console. As shown in the following screenshot, the CloudFormation stacks have been created and deployed by cdk bootstrap and cdk deploy.

On the OpenSearch Service console, under Managed clusters in the navigation pane, choose Domains. You can confirm the domain created.

On the OpenSearch Service console, under Ingestion in the navigation pane, choose Pipelines. You can see the pipeline apache-log-pipeline created.

Configure security options

To configure your security roles, complete the following steps:

On the AWS CloudFormation console, open the stack CdkIngestionStack, and on the Outputs tab, copy the Amazon Resource Name (ARN) of osi-pipeline-role.

Open the OpenSearch Service console in the deployed Region within your AWS account and choose the domain you created.
Choose the link for OpenSearch Dashboards URL.
In the login prompt, enter the user credentials that were provided in cdk.json.

After a successful login, the OpenSearch Dashboards console will be displayed.

If you’re prompted to select a tenant, select the Global tenant.
In the Security options, navigate to the Roles section and choose the all_access role.
On the all_access role page, navigate to mapped_users and choose Manage.
Choose Add another backend role under Backend roles and enter the IAM role ARN you copied.
Confirm by choosing Map.

Create an index template

The next step is to create an index template. Complete the following steps:

On the Dev Tools console, copy the contents of the file index_template.txt within the opensearch_object directory.
Enter the code in the Dev Tools console.

This index template defines the mapping and settings for our OpenSearch index.

Choose the play icon to submit the request and create a template.

In the Dashboard Management section, choose Saved Objects and choose Import.
Choose Import and choose the apache_access_log_dashboard.ndjson file within the opensearch_object directory.
Choose Check for existing objects.
Choose Automatically overwrite conflicts and choose Import.

Ingest data

Now you can proceed with the data ingestion.

On the Amazon S3 console, open the S3 bucket opensearch-logging-blog-<Account ID>.
Upload the data file apache_access_log.gz (within the apache_log_data directory). The file can be uploaded in any prefix.

For this solution, we use Apache access logs as our example data source. Although this pipeline is configured for Apache log format, it can be modified to support other log types by adjusting the pipeline configuration. See Overview of Amazon OpenSearch Ingestion for details about configuring different log formats.

After a few minutes, navigate to the Discover tab in OpenSearch Dashboards, where you can find that the data is ingested.
Confirm that the apache* index pattern is selected.

5. On the Dashboards tab, choose Apache Log Dashboard.

The dashboard will be populated by the data and visuals should be displayed.

Operational best practices

When designing your log analytics platform on OpenSearch Service, make sure you follow the recommended operational best practices for cluster configuration, data management, performance, monitoring, and cost optimization. For detailed guidance, refer to Operational best practices for Amazon OpenSearch Service.

Clean up

To avoid ongoing charges for the resources that you created, delete them by completing the following steps:

On the Amazon S3 console, open the bucket opensearch-logging-blog-<Account ID> and choose Empty.
Follow the prompts to delete the contents of the bucket.
Delete the AWS CDK stacks using the following command:

cdk destroy --all --force

Conclusion

As organizations continue to generate increasing volumes of log data, having a well-architected logging solution becomes crucial for maintaining operational visibility and meeting compliance requirements.

Implementing a robust logging infrastructure requires careful planning. In this post, we explored a field-tested approach in building a scalable, efficient, and cost-effective logging solution using OpenSearch Ingestion.

This solution serves as a starting point that can be customized based on specific organizational needs while maintaining the core principles of scalability, reliability, and cost-effectiveness.

Remember that logging infrastructure is not a “set-and-forget” system. Regular monitoring, periodic reviews of storage patterns, and adjustments to index management policies will help make sure your logging solution continues to serve your organization’s evolving needs effectively.

To dive deeper into OpenSearch Ingestion implementation, explore our comprehensive Amazon OpenSearch Service Workshops, which include hands-on labs and reference architectures. For additional insights, see Build a serverless log analytics pipeline using Amazon OpenSearch Ingestion with managed Amazon OpenSearch Service. You can also visit our Migration Hub if you’re ready to migrate legacy or self-managed workloads to OpenSearch Service.

About the authors

Akhil B is a Data Analytics Consultant at AWS Professional Services, specializing in cloud-based data solutions. He partners with customers to design and implement scalable data analytics platforms, helping organizations transform their traditional data infrastructure into modern, cloud-based solutions on AWS. His expertise helps organizations optimize their data ecosystems and maximize business value through modern analytics capabilities.

Ramya Bhat is a Data Analytics Consultant at AWS, specializing in the design and implementation of cloud-based data platforms. She builds enterprise-grade solutions across search, data warehousing, and ETL that enable organizations to modernize data ecosystems and derive insights through scalable analytics. She has delivered customer engagements across healthcare, insurance, fintech, and media sectors.

Chanpreet Singh is a Senior Consultant at AWS, specializing in the Data and AI/ML space. He has over 18 years of industry experience and is passionate about helping customers design, prototype, and scale Big Data and Generative AI applications using AWS native and open-source tech stacks. In his spare time, Chanpreet loves to explore nature, read, and spend time with his family.

A Complete Guide to Resource Sharing for AWS End User Messaging

2025-08-20 Brett Ezell

Post Syndicated from Brett Ezell original https://aws.amazon.com/blogs/messaging-and-targeting/a-complete-guide-to-resource-sharing-for-aws-end-user-messaging/

Introduction

Do you need to send SMS across multiple AWS accounts? Or have you ever wanted to use the same specific 10DLC phone number or branded Sender ID across those accounts? Perhaps your development team needs to test an application in a sandbox account using a production-ready number, or you’re migrating a workload to a new account and need to ensure your customer communications aren’t disrupted. Centralizing your messaging resources across accounts improves efficiency and branding, while lowering the risk in compliance gaps..

In this step-by-step guide, we will show how to solve this challenge by sharing your AWS End User Messaging resources across multiple AWS accounts using AWS Resource Access Manager (AWS RAM). By creating a single sharing account for your messaging resources—like phone numbers, Sender IDs, and opt-out lists—and securely sharing them with your other “consuming” accounts, you can build a more efficient, secure, and scalable communication platform.

Common Use Cases for Resource Sharing

Important: resource sharing with AWS RAM is a regional feature. You can only share resources with accounts within the same AWS Region where those resources are located.

Centralizing and sharing resources is a powerful pattern that addresses several common customer needs:

Testing in a Sandbox Environment: Allows development teams to test applications using production-ready phone numbers or Sender IDs in an isolated sandbox account, without giving them access to production configurations.
Simplified Registration and Onboarding: Share an existing pre-registered 10DLC number or Sender ID with a new account that has not yet completed its own registration process, enabling it to start sending messages more quickly.
Seamless Account Transitions: When migrating an application or workload to a new AWS account, you can share the existing origination identities. This makes certain that your phone numbers and Sender IDs remain consistent during the transition, preventing any disruption to your customer-facing communications.

This guide will walk you through the step-by-step process of sharing your AWS End User Messaging resources.

Shareable AWS End User Messaging Resources

You can share the following AWS End User Messaging resources using AWS RAM:

Phone Numbers: Share your dedicated short codes, 10DLCs, long codes, and toll-free numbers. This allows different accounts to send messages using a centralized pool of numbers.
Sender IDs: Share alphanumeric sender IDs to maintain consistent branding in one-way SMS messages across your accounts.
Opt-out Lists: Centralize your opt-out management to ensure regulatory compliance. When a user opts out of messaging from one account, they are opted out across all accounts using that shared list. This is especially powerful when used with pools, as you can associate a pool with a specific opt-out list, ensuring all numbers in that pool adhere to the same primary list. As a best practice, you should create and share a dedicated opt-out list rather than relying on the default list for each account.
Pools: Share your pools of phone numbers and sender IDs to manage origination identities at scale. Pools provide benefits like automatic failover and apply settings like opt-out lists or two-way SMS configurations to the entire pool.
- Important: for a shared Opt-out list or pool to be functional, all of its member resources (the phone numbers and/or Sender IDs within it) must also be included in the same AWS RAM resource share.

Understanding AWS RAM Fundamentals

Before sharing your End User Messaging resources, it’s essential to understand the core concepts of AWS RAM.

Resource Share: This is the central component in AWS RAM. A resource share consists of three elements:
- The resources to be shared (such as phone numbers, or opt-out lists).
- The principals (AWS accounts, OUs, or an entire organization) with whom you are sharing.
- The managed permissions that define what actions the principals can perform on the shared resources.

Important: The supported resources of AWS End User Messaging are shareable with AWS accounts, Organizations, and OUs, but not with individual AWS Identity and Access Management (IAM) roles or users. This restriction ensures that resource sharing remains at the account level, maintaining clear boundaries and simplifying access management for your End User Messaging infrastructure.

Sharing Account vs. Consuming Account:
- The sharing account (or owner account) is the AWS account that owns the resources and creates the resource share.
- When a principal (such as an AWS account) is granted access to a resource share, it becomes a consuming account. It can use the shared resources according to the permissions granted and pays for its own usage of those resources, not for the resources themselves. For example: The consuming account pays for the volume of SMS sent by a shared number but the sharing account pays for any fees associated with owning that actual number.
AWS Organizations Integration: While you can share resources with individual AWS accounts, the most powerful way to use AWS RAM is in conjunction with AWS Organizations. This service allows you to centrally manage and govern multiple AWS accounts under a single umbrella. When you enable sharing within your organization, you can share resources with all accounts in the organization, or with specific Organizational Units (OUs), seamlessly and without needing to send and accept individual invitations. This sharing is only possible between accounts that reside in the same AWS Region.
Managed Permissions: AWS RAM uses managed permissions to control access.
- AWS managed permissions are predefined permission sets created and maintained by AWS for common use cases. For AWS End User Messaging, the key permission is AWSRAMDefaultPermissionSmsVoice, which allows consumers to use the resources for sending messages but not for deleting or modifying them.
- Customer managed permissions can be created for more granular control over shared resources.
Resource-Based Policies: Behind the scenes, AWS RAM works by creating and managing resource-based policies for you. These policies are what actually grant the consuming accounts access to the shared resources.

To better illustrate these sharing models, the following diagrams show how a Sharing Account can share its AWS End User Messaging resources using different strategies:

Diagram 1: Direct Account-to-Account Sharing:

Diagram 2: Sharing with an Entire AWS Organization:

Diagram 3: Sharing with a Specific Organizational Unit (OU):

Prerequisites and Setup

For the following walkthrough, we will demonstrate how to configure the setup for Diagram 1: Direct Account-to-Account Sharing. However, the steps for managing and using the resource share are similar for all three scenarios. Before you begin, ensure your environment is set up correctly.

Note for AWS Organizations Users: When your account is managed by AWS Organizations, you can take advantage of that to share resources more easily. With or without Organizations, a user can share with individual accounts. However, if your account is in an organization, then you can share with individual accounts, or with all accounts in the organization or in an OU without having to enumerate each account.

If you plan to share resources using AWS Organizations (as shown in Diagram 2 or Diagram 3), you must complete the following prerequisite steps from your organization’s management account before creating a resource share:

1. Enable all features in your organization:

aws organizations enable-all-features

2. Enable resource sharing with AWS RAM: This creates the necessary service-linked role.

aws ram enable-sharing-with-aws-organization

1. Required IAM Permissions

The IAM user or role performing these actions needs permissions for both AWS RAM and AWS End User Messaging. The following policy grants the necessary permissions to manage resource shares.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RAMResourceShareManagement",
            "Effect": "Allow",
            "Action": [
                "ram:UpdateResourceShare",
                "ram:DeleteResourceShare",
                "ram:AssociateResourceShare",
                "ram:DisassociateResourceShare"
            ],
            "Resource": "arn:aws:ram:*:*:resource-share/*"
        },
        {
            "Sid": "DiscoveryAndCreationPermissions",
            "Effect": "Allow",
            "Action": [
                "ram:CreateResourceShare",
                "ram:GetResourceShares",
                "ram:ListResources",
                "organizations:ListAccounts",
                "organizations:DescribeOrganization",
                "pinpoint-sms-voice-v2:DescribePhoneNumbers",
                "pinpoint-sms-voice-v2:DescribeSenderIds",
                "pinpoint-sms-voice-v2:DescribeOptOutLists",
                "pinpoint-sms-voice-v2:DescribePools"
            ],
            "Resource": "*"
        }
    ]
}

Note on Least Privilege: This policy follows the security best practice of granting least privilege. The first statement scopes modification permissions to only AWS RAM resource shares. The second statement grants permissions for discovery actions (like Describe* and List*) and the ram:CreateResourceShare action, which require "Resource": "*" as they do not operate on a specific, pre-existing resource.

2. Regionality Requirement

Important Reminder: resource sharing with AWS RAM is a regional feature. You can only share resources with accounts within the same AWS Region where those resources are located.

For example, a resource in us-east-1 can only be shared with other accounts in us-east-1, regardless of where those accounts operate other resources. Ensure that the resources you intend to share and the accounts that you anticipate sharing with are each considering the same Region for this process.

Creating and Managing Resource Shares (Sharing Account Actions)

This section provides a step-by-step guide to sharing your resources using the AWS CLI. We will walk through creating a resource share, associating and disassociating resources, and checking the status of your shares.

Step 1: Create an Empty Resource Share

First, create the resource share. Think of this as an empty container. You will associate principals (the consuming accounts) and resources (the phone numbers, etc.) with this share.

In the command below, we will create a share named EUM-Shared-Resources for an external account.

# Create a resource share and grant default End User Messaging permissions # Replace 123456789012 with the consuming account's ID
aws ram create-resource-share \
    --name "EUM-Shared-Resources" \
    --principals "123456789012" \
    --permission-arns "arn:aws:ram::aws:permission/AWSRAMDefaultPermissionSmsVoice" \
    --allow-external-principals \
    --region us-east-1

--principals: Specify one or more AWS account IDs.
--allow-external-principals: This flag is required when sharing with accounts that are not part of your AWS Organization.

Expected Response: A successful command returns a JSON object describing the new resource share. Note that allowExternalPrincipals is now true.

{
    "resourceShare": {
        "resourceShareArn": "arn:aws:ram:us-east-1:111122223333:resource-share/a1b2c3d4-5678-90ab-cdef-example11111",
        "name": "EUM-Shared-Resources",
        "owningAccountId": "111122223333",
        "allowExternalPrincipals": false,
        "status": "ACTIVE",
        "tags": [],
        "featureSet": "STANDARD"
    }
}

For the following sections and when specifying resource ARNs, ensure you’re using the correct format for AWS End User Messaging resources:

Phone numbers: arn:aws:sms-voice:region:account-id:phone-number/phonenumber-id
Sender IDs: arn:aws:sms-voice:region:account-id:sender-id/senderid
Opt-out lists: arn:aws:sms-voice:region:account-id:opt-out-list/optoutlist-id
Pools: arn:aws:sms-voice:region:account-id:pool/pool-id

Replace ‘region‘, ‘account-id‘, and the specific resource IDs with your actual values.

Step 2: Associate Resources with the Share

Now that you have your “container,” you can add resources to it. The associate-resource-share command links one or more of your End User Messaging resources to the share you just created, making them available to the principals.

# Define the ARN of the resource share from the previous step
RESOURCE_SHARE_ARN="arn:aws:ram:us-east-1:111122223333:resource-share/a1b2c3d4-5678-90ab-cdef-111111111111"

# Associate a phone number and a pool with the share # Replace the resource-arns with your actual resource ARNs
aws ram associate-resource-share \
    --resource-share-arn "$RESOURCE_SHARE_ARN" \
    --resource-arns \
        "arn:aws:sms-voice:us-east-1:111122223333:phone-number/phonenumber-a1b2c3d4" \
        "arn:aws:sms-voice:us-east-1:111122223333:pool/pool-b2c3d4e5" \
    --region us-east-1

Expected Response: A successful association returns a JSON object confirming the association and showing its status. The status will initially be ASSOCIATING and will transition to ASSOCIATED once complete.

Note: The association process is asynchronous. We’ll show you how to verify the completion status in the next step using the get-resource-shares and list-resources commands. It’s important to confirm the status has changed to ASSOCIATED before attempting to use the shared resources.

Step 3: Verify the Status and contents of the Share

Before making changes, it’s good practice to verify what’s in the share. Use get-resource-shares to check the status and list-resources to see the contents. This process helps ensure that all intended resources are properly associated and accessible to the principals you’ve designated.

# Verify the association status is ASSOCIATED
aws ram get-resource-shares \
    --resource-owner SELF \
    --name "EUM-Shared-Resources" \
    --association-status ASSOCIATED \
    --region us-east-1

Expected Response: If the command returns no results, wait a few moments and try again. The association process is typically quick but can sometimes take up to a few minutes.

{
    "resourceShares": [
        {
            "resourceShareArn": "arn:aws:ram:us-east-1:111122223333:resource-share/12345678-abcd-1234-efgh-111122223333",
            "name": "EUM-Shared-Resources",
            "owningAccountId": "111122223333",
            "allowExternalPrincipals": true,
            "status": "ACTIVE",
            "creationTime": "2023-07-01T12:00:00.000Z",
            "lastUpdatedTime": "2023-07-01T12:00:00.000Z",
            "featureSet": "STANDARD"
        }
    ]
}

Review the output carefully to ensure all intended resources are listed. If any resources are missing, you may need to reassociate them using the associate-resource-share command.

Expected Response (list-resources): This command will return a list of JSON objects, each representing a resource in the share.

# List the ARNs of all resources currently in the share
aws ram list-resources \
    --resource-owner SELF \
    --resource-share-arns "$RESOURCE_SHARE_ARN" \
    --region us-east-1

Review the output carefully to ensure all intended resources are listed. If any resources are missing, you may need to reassociate them using the associate-resource-share command.

# List the ARNs of all resources currently in the share
aws ram list-resources \
    --resource-owner SELF \
    --resource-share-arns "$RESOURCE_SHARE_ARN" \
    --region us-east-1

Expected Response (list-resources): This command will return a list of JSON objects, each representing a resource in the share.

{
    "resources": [
        {
            "arn": "arn:aws:sms-voice:us-east-1:111122223333:phone-number/phonenumber-a1b2c3d4",
            "type": "sms-voice:PhoneNumber",
            "resourceShareArn": "arn:aws:ram:us-east-1:111122223333:resource-share/a1b2c3d4-5678-90ab-cdef-example11111",
            "status": "AVAILABLE"
        },
        {
            "arn": "arn:aws:sms-voice:us-east-1:111122223333:pool/pool-b2c3d4e5",
            "type": "sms-voice:Pool",
            "resourceShareArn": "arn:aws:ram:us-east-1:111122223333:resource-share/a1b2c3d4-5678-90ab-cdef-example11111",
            "status": "AVAILABLE"
        }
    ]
}

Step 4: Disassociate Specific Resources from the Share

To stop sharing a specific resource, you use the disassociate-resource-share command. You must provide the ARN of the resource you wish to remove. This gives you granular control, allowing you to remove one resource while continuing to share others.

# Disassociate only the phone number from the share
aws ram disassociate-resource-share \
    --resource-share-arn "$RESOURCE_SHARE_ARN" \
    --resource-arns "arn:aws:sms-voice:us-east-1:111122223333:phone-number/phonenumber-a1b2c3d4" \
    --region us-east-1

Expected Response: The response will be nearly identical to the associate response, confirming the disassociation request. The status will be DISASSOCIATING.

{
    "resourceShareAssociations": [
        {
            "resourceShareArn": "arn:aws:ram:us-east-1:111122223333:resource-share/a1b2c3d4-5678-90ab-cdef-example11111",
            "associatedEntity": "arn:aws:sms-voice:us-east-1:111122223333:phone-number/phonenumber-a1b2c3d4",
            "associationType": "RESOURCE",
            "status": "DISASSOCIATING",
            "external": false
        }
    ]
}

How to Use Shared Resources

Once resources are shared, users in the consuming accounts can discover and use them for sending messages.

Step 1: Discovering Shared Resources

From a consuming account, you can list resources that have been shared with you by using the --filters parameter in the describe-* commands.

Note: Shared resources are discoverable via the AWS CLI and SDKs but will not appear in the AWS Management Console of the consuming account. This is expected behavior, as the resources are owned by the sharing account.

# List phone numbers shared with your account
aws pinpoint-sms-voice-v2 describe-phone-numbers \
    --filters Name=shared-with-me,Values=true \
    --region us-east-1
# List sender IDs shared with your account
aws pinpoint-sms-voice-v2 describe-sender-ids \
--filters Name=shared-with-me,Values=true \
--region us-east-1

# List pools shared with your account
aws pinpoint-sms-voice-v2 describe-pools \
--filters Name=shared-with-me,Values=true \
--region us-east-1

# List shared opt-out lists with region specification
aws pinpoint-sms-voice-v2 describe-opt-out-lists \
--filters Name=shared-with-me,Values=true \
--region us-east-1

Expected Response: The command returns a JSON object listing the shared resources, including their ARNs, which you will need for sending messages.

{
    "PhoneNumbers": [
        {
            "PhoneNumberArn": "arn:aws:sms-voice:us-east-1:111122223333:phone-number/phonenumber-a1b2c3d4",
            "PhoneNumberId": "phonenumber-a1b2c3d4",
            "PhoneNumber": "+12065550100",
            "Status": "ACTIVE",
            "MessageType": "TRANSACTIONAL",
            "TwoWayEnabled": true,
            "CreatedTimestamp": "2023-10-26T14:34:56.123Z"
        }
    ]
}

Step 2: Sending Messages with Shared Resources

Important: When using shared resources, consuming accounts must specify the full ARN of the shared resource in API calls. This differs from resource owners, who can use either the resource ID, ARN, or the number directly. You can specify the ARN of an individual phone number or a pool as the origination-identity.

# Send an SMS using a shared Phone Number ARN (consuming account MUST use ARN)
aws pinpoint-sms-voice-v2 send-text-message \
    --destination-phone-number "+12065550199" \
    --origination-identity "arn:aws:sms-voice:us-east-1:111122223333:phone-number/phonenumber-a1b2c3d4" \
    --message-body "Hello from a shared number!" \
    --region us-east-1

# Send an SMS using a shared Pool ARN (consuming account MUST use ARN)
aws pinpoint-sms-voice-v2 send-text-message \
    --destination-phone-number "+12065550199" \
    --origination-identity "arn:aws:sms-voice:us-east-1:111122223333:pool/pool-b2c3d4e5" \
    --message-body "Hello from a shared pool!" \
    --region us-east-1

Expected Response: A successful send-text-message call will return a MessageId, which confirms that the service has accepted the message for delivery.

{
    "MessageId": "a1b2c3d4-5678-90ab-cdef-example22222"
}

Message Delivery Reporting:

Once a message is sent, understanding its delivery status is crucial for ensuring your communications are effective. AWS End User Messaging provides several mechanisms for tracking message delivery, giving you a multi-layered approach to reporting.

Delivery Receipts (DLRs):

For traditional, carrier-provided Delivery Receipts (DLRs), which can sometimes take up to 72 hours to be returned, you must configure an event destination. This is the most common method for confirming that a message has reached the recipient’s handset, and is achieved through a Configuration Set.

For shared resources:

The configuration set must be created and managed in the sharing account.
The consuming account must then reference the ARN of the configuration set when sending messages.

# Example for consuming account
aws pinpoint-sms-voice-v2 send-text-message 
    --destination-phone-number "+12065550199" 
    --origination-identity "arn:aws:sms-voice:us-east-1:111122223333:phone-number/phonenumber-a1b2c3d4" 
    --message-body "Hello from a shared number!" 
    --configuration-set-name "arn:aws:sms-voice:us-east-1:111122223333:configuration-set/MyConfigSet" 
    --region us-east-1

For a detailed walkthrough, see our companion blog post, How to Send SMS Using Configuration Sets with AWS End User Messaging.

Message Feedback:

For more immediate, application-driven insights, you can use the Message Feedback feature. This allows you to programmatically mark messages as “delivered” based on a user’s action, such as using a one-time password (OTP) or clicking a link in the message. This provides a real-time confirmation loop that is independent of carrier DLRs.

Amazon CloudWatch:

To monitor these events at scale, you can stream them to Amazon CloudWatch Logs to track key performance indicators like the number of messages sent and delivered, and to set up alerts based on your specific business needs.

To set up comprehensive reporting:

Configure an event destination for DLRs and detailed status events.
Set up CloudWatch dashboards and alerts for ongoing monitoring.

This multi-layered approach provides both immediate feedback and long-term delivery insights, allowing you to optimize your messaging strategy and quickly identify potential delivery issues.

Troubleshooting Common Issues

Permission Denied Errors: If a consuming account cannot access a shared resource, verify that the consuming account’s IAM policies include the necessary permissions. Here’s an example policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "pinpoint-sms-voice-v2:SendTextMessage",
                "pinpoint-sms-voice-v2:SendVoiceMessage",
                "pinpoint-sms-voice-v2:DescribePhoneNumbers",
                "pinpoint-sms-voice-v2:DescribeSenderIds",
                "pinpoint-sms-voice-v2:DescribeOptOutLists",
                "pinpoint-sms-voice-v2:DescribePools"
            ],
            "Resource": "*"
        }
    ]
}

Resource Not Visible: Remember that shared resources do not appear in the consuming account’s AWS Management Console. If the describe-* commands with the shared-with-me filter return no results, ensure the resource share status is ACTIVE in the sharing account.
- If sharing via AWS Organizations, confirm the consuming account is correctly placed in the specified OU. You can find more information on managing OUs in the AWS Organizations User Guide.
CLI Command Fails: If a command fails with a “not found” or “invalid parameter” error, it is often due to an incorrect ARN. Double-check that the ARNs for resources, principals, and the resource share itself are correct. A Permission Denied error, on the other hand, points to an IAM policy issue..

Best Practices and Considerations

Security: Always follow the principle of least privilege. Use AWS managed permissions like AWSRAMDefaultPermissionSmsVoice where possible and create customer-managed permissions only for specific, granular requirements.
Cost: The sharing account is billed for provisioning the resources (e.g., the monthly cost of a phone number). Consuming accounts are billed for their usage of those shared resources (e.g., the cost per message sent). There are no additional costs for using AWS RAM.
Throughput and Quotas: Resource throughput quotas (e.g., messages per second) are shared along with the resource. High volume sending from multiple consuming accounts using the same shared number or pool, could collectively hit the service quota, which may result in throttling. Plan your usage accordingly or request quota increases if necessary.

Conclusion

This guide has equipped you to centralize your AWS End User Messaging resources using AWS Resource Access Manager. By implementing this strategy, you can directly address the common challenges of a multi-account environment: maintaining consistent branding with shared Sender IDs, ensuring comprehensive compliance with centralized opt-out lists, and reducing operational overhead by managing resources in one place.

We have walked through the entire lifecycle, from the initial prerequisites in AWS Organizations and IAM, to the step-by-step CLI commands for creating shares, associating resources, and enabling consuming accounts to use them. By applying these techniques and keeping the best practices for security and throughput in mind, you are now able to build a more efficient, secure, and scalable communication platform across your entire AWS ecosystem.

Under the hood: how AWS Lambda SnapStart optimizes function startup latency

2025-08-19 Ayush Kulkarni

Post Syndicated from Ayush Kulkarni original https://aws.amazon.com/blogs/compute/under-the-hood-how-aws-lambda-snapstart-optimizes-function-startup-latency/

When building applications using AWS Lambda, optimizing function startup is an important step to improve performance for latency sensitive applications. The largest contributor to startup latency (often referred to as cold start time) is the time that Lambda spends initializing your function code. Lambda SnapStart is a feature available for Java, Python, and .NET runtimes that helps reduce variable cold start latency from several seconds (or higher) to as low as sub-second. SnapStart typically needs zero or minimal changes to your application code and makes it easier to build highly responsive and scalable applications without implementing complex performance optimizations. This post explains how SnapStart works under the hood and provides recommendations to improve application performance when using SnapStart.

If your function already initializes within hundreds of milliseconds, then AWS recommends using Lambda Provisioned Concurrency to achieve double-digit millisecond startup latency.

What is a cold-start?

Lambda runs your function code in an isolated, secure execution environment that uses Firecracker microVM technology. When you first invoke a Lambda function, Lambda creates a new execution environment for the function to run in. Lambda downloads your function code, starts the language runtime, and runs your function initialization code, which is code outside the handler. This initialization process (INIT) is called a cold start. Then, Lambda runs your function handler code to invoke the function. A Lambda execution environment only handles a single invoke request at a time. The following figure shows the lifecycle of a typical invocation request.

Figure 1. Function invocation lifecycle without SnapStart

After the function finishes running, Lambda doesn’t stop the execution environment right away. When your function receives another invocation request, Lambda attempts to route the request to the idle but already running execution environment. As the INIT process has already run for this execution environment, this invoke is called a warm start. When more traffic arrives than Lambda has available idle execution environments, Lambda initializes new execution environments to serve the additional requests, performing the cold start initialization process again.

The last step of the cold start, initializing function code, typically takes the longest. This depends on the startup tasks that you execute in your code and the programming language runtime or framework you use. For languages such as Java and .NET, startup latency is impacted by just-in-time compilation of static code in loaded classes. For Python, it can be impacted if your executed code contains numerous or large modules. Other startup tasks, such as downloading machine learning (ML) models, can also take several seconds to complete, which adds to your function’s initialization latency. SnapStart is designed to optimize this last step of the cold start process and achieves this in three stages.

Stage 1: Snapshotting your Lambda function

When using SnapStart, the Lambda execution environment lifecycle changes. When you enable SnapStart for a particular function, publishing a new function version triggers the snapshotting process. The process runs the function initialization phase and takes an immutable, encrypted Firecracker microVM snapshot of the memory and disk state of the initialized execution environment, caching and chunking the snapshot for reuse. Code paths that are not executed during initialization, such as classes loaded on-demand through dependency injection, are not included in your function’s snapshot. To improve snapshot efficiency, proactively execute code paths during the initialization phase, or use runtime hooks to run code before Lambda creates a snapshot.

Snapshot creation can take a few minutes, during which your function version remains in the PENDING state, becoming ACTIVE when the snapshot is ready.

When you subsequently invoke your function, Lambda restores new execution environments from this snapshot. This optimization makes the invocation time faster and more predictable, because creating new a execution environment no longer requires an initialization.

The following figure shows the lifecycle of a SnapStart configured function.

Diagram illustrating how AWS Lambda SnapStart works. The top section shows the 'Publish Version' phase, where the function is initialized ahead of time by creating the execution environment, downloading the code, starting the runtime, and initializing the function code. At the end of this phase, a microVM snapshot is created. The bottom section shows the 'Request Lifecycle' using SnapStart: each new execution environment resumes from the pre-initialized microVM snapshot and immediately invokes the Lambda handler. This allows multiple environments to start faster by skipping initialization steps.

Figure 2. Function invocation lifecycle with SnapStart

After Lambda creates a snapshot, it periodically regenerates it to apply security patches, runtime updates, and software upgrades. Your invocation requests continue to work throughout the regeneration process.

Stage 2: Storing snapshots for low-latency retrieval at Lambda scale

Lambda operates at a high scale, processing tens of trillions of invocation requests every month. To efficiently manage and retrieve snapshots at this volume of traffic, Lambda uses storage and caching components. These consist of three layers: Amazon S3 for durable storage, a dedicated distributed cache, and a local cache on Lambda worker nodes.

Lambda stores function snapshots in Amazon S3, dividing them into 512 KB chunks to optimize retrieval latency. Retrieval latency from Amazon S3 can take up to hundreds of milliseconds for each 512 KB chunk. Therefore, Lambda uses a two-layer cache to speed-up snapshot retrieval.

When you enable SnapStart, during the optimization process, Lambda stores snapshot chunks in a layer two (L2) cache. This layer is a dedicated distributed cache instance fleet purpose-built by Lambda. Lambda stores a separate copy of each snapshot per AWS Availability Zone (AZ). To balance performance with costs, Lambda may not proactively cache unused snapshot chunks, instead caching them after they are first accessed. Chunks remain cached in the L2 fleet as long as your function version is active. The snapshot restore performance from the L2 layer is typically single digit milliseconds for a 512 KB chunk.

Lambda also maintains a layer one (L1) cache located on Lambda worker nodes, the Amazon Elastic Compute Cloud (Amazon EC2) instances handling function invocations. This layer is available locally, thus it provides the fastest performance, typically 1 millisecond for a 512 KB chunk. Functions with more frequent invocations are more likely to have their snapshot chunks cached in this layer. Functions with fewer invocations are automatically evicted from this cache, because it is bound by the worker instance disk capacity. When a snapshot chunk is not available in the L1 cache, Lambda retrieves the chunk from the L2 cache layer.

Figure 3. SnapStart tiered cache

Stage 3: Resuming execution from restored snapshots

Resuming execution from snapshots with low latency is the final SnapStart stage. This involves loading the retrieved snapshot chunks into your function execution environment. Typically, only a subset of the retrieved snapshot is needed to serve an invocation. Storing snapshots as chunks lets Lambda optimize the resume process by proactively loading only the necessary subset of chunks. To achieve this, Lambda tracks and records the snapshot chunks that the function accesses during each function invocation, as shown in the following figure.

Figure 4. Initial invocation, record chunk access pattern

After the first function invocation, Lambda refers to this recorded chunk access data for subsequent invokes, as shown in the following figure. Lambda proactively retrieves and loads this “working set” of chunks before they are needed for execution. This significantly speeds up cold-start latency. If every invoke executes the same code path, then all necessary chunks are tracked after the first invoke. If your Lambda function includes a method that is conditionally invoked once every five cold starts, then Lambda adds the corresponding chunks representing this method to the chunk access metadata after five cold starts.

Figure 5. Subsequent invocation, load chunks in order of access

Understanding SnapStart function performance

The speed of restoring a snapshot depends on its contents, size, and the caching tier used. As a result, SnapStart performance can vary across individual functions.

Function performance improves with more invocations

Frequently invoked functions are more likely to have their snapshots cached in the L1 layer, which provides the fastest retrieval latency. Infrequently accessed portions of snapshots for functions with sporadic invokes are less likely to be present in the L1 layer, resulting in slower retrieval latency from the L2 and S3 cache layers. Chunk access data for functions with more invocations is also more likely to be “complete”, which speeds up snapshot restore latency.

Pre-load code paths to optimize snapshot restore latency

To maximize the benefits of SnapStart, preload dependencies, initialize resources, and perform heavy computation tasks that contribute to startup latency in your initialization code instead of in the function handler. Code paths not executed during your function’s INIT phase, such as application classes loaded on-demand through dependency injection, are not included in your function’s snapshot. You can further improve SnapStart effectiveness by proactively executing these code paths during function initialization. You can also run code using runtime hooks and invoking your handler during the initialization phase before creating the snapshot. To achieve this, refer to the documentation and posts for Spring Boot and .NET applications to implement the performance tuning.

Performance differs depending on function size

SnapStart performance depends on how quickly Lambda can retrieve and load cached snapshots into your function execution environment. Larger function sizes increase the size of snapshots, and thus the number of chunks, which causes performance to differ for functions of varying sizes.

Not all functions benefit from SnapStart

SnapStart is designed to improve startup latency when function initialization takes several seconds, due to language-specific factors or because of initializing and loading software dependencies and frameworks. If your functions initialize within hundreds of milliseconds, you are unlikely to experience a significant performance improvement with SnapStart. For these scenarios, we recommend Provisioned Concurrency, which pre-initializes execution environments, delivering double-digit millisecond latency.

Conclusion

AWS Lambda SnapStart can deliver as low as sub-second startup performance for Java, .NET, and Python functions with long initialization times. This post explores how the Lambda lifecycle changes with SnapStart and how Lambda efficiently stores and loads snapshots to improve start up performance. SnapStart helps developers build highly responsive and scalable applications without provisioning resources or implementing complex performance optimizations.

To learn more about SnapStart, refer to the documentation and launch posts for Java, and Python and .NET. For performance tuning, refer to the SnapStart best practices section for your preferred language runtime. This post outlines approaches to pre-load code paths to further optimize startup latency. Find more information and sample applications built using SnapStart on Serverlessland.com.

Improve Amazon EMR HBase availability and tail latency using generational ZGC

2025-08-19 Vishal Chaudhary

Post Syndicated from Vishal Chaudhary original https://aws.amazon.com/blogs/big-data/improve-amazon-emr-hbase-availability-and-tail-latency-using-generational-zgc/

At Amazon EMR, we constantly listen to our customers’ challenges with running large-scale Amazon EMR HBase deployments. One consistent pain point that kept emerging is unpredictable application behavior due to garbage collection (GC) pauses on HBase. Customers running critical workloads on HBase were experiencing occasional latency spikes due to varying GC pauses, particularly impacting when they occurred during peak business hours.

To reduce this unpredictable impact to business-critical applications running on HBase, we turn to Oracle’s Z Garbage Collector (ZGC), specifically it’s generational support introduced in JDK 21. Generational ZGC delivers consistent sub-millisecond pause times that dramatically reduce tail latency.

In this post, we examine how unpredictable GC pauses affect business-critical workloads, benefits of enabling generational ZGC in HBase. We also cover additional GC tuning techniques to improve the application throughput and reduce tail latency. Amazon EMR 7.10.0 introduces new configuration parameters that allow you to seamlessly configure and tune the garbage collector for HBase RegionServers.

By incorporating generational collection into ZGC’s ultra-low pause architecture, it efficiently handles both short-lived and long-lived objects, making it exceptionally well-suited to HBase’s workload characteristics:

Handling mixed object lifetimes – HBase operations create a mix of short-lived objects (such as temporary buffers for read/write operations) and long-lived objects (such as cached data blocks and metadata). Generational ZGC can efficiently manage both, reducing overall GC frequency and impact.
Adapting to workload patterns – As workload patterns change throughout the day — for instance, from write-heavy ingestion to read-heavy analytics — generational ZGC adapts its collection strategy, maintaining optimal performance.
Scaling with heap size – As data volumes grow and HBase clusters require larger heaps, generational ZGC maintains it’s sub-millisecond pause times, providing consistent performance even as you scale up.

Understanding the impact of GC pauses on HBase

When running HBase RegionServers, the JVM heap can accumulate a large number of objects, both short-lived (temporary objects created during operations) and long-lived (cached data, metadata). Traditional garbage collectors like Garbage-First Garbage Collector (G1 GC) need to pause application threads during certain phases of garbage collection, particularly during “stop-the-world” (STW) events. GC pauses can have several impacts on HBase :

Latency spikes – GC pauses introduce latency spikes, often impacting tail latencies (p99.9 and p99.99) of the application which can lead to timeout for client requests and inconsistent response times..
Application availability – All application threads are halted during STW events and it negatively impacts overall application availability.
RegionServer failures – If GC pauses exceed the configured ZooKeeper session timeout, they might lead to RegionServer failures.

HBase RegionServer reports whenever there is an unusually long GC pause time using the JvmPauseMonitor. The following log entry shows an example of GC pauses reported by HBase RegionServer. During YCSB benchmarking, G1 GC exhibited 75 such pauses over a 7-hour period, whereas generational ZGC showed no long pauses under identical workload and testing conditions.

INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2839ms
INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3021ms

G1 GC pauses are proportional to the pressure on the heap and the object allocation patterns. As a result, the pauses might get worse if the heap is under too much load, whereas generational ZGC maintains it’s pause times goals even under high pressure.

Pause time and availability (uptime) comparison: Generational ZGC vs. G1GC in Amazon EMR HBase

Our testing revealed significant differences in GC pause time between the generational ZGC and G1 GC for HBase on Amazon EMR 7.10. We used 1 m5.4xlarge (primary), 5 m5.4xlarge (core) nodes cluster settings and ran multiple iterations of 1-billion rows YCSB workloads to compare the GC pauses and uptime percentage. Based on our test cluster, we observed a GC pause time improvement from over 1 minute, 24 seconds, to under 1 seconds for over an hour-long execution, improving the application uptime from 98.08% to 99.99%.

We conducted extensive performance testing comparing G1 GC and generational ZGC on HBase clusters running on Amazon EMR, using the default heap settings automatically configured based on Amazon Elastic Compute Cloud (Amazon EC2) instance type. The following image shows the comparison in both GC pause time and uptime percentage at a peak load of 3,00,000 requests per second (data sampled over 1 hour).

The following figures show the breakdown of the 1-hour runtime in 10-minute intervals. The left vertical axis measures the uptime, the right vertical axis measures the GC pause time, and the horizontal axis shows the interval. The generational ZGC maintained consistent uptime and pause time in milliseconds, and G1 GC demonstrated inconsistent and decreased uptime, pause times in seconds.

Tail latency comparison: Generational ZGC vs. G1GC in Amazon EMR HBase

One of the most compelling advantages of generational ZGC over G1 GC is its predictable garbage collection behavior and the impact on application tail latency. G1 GC’s collection triggers are non-deterministic, meaning pause times can vary significantly and occur at unpredictable intervals. These unexpected pauses, though generally manageable, can create latency spikes that particularly affect the slowest percentile of operations. In contrast, generational ZGC maintains consistent, sub-millisecond pause times throughout its operation. This predictability proves crucial for applications requiring stable performance, especially at the highest percentiles of latency (99.9th and 99.99th percentiles). Our YCSB benchmark testing reveals the real-world impact of these different approaches. The following graph illustrates tail latency distribution between G1 GC and generational ZGC over a 2-hour sampling period :

Improvements to BucketCache

BucketCache is an off-heap cache in HBase that is used to cache the frequently accessed data blocks and minimize disk I/O. Bucket cache and heap memory works in conjunction and might increase the contention on the heap depending on the workload. Generational ZGC maintains it’s pause time goals even with a terabyte-sized bucket cache. We benchmarked multiple HBase clusters with varying bucket cache sizes and 32 GB RegionServer heap. The following figures show the peak pause times observed over a 1-hour sampling period, comparing G1 GC and generational ZGC performance.

Enabling this feature and additional fine-tuning parameters

To enable this feature, follow the configurations mentioned in the Performance Considerations. In the following sections, we discuss additional fine-tuning parameters to tailor the configuration for your specific use case.

Fixed JVM heap

Batch processing jobs and short-lived applications benefit from dynamic allocation’s ability to adapt to varying input sizes and processing demands when multiple applications co-exist on the same cluster and run with resource constraints. The memory footprint can expand during peak processing and contract when the workload diminishes. However, for production HBase deployments without any co-existing applications in the same fixed heap allocation offers stable, reliable performance.

Dynamic heap allocation is when the JVM flexibly grows and shrinks its memory usage between minimum (-Xms) and maximum (-Xmx) limits based on application needs, returning unused memory to the operating system. However, this flexibility comes at the cost of performance overhead and memory fragmentation. Dynamic allocation seemed flexible, but it created constant disruptions. The JVM was always negotiating with the operating system for memory, leading to performance overhead and fragmentation. On the other hand, fixed heap allocation pre-allocates a constant amount of memory for the JVM at startup and maintains it throughout runtime, providing better performance by reducing memory negotiation overhead with the operating system. To enable this feature, use the following configuration: :

[
    {
        "Classification": "hbase",
        "Properties": {
            "hbase.regionserver.fixed.heap.enabled": "true"
        }
    }
]

Enable pre-touch

Applications with large heaps can experience more significant pauses when the JVM needs to allocate and fault in new memory pages. Pre-touch (-XX:+AlwaysPreTouch) instructs the JVM to physically touch and commit all memory pages during heap initialization, rather than waiting until they’re first accessed during runtime. This early commitment reduces the latency of on-demand page faults and memory mappings that occur when pages are first accessed, resulting in more predictable performance especially during heavy load situations. By pre-touching memory pages at startup, you trade a slightly longer JVM startup time for more consistent runtime performance. To enable pre-touch for your HBase cluster, use the following configuration :

[
    {
        "Classification": "hbase-env",
        "Properties": {},
        "Configurations": [
            {
                "Classification": "export",
                "Properties": {
                    "JAVA_HOME": "/usr/lib/jvm/jre-21",
                    "HBASE_REGIONSERVER_GC_OPTS": "\"-XX:+UseZGC -XX:+ZGenerational -XX:+AlwaysPreTouch\""
                }
            }
        ]
    }
]

Increasing memory mappings for large heaps

Depending on the workload and scale, you might need to increase the Java heap size to accommodate large data in memory. When using the generational ZGC with a large heap setup, it’s critical to also increase the operating system’s memory mapping limit (vm.max_map_count).

When a ZGC-enabled application starts, the JVM proactively checks the system’s vm.max_map_count value. If the limit is too low to support the configured heap, it will issue the following warning :

[warning] The system limit on number of memory mappings per process might be too low for the given
[warning] max Java heap size (131072M). Please adjust /proc/sys/vm/max_map_count to allow for at
[warning] least 235929 mappings (current limit is 65530). Continuing execution with the current
[warning] limit could lead to a premature OutOfMemoryError being thrown, due to failure to map memory.

To increase the memory mappings, use the following configuration and adjust the count value in the command based on the heap size of the application.

echo "vm.max_map_count = 262144" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

sudo systemctl restart hbase-regionserver

Conclusion

The introduction of generational ZGC and fixed heap allocation for HBase on Amazon EMR marks a significant leap forward in the predictable performance and tail latency reduction. By addressing the long-standing challenges of GC pauses and memory management, these features unlock new levels of efficiency and stability for Amazon EMR HBase deployments. Although the performance improvements vary depending on workload characteristics, you can expect to see significant enhancements in your Amazon EMR HBase clusters’ responsiveness and stability. As data volumes continue to grow and low-latency requirements become increasingly stringent, features like generational ZGC and fixed heap allocation become indispensable. We encourage HBase users on Amazon EMR to enable these features and experience the benefits firsthand. As always, we recommend testing in a staging environment that mirrors your production workload to fully understand the impact and optimize configurations for your specific use case.

Stay tuned for more innovations as we continue to push the boundaries of what’s possible with HBase on Amazon EMR.

About the authors

Vishal Chaudhary is a Software Development Engineer at Amazon EMR. His expertise is in Amazon EMR, HBase and Hive Query Engine. His dedication towards solving distributed system problems is helping Amazon EMR to achieve higher performance improvements.

Ramesh Kandasamy is an Engineering Manager at Amazon EMR. He is a long tenured Amazonian dedicated to solve distributed systems problems.

Achieve low-latency data processing with Amazon EMR on AWS Local Zones

2025-08-18 Gagan Brahmi

Post Syndicated from Gagan Brahmi original https://aws.amazon.com/blogs/big-data/achieve-low-latency-data-processing-with-amazon-emr-on-aws-local-zones/

Enterprises today require both single-digit millisecond latency data processing and data residency compliance for their applications. By deploying Amazon EMR on AWS Local Zones, organizations can achieve single-digit millisecond latency data processing for applications while maintaining data residency compliance. This post demonstrates how to use AWS Local Zones to deploy EMR clusters closer to your users, enabling millisecond-level response times. We use a Secure Web Gateway as an example and implement Amazon EMR with Apache Flink on AWS Local Zones to process network traffic with ultra-low latency. We also go through the process of creating an EMR cluster on AWS Local Zones, highlighting performance optimizations and architecture considerations specific to edge deployments. This approach uses AWS Local Zones to bring Amazon EMR’s data processing capabilities closer to your users and data sources – ideal for security applications or any other latency-sensitive workloads.

Solution overview

The following diagram illustrates the solution architecture.

Sample Architecture for Secure Web Gateway on AWS LocalZones

The solution consists of several key components:

AWS Local Zones deployment – Positioned close to corporate offices to minimize latency
Network traffic interception – Using AWS Transit Gateway and virtual private cloud (VPC) endpoints
Request queuing and rules streaming – Using Apache Kafka on Amazon Elastic Kubernetes Service (Amazon EKS) to queue the incoming and outgoing network requests as well as stream rules as they are updated by the security administrator
EMR cluster – Running Flink for real-time stream processing and functions to combine rules
Policy management system – For defining and updating security rules
Logging – Using Amazon Simple Storage Service (Amazon S3) for visibility, compliance, and data analytics

In this scenario, the Secure Web Gateway is designed to inspect and make decisions on network traffic within single-digit milliseconds. The workflow consists of the following steps:

The corporate office uses AWS Direct Connect to connect to AWS Local Zones.
The security administrator defines the rules from a rules interface running on Kubernetes pods on Amazon EKS. As the rules are added or modified, they are sent to the swg_rules Kafka topic running on Amazon EKS. These rules are stored and processed by Flink running on the EMR cluster.
A corporate user requests for a software as a service (SaaS) application from the corporate office. The request is routed through Direct Connect to the Local Zone.
The Secure Web Gateway proxy service running on Kubernetes pods on Amazon EKS receives the access request, which is sent to the swg_requests Kafka topic.
Flink running on EMR evaluates and consumes the messages from the swg_requests Kafka topic and determines the routing decision, which is sent back to the swg_decisions Kafka topic.
The Secure Web Gateway proxy service consumes the swg_decisions topic and routes the traffic to the SaaS application, if the access request is allowed. If the request is denied, the proxy responds back to the users with the reason or violations details, if any.

Due to the real-time nature of the solution, the security administrator can add, modify, or remove the rules through the swg_rules topic as Flink constantly consumes and evaluates this topic.In the following sections, we discuss the key components of the solution in more detail.

AWS Local Zones: The foundation

AWS Local Zones provide low-latency extensions of AWS Regions positioned near large population and industry centers. For our Secure Web Gateway use case, deploying in a Local Zones offers several advantages:

Proximity to corporate offices – Reducing round-trip latency for traffic inspection. AWS Local Zones is designed to provide applications with low latency aiming for single-digit millisecond performance.
AWS-native security controls – Using AWS security features.
Consistent connectivity – Reliable connection between corporate networks and AWS resources.

The Local Zone hosts our EMR cluster and networking components, making sure traffic inspection through the Secure Web Gateway happens with single-digit millisecond latency. For scenarios where traffic inspection doesn’t require single-digit millisecond latency, deploying hosting the solution on EMR cluster in a Region should work fine.

Amazon EMR with Apache Flink: The decision engine

The core intelligence of our Secure Web Gateway solution is powered by Amazon EMR running Flink for real-time stream processing. With Amazon EMR running on Flink, we take advantage of the optimized real-time stream processing capability offered by Flink. EMR running in AWS Local Zones helps users perform complex data processing closer to their data centers or corporate locations without worrying any potential latency introduced for moving the data to other Regions. In this particular solution, we use Flink’s stateful processing, which allows for maintaining the session context across multiple network requests/packets. The solution also provides a dynamic rules engine that is combined with the real-time stream of requests for network access.

Architectural component choice considerations

Amazon EMR offers several deployment options for different kinds of workloads and use cases, including Amazon EMR on EKS. AWS also provides Amazon Managed Service for Apache Flink, a fully managed service that simplifies the process of building and managing Flink applications. As of this writing, both the EMR on EKS deployment option and Amazon Managed Service for Apache Flink are not available in AWS Local Zones.

Prerequisites

Before proceeding with this deployment, ensure you have:

AWS account with AWS IAM permissions for Amazon VPC, EMR, and Local Zones management
Basic familiarity with the AWS Management Console

Deploy Amazon EMR on a Local Zone

To deploy Amazon EMR on a Local Zone, you first need to enable the Local Zone for the AWS account. For instructions, refer to Step 1 and Step 2 in Getting started with AWS Local Zones.

After you have enabled a Local Zone and created a Local Zone subnet, create your EMR cluster. For instructions, refer to Step 1: Configure data resources and launch an Amazon EMR cluster. You can follow the instructions provided for the AWS Management Console. Make sure you select the appropriate Amazon EMR release version (5.28.0 or later for Local Zone support). Select the applications you need, which in this case is Hadoop and Flink.

A crucial step to launching an EMR cluster in a Local Zone is selecting the Local Zone network configuration. Choose the VPC that contains your Local Zone subnet, and choose the subnet that you created in the Local Zone.

Review all other configurations and settings for your cluster and make any final adjustments as needed, then choose Create cluster to launch your EMR cluster in the Local Zone.

Performance and scaling considerations

The Local Zone EMR deployment can be scaled based on traffic patterns. You can manually scale the EMR cluster horizontally by adding more worker nodes during peak traffics to provide low-latency performance, after you have increased the number of users that access the Secure Web Gateway. Alternatively, you can set up a scheduled action to scale the EMR cluster at predetermined times based on known workload patterns. You can also perform vertical scaling by using Amazon Elastic Compute Cloud (Amazon EC2) instance types with more compute capacity. Consider using the manual resize option for EMR clusters to modify the cluster size based on workload requirements.

Another important performance consideration is to optimize Flink checkpointing for fault tolerance. To learn more, see Optimizing job restart times for task recovery and scaling operations.

Security considerations

Although this architecture prioritizes low-latency performance, implementing proper security controls is essential for production deployments. The solution handles sensitive corporate network traffic that requires protection through encryption, access controls, and monitoring. For comprehensive security guidance specific to EMR deployments, refer to Security in Amazon EMR. Consider the following key areas:

Data protection – Enable encryption at rest and in transit using Amazon EMR security configurations, including Amazon S3 encryption and TLS certificates for inter-node communication
Access control – Implement AWS Identity and Access Management (IAM) roles with least privilege for Amazon EMR service roles, EC2 instance profiles, and runtime roles to isolate job access
Network security – Deploy EMR clusters in private subnets with security groups following least privilege, and enable the Amazon EMR block public access feature

Benefits of Amazon EMR

Using Amazon EMR on AWS Local Zones in this architecture offers several key benefits:

Low latency – Providing the compute in AWS Local Zones close to corporate offices helps you achieve low-latency processing.
Real-time inspection – Flink’s streaming capabilities unlocks the ability to process real-time inspection for network requests.
Complex policy application – With Flink on Amazon EMR, you can build a complex policy application that, for instance, can detect sophisticated access patterns across multiple events and time windows that would be impossible with traditional rule-based systems.
Scalability – Amazon EMR provides the flexibility to automatically scale the cluster with a custom policy. Moreover, Amazon EMR release 6.15.0 and higher supports Flink autoscaler, which automatically scales the individual Flink job vertexes based on the job metrics.
Compliance – Logging all the events to a durable storage like Amazon S3 helps users improve their security and audit posture.

Clean up

To avoid incurring unnecessary charges, clean up the resources you created during this walkthrough. Follow these steps in order:

Step 1: Terminate the EMR cluster

Open the Amazon EMR console
Select your EMR cluster from the list
Choose Terminate
Confirm the termination when prompted
Wait for the cluster status to change to “TERMINATED”

Step 2: Clean up VPC resources

In the Amazon VPC console, delete the Local Zone subnet you created
If you created a custom VPC specifically for this demo, delete any associated:
- Route tables
- Internet gateways
- Security groups (other than default)
- The VPC itself

Step 3: Disable the Local Zone (optional)

In the EC2 console, go to Zones under “Settings”
Find your enabled Local Zone
Choose Manage and disable the zone if you no longer need it for other workloads

Step 4: Review additional resources Check for and clean up any other resources you may have created:

S3 buckets used for logging or EMR storage
CloudWatch log groups
Any custom IAM roles or policies created specifically for this architecture

Conclusion

This implementation of Amazon EMR on AWS Local Zones demonstrates how enterprises can bring powerful data processing capabilities to the edge while maintaining single-digit millisecond latency. By showcasing a Secure Web Gateway application, we have illustrated just one of many possible use cases where performance-sensitive workloads can benefit from this architecture.As the edge computing landscape evolves, we anticipate organizations will increasingly use this pattern for additional use cases, including:

Real-time fraud detection for financial transactions requiring immediate decision-making
Connected vehicle applications where processing telemetry data with minimal latency is critical
Internet of Things (IoT) sensor analytics that require immediate insights from operational technology environments
Augmented reality experiences where processing must happen close to end-users

We encourage you to evaluate your latency-sensitive workloads and consider how AWS Local Zones with Amazon EMR might help you implement architectures previously perceived highly challenging. Start small with a proof of concept like the one outlined here, measure the performance gains, and expand to production use cases with confidence. Implementing a Secure Web Gateway in AWS Local Zones with Amazon EMR and Flink offers enterprises a powerful solution for securing corporate traffic. By using the proximity of Local Zones and the real-time processing capabilities of Flink, organizations can implement sophisticated security policies without the latency penalties traditionally associated with traffic inspection.

About the authors

Gagan Brahmi is a Specialist Senior Solutions Architect at Amazon Web Services (AWS), specializing in Data Analytics and AI/ML solutions. With over 20 years in information technology, he helps customers architect scalable, high-performance analytics platforms using distributed data processing, real-time streaming technologies, and machine learning services on AWS. When not designing cloud solutions, Gagan enjoys exploring new places with his family.

Arun Shanmugam is a Senior Analytics Solutions Architect at AWS, with a focus on building modern data architecture. He has been successfully delivering scalable data analytics solutions for customers across diverse industries. Outside of work, Arun is an avid outdoor enthusiast who actively engages in CrossFit, road biking, and cricket.

George Oakes is a Senior Hybrid Solutions Architect at AWS, with a focus on edge, on-premise, and low latency architectures. He has been successfully delivering scalable hybrid AWS solutions for customers across diverse industries. Outside of work, George is an avid outdoor enthusiast who enjoys hiking and visiting parks and UNESCO sites around.

Export JMX metrics from Kafka connectors in Amazon Managed Streaming for Apache Kafka Connect with a custom plugin

2025-08-15 Jaydev Nath

Post Syndicated from Jaydev Nath original https://aws.amazon.com/blogs/big-data/export-jmx-metrics-from-kafka-connectors-in-amazon-managed-streaming-for-apache-kafka-connect-with-a-custom-plugin/

Organizations use streaming applications to process and analyze data in real time and adopt the Amazon MSK Connect feature of Amazon Managed Streaming for Apache Kafka (Amazon MSK) to run fully managed Kafka Connect workloads on AWS. Message brokers like Apache Kafka allow applications to handle large volumes and diverse types of data efficiently and enable timely decision-making and instant insights. It’s crucial to monitor the performance and health of each component to help ensure the seamless operation of data streaming pipelines.

Amazon MSK is a fully managed service that simplifies the deployment and operation of Apache Kafka clusters on AWS. It simplifies building and running applications that use Apache Kafka to process streaming data. Amazon MSK Connect simplifies the deployment, monitoring, and automatic scaling of connectors that transfer data between Apache Kafka clusters and external systems such as databases, file systems, and search indices. Amazon MSK Connect is fully compatible with Kafka Connect and supports Amazon MSK, Apache Kafka, and Apache Kafka compatible clusters. Amazon MSK Connect uses a custom plugin as the container for connector implementation logic.

Custom MSK connect plugins use Java Management Extensions (JMX) to expose runtime metrics. While Amazon MSK Connect sends a set of connect metrics to Amazon CloudWatch, it currently does not support exporting the JMX metrics emitted by the connector plugins natively. These metrics can be exported by modifying the custom connect plugin code directly, but it requires maintenance overhead because the plugin code needs to be modified every time it’s updated. In this post, we demonstrate an optimal approach by extending a custom connect plugin with additional modules to export JMX metrics and publish them to CloudWatch as custom metrics. These additional JMX metrics emitted by the custom connectors provide rich insights into their performance and health of the connectors. In this post, we demonstrate how you can export the JMX metrics for Debezium connector when used with MSK Connect.

Understanding JMX

Before we dive deep into exporting JMX metrics, let’s understand how JMX works. JMX is a technology that you can use to monitor and manage Java applications. Key components involved in JMX monitoring are:

Managed beans (MBeans) are Java objects that represent the metrics of the Java application being monitored. They contain the actual data points of the resources being monitored.
JMX server creates and registers the MBeans with the PlatformMBeanServer. The Java application that is being monitored acts as the JMX server and exposes the MBeans.
MBeanServer or JMX registry is the central registry that keeps track of all the registered MBeans in the JMX server. It is the access point for all the MBeans within the Java virtual machine (JVM).
JMXConnectorServer acts as a bridge between the JMX client and the JMX server and enables remote access to the exposed MBeans. JMXConnectorServerFactory creates and manages the JMXConnectorServer. It allows for the customization of the server’s properties and uses the JMXServiceURL to define the endpoint where the JMX client can connect to the JMX server.
JMXServiceURL provides the necessary information such as the protocol, host, and port for the client to connect to the JMX server and access the desired MBeans.
JMX client is an external application or tool that connect to the JMX server to access and monitor the exposed metrics.

JMX monitoring involves the steps shown in the following figure:

JMX architecture diagram showing connection flow from client to server with MBeans

JMX monitoring steps include:

The Java application acting as the JMX server creates and configures MBeans for the desired metrics.
JMX server registers the MBeans with the JMX registry.
JMXConnectorServerFactory creates the JMXConnectorServer that defines the JMXServiceURL that provides the entry point details for the JMX client.
JMXClient connects to the JMX registry in the JMX server using the JMXServiceURL and the JMXConnectorServer.
The JMX server handles client requests, interacting with the JMX registry to retrieve the MBean data.

Solution overview

This method of wrapping supported Kafka connectors with custom code that exposes connector-specific operational metrics enables teams to get better insights by correlating various connector metrics with cloud-centered metrics in monitoring systems such as Amazon CloudWatch. This approach enables consistent monitoring across different components of the change data capture (CDC) pipeline, ultimately feeding metrics into unified dashboards while respecting each connector’s architectural philosophy. The consolidated metrics can be delivered to CloudWatch or the monitoring tool of your choice including partner specific application performance management (APM) tools such as Datadog, New Relic, and so on.

We have the working implementation of this same approach with two popular connectors: Debezium source connector and MongoDB Sink Connector. You can find the Github sample and ready to use plugins built for each in the repository. Review the README file for this custom implementation for more details.

For example, our custom implementation for the MongoDB Sink Connector adds a metrics export layer that calculates critical performance indicators such as latest-kafka-time-difference-ms – which measures the latency between Kafka message timestamps and connector processing time by subtracting the connector’s current clock time from the last received record’s timestamp. This custom wrapper around the MongoDB Sink Connector enables exporting relevant JMX metrics and publishing them as custom metrics to CloudWatch. We’ve open sourced this solution on GitHub, along with a ready-to-use plugin and detailed configuration guidance in the README.

CDC is the process of identifying and capturing changes made in a database and delivering those changes in real time to a downstream system. Debezium is an open source distributed platform built on top of Apache Kafka that provides CDC functionality. It provides a set of connectors to track and stream changes from databases to Kafka.

In the next section, we dive deep into the implementation details of how to export JMX metrics from Debezium MySQL Connector deployed as a custom plugin in Amazon MSK Connect. The connector plugin takes care of creating and configuring the MBeans and registering them with the JMX registry.

The following diagram shows the workflow of using Debezium MySQL Connector as a custom plugin in Amazon MSK Connect for CDC from an Amazon Aurora MySQL-Compatible Edition data source.

Data flow diagram illustrating custom Amazon MSK Connect plugin integrating Aurora, Kafka, and CloudWatch metrics

MySQL binary log (binlog) is enabled in Amazon Aurora for MySQL to record all the operations in the order in which they are committed to the database.
The Debezium connector plugin component of the MSK Connect custom plugin continuously monitors the MySQL database, captures the row-level changes by reading the MySQL bin logs, and streams them as change events to Kafka topics in Amazon MSK.
We’ll build a custom module to enable JMX monitoring on the Debezium connector. This module will act as a JMX client to retrieve the JMX metrics from the connector and publish them as custom metrics to CloudWatch.

The Debezium connector provides three types of metrics in addition to the built-in support for default Kafka and Kafka Connect JMX metrics.

Snapshot metrics provide information about connector operation while performing a snapshot.
Streaming metrics provide information about connector operation when the connector is reading the binlog.
Schema history metrics provide information about the status of the connector’s schema history.

In this solution, we export the MilliSecondsBehindSource streaming metrics emitted by the Debezium MySQL connector. This metric provides the number of milliseconds that the connector is lagging behind the change events in the database.

Prerequisites

Following are the prerequisites you need:

Access to the AWS account where you want to set up this solution.
You have set up the source database and MSK cluster by following this setup instructions in the MSK Connect workshop.

Create a custom plugin

Creating a custom plugin for Amazon MSK Connect for the solution involves the following steps:

Create a custom module: Create a new Maven module or project that will contain your custom code to:
1. Enable JMX monitoring in the connector application by starting the JMX server.
2. Create a Remote Method Invocation (RMI) registry to enable the access to the JMX metrics to the clients.
3. Create a JMX metrics exporter to query the JMX metrics by connecting to the JMX server and push the metrics to CloudWatch as custom metrics.
4. Schedule to run the JMX metrics exporter at a configured interval.
Package and deploy the custom module as an MSK Connect custom plugin.
Create a connector using the custom plugin to capture CDC from the source, stream it and validate the metrics in Amazon CloudWatch.

This custom module extends the connector functionality to export the JMX metrics without requiring any changes in the underlying connector implementation. This helps ensure that upgrading the custom plugin requires only upgrading the plugin version in the pom.xml of the custom module.

Let’s deep dive and understand the implementation of each step mentioned above.

1. Create a custom module

Create a new Maven project with dependencies on Debezium MySQL Connector to enable JMX monitoring, Kafka Connect API for configuration, and CloudWatch AWS SDK to push the metrics to CloudWatch.
Set up a JMX connector server to enable JMX monitoring: To enable JMX monitoring, the JMX server needs to be started at the time of initializing the connector. This is usually done by setting the environment variables with JMX options as described in Monitoring Debezium. In the case of an Amazon MSK Connect custom plugin, JMX monitoring is enabled programmatically at the time of connector plugin initialization. To achieve this:

Extend the MySqlConnector class and override the start which is the connector’s entry point to execute custom code.

public class DebeziumMySqlMetricsConnector extends MySqlConnector{
@Override
	public void start(Map<String, String> props) {

In the start method of the custom connector class (DebeziumMySqlMetricsConnector) that we are creating, set the following parameters to allow customization of the JMX Server properties by retrieving connector configuration from a config file.

connect.jmx.port – The port number on which the RMI registry needs to be created. JMXConnectorServer would listen to the incoming connections on this port.

database.server.name – Name of the database that is the source for the CDC.

It also retrieves the CloudWatch configuration related properties that will be used while pushing the JMX metrics to CloudWatch.

cloudwatch.namespace.name – CloudWatch NameSpace to which the metrics need to be pushed as custom metrics

cloudwatch.region – CloudWatch Region where the custom namespace is created in your AWS account

connectJMXPort = Integer.parseInt(props.getOrDefault(CONNECT_JMX_PORT_KEY, String.valueOf(DEFAULT_JMX_PORT)));
databaseServerName = props.getOrDefault(DATABASE_SERVER_NAME_KEY, "");
cwNameSpace = props.getOrDefault(CW_NAMESPACE_KEY, DEFAULT_CW_NAMESPACE);
cwRegion = props.getOrDefault(CW_REGION_KEY, null);

Create an RMI registry on the specified port (connectJMXPort). This registry is used by the JMXConnectorServer to store the RMI objects corresponding to the MBeans in the JMX registry. This allows the JMX clients to look up and access the MBeans on the PlatformMBeanServer.

LocateRegistry.createRegistry(connectJMXPort);

Retrieve the PlatformMBeanServer and construct the JMXServiceURL which is in the format service:jmx:rmi://localhost/jndi/rmi://localhost:<<jmx.port>>/jmxrmi. Create a new JMXConnectorServer instance using the JMXConnectorServerFactory and the JMXServiceURL and start the JMXConnectorServer instance.

MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
String jmxServiceURL = String.format(JMX_URL_TEMPLATE, connectJMXPort);
JMXServiceURL url = new JMXServiceURL(jmxServiceURL);
JMXConnectorServer svr = JMXConnectorServerFactory.newJMXConnectorServer(url, null, mbs);
svr.start();

Implement JMX metrics exporter: Create a JMX client to connect to the JMX server, query the MilliSecondBehindSource metric from the JMX server, convert it into the required format, and export it to CloudWatch.

Connect to the JMX Server using the JMXConnectorFactory and JMXServiceURL

JMXServiceURL jmxUrl = new JMXServiceURL(String.format(JMX_URL_TEMPLATE,DebeziumMySqlMetricsConnector.getConnectJMXPort()));
JMXConnector jmxConnector = JMXConnectorFactory.connect(jmxUrl, null);
jmxConnector.connect();

Query the MBean object that holds the corresponding metric, for example, MilliSecondsBehindSource, and retrieve the metric value using sample code provided in msk-connect-custom-plugin-jmx. (you can choose one or more metrics).
Schedule the execution of your JMX metrics exporter at regular intervals.

getScheduler().schedule(new JMXMetricsExporter(), SCHEDULER_INITIAL_DELAY, SCHEDULER_PERIOD);

Export metrics to CloudWatch: Implement the logic to push relevant JMX metrics to CloudWatch. You can use the AWS SDK for Java to interact with the CloudWatch PutMetricData API or use the CloudWatch Logs subscription filter to ingest the metrics from a dedicated Kafka topic.

Dimension dimension = Dimension.builder()
.name("DBServerName")
.value(DebeziumMySqlMetricsConnector.getDatabaseServerName())
.build();
MetricDatum datum = MetricDatum.builder()
	     .metricName("MilliSecondsBehindSource")
	     .unit(StandardUnit.NONE)
	     .value(Double.valueOf(msBehindSource))
	     .timestamp(instant)
	     .dimensions(dimension).build();
PutMetricDataRequest request = PutMetricDataRequest.builder()
	  .namespace(DebeziumMySqlMetricsConnector.getCWNameSpace())
	  .metricData(datum).build();
cw.putMetricData(request);

For more information, see the sample implementation for the custom module in aws-samples in GitHub. This sample also provides custom plugins packaged with two different versions of Debezium MySQL connector (debezium-connector-mysql-2.5.2.Final-plugin and debezium-connector-mysql-2.7.3.Final-plugin) and the following steps would explain the steps to build a custom plugin using your custom code.

2. Package the custom module and Debezium MySQL connector as a custom plugin

Build and package the Maven project with the custom code as a JAR file and include the JAR file in the debezium-connector-mysql-2.5.2.Final-plugin folder downloaded from maven repo. Package the updated debezium-connector-mysql-2.5.2.Final-plugin as a ZIP file (Amazon MSK Connect accepts custom plugins in ZIP or JAR format). Alternatively, you can use the prebuiltcustom-debezium-mysql-connector-plugin.zip available in GitHub.

Choose the Debezium connector version (2.5 or 2.7) that fits your requirement.

When you have to upgrade to a new version of the Debezium MySQL connector, you can update the version of the dependency and build the custom module and deploy it. By doing this, you can maintain the custom plugin without modifying the original connector code. The GitHub samples provide ready-to-use plugins for two Debezium connector versions. However, you can follow the same approach to upgrade to the latest connector version as well.

Create a custom plugin in Amazon MSK

If you have set up your AWS resources by following the Getting Started lab, open Amazon S3 console and locate the bucket msk-lab-${ACCOUNT_ID}-plugins-bucket/debezium .
Upload the custom plugin created in the previous section custom-debezium-mysql-connector-plugin.zip to msk-lab-${ACCOUNT_ID}-plugins-bucket/debezium, as shown in the following figure.

msk-lab-s3-plugin-bucket

Switch to the Amazon MSK console and choose Custom plugins in the navigation pane. Choose Create custom plugin and, browse the S3 bucket that you created above and select the custom plugin ZIP file you just uploaded.

custom-connector-plugin-s3-object

Enter custom-debezium-mysql-connector-plugin for the plugin name. Optionally, enter a description and choose Create Custom Plugin.

msk-connect-create-custom-plugin-console

After a few seconds you should see the plugin is created and the status is Active.
Customize the worker configuration for the connector by following the instructions in the Customize worker configuration lab.

3. Create an Amazon MSK connector

The next step is to create an MSK connector.

From the MSK section choose Connectors, then choose Create connector. Choose custom-debezium-mysql-connector-plugin from the list of Custom plugins, then choose Next.

msk-plugin-create

Enter custom-debezium-mysql-connector in the Name textbox, and a description for the connector.

connector-properties-console-in-MSK-connect

Select the MSKCluster-msk-connect-lab from the listed MSK clusters. From the Authentication dropdown, select IAM.
Copy the following configuration and paste it in the connector configuration textbox.

Replace the <Your Aurora MySQL database endpoint>, <Your Database Password>, <Your MSK Bootstrap Server Address>, and <Your CloudWatch Region>placeholders with the corresponding details for the resources in your account.
Review the topic.prefix, database.user, topic.prefix, database.server.id, database.server.name, database.port, database.include.listparameters in the configuration. These parameters are configured with the values used in the workshop. Update them with the details corresponding to your configuration if you have customized it in your account.
Note that the connector.classparameter is updated with the qualified name of the subclass of MySqlConnector class that you created in the custom module.
The connect.jmx.portparameter specifies the default port to start the JMX server. You can configure this to any available port.

connector.class=com.amazonaws.msk.debezium.mysql.connect.DebeziumMySqlMetricsConnector tasks.max=1
include.schema.changes=true
topic.prefix=salesdb
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
database.user=master
database.server.id=123456
database.server.name=salesdb
database.port=3306
key.converter.schemas.enable=false
database.hostname=<Your Aurora MySQL database endpoint>
database.password=<Your Database Password>
value.converter.schemas.enable=false
database.include.list=salesdb
schema.history.internal.kafka.topic=internal.dbhistory.salesdb
schema.history.internal.kafka.bootstrap.servers=<Your MSK Bootstrap Server Address>
schema.history.internal.producer.sasl.mechanism=AWS_MSK_IAM
schema.history.internal.consumer.sasl.mechanism=AWS_MSK_IAM
schema.history.internal.producer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
schema.history.internal.consumer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
schema.history.internal.producer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
schema.history.internal.consumer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
schema.history.internal.consumer.security.protocol=SASL_SSL
schema.history.internal.producer.security.protocol=SASL_SSL
connect.jmx.port=7098
cloudwatch.namespace.name=MSK_Connect
cloudwatch.region=<Your CloudWatch Region>

connector-properties-configuration-settings

5. Follow the remaining instructions from the Create MSK Connector lab and create the connector. Verify that the connector status changes to Running.

Debezium MySQL custom connector version (2.7.3) provides additional flexibility to configure optional properties that can be added to your MSK connector configuration and selectively include and exclude metrics to emit to CloudWatch. The following are the example configuration properties that can be used with version 2.7.3 :

cloudwatch.debezium.streaming.metrics.include – A comma-separated list of streaming metrics type that must be exported to CloudWatch as custom metrics.
cloudwatch.debezium.streaming.metrics.exclude – Specify a comma-separated list of streaming metrics types to exclude from being sent to CloudWatch as custom metrics.
Similarly include and exclude properties for snapshot metrics type are cloudwatch.debezium.snapshot.metrics.include and cloudwatch.debezium.snapshot.metrics.exclude
Include and exclude properties for schema history metrics type are cloudwatch.debezium.schema.history.metrics.include and cloudwatch.debezium.schema.history.metrics.exclude

The following is a sample configuration excerpt.

  "cloudwatch.debezium.streaming.metrics.include": "LastTransactionId, TotalNumberOfEventsSeen, MilliSecondsBehindSource,CapturedTables",
  "cloudwatch.debezium.streaming.metrics.exclude": "LastTransactionId",
  "cloudwatch.debezium.schema.history.metrics.exclude": "MilliSecondsSinceLastAppliedChange",

Review the GitHub README file for more details on the use of these properties with MSK connector configurations.

Verify the replication in the Kafka cluster and CloudWatch metrics

Follow the instructions in the Verify the replication in the Kafka cluster lab to set up a client and make changes to the source DB and verify that the changes are captured and sent to Kafka topics by the connector.

To verify that the connector has published the JMX metrics to CloudWatch, go to the CloudWatch console and choose Metrics in the navigation pane, then choose All Metrics. Under Custom namespace, you can see MSK_Connect with database name as the dimension. Select the database name to view the metrics.

Amazon CloudWatch interface with time series graph and MSK Connect metric details

Select the MilliSecondBehindSource metric with statistic as Average in the Graphed Metric to plot the graph. You can verify that the MilliSecondBehindSource metric value is greater than zero whenever any operation is being performed on the source database and returns to zero during the idle time.

Amazon CloudWatch console showing custom metric visualization with detailed controls and timeline analysis

Clean up

Delete the resources that you created such as the Aurora DB, Amazon MSK Cluster and connectors by following the instructions at Cleanup in the Amazon MSK Connect lab if you have been following along to set up the solution on your account.

Conclusion

In this post, we showed you how to extend the Debezium MySQL connector plugin with an additional module to export the JMX metrics to CloudWatch as custom metrics. As a next step, you can create a CloudWatch alarm to monitor the metrics and take remediation actions when the alarm is triggered. In addition to exporting the JMX metrics to CloudWatch, you can export these metrics to third-party applications such as Prometheus or DataDog using CloudWatch Metric Streams. You can follow a similar approach to export the JMX metrics of other connectors from MSK Connect. You can learn more about creating your own connectors by visiting the Connector Developer Guide and how to deploy them as custom plugins in the MSK Connect documentation.

About the authors

Jaydev Nath is a Solutions Architect at AWS, where he works with ISV customers to build secure, scalable, reliable, and cost-efficient cloud solutions. He brings strong expertise in building SaaS architecture on AWS with a focus on Generative AI and data analytics technologies to help deliver practical, valuable business outcomes for customers.

David John Chakram is a Principal Solutions Architect at AWS. He specializes in building data platforms and architecting seamless data ecosystems. With a profound passion for databases, data analytics, and machine learning, he excels at transforming complex data challenges into innovative solutions and driving businesses forward with data-driven insights.

Sharmila Shanmugam is a Solutions Architect at Amazon Web Services. She is passionate about solving the customers’ business challenges with technology and automation and reduce the operational overhead. In her current role, she helps customers across industries in their digital transformation journey and build secure, scalable, performant and optimized workloads on AWS.

Effectively building AI agents on AWS Serverless

2025-08-15 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/effectively-building-ai-agents-on-aws-serverless/

Imagine an AI assistant that doesn’t just respond to prompts – it reasons through goals, acts, and integrates with real-time systems. This is the promise of agentic AI.

According to Gartner, by 2028 over 33% of enterprise applications will embed agentic capabilities – up from less than 1% today. While early generative AI efforts focused on GPUs and model training, agentic systems shift the focus to CPUs, orchestration, and integration with live data – the places where organizations are starting to see real return on investment (ROI).

In this post, you’ll learn how to build and run serverless AI agents on AWS using services such as Amazon Bedrock AgentCore (preview as of this post publication), AWS Lambda, and Amazon Elastic Container Service (Amazon ECS), which provide scalable compute foundations for agentic workloads. You’ll also explore architectural patterns, state management, identity, observability, and tool usage to support production-ready deployments.

Overview

Early AI assistants were stateless and reactive – each prompt processed in isolation, with no memory of prior interactions or awareness of broader context. Gradually, AI assistants became more capable by injecting system prompts, preserving conversation history, and incorporating enterprise knowledge using Retrieval-Augmented Generation (RAG), as illustrated in the following diagram.

Despite these improvements, traditional AI assistants still lacked true autonomy. They couldn’t reason through multi-step goals, make decisions on their own, or adjust workflows dynamically based on outcomes. As a result, they worked well for simpler Q&A or predefined workflows, but struggled with dynamic, more complex, real-world tasks that require planning, using external tools, and making decisions along the way.

Agentic AI systems shift from passive content generation to autonomous, goal-driven behavior. Powered by Large Language Models (LLMs) and enhanced with memory, planning, and tool use, these systems can break down complex tasks into smaller steps, reason through each step, and take real-time actions, such as calling APIs, executing tools, or interacting with live data. By referencing the LLM within a control cycle that manages context, memory, and decision-making, these systems can choose the right tools, adapt workflows, and integrate deeply into enterprise environments, with use cases ranging from travel booking and financial analysis to DevOps automation and code debugging. This is referred to as an agentic loop. In this system, the agent relies on the LLM’s reasoning output to execute tools, capture tool results, and feed these results to the LLM as updated context (as shown in the following diagram). This happens in a loop until LLM instructs the agent to return the final output to the caller.

While agentic loop is a lightweight approach to structuring these systems, other control flow paradigms, such as graph, swarm, and workflows, are also available in open-source frameworks like LangGraph.

Introducing Strands Agents SDK

Strands Agents SDK is a code-first framework to build production-ready AI agents with minimal boilerplate. It utilizes the above-mentioned agentic loop system and abstracts common challenges like memory management, tool integration, and multi-step reasoning in a lightweight, modular Python framework. Strands SDK handles state, tool orchestration, and multi-step reasoning so agents can remember past conversations, call external APIs, enforce business rules, and adapt to changing inputs. This allows you to focus on the application’s business logic.

Because agents built with Strands SDK are essentially Python apps, they’re portable and can run across different compute options, such as Bedrock AgentCore Runtime, Lambda functions, ECS tasks, or even locally. This makes Strands Agents SDK a powerful foundation for building scalable and goal-driven AI systems. The following sections assume you’re running your AI agents built with Strands Agents SDK on Lambda functions.

Building your first serverless AI agent

Imagine you’re building an AI-powered corporate travel assistant on AWS, and you have the following technical requirements:

Define the system prompts, memory, and model you want to use
Integrate tools for API calls, business logic, and knowledge bases
Ensure authentication and observability

Strands SDK handles heavy lifting, so you can focus on building smart, responsive agents with minimal overhead. The following code snippet creates a simple agent, according to your configuration.

from strands import Agent

agent = Agent(
    system_prompt=
      """You're a travel assistant that helps 
         employees book business trips 
         according to policy.""",
    model=my_model,
    tools=[get_policies, get_hotels, get_cars, book_travel]
)

response = agent("Book me a flight to NYC next Monday.")

That’s it. Your agent now has a personality, memory, and ability to use external tools. The Agent class in the Strands SDK abstracts agentic logic, such as maintaining conversation history, handling LLM interactions, orchestrating tools and external knowledge sources, and running the full agentic loop.

Session state management

Session state management is critical for agentic workflows. It allows agents to track goals across interactions – enabling coherent conversations, retaining context, and providing personalized experiences. Without state management, each prompt is handled in isolation, making it impossible for the agent to reference prior context or track ongoing tasks. In cloud environments, where applications need to be stateless and scalable, the solution is to externalize session state to persistent storage, such as Amazon Simple Storage Service (Amazon S3). This allows any agent instance to reconstruct the conversation history on demand, delivering a seamless, stateful user experience while keeping the agentic app itself stateless for scalability and resilience.

AI agents built with Strands store conversation history in the agent.messages property (see documentation). To support stateless compute environments, you can externalize the agent state, persisting it after each interaction and restoring it before the next. This preserves continuity across invocations while keeping your agent instances stateless. In user-aware agentic applications, you want to persist state for each user, typically associated with the user’s unique ID. The following example illustrates how you can do it with the built-in S3SessionManager class when running your agent in a stateless environment such as a Lambda function:

    session_manager = S3SessionManager(
        session_id=f"session_for_user_{user.id}",
        bucket=SESSION_STORE_BUCKET_NAME,
        prefix="agent_sessions"
    )

    agent = Agent(
        session_manager=session_manager
    )

When using Bedrock AgentCore, use the fully managed, serverless AgentCore Memory primitive to manage sessions and long-term memory. It provides relevant context to models while helping agents learn from past interactions. You can make Strands’ session manager work with AgentCore Memory similar to S3SessionManager.

Authentication and authorization

For enterprise AI agents to operate safely, they must know who the user is and what they are allowed to do. This goes beyond basic identity validation – AI agents often act on behalf of users, so they might need to enforce role-based access controls, support audit, and comply with corporate policies.

AWS services like Amazon Cognito, Amazon Identity and Access Management (IAM), and Amazon API Gateway provide a solid foundation for authentication and authorization. For example, you can use Cognito to authenticate users through user pools or federated identity providers, combined with API Gateway and Lambda authorizer to validate user access permissions before forwarding requests to the agent, as shown in the preceding diagram. IAM policies define what the agent is allowed to do. After the user is both authenticated and authorized, the agent can extract the identity context, for example, from a JSON Web Token (JWT), to personalize prompts, enforce rules, or dynamically restrict actions.

The following code snippet illustrates retrieving user’s identity from the Authorization header and passing it to an agent:

def handler(event: dict, ctx):
    user_id = extract_user_id(event["headers"]["Authorization"])
    user_prompt: dict = json.loads(event["body"])["prompt"]
    agent_response = agent.prompt(user_id, user_prompt)
  
    return {
        "statusCode": 200,
        "body": json.dumps({"text": agent_response.text})
    }

The identity context can become a part of the agent’s execution loop. An agent might check the user’s department before booking travel or restrict access to sensitive tools unless the user has the appropriate permissions. By integrating authentication early, you not only enhance security, but also unlock rich personalization and audit capabilities that make agents enterprise-ready from day one.

When using Bedrock AgentCore, the AgentCore Identity primitive allows your AI agents to securely access AWS services and third-party tools either on behalf of users or as themselves with pre-authorized user consent. It provides managed OAuth 2.0 supported providers for both inbound and outbound authentication. During the preview phase, AgentCore Identity supports identity providers like Amazon Cognito, Auth0 by Okta, Microsoft Entra ID, GitHub, Google, Salesforce, and Slack. Refer to the samples for implementation details.

Building portable Strands agents on AWS

Strands Agents SDK is compute-agnostic. The agents you build are standard Python applications, which can run on any compute type.

For portability and maintainability, separate your agent’s business logic from the interface layer. By doing this, you can reuse the same core agent code across environments, whether invoked through API Gateway and Lambda functions, accessed through Application Load Balancer and Amazon ECS, running on AgentCore Runtime, or even executed locally during development, as shown in the following figure.

The following code snippets illustrate this technique.

Lambda handler code:

def handler(event: dict, ctx):
     user_id = extract_user_id(event)
     user_prompt = json.loads(event["body"])["prompt"]
     agent_response = call_agent(user_id, user_prompt)
     return {
          "statusCode":200,
          "body": json.dumps({
               "text": agent_response.mesage
          })
     }

AgentCore code:

@app.entrypoint
def invoke(payload):
     user_id = extract_user_id(payload)
     user_prompt = payload.get("prompt")
     agent_response = call_agent(user_id, user_prompt)
     return {"result": agent_response.message)

HTTP Handler code:

@app.post("/prompt")
async def prompt(request: Request, prompt_request: PromptRequest):
    user_id=extract_user_id(request)
    user_prompt = prompt_request.prompt
    agent_response = call_agent(user_id, user_prompt)
    return {"text": agent_response.message)

For local testing:

if __name__ == "__main__":
     user_id="local-testing-user"
     user_prompt="book me a trip to NYC"
     agent_response = call_agent(user_id, user_prompt)
     return agent_response.message

Agent code:

def call_agent(user_id, user_prompt):
     agent = Agent(
          system_prompt="You’re a travel agent…",
          model=my_model,
          session_manager = my_session_manager,    
      )
     agent_response = agent(user_prompt)
     return agent_response

Extending agent functionality with tools

A key strength of agentic systems is their ability to invoke tools that perform actions or retrieve real-time data, enabling agents to interact with the outside world, not just generate text. The Strands Agents SDK includes built-in tools and allows you to define your own custom tools, as either in-process Python functions or external tools accessible over HTTP using the Model Context Protocol (MCP). These tools can fetch data, call APIs, or trigger workflows, and can be registered for the agent to reason over and use during execution.

The following snippet illustrates creating an in-process tool. See the documentation for more examples.

from strands import tool 

@tool
def get_weather(city: str) -> str:
    weather = call_weather_api(city)
    return f"The current weather in {city} is {weather}"

Integrating with remote MCP servers

Model Context Protocol (MCP) is an open standard that decouples agents from tools using a client-server model. Instead of embedding tool logic directly into the agent, your agent becomes an MCP client that connects to one or more MCP servers – each exposing tools, resources, and reusable prompts.

Running remote MCP servers is especially valuable when tools span multiple business domains or are provided by third-party vendors, just like how microservices separate responsibilities across teams and systems. This separation allows each domain team to manage their own tools independently while exposing a consistent, standardized interface to agents. It also enables reuse, versioning, and centralized governance without tightly coupling logic into the agent itself. By decoupling tools from agents, MCP unlocks composability, scalability, and long-term ecosystem growth.

The following snippet illustrates configuring an MCP client to connect to a remote MCP Server, retrieving the list of tools, and integrating those tools with an agent.

mcp_client = MCPClient(lambda: streamablehttp_client(
    url=mcp_endpoint,
    headers={"Authorization": f"Bearer {token}"},
))

with mcp_client:
  tools = mcp_client.list_tools_sync()
  agent = Agent(tools=tools)

When using Bedrock AgentCore, you can operate MCP at scale through AgentCore Gateway. It provides an easy and secure way for developers to build, deploy, discover, and connect to remote tools like above at scale. With AgentCore Gateway, developers can convert APIs, Lambda functions, and existing services into Model Context Protocol (MCP)-compatible tools and make them available to agents through Gateway endpoints with just a few lines of code.

Monitoring and observability

Observability is essential when running AI agents. Beyond traditional metrics such as uptime and latency, agentic systems introduce new telemetry dimensions, such as LLM latency, token consumption, and tracing reasoning cycles. These new metrics are essential for understanding both the performance and cost of your agentic systems.

When deploying agents using AWS services such as Bedrock AgentCore, Lambda, or ECS, you inherit the built-in observability capabilities, such as seamless integration with Amazon CloudWatch for metrics, logs, and distributed tracing. This simplifies tracking invocation counts, errors, request duration, and concurrency, as shown in the following figure – essential for operating reliable and scalable agentic applications.

In addition, the Strands Agents SDK provides built-in agent observability features. It uses OpenTelemetry (OTEL) to automatically trace each agent interaction, including spans for LLM calls, tool usage, and context updates. It also exports detailed metrics such as token counts, tool execution times, and decision cycle durations. These metrics can be sent to any OTEL-compatible backend, giving you deep, real-time visibility into how your agents reason, act, and adapt. The following snippet shows built-in token usage metrics:

{
  "accumulated_usage": {
    "inputTokens": 1539,
    "outputTokens": 122,
    "totalTokens": 1661
  },
  "average_cycle_time": 0.881234884262085,
  "total_cycles": 2,
  "total_duration": 1.881234884262085,
  ... redacted ...
}

Learn more about observability and evaluation of Strands agents from this sample code.

When using Bedrock AgentCore, the AgentCore Observability primitive helps you to log and capture metrics and traceability from other AgentCore primitives like runtime, memory, and gateway, as described in this tutorial.

Security considerations

You should build secure communication and access controls layers deploying AI agents that integrate with remote MCP servers. All client-server interactions should be encrypted using TLS, ideally with mutual TLS for bidirectional authentication. Access to tools should be validated through authorization checks with fine-grained permissions to enforce least privilege access. Deploying MCP servers behind an API Gateway provides additional security layers like DDoS protection, WAF, and centralized authentication. Use API Gateway logging capabilities to capture caller identity and execution outcomes. Using trusted, versioned MCP repositories helps protect against supply chain attacks and ensures consistent tool governance across teams. Protocols such as MCP are evolving rapidly, you should always use the most recent versions to minimize potential security vulnerabilities risk.

In addition, you should leverage security best practices described in the AWS Well-Architected Framework Security Pillar, such as enforcing strict IAM role scoping, integrating with identity providers for user context, encrypting all data in transit and at rest, and using VPC endpoints and PrivateLink to limit network exposure. To protect against prompt injection attacks, sanitize inputs, and ensure you maintain comprehensive audit logs for compliance and governance.

Sample project

Follow instructions in this GitHub repo to deploy a sample project implementing the practices described in this post using the AWS Serverless compute. The repo includes a travel agent implemented with Strands Agents SDK and a remote MCP server, both running as Lambda functions.

Conclusion

Agentic AI moves beyond simple prompt-response interactions to enable dynamic, goal-driven workflows. In this post, you learned how to build scalable, production-ready agents on AWS using the Strands Agents SDK and serverless services such as Lambda and Amazon ECS.

By externalizing state, integrating authentication, and adding observability, agents can operate securely and at scale. With support for in-process and remote tools through the MCP, you can cleanly separate responsibilities and build composable, enterprise-ready systems. You can combine these patterns to deliver intelligent, adaptable AI agents that fit naturally into modern cloud and event-driven architectures.

Useful resources

To learn more about Serverless architectures see Serverless Land.

Cluster manager communication simplified with Remote Publication

2025-08-14 Himshikha Gupta

Post Syndicated from Himshikha Gupta original https://aws.amazon.com/blogs/big-data/cluster-manager-communication-simplified-with-remote-publication/

Amazon OpenSearch Service has taken a significant leap forward in scalability and performance with the introduction of support for 1,000-node OpenSearch Service domains capable of handling 500,000 shards with OpenSearch Service version 2.17. This breakthrough is made possible by multiple features, including Remote Publication, which introduces an innovative cluster state publication mechanism that enhances scalability, availability, and durability. It uses the remote cluster state feature as the base. This feature provides durability and makes sure metadata is not lost even when the majority of the cluster manager nodes fail permanently. By using a remote store for cluster state publication, OpenSearch Service can now support clusters with a higher number of nodes and shards.

The cluster state is an internal data structure that contains cluster information. The elected cluster manager node manages this state. It’s distributed to follower nodes through the transport layer and stored locally on each node. A follower node can be a data node, a coordinator node or a non-elected cluster manager node. However, as the cluster grows, publishing the cluster state over the transport layer becomes challenging. The increasing size of the cluster state consumes more network bandwidth and blocks transport threads during publication. This can impact scalability and availability. This post explains cluster state publication, Remote Publication, and their benefits in improving durability, scalability, and availability.

How did cluster state publication work before Remote Publication?

The elected cluster manager node is responsible for maintaining and distributing the latest OpenSearch cluster state to all the follower nodes. The cluster state updates when you create indexes and update mappings, or when internal actions like shard relocations occur. Distribution of the updates follows a two-phase process: publish and commit. In the publish phase, the cluster manager sends the updated state to the follower nodes and saves a copy locally. After a majority (more than half) of the eligible cluster manager nodes acknowledge this update, the commit phase begins, where the follower nodes are instructed to apply the new state.

To optimize performance, the elected cluster manager sends only the changes since the last update, referred to as the diff state, reducing data transfer. However, if a folllower node is out of sync or new to the cluster, it might reject the diff state. In such cases, the cluster manager sends the full cluster state to those follower nodes.

The following diagram depicts the cluster state publication flow.

Sequence of steps between the cluster manager node and a follower node demonstrating the cluster state publication over transport layer

The workflow consists of the following steps:

The user invokes an admin API such as create index.
The elected cluster manager node computes the cluster state for the admin API request.
The elected cluster manager node sends the cluster state publish request to follower nodes.
The follower nodes respond with an acknowledgement to the publish request.
The elected cluster manager node persists the cluster state to the disk.
The elected cluster manager node sends the commit request to follower nodes.
The follower nodes respond with an acknowledgement to the commit request.

We’ve observed stable cluster operations with this publication flow up to 200 nodes or 75,000 shards. However, as the cluster state grows in size with more indexes, shards, and nodes, it starts consuming high network bandwidth and blocking transport threads for a longer duration during publication. Additionally, it becomes CPU and memory intensive for the elected cluster manager to transmit to the follower nodes, often impacting publication latency. The increased latency can lead to a high pending task count on the elected cluster manager. This can cause request timeouts, or in severe cases, cluster manager failure, creating a cluster outage.

Using a remote store for cluster state publication improved availability and scalability

With Remote Publication, cluster state updates are transmitted through an Amazon Simple Storage Service (Amazon S3) bucket as the remote store, rather than transmitting the state over the transport layer. When the elected cluster manager updates the cluster state, it uploads the new state to Amazon S3 in addition to persisting on disk. The cluster manager uploads a manifest file, which keeps track of the entities and which entities changed from their previous state. Similarly, follower nodes download the manifest from Amazon S3 and can decide if it needs the full state or only changed entities. This has two benefits: reduced cluster manager resource usage and faster transport thread availability.

Creating new domains or upgrading from existing OpenSearch Service versions to 2.17 or above, or applying the service patch to an existing 2.17 or above domain, enables Remote Publication by default, This provides seamless migration with the remote state. This is enabled by default for SLA clusters, with or without remote-backed storage. Let’s dive into some details of this design and understand how it works internally.

How is the remote store modeled for scalability?

Having scalable and efficient Amazon S3 storage is essential for Remote Publication to work seamlessly. The cluster state has multiple entities, which get updated at different frequencies. For example, cluster node data only changes if a new node joins the cluster or an old node leaves the cluster, which usually happens during blue/green deployments or node replacements. However, shard allocation can change multiple times a day based on index creations, rollovers, or internal service triggered relocations. The storage schema needs to be able to handle these entities in a way that a change in one entity doesn’t impact another entity. A manifest file keeps track of the entities. Each cluster state entity has its own separate file, like one for templates, one for cluster settings, one for cluster nodes, and so on. For entities that scale with the number of indexes, like index metadata and index shard allocation, per-index files are created to make sure changes in an index can be uploaded and downloaded independently. The manifest file keeps track of paths to these individual entity files. The following code shows a sample manifest file. It contains the details of the granular cluster state entities’ files uploaded to Amazon S3 along with some basic metadata.

{
    "term": 5,
    "version": 10,
    "cluster_uuid": "dsgYj10Nkso7",
    "state_uuid": "dlu34Dh2Hiq",
    "node_id": "7rsyg5FbSeSt",
    "node_version": "3000099",
    "committed": true,
    "indices": [{
        "index_name": "index1",
        "uploaded_filename": "index1-s3-key"
    }, {
        "index_name": "index2",
        "uploaded_filename": "index2-s3-key"
    }],
    "indices_routing": [{
        "index_name": "index1",
        "uploaded_filename": "index1-routing-s3-key"
    }, {
        "index_name": "index2",
        "uploaded_filename": "index2-routing-s3-key"
    }],
    "uploaded_settings_metadata": {
        "uploaded_filename": "settings-s3-key"
    },
    "diff_manifest": {
        "from_state_uuid": "aRiq3oEip",
        "to_state_uuid": "dlu34Dh2Hiq",
        "metadata_diff": {
            "settings_metadata_diff": true,
            "indices_diff": {
                "upserts": ["index1"],
                "deletes": ["index2"]
            }
        },
        "routing_table_diff": {
            "upserts": ["index1"],
            "deletes": ["index2"],
            "diff": "indices-routing-diff-s3-key"
        }
    }
}

In addition to keeping track of cluster state components, the manifest file also keeps track of what entities changed compared to the last state, which is the diff manifest. In the preceding code, diff manifest has a section for metadata diff and routing table diff. This signifies that between these two versions of the cluster state, these entities have changed.

We also keep a separate shard diff file specifically for shard allocation. Because multiple shards for different indexes can be relocated in a single cluster state update, having this shard diff file further reduces the number of files to download.

This configuration provides the following benefits:

Separate files help prevent bloating a single document
Per-index files reduces the number of updates and effectively reduces the network bandwidth usage, because most updates affect only a few indexes
Having a diff tracker makes downloads on nodes efficient because only limited data needs to be downloaded

To support the scale and high request rate to Amazon S3, we use Amazon S3 pre-partitioning, so we can scale proportionally with the number of clusters and indexes. For managing storage size, an asynchronous scheduler is added, which cleans up stale files and keeps only the last 10 recently updated documents. After a cluster is deleted, a domain sweeper job removes the files for that cluster after a few days.

Remote Publication overview

Now that you understand how cluster state is persisted in Amazon S3, let’s see how it is used during the publication workflow. When a cluster state update occurs, the elected cluster manager uploads changed entities to Amazon S3 in parallel, with the number of concurrent uploads determined by a fixed thread pool. It then updates and uploads a manifest file with diff details and file paths.

During the publish phase, the elected cluster manager sends the manifest path, term, and version to follower nodes using a new remote transport action. When the elected cluster manager changes, the newly elected cluster manager increments the term which signifies the number of times a new cluster manager election has occurred. The elected cluster manager increments the cluster state version when the cluster state is updated. You can use these two components to identify cluster state progression and make sure nodes operate with the same understanding of the cluster’s configuration. The follower nodes download the manifest, determine if they need a full state or just the diff, and then download the required files from Amazon S3 in parallel. After the new cluster state is computed, follower nodes acknowledge the elected cluster manager.

In the commit phase, the elected cluster manager updates the manifest, marking it as committed, and instructs follower nodes to commit the new cluster state. This process provides efficient distribution of cluster state updates, especially in large clusters, by minimizing direct data transfer between nodes and using Amazon S3 for storage and retrieval. The following diagram depicts the Remote Publication flow when an index creation triggers a cluster state update.

Sequence of steps between the cluster manager node, the follower nodes, and a remote store such as Amazon S3 depicting the remote cluster state publication

The workflow consists of the following steps:

The user invokes an admin API such as create index.
The elected cluster manager node uploads the index metadata and routing table files in parallel to the configured remote store.
The elected cluster manager node uploads the manifest file containing the details of the metadata files to the remote store.
The elected cluster manager sends the remote manifest file path to the follower nodes.
The follower node downloads the manifest file from the remote store.
The follower nodes download the index metadata and routing table files from the remote store in parallel.

Failure detection in publication

Remote Publication brings in a significant change to how publication works and how the cluster state is managed. Issues in file creation, publication, or downloading and creating cluster state on follower nodes can have a potential impact on the cluster. To make sure the new flow works as expected, a checksum validation is added to the publication flow. On the elected cluster manager, after creating a new cluster state, a checksum is created for individual entities and the overall cluster state and added to the manifest. On follower nodes, after the cluster state is created after download, a checksum is created again and matched against the checksum from the manifest. A mismatch in checksums means the cluster state on the follower node is different from that on the elected cluster manager. In the default mode, the service only logs which entity is failing the checksum match and lets the cluster state persist. For further debugging, checksum match supports different modes, where it can download the complete state and find the diff between two states in trace mode, or fail the publication request in failure mode.

Recovery from failures

With remote state, quorum loss is recovered by using the cluster state from the remote store. Without remote state, the cluster manager might lose metadata, leading to data loss for your cluster. However, the cluster manager can now use the last persisted state to help prevent metadata loss in the cluster. The following diagram illustrates the states of a cluster before a quorum loss, during a quorum loss, and after the quorum loss recovery happens using a remote store.

The states of a cluster before a quorum loss, during a quorum loss, and after the quorum loss recovery happens using remote store

Benefits

In this section, we discuss some of the solution benefits.

Scalability and availability

Remote Publication significantly reduces the CPU, memory, and network overhead for the elected cluster manager when transmitting the state to the follower nodes. Additionally, transport threads responsible for sending publish requests to follower nodes are made available more quickly, because the remote publish request size is smaller. The publication request size remains consistent irrespective of the cluster state size, giving consistent publication performance. This enhancement enables OpenSearch Service to support larger clusters of up to 1,000 nodes and a higher number of shards per node, without overwhelming the elected cluster manager. With reduced load on the cluster manager, its availability improves, so it can more efficiently serve admin API requests.

Durability

With the cluster state being persisted to Amazon S3, we get Amazon S3 durability. Clusters suffering quorum loss due to replacement of cluster manager nodes can hydrate with the remote cluster state and recover from quorum loss. Because Amazon S3 has the last committed cluster state, there is no data loss on recovery.

Cluster state publication performance

We tested the elected cluster manager performance in a 1,000-node domain containing 500,000 shards. We compared two versions: the new Remote Publication system vs. the older cluster state publication system. Both clusters were operated with the same workload for a few hours. The following are some key observations:

Cluster state publication time reduced from an average of 13 seconds to 4 seconds, which is a three-fold improvement
Network out reduced from an average of 4 GB to 3 GB
Elected cluster manager resource utilization showed significant improvement, with JVM dropping from an average of 40% to 20% and CPU dropping from 50% to 40%

We tested on a 100-node cluster as well and saw performance improvements with the increase in the size of the cluster state. With 50,000 shards, the uncompressed cluster state size increased to 600 MB. The following observations were made during cluster state update when compared to a cluster without Remote Publication:

Max network out traffic reduced from 11.3 GB to 5.7 GB (approximately 50%)
Average elected cluster manager JVM usage reduced from 54% to 35%
Average elected cluster manager CPU reduced from 33% to 20%

Contributing to open source

OpenSearch is an open source, community-driven software. You can find code for the Remote Publication feature in the project’s GitHub repository. Some of the notable GitHub pull requests have been added inline to the preceding text. You can find the RFCs for remote state and remote state publication in the project’s GitHub repository. A more comprehensive list of pull requests is attached in the meta issues for remote state, remote publication, and remote routing table.

Looking ahead

The new Remote Publication architecture enables teams to build additional features and optimizations using the remote store:

Faster recovery after failures – With the new architecture, we have the last successful cluster state in Amazon S3, which can be downloaded on the new cluster manager. At the time of writing, only cluster metadata gets restored on recovery and then the elected cluster manager tries to build shard allocation by contacting the data nodes. This takes up a lot of CPU and memory for both the cluster manager and data nodes, in addition to the time taken to collate the data to build the allocation table. With the last successful shard allocation available in Amazon S3, the elected cluster manager can download the data, build the allocation table locally, and then update the cluster state to the follower nodes, making recovery faster and less resource-intensive.
Lazy loading – The cluster state entities can be loaded as needed instead of all at once. This approach reduces the average memory usage on a follower node and is expected to speed up cluster state publication.
Node-specific metadata – At present, every follower node downloads and loads the entire cluster state. However, we can optimize this by modifying the logic so that a data node only downloads the index metadata and routing table for the indexes it contains.
Optimize cluster state downloads – There is an opportunity to optimize the downloading of cluster state entities. We are exploring compression and serialization techniques to minimize the amount of data transmitted.
Restoring to an older state – The service keeps the cluster state for the last 10 updates. This can be used to restore the cluster to a previous state in case the state gets corrupted.

Conclusion

Remote Publication makes cluster state publication faster and more robust, significantly improving cluster scalability, reliability, and recovery capabilities, potentially reducing customer incidents and operational overhead. This change in architecture enables further improvements in elected cluster manager performance and making domains more durable, especially for larger domains where cluster manager operations become heavy as the number of indexes and nodes increase. We encourage you to upgrade to the latest version to take advantage of these improvements and share your experience with our community.

About the authors

Himshikha Gupta is a Senior Engineer with Amazon OpenSearch Service. She is excited about scaling challenges with distributed systems. She is an active contributor to OpenSearch, focused on shard management and cluster scalability

Sooraj Sinha is a software engineer at Amazon, specializing in Amazon OpenSearch Service since 2021. He has worked on multiple core components of OpenSearch, including indexing, cluster management, and cross-cluster replication. His contributions have focused on improving the availability, performance, and durability of OpenSearch.

Enhance Amazon EMR observability with automated incident mitigation using Amazon Bedrock and Amazon Managed Grafana

2025-08-14 Yu-Ting Su

Post Syndicated from Yu-Ting Su original https://aws.amazon.com/blogs/big-data/enhance-amazon-emr-observability-with-automated-incident-mitigation-using-amazon-bedrock-and-amazon-managed-grafana/

Maintaining high availability and quick incident response for Amazon EMR clusters is important in data analytics environments. In this post, we show you how to build an automated observability system that combines Amazon Managed Grafana with Amazon Bedrock to detect and remediate EMR cluster issues. We demonstrate how to integrate real-time monitoring with AI-powered remediation suggestions, combining Amazon Managed Grafana for visualization, Amazon Bedrock for intelligent response recommendations, and AWS Systems Manager for automated remediation actions on Amazon Web Services (AWS).

Solution overview

This solution helps you improve EMR cluster observability through a comprehensive four-layer architecture—comprising monitoring, notification, remediation, and knowledge management—to provide the following features:

Real-time monitoring of EMR clusters using Amazon Managed Service for Prometheus and Amazon Managed Grafana
Automated first-aid remediation through Systems Manager
AI-powered incident response suggestions using Amazon Bedrock
Integration with the AWS Premium Support knowledge base
Historical incident data archival and analysis

The implementation of this architecture delivers the following key benefit:

Reduced Mean time to resolution (MTTR)
Proactive incident prevention
Automated first-response actions
Knowledge base enrichment through machine learning

The following diagram illustrates the solution architecture.

The architecture comprises the following core components:

Monitoring layer – The monitoring layer uses Amazon Managed Service for Prometheus and Amazon CloudWatch to capture real-time metrics from EMR clusters. Amazon Managed Grafana serves as the visualization layer, offering comprehensive dashboards for Apache YARN, HDFS, Apache HBase, and Apache Hudi performance monitoring. Advanced alerting mechanisms trigger notifications based on predefined query results.
Notification layer – To provide timely and reliable alert delivery, the notification layer uses Amazon Simple Notification Service (Amazon SNS) for distribution and Amazon Simple Queue Service (Amazon SQS) for message queuing. This architecture prevents message delays and provides a robust trigger mechanism for AWS Lambda functions.
Remediation layer – The remediation layer enables automatic issue resolution through:
- Lambda functions for orchestration
- Systems Manager for script execution
- Amazon Bedrock (amazon.nova-lite-v1:0) for generating intelligent response recommendations
Knowledge management layer – To maintain an up-to-date knowledge base, the solution:
- Archives technical articles in Amazon Simple Storage Service (Amazon S3)
- Implements periodic data synchronization using Amazon EventBridge
- Integrates with an Amazon Bedrock knowledge base for enhanced insights

We provide an AWS CloudFormation template to deploy the solution resources.

Prerequisites

Before starting this walkthrough, make sure you have access to the following AWS resources and configurations:

An AWS account
Access to the US East (N. Virginia) AWS Region
- Add access to Amazon Bedrock foundation models (amazon.nova-lite-v1:0)
Amazon EMR version 6.15.0 (used in this demo)
Archived technical or troubleshooting articles
AWS IAM Identity Center enabled with at least one role that can become a Grafana administrator
(Optional) AWS Premium Support with a business support plan or higher for enhanced troubleshooting capabilities

Throughout this walkthrough, we provide detailed instructions to set up and configure these prerequisites if you haven’t already done so.

Configure resources using AWS CloudFormation

Complete the following steps to configure your resources:

Launch the CloudFormation stack:

Provide emrobservability as the stack name.
Select a virtual private cloud (VPC) and assign a public subnet.
For EMRClusterName, enter a name for your cluster (default: emrObservability).
Enter an existing Amazon S3 location as the Apache HBase root directory location (for example, s3://mybucket/my/hbase/rootdir/).
For MasterInstanceType and CoreInstanceType, enter your instance types (default: m5.xlarge for both).
For CoreInstanceCount, enter your instance count (default: 2).
For SSHIPRange, use CheckIp and enter your IP (for example, 10.1.10/32).
Choose the release label (default: 6.15.0).
For KeyName, enter a key name to SSH to Amazon Elastic Compute Cloud (Amazon EC2) instances.
For LatestAmiId, enter your AMI (default: /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2).
For KBS3Bucket, enter a name for your S3 bucket (for example, mykbbucket).
For SubscriptionEndpoint, enter an email address to receive notifications and responses (for example, [email protected]).

Accept subscription confirmation

Accept the subscription confirmation sent to the email address you specified in the CloudFormation stack parameters. The following screenshot shows an example of the email you receive.

Prepare the knowledge base

Complete the following steps to populate the S3 bucket with archived technical articles and cases:

On the Lambda console, choose Functions in the navigation pane.
Choose the function CustomFunctionCopyKCArticlesToS3Bucket.

Manually invoke the function by choosing Test on the Test tab.

Verify successful execution by checking the CloudWatch logs.

Repeat the process for the Lambda function CustomFunctionCopyCasesToS3Bucket.

Confirm the S3 bucket has been populated with archived technical articles and cases.

Sync data to the Amazon Bedrock knowledge base

Complete the following steps to sync the data to your knowledge base:

On the Lambda console, choose Functions in the navigation pane.
Choose the function KBDataSourceSync.

Manually invoke the function by choosing Test on the Test tab.

This task might take 10–15 minutes to complete.

Verify successful execution by checking the CloudWatch logs.

Configure your Amazon Managed Grafana workspace

Complete the following steps to configure your Amazon Managed Grafana workspace:

On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
Open your workspace.
Choose Assign new user or group.

Select your IAM Identity Center role and choose Assign users and groups.

On the Admin dropdown menu, choose Make admin.

Enable Grafana alerting, then choose Save changes.

Wait 10 minutes for the workspace to become active.
When it’s active, sign in to the Grafana workspace. (For more information, refer to Connect to your workspace.)

Configure data sources

Add and configure the following data sources:

For Service, choose CloudWatch, then select your Region and add CloudWatch as a data source.

Choose Amazon Managed Service for Prometheus as a second data source and select your Region.

Validate CloudWatch connectivity:
1. Run test queries (for example, Namespace: AWS/EC2, Metric name: CPUUtilization, Statistic: Maximum).
2. Verify CloudWatch metric retrieval.

Validate Amazon Managed Service for Prometheus connectivity:
1. Run test queries (for example, Metric: hadoop_hbase_numregionservers, Label filters: cluster_id = <Amazon EMR cluster ID>).
2. Verify Prometheus metric retrieval.

Confirm SNS notification channels

Complete the following steps to confirm your SNS notification is set up:

On the Amazon SNS console, choose Topics in the navigation pane.
Locate and note the ARNs for -LambdaFunctionTopic and -QALambdaFunctionTopic.

Choose Contact points under Alerting.

Create the first contact point:
1. For Name, enter SNS_SSM.
2. For Integration, choose AWS SNS.
3. For Topic, enter the ARN for LambdaFunctionTopic.
4. For Auth Provider, choose Workspace IAM role.
5. For Alert Message format, choose JSON.

Create the second contact point:
1. For Name, enter SNS_QA.
2. For Integration, choose AWS SNS.
3. For Topic, enter the ARN for QALambdaFunctionTopic.
4. For Auth Provider, choose Workspace IAM role.
5. For Alert Message format, choose JSON.

Create alert rules

Complete the following steps to set up two critical alert rules:

Choose Alert rules under Alerting.

Set up alerting if the Apache HBase region server status is abnormal:
1. For Alert name, enter HBase region server down.
2. For Data source, choose Amazon Managed Service for Prometheus.
3. For Metric, choose hadoop_hbase_numregionservers.
4. For Threshold, configure to alert if the region server count is less than 2 for 3 minutes.
5. For Evaluation interval, set to 1 minute.
6. For Contact point, choose SNS_SSM.

Create a second alert for if Amazon EC2 CPU utilization is abnormal:
1. For Alert name, enter EC2 CPU utilization too high.
2. For Data source, choose Amazon CloudWatch.
3. For Namespace, choose AWS/EC2.
4. For Metric name, choose CPUUtilization
5. For Statistic, choose Maximum.
6. For Threshold, configure to alert if CPU utilization is more than 95% for 3 minutes.
7. For Evaluation interval, configure to 1 minute.
8. For Contact point, choose SNS_QA.

On the alert rule creation page, scroll to 5. Add annotations and for Summary, add a clear description of the alert, for example, CPU utilization on EC2 instance is too high.

Apache HBase region server incident test

To confirm the system is working as expected, complete the following Apache HBase region server incident test:

SSH into an EMR core instance.
Stop the Apache HBase region server using systemctl:

 # Stop HBase region server service 
 sudo systemctl stop hbase-regionserver.service

Verify the service status:

 # Check the current state of HBase region server service 
 sudo systemctl status hbase-regionserver.service

Observe Amazon Managed Grafana alert progression:
1. Monitor alert status changes.
2. Verify SNS message generation.
3. Confirm SQS message queuing.
4. Track the Lambda function triggered for remediation.

CPU utilization stress test

Complete the following CPU utilization stress test:

SSH into the EMR primary instance.
Install stress testing tools:

 sudo amazon-linux-extras install epel -y
 sudo yum install stress -y

Verify the installation:

 stress --version

Generate high CPU load using the stress command and the following command structure:

 sudo stress [options]

For our Amazon EMR test, use the following command:

 # For m5.xlarge instances (4 vCPUs) sudo stress --cpu 4

-c 4 in the command creates 4 CPU-bound processes (one for each vCPU).The following are instance type vCPUs for your reference:

m5.xlarge: 4 vCPUs
m5.2xlarge: 8 vCPUs
m5.4xlarge: 16 vCPUs

Monitor system response:
1. Observe Amazon Managed Grafana alert status changes.
2. Verify Amazon Bedrock recommendation generation.
3. Check SNS email notification delivery.

Best practices and considerations

Monitoring infrastructure requires precise alert prioritization and threshold configuration. Alert aggregation techniques prevent notification overload by consolidating event streams and reducing redundant alerts. Operational teams must maintain dashboards through consistent updates and metric integration, providing real-time visibility into system performance and health.

Security implementations focus on least-privilege AWS Identity and Access Management (IAM) roles, restricting access to critical resources and minimizing potential breach vectors. Data protection strategies involve encryption protocols for information at rest and in transit, using AES-256 standards. Automated security audit processes scan automation scripts, identifying potential vulnerabilities through code analysis and runtime inspection.

Performance optimization in serverless architectures uses Lambda extensions to cache knowledge base content, reducing latency and improving response times. Retry mechanisms for API calls implement exponential backoff strategies, mitigating transient network exceptions and enhancing system resilience. Execution time monitoring of Lambda functions enables detection of anomalies through statistical analysis, providing insights into potential system-wide incidents or performance degradations.

Clean up

To avoid incurring future charges, delete the resources by deleting the parent stack on the AWS CloudFormation console.

Conclusion

This solution provides a robust framework for automated EMR cluster monitoring and incident response. By combining real-time monitoring with AI-powered remediation suggestions and automated execution, organizations can significantly reduce MTTR for common Amazon EMR issues while building a knowledge base for future incident response.

Try out this solution for your own use case, and leave your feedback in the comments section.

About the authors

Yu-ting Su, Sr. Hadoop System Engineer, AWS Support Engineering. Yu-Ting is a Sr. Hadoop Systems Engineer at Amazon Web Services (AWS). Her expertise is in Amazon EMR and Amazon OpenSearch Service. She’s passionate about distributing computation and helping people to bring their ideas to life.

Deploy LLMs on Amazon EKS using vLLM Deep Learning Containers

2025-08-14 Vishal Naik

Post Syndicated from Vishal Naik original https://aws.amazon.com/blogs/architecture/deploy-llms-on-amazon-eks-using-vllm-deep-learning-containers/

Organizations face significant challenges when deploying large language models (LLMs) efficiently at scale. Key challenges include optimizing GPU resource utilization, managing network infrastructure, and providing efficient access to model weights.When running distributed inference workloads, organizations often encounter complexity in orchestrating model operations across multiple nodes. Common challenges include effectively distributing model components across available GPUs, coordinating seamless communication between processing units, and maintaining consistent performance with low latency and high throughput.

vLLM is an open source library for fast LLM inference and serving. The vLLM AWS Deep Learning Containers (DLCs) are optimized for customers deploying vLLMs on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS), and are provided at no additional charge. These containers package a preconfigured, pre-tested environment that functions seamlessly out of the box, includes the necessary dependencies such as drivers and libraries for running vLLMs efficiently, and offers built-in support for Elastic Fabric Adapter (EFA) for high-performance multi-node inference workloads. You don’t have to build the inference environment from scratch anymore. Instead, you can install the vLLM DLC and it will automatically set up and configure the environment, and you can start deploying the inference workloads at scale.

In this post, we demonstrate how to deploy the DeepSeek-R1-Distill-Qwen-32B model using AWS DLCs for vLLMs on Amazon EKS, showcasing how these purpose-built containers simplify deployment of this powerful open source inference engine. This solution can help you solve the complex infrastructure challenges of deploying LLMs while maintaining performance and cost-efficiency.

AWS DLCs

AWS DLCs provide generative AI practitioners with optimized Docker environments to train and deploy generative AI models in their pipelines and workflows across Amazon EC2, Amazon EKS, and Amazon ECS. AWS DLCs are targeted for self-managed machine learning (ML) customers who prefer to build and maintain their AI/ML environments on their own, want instance-level control over their infrastructure, and manage their own training and inference workloads. DLCs are available as Docker images for training and inference, and also with PyTorch and TensorFlow.DLCs are kept current with the latest version of frameworks and drivers, are tested for compatibility and security, and are offered at no additional cost. They are also quickly customizable by following our recipe guides. Using AWS DLCs as a building block for generative AI environments reduces the burden on operations and infrastructure teams, lowers TCO for AI/ML infrastructure, accelerates the development of generative AI products, and helps the generative AI teams focus on the value-added work of deriving generative AI-powered insights from the organization’s data.

Solution overview

The following diagram shows the interaction between Amazon EKS, GPU-enabled EC2 instances with EFA networking, and Amazon FSx for Lustre storage. Client requests flow through the Application Load Balancer (ALB) to the vLLM server pods running on EKS nodes, which access model weights stored on FSx for Lustre. This architecture provides a scalable, high-performance solution for serving LLM inference workloads with optimal cost-efficiency.

The following diagram illustrates the DLC stack on AWS. The stack demonstrates a comprehensive architecture from EC2 instance foundation through container runtime, essential GPU drivers, and ML frameworks like PyTorch. The layered diagram shows how CUDA, NCCL, and other critical components integrate to support high-performance deep learning workloads.

The vLLM DLCs are specifically optimized for high-performance inference, with built-in support for tensor parallelism and pipeline parallelism across multiple GPUs and nodes. This optimization enables efficient scaling of large models like DeepSeek-R1-Distill-Qwen-32B, which would otherwise be challenging to deploy and manage. The containers also include optimized CUDA configurations and EFA drivers, facilitating maximum throughput for distributed inference workloads. This solution uses the following AWS services and components:

AWS DLCs for vLLMs – Pre-configured, optimized Docker images that simplify deployment and maximize performance
EKS cluster – Provides the Kubernetes control plane for orchestrating containers
P4d.24xlarge instances – EC2 P4d instances with 8 NVIDIA A100 GPUs each, configured in a managed node group
Elastic Fabric Adapter – Network interface that enables high-performance computing applications to scale efficiently
FSx for Lustre – High-performance file system for storing model weights
LeaderWorkerSet pattern – Custom Kubernetes resource for deploying vLLM in a distributed configuration
AWS Load Balancer Controller – Manages the ALB for external access

By combining these components, we create an inference system that delivers low-latency, high-throughput LLM serving capabilities with minimal operational overhead.

Prerequisites

Before getting started, make sure you have the following prerequisites:

An AWS account with access to EC2 P4 instances (you might need to request a quota increase)
Access to a terminal that has the following tools installed:
- AWS CLI version 2.11.0 or later
- eksctl version 0.150.0 or later
- kubectl version 1.27 or later
- Helm version 3.12.0 or later
An AWS CLI profile (vllm-profile) configured with an AWS Identity and Access Management (IAM) role or user that has the following permissions:
- Create, manage, and delete EKS clusters and node groups (see Create a Kubernetes cluster on the AWS Cloud for more details)
- Create, manage, and delete EC2 resources, including virtual private clouds (VPCs), subnets, security groups, and internet gateways (see Identity-based policies for Amazon EC2 for more details)
- Create and manage IAM roles (see Identity-based policies and resource-based policies for more details)
- Create, update, and delete AWS CloudFormation stacks
- Create, delete, and describe FSx file systems (see Identity and access management for Amazon FSx for Lustre for more details)
- Create and manage Elastic Load Balancers

This solution can be deployed in AWS Regions where Amazon EKS, P4d instances, and FSx for Lustre are available. This guide uses the us-west-2 Region. The complete deployment process takes approximately 60–90 minutes.

Clone our GitHub repository containing the necessary configuration files:

# Clone the repository
git clone https://github.com/aws-samples/sample-aws-deep-learning-containers.git
cd vllm-samples/deepseek/eks

Create an EKS cluster

First, we create an EKS cluster in the us-west-2 Region using the provided configuration file. This sets up the Kubernetes control plane that will orchestrate our containers. The cluster is configured with a VPC, subnets, and security groups optimized for running GPU workloads.

# Update the region in eks-cluster.yaml if needed
sed -i "s|region: us-east-1|region: us-west-2|g" eks-cluster.yaml

# Create the EKS cluster
eksctl create cluster -f eks-cluster.yaml --profile vllm-profile

This will take approximately 15–20 minutes to complete. During this time, eksctl creates a CloudFormation stack that provisions the necessary resources for your EKS cluster, as shown in the following screenshot.

You can validate the cluster creation with the following code:

# Verify cluster creation
eksctl get cluster --profile vllm-profile
Expected output:
NAME            REGION          EKSCTL CREATED
vllm-cluster    us-west-2       True

You can also see the cluster created on the Amazon EKS console.

Create a node group with EFA support

Next, we create a managed node group with P4d.24xlarge instances that have EFA enabled. These instances are equipped with 8 NVIDIA A100 GPUs each, providing substantial computational power for LLM inference. When deploying EFA-enabled instances like p4d.24xlarge for high-performance ML workloads, you must place them in private subnets to facilitate secure, optimized networking. By dynamically identifying and using a private subnet’s Availability Zone in your node group configuration, you can maintain proper network isolation while supporting the high-throughput, low-latency communication essential for distributed training and inference with LLMs. We identify the Availability Zone using the following code:

# Get the VPC ID from the EKS cluster
VPC_ID=$(aws --profile vllm-profile eks describe-cluster --name vllm-cluster \
  --query "cluster.resourcesVpcConfig.vpcId" --output text)

# Find the one of private subnet's availability zone
PRIVATE_AZ=$(aws --profile vllm-profile ec2 describe-subnets \
  --filters "Name=vpc-id,Values=$VPC_ID" "Name=map-public-ip-on-launch,Values=false" \
  --query "Subnets[0].AvailabilityZone" --output text)
echo "Selected private subnet AZ: $PRIVATE_AZ"

# update the nodegroup_az section with the private AZ value
sed -i "s|availabilityZones: \[nodegroup_az\]|availabilityZones: \[\"$PRIVATE_AZ\"\]|g" large-model-nodegroup.yaml

# Verify the change
grep "availabilityZones" large-model-nodegroup.yaml

# Create the node group with EFA support
eksctl create nodegroup -f large-model-nodegroup.yaml --profile vllm-profile

This will take approximately 10–15 minutes to complete. The EFA configuration is particularly important for multi-node deployments, because it enables high-throughput, low-latency networking between nodes. This is crucial for distributed inference workloads where communication between GPUs on different nodes can become a bottleneck. After the node group is created, configure kubectl to connect to the cluster:

# Configure kubectl to connect to the cluster
aws eks update-kubeconfig --name vllm-cluster --region us-west-2 --profile vllm-profile

Verify that the nodes are ready:

# Check node status
kubectl get nodes

The following is an example of the expected output:

NAME                                            STATUS   ROLES    AGE     VERSION
ip-192-168-xx-xx.us-west-2.compute.internal     Ready    <none>   5m      v1.31.7-eks-xxxx
ip-192-168-yy-yy.us-west-2.compute.internal     Ready    <none>   5m      v1.31.7-eks-xxxx

You can also see the node group created on the Amazon EKS console.

Check NVIDIA device pods

Because we’re using an Amazon EKS optimized AMI with GPU support (ami-0ad09867389dc17a1), the NVIDIA device plugin is already included in the cluster, so there’s no need to install it separately. Verify that the NVIDIA device plugin is running:

# Check NVIDIA device plugin pods
kubectl get pods -n kube-system | grep nvidia

The following is an example of the expected output:

nvidia-device-plugin-daemonset-xxxxx 1/1 Running 0 3m48s 
nvidia-device-plugin-daemonset-yyyyy 1/1 Running 0 3m48s

Verify that GPUs are available in the cluster:

# Check available GPUs
kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"'

The following is our expected output:

"8"
"8"

Create an FSx for Lustre file system

For optimal performance, we create an FSx for Lustre file system to store our model weights. FSx for Lustre provides high-throughput, low-latency access to data, which is essential for loading large model weights efficiently. We use the following code:

# Create a security group for FSx Lustre
FSX_SG_ID=$(aws --profile vllm-profile ec2 create-security-group --group-name fsx-lustre-sg \
  --description "Security group for FSx Lustre" \
  --vpc-id $(aws --profile vllm-profile eks describe-cluster --name vllm-cluster \
  --query "cluster.resourcesVpcConfig.vpcId" --output text) \
  --query "GroupId" --output text)

echo "Created security group: $FSX_SG_ID"

# Add inbound rules for FSx Lustre
aws --profile vllm-profile ec2 authorize-security-group-ingress --group-id $FSX_SG_ID \
  --protocol tcp --port 988-1023 \
  --source-group $(aws --profile vllm-profile eks describe-cluster --name vllm-cluster \
  --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text)

aws --profile vllm-profile ec2 authorize-security-group-ingress --group-id $FSX_SG_ID \
     --protocol tcp --port 988-1023 \
     --source-group $FSX_SG_ID

# Create the FSx Lustre filesystem
SUBNET_ID=$(aws --profile vllm-profile eks describe-cluster --name vllm-cluster \
  --query "cluster.resourcesVpcConfig.subnetIds[0]" --output text)

echo "Using subnet: $SUBNET_ID"

FSX_ID=$(aws --profile vllm-profile fsx create-file-system --file-system-type LUSTRE \
  --storage-capacity 1200 --subnet-ids $SUBNET_ID \
  --security-group-ids $FSX_SG_ID --lustre-configuration DeploymentType=SCRATCH_2 \
  --tags Key=Name,Value=vllm-model-storage \
  --query "FileSystem.FileSystemId" --output text)

echo "Created FSx filesystem: $FSX_ID"

# Wait for the filesystem to be available (typically takes 5-10 minutes)
echo "Waiting for filesystem to become available..."
aws --profile vllm-profile fsx describe-file-systems --file-system-id $FSX_ID \
  --query "FileSystems[0].Lifecycle" --output text

# You can run the above command periodically until it returns "AVAILABLE"
# Example: watch -n 30 "aws --profile vllm-profile fsx describe-file-systems --file-system-id $FSX_ID --query FileSystems[0].Lifecycle --output text"

# Get the DNS name and mount name
FSX_DNS=$(aws --profile vllm-profile fsx describe-file-systems --file-system-id $FSX_ID \
  --query "FileSystems[0].DNSName" --output text)

FSX_MOUNT=$(aws --profile vllm-profile fsx describe-file-systems --file-system-id $FSX_ID \
  --query "FileSystems[0].LustreConfiguration.MountName" --output text)

echo "FSx DNS: $FSX_DNS"
echo "FSx Mount Name: $FSX_MOUNT"

The file system is configured with 1.2 TB of storage capacity, SCRATCH_2 deployment type for high performance, and security groups that allow access from our EKS nodes. You can also check the FSx for Lustre file system on the FSx for Lustre console.

Install the AWS FSx CSI Driver

To mount the FSx for Lustre file system in our Kubernetes pods, we install the AWS FSx CSI Driver. This driver enables Kubernetes to dynamically provision and mount FSx for Lustre volumes.

# Add the AWS FSx CSI Driver Helm repository
helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver/
helm repo update

# Install the AWS FSx CSI Driver
helm install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver --namespace kube-system

Verify that the AWS FSx CSI Driver is running:

# Check AWS FSx CSI Driver pods
kubectl get pods -n kube-system | grep fsx

The following is an example of the expected output:

fsx-csi-controller-xxxx     4/4     Running   0          24s
fsx-csi-controller-yyyy     4/4     Running   0          24s
fsx-csi-node-xxxx              3/3     Running   0          24s
fsx-csi-node-yyyy              3/3     Running   0          24s

Create Kubernetes resources for FSx for Lustre

We create the necessary Kubernetes resources to use our FSx for Lustre file system:

# Update the storage class with your subnet and security group IDs
sed -i "s|<subnet-id>|$SUBNET_ID|g" fsx-storage-class.yaml
sed -i "s|<sg-id>|$FSX_SG_ID|g" fsx-storage-class.yaml

# Update the PV with your FSx Lustre details
sed -i "s|<fs-id>|$FSX_ID|g" fsx-lustre-pv.yaml
sed -i "s|<fs-id>.fsx.us-west-2.amazonaws.com|$FSX_DNS|g" fsx-lustre-pv.yaml
sed -i "s|<mount-name>|$FSX_MOUNT|g" fsx-lustre-pv.yaml

# Apply the Kubernetes resources
kubectl apply -f fsx-storage-class.yaml
kubectl apply -f fsx-lustre-pv.yaml
kubectl apply -f fsx-lustre-pvc.yaml

Verify that the resources were created successfully:

# Check storage class
kubectl get sc fsx-sc

# Check persistent volume
kubectl get pv fsx-lustre-pv

# Check persistent volume claim
kubectl get pvc fsx-lustre-pvc

The following is an example of the expected output:

NAME     PROVISIONER      RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
fsx-sc   fsx.csi.aws.com   Retain          Immediate           false                  1m

NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                  STORAGECLASS   REASON   AGE
fsx-lustre-pv   1200Gi     RWX            Retain           Bound    default/fsx-lustre-pvc  fsx-sc                  1m

NAME             STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   AGE
fsx-lustre-pvc   Bound    fsx-lustre-pv   1200Gi     RWX            fsx-sc         1m

These resources include:

A StorageClass that defines how to provision FSx for Lustre volumes
A PersistentVolume that represents our existing FSx for Lustre file system
A PersistentVolumeClaim that our pods will use to mount the file system

Install the AWS Load Balancer Controller

To expose our vLLM service to the outside world, we install the AWS Load Balancer Controller. This controller manages ALBs for our Kubernetes services and ingresses. Refer to Install AWS Load Balancer Controller with Helm for addition details.

# Download the IAM policy document
curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json

# Create the IAM policy
aws --profile vllm-profile iam create-policy --policy-name AWSLoadBalancerControllerIAMPolicy --policy-document file://iam-policy.json

# Create an IAM OIDC provider for the cluster
eksctl utils associate-iam-oidc-provider --profile vllm-profile --region=us-west-2 --cluster=vllm-cluster --approve

# Create an IAM service account for the AWS Load Balancer Controller
ACCOUNT_ID=$(aws --profile vllm-profile sts get-caller-identity --query "Account" --output text)
eksctl create iamserviceaccount \
  --profile vllm-profile \
  --cluster=vllm-cluster \
  --namespace=kube-system \
  --name=aws-load-balancer-controller \
  --attach-policy-arn=arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \
  --override-existing-serviceaccounts \
  --approve

# Install the AWS Load Balancer Controller using Helm
helm repo add eks https://aws.github.io/eks-charts
helm repo update

# Install the CRDs
kubectl apply -f https://raw.githubusercontent.com/aws/eks-charts/master/stable/aws-load-balancer-controller/crds/crds.yaml

# Install the controller
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=vllm-cluster \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller

Verify that the AWS Load Balancer Controller is running:

# Check AWS Load Balancer Controller pods
kubectl get pods -n kube-system | grep aws-load-balancer-controller
# Install the LeaderWorkerSet controller
   helm install lws oci://registry.k8s.io/lws/charts/lws \
     --version=0.6.1 \
     --namespace lws-system \
     --create-namespace \
     --wait --timeout 300s

Configure security groups for the ALB

We create a dedicated security group for the ALB and configure it to allow inbound traffic on port 80 from our client IP addresses. We also configure the node security group to allow traffic from the ALB security group to the vLLM service port.

# Create security group for the ALB
USER_IP=$(curl -s https://checkip.amazonaws.com)

VPC_ID=$(aws --profile vllm-profile eks describe-cluster --name vllm-cluster \
  --query "cluster.resourcesVpcConfig.vpcId" --output text)

ALB_SG=$(aws --profile vllm-profile ec2 create-security-group \
  --group-name vllm-alb-sg \
  --description "Security group for vLLM ALB" \
  --vpc-id $VPC_ID \
  --query "GroupId" --output text)

echo "ALB security group: $ALB_SG"

# Allow inbound traffic on port 80 from your IP
aws --profile vllm-profile ec2 authorize-security-group-ingress \
  --group-id $ALB_SG \
  --protocol tcp \
  --port 80 \
  --cidr ${USER_IP}/32

# Get the node group security group ID
NODE_INSTANCE_ID=$(aws --profile vllm-profile ec2 describe-instances \
  --filters "Name=tag:eks:nodegroup-name,Values=vllm-p4d-nodes-efa" \
  --query "Reservations[0].Instances[0].InstanceId" --output text)

NODE_SG=$(aws --profile vllm-profile ec2 describe-instances \
  --instance-ids $NODE_INSTANCE_ID \
  --query "Reservations[0].Instances[0].SecurityGroups[0].GroupId" --output text)

echo "Node security group: $NODE_SG"

# Allow traffic from ALB security group to node security group on port 8000 (vLLM service port)
aws --profile vllm-profile ec2 authorize-security-group-ingress \
  --group-id $NODE_SG \
  --protocol tcp \
  --port 8000 \
  --source-group $ALB_SG

# Update the security group in the ingress file
sed -i "s|<sg-id>|$ALB_SG|g" vllm-deepseek-32b-lws-ingress.yaml

Verify that the security groups were created and configured correctly:

# Verify ALB security group
aws --profile vllm-profile ec2 describe-security-groups --group-ids $ALB_SG --query "SecurityGroups[0].IpPermissions"
The following is the expected output for the ALB security group:
[
    {
        "FromPort": 80,
        "IpProtocol": "tcp",
        "IpRanges": [
            {
                "CidrIp": "USER_IP/32"
            }
        ],
        "ToPort": 80
    }
]
# Verify node security group rules
aws --profile vllm-profile ec2 describe-security-groups --group-ids $NODE_SG --query "SecurityGroups[0].IpPermissions"

Deploy the vLLM server

Finally, we deploy the vLLM server using the LeaderWorkerSet pattern. The AWS DLCs provide an optimized environment that minimizes the complexity typically associated with deploying LLMs.The vLLM DLCs come preconfigured with the following features:

Optimized CUDA libraries for maximum GPU utilization
EFA drivers and configurations for high-speed node-to-node communication
Ray framework setup for distributed computing
Performance-tuned vLLM installation with support for tensor and pipeline parallelism

This prepackaged solution dramatically reduces deployment time, the need for complex environment setup, dependency management, and performance tuning that would otherwise require specialized expertise.

# Deploy the vLLM server
# First, verify that the AWS Load Balancer Controller is running
kubectl get pods -n kube-system | grep aws-load-balancer-controller

# Wait until the controller is in Running state
# If it's not running, check the logs:
# kubectl logs -n kube-system deployment/aws-load-balancer-controller

# Apply the LeaderWorkerSet
kubectl apply -f vllm-deepseek-32b-lws.yaml

The deployment will start immediately, but the pod might remain in ContainerCreating state for several minutes (5–15 minutes) while it pulls the large GPU-enabled container image. After the container starts, it will take additional time (10–15 minutes) to download and load the DeepSeek model.You can monitor the progress with the following code:

# Monitor pod status
kubectl get pods

# Check pod logs
kubectl logs -f <pod-name>
Here is the out put of one of the pods
Kubectl logs -f vllm-deepseek-32b-lws-0

The following is the expected output when pods are running:

NAME                      READY   STATUS    RESTARTS   AGE 
vllm-deepseek-32b-lws-0  1/1     Running   0          10m
vllm-deepseek-32b-lws-0-1  1/1     Running   0          10m

We also deploy an ingress resource that configures the ALB to route traffic to our vLLM service:

# Apply the ingress (only after the controller is running)
kubectl apply -f vllm-deepseek-32b-lws-ingress.yaml

You can check the status of the ingress with the following code:

# Check ingress status
kubectl get ingress

The following is an example of the expected output:

NAME                       CLASS   HOSTS   ADDRESS                                                                  PORTS   AGE
vllm-deepseek-32b-lws-ingress  alb     *       k8s-default-vllmdeep-xxxxxxxx-xxxxxxxxxx.us-west-2.elb.amazonaws.com     80      5m

Test the deployment

When the deployment is complete, we can test our vLLM server. It provides the following API endpoints:

/v1/completions – For text completions
/v1/chat/completions – For chat completions
/v1/embeddings – For generating embeddings
/v1/models – For listing available models

# Test the vLLM server
# Get the ALB endpoint
export VLLM_ENDPOINT=$(kubectl get ingress vllm-deepseek-32b-lws-ingress -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "vLLM endpoint: $VLLM_ENDPOINT"

# Test the completions API
curl -X POST http://$VLLM_ENDPOINT/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
      "prompt": "Hello, how are you?",
      "max_tokens": 100,
      "temperature": 0.7
  }'

The following is an example of the expected output:

{
  "id": "cmpl-xxxxxxxxxxxxxxxxxxxxxxxx",
  "object": "text_completion",
  "created": 1717000000,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
  "choices": [
    {
      "index": 0,
      "text": " I'm doing well, thank you for asking! How about you? Is there anything I can help you with today?",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "total_tokens": 105,
    "completion_tokens": 100
  }
}

You can also test the chat completions API:

# Test the chat completions API
curl -X POST http://$VLLM_ENDPOINT/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
      "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
      "messages": [{"role": "user", "content": "What are the benefits of using FSx Lustre with EKS?"}],
      "max_tokens": 100,
      "temperature": 0.7
  }'

If you encounter errors, check the logs of the vLLM pods:

# Troubleshooting
kubectl logs -f <pod-name>

Performance considerations

In this section, we discuss different performance considerations.

Elastic Fabric Adapter

EFA provides significant performance benefits for distributed inference workloads:

Reduced latency – Lower and more consistent latency for communication between GPUs across nodes
Higher throughput – Higher throughput for data transfer between nodes
Improved scaling – Better scaling efficiency across multiple nodes
Better performance – Significantly improved performance for distributed inference workloads

FSx for Lustre integration

Using FSx for Lustre for model storage provides several benefits:

Persistent storage – Model weights are stored on the FSx for Lustre file system and persist across pod restarts
Faster loading – After the initial download, model loading is much faster
Shared storage – Multiple pods can access the same model weights
High performance – FSx for Lustre provides high-throughput, low-latency access to the model weights

Application Load Balancer

Using the AWS Load Balancer Controller with ALB provides several advantages:

Path-based routing – ALB supports routing traffic to different services based on the URL path
SSL/TLS termination – ALB can handle SSL/TLS termination, reducing the load on your pods
Authentication – ALB supports authentication through Amazon Cognito or OIDC
AWS WAF – ALB can be integrated with AWS WAF for additional security
Access logs – ALB can log the requests to an Amazon Simple Storage Service (Amazon S3) bucket for auditing and analysis

Clean up

To avoid incurring additional charges, clean up the resources created in this post. Run the provided ./cleanup.sh script to clean the Kubernetes resources (ingress, LeaderworkerSet, PersistentVolumeClaim, PersistentVolume, AWS Load Balancer Controller, and storage class), IAM resources, the FSX for Lustre file system, and the EKS cluster:

chmod +x cleanup.sh
./cleanup.sh

For more detailed cleanup instructions, including troubleshooting CloudFormation stack deletion failures, refer to the README.md file in the GitHub repository.

Conclusion

In this post, we demonstrated how to deploy the DeepSeek-R1-Distill-Qwen-32B model on Amazon EKS using vLLMs, with GPU support, EFA, and FSx for Lustre integration. This architecture provides a scalable, high-performance system for serving LLM inference workloads.AWS Deep Learning Containers for vLLM provide a streamlined, optimized environment that simplifies LLM deployment by minimizing the complexity of environment configuration, dependency management, and performance tuning. By using these preconfigured containers, organizations can reduce deployment timelines and focus on deriving value from their LLM applications.By combining AWS DLCs with Amazon EKS, P4d instances with NVIDIA A100 GPUs, EFA, and FSx for Lustre, you can achieve optimal performance for LLM inference while maintaining the flexibility and scalability of Kubernetes.This solution helps organizations:

Deploy LLMs efficiently at scale
Optimize GPU resource utilization with container orchestration
Improve networking performance between nodes with EFA
Accelerate model loading with high-performance storage
Provide a scalable, high performance inference API

The complete code and configuration files for this deployment are available in our GitHub repository. We encourage you to try it out and adapt it to your specific use case.

About the authors

Build data pipelines with dbt in Amazon Redshift using Amazon MWAA and Cosmos

2025-08-13 Cindy Li

Post Syndicated from Cindy Li original https://aws.amazon.com/blogs/big-data/build-data-pipelines-with-dbt-in-amazon-redshift-using-amazon-mwaa-and-cosmos/

Effective collaboration and scalability are essential for building efficient data pipelines. However, data modeling teams often face challenges with complex extract, transform, and load (ETL) tools, requiring programming expertise and a deep understanding of infrastructure. This complexity can lead to operational inefficiencies and challenges in maintaining data quality at scale.

dbt addresses these challenges by providing a simpler approach where data teams can build robust data models using SQL, a language they’re already familiar with. When integrated with modern development practices, dbt projects can use version control for collaboration, incorporate testing for data quality, and utilize reusable components through macros. dbt also automatically manages dependencies, making sure data transformations execute in the correct sequence.

In this post, we explore a streamlined, configuration-driven approach to orchestrate dbt Core jobs using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Cosmos, an open source package. These jobs run transformations on Amazon Redshift, a fully managed data warehouse that enables fast, scalable analytics using standard SQL. With this setup, teams can collaborate effectively while maintaining data quality, operational efficiency, and observability. Key steps covered include:

Creating a sample dbt project
Enabling auditing within the dbt project to capture runtime metrics for each model
Creating a GitHub Actions workflow to automate deployments
Setting up Amazon Simple Notification Service (Amazon SNS) to proactively alert on failures

These enhancements enable model-level auditing, automated deployments, and real-time failure alerts. By the end of this post, you will have a practical and scalable framework for running dbt Core jobs with Cosmos on Amazon MWAA, so your team can ship reliable data workflows faster.

Solution overview

The following diagram illustrates the solution architecture.

The workflow contains the following steps:

Analytics engineers manage their dbt project in their version control tool. In this post, we use GitHub as an example.
We configure an Apache Airflow Directed Acyclic Graph (DAG) to use the Cosmos library to create an Airflow task group that contains all the dbt models as part of the dbt project.
We use a GitHub Actions workflow to sync the dbt project files and the DAG to an Amazon Simple Storage Service (Amazon S3) bucket.
During the DAG run, dbt converts the models, tests, and macros to Amazon Redshift SQL statements, which run directly on the Redshift cluster.
If a task in the DAG fails, the DAG invokes an AWS Lambda function to send out a notification using Amazon SNS.

Prerequisites

You must have the following prerequisites:

A Redshift provisioned cluster or a Redshift serverless workgroup. For creation instructions, refer to the Amazon Redshift Management Guide.
An S3 bucket to store dbt project files and DAGs.
An Amazon MWAA environment. For creation instructions, refer to Create an Amazon MWAA Environment.
Network connectivity from Amazon MWAA to Amazon Redshift. Deploy Amazon MWAA and your Redshift cluster in the same virtual private cloud (VPC). Add an Amazon MWAA security group as a source for the Redshift security group inbound rule.
An AWS Identity and Access Management (IAM) role to connect GitHub Actions to actions in AWS. For creation instructions, refer to Use IAM roles to connect GitHub Actions to actions in AWS and Security best practices in IAM.

Create a dbt project

A dbt project is structured to facilitate modular, scalable, and maintainable data transformations. The following code is a sample dbt project structure that this post will follow:

MY_SAMPLE_DBT_PROJECT
├── .github
│   └── workflows
│       └── publish_assets.yml
└── src
    ├── dags
    │   └── dbt_sample_dag.py
    └── my_sample_dbt_project
        ├── macros
        ├── models
        └── dbt_project.yml

dbt uses the following YAML files:

dbt_project.yml – Serves as the main configuration for your project. Objects in this project will inherit settings defined here unless overridden at the model level. For example:

# Name your project! Project names should contain only lowercase characters
# and underscores. 
name: 'my_sample_dbt_project'
version: '1.0.0'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. 
model-paths: ["models"]
macro-paths: ["macros"]

# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In this example config, we tell dbt to build models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
  my_sample_dbt_project:
    # Config indicated by + and applies to files under models/example/
    example:
      +materialized: view
      
on-run-end:
# add run results to audit table 
  - "{{ log_audit_table(results) }}"

sources.yml – Defines the external data sources that your dbt models will reference. For example:

sources:
  - name: sample_source
    database: sample_database
    schema: sample_schema
    tables:
      - name: sample_table

schema.yml – Outlines the schema of your models and data quality tests. In the following example, we have defined two columns, full_name for the model model1 and sales_id for model2. We have declared them as the primary key and defined data quality tests to check if the two columns are unique and not null.

version: 2

models:
  - name: model1
    config: 
      contract: {enforced: true}

    columns:
      - name: full_name
        data_type: varchar(100)
        constraints:
          - type: primary_key
        tests:
          - unique
          - not_null

  - name: model2
    config: 
      contract: {enforced: true}

    columns:
      - name: sales_id
        data_type: varchar(100)
        constraints:
          - type: primary_key
        tests:
          - unique
          - not_null

Enable auditing within dbt project

Enabling auditing within your dbt project is crucial for facilitating transparency, traceability, and operational oversight across your data pipeline. You can capture run metrics at the model level for each execution in an audit table. By capturing detailed run metrics such as load identifier, runtime, and number of rows affected, teams can systematically monitor the health and performance of each load, quickly identify issues, and trace changes back to specific runs.

The audit table consists of the following attributes:

load_id – An identifier for each model run executed as part of the load
database_name – The name of the database within which data is being loaded
schema_name – The name of the schema within which data is being loaded
name – The name of the object within which data is being loaded
resource_type – The type of object to which data is being loaded
execution_time – The time duration taken for each dbt model to complete execution as part of each load
rows_affected – The number of rows affected in the dbt model as part of the load

Complete the following steps to enable auditing within your dbt project:

Navigate to the models directory (src/my_sample_dbt_project/models) and create the audit_table.sql model file:

{%- set run_date = "CURRENT_DATE" -%}
{{
    config(
        materialized='incremental',
        incremental_strategy='append',
        tags=["audit"]
    )
}}

with empty_table as (
    select
        'test_load_id'::varchar(200) as load_id,
        'test_invocation_id'::varchar(200) as invocation_id,
        'test_database_name'::varchar(200) as database_name,
        'test_schema_name'::varchar(200) as schema_name,
        'test_model_name'::varchar(200) as name,
        'test_resource_type'::varchar(200) as resource_type,
        'test_status'::varchar(200) as status,
        cast('12122012' as float) as execution_time,
        cast('100' as int) as rows_affected,
        {{run_date}} as model_execution_date
)

select * from empty_table
-- This is a filter so we will never actually insert these values
where 1 = 0

Navigate to the macros directory (src/my_sample_dbt_project/macros) and create the parse_dbt_results.sql macro file:

{% macro parse_dbt_results(results) %}
    -- Create a list of parsed results
    {%- set parsed_results = [] %}
    -- Flatten results and add to list
    {% for run_result in results %}
        -- Convert the run result object to a simple dictionary
        {% set run_result_dict = run_result.to_dict() %}
        -- Get the underlying dbt graph node that was executed
        {% set node = run_result_dict.get('node') %}
        {% set rows_affected = run_result_dict.get(
        'adapter_response', {}).get('rows_affected', 0) %}
        {%- if not rows_affected -%}
            {% set rows_affected = 0 %}
        {%- endif -%}
        {% set parsed_result_dict = {
                'load_id': invocation_id ~ '.' ~ node.get('unique_id'),
                'invocation_id': invocation_id,
                'database_name': node.get('database'),
                'schema_name': node.get('schema'),
                'name': node.get('name'),
                'resource_type': node.get('resource_type'),
                'status': run_result_dict.get('status'),
                'execution_time': run_result_dict.get('execution_time'),
                'rows_affected': rows_affected
                }%}
        {% do parsed_results.append(parsed_result_dict) %}
    {% endfor %}
    {{ return(parsed_results) }}
{% endmacro %}

Navigate to the macros directory (src/my_sample_dbt_project/macros) and create the log_audit_table.sql macro file:

{% macro log_audit_table(results) %}
    -- depends_on: {{ ref('audit_table') }}
    {%- if execute -%}
        {{ print("Running log_audit_table Macro") }}
        {%- set run_date = "CURRENT_DATE" -%}
        {%- set parsed_results = parse_dbt_results(results) -%}
        {%- if parsed_results | length  > 0 -%}
            {% set allowed_columns = ['load_id', 'invocation_id', 'database_name', 
            'schema_name', 'name', 'resource_type', 'status', 'execution_time', 
            'rows_affected', 'model_execution_date'] -%}
            {% set insert_dbt_results_query -%}
                insert into {{ ref('audit_table') }}
                    (
                        load_id,
                        invocation_id,
                        database_name,
                        schema_name,
                        name,
                        resource_type,
                        status,
                        execution_time,
                        rows_affected,
                        model_execution_date
                ) values
                    {%- for parsed_result_dict in parsed_results -%}
                        (
                            {%- for column, value in parsed_result_dict.items() %}
                                {% if column not in allowed_columns %}
                                    {{ exceptions.raise_compiler_error("Invalid
                                     column") }}
                                {% endif %}
                                {% set sanitized_value = value | replace("'", "''") %}
                                '{{ sanitized_value }}'
                                {%- if not loop.last %}, {% endif %}
                            {%- endfor -%}
                        )
                        {%- if not loop.last %}, {% endif %}
                    {%- endfor -%}
            {%- endset -%}
            {%- do run_query(insert_dbt_results_query) -%}
        {%- endif -%}
    {%- endif -%}
    {{ return ('') }}
{% endmacro %}

Append the following lines to the dbt_project.yml file:

on-run-end:
  - "{{ log_audit_table(results) }}"

Create a GitHub Actions workflow

This step is optional. If you prefer, you can skip it and instead upload your files directly to your S3 bucket.

The following GitHub Actions workflow automates the deployment of dbt project files and DAG file to Amazon S3. Replace the placeholders {s3_bucket_name}, {account_id}, {role_name}, and {region} with your S3 bucket name, account ID, IAM role name, and AWS Region in the workflow file.

To enhance security, it’s recommended to use OpenID Connect (OIDC) for authentication with IAM roles in GitHub Actions instead of relying on long-lived access keys.

name: Sync dbt Project with S3

on:
  workflow_dispatch:
  push:
    branches: [ main ]
    paths:
      - "src/**"

permissions:
  id-token: write   # This is required for requesting the JWT
  contents: read    # This is required for actions/checkout
  pull-requests: write

jobs:
  sync-dev:
    runs-on: ubuntu-latest
    environment: dev
    defaults:
      run:
        shell: bash
    steps:
      - uses: actions/checkout@v4
      - name: Assume AWS IAM Role
        uses: aws-actions/[email protected]
        with:
          aws-region: {region}
          role-to-assume: arn:aws:iam::{account_id}:role/{role_name}
          role-session-name: my_sample_dbt_project_${{ github.run_id }}
          role-duration-seconds: 3600 # 1 hour

      - run: aws sts get-caller-identity

      - name: Sync dbt Model files
        id: dbt_project_files
        working-directory: src/my_sample_dbt_project
        run: aws s3 sync . s3://{s3_bucket_name}/dags/dbt/my_sample_dbt_project 
        --delete
        continue-on-error: false

      - name: Sync DAG files
        id: dag_file
        working-directory: src/dags
        run: aws s3 sync . s3://{s3_bucket_name}/dags

GitHub has the following security requirements:

Branch protection rules – Before proceeding with the GitHub Actions workflow, make sure branch protection rules are in place. These rules enforce required status checks before merging code into protected branches (such as main).
Code review guidelines – Implement code review processes to make sure changes undergo review. This can include requiring at least one approving review before code is merged into the protected branch.
Incorporate security scanning tools – This can help detect vulnerabilities in your repository.

Make sure you are also adhering to dbt-specific security best practices:

Pay attention to dbt macros with variables and validate their inputs.
When adding new packages to your dbt project, evaluate their security, compatibility, and maintenance status to make sure they don’t introduce vulnerabilities or conflicts into your project.
Review dynamically generated SQL to safeguard against issues like SQL injection.

Update the Amazon MWAA instance

Complete the following steps to update the Amazon MWAA instance:

Install the Cosmos library on Amazon MWAA by adding astronomer-cosmos in the requirements.txt file. Make sure to check for version compatibility for Amazon MWAA and the Cosmos library.
Add the following entries in your startup.sh script:
1. In the following code, DBT_VENV_PATH specifies the location where the Python virtual environment for dbt will be created. DBT_PROJECT_PATH points to the location of your dbt project inside Amazon MWAA.
```
#!/bin/sh
export DBT_VENV_PATH="${AIRFLOW_HOME}/dbt_venv"
export DBT_PROJECT_PATH="${AIRFLOW_HOME}/dags/dbt"
```
2. The following code creates a Python virtual environment at the path ${DBT_VENV_PATH} and installs the dbt-redshift adapter to run dbt transformations on Amazon Redshift:
```
python3 -m venv "${DBT_VENV_PATH}"
${DBT_VENV_PATH}/bin/pip install dbt-redshift
```

Create a dbt user in Amazon Redshift and store credentials

To create dbt models in Amazon Redshift, you must set up a native Redshift user with the necessary permissions to access source tables and create new tables. It is essential to create separate database users with minimal permissions to follow the principle of least privilege. The dbt user should not be granted admin privileges, instead, it should only have access to the specific schemas required for its tasks.

Complete the following steps:

Open the Amazon Redshift console and connect as an admin (for more details, refer to Connecting to an Amazon Redshift database).
Run the following command in the query editor v2 to create a native user, and note down the values for dbt_user_name and password_value:
```
create user {dbt_user_name} password 'sha256|{password_value}';
```
Run the following commands in the query editor v2 to grant permissions to the native user:
1. Connect to the database where you want to source tables from and run the following commands:
```
grant usage on schema {schema_name} to {dbt_user_name};
grant select on all tables in schema {schema_name} to {dbt_user_name};
```
2. To allow the user to create tables within a schema, run the following command:
```
grant create on schema {schema_name} to {dbt_user_name};
```
Optionally, create a secret in AWS Secrets Manager and store the values for dbt_user_name and password_value from the previous step as plaintext:

{
    "username":"dbt_user_name",
    "password":"password_value"
}

Creating a Secrets Manager entry is optional, but recommended for securely storing your credentials instead of hardcoding them. To learn more, refer to AWS Secrets Manager best practices.

Create a Redshift connection in Amazon MWAA

We create one Redshift connection in Amazon MWAA for each Redshift database, making sure that each data pipeline (DAG) can only access one database. This approach provides distinct access controls for each pipeline, helping prevent unauthorized access to data. Complete the following steps:

Log in to the Amazon MWAA UI.
On the Admin menu, choose Connections.
Choose Add a new record.
For Connection Id, enter a name for this connection.
For Connection Type, choose Amazon Redshift.
For Host, enter the endpoint of the Redshift cluster without the port and database name (for example, redshift-cluster-1.xxxxxx.us-east-1.redshift.amazonaws.com).
For Database, enter the database of the Redshift cluster.
For Port, enter the port of the Redshift cluster.

Set up an SNS notification

Setting up SNS notifications is optional, but they can be a useful enhancement to receive alerts on failures. Complete the following steps:

Create an SNS topic.
Create a subscription to the SNS topic.
Create a Lambda function with the Python runtime.
Modify the function code in your Lambda function, and replace {topic_arn} with your SNS topic Amazon Resource Name (ARN):

import json

sns_client = boto3.client('sns')

def lambda_handler(event, context):
     try:
        # Extract DAG name from event
        failed_dag = event['dag_name']
        
        # Send notification 
        sns_client.publish(
            TopicArn={topic_arn}, 
            Subject="Data modelling dags - WARNING", 
            Message=json.dumps({'default': json.dumps(f"Data modelling DAG - 
            {failed_dag} has failed, please inform the data modelling team")}),
            MessageStructure='json'
        )
        
    except KeyError as e:
        # Handle missing 'dag_name' in the event
        logger.error(f"KeyError: invalid payload - dag_name not present")

Configure a DAG

The following sample DAG orchestrates a dbt workflow for processing and auditing data models in Amazon Redshift. It retrieves credentials from Secrets Manager, runs dbt tasks in a virtual environment, and sends an SNS notification if a failure occurs. The workflow consists of the following steps:

It starts with the audit_dbt_task task group, which creates the audit model.
The transform_data task group executes the other dbt models, excluding the audit-tagged one. Inside the transform_data group, there are two dbt models, model1 and model2, and each is followed by a corresponding test task that runs data quality tests defined in the schema.yml file.
To properly detect and handle failures, the DAG includes a dbt_check Python task that runs a custom function, check_dbt_failures. This is important because when using DbtTaskGroup, individual model-level failures inside the group don’t automatically propagate to the task group level. As a result, downstream tasks (such as the Lambda operator sns_notification_for_failure) configured with trigger_rule='one_failed' will not be triggered unless a failure is explicitly raised.

The check_dbt_failures function addresses this by inspecting the results of each dbt model and test, and raising an AirflowException if a failure is found. When an AirflowException is raised, the sns_notification_for_failure task is triggered.

If a failure occurs, the sns_notification_for_failure task invokes a Lambda function to send an SNS notification. If no failures are detected, this task is skipped.

The following diagram illustrates this workflow.

Configure DAG variables

To customize this DAG for your environment, configure the following variables:

project_name – Make sure the project_name matches the S3 prefix of your dbt project
secret_name – Provide the name of the secret that stores dbt user credentials
target_database and target_schema – Update these variables to reflect where you want to land your dbt models in Amazon Redshift
redshift_connection_id – Set this to match the connection configured in Amazon MWAA for this Redshift database
sns_lambda_function_name – Provide the Lambda function name to send SNS notifications
dag_name – Provide the DAG name that will be passed to the SNS notification Lambda function

import os
import json
import boto3
from airflow import DAG
from cosmos import (
    DbtTaskGroup, ProfileConfig, ProjectConfig,
    ExecutionConfig, RenderConfig
)
from cosmos.constants import ExecutionMode, LoadMode
from cosmos.profiles import RedshiftUserPasswordProfileMapping
from pendulum import datetime
from airflow.operators.python_operator import PythonOperator
from airflow.providers.amazon.aws.operators.lambda_function import (
    LambdaInvokeFunctionOperator
)
from airflow.exceptions import AirflowException

# project name - should match the s3 prefix of your dbt project
project_name = "my_sample_dbt_project"
# name of the secret that stores dbt user credentials 
secret_name = "dbt_user_credentials_secret"
# target database to land dbt models
target_database = "sample_database"
# target schema to land dbt models
target_schema = "sample_schema"
# Redshift connection name from MWAA
redshift_connection_id = "my_sample_dbt_project_connection"
# sns lambda function name
sns_lambda_function_name = "sns_notification"
# dag name - this will be passed to SNS for notification
payload = json.dumps({
            "dag_name": "my_sample_dbt_project_dag"
        })

Incorporate DAG components

After setting the variables, you can now incorporate the following components to complete the DAG.

Secrets Manager

The DAG retrieves dbt user credentials from Secrets Manager:

sm_client = boto3.client('secretsmanager')

def get_secret(secret_name):
    try:
        get_secret_value_response = sm_client.get_secret_value(SecretId=secret_name)
        return json.loads(get_secret_value_response["SecretString"])
    except Exception as e:
        raise

secret_value = get_secret(secret_name)
username = secret_value["username"]
password = secret_value["password"]

Redshift connection configuration

It uses RedshiftUserPasswordProfileMapping to authenticate:

profile_config = ProfileConfig(
    profile_name="redshift",
    target_name=target_database,
    profile_mapping=RedshiftUserPasswordProfileMapping(
        conn_id=redshift_connection_id,
        profile_args={"schema": target_schema,
                      "user": username, "password": password}
    ),
)

dbt execution setup

This code contains the following variables:

dbt executable path – Uses a virtual environment
dbt project path – Is located in the environment variable DBT_PROJECT_PATH under your project

execution_config = ExecutionConfig(
    dbt_executable_path=f"{os.environ['DBT_VENV_PATH']}/bin/dbt",
    execution_mode=ExecutionMode.VIRTUALENV,
)

project_config = ProjectConfig(
    dbt_project_path=f"{os.environ['DBT_PROJECT_PATH']}/{project_name}",
)

Tasks and execution flow

This step includes the following components:

Audit dbt task group (audit_dbt_task) – Runs the dbt model tagged with audit
dbt task group (transform_data) – Runs the dbt models tagged with operations, excluding the audit model

In dbt, tags are labels that you can assign to models, tests, seeds, and other dbt resources to organize and selectively run subsets of your dbt project. In your render_config, you have exclude=["tag:audit"]. This means dbt will exclude models that have the tag audit, because the audit model runs separately.

Failure check (dbt_check) – Checks for dbt model failures, raises an AirflowException if upstream dbt tasks fail
SNS notification on failure (sns_notification_for_failure) – Invokes a Lambda function to send an SNS notification upon a dbt task failure (for example, a dbt model in the task group)

def check_dbt_failures(**kwargs):
    if kwargs['ti'].state == 'failed':
        raise AirflowException('Failure in dbt task group')

with DAG(
    dag_id="my_sample_dbt_project_dag",
    start_date=datetime(2025, 4, 2),
    schedule_interval="@daily",
    catchup=False,
    tags=["dbt"]
):

    audit_dbt_task = DbtTaskGroup(
        group_id="audit_dbt_task",
        execution_config=execution_config,
        profile_config=profile_config,
        project_config=project_config,
        operator_args={
            "install_deps": True,
        },
        render_config= RenderConfig(
            select=["tag:audit"],
            load_method=LoadMode.DBT_LS
        )
    )

    transform_data = DbtTaskGroup(
        group_id="transform_data",
        execution_config=execution_config,
        profile_config=profile_config,
        project_config=project_config,
        operator_args={
            "install_deps": True,
            # install necessary dependencies before running dbt command
        },
        render_config= RenderConfig(
            exclude=["tag:audit"],
            load_method=LoadMode.DBT_LS
        )
    )

    dbt_check = PythonOperator(
        task_id='dbt_check', 
        python_callable=check_dbt_failures,
        provide_context=True,
    )

    sns_notification_for_failure = LambdaInvokeFunctionOperator(
        task_id="sns_notification_for_failure",
        function_name=sns_lambda_function_name,
        payload=payload,
        trigger_rule='one_failed'
    )

    audit_dbt_task >> transform_data >> dbt_check >> sns_notification_for_failure

The sample dbt orchestrates a dbt workflow in Amazon Redshift, starting with an audit task and followed by a task group that processes data models. It includes a failure handling mechanism that checks for failures and raises an exception to trigger an SNS notification using Lambda if a failure occurs. If no failures are detected, the SNS notification task is skipped.

Clean up

If you no longer need the resources you created, delete them to avoid additional charges. This includes the following:

Amazon MWAA environment
S3 bucket
IAM role
Redshift cluster or serverless workgroup
Secrets Manager secret
SNS topic
Lambda function

Conclusion

By integrating dbt with Amazon Redshift and orchestrating workflows using Amazon MWAA and the Cosmos library, you can simplify data transformation workflows while maintaining robust engineering practices. The sample dbt project structure, combined with automated deployments through GitHub Actions and proactive monitoring using Amazon SNS, provides a foundation for building reliable data pipelines. The addition of audit logging facilitates transparency across your transformations, so teams can maintain high data quality standards.

You can use this solution as a starting point for your own dbt implementation on Amazon MWAA. The approach we outlined emphasizes SQL-based transformations while incorporating essential operational capabilities like deployment automation and failure alerting. Get started by adapting the configuration to your environment, and build upon these practices as your data needs evolve.

For more resources, refer to Manage data transformations with dbt in Amazon Redshift and Redshift setup.

About the authors

Cindy Li is an Associate Cloud Architect at AWS Professional Services, specialising in Data Analytics. Cindy works with customers to design and implement scalable data analytics solutions on AWS. When Cindy is not diving into tech, you can find her out on walks with her playful toy poodle Mocha.

Joao Palma is a Senior Data Architect at Amazon Web Services, where he partners with enterprise customers to design and implement comprehensive data platform solutions. He specializes in helping organizations transform their data into strategic business assets and enabling data-driven decision making.

Harshana Nanayakkara is a Delivery Consultant at AWS Professional Services, where he helps customers tackle complex business challenges using AWS Cloud technology. He specializes in data and analytics, data governance, and AI/ML implementations.

The Amazon SageMaker lakehouse architecture now automates optimization configuration of Apache Iceberg tables on Amazon S3

2025-08-09 Tomohiro Tanaka

Post Syndicated from Tomohiro Tanaka original https://aws.amazon.com/blogs/big-data/the-amazon-sagemaker-lakehouse-architecture-now-automates-optimization-configuration-of-apache-iceberg-tables-on-amazon-s3/

As organizations increasingly adopt Apache Iceberg tables for their data lake architectures on Amazon Web Services (AWS), maintaining these tables becomes crucial for long-term success. Without proper maintenance, Iceberg tables can face several challenges: degraded query performance, unnecessary retention of old data that should be removed, and a decline in storage cost efficiency. These issues can significantly impact the effectiveness and economics of your data lake. Regular table maintenance operations help ensure your Iceberg tables remain high performing, compliant with data retention policies, and cost-effective for production workloads. To help you manage your Iceberg tables at scale, AWS Glue automated those Iceberg table maintenance operations: compaction with sort and z-order and snapshots expiration and orphan data management. After the launch of the feature, many customers have enabled automated table optimization through AWS Glue Data Catalog to reduce operational burden.

The Amazon SageMaker lakehouse architecture now automates optimization of Iceberg tables stored in Amazon S3 with catalog-level configuration, optimizing storage in your Iceberg tables and improving query performance. Previously, optimizing Iceberg tables in AWS Glue Data Catalog required updating configurations for each table individually. Now, you can enable automatic optimization for new Iceberg tables with one-time Data Catalog configuration. Once enabled, for any new table or updated table, Data Catalog continuously optimizes tables by compacting small files, removing snapshots, and unreferenced files that are no longer needed.

This post demonstrates an end-to-end flow to enable catalog level table optimization setting.

Prerequisites

The following prerequisites are required to use the new catalog-level table optimizations:

An active AWS account.
A data lake administrator to configure the table optimizations at the catalog level. To create the data lake administrator, refer to Set up AWS Lake Formation.
An AWS Identity and Access Management (IAM) role for the table optimizations to access Iceberg tables. For the instructions, refer to Catalog level table optimization prerequisites.

Enable table optimizations at the catalog level

The data lake administrator can enable the catalog-level table optimization on the AWS Lake Formation console. Complete the following steps:

On the AWS Lake Formation console, choose Catalogs in the navigation pane.
Select the catalog to be enabled with catalog-level table optimizations.
Choose Table optimizations tab, and choose Edit in Table optimizations, as shown in the following screenshot.

In Optimization options, select Compaction, Snapshot retention, and Orphan file deletion, as shown in the following screenshot.

enable-optimizations

Select an IAM role. Refer to Table optimization prerequisites for permissions.
Choose Grant required permissions.
Choose I acknowledge that expired data will be deleted as part of the optimizers.

After you enable the table optimizations at the catalog level, the configuration is displayed on the AWS Lake Formation console, as shown in the following screenshot.

optimizations-configuration

When you select an Iceberg table registered in the catalog, you can confirm that the table optimizations configuration is inherited from the table view because Configuration source shows catalog, as shown in the following screenshot.

catalog-level-optimizations

The table optimizations history is displayed on the table view. The following result shows one of the compaction runs by the table optimizations.

binpack-compaction-result

The catalog-level table optimizations for all databases and Iceberg tables are now enabled.

Customize setting of table optimizations at both the catalog and table-level

Although the catalog-level optimization applies common settings across all databases and Iceberg tables in your catalog, you might want to apply different strategies for specific Iceberg tables. You can use AWS Glue Data Catalog to enable both catalog-level and table-level optimizations based on specific table characteristics and access patterns. For example, in addition to configuring the catalog-level compaction with the bin-pack strategy for general-purpose Iceberg tables, you can apply the sort strategy at the table-level to tables with frequent range queries on timestamp columns.

This section shows configuring catalog-level and table-specific optimizations through a practical scenario. Imagine a real-time analytics table with frequent write operations that generates more orphan files due to constant metadata updates. Users also run selective queries filtering specific columns, which makes sort-order strategy preferable. Complete the following steps:

Select another Iceberg table in the same catalog as before to configure the table-level optimizations on the AWS Lake Formation console. At this point, the catalog-level table optimizations are configured for this table.
Choose Edit in Optimization configuration, as shown in the following screenshot.

In Optimization options, choose Compaction, Snapshot retention, and Orphan file deletion.
In Optimization configuration, choose Customize settings.
Select the same IAM role.
In Compaction configuration, select Sort, as shown in the following screenshot. Also configure 80 files to Minimum input files, which is a threshold of the number of files to trigger the compaction. To configure Sort, a sort order needs to be defined in your Iceberg table. You can define the sort order with Spark SQL such as ALTER TABLE db.tbl WRITE ORDERED BY <columns>.

sort-config

In Snapshot retention configuration and Snapshot deletion run rate, select Specify a custom value in hours. Then, configure 12 hours to the interval between two deletion job runs, as shown in the following screenshot.

snapshot-retention

In Orphan file deletion configuration, configure 1 day to Files under the provided Table Location with a creation time older than this number of days will be deleted if they are no longer referenced by the Apache Iceberg Table metadata.

orphan-deletion

Choose Grant required permissions.
Choose I acknowledge that expired data will be deleted as part of the optimizers.
Choose Save.
The Table optimization tab on the AWS Lake Formation console displays the custom setting of table optimizers. In Compaction, Compaction strategy is configured to sort and Minimum input files is also configured to 80 files. In Snapshot retention, Snapshot deletion run rate is configured to 12 hours. In Orphan file deletion, Orphan files will be deleted after is configured to 1 days, as shown in the following screenshot.

new-table-level-optimizations

The compaction history shows sort as its table-level compaction strategy even if the strategy in the catalog-level is configured to binpack, as shown in the following screenshot.

sort-compaction-result

In this scenario, the table-specific optimizations are configured along with the catalog-level optimizations. Combining the table and catalog-level optimizations means you can more flexibly manage your Iceberg table data deletions and compactions.

Conclusion

In this post, we demonstrated how to enable and manage using Amazon SageMaker lakehouse architecture with AWS Glue Data Catalog’s catalog-level table optimization feature for Iceberg tables. This enhancement significantly simplifies the management of Iceberg tables because you can enable automated maintenance operations across all tables with a single setting. Instead of configuring optimization settings for individual tables, you can now maintain your entire data lake more efficiently, reducing operational overhead while ensuring consistent optimization policies. We recommend enabling catalog-level table optimization to help you maintain a well-organized, high-performing, and cost-effective data lake while freeing up your teams to focus on deriving value from your data.

Try out this feature for your own use case and share your feedback and questions in the comments. To learn more about AWS Glue Data Catalog table optimizer, visit Optimizing Iceberg tables.

Acknowledgment: A special thanks to everyone who contributed to the development and launch of catalog level optimization: Siddharth Padmanabhan Ramanarayanan, Dhrithi Chidananda, Noella Jiang, Sangeet Lohariwala, Shyam Rathi, Anuj Jigneshkumar Vakil, and Jeremy Song.

About the authors

Tomohiro Tanaka is a Senior Cloud Support Engineer at Amazon Web Services (AWS). He’s passionate about helping customers use Apache Iceberg for their data lakes on AWS. In his free time, he enjoys a coffee break with his colleagues and making coffee at home.

Noritaka Sekiyama is a Principal Big Data Architect with AWS Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Sandeep Adwankar is a Senior Product Manager at Amazon Web Services (AWS). Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products customers can use to improve how they manage, secure, and access data.

Siddharth Padmanabhan Ramanarayanan is a Senior Software Engineer on the AWS Glue and AWS Lake Formation team, where he focuses on building scalable distributed systems for data analytics workloads. He is passionate about helping customers optimize their cloud infrastructure for performance and cost efficiency.

Create an OpenSearch dashboard with Amazon OpenSearch Service

2025-08-05 Smita Singh

Post Syndicated from Smita Singh original https://aws.amazon.com/blogs/big-data/create-an-opensearch-dashboard-with-amazon-opensearch-service/

Effective log analysis is essential for maintaining the health and performance of modern applications. Amazon OpenSearch Service stands out as a powerful, fully managed solution for log analytics and observability. With its advanced indexing, full-text search, and real-time analytics capabilities, OpenSearch Service makes it possible for organizations to seamlessly ingest, process, and search log data across diverse sources—including AWS services like Amazon CloudWatch, VPC Flow Logs, and more.

With OpenSearch Dashboards, you can turn indexed log data into actionable visualizations that reveal insights and help detect anomalies. By querying data stored in OpenSearch Service, you can extract relevant information and display it using a variety of visualization types—such as line charts, bar graphs, pie charts, heatmaps, and more. These tools make it effortless to monitor system behavior, spot trends, and quickly identify issues in your environment.

This post demonstrates how to harness OpenSearch Dashboards to analyze logs visually and interactively. With this solution, IT administrators, developers, and DevOps engineers can create custom dashboards to monitor system behavior, detect anomalies early, and troubleshoot issues faster through interactive charts and graphs.

Solution overview

In this post, we show how to create an index pattern in OpenSearch Dashboards, create two types of visualizations, and display these visualizations on a custom dashboard. We also demonstrate how to export and import visualizations.

Prerequisites

Before diving into log analysis with OpenSearch Dashboards, you must have the following:

A properly configured OpenSearch Service domain
A working log collection and ingestion pipeline

Amazon OpenSearch Service 101: Create your first search application with OpenSearch guides you through setting up your OpenSearch Service domain and configuring the log ingestion pipeline.

For this post, we work with the following log sources, which have already been ingested into an OpenSearch Service cluster as part of the prerequisite steps:

Access OpenSearch Dashboards

Complete the following steps to access OpenSearch Dashboards:

On the OpenSearch Service console, choose Domains in the navigation pane.
Check if your domain status shows as Active.
Choose your domain to open the domain details page.
Choose the OpenSearch Dashboards URL to open it in a new browser window.

Authenticate into OpenSearch Dashboards using one of the supported methods.

Create an index pattern

After you’re logged in to OpenSearch Dashboards, you must create an index pattern. An index pattern allows OpenSearch Dashboards to locate indexes to search. Complete the following steps

In OpenSearch Dashboards, expand the navigation pane and choose Dashboard Management under Management.
Choose Index patterns in the navigation pane.

Choose Create index pattern.
For Index pattern name, enter a name (for example, log-aws-cloudtrail-*).
Choose Next step.

For Time field¸ choose @timestamp.
Choose Create index pattern.

Create visualizations

Now that the index pattern is created, let’s create some visualizations. For this post, we create a pie chart and an area graph.

Create a pie chart

Complete the following steps to create a pie chart:

In OpenSearch Dashboards, choose Visualize in the navigation pane.

Choose Create visualization.

Choose Pie as the visualization type.
For Source¸ choose log-aws-cloudtrail-*.

Under Buckets¸ choose Add and Split slices.

For Aggregation, choose Terms.

For Field, choose eventName.
For Size, enter 10.

Leave all other parameters as default and choose Update.
Choose Save to save the visualization.

Sample ndjson file for the pie chart – EventNamePie.ndjson

Please refer Export and import visualizations for how to import the samples.

The following screenshot shows our pie chart, which displays different types of events and their occurrence percentage in the last 30 minutes.

Create an area graph

Complete the following steps to create an area graph:

In OpenSearch Dashboards, choose Visualize in the navigation pane.
Choose Create visualization.
Choose Area as the visualization type.

For Source¸ choose log-aws-cloudtrail-*.

Under Buckets¸ choose Add and X-axis.

For Aggregation, choose Date Histogram.
For Field, choose @timestamp.
Leave all other parameters as default and choose Update

Under Advanced¸ choose Add and Split series.

For Aggregation, choose Terms.
For Field, choose eventName.
For Size, enter 10.
Leave all other parameters as default and choose Update.

Choose Save.
Update the time range to Last 60 minutes.
Choose Refresh and Save.

The following screenshot shows an area graph with different types of events and their occurrence count in the last 60 minutes.

Sample ndjson file for Area chart – EventNameArea.ndjson

Please refer Export and import visualizations for how to import the samples.

Create a dashboard

Now we will combine the visualizations we just created into a dashboard. A dashboard serves as a customizable interface that consolidates multiple visualizations, saved searches, and various content into a comprehensive view of data. Users can combine diverse visual elements—including charts, graphs, metrics, and tables—into a single cohesive display that can be arranged and resized on a flexible grid layout. You can simultaneously apply filters and time ranges across multiple visualizations, creating a coordinated analytical experience. Complete the following steps to create a dashboard:

In OpenSearch Dashboards, choose Dashboards in the navigation pane.
Choose Create new dashboard.

Choose Add on the menu bar.

Search for and choose the visualizations you created.

You can resize panels by dragging their corners to adjust dimensions. To modify the layout arrangement, you can drag the top portion of panels, which allows you to organize them horizontally in a row formation. When working with tabular visualizations, the system provides a convenient option to export your results in CSV format for further analysis or reporting purposes.

Choose Save.
Change the time range to Last 60 minutes.
Choose Refresh and Save.

Sample ndjson file for dashboard – CloudTrailSummary.ndjson

Please refer Export and import visualizations for how to import the samples.

The following screenshot shows the CloudTrail dashboard displaying both visualizations.

Export and import visualizations

In OpenSearch, an NDJSON file is used to import and export saved objects, such as dashboards, visualizations, maps, and index template. The NDJSON file provides a streamlined approach for handling large datasets by representing each JSON object on a separate line. This format enables efficient import/export operations, simplified data migration between environments, and seamless sharing of complex dashboard configurations. Organizations can back up and restore critical visualizations, saved searches, and dashboard settings while maintaining their integrity. The format’s structure reduces memory overhead during large transfers and improves processing speed for bulk operations. NDJSON’s human-readable nature also facilitates troubleshooting and manual editing when necessary, making it an invaluable tool for maintaining OpenSearch Dashboards deployments across development, testing, and production environments.

Export a visualization

Complete the following steps to export a visualization:

In OpenSearch Dashboards, choose Saved objects in the navigation pane.
Search for and select your object (in this case, a visualization), then choose Export.

The NDJSON file is downloaded in your local host.

Import a visualization

Complete the following steps to import a visualization:

In OpenSearch Dashboards, choose Saved objects in the navigation pane.
Choose Import.
Choose the first NDJSON file to be imported from your local host.
Select Create new objects with random IDs.
Choose Import.

Choose Done.

Choose Import.

You can now open the imported object.

The following screenshot shows our updated dashboard.

Clean up

To clean up your resources, delete the OpenSearch Service domain and relevant information stored or visualizations created on the domain. You will not be able to recover the data after you delete it.

On the OpenSearch Service console, choose Domains in the navigation pane.
Select the domain you created and choose Delete.

Conclusion

OpenSearch Dashboards is a powerful tool for transforming raw log data into actionable visualizations that drive insights and decision-making. In this post, we’ve shown how to create visualizations like pie charts and area graphs, build comprehensive dashboards, and efficiently export and import your work using NDJSON files. By using the fully managed OpenSearch Service features, organizations can focus on extracting valuable insights rather than managing infrastructure, ultimately enhancing their observability posture and operational efficiency.

To further enhance your OpenSearch proficiency, consider exploring advanced visualization options such as heat maps, gauge charts, and geographic maps that can represent your data in more specialized ways. Implementing automated alerting based on predefined thresholds will help you proactively identify anomalies before they become critical issues. You can also use OpenSearch’s powerful machine learning capabilities for sophisticated anomaly detection and predictive analytics to gain deeper insights from your log data. As your implementation grows, customizing security settings with fine-grained access controls will provide appropriate data visibility across different teams in your organization.

For comprehensive learning resources, refer to the Amazon OpenSearch Service Developer Guide, watch Create your first OpenSearch Dashboard on YouTube, explore best practices in Amazon OpenSearch blog posts, and gain hands-on experience through workshops available in AWS Workshops.

About the Authors

Smita Singh is a Senior Solutions Architect at AWS. She focuses on defining technical strategic vision and works on architecture, design, and implementation of modern, scalable platforms for large-scale global enterprises and SaaS providers. She is a data, analytics, and generative AI enthusiast and is passionate about building innovative, highly scalable, resilient, fault-tolerant, self-healing, multi-tenant platform solutions and accelerators.

Dipayan Sarkar is a Specialist Solutions Architect for Analytics at AWS, where he helps customers modernize their data platform using AWS analytics services. He works with customers to design and build analytics solutions, enabling businesses to make data-driven decisions.

Develop and deploy a generative AI application using Amazon SageMaker Unified Studio

2025-08-04 Amit Maindola

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/develop-and-deploy-a-generative-ai-application-using-amazon-sagemaker-unified-studio/

Picture this: You’re a financial analyst starting your Monday morning with a steaming cup of coffee, ready to review your investment portfolio. But instead of manually scouring dozens of news websites, financial reports, and industry analyses, you simply ask your AI assistant: “What global events happened over the weekend that might impact my technology stock holdings?” Within seconds, you receive a comprehensive analysis of relevant news, sentiment scores, and potential investment implications—all powered by a sophisticated generative AI application you built yourself.

This scenario isn’t science fiction; it’s the reality that modern financial professionals can create today. In an era where information moves at the speed of light and industry conditions can shift dramatically overnight, staying informed isn’t just an advantage—it’s essential for survival in competitive financial landscapes. The challenge lies in processing the overwhelming volume of global information that could impact investments while distinguishing reliable insights from noise.

Amazon SageMaker – Develop and scale AI use cases with the broadest set of tools

Luckily for us, technology is making this more straightforward. The next generation of Amazon SageMaker with Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access the data in your organization and act on it using the best tools across different use cases. SageMaker Unified Studio brings together the functionality and tools from existing AWS analytics and artificial intelligence and machine learning (AI/ML) services, including Amazon EMR , AWS Glue, Amazon Athena, Amazon Redshift , Amazon Bedrock, and Amazon SageMaker AI. From within SageMaker Unified Studio, you can ﬁnd, access, and query data and AI assets across your organization, then work together in projects to securely build and share analytics and AI artifacts, including data, models, and generative AI applications.

With SageMaker Unified Studio, you can efficiently build generative AI applications in a trusted and secure environment using Amazon Bedrock. You can choose from a selection of high-performing foundation models (FMs) and advanced customization capabilities like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, Amazon Bedrock Agents, and Amazon Bedrock Flows. You can rapidly tailor and deploy generative AI applications and share with the built-in catalog for discovery.

What makes SageMaker Unified Studio particularly powerful for organizations is its integration with Amazon Bedrock Flows to build generative AI workflows, which is changing how organizations think about AI application development.

Amazon Bedrock Flows for generative AI application development

With Amazon Bedrock Flows, you can build and execute complex generative AI workflows without writing code, using an intuitive visual interface that democratizes AI development. This capability is transformative for organizations where speed, accuracy, and adaptability are paramount. It offers the following benefits:

Visual workflow development – Users can design AI applications by dragging and dropping components onto a canvas, making AI logic transparent and modifiable
Business logic flexibility – The service supports complex business logic through conditional branching, multi-path decision trees, and dynamic routing
Democratizing AI development – Business experts can directly contribute to AI application development without requiring extensive technical expertise
Seamless integration – Amazon Bedrock Flows integrates with FMs, knowledge bases, guardrails, and other AWS services
Reduced development complexity – The service handles infrastructure management and scaling through serverless execution and SDK APIs

Solution overview

In this post, we explore a financial use case, in which we want to stay on top of latest global events and determine our investment or financial exposure based on this. We can use a SageMaker Unified Studio flow application to pull in latest news summaries, derive sentiment based on news summary, and determine their effects on my investments. The following diagram illustrates this use case.

In the following sections, we show how to create a new project and build a flow application using a generative AI profile in SageMaker Unified Studio.

Prerequisites

For this walkthrough, you must have the following prerequisites:

An AWS account – If you don’t have an account, you can create one.
A SageMaker Unified Studio domain – For instructions, refer to Create an Amazon SageMaker Unified Studio domain – quick setup, and add a single sign-on (SSO) user for logging and set two-factor authentication. Finally, on the Connections tab, you can create a connection to a code repository.

A demo project – Create a demo project in your SageMaker Unified Studio domain. For instructions, see Create a project. For this example, we choose All capabilities in the project profile section, which includes the generative AI project profile enabled.

Create new project and build a flow application in SageMaker Unified Studio

In this section, we create a new a flow application that uses an Amazon Bedrock knowledge base to provide information about your personal portfolio. Complete the following steps:

In SageMaker Unified Studio, open the project you created as a prerequisite and choose Build and then Flow.

Drag Knowledge Base from Nodes to the design panel to add a knowledge base that will include the user’s investment portfolio and news articles and other information like earnings call transcripts, financial analyst reports, and so on.

Choose the Knowledge Base node and configure the knowledge base as follows:
Add a name for your knowledge base name (for example, portfolio…).
Choose the model (for example, Claude 3.5 Haiku).

Choose Create new Knowledge Base.
Enter a name for the knowledge base.
Select Project data source.
For Select a data source, choose the Amazon Simple Storage Service (Amazon S3) bucket location where you uploaded your data.
Choose Create.

The knowledge base creation process takes a few minutes to complete.

When the knowledge base is ready, choose Save to save it to the flow.

Choose My components, and on the options menu (three vertical dots), choose Sync to sync the knowledge base.

Make sure the S3 bucket has all the data (user portfolio data and latest news information data) before syncing the knowledge base.

We don’t provide any financial or news information data as part of this post. Upload current events or news data and investment portfolio data from your own data sources.

Test the flow application

After the knowledge base sync is complete, you can return to the flow application and ask questions. Using SageMaker Unified Studio flows, a financial analyst can provide a more personalized and customized financial outlook to their customers using rich internal financial information on their customer’s investment portfolio and latest publicly available current events and news information. The following are some example questions that you can ask to test the knowledge base:

Check if Tesla or Apple is in any of user's investment portfolio

Please check latest news information to provide information if Tesla has positive, negative or neutral outlook in the near future

Flow-based applications offer a visual approach to creating complex AI workflows. By chaining different nodes, each optimized for specific functions, you can create sophisticated solutions that are more reliable, maintainable, and efficient than single-prompt approaches. These flows allow for conditional logic and branching paths, mimicking human decision-making processes and enabling more nuanced responses based on context and intermediate results.

Clean up

To avoid ongoing charges in your AWS account, delete the resources you created during this tutorial:

Delete the project.
Delete the domain created as part of the prerequisites.

Conclusion

In this post, we demonstrated how to use Amazon Bedrock Flows in SageMaker Unified Studio to build a sophisticated generative AI application for financial analysis and investment decision-making without extensive coding knowledge. With this integration, you can create sophisticated financial analysis workflows through an intuitive visual interface, where you can process industry data, analyze news sentiment, and assess investment implications in real time. The solution integrates seamlessly with AWS services and FMs while providing essential features like automatic scaling, compliance controls, and audit capabilities. The implementation process involves setting up a SageMaker Unified Studio domain, configuring knowledge bases with portfolio and news data, and creating visual workflows that can analyze complex financial information. This democratized approach to AI development allows both technical and business teams to collaborate effectively, significantly reducing development time while maintaining the sophisticated capabilities needed for modern financial analysis.

To get started, explore the SageMaker Unified Studio documentation, set up a project in your AWS environment, and discover how this solution can transform your organization’s data analytics capabilities.

About the authors

Amit Maindola is a Senior Data Architect focused on data engineering, analytics, and AI/ML at Amazon Web Services. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.

Arghya Banerjee is a Sr. Solutions Architect at AWS in the San Francisco Bay Area, focused on helping customers adopt and use the AWS Cloud. He is focused on big data, data lakes, streaming and batch analytics services, and generative AI technologies.

Melody Yang is a Principal Analytics Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Gaurav Parekh is a Solutions Architect at AWS, specializing in generative AI and data analytics, with extensive experience building production AI systems on AWS.

Near real-time streaming analytics on protobuf with Amazon Redshift

2025-08-04 Konstantinos Tzouvanas

Post Syndicated from Konstantinos Tzouvanas original https://aws.amazon.com/blogs/big-data/near-real-time-streaming-analytics-on-protobuf-with-amazon-redshift/

Organizations must often deal with a vast array of data formats and sources in their data analytics workloads. This range of data types, such as structured relational data, semi-structured formats like JSON and XML and even binary formats like Protobuf and Avro, has presented new challenges for companies looking to extract valuable insights.

Protocol Buffers (protobuf) has gained significant traction in industries that require efficient data serialization and transmission, particularly in streaming data scenarios. Protobuf’s compact binary representation, language-agnostic nature, and strong typing make it an attractive choice for companies in sectors such as finance, gaming, telecommunications, and ecommerce, where high-throughput and low-latency data processing is crucial.

Although protobuf offers advantages in efficient data serialization and transmission, its binary nature poses challenges when it comes to analytics use cases. Unlike formats like JSON or XML, which can be directly queried and analyzed, protobuf data requires an additional deserialization step to convert it from its compact binary format into a structure suitable for processing and analysis. This extra conversion step introduces complexity into data analytics pipelines and tools. It can potentially slow down data exploration and analysis, especially in scenarios where near real-time insights are crucial.

In this post, we explore an end-to-end analytics workload for streaming protobuf data, by showcasing how to handle these data streams with Amazon Redshift Streaming Ingestion, deserializing and processing them using AWS Lambda functions, so that the incoming streams are immediately available for querying and analytical processing on Amazon Redshift.

The solution provides a solid foundation for handling protobuf data in Amazon Redshift. You can further enhance the architecture to support schema evolution by incorporating AWS Glue Schema Registry. By integrating the AWS Glue Schema Registry, you can make sure your Lambda function uses the latest schema version for deserialization, even as your data structure changes over time. However, for the purpose of this post and to maintain simplicity, we focus on demonstrating how to invoke Lambda from Amazon Redshift to convert protobuf messages to JSON format, which serves as a solid foundation for handling binary data in near real-time analytics scenarios.

Solution overview

The following architecture diagram describes the AWS services and features needed to set up a fully functional protobuf streaming ingestion pipeline for near real-time analytics.

Protobuf deserialization flow

The workflow consists of the following steps:

An Amazon Elastic Compute Cloud (Amazon EC2) event producer generates events and forwards them to a message queue. The events are created and serialized using protobuf.
A message queue using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis accepts the protobuf messages sent by the event producer. For this post, we use Amazon MSK Serverless.
A Redshift cluster (provisioned or serverless), in which a materialized view with an external schema is configured, points to the message queue. For this post, we use Amazon Redshift Serverless.
A Lambda protobuf deserialization function is triggered by Amazon Redshift during ingestion and deserializes protobuf data into JSON data.

Schema

To showcase protobuf’s deserialization functionality, we use a sample protobuf schema that represents a financial trade transaction. This schema will be used across the AWS services mentioned in this post.

// trade.proto 
syntax = "proto3"; 
message Trade{
   int32 userId = 1;   
   string userName = 2;   
   int32 volume = 3;   
   int32 pair = 4;   
   int32 action = 5;
   string TimeStamp = 6;
}

Amazon Redshift materialized view

In order for Amazon Redshift to ingest streaming data from Amazon MSK or Kinesis, an appropriate role needs to be assigned to Amazon Redshift and a materialized view needs to be properly defined. For detailed instructions on how to accomplish this, refer to Streaming ingestion to a materialized view or Simplify data streaming ingestion for analytics using Amazon MSK and Amazon Redshift.

In this section, we focus on the materialized view definition that makes it possible to deserialize protobuf data. Our example focuses on streaming ingestion from Amazon MSK. Typically, the materialized view ingests the Kafka metadata fields and the actual data (kafka_value) like in the following example:

CREATE MATERIALIZED VIEW trade_events AUTO REFRESH YES AS 
SELECT
     kafka_partition,     
     kafka_offset,
     kafka_timestamp_type,
     kafka_timestamp,
     kafka_key,
     JSON_PARSE(kafka_value) as Data,
     kafka_headers 
FROM     
     "dev"."msk_external_schema"."entity" 
WHERE     
     CAN_JSON_PARSE(kafka_value)

When the incoming kafka_value is of type JSON, you can apply the built-in JSON_PARSE function and create a column of type SUPER so you can directly query the data.

Amazon Redshift Lambda user-defined function

In our case, accepting protobuf encoded data requires some additional steps. The first step is to create an Amazon Redshift Lambda user-defined function (UDF). This Amazon Redshift function is the link to a Lambda function that executes the actual deserialization. This way, when data is ingested, Amazon Redshift calls the Lambda function for deserialization.

Creating or updating our Amazon Redshift Lambda UDF is straightforward, as illustrated in the following code. Additional examples are available in the GitHub repo.

CREATE OR REPLACE EXTERNAL FUNCTION f_deserialize_protobuf(VARCHAR(MAX)) 
RETURNS VARCHAR(MAX) IMMUTABLE 
LAMBDA 'f-redshift-deserialize-protobuf' IAM_ROLE ':RedshiftRole';

Because Lambda functions don’t (at the time of writing) accept binary data as input, you must first convert incoming binary data to its hex representation, prior to calling the function. You can do this by using the TO_HEX Amazon Redshift function.
Considering the hex conversation and with the Lambda UDF available, you can now use it in your materialized view definition:

CREATE MATERIALIZED VIEW trade_events AUTO REFRESH YES AS 
SELECT      
       kafka_partition,
       kafka_offset,
       kafka_timestamp_type,
       kafka_timestamp,
       kafka_key,
       kafka_value,
       kafka_headers,
       JSON_PARSE(f_deserialize_protobuf(to_hex(kafka_value)))::super as json_data 
FROM      
       "dev"."msk_external_schema"."entity";

Lambda layer

Lambda functions require access to appropriate protobuf libraries, so that deserialization can take place. You can implement this through a Lambda layer. The layer is provided as a zip file, respecting the following folder structure, and contains the protobuf library, its dependencies, and user-provided code inside the custom folder, which includes the protobuf generated classes:

python
      custom
      google
      Protobuf-4.25.2.dist-info

Because we implemented the Lambda functions in Python, the root folder of the zip file is the python folder. For additional languages, refer to the documentation on how to properly structure your folder structure.

Lambda function

A Lambda function converts incoming protobuf records to JSON records. As a first step, you must import your custom classes from the lambda Layer custom folder:

# Import generated protobuf classes
 from custom import trade_pb2

You can now deserialize incoming hex encoded binary data to objects. This is implemented in a two-step process. The first step is to decode the hex encoded binary data:

 # convert incoming hex data to binary  
binary_data = bytes.fromhex(record)

Next, you instantiate the protobuf defined classes and execute the actual deserialization process using the protobuf library method ParseFromString:

 # Instantiate class  
trade_event = trade_pb2.Trade()
               
# Deserialize into class  
trade_event.ParseFromString(binary_data)

After you run deserialization and instantiate your objects, you can convert to other formats. In our case, we serialize into JSON format, so that Amazon Redshift ingests the JSON content in a single field of type SUPER:

# Serialize into json  
elems = trade_event.ListFields() 
fields = {} 
for elem in elems:     
       fields[elem[0].name] = elem[1] 
json_elem = json.dumps(fields)

Combining these steps together, the Lambda function should look as follows:

import json

# Import the generated protobuf classes
from custom import trade_pb2  

def lambda_handler(event, context):
    
    results = []
    
    recordSets = event['arguments']
    for recordSet in recordSets:
        for record in recordSet:

            # convert incoming hex data to binary data
            binary_data = bytes.fromhex(record)
            
            # Instantiate class
            trade_event = trade_pb2.Trade()
            
            # Deserialize into class
            trade_event.ParseFromString(binary_data)
            
            # Serialize into json 
            elems = trade_event.ListFields()
            fields = {}
            for elem in elems:
                fields[elem[0].name] = elem[1]
            json_elem = json.dumps(fields)

            # Append to results            
            results.append(json_elem)
    
    print('OK')
    
    return json.dumps({"success": True,"num_records": len(results),"results": results})

Batch mode

In the preceding code sample, Amazon Redshift is calling our function in batch mode, meaning that a number of records are sent during a single Lambda function call. More specifically, Amazon Redshift is batching records into the arguments property of the request. Therefore, you must loop through the incoming array of data and apply your deserialization logic per record. At the time of writing, this behavior is internal to Amazon Redshift and can’t be configured or controlled through a configuration option. An Amazon Redshift streaming consumer client will read new records on the message queue since the last time it read. The following is a sample of the payload the Lambda handler function receives:

     "user": "IAMR:Admin",
     "cluster": "arn:aws:redshift:*********************************",
     "database": "dev",
     "external_function": "fn_lambda_protobuf_to_json",
     "query_id": 5583858,
     "request_id": "17955ee8-4637-42e6-897c-5f4881db1df5",     
     "arguments": [         
         [             
"088a1112087374723a3231383618c806200128093217323032342d30332d32302031303a34363a33382e363932"         ],         [             "08a74312087374723a3836313518f83c200728093217323032342d30332d32302031303a34363a33382e393031"         ],         [             "08b01e12087374723a3338383818f73d20f8ffffffffffffffff0128053217323032342d30332d32302031303a34363a33392e303134"         
]     
],     
      "num_records":3 
}

Insights from ingested data

With your data stored in Amazon Redshift after the deserialization process, you can now execute queries against your streaming data and directly gain insights. In this section, we present some sample queries to illustrate functionality and behavior.

Examine lag query

To examine the difference between the most recent timestamp value of our streaming source vs. the current date/time (wall clock), we calculate the most recent point in time at which we ingested data. Because streaming data is expected to flow into the system continuously, this metric also reveals the ingestion lag between our streaming source and Amazon Redshift.

select top 1      
      (GETDATE() - kafka_timestamp) as ingestion_lag 
from     
      trade_events 
order by
      kafka_timestamp desc

Examine content query: Fraud detection on an incoming stream

By applying the query functionality available in Amazon Redshift, we can discover behavior hidden in our data in real time. With the following query, we try to match opposite trade volumes played by different users during the last 5 minutes that result in a zero sum game and could support a potential fraud detection concept:

select  
json_data.volume, 
LISTAGG(json_data.userid::int, ', ') as users, 
LISTAGG(json_data.pair::int, ', ') as pairs 
from     
        trade_events 
where      
        trade_events.kafka_timestamp >= DATEADD(minute, -5, GETDATE()) 
group by      
        json_data.volume 
having      
        sum(json_data.pair) = 0  
and min(abs(json_data.pair)) = max(abs(json_data.pair)) 
and count(json_data.pair) > 1

This query is a rudimentary example of how we can use live data to protect systems from fraudsters.

For a more comprehensive example, see Near-real-time fraud detection using Amazon Redshift Streaming Ingestion with Amazon Kinesis Data Streams and Amazon Redshift ML. In this use case, an Amazon Redshift ML model for anomaly detection is trained using the incoming Amazon Kinesis Data Streams data that is streamed into Amazon Redshift. After sufficient training (for example, 90% accuracy for the model is achieved), the trained model is put into inference mode for inferencing decisions on the same incoming credit card data.

Examine content query: Join with non-streaming data

Having our protobuf records streaming in Amazon Redshift makes it possible to join streaming with non-streaming data. A typical example is combining incoming trades with user information data already recorded in the system. In the following query, we join the incoming stream of trades with user information, like email, to get a list of possible alerts targets:

select   
    user_info.email 
from     
    trade_events
inner join    
    user_info 
on user_info.userId = trade_events.json_data.userid 
where     
    trade_events.json_data.volume > 1000 
and trade_events.kafka_timestamp >= DATEADD(minute, -5, GETDATE())

Conclusion

The ability to effectively analyze and derive insights from data streams, regardless of their format, is crucial for data analytics. Although protobuf offers compelling advantages for efficient data serialization and transmission, its binary nature can pose challenges and perhaps impact performance when it comes to analytics workloads. The solution outlined in this post provides a robust and scalable framework for organizations seeking to gain valuable insights, detect anomalies, and make data-driven decisions with agility, even in scenarios where high-throughput and low-latency processing is crucial. By using Amazon Redshift Streaming Ingestion in conjunction with Lambda functions, organizations can seamlessly ingest, deserialize, and query protobuf data streams, enabling near real-time analysis and insights.

For more information about Amazon Redshift Streaming Ingestion, refer to Streaming ingestion to a materialized view.

About the authors

Konstantinos Tzouvanas is a Senior Enterprise Architect on AWS, specializing in data science and AI/ML. He has extensive experience in optimizing real-time decision-making in High-Frequency Trading (HFT) and applying machine learning to genomics research. Known for leveraging generative AI and advanced analytics, he delivers practical, impactful solutions across industries.

Marios Parthenios is a Senior Solutions Architect working with Small and Medium Businesses across Central and Eastern Europe. He empowers organizations to build and scale their cloud solutions with a particular focus on Data Analytics and Generative AI workloads. He enables businesses to harness the power of data and artificial intelligence to drive innovation and digital transformation.

Pavlos Kaimakis is a Senior Solutions Architect at AWS who helps customers design and implement business-critical solutions. With extensive experience in product development and customer support, he focuses on delivering scalable architectures that drive business value. Outside of work, Pavlos is an avid traveler who enjoys exploring new destinations and cultures.

John Mousa is a Senior Solutions Architect at AWS. He helps power and utilities and healthcare and life sciences customers as part of the regulated industries team in Germany. John has interest in the areas of service integration, microservices architectures, as well as analytics and data lakes. Outside of work, he loves to spend time with his family and play video games.

Implementing Defense-in-Depth Security for AWS CodeBuild Pipelines

2025-08-01 Daniel Begimher

Post Syndicated from Daniel Begimher original https://aws.amazon.com/blogs/security/implementing-defense-security-for-aws-codebuild-pipelines/

Recent security research has highlighted the importance of CI/CD pipeline configurations, as documented in AWS Security Bulletin AWS-2025-016. This post pulls together existing guidance and recommendations into one guide.

Continuous integration and continuous deployment (CI/CD) practices help development teams deliver software efficiently and reliably. AWS CodeBuild provides managed build services that integrate with source code repositories like GitHub, GitLab, and other Source Control Management (SCM) systems. While this guide uses GitHub examples, the security principles and webhook configuration approaches apply to other supported source control systems.

However, certain configurations require careful attention. We strongly recommend that you do not use automatic pull request builds from untrusted repository contributors without proper security controls and a clear understanding of your threat model. This configuration allows untrusted code to execute in your build environment with access to repository credentials and environment variables. Webhook configurations determine which repository events trigger builds and what code gets executed during the build process. Understanding these configurations is essential for maintaining appropriate security boundaries while preserving the automation benefits that make CI/CD valuable.

Security teams and DevOps engineers can use these practical approaches to configure AWS CodeBuild to meet their security goals while maintaining development velocity. We’ll explore webhook configurations, trust boundaries, and implementation strategies that emphasize threat model assessment, least-privilege access, and proactive monitoring of your pipeline configurations.

Security of the pipeline implications

Under the shared responsibility model, while AWS manages the security of the underlying AWS CodeBuild infrastructure, customers are responsible for securing their pipeline configurations, access controls, and the code that runs within their build environments. This shared responsibility is critical when considering the security of the pipeline itself.

When AWS CodeBuild processes pull requests automatically, it builds the code in an environment with access to repository credentials, environment variables, and potentially sensitive information. This creates specific security of the pipeline considerations:

Repository access: AWS CodeBuild projects require repository credentials to read source code and create webhooks. These credentials provide specific permissions that vary based on your configuration.
Build execution: The build process runs the retrieved source code, which may include build scripts, dependency definitions, or test files from pull requests.
Build environment: AWS CodeBuild environments may have access to environment variables, AWS credentials, or other configuration data needed for the build process.

Establishing trust boundaries

Effective security of the pipeline starts with clearly defining trust boundaries for different types of code contributions:

Internal contributors: Team members with repository write access who have been verified through your organization’s access management processes.
External contributors: Contributors from outside your organization who submit pull requests from forked repositories.
Automated processing: Code that runs without manual review as part of the build process.

These trust boundaries form the foundation for threat modeling your specific environment. Internal and trusted environments can often rely more heavily on automation with contributor filtering and least-privilege controls. Public and open source projects require more stringent controls due to the inherent risks of processing untrusted contributions – these environments benefit from stricter webhook filtering, comprehensive approval gates, or the self-hosted GitHub Actions runner approach discussed later.

The key principle is finding the appropriate balance between security controls and development velocity based on your specific risk profile and contributor trust levels. With these considerations in mind, let’s examine how to assess and configure your current AWS CodeBuild webhook settings.

Configuring secure webhooks

Webhooks represent the preferred mechanism by which external events trigger AWS CodeBuild processes. When properly configured, webhooks provide a powerful and efficient way to automate your build processes in response to repository changes. However, improper webhook configuration can create security vulnerabilities by allowing untrusted code to execute in privileged environments.The security of your webhook configuration depends on understanding exactly which events trigger builds, what level of access those builds have, and what code gets executed during the build process. This section provides a comprehensive approach to authoring, assessing, configuring, and maintaining secure webhook configurations.

Assessing current webhook configurations

Begin by reviewing your existing AWS CodeBuild projects to understand their current webhook configurations. The following AWS CLI commands provide a systematic approach to gathering this information:

# List all CodeBuild projects in your region
aws codebuild list-projects --region us-west-2

# Retrieve detailed configuration for analysis
aws codebuild batch-get-projects --region us-west-2 \
  --names $(aws codebuild list-projects --region us-west-2 \
  --query 'projects[*]' --output text | tr '\n' ' ')

When you run these commands, pay particular attention to the webhook section in the output. This section contains the filterGroups configuration, which determines exactly which repository events trigger builds.

Now that you understand how to review your current setup, let’s examine common configuration patterns and their security implications.

Webhook configuration patterns

Understanding common webhook configuration patterns helps you quickly identify potential security concerns and implement appropriate improvements. The following patterns represent different approaches to webhook configuration, each with specific security implications.

Note: These patterns are not recommended for use and are shown here to help you identify configurations that may need attention.

Configuration requiring review – Automatic pull request processing


{
  "webhook": {
    "payloadUrl": "https://codebuild.us-west-2.amazonaws.com/webhooks",
    "filterGroups": [
      [
        {
          "type": "EVENT",
          "pattern": "PULL_REQUEST_CREATED,PULL_REQUEST_UPDATED,PULL_REQUEST_REOPENED",
          "excludeMatchedPattern": false
        }
      ]
    ]
  }
}

This configuration allows contributors who can create a pull request to trigger code execution in your build environment. We strongly recommend that you do not use automatic pull request builds from untrusted repository contributors.

Configuration requiring immediate review – No event filtering


{
  "webhook": {
    "payloadUrl": "https://codebuild.us-west-2.amazonaws.com/webhooks",
    "filterGroups": []
  }
}

Without filtering, this configuration can trigger builds for a wide variety of repository events.

Recommended secure webhook configurations

The following configurations represent security best practices that balance automation benefits with appropriate security controls. These patterns help to reduce security risks while maintaining the development velocity that makes CI/CD valuable.

Push-based builds (Recommended for most use cases)

Push-based builds make sure that only users with repository write access can trigger builds, which means contributors have already been vetted through your repository’s access control mechanisms.


{
  "webhook": {
    "payloadUrl": "https://codebuild.us-west-2.amazonaws.com/webhooks",
    "filterGroups": [
      [
        {
          "type": "EVENT",
          "pattern": "PUSH",
          "excludeMatchedPattern": false
        }
      ]
    ]
  }
}

Organizations that rely heavily on external open-source contributions may find this approach too restrictive. For example, a popular open-source project that receives dozens of pull requests daily from external contributors would need to manually merge each contribution before builds can run, significantly slowing down the contribution review process. In such cases, contributor-filtered builds or the self-hosted GitHub Actions runner approach may be more appropriate.

Contributor-filtered builds (Recommended for trusted contributors only)


{
  "webhook": {
    "payloadUrl": "https://codebuild.us-west-2.amazonaws.com/webhooks",
    "filterGroups": [
      [
        {
          "type": "EVENT",
          "pattern": "PULL_REQUEST_CREATED,PULL_REQUEST_UPDATED",
          "excludeMatchedPattern": false
        },
        {
          "type": "GITHUB_ACTOR_ACCOUNT_ID",
          "pattern": "^(12345678|87654321|11223344)$",
          "excludeMatchedPattern": false
        }
      ]
    ]
  }
}

This configuration allows pull request builds from specific, trusted contributors.

Important: Filtering applies to the GitHub account ID, not repository ownership. Contributors working from forked repositories can still introduce untrusted code that executes in your build environment.

Before implementing these configurations in your environment, consider these key factors that will help facilitate a smooth transition.

Webhook configuration implementation steps

While implementing the webhook security measures below, consider these broader practices:

Threat modeling: Assess your specific risk profile before selecting approaches.
Infrastructure as code: Use Infrastructure as Code (IaC) tools for production implementations.
Gradual implementation: Implement changes incrementally with observation periods.
Testing and rollback: Validate changes in non-production environments first.

The following implementation approach moves from most restrictive to more automated configurations. Choose the approach that best fits your organization’s risk tolerance and operational requirements.
This three-step process moves from the most restrictive approach to more automated configurations while maintaining security controls. Each step builds upon the previous one, creating layers of security that work together to protect your pipeline.

Note: The following examples use the AWS CLI for demonstration purposes. Similar configuration steps can be performed using the AWS Management Console through the AWS CodeBuild project settings.

Step 1: Configure push-only builds

Push-based builds help make sure that only verified contributors can trigger builds. This approach is more secure, because contributors must already be vetted through your repository’s access control mechanisms before they can push code.
Configure your webhook to trigger only on push events:

aws codebuild update-webhook \
  --project-name your-project-name \
  --filter-groups '[
    [
      {
        "type": "EVENT",
        "pattern": "PUSH",
        "excludeMatchedPattern": false
      }
    ]
  ]'

Step 2: Implement branch-based filtering

Branch-based filtering adds an additional layer of security by making sure that builds are triggered only for changes to specific branches. This approach recognizes that not all branches in a repository have the same security requirements or risk profiles.

For example, changes to main or production branches typically require more stringent security controls than changes to feature or development branches. By implementing branch-based filtering, you can apply appropriate security measures based on the criticality and exposure of different branches.

Configure filtering for specific branches:

aws codebuild update-webhook \
  --project-name your-project-name \
  --filter-groups '[
    [
      {
        "type": "EVENT",
        "pattern": "PUSH"
      },
      {
        "type": "HEAD_REF",
        "pattern": "^refs/heads/(main|develop|release/.*)$"
      }
    ]
  ]'

Step 3: Configure contributor filtering

Contributor filtering can be used to manage pull request builds by allowing automation for trusted contributors while requiring manual review for others. This approach recognizes that different contributors represent different risk profiles and should be treated accordingly.

The first step in implementing contributor filtering is identifying the GitHub user IDs of your trusted contributors.

Retrieve GitHub user IDs for trusted contributors:

curl -H "Authorization: token YOUR_GITHUB_TOKEN" \
https://api.github.com/users/trusted-username

Once you have the user IDs of your trusted contributors, you can configure webhook filtering to allow automated builds only for these contributors:


aws codebuild update-webhook \
  --project-name your-project-name \
  --filter-groups '[
    [
      {
        "type": "EVENT",
        "pattern": "PULL_REQUEST_CREATED,PULL_REQUEST_UPDATED"
      },
      {
        "type": "GITHUB_ACTOR_ACCOUNT_ID",
        "pattern": "^(1234567|2345678|3456789)$"
      }
    ]
  ]'

Important: Contributor allowlists require ongoing maintenance as team membership changes. Consider using Infrastructure as Code templates like the Cloudformation examples to manage webhook configurations and contributor lists in version control.

Webhook filtering provides the first layer of security by controlling which events trigger builds. However, comprehensive pipeline security requires additional controls around the permissions and credentials available to those builds once they execute. The following section covers how to implement defense-in-depth security through proper access controls and credential management.

Access control and credential management

This section covers specific approaches to limit the permissions available to build processes, scope repository access tokens appropriately, and create isolated environments that help contain potential security issues. These practices work together to implement defense-in-depth security while maintaining the operational benefits of automated CI/CD workflows.

Implementing least-privilege access

AWS CodeBuild projects require IAM service roles to access AWS resources during the build process. The principle of least privilege dictates that each role should have only the minimum permissions necessary to perform its intended function. By creating separate, purpose-built IAM roles for different types of builds, you can help reduce the potential impact of unauthorized access to build environments.

The following examples demonstrate how to structure minimal IAM roles for different build scenarios. These examples serve as starting points that you should customize based on your specific requirements, adding only the permissions your builds actually need.

Service role configuration

Create minimal IAM roles that provide only the permissions required for specific build types:

Test/validation build role

{
	"Version": "2012-10-17",
	"Statement": [
	{
		"Effect": "Allow",
		"Action": [
			"logs:CreateLogGroup",
			"logs:CreateLogStream",
			"logs:PutLogEvents"
		],
		"Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/test-*"
	},
	{
	"Effect": "Allow",
	"Action": [
		"s3:GetObject"
	],
	"Resource": "arn:aws:s3:::your-test-artifacts-bucket/*"
  }
 ]
}

Release build role (Separate from test)

{
	"Version": "2012-10-17",
	"Statement": [
	  {
		"Effect": "Allow",
		"Action": [
			"s3:PutObject",
			"s3:GetObject"
		],
		"Resource": "arn:aws:s3:::your-production-artifacts-bucket/*"
	  },
	  {
		"Effect": "Allow",
		"Action": [
			"ecr:BatchCheckLayerAvailability",
			"ecr:GetDownloadUrlForLayer",
			"ecr:BatchGetImage",
			"ecr:PutImage"
		],
		"Resource": "arn:aws:ecr:*:*:repository/your-production-repo"
	  }
	]
}

Leveraging IAM Access Analyzer for CodeBuild security

AWS IAM Access Analyzer can generate least-privilege policies for your AWS CodeBuild service roles based on actual CloudTrail activity from your build executions. This eliminates guesswork by analyzing the specific AWS API calls your builds make, rather than requiring you to predict what permissions might be needed.

After running your CodeBuild projects for a representative period, use Access Analyzer’s policy generation feature to create refined policies. This approach proves particularly valuable for complex build processes where the required permissions might not be immediately obvious.

For detailed implementation steps, refer to the IAM Access Analyzer documentation.

Credential scoping and source authentication

When processing external contributions, the principle of least privilege becomes important for repository access tokens. If an unauthorized user gains access to a token through an untrusted build, properly scoped tokens limit the potential impact to only the permissions necessary for the build process.

Configure fine-grained GitHub Personal Access Tokens with minimal permissions to help reduce this risk. Even if accessed inappropriately, a properly scoped token can only read source code (already accessible through the PR) and write status messages – it cannot push code, modify repository settings, or access other repositories.

The following permissions represent the minimum required access for processing external pull requests, demonstrating how to limit token scope to only essential operations:

contents:read – Read-only access to repository source code (already accessible through the PR)
statuses:write – Write commit status messages only (cannot modify code or settings)
metadata:read – Access basic repository information (name, description, public status)

Important: Use fine-grained personal access tokens restricted to the target repository only. Otherwise, this could allow access to other repositories beyond what is necessary for the build process.

This scoped approach ensures that even if a token is accessed inappropriately, the potential impact is limited to reading already-accessible information and writing status messages. The token cannot push code, modify repository settings, create webhooks, or access other repositories.

Credential storage and rotation

The following examples demonstrate how to securely store and reference these tokens using AWS Secrets Manager. AWS Secrets Manager provides automatic rotation capabilities, encryption at rest and in transit, and fine-grained access controls that help prevent tokens from being exposed in build logs or configuration files. This approach also enables centralized token management across multiple CodeBuild projects while maintaining audit trails of token access.

# Store the fine-grained token in AWS Secrets Manager
aws secretsmanager create-secret \
--name "codebuild/github-pat-limited" \
--description "Limited GitHub PAT for external PR processing" \
--secret-string '{"token":"ghp_your_limited_token_here"}'

# Create CodeBuild project with scoped credentials
aws codebuild create-project \
--name external-pr-processor \
--source '{
"type": "GITHUB",
"location": "https://github.com/your-org/your-repo.git",
"sourceCredentialsOverride": {
"serverType": "GITHUB",
"authType": "PERSONAL_ACCESS_TOKEN",
"token": "{{resolve:secretsmanager:codebuild/github-pat-limited:SecretString:token}}"
},
"reportBuildStatus": false
}' \
--service-role arn:aws:iam::account:role/minimal-test-build-role

The centralized storage enables credential rotation capabilities, helping to minimize the window of exposure compared to hardcoded tokens that would require infrastructure updates to rotate.

Build environment isolation

Establishing proper build environment security controls helps maintain pipeline integrity. The foundation of this approach involves implementing separation between test and release builds, which helps prevent credential escalation and limits the scope of potential unauthorized access.

Network isolation represents another layer of protection. Configure VPC settings specifically for builds that process external code by creating dedicated security groups with carefully restricted outbound access. These security groups should permit only necessary connections, such as HTTPS traffic for downloading legitimate dependencies, while blocking unnecessary network access that could be exploited by untrusted code.

Update your AWS CodeBuild projects to leverage this network isolation through proper VPC configuration, including specified subnets and the restricted security groups you’ve established.

Multi-stage pipeline security with human review gates

Implementing security controls across multiple pipeline stages helps provide proper validation and approval processes, especially when processing external contributions. This approach combines automated scanning with human oversight to identify issues before they reach production.

Code inspection integration

Configure your build specification to automatically run security tools like Automated Security Helper during the build process. These tools scan for code security issues and dependency problems, generating detailed reports for review.

Structure the build to continue execution even when issues are found, allowing all scans to complete while automatically failing builds that contain security problems requiring attention. Store all scan artifacts to provide security teams with detailed information for approval decisions.

Manual approval gates

After code passes automated security scans, configure manual approval gates to involve human reviewers for final validation. This helps provide appropriate human review before proceeding to sensitive environments.

The access control and credential management practices outlined in this section provide specific, actionable approaches to implementing defense-in-depth security for AWS CodeBuild pipelines. These controls work together to create multiple layers of protection while maintaining the operational benefits that make CI/CD automation valuable.

Alternative approach – Self-hosted GitHub Actions runners

AWS CodeBuild’s self-hosted GitHub Actions runner capability addresses the configuration issues described in this guide by isolating repository credentials from the build environment and using GitHub Actions’ execution framework instead of AWS CodeBuild webhook processing.

For organizations that need to process external contributions automatically, configure runners with proper access controls, use ephemeral runners to minimize persistent access, and apply standard security practices for runner management.

Configuration details are available in the AWS CodeBuild documentation. For additional implementation guidance, see AWS CodeBuild Managed Self-Hosted GitHub Action Runners blog post.

Monitoring and compliance

The security controls outlined in previous sections provide protection at build time, but comprehensive defense-in-depth security requires ongoing visibility into your pipeline activities and configuration changes. Monitoring and compliance tracking serve as the final layer of your security framework, helping you detect configuration drift, audit access patterns, and maintain security posture over time.

AWS CloudTrail provides detailed logging of API calls made to AWS services, including AWS CodeBuild. Enable CloudTrail logging to create a comprehensive audit trail of all build-related activities in your environment.

AWS Config tracks AWS CodeBuild project configurations over time, providing an inventory of projects and a complete history of configuration changes. This includes webhook modifications, resource relationships, and compliance tracking across your environment. Configure AWS Config to monitor AWS CodeBuild projects and receive notifications when security-critical configurations like webhook filters are modified. For more information, see the AWS Config sample with CodeBuild documentation.

Conclusion

Implementing defense-in-depth security for AWS CodeBuild pipelines requires layered controls that address different security considerations. The most effective approach combines webhook filtering, access controls, credential management, and monitoring to provide comprehensive protection. By implementing these layered practices outlined in this guide, you can maintain development velocity while establishing robust pipeline security.
Key principles to remember:

Assess your threat model first – different projects require different security approaches
Establish clear trust boundaries between different types of contributors
Use webhook filtering to control when builds are triggered
Implement least-privilege access for build environments
Monitor and audit configurations regularly using AWS Config and CloudTrail
Store secrets in AWS Secrets Manager or SSM Parameter Store and enable rotation

AWS CodeBuild provides the flexibility to implement these security measures while maintaining the operational benefits that make pipelines valuable. Apply the configurations and mitigations in this guide based on your specific risk profile and operational requirements. Regular review and updates of your configurations will help your pipelines remain secure as your organization’s needs evolve.

Stay tuned for additional practical guides for implementing CI/CD security best practices. If you have questions or feedback about this post, including suggestions for topics that would help you most, start a new thread on re:Post : Begimher or contact AWS Support.

How to migrate your Amazon EC2 Oracle Transparent Data Encryption database encryption keystore to AWS CloudHSM

2025-07-30 Bhushan Bhale

Post Syndicated from Bhushan Bhale original https://aws.amazon.com/blogs/security/how-to-migrate-your-ec2-oracle-transparent-data-encryption-tde-database-encryption-wallet-to-cloudhsm/

July 30, 2025: This post has been republished to migrate the Amazon EC2 Oracle Transparent Data Encryption database encryption keystore to AWS CloudHSM using AWS CloudHSM Client SDK 5.

Encrypting databases is crucial for protecting sensitive data, helping you to be aligned with security regulations and safeguarding against data loss. Oracle Transparent Data Encryption (TDE) is a feature that you can use to encrypt data at rest within an Oracle database. TDE uses envelope encryption. Envelope encryption is when the encryption key used to encrypt the tables of your database is encrypted by a primary key that resides either in a software keystore or on a hardware keystore, such as a hardware security module (HSM). This primary key is non-exportable by design to protect the confidentiality and integrity of your database operation. This gives you a more granular encryption scheme on your data. Hence, TDE for Oracle is a common use case for HSM devices such as AWS CloudHSM.

Oracle TDE supports keystores to securely store the TDE primary encryption keys. You can use either the TDE wallet (software keystore) or external key managers such as an HSM device. In this solution, we show you how to migrate a TDE keystore for an Oracle 19c database installed on Amazon Elastic Compute Cloud (Amazon EC2) from a software-based TDE wallet to AWS CloudHSM.

Using an external key manager, such as CloudHSM, offers several benefits over keeping keys on the Oracle wallet on the host:

Enhanced security: CloudHSM provides FIPS 140 validated hardware security, keeping the encryption key in a tamper-resistant module.
Centralized key management: CloudHSM supports centralized management of encryption keys, making it straightforward to rotate, back up, and audit keys.
Compliance: Your regulatory requirements may include encryption, and using CloudHSM can help you meet these compliance needs.

When you move from one type of keystore to another, new TDE primary keys are created inside the new keystore. To make sure that you have access to backups that rely on your past encryption keys, consider leaving the keystore running for your normal recovery window period or copying existing keys to the new keystore with exact key labels. Being able to access prior primary keys will help avoid data re-encryption.

You can use TDE to encrypt data online or offline. Encrypting TDE tablespace online minimizes disruption to database operations; however, it requires twice the storage space as the tablespace being encrypted, because the encryption process happens on a copy of the original tablespace.

Solution overview

In this solution, you migrate a TDE keystore for an Oracle 19c database from a software-based TDE wallet to CloudHSM, using the steps shown in Figure 1. Start by moving the current encryption keystore, which is your original TDE wallet, to a software wallet. This is done by replacing the PKCS#11 provider of your original HSM with the CloudHSM PKCS#11 software library (steps 1–2), next you reverse migrate to a local wallet (steps 3–5). The third step is to switch the encryption wallet for your database to your CloudHSM cluster (steps 6 and 7). After this process is complete, your database will automatically re-encrypt the data keys using the new primary key.

Figure 1: Steps to migrate your EC2 Oracle TDE database encryption wallet to CloudHSM

Note: The following instructions were tested using Oracle version 19c.

Prerequisites

You must have the following prerequisites in place to complete the solution in this post.

AWS CloudHSM cluster: You need to have a CloudHSM cluster set up and configured with an admin EC2 instance for interacting with CloudHSM following steps and best practices covered in Getting started with AWS CloudHSM.
Oracle database: Make sure that your Oracle database is up and running. This post assumes that you have an Oracle Database 19c database running on an EC2 Linux instance and there is network connectivity set up to CloudHSM as explained in this Configure the Client Amazon EC2 instance security groups for AWS CloudHSM.

Migrate an Oracle database keystore to a CloudHSM external keystore

As the first step in the migration, you need to migrate your Oracle database keystore to a CloudHSM external keystore. You do this by installing the CloudHSM client and the PKCS#11 library and then configuring the PKCS#11 library to connect to the HSM cluster.

Install the CloudHSM client:

Install the latest CloudHSM client software on your EC2 instance.
Configure the client to connect to HSMs in your cluster. For Linux EC2, use the following command:
```
sudo /opt/cloudhsm/bin/configure-cli -a <The ENI IPv4 / IPv6 addresses of the HSM>
```
Copy the CloudHSM issuing certificate created when you initialized the cluster (customerCA.crt) to the /opt/cloudhsm/etc folder. For more information, see Activate the cluster in AWS CloudHSM.
Validate connectivity to the CloudHSM cluster.
```
/opt/cloudhsm/bin/cloudhsm-cli interactive
```

aws-cloudhsm > login --username hsm-crypto-user --role admin
aws-cloudhsm > user create --username hsm-crypto-user --role crypto-user

Install the PKCS#11 Library

Install the PKCS #11 library for AWS CloudHSM Client SDK 5.
Configure Oracle to use the PKCS library:
1. Copy the PKCS#11 library to the appropriate Oracle folder. Typically, this is:
```
     cp /opt/cloudhsm/libcloudhsm_pkcs11.so /opt/oracle/extapi/[32,64]/hsm/aws/{VERSION}/libcloudhsm_pkcs11.so
```
2. Make sure that the folder /opt/oracle has the correct ownership, usually oracle:dba as owner:group.

Configure PKCS#11 library to connect to the HSM cluster

Use the following commands:

sudo /opt/cloudhsm/bin/configure-pkcs11 -a <HSM IP addresses>
sudo /opt/cloudhsm/bin/configure-pkcs11 --hsm-ca-cert <customerCA certificate file>

Configure the Oracle wallet location

In this section, you configure Oracle to point to CloudHSM using the sqlnet.ora file.

To configure the Oracle wallet location:

Edit the sqlnet.ora parameter ENCRYPTION_WALLET_LOCATION to point to the HSM:
```
ENCRYPTION_WALLET_LOCATION=
  (SOURCE=(METHOD=HSM)
  )
```
Verify that the WALLET_ROOT parameter is pointing to the current file-based TDE wallet location. This parameter defines the location where the TDE wallet (and other related files) will be stored. You can set it to an existing directory, preferably one in your $ORACLE_BASE or $ORACLE_HOME directory, but other locations are also possible.
```
show parameter wallet_root
show parameter tde_configuration
```

Use the following commands to set WALLET_ROOT if it hasn’t already been set.

ALTER SYSTEM SET WALLET_ROOT = '/u01/app/oracle/admin/orcl/wallet' SCOPE=BOTH SID='*';
ALTER SYSTEM SET TDE_CONFIGURATION=“KEYSTORE_CONFIGURATION=FILE” SCOPE=BOTH SID=“*”

Note:

In Oracle Database 19c and later, the ENCRYPTION_WALLET_LOCATION. parameter in sqlnet.ora is deprecated in favor of using WALLET_ROOT and TDE_CONFIGURATION.

You can also use the V$ENCRYPTION_WALLET view to check the current keystore location and status.

Point the Oracle database to use a local file-based keystore and CloudHSM

The KEYSTORE_CONFIGURATION attribute within TDE_CONFIGURATION determines the keystore type.

To point the Oracle database:

Use the following code to point the database to the local keystore and CloudHSM.

ALTER SYSTEM SET TDE_CONFIGURATION="KEYSTORE_CONFIGURATION=HSM|FILE" SCOPE = BOTH SID = '*';

Restart the database to have consistent results.

Verify that the keystore file-based wallet is open

To proceed with encryption key migration, you need to check the current keystore status and make sure that the file-based wallet is open.

To verify that the wallet is open:

Check to see if the file-based wallet is open.
```
Select * from V$encryption_wallet;
```
Figure 2: Verify that the wallet status is OPEN

If the wallet status is not OPEN, use the following command to open it:

  ALTER SYSTEM SET ENCRYPTION WALLET OPEN IDENTIFIED BY "wallet_password";

Migrate the encryption key to CloudHSM

Use the ADMINISTER KEY MANAGEMENT SET ENCRYPTION KEY command to initiate the TDE primary encryption key migration.

To migrate the encryption key:

Use the following command to migrate the encryption key:

ADMINISTER KEY MANAGEMENT SET ENCRYPTION KEY IDENTIFIED BY "hsm-crypto-user:password" MIGRATE USING "wallet-password" WITH BACKUP USING 'backup_tag';

The parameters used to migrate the encryption key are:

SET ENCRYPTION KEY: Specifies that the command is related to the TDE primary encryption key
IDENTIFIED BY: Specifies the details for migrating the keystore, including the external keystore user and password
MIGRATE USING: Specifies the password for the file-based wallet containing the primary encryption key
WITH BACKUP: Creates a backup of the keystore before the migration

Verify that the migration is complete

At this point, the migration from Oracle to Cloud HSM should be complete. Use the following steps to verify it.

To verify the migration:

Check the wallet status again. If the migration was successful, the WALLET_TYPE will be HSM and the WALLET_OR will be PRIMARY.
```
Select * from V$encryption_wallet;
```
Figure 3: Verify that WALLET_TYPE and WALLET_OR are correct

In the wallet is not open, use:

SQL>administer key management set keystore open identified by "hsm-crypto-user:password"

Verify that the database can access encrypted data without issues, confirming that the migration was successful.

Setup auto-login

Create auto-login to open the wallet during database restarts to connect to AWS CloudHSM.

To set up auto_login:

Create a new file-based keystore with the same username and password as the CloudHSM crypto user.

ADMINISTER KEY MANAGEMENT CREATE KEYSTORE '/etc/oracle/wallets/<path>/tde' IDENTIFIED BY "hsm-crypto-user:password";

Add the CloudHSM crypto user password to a keystore (TDE wallet).

ADMINISTER KEY MANAGEMENT ADD SECRET 'hsm-crypto-user:password' FOR CLIENT 'HSM_PASSWORD' TO KEYSTORE '/etc/oracle/wallets/<path>/tde' IDENTIFIED BY "hsm-crypto-user:password" WITH BACKUP;

The following command creates a new auto-login keystore. This is useful for scenarios where the keystore needs to be accessed without human intervention.
```
ADMINISTER KEY MANAGEMENT CREATE AUTO_LOGIN KEYSTORE FROM KEYSTORE '/etc/oracle/wallets/<path>/tde' IDENTIFIED BY "<hsm-crypto-user:password>";
```

Open the newly created file based keystore.

ADMINISTER KEY MANAGEMENT SET KEYSTORE OPEN IDENTIFIED BY "<hsm-crypto-user:password>"

Key rotation

Key rotation helps you to adhere to security best practices by providing several data security benefits. Regular rotation of TDE keys reduces the window of opportunity for a bad actor who might have obtained a key, thereby minimizing the impact of a potential breach.

Many established security frameworks and compliance standards—such as PCI DSS and HIPAA—recommend or require regular key rotation to maintain the integrity and confidentiality of encrypted data. By making sure that keys aren’t used indefinitely, you can help reduce the risk of exposure or compromise, which reinforces overall security.

Encryption algorithms can become less secure over time because of advancements in computing power or newly discovered vulnerabilities. By rotating keys regularly, you can transition to stronger encrypting methods as needed, and so improve protection against emerging risks.

When to rotate TDE keys

The frequency of key rotation depends on several factors, including organizational policies, regulatory requirements, and the sensitivity of the data being protected. Here are some common practices:

Annually: Many organizations rotate TDE keys once a year to align with common compliance requirements.
Quarterly: For higher-security environments or more sensitive data, rotating keys every quarter can provide an additional layer of security.
If keys are compromised or suspected to be compromised: If you believe a key to be compromised, rotating that key as soon as possible is recommended to reduce the impact window.

Oracle TDE primary key rotation with an HSM key

In this section, you choose a 32-bit hex value to use as a prefix when generating a key, then use that key to update the Oracle database to use the new primary key.

Sign in to the database instance as a user who has the ADMINISTER KEY MANAGEMENT or SYSKM privilege and execute following command:
```
ADMINISTER KEY MANAGEMENT SET ENCRYPTION KEY IDENTIFIED BY "<hsm-crypto-user:password>"
```
Decide on a 32-bit hex value pattern to be used. We used 15A5142C9E2D3C2F18FD435814257DFD in this example.

Add the prefix ORACLE.TDE.HSM.MK to the hex pattern.

ORACLE.TDE.HSM.MK.0615A5142C9E2D3C2F18FD435814257DFD

key generate-symmetric aes --label 'ORACLE.TDE.HSM.MK.0615A5142C9E2D3C2F18FD435814257DFD' --key-length-bytes 32 —attributes encrypt=true decrypt=true“

Share the key with the Oracle hsm-crypto-user in case the original key was generated through another user.

 key share --filter attr.label=""ORACLE.TDE.HSM.MK.0615A5142C9E2D3C2F18FD435814257DFD"" attr.class=secret-key --username hsm_crypto_user --role crypto-user

Update the Oracle database to use the new primary TDE key.

ADMINISTER KEY MANAGEMENT USE KEY '0615A5142C9E2D3C2F18FD435814257DFD' FORCE KEYSTORE IDENTIFIED BY "hsm-crypto-user:password"

By following these guidelines, you can enhance the security of your encrypted data, align with regulatory requirements, and maintain robust key management practices.

Conclusion

In this post, we’ve shown you the importance of Transparent Data Encryption (TDE) and the benefits of using an external key manager such as AWS CloudHSM for storing TDE encryption keys. We’ve discussed the benefits of TDE compared to encrypting underlying storage and why using an external key manager is superior to keeping keys on the Oracle wallet on the host. Following these guidelines can help you enhance the security of your encrypted data, align with regulatory requirements, and maintain robust key management practices.

The key takeaways from this post are:

TDE offers granular encryption, compliance benefits, and robust key management.
External key managers provide enhanced security, centralized management, improved auditability and scalability.
Regular rotation of TDE keys is crucial for maintaining security, aligning with regulations, and following recommended practices in key management.

To start securing your Oracle databases with TDE and AWS CloudHSM, visit the AWS Management Console. Follow the steps outlined in this guide to migrate your TDE encryption keystore to AWS CloudHSM and begin rotating your keys regularly to enhance your data security posture. By taking these actions, you can make sure that your sensitive data remains protected, your organization remains aligned with regulations, and you are following the best practices in data encryption and key management.

For more information, see:

If you have feedback about this post, submit comments in the Comments section below.

Optimize traffic costs of Amazon MSK consumers on Amazon EKS with rack awareness

2025-07-30 Austin Groeneveld

Post Syndicated from Austin Groeneveld original https://aws.amazon.com/blogs/big-data/optimize-traffic-costs-of-amazon-msk-consumers-on-amazon-eks-with-rack-awareness/

Are you incurring significant cross Availability Zone traffic costs when running an Apache Kafka client in containerized environments on Amazon Elastic Kubernetes Service (Amazon EKS) that consume data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) topics?

If you’re not familiar with Apache Kafka’s rack awareness feature, we strongly recommend starting with the blog post on how to Reduce network traffic costs of your Amazon MSK consumers with rack awareness for an in-depth explanation of the feature and how Amazon MSK supports it.

Although the solution described in that post uses an Amazon Elastic Compute Cloud (Amazon EC2) instance deployed in a single Availability Zone to consume messages from an Amazon MSK topic, modern cloud-native architectures demand more dynamic and scalable approaches. Amazon EKS has emerged as a leading platform for deploying and managing distributed applications. The dynamic nature of Kubernetes introduces unique implementation challenges compared to static client deployments. In this post, we walk you through a solution for implementing rack awareness in consumer applications that are dynamically deployed across multiple Availability Zones using Amazon EKS.

Here’s a quick recap of some key Apache Kafka terminology from the referenced blog. An Apache Kafka client consumer will register to read against a topic. A topic is the logical data structure that Apache Kafka organizes data into. A topic is segmented into a single or many partitions. Partitions are the unit of parallelism in Apache Kafka. Amazon MSK provides high availability by replicating each partition of a topic across brokers in different Availability Zones. Because there are replicas of each partition that reside across the different brokers that make up your MSK cluster, Amazon MSK also tracks whether a replica partition is in sync with the most recent data for that partition. This means there is one partition that Amazon MSK recognizes as containing the most up-to-date data, and this is known as the leader partition. The collection of replicated partitions is called in-sync replicas. This list of in-sync replicas is used internally when the cluster needs to elect a new leader partition if the current leader were to become unavailable.

When consumer applications read from a topic, the Apache Kafka protocol facilitates a network exchange to determine which broker currently has the leader partition that the consumer needs to read from. This means that the consumer could be told to read from a broker in a different Availability Zone than itself, leading to cross-zone traffic charge in your AWS account. To help optimize this cost, Amazon MSK supports the rack awareness feature, using which clients can ask an Amazon MSK cluster to provide a replica partition to read from, within the same Availability Zone as the client, even if it isn’t the current leader partition. The cluster accomplishes this by checking for an in-sync replica on a broker within the same Availability Zone as the consumer.

The challenge with Kafka clients on Amazon EKS

In Amazon EKS, the underlying units of computes are EC2 instances that are abstracted as Kubernetes nodes. The nodes are organized into node groups for ease of management, scaling, and grouping of applications on certain EC2 instance types. As a best practice for resilience, the nodes in a node group are spread across multiple Availability Zones. Amazon EKS uses the underlying Amazon EC2 metadata about the Availability Zone that it’s located in, and it injects that information into the node’s metadata during node configuration. In particular, the Availability Zone (AZ ID) is injected into the node metadata.

When an application is deployed in a Kubernetes Pod on Amazon EKS, it goes through a process of binding to a node that meets the pod’s requirements. As shown in the following diagram, when you deploy client applications on Amazon EKS, the pod for the application can be bound to a node with available capacity in any Availability Zone. Also, the pod doesn’t automatically inherit the Availability Zone information from the node that it’s bound to, a piece of information necessary for rack awareness. The following architecture diagram illustrates Kafka consumers running on Amazon EKS without rack awareness.

AWS Cloud architecture showing MSK brokers, EKS pods, and EC2 instances in three Availability Zones

To set the client configuration for rack awareness, the pod needs to know what Availability Zone it’s located in, dynamically, as it is bound to a node. During its lifecycle, the same pod can be evicted from the node it was bound to previously and moved to a node in a different Availability Zone, if the matching criteria permit that. Making the pod aware of its Availability Zone dynamically sets the rack awareness parameter client.rack during the initialization of the application container that is encapsulated in the pod.

After rack awareness is enabled on the MSK cluster, what happens if the broker in the same Availability Zone as the client (hosted on Amazon EKS or elsewhere) becomes unavailable? The Apache Kafka protocol is designed to support a distributed data storage system. Assuming customers follow the best practice of implementing a replication factor > 1, Apache Kafka can dynamically reroute the consumer client to the next available in-sync replica on a different broker. This resilience remains consistent even after implementing nearest replica fetching, or rack awareness. Enabling rack awareness optimizes the networking exchange to prefer a partition within the same Availability Zone, but it doesn’t compromise the consumer’s ability to operate if the nearest replica is unavailable.

In this post, we walk you through an example of how to use the Kubernetes metadata label, topology.k8s.aws/zone-id, assigned to each node by Amazon EKS, and use an open source policy engine, Kyverno, to deploy a policy that mutates the pods that are in the binding state to dynamically inject the node’s AZ ID into the pod’s metadata as an annotation, as depicted in the following diagram. This annotation, in turn, is used by the container to create an environment variable that is assigned the pod’s annotated AZ ID information. The environment variable is then used in the container postStart lifecycle hook to generate the Kafka client configuration file with rack awareness setting. The following architecture diagram illustrates Kafka consumers running on Amazon EKS with rack awareness.

AWS architecture with MSK, EKS, Kyverno, and EC2 across three Availability Zones, detailing topology

Solution Walkthrough

Prerequisites

For this walkthrough, we use AWS CloudShell to run the scripts that are provided inline as you progress. For a smooth experience, before getting started, make sure to have kubectl and eksctl installed and configured in the AWS CloudShell environment, following the installation instructions for Linux (amd64). Helm is also required to be install on AWS CloudShell, using the instructions for Linux.

Also, check if the envsubst tool is installed in your CloudShell environment by invoking:

which envsubst

If the tool isn’t installed, you can install it using the command:

sudo dnf -y install gettext-devel

We also assume you already have an MSK cluster deployed in an Amazon Virtual Private Cloud (VPC) in three Availability Zones with the name MSK-AZ-Aware. In this walkthrough, we use AWS Identity and Access Management (IAM) authentication for client access control to the MSK cluster. If you’re using a cluster in your account with a different name, replace the instances of MSK-AZ-Aware in the instructions.

We follow the same MSK cluster configuration mentioned in the Rack Awareness blog mentioned previously, with some modifications. (Ensure you’ve set replica.selector.class = org.apache.kafka.common.replica.RackAwareReplicaSelector for the reasons discussed there). In our configuration, we add one line: num.partitions = 6. Although not mandatory, this ensures that topics that are automatically created will have multiple partitions to support clearer demonstrations in subsequent sections.

Finally, we use the Amazon MSK Data Generator with the following configuration:

{
"name": "msk-data-generator",
    "config": {
    "connector.class": "com.amazonaws.mskdatagen.GeneratorSourceConnector",
    "genkp.MSK-AZ-Aware-Topic.with": "#{Internet.uuid}",
    "genv.MSK-AZ-Aware-Topic.product_id.with": "#{number.number_between '101','200'}",
    "genv.MSK-AZ-Aware-Topic.quantity.with": "#{number.number_between '1','5'}",
    "genv.MSK-AZ-Aware-Topic.customer_id.with": "#{number.number_between '1','5000'}"
    }
}

Running the MSK Data Generator with this configuration will automatically create a six-partition topic named MSK-AZ-Aware-Topic on our cluster for us, and it will push data to that topic. To follow along with the walkthrough, we recommend and assume that you deploy the MSK Data Generator to create the topic and populate it with simulated data.

Create the EKS cluster

The first step is to install an EKS cluster in the same Amazon VPC subnets as the MSK cluster. You can modify the name of the MSK cluster by changing that environment variable MSK_CLUSTER_NAME if your cluster is created with a different name than suggested. You can also change the Amazon EKS cluster name by changing EKS_CLUSTER_NAME.

The environment variables that we define here are used throughout the walkthrough.

The last step is to update the kubeconfig with an entry for the EKS cluster:

AWS_ACCOUNT=$(aws sts get-caller-identity --output text --query Account)
export AWS_ACCOUNT
export AWS_REGION=${AWS_DEFAULT_REGION}
export MSK_CLUSTER_NAME=MSK-AZ-Aware
export EKS_CLUSTER_NAME=EKS-AZ-Aware
export EKS_CLUSTER_SIZE=3
export K8S_VERSION=1.32
export POD_ID_VERSION=1.3.5
 
MSK_BROKER_SG=$(aws kafka list-clusters \
  --query  'ClusterInfoList[?ClusterName==`'${MSK_CLUSTER_NAME}'`].BrokerNodeGroupInfo.SecurityGroups'  \
  --output text | xargs)
export MSK_BROKER_SG

MSK_BROKER_CLIENT_SUBNETS=$(aws kafka list-clusters \
  --query  'ClusterInfoList[?ClusterName==`'${MSK_CLUSTER_NAME}'`].BrokerNodeGroupInfo.ClientSubnets'  \
  --output text | xargs)
export MSK_BROKER_CLIENT_SUBNETS
 
VPC_ID=$(aws ec2 describe-subnets \
  --subnet-ids "$(echo "${MSK_BROKER_CLIENT_SUBNETS}" | cut -d' ' -f1)" \
  --query 'Subnets[0].VpcId' \
  --output text)
export VPC_ID

EKS_SUBNETS=$(echo ${MSK_BROKER_CLIENT_SUBNETS} | sed 's/ \+/,/g')
export EKS_SUBNETS

# Create a minimal config file for encrypted node volumes
cat > eks-config.yaml << EOF
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ${EKS_CLUSTER_NAME}
  region: ${AWS_REGION}
  version: "${K8S_VERSION}"
vpc:
  id: "${VPC_ID}"
  subnets:
    public:
$(for subnet in ${MSK_BROKER_CLIENT_SUBNETS}; do
  AZ=$(aws ec2 describe-subnets --subnet-ids "$subnet" --query 'Subnets[0].AvailabilityZone' --output text)
  echo "      $AZ: { id: $subnet }"
done)
nodeGroups:
  - name: ng1
    instanceType: m5.xlarge
    desiredCapacity: ${EKS_CLUSTER_SIZE}
    minSize: ${EKS_CLUSTER_SIZE}
    maxSize: ${EKS_CLUSTER_SIZE}
    securityGroups:
      attachIDs: ["${MSK_BROKER_SG}"]
    volumeSize: 100
    volumeType: gp3
    volumeEncrypted: true
EOF

eksctl create cluster -f eks-config.yaml

aws eks update-kubeconfig \
  --region "${AWS_REGION}" \
  --name ${EKS_CLUSTER_NAME}

Next, you need to create an IAM policy, MSK-AZ-Aware-Policy, to allow access from the Amazon EKS pods to the MSK cluster. Note here that we’re using MSK-AZ-Aware as the cluster name.

Create a file, msk-az-aware-policy.json, with the IAM policy template:

cat > msk-az-aware-policy.json << EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:Connect",
                "kafka-cluster:AlterCluster",
                "kafka-cluster:DescribeCluster",
                "kafka-cluster:DescribeClusterDynamicConfiguration",
                "kafka-cluster:AlterClusterDynamicConfiguration"
            ],
            "Resource": [
                "arn:aws:kafka:${AWS_REGION}:${AWS_ACCOUNT}:cluster/${MSK_CLUSTER_NAME}/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:*Topic*",
                "kafka-cluster:WriteData",
                "kafka-cluster:ReadData"
            ],
            "Resource": [
                "arn:aws:kafka:${AWS_REGION}:${AWS_ACCOUNT}:topic/${MSK_CLUSTER_NAME}/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:AlterGroup",
                "kafka-cluster:DescribeGroup"
            ],
            "Resource": [
                "arn:aws:kafka:${AWS_REGION}:${AWS_ACCOUNT}:group/${MSK_CLUSTER_NAME}/*"
            ]
        }
    ]
}
EOF

To create the IAM policy, use the following command. It first replaces the placeholders in the policy file with values from relevant environment variables, and then creates the IAM policy:

envsubst < msk-az-aware-policy.json | \
xargs -0 -I {} aws iam create-policy \
            --policy-name MSK-AZ-Aware-Policy \
            --policy-document {}

Configure EKS Pod Identity

Amazon EKS Pod Identity offers a simplified experience for obtaining IAM permissions for pods on Amazon EKS. This requires installing an add-on Amazon EKS Pod Identity Agent to the EKS cluster:

eksctl create addon \
  --cluster ${EKS_CLUSTER_NAME} \
  --name eks-pod-identity-agent \
  --version ${POD_ID_VERSION}

Confirm that the add-on has been installed and its status is ACTIVE and that the status of all the pods associated with the add-on is Running.

eksctl get addon \
  --cluster ${EKS_CLUSTER_NAME} \
  --region "${AWS_REGION}" \
  --name eks-pod-identity-agent -o json

kubectl get pods \
  -n kube-system \
  -l app.kubernetes.io/instance=eks-pod-identity-agent

After you’ve installed the add-on, you need to create a pod identity association between a Kubernetes service account and the IAM policy created earlier:

eksctl create podidentityassociation \
  --namespace kafka-ns \
  --service-account-name kafka-sa \
  --role-name EKS-AZ-Aware-Role \
  --permission-policy-arns arn:aws:iam::"${AWS_ACCOUNT}":policy/MSK-AZ-Aware-Policy \
  --cluster ${EKS_CLUSTER_NAME} \
  --region "${AWS_REGION}"

Install Kyverno

Kyverno is an open source policy engine for Kubernetes that allows for validation, mutation, and generation of Kubernetes resources using policies written in YAML, thus simplifying the enforcement of security and compliance requirements. You need to install Kyverno to dynamically inject metadata into the Amazon EKS pods as they enter the binding state to inform them of Availability Zone ID.

In AWS CloudShell, create a file named kyverno-values.yaml. This file defines the Kubernetes RBAC permissions for Kyverno’s Admission Controller to read Amazon EKS node metadata because the default Kyverno (v. 1.13 onwards) settings don’t allow this:

cat > kyverno-values.yaml << EOF
admissionController:
  rbac:
    clusterRole:
      extraResources:
        - apiGroups:
            - ""
          resources:
            - "nodes"
          verbs:
            - get
            - list
            - watch
EOF

After this file is created, you can install Kyverno using helm and providing the values file created in the previous step:

helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update

helm install kyverno kyverno/kyverno \
  -n kyverno \
  --create-namespace \
  --version 3.3.7 \
  -f kyverno-values.yaml

Starting with Kyverno v 1.13, the Admission Controller is configured to ignore the AdmissionReview requests for pods in binding state. This needs to be changed by editing the Kyverno ConfigMap:

kubectl -n kyverno edit configmap kyverno

The kubectl edit command uses the default editor configured in your environment (in our case Linux VIM).

This will open the ConfigMap in a text editor.

As highlighted in the following screenshot, [Pod/binding,*,*] should be removed from the resourceFilters field for the Kyverno Admission Controller to process AdmissionReview requests for pods in binding state.

Kubernetes YAML configuration detailing Kyverno policy resource filters and cluster roles

If Linux VIM is your default editor, you can delete the entry using VIM command 18x, meaning delete (or cut) 18 characters from the current cursor position. Save the modified configuration using the VIM command :wq, meaning write (or save) the file and quit.

After deleting, the resourceFilters field should look similar to the following screenshot.

Kubernetes YAML configuration with ReplicaSet resource filter highlighted for Kyverno policy management

If you have a different editor configured in your environment, follow the appropriate steps to achieve a similar outcome.

Configure Kyverno policy

You need to configure the policy that will make the pods rack aware. This policy is adapted from the suggested approach in the Kyverno blog post, Assigning Node Metadata to Pods. Create a new file with the name kyverno-inject-node-az-id.yaml:

cat > kyverno-inject-node-az-id.yaml  << EOF
apiVersion: kyverno.io/v2beta1
kind: ClusterPolicy
metadata:
  name: inject-node-az-id
spec:
  background: false
  rules:
    - name: inject-node-az-id
      match:
        any:
        - resources:
            kinds:
            - Pod/binding
      context:
      - name: node
        variable:
          jmesPath: request.object.target.name
          default: ''
      - name: node_az_id
        apiCall:
          urlPath: "/api/v1/nodes/{{node}}"
          jmesPath: "metadata.labels.\"topology.k8s.aws/zone-id\" || 'empty'"
      mutate:
        patchStrategicMerge:
          metadata:
            annotations:
              node_az_id: "{{ node_az_id }}"
EOF

It instructs Kyverno to watch for pods in binding state. After Kyverno receives the AdmissionReview request for a pod, it sets the variable node to the name of the node to which the pod is being bound. It also sets another variable node_az_id to the Availability Zone ID by calling the Kubernetes API /api/v1/nodes/node to get the node metadata label topology.k8s.aws/zone-id. Finally, it defines a mutate rule to inject the obtained AZ ID into the pod’s metadata as an annotation node_az_id.
After you’ve created the file, apply the policy using the following command:

kubectl apply -f kyverno-inject-node-az-id.yaml

Deploy a pod without rack awareness

Now let’s visualize the problem statement. To do this, connect to one of the EKS pods and check how it interacts with the MSK cluster when you run a Kafka consumer from the pod.

First, get the bootstrap string of the MSK cluster. Look up the Amazon Resource Names (ARNs) of the MSK cluster:

MSK_CLUSTER_ARN=$(
    aws kafka list-clusters \
      --query 'ClusterInfoList[?ClusterName==`'${MSK_CLUSTER_NAME}'`].ClusterArn' \
      --output text)
export MSK_CLUSTER_ARN

Using the cluster ARN, you can get the bootstrap string with the following command:

BOOTSTRAP_SERVER_LIST=$(
    aws kafka get-bootstrap-brokers \
        --cluster-arn "${MSK_CLUSTER_ARN}" \
        --query 'BootstrapBrokerStringSaslIam' \
        --output text)
export BOOTSTRAP_SERVER_LIST

Create a new file named kafka-no-az.yaml:

cat > kafka-no-az.yaml << EOF
apiVersion: v1
kind: Namespace
metadata:
 name: kafka-ns
---
apiVersion: v1
kind: ServiceAccount
metadata:
 name: kafka-sa
 namespace: kafka-ns
automountServiceAccountToken: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-no-az
  namespace: kafka-ns
  labels:
    app: kafka-no-az
  annotations:
    node_az_id: ''
spec:
  replicas: 3
  selector:
    matchLabels:
      app: kafka-no-az
  template:
    metadata:
      labels:
        app: kafka-no-az
    spec:
      serviceAccountName: kafka-sa
      containers:
      - image: bitnami/kafka:3.8.0
        name: kafka-no-az
        command: ["/bin/sh", "-ec", "while :; do echo '.'; sleep 5 ; done"]
        env:
        - name: BootstrapServerString
          value: ${BOOTSTRAP_SERVER_LIST}
        - name: MSK_TOPIC
          value: "MSK-AZ-Aware-Topic"
        - name: KAFKA_HOME
          value: /opt/bitnami/kafka
        - name: KAFKA_BIN
          value: /opt/bitnami/kafka/bin
        - name: KAFKA_CONFIG
          value: /opt/bitnami/kafka/config
        - name: KAFKA_LIBS
          value: /opt/bitnami/kafka/libs
        - name: KAFKA_LOG4J_OPTS
          value: "-Dlog4j.configuration=file:/opt/bitnami/kafka/config/log4j.properties"
        lifecycle:
          postStart:
            exec:
              command: 
              - "sh"
              - "-c"
              - |
                export KAFKA_HOME=/opt/bitnami/kafka
                export KAFKA_BIN=\${KAFKA_HOME}/bin
                export KAFKA_CONFIG=\${KAFKA_HOME}/config
                cat > \${KAFKA_CONFIG}/client.properties << EOF1
                security.protocol=SASL_SSL
                sasl.mechanism=AWS_MSK_IAM
                sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
                sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
                EOF1
                
                cat >> \${KAFKA_CONFIG}/log4j.properties << EOF2
                #
                # Enable logging of Kafka Client to stderr
                #
                log4j.rootLogger=WARN, stderr
                log4j.logger.org.apache.kafka.clients.consumer.internals.AbstractFetch=DEBUG
                log4j.appender.stderr=org.apache.log4j.ConsoleAppender
                log4j.appender.stderr.layout=org.apache.log4j.PatternLayout
                log4j.appender.stderr.layout.ConversionPattern=[%d] %p %m (%c)%n
                log4j.appender.stderr.Target=System.err
                EOF2
                cd \${KAFKA_HOME}/libs
                /usr/bin/curl -sS -L https://github.com/aws/aws-msk-iam-auth/releases/download/v2.2.0/aws-msk-iam-auth-2.2.0-all.jar --output \${KAFKA_LIBS}/aws-msk-iam-auth-2.2.0-all.jar
EOF

This pod manifest doesn’t make use of the Availability Zone ID injected into the metadata annotation and hence doesn’t add client.rack to the client.properties configuration.

Deploy the pods using the following command:

kubectl apply -f kafka-no-az.yaml

Run the following command to confirm that the pods have been deployed and are in the Running state:

kubectl -n kafka-ns get pods

Select a pod id from the output of the previous command, and connect to it using:

kubectl -n kafka-ns exec -it POD_ID -- sh

Run the Kafka consumer:

"${KAFKA_BIN}"/kafka-console-consumer.sh \
  --bootstrap-server  "${BootstrapServerString}" \
  --consumer.config  "${KAFKA_CONFIG}"/client.properties \
  --topic "${MSK_TOPIC}" \
  --from-beginning /tmp/non-rack-aware-consumer.log 2>&1 &

This command will dump all the resulting logs into the file, non-rack-aware-consumer.log. There’s a lot of information in those logs, and we encourage you to open them and take a deeper look. Next, examine the EKS pod in action. To do this, run the following command to tail the file to view fetch request results to the MSK cluster. You’ll notice a handful of meaningful logs to review as the consumer access various partitions of the Kafka topic:

grep -E "DEBUG.*Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-[0-9]+" /tmp/rack-aware-consumer.log | tail -5

Observe your log output, which should look similar to the following:

[2025-03-12 23:59:05,308] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-24102] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-3 at position FetchPosition{offset=100, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-12 23:59:05,308] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-24102] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-0 at position FetchPosition{offset=83, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-12 23:59:05,542] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-24102] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-5 at position FetchPosition{offset=100, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-1.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 1 rack: use1-az4)], epoch=0}} to node b-1.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 1 rack: use1-az4) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-12 23:59:05,542] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-24102] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-2 at position FetchPosition{offset=107, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-1.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 1 rack: use1-az4)], epoch=0}} to node b-1.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 1 rack: use1-az4) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-12 23:59:05,720] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-24102] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-4 at position FetchPosition{offset=84, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-3.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 3 rack: use1-az2)], epoch=0}} to node b-3.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 3 rack: use1-az2) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-12 23:59:05,720] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-24102] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-1 at position FetchPosition{offset=85, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-3.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 3 rack: use1-az2)], epoch=0}} to node b-3.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 3 rack: use1-az2) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-12 23:59:05,811] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-24102] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-3 at position FetchPosition{offset=100, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-12 23:59:05,811] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-24102] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-0 at position FetchPosition{offset=83, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)

You’ve now connected to a specific pod in the EKS cluster and run a Kafka consumer to read from the MSK topic without rack awareness. Remember that this pod is running within a single Availability Zone.

Reviewing the log output, you find rack: values as use1-az2, use1-az4, and use1-az6 as the pod makes calls to different partitions of the topic. These rack values represent the Availability Zone IDs that our brokers are running within. This means that our EKS pod is creating networking connections to brokers across three different Availability Zones, which would be accruing networking charges in our account.

Also notice that you have no way to check which node, and therefore Availability Zone, this EKS pod is running in. You can observe in the logs that it’s calling to MSK brokers in different Availability Zones, but there is no way to know which broker is in the same Availability Zone as the EKS pod you’ve connected to. Delete the deployment when you’re done:

kubectl -n kafka-ns delete -f kafka-no-az.yaml

Deploy a pod with rack awareness

Now that you have experienced the consumer behavior without rack awareness, you need to inject the Availability Zone ID to make your pods rack-aware.

Create a new file named kafka-az-aware.yaml:

cat > kafka-az-aware.yaml << EOF
apiVersion: v1
kind: Namespace
metadata:
 name: kafka-ns
---
apiVersion: v1
kind: ServiceAccount
metadata:
 name: kafka-sa
 namespace: kafka-ns
automountServiceAccountToken: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kafka-az-aware
  namespace: kafka-ns
  labels:
    app: kafka-az-aware
  annotations:
    node_az_id: ''
spec:
  replicas: 3
  selector:
    matchLabels:
      app: kafka-az-aware
  template:
    metadata:
      labels:
        app: kafka-az-aware
    spec:
      serviceAccountName: kafka-sa
      containers:
      - image: bitnami/kafka:3.8.0
        name: kafka-az-aware
        command: ["/bin/sh", "-ec", "while :; do echo '.'; sleep 5 ; done"]
        env:
        - name: BootstrapServerString
          value: ${BOOTSTRAP_SERVER_LIST}
        - name: MSK_TOPIC
          value: "MSK-AZ-Aware-Topic"
        - name: KAFKA_HOME
          value: /opt/bitnami/kafka
        - name: KAFKA_BIN
          value: /opt/bitnami/kafka/bin
        - name: KAFKA_CONFIG
          value: /opt/bitnami/kafka/config
        - name: KAFKA_LIBS
          value: /opt/bitnami/kafka/libs
        - name: KAFKA_LOG4J_OPTS
          value: "-Dlog4j.configuration=file:/opt/bitnami/kafka/config/log4j.properties"
        - name: NODE_AZ_ID
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['node_az_id']
        lifecycle:
          postStart:
            exec:
              command: 
              - "sh"
              - "-c"
              - |
                export KAFKA_HOME=/opt/bitnami/kafka
                export KAFKA_BIN=\${KAFKA_HOME}/bin
                export KAFKA_CONFIG=\${KAFKA_HOME}/config
                cat > \${KAFKA_CONFIG}/client.properties << EOF1
                security.protocol=SASL_SSL
                sasl.mechanism=AWS_MSK_IAM
                sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
                sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
                EOF1
                if [ \$NODE_AZ_ID ]
                then
                  echo "client.rack=\$NODE_AZ_ID" >> \${KAFKA_CONFIG}/client.properties
                fi
                
                cat >> \${KAFKA_CONFIG}/log4j.properties << EOF2
                #
                # Enable logging of Kafka Client to stderr
                #
                log4j.rootLogger=WARN, stderr
                log4j.logger.org.apache.kafka.clients.consumer.internals.AbstractFetch=DEBUG
                log4j.appender.stderr=org.apache.log4j.ConsoleAppender
                log4j.appender.stderr.layout=org.apache.log4j.PatternLayout
                log4j.appender.stderr.layout.ConversionPattern=[%d] %p %m (%c)%n
                log4j.appender.stderr.Target=System.err
                EOF2
                
                /usr/bin/curl -sS -L https://github.com/aws/aws-msk-iam-auth/releases/download/v2.2.0/aws-msk-iam-auth-2.2.0-all.jar --output \${KAFKA_LIBS}/aws-msk-iam-auth-2.2.0-all.jar
EOF

As you can observe, the pod manifest defines an environment variable NODE_AZ_ID, assigning it the value from the pod’s own metadata annotation node_az_id that was injected by Kyverno. The manifest then uses the pod’s postStart lifecycle script to add client.rack into the client.properties configuration, setting it equal to the value in the environment variable NODE_AZ_ID.

Deploy the pods using the following command:

kubectl apply -f kafka-az-aware.yaml

Run the following command to confirm that the pods have been deployed and are in the Running state:

kubectl -n kafka-ns get pods

Verify that Availability Zone Ids have been injected into the pods

for pod in $(kubectl -n kafka-ns get pods --field-selector=status.phase==Running -o=name | grep "pod/kafka-az-aware-" | xargs)
do
  kubectl -n kafka-ns get "$pod" -o yaml | grep "node_az_id:"
done

Your output should look similar to:

node_az_id: use1-az2
node_az_id: use1-az4
node_az_id: use1-az6

Or:

AWS CloudShell showing Kafka namespace pods and node assignments in Kubernetes cluster

Select a pod id from the output of the get pods command and shell-in to it.

kubectl -n kafka-ns exec -it POD_ID -- sh

The output of the get $pod command matches the order of results from the get pods command. This matching will help you understand what Availability Zone your pod is running in so you can compare it to log outputs later.

After you’ve connected to your pod, run the Kafka consumer:

"${KAFKA_BIN}"/kafka-console-consumer.sh \
  --bootstrap-server  "${BootstrapServerString}" \
  --consumer.config  "${KAFKA_CONFIG}"/client.properties \
  --topic "${MSK_TOPIC}" \
  --from-beginning /tmp/non-rack-aware-consumer.log 2>&1 &

Similar to before, this command will dump all the resulting logs into the file, rack-aware-consumer.log. You create a new file so there’s no overlap between the Kafka consumers you’ve run. There’s a lot of information in those logs, and we encourage you to open them and take a deeper look. If you want to see the rack awareness of your EKS pod in action, run the following command to tail the file to view fetch request results to the MSK cluster. You can observe a handful of meaningful logs to review here as the consumer access various partitions of the Kafka topic:

grep -E "DEBUG.*Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-[0-9]+" /tmp/rack-aware-consumer.log | tail -5

Observe your log output, which should look similar to the following:

[2025-03-13 00:47:51,695] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-86303] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-5 at position FetchPosition{offset=527, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-1.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 1 rack: use1-az4)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-13 00:47:51,695] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-86303] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-4 at position FetchPosition{offset=509, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-3.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 3 rack: use1-az2)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-13 00:47:51,695] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-86303] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-3 at position FetchPosition{offset=527, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-13 00:47:51,695] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-86303] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-2 at position FetchPosition{offset=522, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-1.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 1 rack: use1-az4)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-13 00:47:51,695] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-86303] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-1 at position FetchPosition{offset=533, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-3.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 3 rack: use1-az2)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)
[2025-03-13 00:47:51,695] DEBUG [Consumer clientId=console-consumer, groupId=console-consumer-86303] Added read_uncommitted fetch request for partition MSK-AZ-Aware-Topic-0 at position FetchPosition{offset=520, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6)], epoch=0}} to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)

For each log line, you can now observe two rack: values. The first rack: value shows the current leader, the second rack: shows the rack that is being used to fetch messages.

For example, look at MSK-AZ-Aware-Topic-5. The leader is identified as rack: use1-az4, but the fetch request is sent to use1-az6 as indicated by to node b-2.mskazaware.hxrzlh.c6.kafka.us-east-1.amazonaws.com:9098 (id: 2 rack: use1-az6) (org.apache.kafka.clients.consumer.internals.AbstractFetch)

You’ll notice something similar in all other log lines. The fetch is always to the broker in use1-az6, which maps to our expectation, given the pod we connected to was in this Availability Zone.

Congratulations! You’re consuming from the closest replica on Amazon EKS.

Clean Up

Delete the deployment when finished:

kubectl -n kafka-ns delete -f kafka-az-aware.yaml

To delete the EKS Pod Identity association:

eksctl delete podidentityassociation \
--cluster ${EKS_CLUSTER_NAME} \
--namespace kafka-ns \
--service-account-name kafka-sa

To delete the IAM policy:

aws iam delete-policy \
  --policy-arn arn:aws:iam::"${AWS_ACCOUNT}":policy/MSK-AZ-Aware-Policy

To delete the EKS cluster:

eksctl delete cluster -n ${EKS_CLUSTER_NAME} --disable-nodegroup-eviction

If you followed along with this post using the Amazon MSK Data Generator, be sure to delete your deployment so it’s no longer attempting to generate and send data after you delete the rest of your resources.

Clean up will depend on which deployment option you used. To read more about the deployment options and the resources created for the Amazon MSK Data Generator, refer to Getting Started in the GitHub repository.

Creating an MSK cluster was a prerequisite of this post, and if you’d like to clean up the MSK cluster as well, you can use the following command:

aws kafka delete-cluster --cluster-arn "${MSK_CLUSTER_ARN}"

There is no additional cost to using AWS CloudShell, but if you’d like to delete your shell, refer to the Delete a shell session home directory in the AWS CloudShell User Guide.

Conclusion

Apache Kafka nearest replica fetching, or rack awareness, is a strategic cost-optimization technique. By implementing it for Amazon MSK consumers on Amazon EKS, you can significantly reduce cross-zone traffic costs while maintaining robust, distributed streaming architectures. Open source tools such as Kyverno can simplify complex configuration challenges and drive meaningful savings.The solution we’ve demonstrated provides a powerful, repeatable approach to dynamically injecting Availability Zone information into Kubernetes pods, optimize Kafka consumer routing, and minimize reduce transfer costs.

Additional resources

To learn more about rack awareness with Amazon MSK, refer to Reduce network traffic costs of your Amazon MSK consumers with rack awareness.

About the authors

Austin Groeneveld is a Streaming Specialist Solutions Architect at Amazon Web Services (AWS), based in the San Francisco Bay Area. In this role, Austin is passionate about helping customers accelerate insights from their data using the AWS platform. He is particularly fascinated by the growing role that data streaming plays in driving innovation in the data analytics space. Outside of his work at AWS, Austin enjoys watching and playing soccer, traveling, and spending quality time with his family.

Farooq Ashraf is a Senior Solutions Architect at AWS, specializing in SaaS, Generative AI, and MLOps. He is passionate about blending multi-tenant SaaS concepts with Cloud services to innovate scalable solutions for the digital enterprise, and has several blog posts, and workshops to his credit.

Automate data lineage in Amazon SageMaker using AWS Glue Crawlers supported data sources

2025-07-30 Mohit Dawar

Post Syndicated from Mohit Dawar original https://aws.amazon.com/blogs/big-data/automate-data-lineage-in-amazon-sagemaker-using-aws-glue-crawlers-supported-data-sources/

The next generation of Amazon SageMaker is the center for all your data, analytics, and AI. Bringing together widely adopted Amazon Web Services (AWS) machine learning (ML) and analytics capabilities, it delivers an integrated experience for analytics and AI with unified access to all your data. From Amazon SageMaker Unified Studio, a single data and AI development environment, you can access your data and use a suite of powerful tools for data processing, SQL analytics, model development, training and inference, and generative AI development.

With data lineage, now part of Amazon SageMaker Catalog, you can centralize lineage metadata of your data assets in a single place. You can track the flow of data over time, determining a clear understanding of where it originated, how it has changed, and its usage across the business. By providing this level of transparency, data lineage helps data consumers gain trust that the data is correct and compliant for their use cases. With data lineage captured at the table, column, and job level, data producers can conduct impact analysis of changes in their data pipelines and respond to data issues when needed, for example, when a column in the resulting dataset is missing the quality required by the business.

Data lineage is a powerful tool that can transform how organizations understand and manage their data flows. In this post, we explore its real-world impact through the lens of an ecommerce company striving to boost their bottom line.

To illustrate this practical application, we walk you through how you can use the prebuilt integration between SageMaker Catalog and AWS Glue crawlers to automatically capture lineage for data assets stored in Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. Using this workflow, you can capture lineage automatically from additional data sources using AWS Glue crawlers. Refer to the Data lineage support matrix in the SageMaker Unified Studio User Guide for supported sources. We also use SageMaker Unified Studio to navigate these data assets and learn about their origin, transformations, and dependencies, thanks to the lineage metadata captured using the AWS Glue crawlers.

Key features of the SageMaker Catalog lineage graph

In SageMaker Unified Studio, you can explore and discover data assets of your organization suited for your use case. As you dive into these data assets, you can learn more about its business context, schema, quality, and lineage. When you decide to work with a subset of these assets, you can subscribe to them in a self-service fashion and start working with them. For more detail, visit Data discovery, subscription, and consumption in the SageMaker Unified Studio User Guide.

SageMaker Studio provides a visual lineage graph that shows how a data asset has evolved from its source through transformations to its final state. This helps data scientists, engineers, and analysts answer key questions such as:

Where did this data come from?
What transformations has it gone through?
Which downstream assets will be impacted by a change?

With this level of visibility, teams can perform faster impact analysis, find the root cause of data quality issues, and ensure models are built on trusted data. It also supports better collaboration so users can confidently use and share data across the organization. The following screenshot shows how SageMaker Unified Studio visualizes data lineage, making it straightforward to trace data flow and understand dependencies.

Column-level lineage – You can expand column-level lineage when available in dataset nodes. This automatically shows relationships with upstream or downstream dataset nodes if source column information is available.
Column search – If the dataset has more than 10 columns, the node presents pagination to navigate to columns not initially presented. To quickly view a particular column, you can search on the dataset node that lists only the searched column.
Details pane – Each lineage node captures and displays the following details:
- Every dataset node has three tabs: LINEAGE, SCHEMA, and HISTORY. The HISTORY tab lists the different versions of lineage event captured for that node.
- The job node has a details pane to display job details with the tabs Job info and History. The details pane also captures queries or expressions run as part of the job.
View dataset nodes only – If you want to filter out the job nodes, you can choose the open view control icon in the graph viewer and toggle the display dataset nodes only, which will remove all the job nodes from the graph and let you navigate only the dataset nodes.
Version tabs – All lineage nodes in Amazon DataZone data lineage will have versioning, captured as history, based on lineage events captured. You can view lineage at a selected timestamp that opens a new tab on the lineage page to help compare or contrast between the different timestamps.

You can try some of these features as you explore the data assets of this post. To learn more on data lineage in SageMaker, we encourage you to dive deep into the Data lineage in Amazon SageMaker Unified Studio.

Solution overview

Imagine a scenario where an ecommerce company aims to optimize conversion rates and enhance customer experience by gaining deeper insights into the customer journey. They need to connect the dots between user interactions and actual purchases, but with data scattered across multiple sources, where do they begin? This is where data lineage becomes invaluable. To perform their analysis, they need data from two primary sources:

Clickstream data stored in Amazon S3 (in JSON or Parquet format)
Transactional order data stored as items in Amazon DynamoDB

To make these datasets discoverable across the business, you need to:

Create a project in SageMaker Unified Studio that will be used to source and manage the datasets
Enable data lineage capture in the SageMaker Unified Studio project
Set up the resources for this use case, which includes an AWS Glue data source (set up in SageMaker Unified Studio) and AWS Glue crawler (set up in AWS Glue)
Run the AWS Glue crawler to catalog the datasets in AWS Glue Data Catalog
Source the metadata of the data assets into the SageMaker Catalog by running the data source
Use SageMaker Unified Studio to navigate through the lineage of the data assets and visualize their origin
Understand how schema evolution is captured in the data asset’s lineage

Prerequisites

To complete the steps on this post, you need an SageMaker Unified Studio domain already deployed in your AWS account. To get started quickly in a testing environment, we suggest creating your SageMaker domain using the quick setup option as explained in Create an Amazon SageMaker Unified Studio domain – quick setup.

Solution steps

To capture data lineage for AWS Glue tables managed with AWS Glue crawlers using SageMaker Unified Studio, complete the steps in the following sections.

Set up a SageMaker project with SQL capability

In SageMaker Unified Studio, a project profile defines an uber template for projects in your Amazon SageMaker unified domain. By setting up a project with the right tooling (project profile), you will provision resources you can use to work with data, which might include cataloging it in SageMaker, transforming it into new data assets, analyzing it to drive business value, or even use it for ML or AI applications.

To demonstrate data lineage effectively, we use SageMaker SQL analytics project profile for a streamlined setup. Although this profile offers comprehensive data analytics capabilities, we focus specifically on two key components:

AWS Glue database – A lakehouse for storing and managing technical metadata
Data source job – Automatically collects and tracks metadata into SageMaker Catalog

We’ve chosen this profile to bypass complex manual configurations so we can focus on the core concepts of data lineage.

To create a new project in your SageMaker domain using the SQL analytics project profile, follow the steps detailed in SQL analytics project profile. Keep all default configurations when creating the project.

After creating your project in SageMaker Studio, you’ll unlock powerful data lineage capabilities that make tracking and understanding your data flows intuitive. Through the data sourcing feature, you can easily monitor how data moves from source to the AWS Glue database. This visibility becomes particularly valuable when debugging data issues—you can quickly trace data back to its source, understand how changes impact downstream processes, and identify affected analyses or reports. Next, populate the AWS Glue database with sample data to observe these features in action and demonstrate how they can streamline your data operations.

For further guidance on how to access the details of the new SageMaker project, refer to Get project details. After you access the data source details, in the Database name field, take note of the AWS Glue database name associated to the SageMaker project.

Enable data lineage capture in the SageMaker project’s data source

To enable lineage capture, follow these steps:

Expand the Actions menu, then choose Edit data source.
Go to the connections and select Import data lineage to configure lineage capture from the source, as shown in the following screenshot.
Make other changes to the data source fields as desired, then choose Save.

Enabling lineage will make sure the data source job will capture lineage in the next run.

Deploy resources for the use case

Follow these steps:

To deploy the resources required for this post, download the AWS CloudFormation template amazon-datazone-examples in the AWS Samples GitHub repository. Deploy it in your AWS account.

For further guidance on how to deploy a CloudFormation stack, refer to Create a stack from the CloudFormation console. You need to provide a Stack name and the name of the AWS GlueDatabaseName associated to the project of your SageMaker domain, as shown in the following screenshot.

Choose Next.

The template will deploy the following resources:

A S3 bucket with a sample file of clickstream data. The bucket name and location of the file will follow the path pattern s3://ecomm-analytics-<ACCOUNT_ID>-<REGION>/clickstream/<YYYY>/<MM>/<DD>/data.json. The file will contain a sample record with the following structure:

{
    "session_id": "abc123",
    "user_id": "u789",
    "event_type": "product_view",
    "product_id": "prod456",
    "timestamp": "2025-06-04T09:23:12Z"
}

A DynamoDB table with a sample item of order data (transactions). The table will be named OrderTransactionTable. The sample item will have the following structure:

{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z"
}

An AWS Glue crawler configured to crawl the S3 bucket and DynamoDB table deployed as part of the stack and store the metadata in the AWS Glue database associated to the SageMaker project. You can access the crawler’s details in the AWS console, as shown in the following screenshot.

Two AWS Identity and Access Management (IAM) roles for loading the source data into Amazon S3 and DynamoDB and to use when running the AWS Glue crawler.

Run the AWS Glue crawler

The AWS Glue crawler deployed in the previous step will allow you to capture metadata from the two data sources, Amazon S3 and DynamoDB, and store it in AWS Glue Data Catalog, specifically in the database associated to the SageMaker project. After the metadata is stored, it will be accessible to SageMaker.

Before running the crawler, you need to provide AWS Lake Formation permissions to the IAM role that the AWS Glue crawler will use to interact with your data source and target AWS Glue database. The following command will grant the permissions needed for the crawler to store metadata into the AWS Glue database of the SageMaker project.

To invoke this command, we recommend using AWS CloudShell on the AWS console as explained in AWS CloudShell Concepts. Update the <REGION>, <ACCOUNT_ID> and <GLUE_DATABASE_NAME> placeholders with the right values for your AWS Region, AWS account ID, and name of the AWS Glue database associated to the SageMaker project.

aws lakeformation grant-permissions \
  --region  \
  --principal DataLakePrincipalIdentifier=arn:aws:iam:::role/glue-crawler-role \ 
  --permissions CREATE_TABLE \
  --resource '{ "Database": { "Name": "" } }'

Next, run the AWS Glue Crawler on the AWS console. After the crawler successfully finishes, two new tables, clickstream and ordertransactiontable, will be created in the AWS Glue database associated to the SageMaker project. Refer to Viewing crawler results and details to learn more about AWS Glue crawler results.

Source metadata from the AWS Glue database into SageMaker

To source metadata from data assets in the AWS Glue database, including their lineage, into SageMaker, use the data source that was deployed as part of the SageMaker project creation.

To run the data source, go to the data source details page.
Choose Run. (Data sources can be scheduled to run as well, however, for this demonstration we trigger a manual run).

After the data source run is complete, metadata from both data assets in the AWS Glue database will be imported into the SageMaker domain as the project’s inventory assets. You can find the details of the data source run from within SageMaker Unified Studio, which include:

The data assets from the AWS Glue database that were ingested into SageMaker.
The status of the data lineage import for each data asset, which includes an event ID for traceability. This lineage event ID can be used to debug inconsistencies in the resulting lineage graph. You can use the GetLineageEvent API to retrieve the raw payload of the lineage event.

Visualizing the data lineage graph of the data assets in SageMaker Unified Studio

With SageMaker Unified Studio, you have a single place to manage and discover data assets. When accessing a data asset published in the SageMaker central catalog or in your project’s own inventory, you can dive into the asset’s metadata, which includes its schema, business description, custom metadata forms, quality, lineage, and more. To visualize the lineage graph of each data asset of this post, follow these steps:

In SageMaker Studio, navigate to the Assets section of the SageMaker project details page and choose INVENTORY
Select the asset that you want to explore. You can also access the asset directly from the data source run by selecting the asset name.
To view the lineage graph of the data asset up to its origin, shown in the following screenshots, choose the LINEAGE tab.
- For clickstream table (Sourced from S3)

- For order transactions table (Sourced from DynamoDB)

With lineage, you can now confirm that the data originated from sources such as Amazon S3 and Amazon DynamoDB and understand how it has been transformed along the way. Because of this end-to-end visibility, you can trust the data, make informed decisions, and provide compliance with confidence. The lineage graph captures essential metadata that forms the foundation of lineage tracking.

This includes table schemas, column definitions and their data types.
Column-level lineage becomes particularly powerful in this context. Imagine your clickstream’s AWS Glue table powers an Amazon QuickSight dashboard analyzing customer purchase patterns and notice discrepancies in your revenue reports. With column lineage, you can instantly trace the source of those columns.
This granular visibility not only accelerates debugging but also proves invaluable during schema changes, as we show in the following section by changing the source schema.
The crawler details such as crawlerRunId (present in the source identifier of the lineage node) and crawler start and end times can be used to debug which crawler runs updated the table.

Understanding your data asset’s schema evolution through lineage in SageMaker Unified Studio

Imagine the order transactions source in DynamoDB was updated with new information. Because this source powers an Amazon QuickSight report for the customer using the AWS Glue database table, it’s important for consumers to know what changes in the data pipeline updated the report.

Edit the DynamoDB table item with additional columns to learn how lineage graph can be used to view historical updates:

{
    "order_id": "ord789",
    "user_id": "u789",
    "product_id": "prod456",
    "order_total": 79.99,
    "order_timestamp": "2025-06-04T09:27:10Z",
	"customerSegment": "new-customer",
    "conversionSource": "primeDayEmailCampaign"
}

Enter the OrderTransactionsCrawler Glue crawler again on the AWS console. After completion, you’ll notice that it updated the ordertransactiontable AWS Glue table, as shown in the following screenshot.

Run again the data source associated to the project in SageMaker Unified Studio to import the latest metadata into the SageMaker Catalog. After completion, you’ll notice the data source updated the ordertransactiontable data asset in the SageMaker Catalog, as shown in the following screenshot.

This section explores how lineage can be useful to track the updates.

Navigate to the ordertransactiontable data asset in SageMaker Catalog by selecting it from the data source run and choose the LINEAGE tab, as shown in the following screenshot.

Notice how the new columns are available in the lineage graph. A new crawler run ID is present as the source identifier of the crawler lineage node. The history tab shows multiple crawler runs. You can navigate to check the state of the system during the first run.

Cleanup

After you’re done, we recommend to cleaning up the resources created for this post to avoid unintended charges:

Delete the inventory assets that were cataloged in the SageMaker project’s inventory, as explained in Delete an Amazon SageMaker Unified Studio asset.
Delete the SageMaker project that was created as part of this post, as explained in Delete a project.
Delete the CloudFormation stack that was deployed as part of this post, as explained in Delete a stack from the CloudFormation console.
The S3 bucket created as part of the CloudFormation stack will remain after its deletion because it contains a data file in it. Empty and delete the bucket, as explained in Deleting a general purpose bucket.

Conclusion

In this post, you were able to explore the data lineage capabilities of Amazon SageMaker, specifically when working with AWS Glue crawlers. You learned how you can set up an AWS Glue crawler to infer metadata from data assets in multiple sources such as Amazon S3 and DynamoDB and store it the AWS Glue Data Catalog. You also imported this metadata, including data lineage, into Amazon SageMaker through the data source capability of a SageMaker project. Finally, you explored the resulting lineage graph of data assets in SageMaker Unified Studio and saw some of the functionalities available to understand the origin path of them, understand how columns are transformed, and what impact looks like when performing changes to any step of the pipeline.We encourage you to now test the capabilities you explored in this post with your own data. By following the pattern presented in this post, many customers have been able to achieve governance of their data lake and lakehouse platforms on top of Amazon SageMaker with data lineage and more.

About the authors

Mohit Dawar is a Senior Software Engineer at Amazon Web Services (AWS) working on Amazon DataZone. Over the past 3 years, he has led efforts around the core metadata catalog, generative AI–powered metadata curation, and lineage visualization. He enjoys working on large-scale distributed systems, experimenting with AI to improve user experience, and building tools that make data governance feel effortless. Connect with him on LinkedIn: Mohit Dawar.

Jose Romero is a Senior Solutions Architect for Startups at Amazon Web Services (AWS) based in Austin, TX, US. He is passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn: Jose Romero.

Troubleshooting Elastic Beanstalk Environments with Amazon Q Developer CLI

2025-07-29 Adarsh Suresh

Post Syndicated from Adarsh Suresh original https://aws.amazon.com/blogs/devops/troubleshooting-elastic-beanstalk-environments-with-amazon-q-developer-cli/

Introduction

Developers working with AWS find AWS Elastic Beanstalk to be an invaluable service which makes it straightforward to deploy and run web applications without worrying about the underlying infrastructure. You simply upload your application code, and Elastic Beanstalk automatically handles the details of capacity provisioning, load balancing, scaling, and monitoring, which allows you to focus on writing code.

With the release of the Amazon Q Developer’s new enhanced CLI agent, we’ve already seen how Q CLI can be used to transform the approach to the software development process.

In addition to software development, developers and DevOps teams may spend most of their time on operational tasks such as deploying and testing their code on multiple environments, including troubleshooting any deployment related failures or application health issues. Q CLI’s new agentic features can be used to significantly simplify this process by helping you identify and resolve operational issues in a more efficient manner.

When troubleshooting Elastic Beanstalk environment issues, Q CLI becomes a go-to companion. When environments show degraded health or deployment failures, developers can use Q CLI to quickly investigate without having to navigate through multiple AWS console pages or parse multiple logs manually.

For instance, when facing deployment failures, you can run q chat to start a new conversation and describe the issue. Q CLI can help analyze instance logs, check environment configurations, and identify misconfigurations in applications. It can pull relevant error messages from Elastic Beanstalk logs and suggest specific fixes based on the error patterns it recognizes.

When dealing with health issues, developers can ask Q CLI to check their environment’s status, resource utilization, and recent events. It can identify if an application is experiencing out of memory problems, connectivity issues, or dependency related failures. Q CLI can also examine application logs to find recurring errors that might be causing health degradation.

What developers appreciate most is how Q CLI connects the dots between different AWS services. If an Elastic Beanstalk environment is having issues because of an underlying Amazon VPC configuration issue or Amazon S3 permission issue, Q CLI can identify these connections and provide holistic solutions.

The time savings are significant – what used to take hours of investigation across multiple AWS console pages now takes minutes with targeted Q CLI queries. This has dramatically improved developers’ ability to maintain healthy environments and quickly resolve issues when they arise.

Below, we’ll take you through some examples of how you can use Q CLI to troubleshoot some of the issues that you may face while managing Elastic Beanstalk environments.

Solution Walkthrough

Prerequisites

If you’d like to follow along on your own machine, please make sure you complete the following prerequisites:

An AWS account with Elastic Beanstalk access
Basic familiarity with Elastic Beanstalk concepts (environments, applications, deployments)
AWS CLI installed and configured with appropriate permissions to access Elastic Beanstalk resources, and collect logs
AWS Q Developer CLI installed and setup
EB CLI installed and setup (optional)
Elastic Beanstalk environments created for troubleshooting

Now let’s dive into troubleshooting specific Elastic Beanstalk issues with Q CLI. All the scenarios below were tested with Amazon Q Developer CLI with a Pro tier subscription as it provides higher request limits, but is not required for the purposes of this demo.

Troubleshooting environment health

Let’s consider an Elastic Beanstalk environment running Node.js 22 AL2023, to which we’re going to deploy a new application version. After deploying a new application version to our Node.js based elastic beanstalk environment, we noticed that its health status changed to a “Warning” state with the following message visible in the environment events:

100% of requests failing with HTTP 5xx errors

Figure 1. EB Dashboard showing the Warning health state, along with the reason for the health status

This event message could be a result of a number of issues, including but not limited to Nodejs application failures, reverse proxy configuration issues, resource utilization issues etc.

Let’s use Q CLI to help us investigate further. We’ll initiate a new conversation with the agent by running q chat, and ask the following question:

Why is my beanstalk environment nodejs-app in us-east-1 unhealthy? Check the logs if required, and recommend steps to resolve the issue

Note that we’ve disabled all confirmations for q chat using the /tools trust-all option as we’re using a development environment, but this is generally not recommended as it can lead to unexpected changes.

As you can see, the Q CLI agent is able to use the AWS CLI tool to describe the environment details, its health status and retrieve the tail logs for further analysis. It then parses through the log file to identify the source of the issue, all without requiring additional prompts.

As we can see from the below image, the Q CLI agent was able to parse the logs and identify that the Nodejs application is running on port 3000, but the Nginx proxy is attempting to establish a connection to the application on port 8080 (which is the default forwarding port for Nodejs based elastic beanstalk environments), resulting in HTTP requests failing with a 502 response.

Figure 2. Q CLI solution for port issue

As requested in the prompt, the Q CLI agent also provides multiple ways to implement the recommended solution, along with specific steps or commands for each option. In this specific case, Q CLI correctly advised us to update the elastic beanstalk environment’s configuration to use port 3000 and shared multiple approaches to apply the recommended changes.

Environment creation failures

Here, we’re trying to create a new Elastic Beanstalk environment in a new VPC, but the environment creation fails with the following error message as we can see in the screenshot below:

The EC2 instances failed to communicate with AWS Elastic Beanstalk, either because of configuration problems with the VPC or a failed EC2 instance. Check your VPC configuration and try launching the environment again.

Figure 3. EB events describing the VPC connectivity issue

Now, let’s ask Q CLI to help us investigate this specific issue. We will issue the following prompt to Q CLI with the environment’s name and region, along with the specific error message that is observed:

The beanstalk environment "Dev-env" in the us-west-2 region failed to launch successfully with the following error:

The EC2 instances failed to communicate with AWS Elastic Beanstalk, either because of configuration problems with the VPC or a failed EC2 instance. Check your VPC configuration and try launching the environment again.

Check the environment configuration and recommend steps to resolve the issue.

Here, Q CLI is able to use the given context to invoke relevant AWS CLI commands to check and verify the elastic beanstalk environment’s configuration, including its underlying resources such as the VPC, subnets, route table, security groups and related resources.

Figure 4. Q CLI identified network configuration issues

After retrieving the required information, Q CLI was able to identify the source of the issue. The subnets configured for the elastic beanstalk environment’s EC2 instances do not have outbound internet access, due to which they are unable to communicate with the AWS service endpoints.

As seen in the following screenshot, Q CLI then goes on to recommend multiple solutions to resolve the issue, specifically highlighting the more secure options recommended for production vs other options that are simpler to manage but may not be as secure.

Figure 5. Q CLI solutions for resolving network configuration issues

We can see how using Q CLI here results in significant time saved during troubleshooting as it quickly and efficiently verifies the relevant underlying resource configurations, hence removing the need for the developer to manually identify and check multiple AWS resource configurations.

Command Execution Failures

In this next scenario, we’re attempting to deploy a python application to an elastic beanstalk environment, using a Python 3.13 based solution stack. We noticed that the deployment fails with the following error message, visible in the environment events:

Command failed on instance. Return code: 1 Output: Engine execution has encountered an error.

Let’s ask Q CLI to help us identify and resolve the issue, with the following prompt:

The deployment to the beanstalk environment "modern-python" in the us-east-1 region failed with the error "Command failed on instance". Check the environment details, and logs if required, and recommend steps to resolve the issu

Here, we see how Q CLI can also help with troubleshooting application or dependencies related issues. By checking the environment events and the tail logs, Q CLI was able to identify the source of the deployment failure due to the “Jinja” package that was specified in the requirements.txt file. It correctly advises us to use a newer version of the “Jinja2” package, which is compatible with Python 3.13.

It also goes on to give us recommendations and steps on testing the changes locally before updating the requirements.txt and creating a new application version to be used for the deployment.

Figure 6. Q CLI identified the reason for command failure and provides solutions

Using EB CLI with Amazon Q Developer CLI

To wrap this up, we will demonstrate the benefits of using Q CLI in your development environment, along with EB CLI.

EB CLI enables developers to deploy applications to Elastic Beanstalk with a simple eb deploy command, handling environment provisioning, artifact packaging, and configuration as code. It integrates with Git for version tracking and supports local testing through eb local run, making it ideal for CI/CD pipelines and iterative development workflows.

In this scenario, we have another application deployment that failed. We will use Q CLI to troubleshoot this issue by initiating a new q chat from the same directory where the application files are located, which also has EB CLI installed and setup using the command eb init.

Figure 7. Q CLI prompt to troubleshoot the python deployment issue

As you can see above, we’ve used the following prompt:

The latest deployment to the beanstalk environment "modern-python" in the us-east-1 region failed, and the environment is in a Degraded health state. Check the environment details and logs if required, and recommend steps to resolve the issue.

Q CLI was able to check the relevant logs and identify the following error causing the deployment failure:

ModuleNotFoundError: no module named ‘app’

Because the q chat conversation was initiated from the directory containing the application files, Q CLI is also able to view my application files, and identify the solution to the problem, suggesting that main python file name is application.py, not app.py, and therefore, the Procfile needs to be updated accordingly.

Figure 8. Q CLI identifies the reason for deployment failure, and recommends updating the Procfile

Finally, because we already have EB CLI initiated in this directory with the application files, we can use Q CLI to automatically make the required changes to the Procfile and update the elastic beanstalk environment, all with just the following natural language prompt:

Update the Procfile with the recommended corrections, and deploy to the beanstalk environment

As seen above, Q CLI is able to update the Procfile with the necessary changes and use the eb deploy EB CLI command to deploy the changes to the elastic beanstalk environment.

These examples demonstrate how Amazon Q Developer’s CLI agent supercharges your operational and troubleshooting tasks throughout the entire development process when used in your CLI environment.

Best Practices for Troubleshooting Elastic Beanstalk with Amazon Q Developer CLI

Be specific in your questions: Include environment name, region, and specific symptoms to help Q CLI provide more targeted assistance.
Allow Amazon Q to access logs: When prompted, allow Amazon Q to retrieve and analyze logs for more accurate troubleshooting.
Implement suggested fixes incrementally: If Amazon Q suggests multiple solutions, implement them one at a time to identify which one resolves the issue.
Use caution with the /tools trust-all flag: This flag bypasses confirmation prompts during the troubleshooting. Review the security considerations and use with caution in production environments.

Cleaning up

If you’ve created any Elastic Beanstalk environments, please terminate them if they’re no longer being used to avoid incurring charges for unused AWS resources.

Conclusion

Amazon Q Developer CLI is a powerful tool for troubleshooting Elastic Beanstalk environments, capable of quickly identifying and helping resolve common issues. By leveraging Q CLI’s ability to analyze logs, check environment status, and provide targeted solutions, you can significantly reduce the time and effort required to troubleshoot Elastic Beanstalk problems.

Try Amazon Q Developer CLI today and see how quickly you can resolve Elastic Beanstalk issues. Transform hours of log parsing and console navigation into minutes of focused problem-solving with Amazon Q Developer CLI. Start with a simple q chat command and let AI-powered assistance transform your operational workflows. Install the CLI agent now and experience firsthand how conversational AI can help you maintain healthier Elastic Beanstalk environments with less effort!