Tag Archives: Technical How-to

Enhance event experiences with a generative AI-powered WhatsApp assistant using AWS End User Messaging

2025-09-23 Richard Perez

Post Syndicated from Richard Perez original https://aws.amazon.com/blogs/messaging-and-targeting/enhance-event-experiences-with-a-generative-ai-powered-whatsapp-assistant-using-aws-end-user-messaging/

Technology conferences and events serve as vital opportunities for innovation, knowledge sharing, and networking in the rapidly evolving technology industry. These gatherings range from large international conventions attracting tens of thousands of attendees, or more specialized conferences focused on specific sectors such as AI, cybersecurity, or a particular industry.

In this post, we share how the AWS Communication Developer Services (CDS) team integrated an AWS End User Messaging Social WhatsApp channel with Amazon Bedrock to launch the AWS Summit Assistant Bot at the AWS Dubai Summit 2025, enhancing the experience of attendees in real-world applications.

AWS Global Summits

AWS Global Summits have become an important event in the technology community, offering invaluable opportunities for professionals to explore the latest cloud computing innovations and best practices. These summits, held in major cities worldwide, bring together developers, engineers, and business leaders to share knowledge, network, and gain hands-on experience with AWS technologies. The events typically feature keynote speeches from AWS executives and industry leaders, providing insights into future trends and strategic directions in cloud computing. Attendees can participate in technical sessions, workshops, and demos that cover a wide range of topics, from artificial intelligence and machine learning (AI/ML) to serverless computing. The impact of these summits extends beyond the events themselves, fostering a global community of cloud practitioners and driving innovation across various industries that rely on cloud technologies.

Attendee experience

Despite the valuable experience these events create, attendees often find navigating their way around these events challenging. Participants frequently find themselves uncertain about which sessions align best with their interests and expertise levels. Locating specific sessions within the venue can be time-consuming, and finding essential venue-specific areas such as quiet rooms or lost property offices has been a persistent pain point.

Users are increasingly reluctant to download single-use mobile apps due to friction like account creation, logins, storage, and app fatigue. Storyly notes that users spend 80% of their app time in their top three apps, with most others abandoned quickly. For event apps specifically, this reluctance stems from the same factors: app fatigue, privacy concerns about sharing personal information, and the time and effort required to set up and navigate yet another platform. Unless the value is immediate and clear, many attendees simply opt out, mirroring the broader trend where users avoid the hassle of downloading apps unless they offer sustained, high-value utility.

These challenges can detract from the overall event experience and can lead to missed opportunities for learning and networking.

How we enhanced the attendee experience

The AWS Summit Assistant Bot, launched at the AWS Dubai Summit (May 2025), offered a seamless solution to these attendee challenges. Attendees simply scanned QR codes that were strategically placed throughout the venue to initiate a WhatsApp chat with the AI-powered assistant. This innovative assistant uses advanced natural language processing to understand attendees’ interests and provide tailored session recommendations. Moreover, it offered real-time guidance on session locations and can direct users to various venue-specific areas, providing a smooth and efficient summit experience.

Solution overview

Let’s examine how the AWS Summit Assistant Bot architecture enables seamless interaction between attendees and summit information systems. The AWS Summit Assistant Bot design allows for real-time processing of attendee messages and response generation using Amazon Bedrock Knowledge Bases for helpful and relevant answers. The following diagram illustrates how various AWS services work together to process attendee messages.

The architecture follows a modular pattern with four key components that enable efficient message processing and analytics capabilities:

Custom metrics – Custom Amazon CloudWatch WhatsApp metrics enable real-time monitoring of engagement events through Amazon Simple Notification Service (Amazon SNS) topic integration. These metrics track message delivery status and read receipts, providing crucial operational and performance insights.
Inbound message processing – This pipeline forms the core functionality, implementing message validation filters for length and message type constraints, managing session state, and handling audio transcription workflows. Validated messages are published to a dedicated SNS topic for downstream consumption.
Response generation – This component uses Amazon Bedrock Knowledge Bases for intelligent message handling, with architecture designed for flexible integration of alternative processing engines. Future iterations of the AWS Summit Assistant Bot will use the Strands Agents SDK and tools.
Questions categorization – This framework provides contextual analytics beyond standard CloudWatch and AWS Lambda insights. This component implements an Amazon DynamoDB based categorization system that works in conjunction with Amazon Bedrock to dynamically classify and track inquiry patterns while maintaining user privacy through personally identifiable information (PII)-free analytics.

Technical implementation

Our serverless, event-driven architecture efficiently handles WhatsApp message processing through a seamless multi-stage workflow. When a WhatsApp message arrives, AWS End User Messaging receives it and immediately publishes it to an SNS topic. From there, the messages are written to Amazon Simple Queue Service (Amazon SQS) queues, which enables controlled and systematic processing. Lambda functions then handle the core business logic, processing these messages and managing interactions with DynamoDB. To generate responses, the system uses Amazon Bedrock Knowledge Bases to create personalized content for each user. Finally, these tailored responses are routed back to users through AWS End User Messaging, completing the WhatsApp communication cycle.

The system implements two parallel processes alongside the main message processing pipeline. The first is a categorization service that processes messages through Amazon SNS and Amazon SQS before using Lambda functions to analyze content against existing DynamoDB category records. This function either increments existing category counters or creates new categories as needed. The second parallel process handles custom CloudWatch metrics, following a similar initial flow through Amazon SNS and Amazon SQS, but employs specialized Lambda functions to extract and record engagement metrics for operational monitoring.

Generative AI integration

The Amazon Bedrock implementation encompasses two core AI capabilities:

A knowledge base retrieval system using OpenSearch vector embeddings and Anthropic’s Claude 3.7 Sonnet model for accurate information retrieval
A real-time message categorization engine that dynamically classifies incoming messages into existing categories or creates new ones based on content analysis

Voice message processing

The voice message handling system implements a sophisticated processing chain. WhatsApp voice messages in OGG format are processed through a Lambda based conversion pipeline using the ffmpeg library. The converted audio is then transcribed using Whisper through Amazon Bedrock Marketplace, chosen for its fast processing and robust multi-language support capabilities.

Security and privacy considerations

Our security-first approach implements multiple layers of protection:

Customer managed key encryption for data at rest and in transit across Amazon SNS, Amazon SQS, and DynamoDB
Minimized PII CloudWatch logging and automatic data cleanup through DynamoDB TTL settings
Amazon Bedrock Guardrails to prevent inappropriate content generation and protect against data loss
Custom logic to prevent resource draining by preventing the ingestion of unreasonably large messages and keeping a short conversation context window

Monitoring and analytics

Monitoring is important for operational purposes but also for understanding what questions users are asking. The solution uses the following components:

A Real-time CloudWatch dashboard for tracking operational metrics such as messages published on SNS topics; WhatsApp messages sent, delivered, and read, Lambda invocations; failures; and Amazon Bedrock metrics
CloudWatch logs for granular analytics using CloudWatch insights such as unique users and number of conversations
Generative AI-powered categorization of users’ questions

Conclusion

The AWS Summit Assistant Bot demonstrates how AWS services can be combined to create practical, generative AI-powered solutions that enhance real-world experiences. This framework can be adapted for various event types and scales, such as tech conferences, trade shows, festivals, campus orientations, and shopping centers.

To learn more about building similar solutions:

Explore the AWS End User Messaging documentation
Review Amazon Bedrock capabilities for natural language processing
Visit the AWS Communication Developer Services blog for more insights
Deploy your own communications hub

By using these AWS services and resources, you can create innovative, AI-powered communication solutions for a wide range of applications.

About the authors

How to accelerate security finding reviews using automated business context validation in AWS Security Hub

2025-09-23 Reetesh Surjani

Post Syndicated from Reetesh Surjani original https://aws.amazon.com/blogs/security/how-to-accelerate-security-finding-reviews-using-automated-business-context-validation-in-aws-security-hub/

Security teams must efficiently validate and document exceptions to AWS Security Hub findings, while maintaining proper governance. Enterprise security teams need to make sure that exceptions to security best practices are properly validated and documented, while development teams need a streamlined process for implementing and verifying compensating controls.

In this blog post, we show you an automated solution that’s ideal for organizations using AWS Security Hub that need to manage security exceptions at scale while maintaining governance controls. It’s particularly valuable for enterprises that have complex compliance requirements and multiple development teams. By implementing this solution, you can accelerate the Security Hub findings review process while maintaining proper security governance and providing clear business context for security exceptions.

Note: The solution in this post is provided as a reference architecture and should not be implemented as-is in production environments. Organizations must thoroughly review, customize, and enhance this solution to align with their specific security requirements, compliance frameworks, governance policies, and risk tolerance. Engage with your security, compliance, and legal teams before deploying this automated security validation solution.

The challenge

Security Hub provides a comprehensive view of your AWS security posture across AWS accounts. However, in real-world scenarios, you’ll encounter legitimate business reasons for exceptions to security best practices. For example:

Amazon GuardDuty not enabled: Because of an alternative monitoring solution, an organization has deferred the implementation of Amazon GuardDuty but requires compensating controls such as Amazon Virtual Private Cloud (VPC) Flow logs, Amazon CloudWatch alarms, and organization-specific incident response procedures.
Amazon S3 Block Public Access not enabled: A marketing team might need a public Amazon Simple Storage Service (Amazon S3) bucket for website assets, but should implement the following compensating controls:

- Amazon CloudFront distributions in front of Amazon S3
- Server-side encryption with AWS KMS keys (SSE-KMS) enabled on the S3 bucket
- Enable Amazon S3 bucket logging
- Enable Amazon S3 bucket versioning
- Amazon CloudWatch alarms for suspicious access patterns and comprehensive access logging

Managing exceptions to security best practices can be challenging and typically involve multiple steps. Security teams spend significant time reviewing exception requests and defining and validating compensating controls, and developers must then implement and validate those controls. Multiple teams must be included to create and manage documentation for compliance and audit purposes. Overall, this process, if done manually, is time intensive, error-prone (with a risk of missing implementation issues), and has a risk of poor visibility because of limited or missing documentation of the business context in the security findings.

Solution prerequisites

For this solution, you must have the following elements in place:

An AWS account with appropriate service quotas for Amazon DynamoDB, AWS Lambda, and Amazon Simple Queue Service (Amazon SQS)
Required AWS Identity and Access Management (IAM) permissions for deployment of various AWS resources including:
- IAM create-role and IAM put-role-policy permission (to create the security team role and developer role)
- AWS CloudFormation stack management
- DynamoDB table creation and management
- Amazon SQS
- AWS Lambda event source mapping with Amazon SQS
- Amazon SQS policy
- Lambda function deployment and configuration
- Lambda execution role
- Amazon EventBridge rule configuration
- Amazon S3 bucket operations for deployment artifacts
AWS Command Line Interface (AWS CLI) version 2.17.44 or later
Python version 3.12 or later
jq JSON processing utility for script operations
Security Hub enabled in your target AWS Region

aws securityhub enable-security-hub

AWS Config is recommended for enhanced validation capabilities

aws configservice put-configuration-recorder \
    --configuration-recorder name=default,roleARN=arn:aws:iam::ACCOUNT_ID:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig

Automated validation

The solution includes a pre-deployment validation script (validate-environment.sh) that automatically verifies the following:

Tool versions and installations
AWS service enablement status
Resource conflicts

This validation runs automatically during deployment (Integrated in deploy.sh script) to help make sure that required prerequisites are met before infrastructure creation begins.

Additional resources

See the Cost Estimation Guide for a detailed pricing breakdown of prerequisites and the Troubleshooting Guide for common setup issues and solutions.

Solution overview

This solution provides sample code and CloudFormation templates that organizations can deploy to automate the validation of compensating controls for suppressed Security Hub findings while maintaining proper segregation of duties between the security and development teams.

Architecture

Figure 1: Solution architecture diagram

Figure 1 illustrates the solution workflow that’s initiated when a developer changes a Security Hub finding’s workflow status to SUPPRESSED to request a business-justified security exception. The process concludes with the solution adding validation results as notes to the respective Security Hub finding, maintaining a complete audit trail of the exception request and validation outcome.

Note: Before initiating this workflow, developers must first consult with their organization’s security team to explain their business justification for the exception. During this initial consultation, the security team defines required compensating controls for the finding type. The security team uses the add-controls-role-based.sh script to add controls to DynamoDB. A developer enables the required compensating controls before proceeding with the workflow status change.

The workflow shown in Figure 1 includes the following steps:

A developer changes the Security Hub finding status to SUPPRESSED.
EventBridge detects the status change to SUPPRESSED.
An EventBridge rule sends an event to the Amazon SQS queue.
A Lambda function retrieves messages from the Amazon SQS queue.
The Lambda function fetches compensating controls from the DynamoDB compensating controls table.
The Lambda function validates each control using the appropriate AWS services APIs.
Evidence is collected for each validation and stored in DynamoDB.
Findings validation results and timestamps are stored in the DynamoDB Findings table.
A versioned history of finding validation attempts is stored in the DynamoDB History table.
If the security team provided controls pass validation, the finding remains SUPPRESSED, and a note is added in the respective Security Hub finding with adjusted severity information (the original severity assigned by Security Hub isn’t changed by this solution). If one of these control fails validation, finding status is changed to NOTIFIED, and a note is added in the respective Security Hub finding of failed controls (the original severity assigned by Security Hub isn’t changed by this solution).
OPTIONAL: Extend the solution with Amazon OpenSearch for SOC teams to perform advanced search, correlation, and visualization of validation evidence across findings, and historical trend analysis of compensating control effectiveness. Use Amazon QuickSight for visualization of compliance metrics, and AWS Security Lake to centralize validation data across multiple accounts and Regions, standardizing it in OCSF format for comprehensive cross-account analysis and long-term compliance reporting.

Note: This solution should be deployed in accordance with your organization’s security policies and the AWS Shared Responsibility Model. Review and test security controls before deploying in production environments.

How it works

This solution is designed exclusively for deployment and management by organizational security teams. Only security teams should have permissions to deploy the AWS CloudFormation stack, modify Lambda validation code, add/modify compensating controls, or access the four DynamoDB tables (Controls, Findings, History, Evidence).

Developers are restricted to two specific actions: suppressing Security Hub findings and reading compensating control requirements. This strict role separation facilitates proper governance and helps prevent bypass of security validation logic. Organizations must implement appropriate IAM policies to enforce these access restrictions in production environments.

Here’s how the solution works:

The security team defines controls: A Security team establishes compensating controls for specific Security Hub finding types and stores them in a DynamoDB table. This helps make sure that approved exceptions follow security-approved guidelines and maintain compliance standards.

Key files for security teams:

File	Purpose
`add-controls-role-based.sh`	Utility script for adding compensating controls
`/templates/findings/*.json`	Example compensating controls for reference
`/docs/guides/compensating-controls.md`	Guide for defining controls

Supported validation Types: The solution supports 13 validation methods to accommodate diverse security requirements:

Validation type	Description	Example use case
`CONFIG_RULE`	Validates using AWS Config rules	For GuardDuty not enabled finding: `vpc-flow-logs-enabled` Config rule helps make sure that network traffic is monitored
`API_CALL`	Validates using direct AWS API calls	For Amazon S3 public access finding: API call to verify CloudFront distribution exists in front of the S3 bucket
`SECURITY_HUB_CONTROL`	Validates using Security Hub control status	For GuardDuty not enabled finding: `CloudTrail.1` control passing confirms comprehensive API logging
`CLOUDWATCH`	Validates using CloudWatch alarms	For GuardDuty not enabled finding: Alarms monitoring for suspicious API calls and network traffic patterns
`CLOUDTRAIL`	Validates CloudTrail configuration	For GuardDuty not enabled finding: Multi-Region CloudTrail with log validation and CloudWatch integration
`SYSTEMS_MANAGER`	Validates using Systems Manager parameters	For GuardDuty not enabled finding: Parameter confirming custom threat detection solution is enabled
`PROCESS_CONTROL`	Validates process-based controls	For GuardDuty not enabled finding: Documented incident response process for network security events
`INSPECTOR`	Validates Amazon Inspector configuration	For vulnerability finding: Inspector EC2 scanning enabled with zero critical findings allowed
`ACCESS_ANALYZER`	Validates AWS IAM Access Analyzer	For IAM permission finding: IAM Access Analyzer enabled with zero active findings allowed
`MACIE`	Validates Amazon Macie configuration	For data protection finding: Macie enabled with sensitive data discovery and zero sensitive buckets allowed
`AUDIT_MANAGER`	Validates AWS Audit Manager frameworks	For compliance finding: Custom security framework active with required control sets
`EVENTBRIDGE`	Validates EventBridge rules	For GuardDuty not enabled finding: Rules monitoring AWS CloudTrail events with Lambda targets for automated response
`TRUSTED_ADVISOR`	Validates AWS Trusted Advisor checks	For security best practice finding: S3 bucket permissions check passing with zero warnings or error resources

Note: Only security team members have access to add or modify compensating controls. The solution enforces this through IAM permissions and runtime checks to maintain proper governance.

Approved security exceptions must have an expiration date to facilitate periodic review. The solution automatically enforces these time limits based on the expiration date defined by the security team.

For this post, we provide a utility script (add-controls-role-based.sh) to demonstrate adding compensating controls. However, in a production enterprise environment, organizations should integrate DynamoDB with their existing governance systems (such as Jira, ServiceNow, and so on) to automatically populate controls from authorized security team sources. This solution focuses on validating controls, not prescribing how they’re ingested.

2. Developers implement controls: When Security Hub findings are suppressed, developers must implement the required compensating controls defined by the security team.

How developers interact with the solution:

View required controls: The solution provides clear requirements for each finding type.
Implement compensating controls: Developers should implement the security team provided compensating controls in their AWS environment, referring to the compensating controls defined by Security team. The specific compensating controls depend on the finding type and security team requirements.
Finding status change: Developers change the Security Hub finding status to SUPPRESSED in Security Hub.
Automatic validation: The solution validates compensating controls when Security Hub findings workflow status is changed.
Status updates: Findings remain SUPPRESSED if controls pass validation; they change to NOTIFIED with failure details if validation fails.

Note: This solution doesn’t modify the original severity of findings in Security Hub. It adds business context with security-approved adjusted severity to findings based on security-approved compensating controls validation, helping security teams make informed decisions.

For this solution, we’re simulating the developer workflow of addressing Security Hub findings by implementing and validating compensating controls. In a production environment, developers would receive notifications about findings that require attention, implement the necessary controls according to security team guidance, and use this validation system to verify their implementations. The solution focuses on the validation aspect but assumes organizations will integrate it with their existing developer workflows, ticketing systems, and continuous integration and delivery (CI/CD) pipelines to create a seamless process from finding detection to remediation verification.

Evidence collection and audit trail

The solution automatically captures comprehensive evidence for each validation activity. The key features of the solution are:

Four-table design: Separate tables for Controls, Findings, History, and Evidence (shown in Figure 2) provide security through segregation while maintaining a complete audit trail

Figure 2: The four-table design for storing compensating controls, evidence, findings, and history

Detailed evidence: Each validation stores specific evidence based on its type—from AWS Config rule compliance details to API responses and process documentation verification
Immutable records: Each evidence includes timestamps, validation context, and results that cannot be modified after collection (shown in Figure 3)

Figure 3: Sample evidence collected for a CONFIG_RULE validation showing PASSED status

Historical tracking: The solution maintains a complete history of each validation attempt, allowing organizations to demonstrate continuous compliance over time

Deployment and configuration

You can deploy the solution using the provided scripts.

Use the following command to clone the repository:

git clone https://github.com/aws-samples/sample-automated-securityhub-validator.git
cd automated-securityhub-validator

Use the following command to check service quotas and to create the security team and developer roles:

cd scripts
./create-roles-quotas-check.sh

Use the following command to assume the security team role:

aws sts assume-role --role-arn arn:aws:iam:: ACCOUNT_ID:role/securityhub-validator-SecurityTeamRole --role-session-name SecurityTeamSession

In the preceding command’s output, note the AccessKeyId, SecretAccessKey, and SessionToken. The timestamp in the expiration field is in the UTC time zone and shows when the IAM role’s temporary credentials expire. After the temporary credentials expire, the user must assume the role again.

Note: For temporary credentials, you can use the DurationSeconds parameter to increase the maximum session duration for IAM roles.

Create environment variables to assume the security team role and verify user assumed the IAM role:

Run the following commands to set the environment variables to assume the IAM role:

export AWS_ACCESS_KEY_ID=RoleAccessKeyID
export AWS_SECRET_ACCESS_KEY=RoleSecretKey
export AWS_SESSION_TOKEN=RoleSessionToken

Note: Replace the example values with the values that you noted when you assumed the IAM role. For Windows (OS, replace export with set.

Run the get-caller-identity command to verify that the user assumed the IAM role:

aws sts get-caller-identity

Note: In the preceding command’s output, confirm that the ARN is arn:aws:sts::ACCOUNT_ID:assumed-role/securityhub-validator-SecurityTeamRole/SecurityTeamSession instead of arn:aws:iam::ACCOUNT_ID:user/username.

Use the following command to deploy the solution:

cd scripts
./deploy.sh

You can verify that the stack has been created by going to the AWS Management Console for CloudFormation and using the following steps:
1. In the CloudFormation console, choose Stacks and then Stack details in the navigation pane.
2. Locate and select the stack securityhub-validator to open its details page.
3. On the stack details page, select the Resources tab.
4. In the Resources section, you’ll see a list of the resources that are part of the stack.

Figure 4: Resources created using the CloudFormation stack

The deployment script creates a CloudFormation stack with the necessary resources:

DynamoDB tables for controls, findings, history, and evidence
A Lambda function for validation and Security Hub updates
An EventBridge rule for capturing finding status changes
An Amazon SQS queue and dead letter queue (DLQ) for message processing
IAM roles with least privilege permissions

Add compensating controls (security team):

cd scripts
./add-controls-role-based.sh

Implement controls (developers).

Now, a developer will assume the developer role and implement the required controls based on the security team’s specifications. The solution automatically validates these implementations when the Security Hub finding workflow status is changed to SUPPRESSED by a developer.

For an example implementations of common controls, see the example of compensating controls for GuardDuty.1 finding.

Test the solution

To test the solution, you can validate the compensating controls for a GuardDuty finding using the following example scenario:

A developer wants a security exception for the Security Hub finding GuardDuty.1: GuardDuty should be enabled, and because of cost constraints, the developer’s organization hasn’t implemented GuardDuty and requested a security exception from their organization’s security team.

Compensating controls provided by the security team include:

Amazon Virtual Private Cloud (Amazon VPC) Flow Logs must be enabled for network monitoring
CloudWatch alarms are enabled to monitor for suspicious activity

Note: To simulate this finding, do not enable GuardDuty so that the GuardDuty should be enabled finding appears in the Security Hub console.

Approximately 20–30 mins after enabling AWS Config and Security Hub, you can locate the finding in the console using the following steps and then add the compensating controls provided by the security team.

For this use case, we’re using the GuardDuty should be enabled Security Hub finding:

Navigate to the AWS Security Hub console and choose Findings in the navigation pane.
In the Add filter search bar at the top, select Severity label and set the is value to HIGH.
After applying the filter, select GuardDuty should be enabled in the Finding column to view its details in the righthand pane.
Choose Actions in the top-right corner and select View JSON.

Figure 5: Security Hub findings

In the JSON details window, locate the SecurityControlId field and note the value. You’ll be prompted to enter it by the add-controls-role-based.sh utility in the next step.

Note: The SecurityControlId value is required by the add-controls-role-based.sh utility to properly associate your compensating control with the correct Security Hub finding.

Figure 6: SecurityControlId from the GuardDuty finding

Use the following command to clone the repository:

git clone https://github.com/aws-samples/sample-automated-securityhub-validator.git
cd sample-automated-securityhub-validator

For this demo, you will act as a member of the security team by assuming security team role and use the add-controls-role-based.sh utility to create compensating controls and push them to the compensating control DynamoDB table.

cd sample-automated-securityhub-validator/scripts
./add-controls-role-based.sh

Use the following prompt values in add-controls-role-based.sh to create compensating control table entries using four compensating controls given by the security team for the GuardDuty.1 finding type:

./add-controls-role-based.sh
Security Team - Compensating Controls Management Utility
--------------------------------------------------------
SECURITY NOTICE: This utility is restricted to security team members only
Validating security team role...
✓ Security team role validated: arn:aws:sts::xxxxxxxxxxx:assumed-role/securityhub-validator-SecurityTeamRole/SecurityTeamSession
Using AWS Region: us-east-1
Using stack: securityhub-validator
Using controls table: securityhub-validator-ControlsTable-ARDQCU67CBCN
Enter finding type (e.g., GuardDuty.1): GuardDuty.1
Security approved adjusted risk level [CRITICAL/HIGH/MEDIUM/LOW/INFORMATIONAL]: MEDIUM
Expiration date (YYYY-MM-DD): 2026-12-31
Ticket reference: JIRA-SEC-1234
Business justification: Alternative monitoring solution provides equivalent detection capabilities
Adding Control #1
Control ID: VPC-FLOW-LOGS
Control description: VPC Flow logs must be enabled for network monitoring 
Validation type [CONFIG_RULE/API_CALL/SECURITY_HUB_CONTROL/INSPECTOR/ACCESS_ANALYZER/CLOUDTRAIL/MACIE/AUDIT_MANAGER/CLOUDWATCH/SYSTEMS_MANAGER/EVENTBRIDGE/TRUSTED_ADVISOR/PROCESS_CONTROL]: CONFIG_RULE
Config rule name (exact name): vpc-flow-logs-enabled
Description of how this rule mitigates the finding: Provides comprehensive network traffic visibility similar to GuardDuty's network monitoring capabilities
Add another control? [y/n]: y
Adding Control #2
Control ID: SECURITY-ALARMS
Control description: CloudWatch alarms for suspicious activity
Validation type [CONFIG_RULE/API_CALL/SECURITY_HUB_CONTROL/INSPECTOR/ACCESS_ANALYZER/CLOUDTRAIL/MACIE/AUDIT_MANAGER/CLOUDWATCH/SYSTEMS_MANAGER/EVENTBRIDGE/TRUSTED_ADVISOR/PROCESS_CONTROL]: CLOUDWATCH
Alarm name pattern: SecurityMonitoring-
Required metrics (comma-separated): UnauthorizedAPICalls,NetworkPortProbing
Required alarm state [ALARM/OK/INSUFFICIENT_DATA/ANY]: ANY
Minimum number of matching alarms required: 2
Description of how these alarms mitigate the finding: Alarms detect suspicious API calls and network activity similar to GuardDuty's threat detection
Add another control? [y/n]: n
Generated controls:
{
  "findingType": {
    "S": "GuardDuty.1"
  },
  "securityApprovedAdjustedRiskLevel": {
    "S": "MEDIUM"
  },
  "expirationDate": {
    "S": "2026-12-31T00:00:00Z"
  },
  "ticketReference": {
    "S": "JIRA-SEC-1234"
  },
  "businessJustification": {
    "S": "Alternative monitoring solution provides equivalent detection capabilities"
  },
  "auditInfo": {
    "S": "{\"createdBy\":\"arn:aws:sts::xxxxxxxxxxx:assumed-role/securityhub-validator-SecurityTeamRole/SecurityTeamSession\",\"createdAt\":\"2025-08-05T08:49:51Z\",\"lastModifiedBy\":\"arn:aws:sts::xxxxxxxxxxx:assumed-role/securityhub-validator-SecurityTeamRole/SecurityTeamSession\",\"lastModifiedAt\":\"2025-08-05T08:49:51Z\"}"
  },
  "securityControlHash": {
    "S": "a0b33a0a96a6b282bad1c093586d89cef832d40bb379abd4a004d00afdf603d1"
  },
  "requiredControls": {
    "S": "[{\"controlId\":\"VPC-FLOW-LOGS\",\"description\":\"VPC Flow logs must be enabled for network monitoring\",\"validationType\":\"CONFIG_RULE\",\"validationParams\":{\"ruleName\":\"vpc-flow-logs-enabled\",\"justification\":\"Provides comprehensive network traffic visibility similar to GuardDuty's network monitoring capabilities\"}},{\"controlId\":\"SECURITY-ALARMS\",\"description\":\"CloudWatch alarms for suspicious activity\",\"validationType\":\"CLOUDWATCH\",\"validationParams\":{\"alarmNamePattern\":\"SecurityMonitoring-\",\"requiredMetrics\":[\"UnauthorizedAPICalls\",\"NetworkPortProbing\"],\"requiredState\":\"ANY\",\"minimumAlarms\":2,\"justification\":\"Alarms detect suspicious API calls and network activity similar to GuardDuty's threat detection\"}}]"
  }
}
Save to DynamoDB? [y/n]: y
Compensating controls saved to DynamoDB!
This action has been logged for audit purposes.

When prompted to save to DynamoDB, enter Y. Compensating controls will be added to the DynamoDB compensating controls table.

Figure 7: Compensating controls for GuardDuty.1 finding

For this proof-of-concept demonstration, the compensating controls implementation requires additional AWS permissions beyond what the developer role provides. In a production environment, these controls would typically be implemented by infrastructure teams or through automated deployment pipelines.

Switch to administrative credentials.

For the demonstration, temporarily switch back to your administrative AWS credentials (the ones used to create the roles):

Unset the security team role credentials

unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN

Implement the required controls

Control 1: Enable VPC Flow Logs, starting by getting your VPC IDVPC_ID=$(aws ec2 describe-vpcs --query 'Vpcs[0].VpcId' --output text)

Create flow logs:

aws ec2 create-flow-logs \
    --resource-type VPC \
    --resource-ids $VPC_ID \
    --traffic-type ALL \
    --log-destination-type cloud-watch-logs \
    --log-group-name VPCFlowLogs

Create the AWS Config Rule:

aws configservice put-config-rule \
    --config-rule '{
        "ConfigRuleName": "vpc-flow-logs-enabled",
        "Source": {
            "Owner": "AWS",
            "SourceIdentifier": "VPC_FLOW_LOGS_ENABLED"
        }
    }'

Control 2: Create security monitoring alarms starting with creating metric filters for CloudTrail Logs; start by creating a log group for CloudTrail (if none exists):aws logs create-log-group --log-group-name CloudTrail/SecurityEventsCreate a metric filter for unauthorized API calls:

aws logs put-metric-filter \
    --log-group-name CloudTrail/SecurityEvents \
    --filter-name UnauthorizedAPICallsFilter \
    --filter-pattern '{ ($.errorCode = "*UnauthorizedOperation") || ($.errorCode = "AccessDenied*") }' \
    --metric-transformations metricName=UnauthorizedAPICalls,metricNamespace=SecurityMetrics,metricValue=1

Create a filter for network port probing:

aws logs put-metric-filter \
    --log-group-name CloudTrail/SecurityEvents \
    --filter-name NetworkPortProbingFilter \
    --filter-pattern '[version, account, eni, source, destination, srcport, destport="22" || destport="3389" || destport="1433", protocol, packets, bytes, windowstart, windowend, action="REJECT", flowlogstatus]' \
    --metric-transformations metricName=NetworkPortProbing,metricNamespace=SecurityMetrics,metricValue=1

Create required CloudWatch alarms, starting with Alarm 1 for Unauthorized API calls:

aws cloudwatch put-metric-alarm \
    --alarm-name "SecurityMonitoring-UnauthorizedAPICalls" \
    --alarm-description "Detects unauthorized API calls" \
    --metric-name "UnauthorizedAPICalls" \
    --namespace "SecurityMetrics" \
    --statistic Sum \
    --period 300 \
    --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --evaluation-periods 1

Alarm 2: Network port probing:

aws cloudwatch put-metric-alarm \
    --alarm-name "SecurityMonitoring-NetworkPortProbing" \
    --alarm-description "Detects network port probing activity" \
    --metric-name "NetworkPortProbing" \
    --namespace "SecurityMetrics" \
    --statistic Sum \
    --period 300 \
    --threshold 5 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --evaluation-periods 1

Now assume the DeveloperRole to suppress the finding:

aws sts assume-role \
    --role-arn arn:aws:iam::ACCOUNT_ID:role/securityhub-validator-DeveloperRole \
    --role-session-name DeveloperSession

Configure the returned credentials:

export AWS_ACCESS_KEY_ID=<from assume-role output>
export AWS_SECRET_ACCESS_KEY=<from assume-role output>
export AWS_SESSION_TOKEN=<from assume-role output>

Change the workflow status of the Security Hub finding related to GuardDuty from NEW to SUPPRESSED.

To change the workflow status using the AWS CLI (developer):

# Get the finding ARN first (command shown for reference)
aws securityhub get-findings \
    --filters '{"GeneratorId":[{"Value":"security-control/GuardDuty.1","Comparison":"EQUALS"}]}' \
    --query 'Findings[0].Id'
# Get the product ARN (command shown for reference)
aws securityhub get-findings \
    --filters '{"GeneratorId":[{"Value":"security-control/GuardDuty.1","Comparison":"EQUALS"}]}' \
    --query 'Findings[0].ProductArn' \
    --output text
# Then suppress the finding
aws securityhub batch-update-findings \
  --finding-identifiers '[{"Id":"finding-arn-from-above","ProductArn":"product-arn-from-above"}]' \
  --workflow '{"Status":"SUPPRESSED"}' \
  --note '{"Text":"Implemented compensating controls as per security team requirements","UpdatedBy":"[email protected]"}'

To change the workflow status using the console (developer):

Go to the Security Hub console.
In the navigation pane, choose Findings.
In the search bar, select Compliance Security Control ID filter and enter the value of Is as GuardDuty.1.
Select the finding GuardDuty should be enabled and under Workflow status, select SUPPRESSED.
In the Note field, enter Implemented compensating controls as per security team requirements.
Choose Set status to save the note.

Figure 8: GuardDuty.1 finding workflow status changed from NEW to SUPPRESSED

Note: Only suppress findings after implementing the required compensating controls provided by the security team.

After the Workflow status of the finding is SUPPRESSED, the automated validation process begins and you can see the Lambda function logs in the CloudWatch console related to different validations performed.

To view Lambda function logs in the CloudWatch console:

Go to the Amazon CloudWatch console.
In the navigation pane, under Logs, choose Log groups.
Select the log group with the Lambda function name.
Select the most recent log stream to view the logs.

Figure 9: Lambda function CloudWatch logs

The solution updates the note section of the findings in Security Hub with the validation results:

If all controls pass:

Finding status remains SUPPRESSED.
A note is added with validation results and adjusted risk level.
Business context is added to the finding.

If one of the controls fails:

Finding status changes to NOTIFIED.
A note is added with details about failed controls.
The security team reviews the changes as part of their standard process.

To view the finding’s workflow status and updated note using the console (developer):

Go to the Security Hub console.
In the navigation pane, choose Findings.
In the search bar, select Compliance Security Control ID filter and enter value of Is as GuardDuty.1.
Select the finding GuardDuty should be enabled and check the Workflow status.
For Actions, choose Add note.
Check the Last note added.

Figure 10: Security Hub updated finding note

The finding note shows that automated validation has performed checks and documented the results, also note that the original severity of HIGH that was assigned by Security Hub is maintained and the adjusted severity of MEDIUM that was provided by the security team is added in the Note section and to the Evidence table, providing transparency and accountability while maintaining the original severity assigned by Security Hub.

Clean up

To avoid incurring ongoing charges, use the following command to clean up resources created for this post.

./cleanup.sh

This deployment process is designed to be straightforward and to maintain security best practices such as encryption, least privilege, and segregation of duties.

Conclusion

In this post, we showed you how to implement a solution that security teams can use to define compensating controls for AWS Security Hub findings and automatically validate their implementation. We walked through the challenges of managing security exceptions and demonstrated how this solution helps to bridge the gap between security requirements and practical implementation.

The solution provides a structured workflow where security teams define acceptable compensating controls, developers implement them, and an automated system validates their effectiveness. With support for 13 different validation types, from AWS Config rules to process documentation, the solution offers comprehensive coverage for various security scenarios.

We also demonstrated the end-to-end process of adding compensating controls for a GuardDuty finding and showed how the solution maintains the original finding severity assigned by Security Hub while documenting the adjusted risk level approved by the security team. This approach helps maintain transparency and auditability while allowing for necessary exceptions.

Give it a try and share your feedback in the comments section.

Security Implication Disclaimer: The Amazon S3 configurations demonstrated in this post involve public access settings that expose data to the internet and should only be used for demonstration or non-sensitive content. Public S3 buckets carry significant risks including data exposure, unexpected costs from unauthorized usage, compliance violations, and potential security breaches. For production environments, use IAM roles, implement least privilege access policies, enable S3 Block Public Access settings, and consider CloudFront with Origin Access Control for public content delivery. Consult your security team and make sure of compliance with organizational policies before implementing public S3 configurations in production systems.

A scalable, elastic database and search solution for 1B+ vectors built on LanceDB and Amazon S3

2025-09-22 Audra Devoto

Post Syndicated from Audra Devoto original https://aws.amazon.com/blogs/architecture/a-scalable-elastic-database-and-search-solution-for-1b-vectors-built-on-lancedb-and-amazon-s3/

This post was co-authored with Owen Janson, Audra Devoto, and Christopher Brown of Metagenomi.

From CRISPR gene editing to industrial biocatalysis, enzymes power some of the most transformative technologies in healthcare, energy, and manufacturing. But discovering novel enzymes that can transform an industry — such as Cas9 for genome engineering — requires sifting through the billions of diverse enzymes encoded by organisms spanning the tree of life. Advances in DNA sequencing and metagenomics have enabled the growth of vast public and proprietary databases containing known protein sequences, but scanning through these collections to identify high value candidates is fundamentally a big data problem as well as a biological one.

At Metagenomi, we’re developing potentially curative therapeutics by using our extensive metagenomics database (MGXdb) to build a toolbox of novel gene editing systems. In this post, we highlight how Metagenomi is tackling the challenge of enzyme discovery at the billion protein scale by using the scalable infrastructure of Amazon Web Services (AWS) to build a high-performance protein database and search solution based on embeddings. By embedding every protein in our large proprietary database into a vector space, making the data accessible using LanceDB built on Amazon Simple Storage Service (Amazon S3), and accessed with AWS Lambda, we were able to transform enzyme discovery into a nearest neighbor search problem and rapidly access previously unexplored discovery space.

Solution overview

At the core of our solution is LanceDB. LanceDB is an open source vector database that enables rapid approximate nearest neighbor (ANN) searches on indexed vectors. LanceDB is particularly well suited for a serverless stack because it’s entirely file-based and is also compatible with Amazon S3 storage. As a result, we can store our database of embedded protein sequences on relatively low-cost Amazon S3, rather than a persistent disk storage such as Amazon Elastic Block Store (Amazon EBS). Instead of constantly running servers, all that is needed to rapidly query the database on-demand is a Lambda function that uses LanceDB to find nearest neighbors directly from the data on S3.

To overcome the challenge of ingesting and querying billions of vector embeddings representing Metagenomi’s large protein database, we devised a method for splitting the database into equal sized parts (folders) stored for low cost on Amazon S3 that can be indexed in parallel and searched with a map-reduce approach using Lambda. The following diagram illustrates this architecture.

AWS architecture showing protein vector processing workflow with ECR, Lambda, and LanceDB

The process follows four steps:

Data vectorization
Data bucketing
Indexing and ingesting data
Querying the database

Data vectorization

To make use of LanceDB’s fast ANN search capabilities, the data must be in vector form. Our metagenomics database consists of billions of proteins, each a string of amino acids. To convert each protein into a vector that captures biologically meaningful information, we run them through a protein language model (pLM), capturing the model’s hidden layers as a vector representation of that protein. Many pLMs can be used to generate protein embeddings, depending on the desired biological information and computational requirements. Here, we use the AMPLIFY_350M model, a transformer encoder model that is fast enough to scale to our entire protein database. We perform a mean-pool of the final hidden layer of the model to produce a 960-dimension vector for each protein. These vectors and their respective unique protein IDs are then stored in HDF5 files.

Data bucketing

To turn our protein vectors into a searchable database, we use LanceDB to build an index suitable for quickly finding ANNs to a query. However, indexing can take a long time and is difficult to distribute across nodes. To speed up indexing, we first divide our data into roughly evenly sized buckets. We then assign each of our embedding HDF5 files to buckets of size roughly equal to 200 million total vectors using a best-fit bin packing algorithm. The exact size packing method used to bucket data depends on the number and dimension of the vectors, as well as their format. Each bucket is ingested into a separate table that will separately reside in a single LanceDB database object store on Amazon S3.

S3 bucket structure showing LanceDB database organization with vector buckets

By bucketing our data, we can produce several smaller databases that can be indexed on separate nodes in a much shorter amount of time. We can also add more data to our database incrementally as a new bucket, instead of reindexing all the existing data.

Ingesting and indexing bucketed data

After the vectorized data has been assigned to a bucket, it’s time to turn it into a LanceDB table and index it to enable fast ANN querying. The details on how to convert your specific data into a LanceDB table can be found in the LanceDB documentation. For each of our buckets of approximately 200 million vectors, we create a LanceDB table with an IVF-PQ index on the cosine distance. For indexing, we use several partitions equal to the square root of the number of inserted rows, and several sub vectors equal to the number of dimensions of our vectors divided by 16.

To make things smoother to query, we name each table after the bucket from which it was created and upload them to a single S3 directory such that their file structure indicates a single LanceDB database with multiple tables.

The following code snippet provides an example of how you might ingest vectors from an HDF5 file containing id and embedding columns into a LanceDB database and index for fast ANN searches based on cosine distance. The only requirements for running this snippet are python >= 3.9, as well as the lancedb, pyarrow, and h5py packages. It should be noted that this snippet was tested and developed using lancedb version 0.21.1 using the asynchronous LanceDB API.

from typing import List, Iterable
from itertools import islice
from math import sqrt
import pyarrow as pa
import datetime
import asyncio
import lancedb
import h5py

def batched(iterable: Iterable, n: int) -> Iterable[List]:
    """Yield batches of n items from iterable."""
    while batch := list(islice(iterable, n)):
        yield batch

async def vectors_to_db(
    vectors: str,
    db: str,
    table_name: str,
    vector_dim: int,
    ingestion_batch_size: int,
) -> int:
    """Ingest and index vectors from an HDF5 file into a LanceDB table.
    Args:
        vectors (str): An HDF5 file containing protein IDs and their
            960-dimension vector representations.
        db (str): Path to the LanceDB database.
        table_name (str): Name of the table to create.
        vector_dim (int): Dimension of the vectors.
    """
    # create db and table
    custom_schema = pa.schema(
        [
            pa.field("embedding", pa.list_(pa.float32(), vector_dim)),
            pa.field("id", pa.string()),
        ]
    )

    # count the total number of rows as they are added to the table
    total_rows = 0

    # open a connection to the new database and create a table
    with await lancedb.connect_async(db) as db_connection:
        with await db_connection.create_table(
            table_name, schema=custom_schema
        ) as table_connection:
            # open vectors file
            with h5py.File(vectors, "r") as vectors_handle:
                # create a generator over the rows
                rows = (
                    {"embedding": e, "id": i}
                    for e, i in zip(
                        vectors_handle["embedding"],
                        vectors_handle["id"],
                    )
                )

                # insert rows in batches to avoid memory issues
                for batch in batched(rows, ingestion_batch_size):
                    total_rows += len(batch)
                    await table_connection.add(batch)

            # optimize the table and remove old data
            await table_connection.optimize(
                cleanup_older_than=datetime.timedelta(days=0)
            )

            # configure the index for the table
            index_config = lancedb.index.IvfPq(
                distance_type="cosine",
                num_partitions=int(sqrt(total_rows)),
                num_sub_vectors=int(
                    vector_dim / 16
                ),
            )

            # index the table
            await table_connection.create_index(
                "embedding", config=index_config
            )

# ingest and index your data
asyncio.run(
    vectors_to_db(
        vectors="./my_vectors.h5",
        db="./test_db",
        table_name="bucket1",
        vector_dim=960,
        ingestion_batch_size=50000
    )
)

The task of vectorizing, ingesting, indexing each bucket could be parallelized over multiple AWS Batch jobs or run on a single Amazon Elastic Compute Cloud (Amazon EC2) instance.

Querying the database

After the data has been bucketed and ingested into a LanceDB database on Amazon S3, we need a way to query it. Because LanceDB can be queried directly from Amazon S3 using the LanceDB Python API, we can use Lambda functions to take a user-provided query vector and search for ANNs, then return the data to the user. However, because our data has been bucketed across several tables in the database, we need to search for nearest neighbors in each bucket and aggregate the results before passing them back to the user.

We implement the query workflow as an AWS Step Functions state machine that manages a query process for each bucket as Lambda processes, as well as a single Lambda process at the end that aggregates the data and writes the resulting ANNs to a .csv file on Amazon S3. However, this could also be implemented as a series of AWS Batch processes or even run locally. The following snippet shows how a process assigned to one bucket could run an ANN query against one of the database’s buckets, requiring only pandas and lancedb to run on python >= 3.9. As detailed before in the ingestion section, we use the asynchronous LanceDB API and lancedb package version 0.21.1.

from typing import List, Iterable
import asyncio
import lancedb
import pandas
import random

async def run_query_async(
    lancedb_s3_uri: str,
    table_name: str,
    q_vec: List[float],
    k: int,
    vec_col: str,
    n_probes: int,
    refine_factor: int,
) -> pandas.DataFrame:
    """Run a query on a LanceDB table.
    Args:
        lancedb_s3_uri (str): S3 URI of the LanceDB database.
        table_name (str): Name of the table to query.
        q_vec (List[float]): Query vector.
        k (int): Number of nearest neighbors to return.
        vec_col (str): Column name of the vector column.
        n_probes (int): Number of probes to use for the query.
        refine_factor (int): Refine factor for the query.
    Returns:
        pandas.DataFrame: DataFrame containing the approximate nearest
        neighbors to the query vector.
    """
    # open a connection to the database and table
    with await lancedb.connect_async(
        lancedb_s3_uri, storage_options={"timeout": "120s"}
    ) as db_connection:
        with await db_connection.open_table(table_name) as table_connection:
            # query the approximate nearest neighbors to the query vector
            df = (
                await table_connection.query()
                .nearest_to(q_vec)
                .column(vec_col)
                .nprobes(n_probes)
                .refine_factor(refine_factor)
                .limit(k)
                .distance_type("cosine")
                .to_pandas()
            )

    return df

# query the example bucket we produced in the last section
bucket1_df = asyncio.run(
    snippets.run_query_async(
        lancedb_s3_uri="s3://mg-analysis/owen/20250415_lancedb_snippet_testing/test_db/",
        table_name="bucket1",
        q_vec=[random.random() for _ in range(960)],
        k=3,
        vec_col="embedding",
        n_probes=1,
        refine_factor=1,
    )
)

The preceding query will return a panda DataFrame of the following structure:

embedding	id	_distance
[-5.124435, 4.242000, …]	id_1	0.000000
[-5.783999, 4.340500, …]	id_2	0.001000
[-6.932943, 3.394850, …]	id_3	0.04020

Where the embedding column contains the vector representations of the nearest neighbors, the id column their IDs, and the _distance column their cosine distances to the queried vector.

After each bucket has been independently queried across nodes and each has returned a nearest neighbors DataFrame, the results must be merged and subset to return the user. The following snippet shows how you might do this.

def aggregate_nearest_neighbors(
    dfs: List[pandas.DataFrame], k: int
):
    """Aggregate the nearest neighbors for each query vector.
    Args:
        dfs (List[pandas.DataFrame]): A list of DataFrames containing the
            nearest neighbors queried from each bucket.
        k (int): The number of nearest neighbors to aggregate.
    Returns:
        pd.DataFrame: A DataFrame with the aggregated nearest neighbors.
    """
    # concatenate the DataFrames and get the top k nearest neighbors
    return (
        pandas.concat(dfs, ignore_index=True)
        .sort_values(by=["_distance"], ascending=True)
        .reset_index(drop=True)
        .head(k)
    )

# add the dataframes from querying each bucket to a list
dfs = [bucket1_df, bucket2_df, bucket3_df, bucket4_df, bucket_5]

# aggregate the nearest neighbors across all buckets
nearest_neighbors_all_buckets_df = aggregate_nearest_neighbors(dfs, 5)

Optimizing for large batches of queries

Though querying a LanceDB database directly from its S3 object store on Lambda works well for querying the ANNs of one or a few query vectors, some use cases might require querying thousands or even millions of vectors.

One solution we’ve found that scales well to large batches of queries is to modify the preceding query implementation such that it first downloads one of the database buckets to local storage, then queries it locally using the LanceDB API. Because database buckets can have a large storage footprint, this implementation is better suited for AWS Batch jobs than Lambda, and we recommend using optimized instance storage (for example, i4i instances) rather than EBS volumes. After all query Batch jobs finish, a final job can aggregate their results before returning to the user. Orchestration of parallel query jobs and the aggregation job can be done with Nextflow. Though this implementation will have significantly more overhead and latency from downloading the buckets to disk, it can handle larger batches of queries more efficiently and still requires no continuously running server-based database.

Benchmarking results

Indexing strategies and database split sizes depend on your personal need for performance. Consider the following general optimization guidance when customizing to your use case.

An example database created by Metagenomi consisted of 3.5 billion vector embeddings produced by AMPLIFY, of dimension 960. Ingesting and indexing these 3.5B vector embeddings in split sizes of 200M vectors on i4i.8xlarge instances took 108 total compute hours. Because this solution is serverless and can be queried directly from its S3 object store, the only fixed cost of this database is its storage footprint on Amazon S3 (for an indexed database of 3.5B vectors, this is approximately 12.9 TB). Lambda queries can be an exceptionally low-cost querying solution, with many queries costing fractions of a cent.

In general, larger database splits will be more cost effective to query but will result in longer runtimes and longer indexing times. We recommend scaling up database split sizes to the maximum size that results in an acceptable query return time for a single split while also considering limits of parallelization such as maximum concurrent Lambda functions running. Metagenomi identified database splits of 200M vectors each to yield an optimal trade-off in cost and runtime for both small and large queries. We recommend ingesting and indexing on storage-optimized instances, such as those in the i4i family, for optimal performance and cost savings. If querying is to be done on an instance using a disk-based database (as opposed to Lambda and Amazon S3), we also recommend using storage-optimized instances for queries. We found the Lambda implementation could quickly handle single queries requesting up to 50,000 ANNs, or multi queries of up to 100 sequences with fewer than 5 ANNs. Runtime increases linearly with the number of ANNs requested, as shown in the following graph.

Line graph showing query runtime increasing with number of nearest neighbors

Conclusion

In this post, we showed how Metagenomi was able to store and query billions of protein embeddings at low cost using LanceDB implemented with Amazon S3 and AWS Lambda. This work expands on Metagenomi’s patient-driven mission to create curative genetic medicines by accelerating our discovery and engineering platform. Having quick access to the ANN embedding space of a query protein in seconds has enabled the integration of rapid search methods in our extensive analysis pipelines, accelerated the discovery of several diverse and novel enzyme families, and enabled protein engineering efforts by providing scientists with methods to generate and search embeddings on the fly. As Metagenomi continues to rapidly scale protein and DNA databases, horizontal scaling enabled by database splits that can be indexed and searched in parallel facilitates an embedding database solution that scales to future needs.

The solution outlined in this post focuses on vectors produced by a protein large language model (LLM) but can be applied to other vectorized datasets. To learn more about LanceDB integrated with Amazon S3, refer to the LanceDB documentation.

References

Fournier, Quentin, et al. “Protein language models: is scaling necessary?.” bioRxiv (2024): 2024-09.

About the authors

Use Apache Airflow workflows to orchestrate data processing on Amazon SageMaker Unified Studio

2025-09-22 Vinod Jayendra

Post Syndicated from Vinod Jayendra original https://aws.amazon.com/blogs/big-data/use-apache-airflow-workflows-to-orchestrate-data-processing-on-amazon-sagemaker-unified-studio/

Orchestrating machine learning pipelines is complex, especially when data processing, training, and deployment span multiple services and tools. In this post, we walk through a hands-on, end-to-end example of developing, testing, and running a machine learning (ML) pipeline using workflow capabilities in Amazon SageMaker, accessed through the Amazon SageMaker Unified Studio experience. These workflows are powered by Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

While SageMaker Unified Studio includes a visual builder for low-code workflow creation, this guide focuses on the code-first experience: authoring and managing workflows as Python-based Apache Airflow DAGs (Directed Acyclic Graphs). A DAG is a set of tasks with defined dependencies, where each task runs only after its upstream dependencies are complete, promoting correct execution order and making your ML pipeline more reproducible and resilient.We’ll walk through an example pipeline that ingests weather and taxi data, transforms and joins datasets, and uses ML to predict taxi fares—all orchestrated using SageMaker Unified Studio workflows.

If you prefer a simpler, low-code experience, see Orchestrate data processing jobs, querybooks, and notebooks using visual workflow experience in Amazon SageMaker.

Solution overview

This solution demonstrates how SageMaker Unified Studio workflows can be used to orchestrate a complete data-to-ML pipeline in a centralized environment. The pipeline runs through the following sequential tasks, as shown in the preceding diagram.

Task 1: Ingest and transform weather data: This task uses a Jupyter notebook in SageMaker Unified Studio to ingest and preprocess synthetic weather data. The synthetic weather dataset includes hourly observations with attributes such as time, temperature, precipitation, and cloud cover. For this task, the focus is on time, temperature, rain, precipitation, and wind speed.
Task 2: Ingest, transform and join taxi data: A second Jupyter notebook in SageMaker Unified Studio ingests the raw New York City taxi ride dataset. This dataset includes attributes such as pickup time, drop-off time, trip distance, passenger count, and fare amount. The relevant fields for this task include pickup and drop-off time, trip distance, number of passengers, and total fare amount. The notebook transforms the taxi dataset in preparation for joining it with the weather data. After transformation, the taxi and weather datasets are joined to create a unified dataset, which is then written to Amazon S3 for downstream use.
Task 3: Train and predict using ML: A third Jupyter notebook in SageMaker Unified Studio applies regression techniques to the joined dataset to create a model to determine how attributes of the weather and taxi data such as rain and trip distance impact taxi fares and create a fare prediction model. The trained model is then used to generate fare predictions for new trip data.

This unified approach enables orchestration of extract, transform, and load (ETL) and ML steps with full visibility into the data lifecycle and reproducibility through governed workflows in SageMaker Unified Studio.

Prerequisites

Before you begin, complete the following steps:

Create a SageMaker Unified Studio domain: Follow the instructions in Create an Amazon SageMaker Unified Studio domain – quick setup
Sign in to your SageMaker Unified Studio domain: Use the domain you created in Step 1 sign in. For more information, see Access Amazon SageMaker Unified Studio.
Create a SageMaker Unified Studio project: Create a new project in your domain by following the project creation guide. For Project profile, select All capabilities.

Set up workflows

You can use workflows in SageMaker Unified Studio to set up and run a series of tasks using Apache Airflow to design data processing procedures and orchestrate your querybooks, notebooks, and jobs. You can create workflows in Python code, test and share them with your team, and access the Airflow UI directly from SageMaker Unified Studio. It provides features to view workflow details, including run results, task completions, and parameters. You can run workflows with default or custom parameters and monitor their progress. Now that you have your SageMaker Unified Studio project set up, you can build your workflows.

In your SageMaker Unified Studio project, navigate to the Compute section and select Workflow environment.
Choose Create environment to set up a new workflow environment.
Review the options and choose Create environment. By default, SageMaker Unified Studio creates an mw1.micro class environment, which is suitable for testing and small-scale workflows. To update the environment class before project creation, navigate to Domain and select Project Profiles and then All Capabilities and go to OnDemand Workflows blueprint deployment settings. By using these settings, you can override default parameters and tailor the environment to your specific project requirements.

Develop workflows

You can use workflows to orchestrate notebooks, querybooks, and more in your project repositories. With workflows, you can define a collection of tasks organized as a DAG that can run on a user-defined schedule.To get started:

Download Weather Data Ingestion, Taxi Ingest and Join to Weather, and Prediction notebooks to your local environment.
Go to Build and select JupyterLab; choose Upload files and import the three notebooks you downloaded in the previous step.

Configure your SageMaker Unified Studio space: Spaces are used to manage the storage and resource needs of the relevant application. For this demo, configure the space with an ml.m5.8xlarge instance
1. Choose Configure Space in the right-hand corner and stop the space.
2. Update instance type to ml.m5.8xlarge and start the space. Any active processes will be paused during the restart, and any unsaved changes will be lost. Updating the workspace might take a take few minutes.
Go to Build and select Orchestration and then Workflows.
Select the down arrow (▼) next to Create new workflow. From the dropdown menu that appears, select Create in code editor.
In the editor, create a new Python file named multinotebook_dag.py under src/workflows/dags. Copy the following DAG code, which implements a sequential ML pipeline that orchestrates multiple notebooks in SageMaker Unified Studio. Replace <REPLACE-OWNER> with your username. Update NOTEBOOK_PATHS to match your actual notebook locations.

from airflow.decorators import dag
from airflow.utils.dates import days_ago
from workflows.airflow.providers.amazon.aws.operators.sagemaker_workflows import NotebookOperator

WORKFLOW_SCHEDULE = '@daily'

NOTEBOOK_PATHS = [
'<REPLACE FULL PATH FOR Weather_Data_Ingestion.ipynb>',
'<REPLACE FULL PATH FOR Taxi_Weather_Data_Collection.ipynb>',
'<REPLACE FULL PATH FOR Prediction.ipynb>'
]

default_args = {
    'owner': '<REPLACE-OWNER>',
}

@dag(
    dag_id='workflow-multinotebooks',
    default_args=default_args,
    schedule_interval=WORKFLOW_SCHEDULE,
    start_date=days_ago(2),
    is_paused_upon_creation=False,
    tags=['MLPipeline'],
    catchup=False
)
def multi_notebook():
    previous_task = None

    for idx, notebook_path in enumerate(NOTEBOOK_PATHS, 1):
        current_task = NotebookOperator(
            task_id=f"Notebook{idx}task",
            input_config={'input_path': notebook_path, 'input_params': {}},
            output_config={'output_formats': ['NOTEBOOK']},
            wait_for_completion=True,
            poll_interval=5
        )

        # Ensure tasks run sequentially
        if previous_task:
            previous_task >> current_task

        previous_task = current_task  # Update previous task

multi_notebook()

The code uses the NotebookOperator to execute three notebooks in order: data ingestion for weather data, data ingestion for taxi data, and the trained model created by combining the weather and taxi data. Each notebook runs as a separate task, with dependencies to help ensure that they execute in sequence. You can customize with your own notebooks. You can modify the NOTEBOOK_PATHS list to orchestrate any number of notebooks in their workflow while maintaining sequential execution order.

The workflow schedule can be customized by updating WORKFLOW_SCHEDULE (for example: '@hourly', '@weekly', or cron expressions like ‘13 2 1 * *’) to match your specific business needs.

After a workflow environment has been created by a project owner, and once you’ve saved your workflows DAG files in JupyterLab, they are automatically synced to the project. After the files are synced, all project members can view the workflows you have added in the workflow environment. See Share a code workflow with other project members in an Amazon SageMaker Unified Studio workflow environment.

Test and monitor workflow execution

To validate your DAG, Go to Build > Orchestration > Workflows. You should now see the workflow running in Local Space based on the Schedule.

Once the execution completes, workflow would change to success start as shown below.

For each execution, you can zoom in to get a detailed workflow run details and task logs

Access the airflow UI from actions for more information on the dag and execution.

Results

The model’s output is written to the Amazon Simple Storage Service (Amazon S3) output folder as shown the following figure. These results should be evaluated for correctness of fit, prediction accuracy, and the consistency of relationships between variables. If any results appear unexpected or unclear, it is important to review the data, engineering steps, and model assumptions to verify that they align with the intended use case.

Clean up

To avoid incurring additional charges associated with resources created as part of this post, make sure you delete the items created in the AWS account for this post.

The SageMaker domain
The S3 bucket associated with the SageMaker domain

Conclusion

In this post, we demonstrated how you can use Amazon SageMaker to build powerful, integrated ML workflows that span the full data and AI/ML lifecycle. You learned how to create an Amazon SageMaker Unified Studio project, use a multi-compute notebook to process data, and use the built-in SQL editor to explore and visualize results. Finally, we showed you how to orchestrate the entire workflow within the SageMaker Unified Studio interface.

SageMaker offers a comprehensive set of capabilities for data practitioners to perform end-to-end tasks, including data preparation, model training, and generative AI application development. When accessed through SageMaker Unified Studio, these capabilities come together in a single, centralized workspace that helps eliminate the friction of siloed tools, services, and artifacts.

As organizations build increasingly complex, data-driven applications, teams can use SageMaker, together with SageMaker Unified Studio, to collaborate more effectively and operationalize their AI/ML assets with confidence. You can discover your data, build models, and orchestrate workflows in a single, governed environment.

To learn more, visit the Amazon SageMaker Unified Studio page.

About the authors

Integrate Tableau and PingFederate with Amazon Redshift using AWS IAM Identity Center

2025-09-19 Rohit Vashishtha

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/integrate-tableau-and-pingfederate-with-amazon-redshift-using-aws-iam-identity-center/

The series of posts on single sign-on to Amazon Redshift with AWS IAM Identity Center (successor to AWS Single Sign-On) integration continues from our prior post.

In this post, we outline a comprehensive guide for setting up single sign-on from Tableau desktop to Amazon Redshift using integration with IAM Identity Center and PingFederate as the identity provider (IdP) with an LDAP based data store, AWS Directory Service for Microsoft Active Directory.

Prerequisites

You should have the following prerequisites:

A PingFederate account that has an active subscription. You need an admin role to set up the application on PingFederate. If you’re new to PingFederate, you can reach out to Ping Identity Sales.
A working PingFederate server.
Amazon Redshift Serverless workgroup or a provisioned Amazon Redshift data warehouse.
Download and install the latest Redshift ODBC 2.X driver.
Download and install Tableau Desktop 2024.1 or later
Install Tableau Server 2023.3.9 or later. For Tableau Server installation, see Install and Configure Tableau Server.

Solution overview

PingFederate instance connects to IAM Identity Center using SAML. The users and groups in PingFederate are synced to IAM Identity Center using an open standard SCIM. After you set up SAML and SCIM, you will be able to enable single sign-on to Amazon Redshift from the AWS Management Console using Amazon Redshift Query Editor v2. This is achieved by creating an Identity Center application in the Amazon Redshift console.

To enable single sign-on to Amazon Redshift from outside of AWS using a third-party client like Tableau, you set up a trusted token issuer token exchange using OIDC standard.

Figure 1 : Solution overview for Tableau integration with Amazon Redshift using IAM Identity Center and Ping Federate

The workflow, shown in the preceding figure, includes the following steps:

The user configures Tableau to access Amazon Redshift using IAM Identity Center authentication.
On a user sign-in attempt, Tableau initiates a browser-based OAuth flow and redirects the user to the PingFederate sign in page to enter the sign-in credentials. Password validation is done against the AWS Managed Microsoft AD data store.
On successful authentication, PingFederate issues an authentication token (ID and access token) to Tableau.
The Amazon Redshift driver then makes a call to the Amazon Redshift-enabled Identity Center application and forwards the access token.
Amazon Redshift passes the token to Identity Center and requests an access token.
Identity Center verifies the token using the OIDC discovery connection to the trusted token issuer and returns an Identity Center-generated access token for the same user. In the preceding figure, trusted token issuer (TTI) is the PingFederate server that Identity Center trusts to provide tokens that third-party applications like Tableau use to call AWS services.
Amazon Redshift then uses the token to obtain the user and group membership information from Identity Center.
Tableau user will be able to connect with Amazon Redshift and access data based on the user and group membership returned from Identity Center. The user and group settings in the LDAP-based AWS Managed Microsoft AD data store for PingFederate are propagated to identity center using SCIM protocol for outbound provisioning.

Walkthrough

In this walkthrough, you will use the following steps to build the solution:

SAML and SCIM set up between PingFederate and IAM Identity Center
Connect to Amazon Redshift using Query Editor v2
Configure identity federation from a third-party client
1. Create an access token manager and access token mapping
2. Create an OIDC policy
3. Create an OAuth client
4. Set up a PingFederate Authorization Server
5. Policy Contract Grant Mapping
6. Collect PingFederate information
7. Set up a trusted token issuer in IAM Identity Center
8. Set up client connections and trusted token issuers in Amazon Redshift
9. Configure Tableau OAuth config files for PingFederate to integrate with Amazon Redshift using IAM Identity Center
10. Install a Tableau OAuth config file on a client machine for Tableau Desktop
11. Install a Tableau OAuth config file for a site on Tableau Server or Tableau Cloud
12. Federate to Amazon Redshift from Tableau Desktop using Identity Center
13. Federate to Amazon Redshift from Tableau Server using Identity Center authentication

SAML and SCIM set up between PingFederate and IAM Identity Center

IAM Identity Center integration with PingFederate starts with SAML set up followed by SCIM.

Set up SAML 2.0 for SP Connection of type Browser SSO (single sign-on) in PingFederate.
Set up SCIM 2.0 for outbound provisioning. It will sync the users and groups created in an LDAP based data store like AWS managed Microsoft AD for PingFederate to the users and groups in IAM Identity Center.

The implementation for the cloud based IdP option PingOne is not in scope of this post and follows steps similar to those described in Integrate IdP with Amazon Redshift Query Editor v2 using AWS IAM Identity Center for seamless Single Sign-On.

Further details of SAML and SCIM set up are as follows.

1. Install PingFederate Server.
2. Set up IAM Identity center integration by following the Ping documentation including the download for Identity Center integration files.
  1. Deploy the integration files to your PingFederate installation.
  2. Enable provisioning and configure IdP Browser SSO (SAML connection). (You can implement Browser SSO connection only using IAM Identity Center metadata file.)
    1. Under System > Server > Protocol Settings > Federation Info BASE_URL field, use the publicly accessible fully qualified domain name of the PingFederate server.
    2. Create an LDAP based data store (the name used in this example is AWSManagedMSAD) because SCIM 2.0 protocol for outbound provisioning only works with LDAP based data stores with PingFederate. If you are using a cloud-based solution like PinOne, you can set up outbound provisioning in PingOne itself. Thus for this writing, we have used AWS Managed Microsoft AD as a data store created using AWS Directory Service.
    3. Create a password credential validator (name used in this example is awsmanagedmsadpassval) and IdP adapters (name used in this example is awsmanagedmsadadapter) for your data store as applicable.
    4. Create an SP connection of type Browser SSO using the sp-saml-metadata.xml file as explained in creating a provisioning connection.
  3. Export SAML metadata from PingFederate.
  4. Register PingFederate as an IdP in Identity Center.
  5. Navigate back to the connection saved in step b, and configure outbound provisioning.
3. Enable provisioning in IAM Identity Center by following step 1 in the documentation.
4. Then, configure provisioning in PingFederate by following step 2 in the documentation.
5. Optionally, you can configure and pass user attributes from PingFederate for access control in Identity Center.

Next, connect to Amazon Redshift using its native query editor, Query Editor v2, to validate AWS services’ connectivity using IAM Identity Center.

Connect to Amazon Redshift using Query Editor v2

Complete the Walkthrough section of IAM Identity Center integration with Amazon Redshift, which will set up your Amazon Redshift connectivity with Query Editor v2.

If you need further help with SAML and SCIM set up, and connecting to Amazon Redshift using Query Editor v2, you can also follow step by step guided demo video single sign-on to Amazon Redshift with IAM IDC integration using PingFederate with AWS Managed MSAD Demo

Configure identity federation from a third-party client

Configure identity federation enabled by IAM Identity Center from IdP PingFederate to the service provider Amazon Redshift using an external client like Tableau. The following steps in the PingFederate admin console and Identity Center guide you through the identity federation process.

Create an access token manager and access token mapping

To map PingFederate attributes to OAuth access tokens and OpenID Connect ID (OIDC) tokens, create an access token manager and token mapping. For complete details and set up based on your security needs, see Token mapping in PingFederate, which explains access token management in detail. Complete the following steps to create a token manager.

In the PingFederate administrative console, go to Applications > OAuth > Access Token Management, and choose Create New Instance.
In Type tab,
1. Enter an Instance Name and Instance ID of your choice, for example TrustedTokenIssuerMgr.
2. Select the Type from drop down list as JSON Web Tokens, commonly called JWT.
3. Leave Parent instance as None and choose Next.
In Instance configuration tab,
1. Under Certificates, select Add a new row to ‘Certificates’, select the certificate for token manager from the drop-down list, enter a Key ID such as certkey, and choose Update under Action. You can create a new certificate by navigating to Security > Certificate & Key Management > Signing & Decryption Keys & Certificates > Create New.
2. Select Use Centralized Signing Key.
3. In JWS Algorithm, select RSA using SHA-256.
4. Select Enable Token Revocation. Leave everything else as default and choose Next.
Under Session Validation tab,
1. Select Include Session Identifier in Access Token.
2. Select Check for valid authentication session.
3. Leave other choices as is and choose Next.
In the Access Token Attribute Contract tab, leave the Subject Attribute Name as the e default and proceed to Extend the Contract to add the following attribute and values.
1. Enter aud, leave multi-value unchecked. Choose Add under Action.
2. Repeat the same to enter email, exp, iss, sub. When completed, choose Next.
On each of Resource URIs and Access Control tabs, leave as is and choose Next.
On the Summary tab, review your changes and choose Save. An instance name with the name you provided, like TrustedTokenIssuerMgr appears in Applications > Oauth > Access Token Management.

Figure 2 : Access Token Management Configuration Summary

Navigate to Applications > OAuth > Access Token Mappings, select the default Context and Access Token Manager, TrustedTokenIssuerMgr that was created in the previous step. Choose Add Mapping.
Leave Attribute Sources & User Lookup as is and choose Next.
Under Contract Fulfillment tab,
1. For Contract aud, select Text from the Source, and enter the Value as AWSIdentityCenter.
2. For Contract email, select Persistent Grant from the Source, and Value as email.
3. For Contract exp, select Persistent Grant from the Source, and Value as EXPIRES_AT.
4. For Contract iss, select Text from the Source, and enter your base URL as the Value, like https://yourwebsite.domain.com, the same as in System > Server > Protocol Settings > BASE URL.
5. For Attribute Contract sub, select Persistent Grant from the Source, and Value as USER_KEY.
6. Choose on Next.
Leave Issuance Criteria as is and choose Next.
On the Summary tab, review all your changes and choose Save. A new default Context with Access Token Manager if TrustedTokenIssuerMgr appears in Applications > OAuth > Access Token Mappings.

Figure 3: Access Token Mappings Summary

Create an OIDC policy

For complete details and set up based on your security needs, see to Open ID connect (OIDC) policy management in PingFederate. Complete the following steps to set up an OIDC policy.

In the PingFederate administrative console, go to Applications > OAuth > OpenID Connect Policy Management, and choose Add Policy.
In the Manage Policy tab,
1. Enter the Policy ID and Name of your choice, for example OIDCPolicy.
2. Select the Access Token Manager from drop down list created in the previous section—TrustedTokenIssuerMgr.
3. Select Include Session Identifier in ID Token
4. Select Include User Info in ID Token
5. Select Return ID Token on Refresh Grant
6. Leave others as is and choose Next.
In the Attribute Contract tab, keep only the required attributes in extended contract and delete the others.
1. Leave the sub attribute under Attribute Contract as is.
2. Under Extend the contract, choose delete for all attributes except email. choose Next.
In the Attribute Scopes tab,
1. Select openid from the Scope list.
2. Select email from Attributes.
3. Choose Add from Actions. Choose Next.
Leave Attribute Sources & User Lookup as is and choose Next.
In Contract Fulfillment tab,
1. For Attribute Contract email, select Persistent Grant from the Source, and Value as email.
2. For Attribute Contract sub, select Persistent Grant from the Source, and Value as USER_KEY.
3. Choose Next.
Leave Issuance Criteria as is and choose Next.
On the Summary tab, review your changes and choose Save. A policy ID with the name you provided, like OIDCPolicy, appears in Applications > Oauth > OpenID Connect Policy Management.

Figure 4 : OpenID Connect Policy Management Summary

Create OAuth client

For complete details and set up based on your security needs, see configure an OAuth client in PingFederate, which explains each field in detail. Complete the following steps to create an OAuth client.

In the PingFederate administrative console, go to Applications > OAuth > Clients, and choose Add Client.
In the Client ID field, enter a unique, immutable client ID. We use tableauredshiftpingfed as the name in this example.
Enter a Name and Description for the client.
Select a Client Authentication method. You can select from None, Client TLS Certificate, Private Key JWT, or Client Secret. For this scenario, select Client Secret. Choose Generate Secret to create a new one or use select Change secret to create your own.
Leave Request object signing algorithm set to Allow Any. You can override to use the algorithm of your choice if needed.
In the Redirect URIs field, add each of the following values.
1. http://localhost:8080/authorization-code/callback
2. http://localhost:55556/Callback
3. http://localhost:55557/Callback
4. http://localhost:55558/Callback
5. http://localhost/auth/add_oauth_token
Select Restrict common scopes. Restrict scopes by selecting the checkboxes for email, offline_access, openid, and profile as required.
In Logo URL, optionally enter the URL for logo you want to display on the User Grant Authorization and Revocation pages.
In the Allowed Grant Types list, you can choose from a list of authorization options. In this example, select Authorization code. Optionally, you can select Implicit, Refresh Token, and Client Credentials.
Under Default access token manager, select the access token manager TrustedTokenIssuerMgr created in the earlier section.
Select the Restrict box for Restrict to default access token manager.
Customize Persistent grants max lifetime to match your requirements. Set it to 12 hours for this example by using the third radio button.
For Openid connect, choose your preferred ID token signing algorithm. Select RSA using SHA-256 for this example. Optionally, for Policy you can choose the OIDC policy created in the earlier section.
Leave the remaining settings as default and choose Save.

Figure 5 : OAuth Client Configuration

The Tableau Desktop redirect URLs should always use localhost. The following example, also use localhost for the Tableau Server hostname to simplify testing in a test environment. For this setup, you should also access the server at localhost in the browser. In a production environment, or Tableau Cloud, you should use the full hostname that your users will use to access Tableau on the web, along with HTTPS. If you already have an environment with HTTPS configured, you can skip the localhost configuration and use the full hostname from the start.

Set up a PingFederate authorization server

For complete details and set up based on your security needs, see PingFederate authorization server settings in PingFederate. Complete the following steps to configure an authorization server.

In the PingFederate administrative console, go to System > OAuth Settings > Authorization Server Settings, and make following changes.
Leave the initial configurations as default and scroll down to Persistent Grant Extended Attributes, add Attribute email.
For OAuth Administrative Web Services Settings, in Password Credential Validator, select awsmanagedmsadpassval that you created in the SAML and SCIM set up section.
For Persistent Grant Management API,
1. In Access Token Manager, select the TrustedTokenIssuerMgr created earlier.
2. In Required Scope, select openid.
Leave remaining the settings as default and choose Save.

Figure 6 : PingFederate Authorization Server Setting

Policy contract grant mapping

For complete details and set up based on your security needs, see Grant contract mapping in PingFederate. For this illustration, we set up a policy contract grant mapping for authentication in a three-step process.

Step 1: Create a policy contract

In the PingFederate administrative console, go to Authentication > Policies > Policy Contracts, and choose Create New Contract.
In Contract Info tab, enter a name. For this example, we use OIDCPolicyContract.
In Contract Attributes tab, choose Extend the Contract to add email attribute.
Review and choose Save.

Figure 7 : Policy Contract Summary

Step 2: Add authentication policy

In the PingFederate administrative console, go to Authentication > Policies > Policies, and choose Add Policy.
Enter a policy name. In this example, we use OAuthOIDCPolicy.
In the Policy drop down, select IdP Adapter and select the awsmanagedmsadadapter that you created in the SAML and SCIM set up section.
Set FAIL to Done and under SUCCESS, select Policy Contracts from the drop-down menu and select the OIDCPolicyContract created in step 1. Choose Done.

Figure 8 : Authentication Policy Configuration

Step 3: Policy contract grant mapping

In the PingFederate administrative console, go to Authentication > OAuth > Policy Contract Grant Mapping, and under Mappings, select OIDCPolicyContract created in Step1 and choose Add Mapping.
On the Attribute Sources & User Lookup tab, choose Next.
In the Contract Fulfillment tab,
1. For Contract USER_KEY, pick Authentication Policy Contract from the Source, and Value as subject.
2. For Contract USER_NAME, pick Authentication Policy Contract from the Source, and Value as subject.
3. For Contract email, pick Authentication Policy Contract from the Source, and Value as email.
4. Choose Next.
Leave Issuance Criteria as is, review and choose Save.

Figure 9 : Policy Contract Grant Mapping Summary

Collect PingFederate information

To configure your PingFederate with IAM Identity Center and Amazon Redshift, collect the following parameters. If you don’t have these parameters, contact your PingFederate admin.

Issuer URL, auth URL (authUri), and token URL (tokenUri).

You can get these values from the OIDC IdP URL: https://pingfedserver.example.com/.well-known/openid-configuration. Open this URL in a web browser, replacing pingfedserver.example.com with your IdP server name.

The following is an example screenshot of IdP attributes using OIDC IdP URL where:

The issuer URL corresponds to the issuer
The auth URL (authUri) corresponds to authorization_endpoint
The token URL (tokenUri) corresponds to token_endpoint

Figure 10 : Screenshot of IdP Attributes

Audience value

To get the Audience value from PingFederate, sign in as an admin to PingFederate and navigate to the following path to get the audience value that you created during access token mapping creation in PingFederate:

Applications > OAuth > Access Token Mappings > TrustedTokenIssuerMgr → Summary > aud

Figure 11 : Access Token Mapping

Set up a trusted token issuer in IAM Identity Center

Switch from the PingFederate console to the IAM Identity Center console for the AWS side of configuration. Start by adding a trusted token issuer (TTI), which makes it possible to authorize Tableau to make requests on behalf of their users to access data in Amazon Redshift. A TTI is an OAuth 2.0 authorization server that issues tokens to applications that initiate requests (requesting applications). The tokens authorize these applications to initiate requests on behalf of their users to a receiving application (an AWS service). In this step, you create a TTI in the central management account. To create a TTI,

Open the AWS Management Console and navigate to IAM Identity Center, and then to the Settings page.
Select the Authentication tab and under Trusted token issuers, choose Create trusted token issuer.
On the Set up an external IdP to issue trusted tokens page, under Trusted token issuer details, do the following:
- For Issuer URL, enter the OIDC discovery URL of the external IdP that will issue tokens for trusted identity propagation. You can get issuer the URL as mentioned in step 1 of the preceding section Collect PingFederate information.
For Trusted token issuer name, enter a name to identify this TTI in Identity Center and in the application console.
Under Map attributes, do the following:
1. For the identity provider attribute, select an attribute from the list to map to an attribute in the Identity Center identity store. You can select Email, Object Identifier, Subject, and Other.
2. For Identity Center attribute, select the corresponding attribute for the attribute mapping.
Under Tags (optional), choose Add new tag, enter a value for Key, and optionally for Value. For information about tags, see Tagging AWS IAM Identity Center resources.

The following figure shows the set up for TTI:

Figure 12 : Configuring Trusted Token Issuer

Set up client connections and trusted token issuers in Amazon Redshift

In this step, the Amazon Redshift applications that exchange externally generated tokens must be configured to use the TTI you created in the previous step. Also, the audience claim (or aud claim) from PingFederate must be specified. In this example, you are configuring the Amazon Redshift application in the member account where the Amazon Redshift cluster or serverless instance exists.

Select IAM Identity Center connection from the Amazon Redshift console menu.
Select the Amazon Redshift application that you created as part of the prerequisites.
Select the Client connections tab and choose Edit.
Choose Yes under Configure client connections that use third-party IdPs.
Select the checkbox for Trusted token issuer that you created in the previous section.
Enter the Aud claim value under Configure selected trusted token issuers. For example, AWSIdentityCenter. You can get the audience value from the PingFederate path: Applications > OAuth > Access Token Mappings > TrustedTokenIssuerMgr > Summary > aud.
Choose Save.

Figure 13 : Configure Audience Value in Amazon Redshift

At this point, your IAM Identity Center, Amazon Redshift, and PingFederate configuration are complete. Next, you need to configure Tableau.

Configure Tableau OAuth config files for PingFederate to integrate with Amazon Redshift using IAM Identity Center

This XML file used in this section will be used for all the Tableau products like Tableau Desktop, Server and Cloud.

To integrate Tableau with Amazon Redshift using IAM Identity Center, you need to use a custom XML file. In this step, you will use the following XML and replace the values starting with a $ sign and highlighted in bold. The rest of the values can be kept as it is or you can modify them based on your specific needs. For detailed information on each of the elements in the file, see the Tableau documentation on GitHub.

You can get authUri and tokenUri as mentioned in step 1 of preceding section, Collect PingFederate information.

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
  <dbclass>redshift</dbclass>
  <oauthConfigId>custom_redshift_pingfed</oauthConfigId>
  <clientIdDesktop></clientIdDesktop>
  <clientSecretDesktop></clientSecretDesktop>
  <redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
  <redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
  <redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
  <authUri>https://.com/as/authorization.oauth2</authUri>
  <tokenUri>https://.com/as/token.oauth2</tokenUri>
  <scopes>openid</scopes>
  <scopes>email</scopes>
  <scopes>profile</scopes>
  <scopes>offline_access</scopes>
  <capabilities>
    <entry>
      <key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_REQUIRE_PKCE</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_SUPPORTS_STATE</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
      <value>true</value>
    </entry>
  </capabilities>
  <accessTokenResponseMaps>
    <entry>
      <key>ACCESSTOKEN</key>
      <value>access_token</value>
    </entry>
    <entry>
      <key>REFRESHTOKEN</key>
      <value>refresh_token</value>
    </entry>
    <entry>
      <key>id-token</key>
      <value>id_token</value>
    </entry>
    <entry>
      <key>access-token-issue-time</key>
      <value>issued_at</value>
    </entry>
    <entry>
      <key>access-token-expires-in</key>
      <value>expires_in</value>
    </entry>
    <entry>
      <key>username</key>
      <value>email</value>
    </entry>
  </accessTokenResponseMaps>
</pluginOAuthConfig>

The following is the example XML:

<?xml version="1.0" encoding="utf-8"?>
<pluginOAuthConfig>
  <dbclass>redshift</dbclass>
  <oauthConfigId>custom_redshift_pingfed</oauthConfigId>
  <clientIdDesktop>tableauredshiftpingfed</clientIdDesktop>
  <clientSecretDesktop></clientSecretDesktop>
  <redirectUrisDesktop>http://localhost:55556/Callback</redirectUrisDesktop>
  <redirectUrisDesktop>http://localhost:55557/Callback</redirectUrisDesktop>
  <redirectUrisDesktop>http://localhost:55558/Callback</redirectUrisDesktop>
  <authUri>https://pingfedserver.example.com/as/authorization.oauth2</authUri>
  <tokenUri>https://pingfedserver.example.com/as/token.oauth2</tokenUri>
  <scopes>openid</scopes>
  <scopes>email</scopes>
  <scopes>profile</scopes>
  <scopes>offline_access</scopes>
  <capabilities>
    <entry>
      <key>OAUTH_CAP_FIXED_PORT_IN_CALLBACK_URL</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_PKCE_REQUIRES_CODE_CHALLENGE_METHOD</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_REQUIRE_PKCE</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_SUPPORTS_STATE</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_CLIENT_SECRET_IN_URL_QUERY_PARAM</key>
      <value>true</value>
    </entry>
    <entry>
      <key>OAUTH_CAP_SUPPORTS_GET_USERINFO_FROM_ID_TOKEN</key>
      <value>true</value>
    </entry>
  </capabilities>
  <accessTokenResponseMaps>
    <entry>
      <key>ACCESSTOKEN</key>
      <value>access_token</value>
    </entry>
    <entry>
      <key>REFRESHTOKEN</key>
      <value>refresh_token</value>
    </entry>
    <entry>
      <key>id-token</key>
      <value>id_token</value>
    </entry>
    <entry>
      <key>access-token-issue-time</key>
      <value>issued_at</value>
    </entry>
    <entry>
      <key>access-token-expires-in</key>
      <value>expires_in</value>
    </entry>
    <entry>
      <key>username</key>
      <value>email</value>
    </entry>
  </accessTokenResponseMaps>
</pluginOAuthConfig>

Install Tableau OAuth config file on a client machine for Tableau Desktop

After the XML configuration file is created, it should be copied to a specific location to be used by Amazon Redshift Connector from Tableau Desktop. Save the preceding file as .xml and save it under Documents\My Tableau Repository\OAuthConfigs.

Note: Currently this integration is not supported in macOS because the Amazon Redshift ODBC 2.X Driver is not supported yet for MAC.

Install Tableau OAuth config file for a site on Tableau Server or Tableau Cloud

To integrate with Amazon Redshift using IAM Identity Center authentication, you need to install the Tableau OAuth config file in Tableau Server or Tableau Cloud.

Sign in to the Tableau Server or Tableau Cloud using admin credentials.
Navigate to Settings.
Go to OAuth Clients Registry and select Add OAuth Client.
Choose the following settings:
1. Connection type: Select Amazon Redshift.
2. OAuth Provider: Select Custom_IdP.
3. Client ID: Enter your IdP client ID value.
4. Client Secret: Enter your client secret value.
5. Redirect URL: Enter the value as http://localhost/auth/add_oauth_token. In this post, we are using localhost for testing in the local environment. You should ideally use the full hostname with https.
6. Choose OAuth Config File: Select the XML file that you configured in Configure Tableau Desktop.
7. Select Add OAuth Client and choose Save.

Figure 14: Create an OAuth connection in Tableau Server or Cloud

Federate to Amazon Redshift from Tableau Desktop using IAM Identity Center

Now, you’re ready to connect from Tableau and federated sign-in using IAM Identity Center authentication. In this step, you will create a Tableau Desktop report and publish it to Tableau Server.

Open Tableau Desktop.
Choose Amazon Redshift Connector and enter the following values:
1. Server: Enter the name of the server that hosts the database and the name of the database you want to connect to.
2. Port: Enter 5439.
3. Database: Enter your database name. In this example, we use dev.
4. Authentication: Select OAuth.
5. Federation Type: Select Identity Center
6. Identity Center Namespace: You can leave this blank.
7. OAuth Provider: This value should automatically be pulled from your configured XML. It will be the value from the element oauthConfigId.
8. Select checkbox for Require SSL.
Choose Sign-In.
A browser pop-up will initiate where you will enter your IdP credentials.

Figure 15: Tableau Desktop OAuth connection

When authentication is successful, you will see the message Tableau created this window to authenticate. It is now safe to close it.

Figure 16: Successful authentication using Tableau

Congratulations! You are signed in using the IAM Identity Center integration with Amazon Redshift and are ready to explore and analyze your data using Tableau Desktop.

Figure 17: Successful connection using Tableau Desktop

The following is a screenshot from Amazon Redshift system table (sys_query_history) showing that user Ethan from PingFederate is accessing the sales report.

Figure 18: User audit in sys_query_history

Now you can create your own Tableau Report on the desktop version and publish it to your Tableau Server. For the next section, you create and publish a report named Account Level Sales.

Federate to Amazon Redshift from Tableau Server using IAM Identity Center authentication

After you have published the report from Tableau Desktop to Tableau Server, sign in as non-admin user and view the published report using IAM Identity Center authentication.

Sign in to the Tableau Server site as a non-admin user.
Navigate to Explore and go to the folder where your published report is stored.
Select the report and choose Sign In.

Figure 19: Sign In Prompt on Tableau Cloud/Server

Enter your PingFederate credentials to the browser pop-up to authenticate.
After successful authentication, you can access the data and create reports.

Figure 20: Tableau report

Clean up

Complete the following steps to clean up your resources:

Delete the IdP applications that you created to integrate with IAM Identity Center.
Delete Identity Center configuration.
Delete the Amazon Redshift application and the Amazon Redshift provisioned cluster or Serverless instance that you created for testing.
Delete the IAM role and IAM policy that you created for Identity Center and Amazon Redshift integration.
Delete the permission set from Identity Center that you created for Amazon Redshift Query Editor v2 in the management account.
Clean up resources related to PingFederate.

Conclusion

This post covered streamlining access management for data analytics by using Tableau’s capability to support single sign-on based on the OAuth 2.0 and OIDC protocol. This setup facilitates federated user authentication, where user identities from an external identity provider like PingFederate are trusted and propagated to Amazon Redshift. You walked through the steps to configure Tableau Desktop and Tableau Server to integrate seamlessly with Amazon Redshift using AWS IAM Identity Center for single sign-on. By harnessing this integration of a third-party IdP with IAM Identity Center, analysts can securely access Amazon Redshift data sources within Tableau without managing separate database credentials.

Learn more about Amazon Redshift integration with IAM Identity Center using PingFederate as an identity provider by visiting the following resources.

About the authors

Tailor Amazon SageMaker Unified Studio project environments to your needs using custom blueprints

2025-09-18 Aditya Challa

Post Syndicated from Aditya Challa original https://aws.amazon.com/blogs/big-data/tailor-amazon-sagemaker-unified-studio-project-environments-to-your-needs-using-custom-blueprints/

Amazon SageMaker Unified Studio is a single data and AI development environment that brings together data preparation, analytics, machine learning (ML), and generative AI development in one place. By unifying these workflows, it saves teams from managing multiple tools and makes it straightforward for data scientists, analysts, and developers to build, train, and deploy ML models and AI applications while collaborating seamlessly.

In SageMaker Unified Studio, a project is a boundary where you can collaborate with other users to work on a business use case. A blueprint defines what AWS tools and services members of a project can use as they work with their data. Blueprints are defined by an administrator and are powered by AWS CloudFormation. Instead of manually piecing together project structures or workflow configurations, teams can rapidly spin up secure, compliant, and consistent analytics and AI environments. This streamlined approach significantly reduces setup time and provides standardized workspaces across the organization. Out of the box, SageMaker Unified Studio comes with several default blueprints.

We recently launched the custom blueprints feature in SageMaker Unified Studio. Organizations can now incorporate their specific dependencies, security controls using their own managed AWS Identity and Access Management (IAM) policies, and best practices, making it straightforward for them to align with internal standards. Because they’re defined through infrastructure as code (IaC), blueprints are straightforward to version control, share across teams, and evolve over time. This speeds up onboarding and keeps projects consistent and governed, no matter how big or distributed your data organization becomes.

For enterprises, this means more time focusing on insights, models, and innovation. The custom blueprints feature is designed to help teams move faster and stay consistent while maintaining their organization’s security controls and best practices. In this post, we show how to get started with custom blueprints in SageMaker Unified Studio.

Solution overview

We provide a CloudFormation template to implement a custom blueprint in SageMaker Unified Studio. The template deploys the following resources in the project environment:

AWS Glue database
Amazon Redshift Serverless namespace and workgroup
AWS Lake Formation permissions for the newly created project to access the AWS Glue database
Custom managed policies for AWS Glue and Amazon Redshift

Prerequisites

The post assumes you have a preexisting SageMaker Unified Studio domain. If you don’t have one, refer to Create a Amazon SageMaker Unified Studio domain – quick setup for instructions to create one.

Define reserved environment parameters

The CloudFormation template uses parameters that are reserved to your SageMaker environment, such as datazoneEnvironmentEnvironmentId, datazoneEnvironmentProjectId, s3BucketArn, and privateSubnets. These parameters are automatically populated by SageMaker when creating the project. The parameters also help in retrieving other environment variables, such as SecurityGroupIds, as shown in the following snippets.

The following code illustrates defining reserved environment parameters:

"Parameters": {
        "datazoneEnvironmentEnvironmentId": {
            "Type": "String",
            "Description": "EnvironmentId for which the resource will be created for."
        },
        "datazoneEnvironmentProjectId": {
            "Type": "String",
            "Description": "DZ projectId for which project the resource will be created for."
        },
        "s3BucketArn": {
            "Type": "String",
            "Description": "Project S3 Bucket ARN"
        },
        "privateSubnets": {
            "Type": "String",
            "Description": "Project Private Subnets"
        }
}

The following code illustrates using reserved environment parameters to import other necessary values:

"SecurityGroupIds": [
                    {
                        "Fn::ImportValue": {
                            "Fn::Join": [
                                "",
                                [
                                    "securityGroup-",
                                    {
                                        "Ref": "datazoneEnvironmentProjectId"
                                    },
                                    "-dev"
                                ]
                            ]
                        }
                    }
]

Attach custom IAM policies to project role

By default, SageMaker Unified Studio creates a project role and attaches several managed policies to the role. These managed policies are defined in the tooling blueprint. With custom blueprints, you can configure and attach your own IAM policies, in addition to the default policies, to the project role. To do this, include the IAM policies in your CloudFormation template and use the Export feature in the Outputs section, as shown in the following code. SageMaker Unified Studio gathers the policy information and adds it to the project role.

"GlueAccessManagedPolicy": {
            "Description": "ARN of the created managed policy",
            "Value": {
                "Ref": "GlueAccessManagedPolicy"
            },
            "Export": {
                "Name": {
                    "Fn::Sub": "datazone-managed-policy-glue-${glueDbName}-${datazoneEnvironmentEnvironmentId}"
                }
            }
        },
"RedshiftAccessManagedPolicy": {
            "Description": "ARN of the created Redshift managed policy",
            "Value": {
                "Ref": "RedshiftAccessManagedPolicy"
            },
            "Export": {
                "Name": {
                    "Fn::Sub": "datazone-managed-policy-redshift-${redshiftWorkgroupName}-${datazoneEnvironmentEnvironmentId}"
                }
            }
        }

Create custom blueprint

Complete the following steps to create a custom blueprint using the CloudFormation template:

On the Amazon SageMaker console, open the domain where you want to create a custom blueprint.
On the Blueprints tab, choose Create.
Under Name and description, enter a name and optional description.
Under Upload CloudFormation template, select Upload a template file and upload the provided template.
Choose Next.
SageMaker will automatically detect the reserved parameters defined in the template, as shown in the following screenshot.
For Editable parameters, edit the Value column if necessary, and specify whether the values can be editable at the time of project creation.
Choose Next.
As shown in the following screenshot, the reserved parameters described earlier are not shown on this page.
Select Enable blueprint.
Choose the provisioning role to be used by SageMaker to provision the environment resources.
Choose the domain units authorized to use the blueprint.
Choose Next.
Review the blueprint information and choose Create blueprint.

Create project profile

Complete the following steps to create a custom project profile that includes the custom blueprint created in the previous section:

On the SageMaker console, open your domain.
On the Project profiles tab, choose Create.
Enter the project profile name and optional description.
Select Custom create.
Choose the blueprints to be included in the project profile, including the custom blueprint you created in the previous section.
Choose the account and AWS Region to be used.
Choose the authorized users.
Select Enable project profile on creation.
Choose Create project profile.

Create project

Complete the following steps to create a new project that is based on the custom project profile and custom blueprint created in the previous sections:

In the SageMaker Unified Studio environment, choose Create project.
Enter a project name and optional description.
For Project profile, choose the profile created in the previous section.
Choose Continue.
On the Customize blueprint parameters page, review the parameters, modify as necessary, and choose Continue.
Review your selections and choose Create project.

SageMaker Unified Studio will create the project environments with the resources defined in your custom blueprint.

It will also attach the custom IAM policies defined and add them to the project role, as shown in the following screenshot.

Clean up

To avoid incurring additional costs, complete the following steps:

Delete the project you created in SageMaker Unified Studio.
Delete the custom project profile and custom blueprint you created.
Delete the CloudFormation template.

Conclusion

In this post, we discussed custom blueprints, a new option during administrator setup in SageMaker Unified Studio. We showed how to create new custom blueprints and create custom project profiles that include the newly created custom blueprints. We also demonstrated how to create projects that implement custom blueprints.

Custom blueprints in SageMaker Unified Studio are intended to streamline and standardize data, analytics and AI workflows. By helping organizations create templated environments with preconfigured resources, security controls, and best practices, custom blueprints can reduce setup time while providing consistency and compliance across projects.

Organizations can now enforce their specific security standards and access controls at the project level using the ability to incorporate custom IAM policies directly into these blueprints. This granular control over permissions helps organizations create projects that adhere to corporate security policies right from inception. Custom blueprints can help you scale analytics and AI/ML operations securely, by including tooling designed to version control these templates, share them across teams, and automatically apply custom IAM policies.

To learn more about custom blueprints in SageMaker Unified Studio, refer to Custom blueprints.

About the Authors

Enhance TLS inspection with SNI session holding in AWS Network Firewall

2025-09-18 Amit Gaur

Post Syndicated from Amit Gaur original https://aws.amazon.com/blogs/security/enhance-tls-inspection-with-sni-session-holding-in-aws-network-firewall/

AWS Network Firewall is a managed firewall service that filters and controls network traffic in Amazon Virtual Private Cloud (Amazon VPC). Unlike traditional network controls such as security groups or network access control lists (NACLs), Network Firewall can inspect and make decisions based on information from higher layers of the OSI model, including the Transport through Application layers. Furthermore, you can use the TLS inspection capability of Network Firewall to create firewall rules that match the content of encrypted TLS traffic. Network Firewall decrypts the traffic using your configured certificate and matches the decrypted payload against the rules in the firewall policy.

This post introduces Server Name Indication (SNI) session holding, which enhances TLS inspection by stopping TCP or TLS establishment packets from reaching the destination server until TLS inspection rules for SNI have been applied. When SNI is enabled, Network Firewall will not initiate an outbound TCP connection to the target until it has received the client hello and matched its domain information sent through SNI against firewall rules. The TCP session between the firewall and the upstream server is only initiated after the firewall validates traffic to that domain. This offers you additional security controls on outbound traffic with minimal latency and performance overheads, helping protect against malicious targets.

Network Firewall TLS inspection prior to SNI session holding

When TLS inspection is enabled, Network Firewall acts as an intermediary between the client and server, maintaining separate connections with each endpoint. Throughout this process, Network Firewall evaluates outbound traffic against configured rules to determine whether the traffic should be allowed to exit the firewall.As shown in Figure 1, the steps prior to availability of SNI session holding were:

The client creates a TCP connection, and Network Firewall evaluates the stateless rules to determine if the traffic is allowed. If not, the connection is terminated.
Network Firewall creates a TCP Connection to the destination server.
The client sends a ClientHello message, including SNI information, to Network Firewall. The firewall validates that the SNI is valid, otherwise the connection is terminated.
Network Firewall forwards the ClientHello message to the destination server.
The destination server responds with a ServerHello message and its certificate.
Network Firewall validates the certificates downloaded from the destination server.
At this point, the server name indication is validated against the certificate subject name.
Network Firewall forwards the server’s certificate to the client and completes the TLS connection with the client.
The client encrypts the application payload using the session keys it negotiated during TLS handshake and sends it to Network Firewall.
Network Firewall decrypts the traffic, uses its stateful engine to evaluate rules against the traffic, and determines if it is allowed.
If traffic is allowed, Network Firewall re-encrypts the application layer payload with the destination server’s session keys and forwards it to the destination server.
The destination server sends back response data to Network Firewall.
The Network Firewall stateful engine analyzes the destination server’s response.
Network Firewall forwards the server response to the client. The communication continues until the client or destination server terminates the connection.

Figure 1: Steps prior to availability of SNI session holding

With the current sequence of traffic inspection, the TCP connection is established before the TLS SNI field is evaluated, which could lead to a server learning about a connection before the firewall inspects the SNI.

For example, when customers configure rules to reject traffic based on TLS SNI fields (such as example.com), they expect these connections to be blocked before opening a connection to the destination server and before data transmission occurs. However, because of the inherent protocol sequence, TCP connections are briefly established before SNI rule validation takes place. This processing order creates a narrow window where sophisticated threat actors could potentially attempt to circumvent data exfiltration prevention controls, even with properly configured SNI-based blocking rules.

Session holding addresses this concern so that the traffic originating from within VPCs cannot connect to destination servers until Network Firewall verifies the TLS SNI.

How TLS inspection works with session holding

SNI session holding implements a two-step validation process. First, the firewall examines the TLS layer and validates the SNI when the client sends the TLS client hello message. After the message is approved, Network Firewall allows the connection to the destination server, permitting encrypted upper-layer protocols like HTTP or SMTP to initiate their negotiations. This approach creates a distinct separation between TLS validation and protocol inspection, where protocol examination only occurs after successful TLS handshake authorization.As shown in Figure 2, the steps in this scenario with SNI session holding are:

Note: Steps 2–5 are part of SNI session holding.

The client creates a TCP connection, and Network Firewall evaluates the stateless rules to determine if the traffic is allowed. If not, the connection is terminated.
The Client sends a ClientHello message including SNI information to Network Firewall. Network Firewall performs validation of the SNI.
The firewall evaluates the TLS inspection rules, including the SNI rules, to determine if the traffic is allowed. If not, the connection is terminated.
Network Firewall creates a TCP connection to the destination server.
Network Firewall forwards the ClientHello message to the destination server.
The destination server responds with a ServerHello message and its certificate.
Network Firewall validates the certificates downloaded from the destination server.
Network Firewall forwards the server’s certificate to the client and completes the TLS connection with the client.
The client encrypts the application payload using the session keys it negotiated during TLS handshake and sends it to Network Firewall.
Network Firewall decrypts the traffic, uses its stateful engine to evaluate rules against the traffic, and determines if it is allowed.
If traffic is allowed, Network Firewall re-encrypts the application layer payload with the destination server’s session keys and forwards it to destination server.
The destination server sends back response data to Network Firewall.
Network Firewall stateful engine analyzes the destination server response.
Network Firewall forwards the server response to the client. The communication continues until the client, or the destination server terminates the connection.

Figure 2: Steps after session holding

Getting started

Session holding can be enabled while creating a TLS inspection configuration directly within a Network Firewall policy using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK.

Prerequisites

To get started setting up a Network Firewall policy with session holding, visit the Network Firewall console or see the AWS Network Firewall Developers Guide. Session holding is supported in AWS Regions where Network Firewall is available today, including the AWS GovCloud (US) Regions and China Regions.

If this is your first time using Network Firewall, make sure to complete the following prerequisites. If you already have a firewall and TLS inspection configuration, you can skip this section.

Enable session holding

To enable session holding, follow the steps to create a firewall policy. On the step to Add TLS Inspection configuration, you will have an option to enable session holding by selecting the box as shown in Figure 3.

Figure 3: Enable session holding

After adding the TLS inspection configuration and selecting the box to enable session holding, continue to create the new firewall policy and then associate this policy to your firewall.

If you have an existing policy that is attached to a TLS inspection configuration, choose Manage TLS Inspection Configuration on your firewall policy.

Figure 4: TLS inspection configuration

This will provide the option to enable session holding as shown in figure 3.

Pricing

SNI session holding is included in the cost of TLS advanced inspection. For TLS advanced inspection pricing, see AWS Network Firewall pricing.

Considerations

When enabling the session holding, note the following considerations:

Keywords: Session holding is only applicable to Suricata rules using the TLS.SNI keyword. It does not apply to rules using other TLS application keywords, such as TLS.CERT or TLS.VERSION.
Performance: Because TCP connection establishment packets are held until the SNI validation is complete, session holding might introduce latency in the TCP connection establishment. You’ll notice the impact only when there is a surge in new TCP connections being inspected by Network Firewall with TLS inspection enabled.
Compatibility: TLS.SNI takes priority over http.host rules when session holding is enabled. When disabled, the traffic can match rules based on the http.host keyword and tls.sni keyword simultaneously, resulting in an outcome defined by the combination of the actions in these two types of rules. However, when this session holding is enabled, this traffic can only match the rule with TLS.SNI keyword and the rule with http.host keyword is applied only when the decrypted traffic has not matched other TLS.SNI-based pass rules.

Conclusion

As a preventive measure, this session holding helps make sure that SNI validation happens before a connection is established with the destination server, avoiding even initial contact with potentially malicious endpoints. For more information, see What is AWS Network Firewall?

If you have feedback about this post, submit comments in the Comments section below.

Unlock the power of Apache Iceberg v3 deletion vectors on Amazon EMR

2025-09-17 Arun Shanmugam

Post Syndicated from Arun Shanmugam original https://aws.amazon.com/blogs/big-data/unlock-the-power-of-apache-iceberg-v3-deletion-vectors-on-amazon-emr/

As modern data architectures expand, Apache Iceberg has become a widely popular open table format, providing ACID transactions, time travel, and schema evolution. In table format v2, Iceberg introduced merge-on-read, improving delete and update handling through positional delete files. These files improve write performance but can slow down reads when not compacted, since Iceberg must merge them during query execution to return the latest snapshot. Iceberg v3 enhances merge performance during reads by replacing positional delete files with deletion vectors for handling row-level deletes in Merge-on-Read (MoR) tables. This change deprecates the use of positional delete files in v3, which marked specific row positions as deleted, in favor of the more efficient deletion vectors.

In this post, we compare and evaluate the performance of the new binary deletion vectors in Iceberg v3 with respect to traditional position delete files of Iceberg v2 using Amazon EMR version 7.10.0 with Apache Spark 3.5.5. We provide insights into the practical impacts of these advanced row-level delete mechanisms on data management efficiency and performance.

Understanding binary deletion vectors and Puffin files

Binary deletion vectors stored in Puffin files use compressed bitmaps to efficiently represent which rows have been deleted within a data file. In contrast, previous Iceberg versions (v2) relied on positional delete files—Parquet files that enumerated rows to delete by file and position. This older approach resulted in many small delete files, which placed a heavy burden on query engines due to numerous file reads and costly in-memory conversions. Puffin files reduce this overhead by compactly encoding deletions, improving query performance and resource utilization.

Iceberg v3 improves this in the following aspects:

Reduced I/O – Fewer small delete files lower metadata overhead by introducing deletion vectors—compressed bitmaps that efficiently represent deleted rows. These vectors are stored persistently in Puffin files, a compact binary format optimized for low-latency access.
Query performance – Bitmap-based deletion vectors enable faster scan filtering by allowing multiple vectors to be stored in a single Puffin file. This reduces metadata and file count overhead while preserving file-level granularity for efficient reads. The design supports continuous merging of deletion vectors, promoting ongoing compaction that maintains stable query performance and reduces fragmentation over time. It removes the trade-off between partition-level and file-level delete granularity seen in v2, enabling consistently fast reads even in heavy-update scenarios.
Storage efficiency – Iceberg v3 uses a compressed binary format instead of verbose Parquet positioning. Engines maintain a single deletion vector per data file at write time, enabling better compaction and consistent query performance.

Solution overview

To explore the performance characteristics of delete operations in Iceberg v2 and v3, we use PySpark to run our comparison tests focusing on delete operation runtime and delete file size. This implementation helps us effectively benchmark and compare the deletion mechanisms between Iceberg v2’s position-delete files using Parquet and v3’s newer Puffin-based deletion vectors.

Our solution demonstrates how to configure Spark with the AWS Glue Data Catalog and Iceberg, create tables, and run delete operations programmatically. We first create Iceberg tables with format versions 2 and 3, insert 10,000 rows, then perform delete operations on a range of record IDs. We also perform table compaction and then measure delete operation runtime and size and count of associated delete files.

In Iceberg v3, deleting rows introduces binary deletion vectors stored in Puffin files (compact binary sidecar files). These allow more efficient query planning and faster read performance by consolidating deletes and avoiding large numbers of small files.

For this test, the Spark job was submitted by SSH’ing into the EMR cluster and using spark-submit directly from the shell, with the required Iceberg JAR file being referenced directly from the Amazon Simple Storage Service (Amazon S3) bucket in the submission command. When running the job, make sure you provide your S3 bucket name. See the following code:

spark-submit --jars s3://< S3-BUCKET-NAME >/iceberg/jars/iceberg-spark-runtime-3.5_2.12-1.9.2.jar v3_deletion_vector_test.py

Prerequisites

To follow along with this post, you must have the following prerequisites:

Amazon EMR on Amazon EC2 with version 7.10.0 integrated with the Glue Data Catalog, which includes Spark 3.5.5.
The Iceberg 1.9.2 JAR file from the official Iceberg documentation, which includes important deletion vector improvements such as v2 to v3 rewrites and dangling deletion vector detection. Optionally, you can use the default Iceberg 1.8.1-amzn-0 bundled with Amazon EMR 7.10 if these Iceberg 1.9.x improvements are not required.
An S3 bucket to store Iceberg data.
An AWS Identity and Access management (IAM) role for Amazon EMR configured with the necessary permissions.

The upcoming Amazon EMR 7.11 will ship with Iceberg 1.9.1-amzn-1, which includes deletion vector improvements such as v2 to v3 rewrites and dangling deletion vector detection. This means you no longer need to manually download or upload the Iceberg JAR file, because it will be included and managed natively by Amazon EMR.

Code walkthrough

The following PySpark script demonstrates how to create, write, compact, and delete records in Iceberg tables with two different format versions (v2 and v3) using the Glue Data Catalog as the metastore. The main goal is to compare both write and read performance, along with storage characteristics (delete file format and size) between Iceberg format versions 2 and 3.

The code performs the following functions:

Creates a SparkSession configured to use Iceberg with Glue Data Catalog integration.
Creates a synthetic dataset simulating user records:
- Uses a fixed random seed (42) to provide consistent data generation
- Creates identical datasets for both v2 and v3 tables for fair comparison
Defines the function test_read_performance(table_name) to perform the following actions:
- Measure full table scan performance
- Measure filtered read performance (with WHERE clause)
- Track record counts for both operations
Defines the function test_iceberg_table(version, test_df) to perform the following actions:
- Create or use an Iceberg table for the specified format version
- Append data to the Iceberg table
- Trigger Iceberg’s data compaction using a system procedure
- Delete rows with IDs between 1000–1099
- Collect statistics about inserted data files and delete-related files
- Measure and record read performance metrics
- Track operation timing for inserts, deletes, and reads
Defines a function to print a comprehensive comparative report including the following information:
- Delete operation performance
- Read performance (both full table and filtered)
- Delete file characteristics (formats, counts, sizes)
- Performance improvements as percentages
- Storage efficiency metrics
Orchestrate the main execution flow:
- Create a single dataset to ensure identical data for both versions
- Clean up existing tables for fresh testing
- Run tests for Iceberg format version 2 and version 3
- Output a detailed comparison report
- Handle exceptions and shut down the Spark session

See the following code:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import functions as F
import time
import random
import logging
from pyspark.sql.utils import AnalysisException
# Logging
logging.basicConfig(level=logging.INFO, format='%(message)s')
logger = logging.getLogger(__name__)
# Constants
ROWS_COUNT = 10000
DELETE_RANGE_START = 1000
DELETE_RANGE_END = 1099
SAMPLE_NAMES = ["Alice", "Bob", "Charlie", "Diana",
                "Eve", "Frank", "Grace", "Henry", "Ivy", "Jack"]
# Spark Session
spark = (
    SparkSession.builder
    .appName("IcebergWithGlueCatalog")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://<S3-BUCKET-NAME>/blog/glue/")
    .config("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .getOrCreate()
)
spark.sql("CREATE DATABASE IF NOT EXISTS glue_catalog.blog")
def create_dataset(num_rows=ROWS_COUNT):
    # Set a fixed seed for reproducibility
    random.seed(42)
    
    data = [(i,
             random.choice(SAMPLE_NAMES) + str(i),
             random.randint(18, 80))
            for i in range(1, num_rows + 1)]
    schema = StructType([
        StructField("id", IntegerType(), False),
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)
    ])
    df = spark.createDataFrame(data, schema)
    df = df.withColumn("created_at", F.current_timestamp())
    return df
def test_read_performance(table_name):
    """Test read performance of the table"""
    start_time = time.time()
    count = spark.sql(f"SELECT COUNT(*) FROM glue_catalog.blog.{table_name}").collect()[0][0]
    read_time = time.time() - start_time
    
    # Test filtered read performance
    start_time = time.time()
    filtered_count = spark.sql(f"""
        SELECT COUNT(*) 
        FROM glue_catalog.blog.{table_name} 
        WHERE age > 30
    """).collect()[0][0]
    filtered_read_time = time.time() - start_time
    
    return read_time, filtered_read_time, count, filtered_count
def test_iceberg_table(version, test_df):
    try:
        table_name = f"iceberg_table_v{version}"
        logger.info(f"\n=== TESTING ICEBERG V{version} ===")
        spark.sql(f"""
            CREATE TABLE IF NOT EXISTS glue_catalog.blog.{table_name} (
                id int,
                name string,
                age int,
                created_at timestamp
            ) USING iceberg
            TBLPROPERTIES (
                'format-version'='{version}',
                'write.delete.mode'='merge-on-read'
            )
        """)
        start_time = time.time()
        test_df.writeTo(f"glue_catalog.blog.{table_name}").append()
        insert_time = time.time() - start_time
        logger.info("Compaction...")
        spark.sql(
            f"CALL glue_catalog.system.rewrite_data_files('glue_catalog.blog.{table_name}')")
        start_time = time.time()
        spark.sql(f"""
            DELETE FROM glue_catalog.blog.{table_name}
            WHERE id BETWEEN {DELETE_RANGE_START} AND {DELETE_RANGE_END}
        """)
        delete_time = time.time() - start_time
        files_df = spark.sql(
            f"SELECT COUNT(*) as data_files FROM glue_catalog.blog.{table_name}.files")
        delete_files_df = spark.sql(f"""
            SELECT COUNT(*) as delete_files,
                   file_format,
                   SUM(file_size_in_bytes) as total_size
            FROM glue_catalog.blog.{table_name}.delete_files
            GROUP BY file_format
        """)
        data_files = files_df.collect()[0]['data_files']
        delete_stats = delete_files_df.collect()
        # Add read performance testing
        logger.info("\nTesting read performance...")
        read_time, filtered_read_time, total_count, filtered_count = test_read_performance(table_name)
        
        logger.info(f"Insert time: {insert_time:.3f}s")
        logger.info(f"Delete time: {delete_time:.3f}s")
        logger.info(f"Full table read time: {read_time:.3f}s")
        logger.info(f"Filtered read time: {filtered_read_time:.3f}s")
        logger.info(f"Data files: {data_files}")
        logger.info(f"Total records: {total_count}")
        logger.info(f"Filtered records: {filtered_count}")
        if len(delete_stats) > 0:
            stats = delete_stats[0]
            logger.info(f"Delete files: {stats.delete_files}")
            logger.info(f"Delete format: {stats.file_format}")
            logger.info(f"Delete files size: {stats.total_size} bytes")
            return delete_time, stats.total_size, stats.file_format, read_time, filtered_read_time
        else:
            logger.info("No delete files found")
            return delete_time, 0, "N/A", read_time, filtered_read_time
    except AnalysisException as e:
        logger.error(f"SQL Error: {str(e)}")
        raise
    except Exception as e:
        logger.error(f"Error: {str(e)}")
        raise
def print_comparison_results(v2_results, v3_results):
    v2_delete_time, v2_size, v2_format, v2_read_time, v2_filtered_read_time = v2_results
    v3_delete_time, v3_size, v3_format, v3_read_time, v3_filtered_read_time = v3_results
    logger.info("\n=== PERFORMANCE COMPARISON ===")
    logger.info(f"v2 delete time: {v2_delete_time:.3f}s")
    logger.info(f"v3 delete time: {v3_delete_time:.3f}s")
    if v2_delete_time > 0:
        improvement = ((v2_delete_time - v3_delete_time) / v2_delete_time) * 100
        logger.info(f"v3 Delete performance improvement: {improvement:.1f}%")
    logger.info("\n=== READ PERFORMANCE COMPARISON ===")
    logger.info(f"v2 full table read time: {v2_read_time:.3f}s")
    logger.info(f"v3 full table read time: {v3_read_time:.3f}s")
    logger.info(f"v2 filtered read time: {v2_filtered_read_time:.3f}s")
    logger.info(f"v3 filtered read time: {v3_filtered_read_time:.3f}s")
    
    if v2_read_time > 0:
        read_improvement = ((v2_read_time - v3_read_time) / v2_read_time) * 100
        logger.info(f"v3 Read performance improvement: {read_improvement:.1f}%")
    
    if v2_filtered_read_time > 0:
        filtered_improvement = ((v2_filtered_read_time - v3_filtered_read_time) / v2_filtered_read_time) * 100
        logger.info(f"v3 Filtered read performance improvement: {filtered_improvement:.1f}%")
    logger.info("\n=== DELETE FILE COMPARISON ===")
    logger.info(f"v2 delete format: {v2_format}")
    logger.info(f"v2 delete size: {v2_size} bytes")
    logger.info(f"v3 delete format: {v3_format}")
    logger.info(f"v3 delete size: {v3_size} bytes")
    if v2_size > 0:
        size_reduction = ((v2_size - v3_size) / v2_size) * 100
        logger.info(f"v3 size reduction: {size_reduction:.1f}%")
# Main
try:
    # Create dataset once and reuse for both versions
    test_dataset = create_dataset()
    
    # Drop existing tables if they exist
    spark.sql("DROP TABLE IF EXISTS glue_catalog.blog.iceberg_table_v2")
    spark.sql("DROP TABLE IF EXISTS glue_catalog.blog.iceberg_table_v3")
    
    # Test both versions with the same dataset
    v2_results = test_iceberg_table(2, test_dataset)
    v3_results = test_iceberg_table(3, test_dataset)
    print_comparison_results(v2_results, v3_results)
finally:
    spark.stop()

Results summary

The output generated by the code includes the results summary section that shows several key comparisons, as shown in the following screenshot. For delete operations, Iceberg v3 uses the Puffin file format compared to Parquet in v2, resulting in significant improvements. The delete operation time decreased from 3.126 seconds in v2 to 1.407 seconds in v3, achieving a 55.0% performance improvement. Additionally, the delete file size was reduced from 1801 bytes using Parquet in v2 to 475 bytes using Puffin in v3, representing a 73.6% reduction in storage overhead. Read operations also saw notable improvements, with full table reads 28.5% faster and filtered reads 23% faster in v3. These improvements demonstrate the efficiency gains from v3’s implementation of binary deletion vectors through the Puffin format.

The actual measured performance and storage improvements depend on workload and environment and might differ from the preceding example.

This following screenshot from the S3 bucket demonstrates a Puffin delete file stored alongside data files.

Clean up

After you finish your tests, it’s important to clean up your environment to avoid unnecessary costs:

Drop the test tables you created to remove associated data from your S3 bucket and prevent ongoing storage charges.
Delete any temporary data left in the S3 bucket used for Iceberg data.
Delete the EMR cluster to stop billing for running compute resources.

Cleaning up resources promptly helps maintain cost-efficiency and resource hygiene in your AWS environment.

Considerations

Iceberg features are introduced through a phased process: first in the specification, then in the core library, and finally in engine implementations. Deletion vector support is currently available in the specification and core library, with Spark being the only supported engine. We validated this capability on Amazon EMR 7.10 with Spark 3.5.5.

Conclusion

Iceberg v3 introduces a significant advancement in managing row-level deletes for merge-on-read operations through binary deletion vectors stored in compact Puffin files. Our performance tests, conducted with Iceberg 1.9.2 on Amazon EMR 7.10.0 and EMR Spark 3.5.5, show clear improvements in both delete operation speed and read performance, along with a considerable reduction in delete file storage compared to Iceberg v2’s positional delete Parquet files. For more information about deletion vectors, refer to Iceberg v3 deletion vectors.

About the authors

Enhance the local testing experience for serverless applications with LocalStack

2025-09-17 Patrick Galvin

Post Syndicated from Patrick Galvin original https://aws.amazon.com/blogs/compute/enhance-the-local-testing-experience-for-serverless-applications-with-localstack/

Serverless applications often comprise multiple AWS services, such as AWS Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon EventBridge, and Amazon DynamoDB. Although serverless architectures make it easy to build applications that are generally simple to operate and scale, testing them requires extra steps for developers. Recently, AWS brought you the capability to help developers remotely debug Lambda functions to accelerate the development process. Today, we’re excited to announce new capabilities that further simplify the local testing experience for Lambda functions and serverless applications through integration with LocalStack, an AWS Partner, in the AWS Toolkit for Visual Studio Code.

In this post, we will show you how you can enhance your local testing experience for serverless applications with LocalStack using AWS Toolkit.

Challenges with local serverless development

When building serverless applications with infrastructure as code (IaC) tools like the AWS Serverless Application Model (AWS SAM), developers often face challenges during local integration testing of applications that depend on interactions across multiple AWS services. These friction points slow down the critical code-test-debug cycle. Developers might encounter the following common roadblocks:

Cloud-based validation slows iteration – Developers previously needed to deploy AWS SAM templates to the cloud to test changes, introducing delays in feedback loops. AWS research shows that developers spend considerable time on deployment and testing, rather than writing code.
Tool context switching adds friction – Developers routinely shift between integrated development environments (IDEs), command line interfaces (CLIs), and resource emulators like LocalStack, leading to fragmented workflows.
Manual setup increases configuration complexity – Port mapping and code edits for local service integration tests can introduce inconsistencies between local and cloud environments.
Service integration debugging is limited – Troubleshooting Lambda functions in the context of AWS service integrations, such as DynamoDB, Amazon Simple Storage Service (Amazon S3), or Amazon SQS, requires manual configuration, extending the duration of troubleshooting efforts.

These challenges directly impact developer productivity and make local testing of integrated serverless applications complex.

Solution overview

Starting today, AWS helps simplify local serverless development by integrating LocalStack directly into the AWS Toolkit for VS Code. This integration helps developers test and debug serverless applications—defined using IaC tools like AWS SAM—entirely within their IDE. The enhanced local testing experience delivers four major improvements:

Integrated LocalStack experience – Connect to LocalStack directly within VS Code and manage local resources alongside cloud resources through a unified interface.
Emulated service interactions – Test Lambda functions with their interactions with other AWS services like Amazon SQS, DynamoDB, and EventBridge locally.
Simplified debugging – Start debugging sessions with LocalStack emulated environment, with a single click – no manual port configurations or code changes required, streamlining the debugging workflow.
Streamlined workflow – Deploy, test, and debug serverless applications without leaving the IDE, avoiding context switching between tools.

To set up LocalStack in VS Code (either the free version supporting over 30 core services like Lambda, Amazon S3, DynamoDB, Amazon SQS, and Amazon API Gateway, or the Ultimate version with over 110 services and advanced debugging features) you need essential development tools, including Docker, the AWS Command Line Interface (AWS CLI), AWS SAM CLI, and your preferred IDE such as VS Code. This combination enables full local integration testing of AWS services, including Lambda functions, messaging queues, databases, event-driven architectures, and serverless workflows, so you can develop and test your entire AWS application stack locally before deploying to the cloud.

Automated setup process

LocalStack is a cloud service emulator that you can use to run AWS applications locally for testing and development. To enhance your local testing capabilities, you can install the LocalStack VSCode Extension directly from AWS Walkthrough in AWS Toolkit, which offers a streamlined setup process through an intelligent wizard. After installation, the extension automatically detects whether LocalStack is configured on your system and prompts you to run the setup wizard through a notification. The entire process is quick and requires no manual configuration.

LocalStack extension has an integrated authentication wizard, that simplifies the process of connecting your development environment to LocalStack. During setup, the wizard opens a browser-based authentication flow and maintains an active connection until authentication completes. After it’s verified, it securely stores the authentication token in the ~/.localstack/auth.json file, enabling communication between your local environment and LocalStack services.

The wizard also checks if LocalStack AWS CLI profiles exist, and if not found, automatically creates them by updating the ~/.aws/config and ~/.aws/credentials files with LocalStack-specific endpoints and credentials. This seamless integration of AWS profiles enhances the development workflow by allowing developers to easily switch between different AWS environments, including the local LocalStack setup. By leveraging these profiles, developers can effortlessly point their AWS CLI or SDK to the appropriate endpoint, whether it’s a real AWS account or the LocalStack instance running on their machine. This configuration not only ensures a clear separation between local and cloud environments but also minimizes the risk of cross-environment interference. The automatic creation of these profiles streamlines the setup process, reducing manual configuration errors and saving valuable development time. Visual Studio Code (VS Code) provides real-time feedback throughout the setup. The status bar initially displays an error or warning indicator when LocalStack is not configured and then transitions to a normal or connected state once a successful connection is established. After setup completes, you’re ready to deploy, test, and debug serverless applications locally—without additional configuration. These settings persist across VS Code sessions, so the setup process is a one-time task.The following figure illustrates the process to start and verify LocalStack from VS Code.

To learn more, including installation steps, configuration examples, and troubleshooting guidance, visit the LocalStack Docs.

Test a serverless application

To demonstrate the enhanced local testing capabilities, let’s explore a practical serverless pattern: building and testing an event-driven order processing system that integrates Lambda with Amazon SQS, API Gateway, and Amazon Simple Notification Service (Amazon SNS). The application processes orders through an event-driven workflow: orders are submitted through API Gateway to an SQS queue and processed by a Lambda function, and the status is published to Amazon SNS to trigger customer email notifications.

After you set up LocalStack in VS Code, you can test your entire serverless workflow without deploying to the cloud:

Deploy locally – Use the LocalStack AWS profile to deploy your AWS SAM application. The process mirrors cloud deployment but targets local endpoints. You can use the Application builder pane to initiate the deployment to LocalStack environment. The following figure illustrates the process of deploying a sample serverless application.

Debug Lambda function deployed in LocalStack – Set breakpoints in your Lambda function and step through execution using VS Code’s integrated debugger. With the AWS Toolkit extension, you can invoke your Lambda with one click and inspect live interactions across services, all while running against a LocalStack container on your machine. This setup makes it possible to debug your AWS applications in a controlled, local environment that mimics the cloud infrastructure, without the need for deploying actual AWS services.

Validate end-to-end Flows – Test complete workflows from message ingestion through processing and notification, confirming all service integrations work correctly before cloud deployment.

For an in-depth technical demonstration of this LocalStack integration, refer to this youtube video.

Best practices for local Lambda function testing

In this section, we discuss various strategies and best practices for local Lambda function testing.

Optimizing your development workflow

Consider the following strategies to optimize your development workflow:

Start with a strong testing foundation – Use the AWS SAM CLI to perform unit tests that validate the core programmatic and business logic of your Lambda functions. Isolating function behavior early helps identify logic errors before introducing external dependencies.
Establish environment parity early in the development process – Many production issues stem from discrepancies between local and cloud environments. Use consistent service versions, configurations, and data structures across environments to confirm that what works locally behaves the same in production.
Adopt IaC from day one – Whether you choose AWS SAM, AWS CloudFormation, or another IaC framework, defining your application infrastructure as code reduces configuration drift and makes your deployments reproducible across teams and environments.
Apply a progressive testing strategy – Follow a structured testing pyramid that starts with fast, isolated unit tests and builds up to broader integration and system-level validation. This layered approach helps you catch issues earlier—when they’re easier and less expensive to fix—while still providing full application coverage.

A strategic approach to testing

Testing should be an integrated part of your serverless development workflow—not an afterthought. Successful teams implement layered testing strategies that use both local and cloud environments to strike a balance between speed and accuracy:

Begin with unit tests that focus on isolated function logic. Use tools like the AWS SAM CLI, AWS Toolkit for VS Code and LocalStack extensions to run and debug functions locally.
After validation, proceed to local integration testing using LocalStack to confirm how your Lambda functions interact with services such as Amazon SQS, DynamoDB, and Amazon SNS. These tests typically complete within minutes and catch most service integration issues before they reach production.
After local testing, validate your application in the actual AWS environment. Cloud testing helps surface issues not present in local emulation, such as AWS Identity and Access Management (IAM) permission mismatches, Amazon Virtual Private Cloud (Amazon VPC) networking challenges, or service-specific nuances such as Lambda concurrency. For troubleshooting issues in the cloud environment, you can also remotely debug your Lambda functions using AWS Toolkit for VS Code.
Lastly, conduct performance testing in AWS to assess how your application handles real-world traffic. These longer-running tests help validate scaling behavior and system resilience under load.

The result is higher-quality applications delivered faster, with fewer production surprises and more confident deployments.

Security considerations

When using LocalStack for local development, follow these security best practices:

Isolate the local environment – Use Docker networking to restrict LocalStack access and bind services to localhost to prevent external connections.
Use placeholder credentials – Use test credentials (for example, test/test) instead of real AWS credentials.
Protect your data – Use synthetic or anonymized datasets instead of production data and regularly purge local data stores to reduce risk.

When to use local versus cloud testing

Although local testing offers significant advantages, it’s important to understand when to use it versus testing in the cloud. The following table lists the potential use cases for each strategy.

Testing Scenario	Local Testing	Cloud Testing	Reason
Function logic validation	✓		Fast feedback for core business logic
Service integration testing	✓		Quick validation of AWS service interactions
Rapid iteration during development	✓		Immediate feedback without deployment overhead
Cost-sensitive development environments	✓		Minimizes cloud resource costs during development
Offline development scenarios	✓		No internet connectivity required
Performance and scalability testing		✓	Requires actual AWS infrastructure for accurate results
IAM permission validation		✓	LocalStack doesn’t fully replicate IAM behavior
VPC networking scenarios		✓	Network configurations can’t be accurately emulated
Production-like load testing		✓	Real performance metrics only available in AWS
Final validation before deployment		✓	Supports compatibility with actual AWS environment

Conclusion

In this post, we discussed how to streamline local testing for AWS Serverless applications using LocalStack and the AWS Toolkit for VS Code. By running and debugging serverless applications directly in your IDE, you can reduce context switching, test complex integrations locally, and catch issues earlier—without deploying to the cloud.

We also showed how to apply progressive testing strategies that combine local emulation with cloud validation, optimize development costs, and build event-driven workflows with confidence.These enhancements lead to faster test cycles, lower development costs, and higher-quality deployments—all while staying fully in control of your development environment.

Have questions or feedback about this post? Connect with us on the AWS Compute Blog or join the AWS Developer community.

Automate OIDC client secret rotation with Application Load Balancer

2025-09-17 Kani Murugan

Post Syndicated from Kani Murugan original https://aws.amazon.com/blogs/security/automate-oidc-client-secret-rotation-with-application-load-balancer/

Elastic Load Balancing simplifies authentication by offloading it to OpenID Connect (OIDC) compatible identity providers (IdPs). This lets builders focus on application logic while using robust identity management.

OIDC client secrets are confidential credentials used in OAuth 2.0 and OIDC protocols for authenticating clients (applications). However, manual management of OIDC client secrets introduces security risks and operational overhead.
As shown in Figure 1, manual management of OIDC client secrets starts with authentication through a third-party IdP.

Figure 1: Manual management of OIDC client secrets

The risks of manual management of OIDC client secrets include:

Exposure of plaintext credentials
The need for manual intervention to adjust the Application Load Balancer (ALB) configuration
Lack of proactive monitoring of credential changes
Lack of continued verification of authentication credentials
Not scalable for ALB configuration with multiple listener rules

In this blog post I show you how to automate OIDC client secret rotation using AWS Secrets Manager, AWS Lambda, and Amazon EventBridge, helping to enhance security and streamline operations. Automating secret rotation is a critical security practice that minimizes the risk of credential compromise and helps facilitate ongoing compliance.

For the ALB-OIDC authentication setup, see Authenticate users using an Application Load Balancer.

Solution overview

This solution provides a flexible framework for automated credential management across various OIDC providers (Auth0 as an example), with a specific implementation demonstrating integration with AWS services. The core architecture supports automated credential rotation, secure secret storage, provider agnostic design, and scalable implementation across different authentication workflows. The key components are:

Secrets Manager: Securely stores and manages OIDC (Auth0) client credentials.
Lambda: Executes the secret rotation logic on a scheduled basis.
Elastic Load Balancing: Offloads authentication using OIDC listener rules.
EventBridge (scheduled): Triggers the Lambda function according to a defined schedule.
Custom AWS CloudFormation resource: Automates the entire stack and architecture used in this post.

Figure 2: Automated OIDC client secret rotation

The authentication workflow, as shown in Figure 2, is:

EventBridge triggers the Auth0CredentialHandler Lambda handler every 15 minutes
The Auth0CredentialHandler Lambda handler connects to the Auth0 management domain and gets the current client credentials—auth0_current.
The Auth0CredentialHandler Lambda handler fetches the existing credentials auth0/credentials/${Auth0-dev-domain} from Secrets Manager and compares them with the credential auth0_current retrieved in the previous step.
- If the secret isn’t found, the handler retries three times within a 30-minute period and then logs AWS CloudWatch alarms.
- Assumes that the secret Amazon Resource Name (ARN) is already present in Secrets Manager.
If the credentials are different, Auth0CredentialHandler updates the auth0/credentials/${Auth0-dev-domain} with the new value. If the credentials are the same, no action is taken. CloudWatch alarms are configured to trigger for successful and for failed secret updates.
The ALB listener rule is configured to pull client credentials dynamically from the auth0/credentials/${Auth0-dev-domain} resource ARN in Secrets Manager.

Security recommendations

There are several things you can do to improve the security of your authentication system, starting with implementing centralized secret management with encryption enabled for data at rest. You can also configure Lambda functions with least-privilege permissions, limiting access to only required Secrets Manager and ALB listener resources, which can reduce the security blast radius.

Use CloudWatch alarms to monitor key operational events, including secret updates, update failures, and ALB credential issues and use AWS Config to track rule configurations and perform regular security audits.

By creating separate secrets for each ALB listener rule, you can enable granular access control and narrow the scope of permissions, helping to enhance overall system security.

By following these practices, you can establish a robust security framework for your application and provide proper data protection and access management.

Prerequisites

This solution assumes that the following prerequisites are met before beginning implementation:

An existing ALB configured with a listener and target groups to be used as Listenerarn and targetarn in the CloudFormation template
An OIDC IdP (for example, Auth0) account and client application
Auth0 IdP application client credentials stored in Secrets Manager

      {
       "domain": "your-tenant.auth0.com",
       "client_id": "your-client-id",
       "client_secret": "your-client-secret"
       }

Implementation details

Note: This solution demonstrates OIDC client secret rotation using Auth0 as the IdP. While the core principles and architectural patterns are generally applicable, specific implementation details might vary across different identity providers. Users are advised to consult their specific IdP’s documentation for precise configuration steps, API interactions, and AWS compatible authentication mechanisms.

This is an automated, simple and scalable approach using a CloudFormation custom resource to create the resources mentioned in architecture diagram. The CloudFormation template and AWS Lambda implementation are hosted in demo-stack

Core components

In this section, I explain the key components of the solution.

Credential refresh rule

An EventBridge rule is scheduled to trigger the Auth0CredentialHandler Lambda function at 15-minute intervals using the LambdaInvokePermission AWS Identity and Access Management (IAM) role.

Auth0CredentialHandler Lambda function

The Auth0CredentialHandler Lambda function is responsible for securely managing client credentials. It retrieves the Auth0 configuration from the Secrets Manager resource auth0/credentials/${Auth0-dev-domain}, makes API calls to the Auth0 domain to obtain new tokens, and manages the updating of these credentials in Secrets Manager. It requires permissions to interact with Secrets Manager, which are provided through its execution role.

This IAM role used by Lambda has two main permission sets.

The AWS managed policy AWSLambdaBasicExecutionRole, which allows the Lambda function to create CloudWatch logs.
A custom policy that grants specific Secrets Manager permissions (GetSecretValue, CreateSecret, UpdateSecret) for secrets under the auth0/credentials/${Auth0-dev-domain} path.

Lambda will retry three times within a 30 minute period. If all attempts fail, then a CloudWatch warning will be logged and create alarms.

  ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: SecretsManagerAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - secretsmanager:GetSecretValue
                  - secretsmanager:CreateSecret
                  - secretsmanager:UpdateSecret
                Resource:
                  - !Sub 
arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:auth0/*

# Permission for Amazon EventBridge to invoke Lambda
  
  LambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:InvokeFunction
      FunctionName: !Ref Auth0CredentialHandler
      Principal: events.amazonaws.com
      SourceArn: !GetAtt CredentialRefreshRule.Arn

ALB listener rules

Elastic Load Balancing listener rule resources in CloudFormation are configured to dynamically resolve the client credentials from Secrets Manager and forwards authenticated requests to a specific target group. It integrates with the Auth0 credentials that are regularly refreshed by the Auth0CredentialHandler. This configuration requires read access to Secrets Manager to obtain the Auth0 client credentials for authentication.

# ALB Listener Rules - replace the Oidc config with your endpoints. Only Client credentials are stored in SecretsManager
  ListenerRule1:
    Type: AWS::ElasticLoadBalancingV2::ListenerRule
    Properties:
      ListenerArn: arn:aws:elasticloadbalancing:region:account-id:listener/app/my-load-balancer/1234567890/abcdef
      Priority: 1
      Actions:
        - Type: authenticate-oidc
          AuthenticateOidcConfig:
            ClientId: 
'{{resolve:secretsmanager:auth0/credentials/your-tenant.auth0.com:SecretString:client_id}}'
            ClientSecret: 
'{{resolve:secretsmanager:auth0/credentials/your-tenant.auth0.com:SecretString:client_secret}}'
            Issuer: https://idp1.example.com
            AuthorizationEndpoint: https://idp1.example.com/auth
            TokenEndpoint: https://idp1.example.com/token
            UserInfoEndpoint: https://idp1.example.com/userinfo
            OnUnauthenticatedRequest: authenticate
        - Type: forward
          TargetGroupArn: 
arn:aws:elasticloadbalancing:region:account-id:targetgroup/target-group-1/1234567890abc
      Conditions:
        - Field: path-pattern
          Values:
            - /app1/*

CloudWatch monitoring and alerting

The provided CloudFormation template is configured to establish security monitoring for secret updates. The template provisions alerts for successful and failed secret updates. The template creates CloudWatch metric filters using AWS CloudTrail logs, sets up corresponding alarms with defined thresholds, and establishes an Amazon Simple Notification Service (Amazon SNS) topic for consolidated alert delivery. Upon deployment, this infrastructure-as-code solution enables automated detection and notification of potential security events related to secrets management and unauthorized access attempts.

  # CloudWatch Log Group
  CloudTrailLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: secrets-manager-monitoring
      RetentionInDays: 14
      
  # Combined Metric Filter for Both Success and Failed Updates
  SecretUpdateMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      LogGroupName: !Ref CloudTrailLogGroup
      FilterPattern: !Sub '{ $.eventSource = secretsmanager.amazonaws.com && ($.eventName = UpdateSecret || $.eventName = PutSecretValue) && $.responseElements.ARN = "${MyCustomResource.SecretArn}" }'
      MetricTransformations:
        - MetricNamespace: 'SecretsManager/Updates'
          MetricName: 'SecretUpdates'
          MetricValue: '1'
          DefaultValue: 0

  # Combined Alarm for Both Success and Failed Updates
  SecretUpdateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub '${AWS::StackName}-secret-update'
      AlarmDescription: !Sub 'Alarm for any updates (success or failure) to secret ${MyCustomResource.SecretArn}'
      MetricName: SecretUpdates
      Namespace: SecretsManager/Updates
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - !Ref SecretMonitoringTopic

To enhance the reliability of the secret rotation process, implement comprehensive monitoring by creating CloudWatch alarms to detect Lambda rotation failures beyond threshold and high rates of authentication failures, unusual spikes in HTTP 4xx and 5xx error rates from ALB and using CloudTrail to track API calls and configuration changes related to secrets in Secrets Manager and load balancer settings. By implementing these custom alarms alongside standard configurations, potential security incidents and unauthorized access attempts can be quickly detected across your AWS resources. This multi-layered approach helps maintain visibility into the rotation process and helps quickly identify and respond to potential issues.

See Creating CloudWatch alarms for CloudTrail events: examples for detailed guidance.

Deployment process

Deploy the CloudFormation template using the AWS Command Line Interface (AWS CLI) or AWS Management Console. Replace <your-region> with the AWS Region where you want to deploy the solution.

aws cloudformation deploy \
  --template-file template.yaml \
  --stack-name oidc-credential-manager-stack \
  --capabilities CAPABILITY_IAM \
  --region

Note: You can add additional parameters if required by your IdP configuration.

Testing and verification

Disclaimer: It’s recommended to test in a separate non-critical environment to make sure that any customer-specific settings are fully verified before deploying in production environments.

For secret updates, verify that the configured CloudWatch alarms are triggered. For ALB authentication, examine ALB access logs for authentication_success entries and the presence of OIDC identity tokens.

Set up CloudWatch metrics and alarms to monitor the rotation process and authentication success rates.Verify failure cases by manually editing ALB rule configuration to point to a different secret ARN and confirm that the CloudWatch alarm is triggered.The following is an example CloudTrail event for a successful Secrets Manager update:

{
  "source": ["aws.secretsmanager"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["secretsmanager.amazonaws.com"],
    "eventName": ["UpdateSecret"],
    "responseElements": {"status": "Success"}
  }
}

The following is an example of ALB access logs:

/aws/alb/<your-alb-name>:
- Look for entries containing:
  "authentication_success"
  "id_token_authentication_successful"
  "x-amzn-oidc-identity"
  HTTP status code 200
- Example log pattern:
  timestamp elb_name client:port target:port request_processing_time 
  target_processing_time response_processing_time status_code 
  "authentication_success" "x-amzn-oidc-identity: [token]"

Advanced scenarios

In this section, you learn how to reduce the wait time and make the Secrets Manager update nearly synchronous.

Optimize the Secrets Manager sync: Use EventBridge partner integration to configure EventBridge to invoke the Lambda function based on the events received from third-party IdP. See Receiving events from a SaaS partner with Amazon EventBridge for detailed guidance.
Rotate the client ID: While rotating the client secret is the most common scenario, there might be instances where rotating the client ID is also necessary. In most identity providers, this means creating a new application client and migrating resources. To do this, the Auth0CredentialHandler requires permissions to modify ALB listener rules (elasticloadbalancing:ModifyRule, elasticloadbalancing:DescribeListeners, elasticloadbalancing:DescribeRules). Client ID rotation can cause temporary authentication disruptions, so thorough testing is crucial. Use AWS Config to monitor ALB rule configurations for unexpected changes. This feature empowers a more comprehensive security posture, although it can increase the complexity of the solution and might require manual intervention.
Multi-provider strategies: If your organization handles multiple IdPs, implement a centralized rotation framework that abstracts provider-specific nuances, focusing on core security principles outlined in this post. Key considerations include creating provider-agnostic interfaces to support comprehensive monitoring and minimizing configuration overhead.

Conclusion

In this post, you explored a comprehensive approach to automating OIDC client secret rotation using AWS services. By implementing this solution, you can enhance your application’s security, reduce manual management overhead, and maintain a robust authentication strategy.

Consider exploring advanced identity management techniques or integrating multi-factor authentication with your OIDC implementation. If you are new to automated secrets rotation, visit Back to Basics: Secrets Management.

If you have questions or feedback about this post, contact AWS Support.

Get started with Amazon OpenSearch Service: T-shirt size your domain for log analytics

2025-09-16 Harsh Bansal

Post Syndicated from Harsh Bansal original https://aws.amazon.com/blogs/big-data/get-started-with-amazon-opensearch-service-t-shirt-size-your-domain-for-log-analytics/

When you’re spinning up your Amazon OpenSearch Service domain, you need to figure out the storage, instance types, and instance count; decide the sharding strategies and whether to use a cluster manager; and enable zone awareness. Generally, we consider storage as a guideline for determining instance count, but not other parameters. In this post, we offer some recommendations based on T-shirt sizing for log analytics workloads.

Log analytics and streaming workload characteristics

When you use OpenSearch Service for your streaming workloads, you send data from one or more sources into OpenSearch Service. OpenSearch Service indexes your data in an index that you define.

Log data naturally follows a time series pattern, and therefore a time-based indexing strategy (daily or weekly indexes) is recommended. For efficient management of log data, you must implement time-based index patterns and set retention periods. You further define time slicing and a retention period for the data to manage its lifecycle in your domain.

For illustration, consider that you have a data source producing a continuous stream of log data, and you’ve configured a daily rolling index and set a retention period of 3 days. As the logs arrive, OpenSearch Service creates an index per day with names like stream1_2025.05.21, stream1_2025.05.22, and so on. The prefix stream1_* is what we call an index pattern, a naming convention that helps group-related indexes.

The following diagram shows three primary shards for each daily index. These shards are deployed across three OpenSearch Service data instances, with one replica for each primary shard. (For simplicity, the diagram doesn’t show that primary and replica shards are always placed on different instances for fault tolerance.)

When OpenSearch Service processes new log entries, they are sent to all relevant primary shards and their replicas in the active index, which in this example is only today’s index due to the daily index configuration.

There are several important characteristics of how OpenSearch Service processes your new entries:

Total shard count – Each index pattern will have a D * P * (1 + R) total shards, where D represents retention in days, P represents primary shards, and R is the number of replicas. These shards are distributed across your data nodes.
Active index – Time slicing means that new log entries are only written to today’s index.
Resource utilization – When sending a _bulk request with log entries, these are distributed across all shards in the active index. In our example with three primary shards and one replica per shard, that’s a total of six shards processing new data simultaneously, requiring 6 vCPUs to efficiently handle a single _bulk request.

Similarly, OpenSearch Service distributes queries across the shards for the indexes involved. If you query this index pattern across all 3 days, you will engage 9 shards, and need 9 vCPUs to process the request.

This will get even more complicated when you add in more data streams and index patterns. For each additional data stream or index pattern, you deploy shards for each of the daily indexes and use vCPUs to process requests in proportion to the shards deployed, as shown in the preceding diagram. When you make concurrent requests to more than one index, each shard for all the indexes involved must process those requests.

Cluster capacity

As the number of index patterns and concurrent requests increases, you can quickly overwhelm the cluster’s resources. OpenSearch Service includes internal queues that buffer requests and mitigate this concurrency demand. You can monitor these queues using the _cat/thread_pool API, which shows queue depths and helps you understand when your cluster is approaching capacity limits.

Another complicating dimension is that the time to process your updates and queries depends on the contents of the updates and queries. As requests come in, the queues are filling at the rate you are sending them. They are draining at a rate that is governed by the available vCPUs, the time they take on each request, and the processing time for that request. You can interleave more requests if those requests clear in a millisecond than if they clear in a second. You can use the _nodes/stats OpenSearch API to monitor average load on your CPUs. For more information about the query phases, refer to A query, or There and Back Again on the OpenSearch blog.

If you see the queue depths increasing, you are moving into a “warning” area, where the cluster is handling load. But if you continue, you can start to exceed the available queues and must scale to add more CPUs. If you start to see load increasing, which is correlated with queue depth increasing, you are also in a “warning” area and should consider scaling.

Recommendations

For sizing a domain, consider the following steps:

Determine the storage required – Total storage = (daily source data in bytes × 1.45) × (number_of_replicas + 1) × number of days retained. This accounts for the additional 45% overhead on daily source data, broken down as follows:
- 10% for larger index size than source data.
- 5% for operating system overhead (reserved by Linux for system recovery and disk defragmentation protection).
- 20% for OpenSearch reserved space per instance (segment merges, logs, and internal operations).
- 10% for additional storage buffer (minimizes impact of node failure and Availability Zone outages).
Define the shard count – Approximate number of primary shards = storage size required per index / desired shard size. Round up to the nearest multiple of your data node count to maintain even distribution. For more detailed guidance on shard sizing and distribution strategies, refer to “Amazon OpenSearch Service 101: How many shards do I need” For log analytics workloads, consider the following:
- Recommended shard size: 30–50 GB
- Optimal target: 50 GB per shard
Calculate CPU requirements – Recommended ratio is 1.25 vCPU:1 Shard for lower data volumes. Higher ratios are recommended for larger volumes. Target utilization is 60% average, 80% maximum.
Choose the right instance type – Consider the following based on your nodes:
- Cluster manager nodes: M family AWS Graviton instances (all workload sizes)
- Data nodes (small to large workloads): M or R family AWS Graviton instances with Amazon Elastic Block Store (Amazon EBS)
- Data nodes (very large workloads): I family instances with NVMe SSDs

Let’s look at an example for domain sizing. The initial requirements are as follows:

Daily log volume: 3 TB
Retention period: 3 months (90 days)
Replica count: 1

We make the following instance calculation.

The following table recommends instances, amount of source data, storage needed for 7 days of retention, and active shards based on the preceding guidelines.

T-Shirt Size	Data (Per Day)	Storage Needed (with 7 days Retention)	Active Shards	Data Nodes	Primary Nodes
XSmall	10 GB	175 GB	2 @ 50 GB	3 * r7g.large. search	3 * m7g.large. search
Small	100 GB	1.75 TB	6 @ 50 GB	3 * r7g.xlarge. search	3 * m7g.large. search
Medium	500 GB	8.75 TB	30 @ 50 GB	6 * r7g.2xlarge.search	3 * m7g.large. search
Large	1 TB	17.5 TB	60 @ 50 GB	6 * r7g.4xlarge.search	3 * m7g.large. search
XLarge	10 TB	175 TB	600 @ 50 GB	30 * i4g.8xlarge	3 * m7g.2xlarge.search
XXL	80 TB	1.4 PB	2400 @ 50 GB	87 * I4g.16xlarge	3 * m7g.4xlarge.search

As with all sizing recommendations, these guidelines represent a starting point and are based on assumptions. Your workload will differ, and so your actual needs will differ from these recommendations. Make sure to deploy, monitor, and adjust your configuration as needed.

For T-shirt sizing the workloads, an extra-small use case encompasses 10 GB or less of data per day from a single data stream to a single index pattern. A small use case falls between 10–100 GB per day of data, a medium use case between 100–500 GB of data, and so on. Default instance count per domain is 80 for most of the instance family. Refer to the “Amazon OpenSearch Service quotas “ for details.

Additionally, consider the following best practices:

Choose the right storage tier [Ultra Warm, Hot storage] for your needs in OpenSearch Service. Refer to the Choose the right storage tier for your needs in Amazon OpenSearch Service for details.
Use the OpenSearch Optimized (OR) family for large-scale workloads that require low latency with index-heavy operations. Refer to OpenSearch optimized instance (OR1) is game changing for indexing performance and cost for details.
Isolate the ingestion using an OpenSearch Ingestion pipeline for smaller to large workloads and reduce the operational overhead of managing the ingestion pipelines.
Use reserved instances for long-term cost savings.
Consider using Availability Zone awareness for high availability.

Conclusion

This post provided comprehensive guidelines for sizing your OpenSearch Service domain for log analytic workloads, covering several critical aspects. These recommendations serve as a solid starting point, but each workload has unique characteristics. For optimal performance, consider implementing additional optimizations like data tiering and storage tiers. Evaluate cost-saving options such as reserved instances, and scale your deployment based on actual performance metrics and queue depths.By following these guidelines and actively monitoring your deployment, you can build a well-performing OpenSearch Service domain that meets your log analytics needs while maintaining efficiency and cost-effectiveness.

About the authors

Amazon SageMaker introduces Amazon S3 based shared storage for enhanced project collaboration

2025-09-16 Hari Ramesh

Post Syndicated from Hari Ramesh original https://aws.amazon.com/blogs/big-data/amazon-sagemaker-introduces-amazon-s3-based-shared-storage-for-enhanced-project-collaboration/

AWS recently announced that Amazon SageMaker now offers Amazon Simple Storage Service (Amazon S3) based shared storage as the default project file storage option for new Amazon SageMaker Unified Studio projects. This feature addresses the deprecation of AWS CodeCommit while providing teams with a straightforward and consistent way to collaborate on project files across the integrated development tools in SageMaker.

This new Amazon S3 storage option provides the following benefits:

Simplified collaboration – File sharing between project members directly without Git operations
Universal access – Consistent file access across SageMaker tools (JupyterLab, Query Editor, Visual ETL)
Clear workspace separation – Built-in personal storage separation with Amazon Elastic Block Store (Amazon EBS) volumes
Global availability – Available in AWS Regions where SageMaker is supported

Although Amazon S3 is the default option for file storage, you can also use Git version control for more robust source control capabilities.

In this post, we discuss this new feature and how to get started using Amazon S3 shared storage in SageMaker Unified Studio.

Solution overview

When you create a new SageMaker Unified Studio domain, the service automatically configures Amazon S3 storage as your default project storage option. Each project receives a dedicated shared location in Amazon S3, accessible to project members, following the structure [bucket]/[domain-id]/[project-id]/shared/.

SageMaker tools JupyterLab and Code Editor provide the following to users:

A personal EBS volume for individual work in JupyterLab and Code Editor tools
A mounted shared folder containing the project’s Amazon S3 shared storage
Clear separation between personal and shared spaces

The shared storage is accessible across SageMaker integrated development tools:

JupyterLab and Code Editor show shared files along with personal files
Query Editor filters for relevant SQL notebooks
Visual ETL provides direct access to shared extract, transform, and load (ETL) workflows

Files saved to the shared location are immediately visible and available to project members. Users can continue working with personal files in their EBS volumes in tools like JupyterLab and Code Editor and explicitly move files to shared storage when ready to collaborate.If you want to use Git for collaboration, you can continue to do so by integrating projects with your GitHub version control, GitLab version control, or managed Bitbucket repositories.

Migration and version control options

For teams currently using Amazon CodeCommit, existing projects will remain fully functional. New projects will default to Amazon S3 storage. If you want to have version control for Amazon S3 based projects, you can enable versioning in Amazon S3 directly.

Prerequisites

You will need to complete the following prerequisites before you can follow the instructions in the next section:

Sign up for an AWS account.
Create a user with administrative access.
Enable IAM Identity Center in the same AWS Region you want to create your SageMaker Unified Studio domain. Confirm in which Region SageMaker Unified Studio is currently available. Set up your IdP and synchronize identities and groups with IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.

Get started with Amazon S3 shared storage

To begin using Amazon S3 shared storage, complete the following steps:

Create a new SageMaker Unified Studio domain.
Create a new project (Amazon S3 storage is the default file storage option).
Open the new project and choose JupyterLab from the Build menu.
Save the new notebook you just created.
Rename the file.

After the project is saved, project users can view the saved notebook in the Project files section under the S3 path [bucket]/[domain-id]/[project-id]/shared/.

Enable version control using Git

To enable version control using Git, complete the following steps:

On the SageMaker console, create a new project profile.
Provide the necessary details for your project profile.
In the Project files storage section, the Amazon S3 option is selected by default. To enable version control for the project, you can use existing Git repository connections by selecting Git repository.

Use shared storage in Query Editor

To use the shared storage feature in Query Editor, complete the following steps:

Choose Query Editor from the Build menu.
Compose your query, and on the Actions menu, choose Save to save the query to shared storage.
Navigate back to the Project files section, where you can view the query notebook files under the S3 path [bucket]/[domain-id]/[project-id]/shared/.

Use shared storage in Visual ETL flows

To use the shared storage feature in Visual ETL flows, complete the following steps:

Choose Visual ETL flows from the Build menu.
Develop your ETL workflow and save the code to the project.
Navigate back to the Project files section, where you can view the files under the S3 path [bucket]/[domain-id]/[project-id]/shared/jobs/uploads/<ETL name>.

Clean up

Make sure you remove the SageMaker Unified Studio resources to mitigate any unexpected costs. This involves a few steps:

Delete the projects.
Delete the domain.
Delete the S3 bucket named amazon-datazone-AWSACCOUNTID-AWSREGION-DOMAINID

Conclusion

The launch of Amazon S3 shared storage in SageMaker represents another step in simplifying the analytics and machine learning (ML) development experience for our customers. By reducing the complexity of Git operations while maintaining robust collaboration capabilities, teams can now focus on building and deploying analytics and ML solutions faster. The feature is now available in Regions where SageMaker is available.

For detailed information about this feature, including setup instructions and best practices, refer to Unified storage in Amazon SageMaker Unified Studio. Share your feedback on this feature in the comments section.

About the Authors

Multi-Region keys: A new approach to key replication in AWS Payment Cryptography

2025-09-16 Ruy Cavalcanti

Post Syndicated from Ruy Cavalcanti original https://aws.amazon.com/blogs/security/multi-region-keys-a-new-approach-to-key-replication-in-aws-payment-cryptography/

In our previous blog post (Part 1 of our key replication series), Automatically replicate your card payment keys across AWS Regions, we explored an event-driven, serverless architecture using AWS PrivateLink to securely replicate card payment keys across AWS Regions. That solution demonstrated how to build a custom replication framework for payment cryptography keys.

Based on customer feedback requesting a more automated, no-code approach, we’re excited to announce an additional option to this capability with Multi-Region keys for AWS Payment Cryptography in Part 2 of our series.

By using this new feature, you can automatically synchronize payment cryptography keys from a primary Region to other Regions that you select, improving resilience and availability of payment applications. You can also choose between account-level replication or key-level replication, giving more flexibility in how to manage payment keys across Regions.

Multi-Region keys: Overview and benefits

The new Multi-Region key replication feature for AWS Payment Cryptography offers you flexible control over your key replication strategy through the following primary capabilities:

Control whether keys are replicated
Select specific Regions for key replication
Manage replication configuration changes
Configure either account-level or key-level replication to meet business needs

Multi-Region keys help deliver several benefits for global payment operations, including:

Improved availability: Access your payment keys even if a Region becomes unavailable
Disaster recovery: Maintain business continuity with replicated keys across Regions
Global operations: Support payment processing across multiple geographic regions
Simplified management: Centralized control with distributed availability
Consistent key IDs: The same key ID across Regions simplifies application development

Configuration options

Payment Cryptography provides two distinct methods for configuring Multi-Region key replication, giving flexibility to implement a strategy that best fits your organization’s needs. You can choose between a broad, account-level approach or a more granular, key-level method.

Account-level

With account-level configuration, AWS automatically replicates exportable symmetric keys created in your Payment Cryptography account from your designated primary Region to other Regions you specify. This simplifies key management in multi-Region deployments, provides consistent key availability in the Regions that you specify, and reduces the operational overhead of key management.

To configure account-level replication using the AWS Command Line Interface (AWS CLI), use the new enable-default-key-replication-regions API to set the Regions where AWS will replicate your keys. To remove Regions from your default replication list, use the disable-default-key-replication-regions API.

Note: Only symmetric keys created after the account-level replication is enabled will be replicated.

Key-level replication

By using key-level replication, you can achieve more granular control by:

Designating specific keys as multi-Region keys
Defining custom replication targets for each multi-Region key
Maintaining Region-specific keys when needed

Note: Within each Region, Payment Cryptography maintains redundancy of your keys across multiple Availability Zones for high availability. Multi-Region key replication extends across geographic boundaries, giving you additional resilience against Regional outages while maintaining control over where your keys are stored.

You can specify replication Regions during key creation using the --replication-regions parameter, using the AWS CLI, with the create-key or import-key APIs. For existing keys, you can use the new add-key-replication-regions and remove-key-replication-regions APIs to manage which regions receive your replicated keys.

Important: When you specify replication Regions during key creation, these settings take precedence over default replication Regions configured at the account level.

How it works

Figure 1 shows the process when you replicate a key in Payment Cryptography.

The key is created in your designated primary Region
Payment Cryptography automatically replicates the key material asynchronously to the specified replica Regions
The replicated keys maintain the same key ID across Regions; only the Region portion of the Amazon Resource Name (ARN) changes
The key in the primary Region is marked with MultiRegionKeyType: PRIMARY
Keys in replica Regions are marked with MultiRegionKeyType: REPLICA and include a reference to the primary Region
When deleting a key, its deletion cascades from the primary to replica Regions

Figure 1: Representation of key replication from us-east-1 to us-west-2

Example: Creating a multi-Region key at key level

The following is an example of creating a card verification key (CVK) in the primary Region (us-east-1) with replication to us-west-2:

aws payment-cryptography create-key \
--exportable \
--key-attributes KeyAlgorithm=TDES_2KEY,\
KeyUsage=TR31_C0_CARD_VERIFICATION_KEY,\
KeyClass=SYMMETRIC_KEY,KeyModesOfUse='{Generate=true,Verify=true}' \
--region us-east-1 \
--replication-regions us-west-2

The response shows the key being created with replication in progress:

{
  "Key": {
    "KeyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
    "KeyAttributes": {
      "KeyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
      "KeyClass": "SYMMETRIC_KEY",
      "KeyAlgorithm": "TDES_2KEY",
      "KeyModesOfUse": {
        "Encrypt": false,
        "Decrypt": false,
        "Wrap": false,
        "Unwrap": false,
        "Generate": true,
        "Sign": false,
        "Verify": true,
        "DeriveKey": false,
        "NoRestrictions": false
      }
    },
    "KeyCheckValue": "CC5EE2",
    "KeyCheckValueAlgorithm": "ANSI_X9_24",
    "Enabled": true,
    "Exportable": true,
    "KeyState": "CREATE_COMPLETE",
    "KeyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
    "CreateTimestamp": "2025-08-21T15:25:54.475000-03:00",
    "UsageStartTimestamp": "2025-08-21T15:25:54.287000-03:00",
    "MultiRegionKeyType": "PRIMARY",
    "ReplicationStatus": {
      "us-west-2": {
        "Status": "IN_PROGRESS"
      }
    },
    "UsingDefaultReplicationRegions": false
  }
}

After replication completes, the status updates to SYNCHRONIZED:

aws payment-cryptography get-key \
--key-identifier arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk \
--region us-east-1

{
    "Key": {
        "KeyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
        "KeyAttributes": {
            "KeyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
            "KeyClass": "SYMMETRIC_KEY",
            "KeyAlgorithm": "TDES_2KEY",
            "KeyModesOfUse": {
                "Encrypt": false,
                "Decrypt": false,
                "Wrap": false,
                "Unwrap": false,
                "Generate": true,
                "Sign": false,
                "Verify": true,
                "DeriveKey": false,
                "NoRestrictions": false
            }
        },
        "KeyCheckValue": "CC5EE2",
        "KeyCheckValueAlgorithm": "ANSI_X9_24",
        "Enabled": true,
        "Exportable": true,
        "KeyState": "CREATE_COMPLETE",
        "KeyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
        "CreateTimestamp": "2025-08-21T15:25:54.475000-03:00",
        "UsageStartTimestamp": "2025-08-21T15:25:54.287000-03:00",
        "MultiRegionKeyType": "PRIMARY",
        "ReplicationStatus": {
            "us-west-2": {
                "Status": "SYNCHRONIZED"
            }
        },
        "UsingDefaultReplicationRegions": false
    }
}

You can then access the key in the replica Region (us-west-2) using the same key ID and changing only the Region name:

aws payment-cryptography get-key \
--key-identifier arn:aws:payment-cryptography:us-west-2:111122223333:key/qs6643jl4ohibtqk \
--region us-west-2

The response shows the replica key with a reference to the primary Region:

{
    "Key": {
        "KeyArn": "arn:aws:payment-cryptography:us-west-2:111122223333:key/qs6643jl4ohibtqk",
        "KeyAttributes": {
            "KeyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
            "KeyClass": "SYMMETRIC_KEY",
            "KeyAlgorithm": "TDES_2KEY",
            "KeyModesOfUse": {
                "Encrypt": false,
                "Decrypt": false,
                "Wrap": false,
                "Unwrap": false,
                "Generate": true,
                "Sign": false,
                "Verify": true,
                "DeriveKey": false,
                "NoRestrictions": false
            }
        },
        "KeyCheckValue": "CC5EE2",
        "KeyCheckValueAlgorithm": "ANSI_X9_24",
        "Enabled": true,
        "Exportable": true,
        "KeyState": "CREATE_COMPLETE",
        "KeyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
        "CreateTimestamp": "2025-08-21T15:25:54.475000-03:00",
        "UsageStartTimestamp": "2025-08-21T15:25:54.287000-03:00",
        "MultiRegionKeyType": "REPLICA",
        "PrimaryRegion": "us-east-1"
    }
}

Things to consider

When using multi-Region keys, several important aspects should be considered. Multi-Region key replication supports only symmetric keys with the exportable attribute enabled, and asymmetric keys are not supported. For billing purposes, AWS bills per key per Region, which means replicating to three Regions incurs costs for the primary key plus costs for each key in the replica Regions.

Key aliases and tags require separate management in each Region because they are not part of the replication process. While primary keys support modifications and updates, replica keys are read-only copies that support only cryptographic operations. Modifications must be made to the key in the primary Region, and Payment Cryptography automatically propagates these changes to the replica Regions. Monitor the replication status to confirm successful synchronization of these changes.

The deletion process for multi-Region keys follows specific behavior patterns that are important to understand. When a primary key is scheduled for deletion, associated replica keys are deleted immediately. The primary key enters a pending deletion state with a minimum 3-day waiting period, during which the deletion can be canceled. However, if you restore the primary key by canceling its deletion, you will need to re-enable replication to recreate the replica keys in your desired Regions. After the 3-day waiting period expires, the primary key is permanently deleted and becomes unrecoverable. Note that deleting a replica key affects only that specific Region and does not impact the primary key or other replica keys.

Multi-Region key replication operates with eventual consistency. When creating new keys or making changes to existing keys, these updates might not appear immediately across all Regions. Applications should be designed to handle this eventual consistency model and not assume immediate availability of keys or key changes in replica Regions. If your application requires strong consistency, implement polling mechanisms using the GetKey API to verify that changes have been synchronized before proceeding with key operations.

Logging and monitoring

Payment Cryptography logs API activity through AWS CloudTrail, which now includes new events and attributes specific to Multi-Region key replication.

New CloudTrail event

The service logs a new event type called SynchronizeMultiRegionKey, which appears in primary and replica Regions.

Primary Region events:

Two SynchronizeMultiRegionKey events are logged in the primary Region for each replication Region defined:

One event related to a key export process.

{
    "eventVersion": "1.11",
    "userIdentity": {
        "accountId": "111122223333",
        "invokedBy": "payment-cryptography.amazonaws.com"
    },
    "eventTime": "2025-08-21T18:25:56Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "SynchronizeMultiRegionKey",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "payment-cryptography.amazonaws.com",
    "userAgent": "payment-cryptography.amazonaws.com",
    "requestParameters": null,
    "responseElements": null,
    "eventID": "fbae27f1-f2ad-49d1-ab05-d460b0b4ca25",
    "readOnly": false,
    "eventType": "AwsServiceEvent",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "serviceEventDetails": {
        "keyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
        "replicationRegion": "us-west-2",
        "replicationType": "ExportKeyReplica"
    },
    "eventCategory": "Management"
}

One event related to a key import process.

{
    "eventVersion": "1.11",
    "userIdentity": {
        "accountId": "111122223333",
        "invokedBy": "payment-cryptography.amazonaws.com"
    },
    "eventTime": "2025-08-21T18:25:56Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "SynchronizeMultiRegionKey",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "payment-cryptography.amazonaws.com",
    "userAgent": "payment-cryptography.amazonaws.com",
    "requestParameters": null,
    "responseElements": null,
    "eventID": "5c06716f-88ea-4315-b633-5dde83d7232c",
    "readOnly": false,
    "eventType": "AwsServiceEvent",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "serviceEventDetails": {
        "keyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
        "replicationRegion": "us-west-2",
        "replicationType": "ImportKeyReplica"
    },
    "eventCategory": "Management"
}

Replica Region events:

One SynchronizeMultiRegionKey event is logged as an import key process in each replicated Region.

{
    "eventVersion": "1.11",
    "userIdentity": {
        "accountId": "111122223333",
        "invokedBy": "payment-cryptography.amazonaws.com"
    },
    "eventTime": "2025-08-21T18:25:56Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "SynchronizeMultiRegionKey",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "payment-cryptography.amazonaws.com",
    "userAgent": "payment-cryptography.amazonaws.com",
    "requestParameters": null,
    "responseElements": null,
    "eventID": "0a952017-dd89-435e-8959-5de7b43c86d5",
    "readOnly": false,
    "eventType": "AwsServiceEvent",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "serviceEventDetails": {
        "keyArn": "arn:aws:payment-cryptography:us-west-2:111122223333:key/qs6643jl4ohibtqk",
        "replicationRegion": "us-west-2",
        "replicationType": "ImportKeyReplica"
    },
    "eventCategory": "Management"
}

New CloudTrail event attributes

New attributes were included in the service key management APIs. The following are examples of the CreateKey API highlighting the new attributes.

One CreateKey event in the primary Region:

{
    "eventVersion": "1.11",
...
    "eventTime": "2025-08-21T18:25:54Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "CreateKey",
    "awsRegion": "us-east-1",
...
    "requestParameters": {
        "keyAttributes": {
            "keyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
            "keyClass": "SYMMETRIC_KEY",
            "keyAlgorithm": "TDES_2KEY",
            "keyModesOfUse": {
                "encrypt": false,
                "decrypt": false,
                "wrap": false,
                "unwrap": false,
                "generate": true,
                "sign": false,
                "verify": true,
                "deriveKey": false,
                "noRestrictions": false
            }
        },
        "exportable": true,
        "replicationRegions": [
            "us-west-2"
        ]
    },
    "responseElements": {
        "key": {
            "keyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
            "keyAttributes": {
                "keyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
                "keyClass": "SYMMETRIC_KEY",
                "keyAlgorithm": "TDES_2KEY",
                "keyModesOfUse": {
                    "encrypt": false,
                    "decrypt": false,
                    "wrap": false,
                    "unwrap": false,
                    "generate": true,
                    "sign": false,
                    "verify": true,
                    "deriveKey": false,
                    "noRestrictions": false
                }
            },
            "keyCheckValue": "CC5EE2",
            "keyCheckValueAlgorithm": "ANSI_X9_24",
            "enabled": true,
            "exportable": true,
            "keyState": "CREATE_COMPLETE",
            "keyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
            "createTimestamp": "Aug 21, 2025, 6:25:54 PM",
            "usageStartTimestamp": "Aug 21, 2025, 6:25:54 PM",
            "multiRegionKeyType": "PRIMARY",
            "replicationStatus": {
                "us-west-2": {
                    "status": "IN_PROGRESS"
                }
            },
            "usingDefaultReplicationRegions": false
        }
    },
...
}

One CreateKey event in a replica Region:

{
    "eventVersion": "1.11",
    "userIdentity": {
...
        "invokedBy": "payment-cryptography.amazonaws.com"
    },
    "eventTime": "2025-08-21T18:25:54Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "CreateKey",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "payment-cryptography.amazonaws.com",
    "userAgent": "payment-cryptography.amazonaws.com",
    "requestParameters": {
        "keyAttributes": {
            "keyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
            "keyClass": "SYMMETRIC_KEY",
            "keyAlgorithm": "TDES_2KEY",
            "keyModesOfUse": {
                "encrypt": false,
                "decrypt": false,
                "wrap": false,
                "unwrap": false,
                "generate": true,
                "sign": false,
                "verify": true,
                "deriveKey": false,
                "noRestrictions": false
            }
        },
        "exportable": true,
        "enabled": true
    },
    "responseElements": {
        "key": {
            "keyArn": "arn:aws:payment-cryptography:us-west-2:111122223333:key/qs6643jl4ohibtqk",
            "keyAttributes": {
                "keyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
                "keyClass": "SYMMETRIC_KEY",
                "keyAlgorithm": "TDES_2KEY",
                "keyModesOfUse": {
                    "encrypt": false,
                    "decrypt": false,
                    "wrap": false,
                    "unwrap": false,
                    "generate": true,
                    "sign": false,
                    "verify": true,
                    "deriveKey": false,
                    "noRestrictions": false
                }
            },
            "keyCheckValue": "CC5EE2",
            "keyCheckValueAlgorithm": "ANSI_X9_24",
            "enabled": true,
            "exportable": true,
            "keyState": "CREATE_COMPLETE",
            "keyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
            "usageStartTimestamp": "Aug 21, 2025, 6:25:54 PM"
        }
    },
...
}

Getting started

To start using Multi-Region key replication in Payment Cryptography:

Determine your primary Region.
Determine your replica Regions and if you will use account-level or key-level configuration.
Create new exportable symmetric keys or update existing keys to use the Multi-Region key replication feature.
Update your applications to use the consistent key IDs across Regions.

Conclusion

The new Multi-Region key replication feature in Payment Cryptography enhances our automatic key replication capabilities, providing improved resilience and simplified management for global payment applications. This feature helps make sure your payment cryptography keys are available when and where you need them, with the flexibility to choose between account-level or key-level replication strategies.

For more information about AWS Payment Cryptography, visit https://aws.amazon.com/payment-cryptography/.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Break down data silos and seamlessly query Iceberg tables in Amazon SageMaker from Snowflake

2025-09-15 Nidhi Gupta

Post Syndicated from Nidhi Gupta original https://aws.amazon.com/blogs/big-data/break-down-data-silos-and-seamlessly-query-iceberg-tables-in-amazon-sagemaker-from-snowflake/

Organizations often struggle to unify their data ecosystems across multiple platforms and services. The connectivity between Amazon SageMaker and Snowflake’s AI Data Cloud offers a powerful solution to this challenge, so businesses can take advantage of the strengths of both environments while maintaining a cohesive data strategy.

In this post, we demonstrate how you can break down data silos and enhance your analytical capabilities by querying Apache Iceberg tables in the lakehouse architecture of SageMaker directly from Snowflake. With this capability, you can access and analyze data stored in Amazon Simple Storage Service (Amazon S3) through AWS Glue Data Catalog using an AWS Glue Iceberg REST endpoint, all secured by AWS Lake Formation, without the need for complex extract, transform, and load (ETL) processes or data duplication. You can also automate table discovery and refresh using Snowflake catalog-linked databases for Iceberg. In the following sections, we show how to set up this integration so Snowflake users can seamlessly query and analyze data stored in AWS, thereby improving data accessibility, reducing redundancy, and enabling more comprehensive analytics across your entire data ecosystem.

Business use cases and key benefits

The capability to query Iceberg tables in SageMaker from Snowflake delivers significant value across multiple industries:

Financial services – Enhance fraud detection through unified analysis of transaction data and customer behavior patterns
Healthcare – Improve patient outcomes through integrated access to clinical, claims, and research data
Retail – Increase customer retention rates by connecting sales, inventory, and customer behavior data for personalized experiences
Manufacturing – Boost production efficiency through unified sensor and operational data analytics
Telecommunications – Reduce customer churn with comprehensive analysis of network performance and customer usage data

Key benefits of this capability include:

Accelerated decision-making – Reduce time to insight through integrated data access across platforms
Cost optimization – Accelerate time to insight by querying data directly in storage without the need for ingestion
Improved data fidelity – Reduce data inconsistencies by establishing a single source of truth
Enhanced collaboration – Increase cross-functional productivity through simplified data sharing between data scientists and analysts

By using the lakehouse architecture of SageMaker with Snowflake’s serverless and zero-tuning computational power, you can break down data silos, enabling comprehensive analytics and democratizing data access. This integration supports a modern data architecture that prioritizes flexibility, security, and analytical performance, ultimately driving faster, more informed decision-making across the enterprise.

Solution overview

The following diagram shows the architecture for catalog integration between Snowflake and Iceberg tables in the lakehouse.

Catalog integration to query Iceberg tables in S3 bucket using Iceberg REST Catalog (IRC) with credential vending

The workflow consists of the following components:

Data storage and management:
- Amazon S3 serves as the primary storage layer, hosting the Iceberg table data
- The Data Catalog maintains the metadata for these tables
- Lake Formation provides credential vending
Authentication flow:
- Snowflake initiates queries using a catalog integration configuration
- Lake Formation vends temporary credentials through AWS Security Token Service (AWS STS)
- These credentials are automatically refreshed based on the configured refresh interval
Query flow:
- Snowflake users submit queries against the mounted Iceberg tables
- The AWS Glue Iceberg REST endpoint processes these requests
- Query execution uses Snowflake’s compute resources while reading directly from Amazon S3
- Results are returned to Snowflake users while maintaining all security controls

There are four patterns to query Iceberg tables in SageMaker from Snowflake:

Iceberg tables in an S3 bucket using an AWS Glue Iceberg REST endpoint and Snowflake Iceberg REST catalog integration, with credential vending from Lake Formation
Iceberg tables in an S3 bucket using an AWS Glue Iceberg REST endpoint and Snowflake Iceberg REST catalog integration, using Snowflake external volumes to Amazon S3 data storage
Iceberg tables in an S3 bucket using AWS Glue API catalog integration, also using Snowflake external volumes to Amazon S3
Amazon S3 Tables using Iceberg REST catalog integration with credential vending from Lake Formation

In this post, we implement the first of these four access patterns using catalog integration for the AWS Glue Iceberg REST endpoint with Signature Version 4 (SigV4) authentication in Snowflake.

Prerequisites

You must have the following prerequisites:

A Snowflake account.
An AWS Identity and Access Management (IAM) role that is a Lake Formation data lake administrator in your AWS account. A data lake administrator is an IAM principal that can register Amazon S3 locations, access the Data Catalog, grant Lake Formation permissions to other users, and view AWS CloudTrail. See Create a data lake administrator for more information.
An existing AWS Glue database named iceberg_db and Iceberg table named customer with data stored in an S3 general purpose bucket with a unique name. To create the table, refer to the table schema and dataset.
A user-defined IAM role that Lake Formation assumes when accessing the data in the aforementioned S3 location to vend scoped credentials (see Requirements for roles used to register locations). For this post, we use the IAM role LakeFormationLocationRegistrationRole.

The solution takes approximately 30–45 minutes to set up. Cost varies based on data volume and query frequency. Use the AWS Pricing Calculator for specific estimates.

Create an IAM role for Snowflake

To create an IAM role for Snowflake, you first create a policy for the role:

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
Choose the JSON editor and enter the following policy (provide your AWS Region and account ID), then choose Next.

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Sid": "AllowGlueCatalogTableAccess",
             "Effect": "Allow",
             "Action": [
                 "glue:GetCatalog",
                 "glue:GetCatalogs",
                 "glue:GetPartitions",
                 "glue:GetPartition",
                 "glue:GetDatabase",
                 "glue:GetDatabases",
                 "glue:GetTable",
                 "glue:GetTables",
                 "glue:UpdateTable"
             ],
             "Resource": [
                 "arn:aws:glue:<region>:<account-id>:catalog",
                 "arn:aws:glue:<region>:<account-id>:database/iceberg_db",
                 "arn:aws:glue:<region>:<account-id>:table/iceberg_db/*",
             ]
         },
         {
             "Effect": "Allow",
             "Action": [
                 "lakeformation:GetDataAccess"
             ],
             "Resource": "*"
         }
     ]
 }

Enter iceberg-table-access as the policy name.
Choose Create policy.

Now you can create the role and attach the policy you created.

Choose Roles in the navigation pane.
Choose Create role.
Choose AWS account.
Under Options, select Require External Id and enter an external ID of your choice.
Choose Next.
Choose the policy you created (iceberg-table-access policy).
Enter snowflake_access_role as the role name.
Choose Create role.

Configure Lake Formation access controls

To configure your Lake Formation access controls, first set up the application integration:

Sign in to the Lake Formation console as a data lake administrator.
Choose Administration in the navigation pane.
Select Application integration settings.
Enable Allow external engines to access data in Amazon S3 locations with full table access.
Choose Save.

Now you can grant permissions to the IAM role.

Choose Data permissions in the navigation pane.
Choose Grant.
Configure the following settings:
1. For Principals, select IAM users and roles and choose snowflake_access_role.
2. For Resources, select Named Data Catalog resources.
3. For Catalog, choose your AWS account ID.
4. For Database, choose iceberg_db.
5. For Table, choose customer.
6. For Permissions, select SUPER.
Choose Grant.

SUPER access is required for mounting the Iceberg table in Amazon S3 as a Snowflake table.

Register the S3 data lake location

Complete the following steps to register the S3 data lake location:

As data lake administrator on the Lake Formation console, choose Data lake locations in the navigation pane.
Choose Register location.
Configure the following:
1. For S3 path, enter the S3 path to the bucket where you will store your data.
2. For IAM role, choose LakeFormationLocationRegistrationRole.
3. For Permission mode, choose Lake Formation.
Choose Register location.

Set up the Iceberg REST integration in Snowflake

Complete the following steps to set up the Iceberg REST integration in Snowflake:

Log in to Snowflake as an admin user.
Execute the following SQL command (provide your Region, account ID, and external ID that you provided during IAM role creation):

CREATE OR REPLACE CATALOG INTEGRATION glue_irc_catalog_int
CATALOG_SOURCE = ICEBERG_REST
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'iceberg_db'
REST_CONFIG = (
    CATALOG_URI = 'https://glue.<region>.amazonaws.com/iceberg'
    CATALOG_API_TYPE = AWS_GLUE
    CATALOG_NAME = '<account-id>'
    ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS
)
REST_AUTHENTICATION = (
    TYPE = SIGV4
    SIGV4_IAM_ROLE = 'arn:aws:iam::<account-id>:role/snowflake_access_role'
    SIGV4_SIGNING_REGION = '<region>'
    SIGV4_EXTERNAL_ID = '<external-id>'
)
REFRESH_INTERVAL_SECONDS = 120
ENABLED = TRUE;

Execute the following SQL command and retrieve the value for API_AWS_IAM_USER_ARN:

DESCRIBE CATALOG INTEGRATION glue_irc_catalog_int;

On the IAM console, update the trust relationship for snowflake_access_role with the value for API_AWS_IAM_USER_ARN:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                   "<API_AWS_IAM_USER_ARN>"
                ]
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": [
                        "<external-id>"
                    ]
                }
            }
        }
    ]
}

Verify the catalog integration:

SELECT SYSTEM$VERIFY_CATALOG_INTEGRATION('glue_irc_catalog_int');

Mount the S3 table as a Snowflake table:

CREATE OR REPLACE ICEBERG TABLE s3iceberg_customer
 CATALOG = 'glue_irc_catalog_int'
 CATALOG_NAMESPACE = 'iceberg_db'
 CATALOG_TABLE_NAME = 'customer'
 AUTO_REFRESH = TRUE;

Query the Iceberg table from Snowflake

To test the configuration, log in to Snowflake as an admin user and run the following sample query:SELECT * FROM s3iceberg_customer LIMIT 10;

Clean up

To clean up your resources, complete the following steps:

Delete the database and table in AWS Glue.
Drop the Iceberg table, catalog integration, and database in Snowflake:

DROP ICEBERG TABLE iceberg_customer;
DROP CATALOG INTEGRATION glue_irc_catalog_int;

Make sure all resources are properly cleaned up to avoid unexpected charges.

Conclusion

In this post, we demonstrated how to establish a secure and efficient connection between your Snowflake environment and SageMaker to query Iceberg tables in Amazon S3. This capability can help your organization maintain a single source of truth while also letting teams use their preferred analytics tools, ultimately breaking down data silos and enhancing collaborative analysis capabilities.

To further explore and implement this solution in your environment, consider the following resources:

Technical documentation:
- Review the Amazon SageMaker Lakehouse User Guide
- Explore Security in AWS Lake Formation for best practices to optimize your security controls
- Learn more about Iceberg table format and its benefits for data lakes
- Refer to Configuring secure access from Snowflake to Amazon S3
Related blog posts:
- Build real-time data lakes with Snowflake and Amazon S3 Tables
- Simplify data access for your enterprise using Amazon SageMaker Lakehouse

These resources can help you to implement and optimize this integration pattern for your specific use case. As you begin this journey, remember to start small, validate your architecture with test data, and gradually scale your implementation based on your organization’s needs.

About the authors

Automate and orchestrate Amazon EMR jobs using AWS Step Functions and Amazon EventBridge

2025-09-15 Senthil Kamala Rathinam

Post Syndicated from Senthil Kamala Rathinam original https://aws.amazon.com/blogs/big-data/automate-and-orchestrate-amazon-emr-jobs-using-aws-step-functions-and-amazon-eventbridge/

Many enterprises are adopting Apache Spark for scalable data processing tasks such as extract, transform, and load (ETL), batch analytics, and data enrichment. As data pipelines evolve, the need for flexible and cost-efficient execution environments that support automation, governance, and performance at scale also evolve in parallel. Amazon EMR provides a powerful environment to run Spark workloads, and depending on workload characteristics and compliance requirements, teams can choose between fully managed options like Amazon EMR Serverless or more customizable configurations using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2).

In use cases where infrastructure control, data locality, or strict security postures are essential, such as in financial services, healthcare, or government, running transient EMR on EC2 clusters becomes a preferred choice. However, orchestrating the full lifecycle of these clusters, from provisioning to job submission and eventual teardown, can introduce operational overhead and risk if done manually.

To streamline this process, the AWS Cloud offers built-in orchestration capabilities using AWS Step Functions and Amazon EventBridge. Together, these services help you automate and schedule the entire EMR job lifecycle, reducing manual intervention while optimizing cost and compliance. Step Functions provides the workflow logic to manage cluster creation, Spark job execution, and cluster termination, and EventBridge schedules these workflows based on business or operational needs.

In this post, we discuss how to build a fully automated, scheduled Spark processing pipeline using Amazon EMR on EC2, orchestrated with Step Functions and triggered by EventBridge. We walk through how to deploy this solution using AWS CloudFormation, processes COVID-19 public dataset data in Amazon Simple Storage Service (Amazon S3), and store the aggregated results in Amazon S3. This architecture is ideal for periodic or scheduled batch processing scenarios where infrastructure control, auditability, and cost-efficiency are critical.

Solution overview

This solution uses the publicly available COVID-19 dataset to illustrate how to build a modular, scheduled architecture for scalable and cost-efficient batch processing for time-bound data workloads.The solution follows these steps:

Raw COVID-19 data in CSV format is stored in an S3 input bucket.
A scheduled rule in EventBridge triggers a Step Functions workflow.
The Step Functions workflow provisions a transient Amazon EMR cluster using EC2 instances.
A PySpark job is submitted to the cluster to calculate COVID-19 hospital utilization data to compute monthly state-level averages of inpatient and ICU bed utilization, and COVID-19 patient percentages.
The processed results are written back to an S3 output bucket.
After successful job completion, the EMR cluster is automatically deleted.
Logs are persisted to Amazon S3 for observability and troubleshooting.

By automating this workflow, you alleviate the need to manually manage EMR clusters while gaining cost-efficiency by running compute only when needed. This architecture is ideal for periodic Spark jobs such as ETL pipelines, regulatory reporting, and batch analytics, especially when control, compliance, and customization are required.The following diagram illustrates the architecture for this use case.

The infrastructure is deployed using AWS CloudFormation to provide consistency and repeatability. AWS Identity and Access Management (IAM) roles grant least‑privilege access to Step Functions, Amazon EMR, EC2 instances, and S3 buckets, and optional AWS Key Management Service (AWS KMS) encryption can secure data at rest in Amazon S3 and Amazon CloudWatch Logs. By combining a scheduled trigger, stateful orchestration, and centralized logging, this solution delivers a fully automated, cost‑optimized, and secure way to run transient Spark workloads in production.

Prerequisites

Before you get started, make sure you have the following prerequisites:

An AWS account. If you don’t have one, you can sign up for one.
An IAM user with administrator access. For instructions, see Create a user with administrative access.
The AWS Command Line Interface (AWS CLI) is installed on your local machine
A default virtual private cloud (VPC) and subnet in the target AWS Region where you plan to run the CloudFormation template.

Set up resources with AWS CloudFormation

To provision the required resources using a single CloudFormation template, complete the following steps:

Clone the sample repository to your local machine or AWS CloudShell and navigate into the project directory.

git clone https://github.com/aws-samples/sample-emr-transient-cluster-step-functions-eventbridge.git
cd sample-emr-transient-cluster-step-functions-eventbridge

Set an environment variable for the AWS Region where you plan to deploy the resources. Replace the placeholder with your Region code, for example, us-east-1.
```
export AWS_REGION=<YOUR AWS REGION>
```

Deploy the stack using the following command. Update the stack name if needed. In this example, the stack is created with the name covid19-analysis.

aws cloudformation deploy \
--template-file emr_transient_cluster_step_functions_eventbridge.yaml \
--stack-name covid19-analysis \
--capabilities CAPABILITY_IAM \
--region $AWS_REGION

You can monitor the stack creation progress on the AWS CloudFormation console on the Events tab. The deployment typically completes in under 5 minutes.

After the stack is successfully created, go to the Outputs tab on the AWS CloudFormation console and note the following values for use in later steps:

InputBucketName
OutputBucketName
LogBucketName

Set up the COVID-19 dataset

With your infrastructure in place, complete the following steps to set up the input data:

Download the COVID-19 data CSV file from HealthData.gov to your local machine.
Rename the downloaded file to covid19-dataset.csv.
Upload the renamed file to your S3 input bucket under the raw/ folder path.

Set up the PySpark Script

Complete the following steps to set up the PySpark script:

Open AWS CloudShell from the console.
Confirm that you are working inside the sample-emr-transient-cluster-step-functions-eventbridge directory before running the next command.
Copy the PySpark script needed for this walkthrough into your input bucket:
```
aws s3 cp covid19_processor.py s3://<InputBucketName>/scripts/
```

This script processes COVID-19 hospital utilization data stored as CSV files in your S3 input bucket. When running the job, provide the following command-line arguments:

--input – The S3 path to the input CSV files
--output – The S3 path to store the processed results

The script reads the raw dataset, standardizes various date formats, and filters out records with invalid or missing dates. It then extracts key utilization metrics such as inpatient bed usage, ICU bed usage, and the percentage of beds occupied by COVID-19 patients and calculates monthly averages grouped by state. The aggregated output is saved as timestamped CSV files in the specified S3 location.

This example demonstrates how you can use PySpark to efficiently clean, transform, and analyze large-scale healthcare data to gain actionable insights on hospital capacity trends during the pandemic.

Configure a schedule in EventBridge

The Step Functions state machine is by default scheduled to run on December 31, 2025, as a one-time execution. You can update the schedule for recurring or one-time execution as needed. Complete the following steps:

On the EventBridge console, choose Schedules under Scheduler in the navigation pane.
Select the schedule named <StackName>-covid19-analysis and choose Edit.
Set your preferred schedule pattern.
1. If you want to run the schedule one time, select One-time schedule for Occurrence and enter a date and time.
2. If you want to run this on a recurring basis, select Recurring schedule. Specify the schedule type as either Cron-based schedule or Rate-based schedule as needed.
Choose Next twice and choose Save schedule.

Start the workflow in Step Functions

Based on your EventBridge schedule, the Step Functions workflow will run automatically. For this walkthrough, complete the following steps to trigger it manually:

On the Step Functions console, choose State machines in the navigation pane.
Choose the state machine that begins with Covid19AnalysisStateMachine-*.
Choose Start execution.
In the Input section, provide the following JSON (provide the log bucket and output bucket names with the appropriate values captured earlier):
```
{
  "LogUri": "s3://<LogBucketName>/logs/",
  "OutputS3Location": "s3://<OutputBucketName>/processed/"
}
```
Choose Start execution to initiate the workflow.

Monitor the EMR job and workflow execution

After you start the workflow, you can track both the Step Functions state transitions and the EMR job progress in real time on the console.

Monitor the Step Functions state machine

Complete the following steps to monitor the Step Functions state machine:

On the Step Functions console, choose State machines in the navigation pane.
Choose the state machine that begins with Covid19AnalysisStateMachine-*.
Choose the running execution to view the visual workflow.
Each state node will update as it progresses—green for success, red for failure.
To explore a step, choose its node and inspect the input, output, and error details in the side pane.

The following screenshot shows an example of a successfully executed workflow.

Monitor the EMR cluster and EMR step

Complete the following steps to monitor the EMR cluster and EMR step status:

While the cluster is active, open the Amazon EMR console and choose Clusters in the navigation pane.
Locate the Covid19Cluster transient EMR cluster.
Initially, it will be in Starting status.

On the Steps tab, you can see your Spark submit step listed. As the job progresses, the step status changes from Pending to Running to finally Completed or Failed.
Choose the Applications tab to view the application UIs, in which you can access the Spark History Server and YARN Timeline Server for monitoring and troubleshooting.

Monitor CloudWatch logs

To enable CloudWatch logging and enhanced monitoring for your EMR on EC2 cluster, refer to Amazon EMR on EC2 – Enhanced Monitoring with CloudWatch using custom metrics and logs. This guide explains how to install and configure the CloudWatch agent using a bootstrap action, so you can stream system-level metrics (such as CPU, memory, and disk usage) and application logs from EMR nodes directly to CloudWatch. With this setup, you can gain real-time visibility into cluster health and performance, simplify troubleshooting, and retain critical logs even after the cluster is terminated.

For this walkthrough, check the logs in the S3 log output location.

Confirm cluster deletion

When the Spark step is complete, Step Functions will automatically delete the Amazon EMR cluster. Refresh the Clusters page on the Amazon EMR console. You should see your cluster status change from Terminating to Terminated within a minute.

By following these steps, you gain full end-to-end visibility into your workflow from the moment the Step Functions state machine is triggered to the automatic shutdown of the EMR cluster. You can monitor execution progress, troubleshoot issues, confirm job success, and continuously optimize your transient Spark workloads.

Verify job output in Amazon S3

When the job is complete, complete the following steps to check the processed results in the S3 output bucket:

On the Amazon S3 console, choose Buckets in the navigation pane.
Open the output S3 bucket you noted earlier.
Open the processed folder.
Navigate into the timestamped subfolder to view the CSV output file.
Download the CSV file to view the processed results, as shown in the following screenshot.

Monitoring and troubleshooting

To monitor the progress of your Spark job running on a transient EMR on EC2 cluster, use the Step Functions console. It provides real-time visibility into each state transition in your workflow, from cluster creation and job submission to cluster deletion. This makes it straightforward to track execution flow and identify where issues might occur.During job execution, you can use the Amazon EMR console to access cluster-level monitoring. This includes YARN application statuses, step-level logs, and overall cluster health. If CloudWatch logging is enabled in your job configuration, driver and executor logs stream in near real time, so you can quickly detect and diagnose errors, resource constraints, or data skew within your Spark application.

After the workflow is complete, regardless of whether it succeeds or fails, you can perform a detailed post-execution analysis by reviewing the logs stored in the S3 bucket specified in the LogUri parameter. This log directory includes standard output and error logs, along with Spark history files, offering insights into execution behavior and performance metrics.

For continued access to the Spark UI during job execution, you can use persistent application UIs on the EMR console. These links remain accessible even after the cluster is stopped, enabling deeper root-cause analysis and performance tuning for future runs.

This visibility into both workflow orchestration and job execution can help teams optimize their Spark workloads, reduce troubleshooting time, and build confidence in their EMR automation pipelines.

Clean up

To avoid incurring ongoing charges, clean up the resources provisioned during this walkthrough:

Empty the S3 buckets:
1. On the Amazon S3 console, choose Buckets in the navigation pane.
2. Select the input, output, and log buckets used in this tutorial.
3. Choose Empty to remove all objects before deleting the buckets (optional).
Delete the CloudFormation stack:
1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
2. Select the stack you created for this solution and choose Delete.
3. Confirm the deletion to remove associated resources.

Conclusion

In this post, we showed how to build a fully automated and cost-effective Spark processing pipeline using Step Functions, EventBridge, and Amazon EMR on EC2. The workflow provisions a transient EMR cluster, runs a Spark job to process data, and stops the cluster after the job completes. This approach helps reduce costs while giving you full control over the process. This solution is ideal for scheduled data processing tasks such as ETL jobs, log analytics, or batch reporting, especially when you need detailed control over infrastructure, security, and compliance settings.

To get started, deploy the solution in your environment using the CloudFormation stack provided and adjust it to fit your data processing needs. Check out the Step Functions Developer Guide and Amazon EMR Management Guide to explore further.

Share your feedback and ideas in the comments or connect with your AWS Solutions Architect to fine-tune this pattern for your use case.

About the authors

Streamline Spark application development on Amazon EMR with the Data Solutions Framework on AWS

2025-09-15 Vincent Gromakowski

Post Syndicated from Vincent Gromakowski original https://aws.amazon.com/blogs/big-data/streamline-spark-application-development-on-amazon-emr-with-the-data-solutions-framework-on-aws/

Today, organizations are heavily using Apache Spark for their big data processing needs. However, managing the entire development lifecycle of Spark applications—from local development to production deployment—can be complex and time-consuming. Managing the entire code base—including application code, infrastructure provisioning, and continuous integration and delivery (CI/CD) pipelines—is sometimes not fully automated and a shared responsibility across multiple teams, which slows down release cycles. This undifferentiated heavy lifting diverts valuable resources away from core business objectives: deriving value from data.

In this post, we explore how to use Amazon EMR, the AWS Cloud Development Kit (AWS CDK), and the Data Solutions Framework (DSF) on AWS to streamline the development process, from setting up a local development environment to deploying serverless Spark infrastructure, and implementing a CI/CD pipeline for automated testing and deployment.

By adopting this approach, developers gain full control over their code and the infrastructure responsible for running it, alleviating the need for cross-team dependency. Developers can customize the infrastructure to meet specific business needs and optimize performance. Additionally, they can customize CI/CD stages to facilitate comprehensive testing, using the self-mutation capability of AWS CDK Pipelines to automatically update and refine the deployment process. This level of control not only accelerates development cycles but also enhances the reliability and efficiency of the entire application lifecycle, so developers can focus more on innovation and less on manual infrastructure management.

Solution overview

The solution consists of the following key components:

The local development environment to develop and test your Spark code locally
The infrastructure as code (IaC) that will run your Spark application in AWS environments
The CI/CD pipeline running end-to-end tests and deploying into the different AWS environments

In the following sections, we discuss how to set up these components.

Prerequisites

To set up this solution, you must have an AWS account with appropriate permissions, Docker and the AWS CDK CLI.

Set up the local development environment

Developing Spark applications locally can be a challenging task due to the need for a consistent and efficient environment that mirrors your production setup. With Amazon EMR, Docker, and the Amazon EMR toolkit extension for Visual Studio Code, you can quickly set up a local development environment for Spark applications, developing and testing Spark code locally, and seamlessly port it to the cloud.

The Amazon EMR toolkit for VS Code includes an “EMR: Create Local Spark Environment” command that generates a development container. This container is based on an Amazon EMR on Amazon EKS image corresponding to the Amazon EMR version you select. You can develop Spark and PySpark code locally, with full compatibility with your remote Amazon EMR environment. Additionally, the toolkit provides helpers to make it straightforward to connect to the AWS Cloud, including an Amazon EMR explorer, an AWS Glue Data Catalog explorer, and commands to run Amazon EMR Serverless jobs from VS Code.

To set up your local environment, complete the following steps:

Install VS Code and the Amazon EMR Toolkit for VS Code.
Install and launch Docker.
Create a local Amazon EMR environment in your working directory using the command EMR: Create Local Spark Environment.

Amazon EMR Toolkit bootstrap

Choose PySpark, Amazon EMR 7.5, and the AWS Region you want to use, and choose an authentication mechanism.

Amazon EMR toolkit local environment

aws ecr get-login-password --region us-east-1 \
    | docker login \
    --username AWS \
    --password-stdin \
    12345678910.dkr.ecr.us-east-1.amazonaws.com

Now you can launch your dev container using the VS Code command Dev Containers: Rebuild and Reopen in container.

The container will install the latest operating system packages and run a local Spark history server on port 18080.

local Spark history server

The container provides spark-shell, spark-sql, and pyspark from the terminal and a Jupyter Python kernel for connecting a Jupyter notebook to execute interactive Spark code.

local Jupyter notebooks

Using the Amazon EMR Toolkit, you can develop your Spark application and test it locally using Pytest—for example, to validate the business logic. You can also connect to other AWS accounts where you have your development environment.

Build the AWS CDK application with DSF on AWS

After you validate the business logic into your local Spark application, you can implement the infrastructure responsible for running your application. DSF provides AWS CDK L3 Constructs that simplify the creation of Spark-based data pipelines on EMR Serverless or Amazon EMR on EKS.

DSF provides the capability to package your local PySpark application, including the Python dependencies, into artifacts that can consumed by EMR Serverless jobs. The PySparkApplicationPackage is a construct that uses a Dockerfile to perform the packaging of dependencies into a Python virtual environment archive and then upload the archive and the PySpark entrypoint file into a secured Amazon Simple Storage Service (Amazon S3) bucket. The following diagram illustrates this architecture.

PySparkApplicationPackage L3 construct

See the following example code:

spark_app = dsf.processing.PySparkApplicationPackage(
    self,
    "SparkApp",
    entrypoint_path="./../spark/src/agg_trip_distance.py",
    application_name="TaxiAggregation",
    # Path of the Dockerfile used to package the dependencies as a Python venv
    dependencies_folder='./../spark',
    # Path of the venv archive in the docker image
    venv_archive_path="/venv-package/pyspark-env.tar.gz",
    removal_policy=RemovalPolicy.DESTROY)

You just need to provide the paths for the following:

The PySpark entrypoint. This is the main Python script of your Spark application.
The Dockerfile containing the logic for packaging a virtual environment into an archive.
The path of the resulting archive in the container file system.

DSF provides helpers to connect the application package to the EMR Serverless job. The PySparkApplicationPackage construct exposes properties that can directly be used into the SparkEmrServerlessJob construct parameters. This construct simplifies the configuration of a batch job using an AWS Step Functions state machine. The following diagram illustrates this architecture.

EmrServerlessJob L3 construct

The following code is an example of an EMR Serverless job:

spark_job = dsf.processing.SparkEmrServerlessJob(
    self,
    "SparkProcessingJob",
    dsf.processing.SparkEmrServerlessJobProps(
        name=f"taxi-agg-job-{Names.unique_resource_name(self)}",
        # ID of the previously created EMR Serverless runtime
        application_id=spark_runtime.application.attr_application_id,
        # The IAM role used by the EMR Job with permissions required by the application
        execution_role=processing_exec_role,
        spark_submit_entry_point=spark_app.entrypoint_uri,
        # Add the Spark parameters from the PySpark package to configure the dependencies (using venv)
        spark_submit_parameters=spark_app.spark_venv_conf + spark_params,
        removal_policy=RemovalPolicy.DESTROY,
        schedule=schedule))

Note the two parameters of SparkEmrServerlessJob that are provided by PySparkApplicationPackage:

entrypoint_uri, which is the S3 URI of the entrypoint file
spark_venv_conf, which contains the Spark submit parameters for using the Python virtual environment

DSF also provides a SparkEmrServerlessRuntime to simplify the creation of the EMR Serverless application responsible for running the job.

Deploy the Spark application using CI/CD

The final step is to implement a CI/CD pipeline that can test your Spark code and promote from dev/test/stage and then to production. DSF provides a L3 Construct that simplifies the creation of the CI/CD pipeline for your Spark applications. DSF’s implementation of the Spark CI/CD pipeline construct uses the AWS CDK built-in pipeline functionality. One of the key capabilities when using an AWS CDK pipeline is its self-mutating capability. It can update itself whenever you change its definition, avoiding the traditional chicken-and-egg problem of pipeline updates and helping developers fully control their CI/CD pipeline.

When the pipeline runs, it follows a carefully orchestrated sequence. First, it retrieves your code from your repository and synthesizes it into AWS CloudFormation templates. Before doing anything else, it examines these templates to see if you’ve made any changes to the pipeline’s own structure. If the pipeline detects that its definition has changed, it will pause its normal operation and update itself first. After the pipeline has updated itself, it will continue with its regular stages, such as deploying your application.

DSF provides an opinionated implementation of CDK Pipelines for Spark applications, where the PySpark code is automatically unit tested using Pytest and where the configuration is simplified. You only need to configure four components:

The CI/CD stages (testing, staging, production, and so on). This includes the AWS account ID and Region where these environments reside in.
The AWS CDK stack that is deployed in each environment.
(Optional) The integration test script that you want to run against the deployed stack.
The SparkEmrCICDPipeline AWS CDK construct.

The following diagram illustrates how everything works together.

SparkCICDPipeline L3 construct

Let’s dive into each of these components.

Define cross-account deployment and CI/CD stages

With the SparkEmrCICDPipeline construct, you can deploy your Spark application stack across different AWS accounts. For example, you can have a separate account for your CI/CD processes and different accounts for your staging and production environments.To set this up, first bootstrap the various AWS accounts (staging, production, and so on):

cdk bootstrap --profile <ENVIRONMENT_ACCOUNT_PROFILE> \ 
    aws://<ENVIRONMENT_ACCOUNT_ID&gt;/&lt;REGION> \ 
    --trust <CICD_ACCOUNT_ID> \ 
    --cloudformation-execution-policies "POLICY_ARN"

This step sets up the necessary resources in the environment accounts and creates a trust relationship between those accounts and the CI/CD account where the pipeline will run.Next, choose between two options to define the environments (both options require the relevant configuration in the cdk.context.json file.The first option is to use pre-defined environments, which is defined as follows:

{ 
    "staging": { 
        "account": "<STAGING_ACCOUNT_ID>", 
        "region": "<REGION>" 
    }, 
    "prod": { 
        "account": "<PROD_ACCOUNT_ID>", 
        "region": "<REGION>" 
    } 
}

Alternatively, you can use user-defined environments, which is defined as follows:

{
   "environments":[
      {
         "stageName":"<STAGE_NAME_1>",
         "account":"<STAGE_ACCOUNT_ID>",
         "region":"<REGION>",
         "triggerIntegTest":"<OPTIONAL_BOOLEAN_CAN_BE_OMMITTED>"
      },
      {
         "stageName":"<STAGE_NAME_2>",
         "account":"<STAGE_ACCOUNT_ID>",
         "region":"<REGION>",
         "triggerIntegTest":"<OPTIONAL_BOOLEAN_CAN_BE_OMMITTED>"
      },
      {
         "stageName":"<STAGE_NAME_3>",
         "account":"<STAGE_ACCOUNT_ID>",
         "region":"<REGION>",
         "triggerIntegTest":"<OPTIONAL_BOOLEAN_CAN_BE_OMMITTED>"
      }
   ]
}

Customize the stack to be deployed

Now that the environments have been bootstrapped and configured, let’s look at the actual stack that contains the resources that will be deployed in the various environments. Two classes must be implemented:

A class that extends the stack – This is where the resources that are going to be deployed in each of the environments are defined. This can be a normal AWS CDK stack, but it can be deployed in another AWS account depending on the environment configuration defined in the previous section.
A class that extends ApplicationStackFactory – This is DSF specific, and makes it possible to configure and then return the stack that is created.

The following code shows a full example:

class MyApplicationStack(cdk.Stack): 
    def __init__(self, scope, *, stage): 
        super().__init__(scope, "MyApplicationStack") 
        bucket = Bucket(self, "TestBucket",
                        auto_delete_objects=True, 
                        removal_policy=cdk.RemovalPolicy.DESTROY) 
        cdk.CfnOutput(self, "BucketName", value=bucket.bucket_name) 
        
class MyStackFactory(dsf.utils.ApplicationStackFactory): 
    def create_stack(self, scope, stage): 
        return MyApplicationStack(scope, stage=stage)

ApplicationStackFactory supports customization of the stack before returning the initialized object to be deployed by the CI/CD pipeline. You can customize your stack behavior by passing the current stage to your stack. For example, you can skip scheduling the Spark application in the integration tests stage because the integration tests trigger it manually as part of the CI/CD pipeline. For the production stage, the scheduling facilitates automatic execution of the Spark application.

Write the integration test script

The integration test script is a bash script that is triggered after the main application stack has been deployed. Inputs to the bash script can come from the AWS CloudFormation outputs of the main application stack. These outputs are mapped into environment variables that the bash script can access directly.

In the Spark CI/CD example, the application stack uses the SparkEMRServerlessJob CDK construct. This construct uses a Step Functions state machine to manage the execution and monitoring of the Spark job. The following is an example integration test bash script that we use to test that the deployed stack can run the associated Spark job successfully:

#!/bin/bash 
EXECUTION_ARN=$(aws stepfunctions start-execution --state-machine-arn $STEP_FUNCTION_ARN | jq -r '.executionArn')

while true 
do 
    STATUS=$(aws stepfunctions describe-execution --execution-arn $EXECUTION_ARN | jq -r '.status') 
    if [ $STATUS = "SUCCEEDED" ]; then 
        exit 0 
    elif [ $STATUS = "FAILED" ] || [ $STATUS = "TIMED_OUT" ] || [ $STATUS = "ABORTED" ]; then 
        exit 1 
    else 
        sleep 10
        continue 
    fi
done

The integration test scripts are executed within an AWS CodeBuild project. As part of the IntegrationTestStack, we’ve included a custom resource that periodically checks the status of the integration test script as it runs. Failure of the CodeBuild execution causes the parent pipeline (residing in the pipeline account) to fail. This helps teams only promote changes that pass all the required testing.

Bring all the components together

When you have your components ready, you can use the SparkEmrCICDPipeline to bring them together. See the following example code:

dsf.processing.SparkEmrCICDPipeline(
    self,
    "SparkCICDPipeline",
    spark_application_name="SparkTest",
    # The Spark image to use in the CICD unit tests
    spark_image=dsf.processing.SparkImage.EMR_7_5,
    # The factory class to dynamically pass the Application Stack
    application_stack_factory=SparkApplicationStackFactory(),
    # Path of the CDK python application to be used by the CICD build and deploy phases
    cdk_application_path="infra",
    # Path of the Spark application to be built and unit tested in the CICD
    spark_application_path="spark",
    # Path of the bash script responsible to run integration tests 
    integ_test_script='./infra/resources/integ-test.sh',
    # Environment variables used by the integration test script, value is the CFN output name
    integ_test_env={
        "STEP_FUNCTION_ARN": "ProcessingStateMachineArn"
    },
    # Additional permissions to give to the CICD to run the integration tests
    integ_test_permissions=[
        PolicyStatement(
            actions=["states:StartExecution", "states:DescribeExecution"
            ],
            resources=["*"]
        )
    ],
    source= CodePipelineSource.connection("your/repo", "branch",
        connection_arn="arn:aws:codeconnections:us-east-1:222222222222:connection/7d2469ff-514a-4e4f-9003-5ca4a43cdc41"
    ),
    removal_policy=RemovalPolicy.DESTROY,
)

The following elements of the code are worth highlighting:

With the integ_test_env parameter, you can define the environment variable mapping with the output of your application stack that’s defined in the application_stack_factory parameter
The integ_test_permissions parameter specifies the AWS Identity and Access Management (IAM) permissions that are attached to the CodeBuild project where the integration test script runs in
CDK Pipelines needs an AWS code connection Amazon Resource Name (ARN) to connect to your Git repository when you host your code

Now you can deploy the stack containing the CI/CD pipeline. This is a one-time operation because the CI/CD pipeline will dynamically be updated based on code changes that impact the CI/CD pipeline itself:

cd infra 
cdk deploy CICDPipeline

Then you can commit and push the code into the source code repository defined in the source parameter. This step triggers the pipeline and deploys the application in the configured environments. You can check the pipeline definition and status on the AWS CodePipeline console.

AWS CodePipeline

You can find the full example on the Data Solutions Framework GitHub repository.

Clean up

Follow the readme guide to delete the resources created by the solution.

Conclusion

By using Amazon EMR, the AWS CDK, DSF on AWS, and the Amazon EMR toolkit, developers can now streamline their Spark application development process. The solution described in this post helps developers gain full control over their code and infrastructure, making it possible to set up local development environments, implement automated CI/CD pipelines, and deploy serverless Spark infrastructure across multiple environments.

DSF supports other patterns, such as streaming governance and data sharing and Amazon Redshift data warehousing. The DSF roadmap is publicly available, and we look forward to your feature requests, contributions, and feedback. You can get started using DSF by following our Quick start guide.

About the authors

Tuning guide for AMD Amazon EC2 instances

2025-09-12 Suyash Nadkarni

Post Syndicated from Suyash Nadkarni original https://aws.amazon.com/blogs/compute/tuning-guide-for-amd-amazon-ec2-instances/

As organizations migrate more mission-critical workloads to the cloud, optimizing for price-performance becomes a key consideration. Amazon Elastic Compute Cloud (Amazon EC2) instances powered by AMD EPYC processors deliver high core density, large memory bandwidth, and hardware-enabled security features, making them a strong option for a wide range of compute, memory, and I/O-intensive workloads. In this post, we explain how to choose the right AMD-based Amazon EC2 instance types and describe tuning techniques that can help users improve workload efficiency. Whether you’re running simulations, large-scale analytics, or inference workloads, this post provides practical guidance for optimizing AMD-powered Amazon EC2 instance.

Amazon EC2 offers AMD-based instances built on multiple generations of AMD EPYC processors. This post focuses on optimization strategies for the 3rd and 4th generation families, which provide enhanced capabilities for compute and memory-intensive workloads.

3rd generation (M6a, R6a, C6a, Hpc6a): Balances compute, memory, and storage—well-suited for analytics, web servers, and high-performance computing.
4th generation (M7a, R7a, C7a, Hpc7a): Deliver up to 50% better performance over earlier AMD generations These instances introduce AVX-512 support, DDR5 memory, and Simultaneous Multithreading (SMT) turned off, SMT is a technology that allows a single physical core to run multiple threads concurrently; with SMT disabled, each virtual CPU (vCPU) maps directly to a physical core, which can improve workload isolation and consistency.

Choosing the right AMD EPYC powered Amazon EC2 instance type

Selecting the right AMD EPYC powered Amazon EC2 instance type starts with understanding how your application uses compute, memory, storage, and networking resources. Each instance family is optimized for specific workload characteristics.

Compute-intensive workloads

These workloads involve large-scale calculations, simulations, or encoding tasks, and they often need high CPU throughput and advanced instruction set support.

Recommended instances: C7a, Hpc7a, C6a, Hpc6a
Use cases: Scientific computing, financial modelling, media transcoding, encryption, machine learning (ML) inference

Big data and analytics

Applications that process and analyze large datasets benefit from high memory bandwidth and a balanced compute-to-memory ratio.

Recommended instances: R7a, M7a, R6a, M6a
Use cases: Stream processing, real-time analytics, business intelligence tools, distributed caching

Database workloads

Database workloads typically need consistent memory performance and high I/O throughput for read/write operations.

Recommended instances: R7a, M7a, R6a, M6a
Use cases: Relational databases (MySQL, PostgreSQL), NoSQL databases (MongoDB, Cassandra), in-memory databases (Redis)

Web and application servers

These applications handle variable request loads and benefit from balanced compute, memory, and network performance.

Recommended instances: C7a, M7a, C6a, M6a
Use cases: Web servers, content management systems, e-commerce platforms, API endpoints

AI/ML on CPU

ML tasks that do not need GPUs—such as inference or preprocessing—can run efficiently on CPU-based instances.

Recommended instances: M7a, R7a, C7a
Use cases: Model inference, natural language processing, computer vision, recommendation engines

High Performance Computing (HPC)

These workloads need high core counts, memory bandwidth, and low-latency networking for tightly coupled computations.

Recommended instances: Hpc7a, Hpc6a, R7a, M7a
Use cases: Computational fluid dynamics, genomics, seismic analysis, engineering simulations

Aligning your instance type with the needs of your workload helps provide predictable performance and cost efficiency. Services such as Amazon EC2 Auto Scaling and AWS Compute Optimizer can assist with ongoing instance selection and scaling decisions.

Optimizing AMD EPYC powered Amazon EC2 instances

Amazon EC2 instances powered by 4th generation AMD EPYC processors use a modular chiplet architecture, as shown in the following figure. Each processor includes multiple Core Complex Dies (CCDs), and each CCD contains one or more core complexes (CCXs). A CCX groups up to eight physical cores, with each core having 1 MB of dedicated L2 cache and all eight cores sharing a 32 MB L3 cache. These CCDs are connected to a central I/O die, which manages memory and interconnects across the chip.

Figure 1: Layout of the ‘Zen 4’ CPU die with 8 cores per die

The modular architecture of 4th generation AMD EPYC processors enables Amazon EC2 instances such as m7a.24xlarge and m7a.48xlarge to support high core counts-up to 96 physical cores per socket. For example:

m7a.24xlarge provides 96 physical cores from a single socket.
m7a.48xlarge spans two sockets, offering 192 physical cores.

Understanding how Amazon EC2 instance sizes map to physical processor layouts can help you optimize for performance and cache locality. Workloads that involve shared memory access or thread synchronization, such as high-performance computing or in-memory databases, can benefit from selecting instance sizes that minimize cross-socket communication and make efficient use of shared L3 cache, as shown in the following figure.

Figure 2: Layout of the ‘EPYC Chiplet’ CPU

Amazon EC2 instances powered by 4th generation AMD EPYC processors operate with SMT turned off. In this configuration, each vCPU maps directly to a physical core, eliminating resource sharing such as execution units and cache between sibling threads. This design can reduce intra-core interference and help provide more consistent performance under certain workloads. Users can isolate threads at the core level and observe lower variability and more stable throughput for workloads, such as high-performance computing, ML inference, and transactional databases.

CPU optimizations

Tools such as htop can help identify CPU usage patterns, system load averages, and per-process resource consumption. CPU usage should be evaluated in the context of your workload and performance requirements. If usage consistently reaches 100%, then it may indicate that the workload is CPU-bound and not optimally balanced. Before modifying the instance size, enabling Auto Scaling, or switching instance families, evaluations must be conducted for the tuning opportunities that could improve performance without changing infrastructure. Load averages that regularly exceed the number of vCPUs can also signal compute saturation and may warrant further optimization.

L3 cache usage

The L3 cache is a shared, high-speed memory layer used by a group of CPU cores. On AMD-based Amazon EC2, cores are organized into L3 cache slices, each shared by a subset of cores on the same socket. Threads scheduled within the same slice can access shared data more efficiently, reducing memory latency. On 4th generation AMD instances such as m7a.2xlarge or r7a.2xlarge, all vCPUs typically map to cores within a single L3 slice, which ensures consistent cache locality. For larger sizes (for example m7a.8xlarge and above), thread pinning—assigning threads to specific physical cores—can help maintain this locality. Thread pinning can reduce performance variability in workloads with shared-memory access patterns.

You can pin threads using the taskset command:

taskset -c 0-3 ./your_application

This example pins your application to CPU cores 0 through 3. To determine which cores share the same L3 cache region, use tools such as lscpu or lstopo to inspect the system’s CPU topology. Grouping related threads on cores that share an L3 cache can improve performance consistency for workloads with frequent shared-memory access.

Docker container optimization

In containerized environments running on AMD-based Amazon EC2 instances, tuning CPU-related settings can improve workload consistency and efficiency—particularly for compute-intensive or latency-sensitive applications. Although default configurations work for many general-purpose scenarios, certain workloads may benefit from more explicit control over how CPU resources are allocated. By default, container runtimes such as Docker allow the operating system to schedule containers across any available CPU cores. This flexible scheduling can lead to variability in performance when containers move across cores that don’t share cache. To reduce this variability and improve cache efficiency, containers can be pinned to specific cores using the --cpuset-cpus flag.

docker run --cpuset-cpus="1,3" my-container

This setting restricts the container to use only the specified cores. In this example, cores 1 and 3 are used for demonstration. The actual core selection should be based on CPU topology to make sure of cache-efficient scheduling. Pinning containers to cores that share L3 cache can reduce scheduling overhead and improve consistency for workloads with shared-memory access patterns.

CPU frequency governor settings

Some operating systems adjust CPU frequency dynamically to save power. This is typically controlled by a setting called the CPU frequency governor. Although this behavior is efficient for general-purpose workloads, it may introduce latency or performance variability in compute-sensitive environments. For workloads that need consistently high CPU performance—such as high-throughput data processing, simulations, or real-time applications—we recommend setting the CPU governor to performance mode. This makes sure that the CPU runs at its maximum frequency under load, avoiding time spent ramping up from lower power states.

You can apply this setting on bare metal instances or Amazon EC2 Dedicated Hosts using the following command:

sudo cpupower frequency-set -g performance

Before applying, consider benchmarking workload performance with other CPU frequency governors (such as ondemand or schedutil) to make sure that the performance setting provides measurable benefits without unnecessary energy trade-offs.

Use architecture-specific compiler flags

When compiling performance-sensitive C or C++ applications, architecture-specific flags such as -march=znverX can unlock AMD EPYC–specific optimizations, including improved vectorization and floating-point performance. Although this is beneficial for compute-heavy workloads, it may reduce portability across architectures. To balance performance and flexibility, consider implementing runtime feature detection and dispatching an approach used by many optimized libraries to adapt behavior based on the underlying CPU.

Before using these flags, verify that your compiler version supports them and make sure that the target EC2 instance architecture matches the specified flag. For example, a binary compiled with -march=znver4 may fail with an illegal instruction error (SIGILL) if run on earlier-generation instances such as M5a.The following table outlines the appropriate flags and minimum supported compiler versions for each AMD EPYC generation:

AMD EPYC Generation	-march Flag	Minimum GCC Version	Minimum LLVM/Clang Version
4th generation (for example M7a)	znver4	GCC 12	Clang 15
3rd generation (for example M6a)	znver3	GCC 11	Clang 13
2nd generation (for example M5a)	znver2	GCC 9	Clang 11

The following flags are supported for GCC 11+ or LLVM Clang 13+:

# 4th Gen EPYC (M7a, R7a, C7a, Hpc7a)
-march=znver4

# 3rd Gen EPYC (M6a, R6a, C6a)
-march=znver3

# 2nd Gen EPYC (M5a, R5a, C5a)
-march=znver2

When to enable AVX-512 and VNNI instructions

4th generation AMD EPYC powered Amazon EC2 instances support advanced single instruction, multiple data (SIMD) instruction sets such as AVX2, AVX-512, and VNNI. These can improve throughput for vector-heavy workloads such as ML inference, image processing, or scientific simulations. However, these flags are generation-specific—attempting to run binaries compiled with AVX-512 on unsupported instances (for example 2nd generation M5a) may result in runtime errors such as illegal instruction (SIGILL).

When compiling C or C++ code:

gcc -mavx2 -mavx512f -O2 your_program.c -o your_program

To better understand which optimizations are applied, use the following:

-ftree-vectorizer-verbose=2 -fopt-info-vec-missed

This helps identify loops that benefit from vectorization and those that don’t. Only enable these optimizations if your workload benefits and you’ve validated compatibility with the instance generation in use. Avoid applying AVX flags indiscriminately, because it may reduce portability and increase binary complexity.

AMD Optimizing CPU Libraries

The AMD Optimizing CPU Libraries (AOCL) provide performance-tuned math libraries specifically designed for AMD EPYC processors. These libraries include optimized implementations of commonly used functions in scientific computing, engineering, and ML workloads. You can link your applications against AOCL to use processor-specific optimizations without rewriting your code. AOCL includes libraries for vector and scalar math, random number generation, FFT, BLAS, and LAPACK, among others.

Setting up AOCL

Set the AOCL_ROOT environment variable to point to the installation directory:
```
export AOCL_ROOT=/path/to/aocl
```

Compile your application with the appropriate include and library paths:

gcc -I$AOCL_ROOT/include -L$AOCL_ROOT/lib -lamdlibm -lm your_program.c -o your_program

Vector and scalar math optimization: you can enable more vectorized or scalar math tuning flags for specific workloads:

# Vector math optimization
gcc -lamdlibm -fveclib=AMDLIBM -lm your_program.c -o your_program
		
# Faster scalar math
gcc -lamdlibm -fsclrlib=AMDLIBM -lamdlibmfast -lm your_program.c -o your_program

AOCL runtime profiling: AOCL supports runtime profiling, which helps developers identify which mathematical operations dominate execution time. To enable profiling, run the following:
```
export AOCL_PROFILE=1
./your_program
```

After running this, a report file named aocl_profile_report.txt is generated. It provides a function-level breakdown of call counts, execution time, and thread usage. Developers can use this to focus optimization efforts on high-impact operations.

Conclusion

This post explored how to select AMD-based Amazon EC2 instance types that align with specific workload characteristics, and how to apply tuning techniques focused on CPU usage, thread placement, cache efficiency, and math library optimization. These approaches are especially relevant for compute-bound or latency-sensitive workloads where consistent performance is critical.

Ready to get started? Sign in to the AWS Management Console and launch AMD EPYC powered Amazon EC2 instances to begin optimizing your workloads today.

Accelerate your data and AI workflows by connecting to Amazon SageMaker Unified Studio from Visual Studio Code

2025-09-12 Lauren Mullennex

Post Syndicated from Lauren Mullennex original https://aws.amazon.com/blogs/big-data/accelerate-your-data-and-ai-workflows-by-connecting-to-amazon-sagemaker-unified-studio-from-visual-studio-code/

Developers and machine learning (ML) engineers can now connect directly to Amazon SageMaker Unified Studio from their local Visual Studio Code (VS Code) editor. With this capability, you can maintain your existing development workflows and personalized integrated development environment (IDE) configurations while accessing Amazon Web Services (AWS) analytics and artificial intelligence and machine learning (AI/ML) services in a unified data and AI development environment. This integration provides seamless access from your local development environment to scalable infrastructure for running data processing, SQL analytics, and ML workflows. By connecting your local IDE to SageMaker Unified Studio, you can optimize your data and AI development workflows without disrupting your established development practices.

In this post, we demonstrate how to connect your local VS Code to SageMaker Unified Studio so you can build complete end-to-end data and AI workflows while working in your preferred development environment.

Solution overview

The solution architecture consists of three main components:

Local computer – Your development machine running VS Code with AWS Toolkit for Visual Studio Code and Microsoft Remote SSH installed. You can connect through the Toolkit for Visual Studio Code extension in VS Code by browsing available SageMaker Unified Studio spaces and selecting their target environment.
SageMaker Unified Studio – Part of the next generation of Amazon SageMaker, SageMaker Unified Studio is a single data and AI development where you can find and access your data and act on it using familiar AWS tools for SQL analytics, data processing, model development, and generative AI application development.
AWS Systems Manager – A secure, scalable remote access and management service that enables seamless connectivity between your local VS Code and SageMaker Unified Studio spaces to streamline data and AI development workflows.

The following diagram shows the interaction between your local IDE and SageMaker Unified Studio spaces.

Prerequisites

To try the remote IDE connection, you must have the following prerequisites:

Access to a SageMaker Unified Studio domain with connectivity to the internet. For domains set up in virtual private cloud (VPC)-only mode, your domain should have a route out to the internet through a proxy or a NAT gateway. If your domain is completely isolated from the internet, refer to the documentation for setting up the remote connection. If you don’t have a SageMaker Unified Studio domain, you can create one using the quick setup or manual setup option.
A user with SSO credentials through IAM Identity Center is required. To configure SSO user access, review the documentation.
Access to or can create a SageMaker Unified Studio project.
A JupyterLab or Code Editor compute space with a minimum instance type requirement of 8 GB of memory. In this post, we use an ml.t3.large instance. SageMaker Distribution image version 2.8 or later is supported.
You have the latest stable VS Code with Microsoft Remote SSH (version 0.74.0 or later), and AWS Toolkit (version 3.74.0) extension installed on your local machine.

Solution implementation

To enable remote connectivity and connect to the space from VS Code, complete the following steps. To connect to a SageMaker Unified Studio space remotely, the space must have remote access enabled.

Navigate to your JupyterLab or Code Editor space. If it’s running, stop the space and choose Configure space to enable remote access, as shown in the following screenshot.
Turn on Remote access to enable the feature and choose Save and restart, as shown in the following screenshot.
Navigate to AWS Toolkit in your local VS Code installation.
On the SageMaker Unified Studio tab, choose Sign in to get started and provide your SageMaker Unified Studio domain URL, that is, https://<domain-id>.sagemaker.<region>.on.aws.
You will be prompted to be redirected to your web browser to allow access to AWS IDE extensions. Choose Open to open a new web browser tab.
Choose Allow access to connect to the project through VS Code.
You’ll receive a Request approved notification, indicating that you now have permissions to access the domain remotely.

You can now navigate back to your local VS Code to access your project to continue building ETL jobs and data pipelines, training and deploying ML models, or building generative AI applications. To connect to the project for data processing and ML development, follow these steps:

Choose Select a project to view your data and compute resources. All projects in the domain are listed, but you’re only allowed access to projects where you’re a project member.

You can only view one domain and one project at a time. To switch projects or sign out of a domain, choose the ellipsis icon.

You can also view compute and data resources that you created previously.
Connect your JupyterLab or Code Editor space by selecting the connectivity icon, as shown in the following image. Note: If this option does not show as available, then you may have remote access disabled in the space. If the space is in “Stopped” state, hover over the space and choose the connect button. This should enable remote access, start the space and connect to it. If the space is in “Running” state, the space must be restarted with remote access enabled. You can do this by stopping the space and connecting to it as shown below from the toolkit.

Another VS Code window will open that is connected to your SageMaker Unified Studio space using remote SSH.
Navigate to the Explorer to view your space’s notebooks, files, and scripts. From the AWS Toolkit, you can also view your data sources.

Use your custom VS Code setup with SageMaker Unified Studio resources

When you connect VS Code to SageMaker Unified Studio, you keep all your personal shortcuts and customizations. For example, if you use code snippets to quickly insert common analytics and ML code patterns, these continue to work with SageMaker Unified Studio managed infrastructure.

In the following graphic, we demonstrate using analytics workflow shortcuts. The “show-databases” code snippet queries Athena to show available databases, “show-glue-tables” lists tables in AWS Glue Data Catalog, and “query-ecommerce” retrieves data using Spark SQL for analysis.

Graphic showing how to use code snippets in local VS Code to query data resources in SageMaker Unified Studio

You can also use shortcuts to automate building and training an ML model on SageMaker AI. In the below graphic, the code snippets show data processing, configuring, and launching a SageMaker AI training job. This approach demonstrates how data practitioners can maintain their familiar development setup while using managed data and AI resources in SageMaker Unified Studio.

Graphic showing how to do data processing and train a SageMaker AI job remotely in VS Code using code snippets

Disabling remote access in SageMaker Unified Studio

As an administrator, if you want to disable this feature for your users, you can enforce it by adding the following policy to your project’s IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyStartSessionForSpaces",
            "Effect": "Deny",
            "Action": [
                "sagemaker:StartSession"
            ],
            "Resource": "arn:aws:sagemaker:*:*:space/*/*"
        }
    ]
}

Clean up

SageMaker Unified Studio by default shuts down idle resources such as JupyterLab and Code Editor spaces after 1 hour. If you’ve created a SageMaker Unified Studio domain for the purposes of this post, remember to delete the domain.

Conclusion

Connecting directly to Amazon SageMaker Unified Studio from your local IDE reduces the friction of moving between local development and scalable data and AI infrastructure. By maintaining your personalized IDE configurations, this reduces the need to adapt between different development environments. Whether you’re processing large datasets, training foundation models (FMs), or building generative AI applications, you can now work from your local setup while accessing the capabilities of SageMaker Unified Studio. Get started today by connecting your local IDE to SageMaker Unified Studio to streamline your data processing workflows and accelerate your ML model development.

About the authors

How to build resilient SMS delivery with AWS End User Messaging

2025-09-12 Tyler Holmes

Post Syndicated from Tyler Holmes original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-build-resilient-sms-delivery-with-aws-end-user-messaging/

Reliable SMS delivery is a critical requirement for many businesses. However, SMS communications can be impacted by factors outside your direct control, such as carrier availability and delivery challenges.

In this post, we explore strategies for building resilient SMS architectures using AWS End User Messaging. We discuss how to architect your SMS communications at the originator, account, and Regional levels to support high availability and seamless failover, even in the face of disruptions. This includes implementing best practices like using phone pools, dedicated originators, and multi-Region redundancy.

By understanding these strategies, you can create a resilient messaging system that keeps your mission-critical SMS flowing reliably to your customers and stakeholders, regardless of external service interruptions or carrier-specific issues.

How SMS delivery works

The process of delivering an SMS message involves a complex chain of interconnected systems. The message needs to be routed to the appropriate mobile network operator (carrier) based on the recipient’s phone number, and there are many paths the message could take. When a user sends an SMS through AWS End User Messaging, the message is routed appropriately based on the country, carrier, and originator type being used.

The inherent complexity of SMS delivery means there are numerous potential service degradation points, such as issues with an aggregator, carrier configurations, and filtering. The general availability of a mobile device can also be a factor, because it’s dependent on the current health of the network the device uses. Things as simple as weather changes or location (such as parking garages) can impact the delivery of messages and illustrates why alternate channels should be provided.

Understanding this underlying architecture is crucial for building resilient SMS systems that can withstand disruptions at different stages of the process. The following diagram shows a simplified version of how an SMS is delivered to a handset.

The need for SMS resiliency

Given the complex dependencies and potential points of degradation in the SMS delivery chain, it’s critical to architect your SMS communications for resilience. This helps make sure your messages can be delivered reliably, even when facing Regional service disruptions, carrier challenges, or other potential communication barriers.

Levels of resiliency for SMS

The following are the three levels of resiliency to consider for SMS and some potential reasons for disruption:

Originator-level resiliency – Carriers and other downstream entities can sometimes block or filter specific origination numbers, causing delivery issues. Originators must be configured with these downstream entities, so downstream misconfigurations might occur.
Account resiliency – Your primary AWS account might experience a disruption, preventing you from sending messages through that account. Account-level issues, such as reaching an account SMS spending limit or throughput, might limit your ability to send from a specific account.
AWS Region resiliency – Regions can experience degradation of service, and originators are tied to an account and Region when they are configured and can’t be moved.

General best practices for SMS resiliency

A phone pool, also known as a pool, is a collection of originators that share the same settings. When you send messages through a phone pool, it selects an appropriate origination identity to use for sending the message based on the country code. In general, pools will select the highest throughput originator in the pool for the country being sent to. This means that the order from first to last will be short codes, long codes, sender ID, toll-free, and finally shared routes. If one of the origination identities in the phone pool is unable to send the message, the phone pool will automatically fail over to another origination identity, which is part of the same phone pool, for that same country if there is one available.

Having a dedicated originator for each country you send to improves deliverability and allows for two-way communication if the originator supports it. Pools have a setting for shared routes in some countries, which is a pool of shared origination identities that AWS maintains. When you activate shared routes on a pool and don’t have a dedicated originator, AWS End User Messaging SMS attempts to deliver your message using one of the shared identities. The origination identity could be a sender ID, long code, or short code, and could vary within each country. These shared routes are not capable of two-way communication so they are not eligible for any use cases that require it. Deliverability on these shared routes also varies; it’s always a best practice to use a shared route as a last resort option. Using at least one dedicated originator for each destination and use case you support will improve your deliverability and the experience of your end-users. Refer to How to Manage Global Sending of SMS with AWS End User Messaging for more details on getting ready to send SMS. The post includes a template for organizing use cases and selecting originators.

AWS End User Messaging provides several options for sending Delivery Receipts (DLRs), as shown in the following diagram, including Amazon Simple Notification Service (Amazon SNS), Amazon CloudWatch Logs, and Amazon Data Firehose. If you are using a multi-Region or multi-account architecture, it’s important to centralize this data. The following GitHub repo provides a solution as a deployable starter project that builds on top of an Amazon S3 storage location and combines channel engagement and conversational data into a centralized data store. You can also optionally deploy an Amazon QuickSight dashboard to visualize the engagement data.

Using the message feedback feature of AWS End User Messaging also allows for more visible and finite message statuses. You can use signals from customers to determine if they have received the message and set the message feedback status record as delivered. Using message feedback means you don’t have to wait for the DLR to be returned and you can set your message as received and update your message metrics. Message feedback can be used for typical user actions, such as completing a workflow, clicking a link, verifying a one-time password.

When choosing a repository for your DLRs, make sure to consider your data requirements related to data consistency, tolerance for latency, performance requirements, and your unique access patterns.

Strategies for resiliency at each level

Each level at which SMS operates provides layered resiliency. You don’t need to implement all the layers; this will depend on your comfort level for complexity and increased cost. In this section, we review the resiliency strategies at each level.

Originator-level resiliency

AWS recommends provisioning a minimum of two origination number types per country in each Region you are using to provide redundancy. Different originator types often use different paths to send, so if one sending is degraded on one path, you can switch to the other. The implementation itself will depend on the countries you are sending to and the level of complexity and cost you are willing to incur, because some originators have costs associated with owning them.If you decide to have multiple originators, we recommend communicating with your end-users about the methods by which you might communicate with them. This reduces the chance of spam complaints if you need to deliver SMS with an unfamiliar originator.

Let’s explore an example of designing originator-level resiliency for the US (the general pattern is the same across different countries).The US options for originators, in order of highest to lowest throughput, are short codes, 10DLC, and toll-free numbers (TFNs). Each requires registration to be completed. Depending on your throughput needs, there are a few things we recommend when implementing resiliency in the US.

If you’re using 10DLC, we recommend getting at least one other 10DLC number that you don’t use. If you encounter a filtering or blocking event by US carriers, you can use this number to swap into your pool to continue to be able to send while you solve the problem on the blocked or filtered number. This might give you more time to fix an issue while still maintaining your ability to send. The other option, and another layer of redundancy, is to register a TFN that you could swap into your pool. Although TFNs have lower throughput, this can help you continue some level of sending while solving for the blocking issue.

If you’re using a short code, you have an added layer of redundancy because carriers don’t generally block those codes without warning. You will receive an audit and be given a chance to fix whatever issue the carriers have found with your sending. Having a second short code or using a lower throughput backup such as a 10DLC or TFN is also an option.

Account-level resiliency

There is always the chance that your primary account could be degraded in some way. Issues such as an inaccessible account or hitting a spending limit can take time to mitigate. For example, Artificially Inflated Traffic (AIT), also known as SMS pumping fraud, can cause your spending limit to be hit, shutting off your ability to send from that account. To learn more, refer to Defending Against SMS Pumping: New AWS Features to Help Combat Artificially Inflated Traffic.

You can mitigate these issues by having a secondary account in the same Region that you share your originators, pools, and opt-out lists with by using AWS Resource Access Manager (AWS RAM) to enable resource sharing. You can use AWS RAM to share some AWS End User Messaging SMS resources with other AWS accounts or through AWS Organizations. The accounts being shared to must be in the same Region as the account that owns the resources. Configuring this sharing makes it possible to send from a secondary account using the same resources in the primary account. Billing on the volume is attributed to the sending account, whereas charges for the originators are billed to the account that owns them.

Region-level resiliency

There is always the possibility of a Regional degradation of services or a downstream misconfiguration for a particular originator or Region. The only way to protect your sending against this is to configure origination numbers in at least one other Region. This way, you can fail over to a secondary Region if the primary Region experiences a degradation of service. When implementing this approach, keep the following considerations in mind:

If a country requires registration for SMS sending, you must complete that registration separately in each Region where you plan to use an originator for that country. You can submit the same registration for each Region, or for some originators you can specify multiple Regions at the time of registration, rather than applying twice.
Many countries support sender IDs, and as long as they don’t require a registration, the same sender ID can be configured in multiple Regions. This simplifies the multi-Region setup. If you need to configure many sender IDs, refer to Automating Sender ID Configuration for SMS with AWS End User Messaging APIs to learn how to automate the process of configuring sender IDs across Regions.
Carrier availability can also be a point of failure, so it’s important to have multiple origination numbers provisioned in each Region to avoid a single point of failure.

Although this post focuses specifically on SMS resiliency, as a general best practice for your messaging system, you should also enable alternative channels as failover or primary channels. Channels such as WhatsApp, push, voice, or email offer increased resiliency in the event of a degradation of SMS service.

Automating failover

AWS End User Messaging provides DLR data for your sent messages, which is a key piece of information you can use to automate retries and handle failures. As a protocol, SMS doesn’t guarantee delivery. Depending on the country being sent to, DLRs might take up to 72 hours to be returned or in some cases might not be returned at all. For this reason, relying on DLRs alone is not enough. You might also want to monitor the health of your Region or the AWS End User Messaging service, which can be done through the AWS Health Dashboard.

For a deep dive on managing SMS deliverability, refer to A Guide to Optimizing SMS Delivery and Best Practices, which goes into more detail on the complexities of SMS delivery and how to effectively monitor your message performance.

When it comes to automating your failover process, the DLR data provided by AWS End User Messaging can be a powerful tool. By analyzing the delivery statuses and error codes, you can build logic to automatically retry messages that fail on the first attempt. The key is to build in this automation proactively, rather than relying on manual intervention. Building your failover logic ahead of time can provide for a seamless recovery when delivery issues occur, minimizing disruption to your users.

It’s also important to remember that DLRs are fallible and might take up to 72 hours to arrive. The message feedback feature will give you more insight into message status, and you don’t have to wait for the DLR to be returned. You can set your message as received and update your message metrics based on expected user actions.

The goal is to create a resilient messaging architecture that can withstand the inevitable complexities of SMS delivery. Automating your failover process is a crucial component of that strategy.

Pros and cons of multi-Region SMS redundancy

Although implementing multi-Region redundancy can increase the reliability and resilience of your SMS communications, there are both advantages and trade-offs to consider. Evaluating the specific needs of your use cases against the added complexity and costs is important in determining the optimal approach.

The following are key benefits of having a resilient SMS architecture:

Increased reliability and availability of SMS communications – Having redundant originators and routing across multiple Regions strengthens your ability to withstand Regional disruptions or carrier-specific issues, so you can continue sending SMS reliably.
Seamless failover during outages – The ability to automatically fail over to a secondary Region when issues occur in the primary Region minimizes disruptions and keeps your SMS flowing.
Reduced impact of carrier-specific problems – By diversifying your origination numbers across AWS accounts and Regions, you can avoid being heavily impacted by a problem with a single carrier or originator.

However, consider the following important trade-offs:

Increased complexity in configuration and management – Maintaining redundant resources (originators, phone pools, opt-out lists, and so on) across multiple Regions adds complexity to your SMS architecture. A multi-Region setup requires additional configuration and ongoing maintenance.
Additional costs – Provisioning origination numbers, short codes, and so on in multiple Regions can incur additional costs compared to a single-Region setup. There might also be costs for cross-Region data transfers if centralizing delivery logs and event data. Centralizing DLR data from multiple Regions likely requires additional storage and processing costs.
Potential reputation and deliverability challenges – When failing over to a different Region, your SMS messages might come from new originators. If customers aren’t prepared for this change, they might mistake legitimate messages for spam. These spam reports can harm your overall SMS deliverability rates.

Overall, the pros of increased reliability and resilience must be weighed against the cons of higher complexity and costs. The optimal approach will depend on the criticality of the SMS use cases and your organization’s risk tolerance.

Conclusion

By implementing the layered resiliency strategies outlined in this post, you can significantly improve the reliability of your critical SMS communications. Whether you start with originator-level redundancy using phone pools or build a fully Regional-resilient architecture, proactively investing in your setup helps your messages reach your customers, even in the face of unexpected challenges.

To get started, consider the following next steps:

Evaluate your current SMS workloads and determine what level of resiliency is right for your business needs and risk tolerance.
As a first step, implement phone pools in your primary Region to protect against single-originator filtering or blocking.
For critical applications, set up a secondary account and use AWS RAM to share your primary originators, providing a robust layer of account-level redundancy.

To learn more, explore the AWS End User Messaging documentation and the AWS RAM User Guide. For personalized guidance, work with your AWS account team to design the optimal SMS architecture for your business.

About the author

Migrating from API keys to service account tokens in Grafana dashboards using Terraform

2025-09-11 Majdoulina Makbal

Post Syndicated from Majdoulina Makbal original https://aws.amazon.com/blogs/big-data/migrating-from-api-keys-to-service-account-tokens-in-grafana-dashboards-using-terraform/

With the release of Grafana 9.4, Amazon Managed Grafana added support for service accounts, which have become the recommended authentication method for applications interacting with Amazon Managed Grafana, replacing the previous API key system.

While API keys are created with a specific role that determines their level of access, service accounts offer a more flexible and maintainable approach. They support multiple tokens, can be enabled or disabled independently, and aren’t tied to individual users, allowing applications to remain authenticated even if a user is deleted. Permissions can be assigned directly to service accounts using role-based access control, simplifying management of long-lived access for non-human entities like applications or scripts.

In this blog post, we walk through how to migrate from API keys to service account tokens when automating Amazon Managed Grafana resource management. We will also show how to securely store tokens using AWS Secrets Manager and automate token rotation with AWS Lambda. All infrastructure is deployed using Terraform, though the pattern can be adapted to your infrastructure-as-code framework of choice.

What are service accounts and tokens?

A service account is designed to authenticate automated tools and systems with Amazon Managed Grafana and is intended for programmatic access. A service account token is a secure credential issued to a service account and can be used to authenticate requests to the Amazon Managed Grafana HTTP API. Multiple tokens can be associated with a single service account, and tokens can be individually revoked or rotated without affecting other services or requiring changes to user accounts.

For a deeper understanding, see the Grafana service account documentation.

Solution overview

In this solution, we show you how to create a service account, reference it in your Terraform stack, and then implement rotation of the token associated with it using Lambda and Secrets Manager as shown in the following diagram:

Workflow diagram showing automated secret management between Terraform, AWS Secrets Manager, and Grafana workspace with Lambda rotation

Architecture diagram illustrating the integration between Terraform, AWS Secrets Manager secret store, and an Amazon Managed Grafana workspace, with secret rotation functionality.

The following are the basic steps to set up the solution.

Set up Amazon Managed Grafana with service accounts.
Update the secret in Secrets Manager with the token value.
Automate resource creation in Amazon Managed Grafana using service account tokens in Terraform.
Create a service account and token in your Amazon Managed Grafana workspace.
Store the token securely using Secrets Manager.
Use Terraform to automate Amazon Managed Grafana resource creation with the token.
Automate the rotation of the service account token.

GitHub repo for cloning the code and deploying the Terraform stack.

Prerequisites

Before starting this walkthrough, make sure that you have the following:

The Terraform CLI (1.2.0+) installed.
The AWS CLI installed.
An AWS account with permissions to create resources such as Lambda functions, AWS Identity and Access Management (IAM) roles, Secrets Manager secrets, and Amazon Managed Grafana workspaces.

Solution walkthrough

Use the following steps to set up and configure the solution.

Provision resources using the Terraform stack

The full source code of the solution is in sample-migrate-from-apikeys-grafana and is deployed using Terraform.

Clone the repository.

git clone https://github.com/aws-samples/sample-migrate-from-apikeys-grafana.git

Initialise a Terraform project.

terraform init

Create infrastructure for the secrets and the Amazon Managed Grafana instance.

terraform apply —target=aws_secretsmanager_secret.token —target=aws_grafana_workspace.grafana

This step creates the Amazon Managed Grafana workspace and the Secrets Manager secret. In the next step, you bind the workspace with AWS IAM Identity Center and generate the service account token.

Retrieve service account token from the Amazon Managed Grafana workspace

You must have administrative privileges in your Amazon Managed Grafana workspace to perform this step. This applies whether you’re using IAM Identity Center or an external identity provider for authentication.

To change a user’s role in AWS IAM Identity Center (console)
1. Open the Amazon Managed Grafana console.
2. In the navigation pane, choose Workspaces.
3. Select the workspace you want to manage.
4. On the AWS IAM Identity Center, choose the Assigned users tab.
5. Select the row of the user that you want to modify.
6. For Action, choose the following:
  - Make admin
7. Confirm the role change.

Select the workspace URL and sign in using your credentials, you should be able to create a service account under the name grafana-sa (or the name of the variable defined in /variables.tf).

Assign the Editor role to the service account to allow it to create dashboards and folders. Learn more about service account roles in the Assign roles to a service account in Grafana.
After the service account is created, add a service account token to it, again the name should be similar to the one defined in /variables.tf.

Add the token to Secrets Manager and create the rest of the resources

After you complete this step, the access token will be stored in Secrets Manager and will automatically be used in the provider definition during future runs of terraform apply.

Copy the service account token.

Paste it into the plaintext section of the Secrets Manager secret created in the previous section

With the access token stored in Secrets Manager, there is no longer a need to restrict the apply operation to the rotation module using the --target flag. Use the following code to remove the restriction.
```
provider "grafana" {
  url  = "https://${aws_grafana_workspace.grafana.endpoint}"
  auth = module.grafana_sa_key_automation.grafana_sa_token
}
```

Clean up

To avoid incurring future charges, use the following command to delete unused Amazon Managed Grafana service accounts and Terraform-managed resources run the cli command terraform destroy.

Security notes

To protect the security of your organization, we recommend the following best practices:

Always follow least privilege principles. Grant the minimum permissions needed to the service account (for example, Editor instead of Admin).
Make sure that Amazon Simple Queue Service (Amazon SQS) queues, Secrets Manager secrets, and Amazon CloudWatch Logs are encrypted with a customer-managed KMS key if required by your organization.
Rotate secrets regularly to minimize exposure.

Conclusion

In this post, we demonstrated how to migrate from API keys to Amazon Managed Grafana service account tokens using Terraform, with secure storage in AWS Secrets Manager and optional automated token rotation via AWS Lambda.This modern approach improves security, scalability, and auditing in your automation pipelines.

For more information, see the Amazon Managed Grafana service account documentation.