Tag Archives: AWS Lambda

Building serverless event streaming applications with Amazon MSK and AWS Lambda

2025-06-26 Tarun Rai Madan

Post Syndicated from Tarun Rai Madan original https://aws.amazon.com/blogs/big-data/building-serverless-event-streaming-applications-with-amazon-msk-and-aws-lambda/

As organizations build modern applications with event-driven architectures (EDA), they often seek solutions that minimize infrastructure management overhead while maximizing developer productivity. Amazon Managed Streaming for Apache Kafka (Amazon MSK) and AWS Lambda together provide a serverless, scalable, and cost-efficient platform for real-time event-driven processing.

In this post, we describe how you can simplify your event-driven application architecture using AWS Lambda with Amazon MSK. We demonstrate how to configure Lambda as a consumer for Kafka topics, including a cross-account setup and how to optimize price and performance for these applications.

Why use Lambda with Amazon MSK?

Customers building event-driven applications have several key priorities when it comes to their architecture choices. They typically seek to reduce their operational overhead by using Amazon Web Services (AWS) to handle the complex, underlying infrastructure components so their teams can focus on core business logic. Additionally, developers prefer a streamlined experience that minimizes the need for repetitive boilerplate code, enabling them to be more productive and focus on creating value. Furthermore, these customers want to achieve both scalability and cost-effectiveness without the burden of managing compute infrastructure directly. Lambda integration with Amazon MSK effectively addresses these requirements, delivering a comprehensive solution that combines the benefits of serverless computing with managed Kafka services. For example, an ecommerce company can use Amazon MSK to collect real-time clickstream data from its website and process those events using AWS Lambda. With this integration, they can trigger Lambda functions to update recommendation models, send personalized offers, or analyze user behavior instantly—without provisioning or managing servers. The key benefits of using Lambda with Amazon MSK include:

Simplicity through native integration – AWS Lambda offers native integration with Amazon MSK through a connector resource called event source mapping. You can use this integration to directly associate a Kafka topic—whether it’s on Amazon MSK or a self-managed Kafka cluster—as an event source for a Lambda function without writing custom consumer logic. With just a few configuration steps, event source mapping handles partition assignment, offset tracking, and parallelized batch processing under the hood. It uses the Kafka consumer group protocol to distribute topic partitions across multiple concurrent Lambda invocations, supports batch windowing, and enables at-least-once delivery semantics. Moreover, it automatically commits offsets upon successful function execution while handling retries and dead-letter queue (DLQ) routing for failed records, significantly reducing the operational overhead traditionally associated with Kafka consumers.
Auto scaling and throughput controls – When using AWS Lambda with Amazon MSK through event source mapping, Lambda automatically scales by assigning a dedicated event poller per Kafka partition, enabling parallel, partition-based processing. This allows the system to elastically handle varying traffic without manual intervention. For advanced control, provisioned concurrency pre-initializes Lambda execution environments, eliminating cold starts and delivering consistent low-latency performance. Additionally, with provisioned event source mapping, you can configure the minimum and maximum number of Kafka pollers, providing precise control over throughput and concurrency. This is ideal for applications with unpredictable traffic patterns or strict latency requirements.
Cost-effectiveness – AWS Lambda uses a pay-per-use model in which you only pay for compute time and number of invocations. When integrated with Amazon MSK, there are no charges for idle time, making it ideal for bursty or low-frequency Kafka workloads. You can further optimize costs by tuning batch size and batch window settings. For mission-critical workloads, provisioned concurrency provides consistent performance with controlled pricing.
Event filtering – AWS Lambda supports event filtering for Amazon MSK event sources, which means you can process only the Kafka records that match specific criteria. This reduces unnecessary function invocations and optimizes your function costs. You can define up to five filters per event source mapping (with the option to request an increase to ten). Each filter uses a JSON-based pattern to specify the conditions a record must meet to be processed. Filters can be applied using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS Serverless Application Model (AWS SAM) templates. For more details and examples, refer to the AWS Lambda documentation on event filtering with Amazon MSK.
Handling Availability Zone outage for your consumer – Amazon MSK enables high availability for your Kafka brokers by distributing them across multiple Availability Zones within a Region. To maintain high availability across your application, you similarly need a consumer that offers high availability. AWS Lambda offers high availability and resilience by running your consumer functions across multiple Availability Zones in a Region. This means that even if one Availability Zone experiences an outage, your Lambda function will continue to operate in other healthy Availability Zones. While Lambda manages security patching and Availability Zone failure scenarios, you can focus on your application logic.
Cross-account event processing – Cross-account connectivity between AWS Lambda and Amazon MSK allows a Lambda function in one AWS account to consume data from an MSK cluster in another account using MSK multi-VPC private connectivity powered by AWS PrivateLink. This setup is particularly beneficial for organizations that centralize Kafka infrastructure while maintaining separate accounts for different applications or teams.
Support for JSON, Avro, Protobuf, and Schema Registries – AWS Lambda supports Kafka events in JSON, Avro and Protobuf formats via event source mapping. It integrates with AWS Glue Schema registry, Confluent Cloud Schema registry, and self-managed Confluent Schema registry , enabling native schema validation, filtering, and deserialization without custom code.

How Lambda processes messages from your Kafka topic

Lambda uses event source mappings to process records from Amazon MSK by actively polling Kafka topics through event pollers that invoke Lambda functions with batches of records. These mappings are Lambda managed resources designed for high-throughput, stream-based processing. By default, Lambda detects the OffsetLag for all partitions in your Kafka topic and automatically scales pollers based on traffic. For high-throughput applications, you can enable provisioned mode to define minimum and maximum pollers, and your event source mapping auto scales between the minimum and maximum defined values. In the provisioned mode, each poller can process up to 5 MBps and supports concurrent Lambda invocations.

After Lambda processes each batch, it commits the offsets of the messages in that batch. If your function returns an error for a message in a batch, Lambda retries the whole batch of messages until processing succeeds or the messages expire. You can send records that fail all retry attempts to an on-failure destination for later processing. To maintain ordered processing within a partition, Lambda limits the maximum event pollers to the number of partitions in the topic. When setting up Kafka as a Lambda event source, you can specify a consumer group ID to let Lambda join an existing Kafka consumer group. If other consumers are active in that group, Lambda will receive only part of the topic’s messages. If the group exists, Lambda starts from the group’s committed offset, ignoring the StartingPosition. The following diagram illustrates this flow.

Walkthrough: Build a serverless Kafka app with AWS Lambda

Follow these steps to build a serverless application that consumes messages from an MSK cluster using AWS Lambda:

Create an Amazon MSK cluster. Use the AWS Management Console or AWS CLI to create your MSK cluster. When the cluster is up, create your Kafka topic(s). For detailed instructions, refer to the Amazon MSK documentation.
Create a Lambda function using the AWS Management Console or the AWS CLI. To learn more about creating a Lambda function, refer to Create your first Lambda function. The Lambda function’s execution role needs to have the following permissions:
1. Access to connect to your MSK cluster
2. Permissions to manage elastic network interfaces in your VPC
To connect Lambda to Amazon MSK as a consumer, set up event source mapping to link your MSK topic with the Lambda function. This allows Lambda to automatically poll for new messages and process them. Follow the guide on how to configure event source mapping.

For reference, configuring event source mapping involves three steps:

Network setup – In the default event source mapping mode, you need to configure a networking setup using a PrivateLink endpoint or NAT gateway for event source mapping to invoke Lambda functions. In provisioned mode, no networking configuration is needed (and you don’t incur the cost of networking components).
Event source mapping parameter configuration – This involves setting necessary configuration parameters for the event source mapping to be able to poll messages from your Kafka cluster. This includes the MSK cluster, topic name, consumer group ID, authentication method, and optionally, schema registry, scaling mode. You can configure the scaling mode for provisioned throughput, along with batch size, batch window, and event filtering for your event source mapping.
Access permissions – This involves configuring required permissions to access the required AWS resources, and includes configuring permissions for the function to execute the code, permissions for the event source mapping to access your MSK cluster, and permissions for Lambda to access your VPC resources.

The following screenshot shows the console setup for configuring Amazon MSK event source mapping, including the Amazon MSK trigger related fields.

The following screenshot shows event poller configuration.

The following screenshot shows additional settings you can use, depending on your use case.

Optimizing AWS Lambda for stream processing with Amazon MSK

When building real-time data processing pipelines with Amazon MSK and AWS Lambda, it’s important to tune your setup for both performance and cost-efficiency. Lambda offers powerful serverless compute capabilities, but to get the most out of it in a streaming context, you need to make a few key optimizations:

Enable provisioned concurrency for low-latency processing – For workloads that are sensitive to latency—cold starts can introduce unwanted delays. By enabling provisioned concurrency, you can pre-warm a specified number of Lambda instances so they’re always ready to handle traffic immediately. This eliminates cold starts and provides consistent response times, which is crucial for latency-critical use cases.
Enable provisioned mode for event source mapping for high-throughput processing – For Kafka workloads with stringent throughput requirements, activate the provisioned mode. The optimal configuration of minimum and maximum event pollers for your Kafka event source mapping depends on your application’s performance requirements. Start with the default minimum event pollers to baseline the performance profile and adjust event pollers based on observed message processing patterns and your application’s performance requirements. For workloads with spiky traffic and strict performance needs, increase the minimum event pollers to handle sudden surges. You can fine-tune the minimum event pollers by evaluating your desired throughput, your observed throughput, which depends on factors such as the ingested messages per second and average payload size, and using the throughput capacity of one event poller (up to 5 MB/s) as reference. To maintain ordered processing within a partition, Lambda caps the maximum event pollers at the number of partitions in the topic.
Optimize message batching using size and windowing – By integrating Lambda with Amazon MSK, you can control how messages are batched before they’re sent to your function. Tuning parameters such as batch size (the number of records per invocation: 1–10,000 records) and maximum batching window (how long to wait for a full batch: 0–300 seconds) can significantly impact performance. Larger batches mean fewer invocations, which reduces overhead and improves throughput. However, it’s important to strike a balance—too large a batch or window might introduce unwanted processing delays. Monitor your stream’s behavior and adjust these settings based on throughput requirements and acceptable latency.
Apply filters to reduce unnecessary invocations – Not every record in your Kafka topic might require processing. To avoid unnecessary Lambda invocations (and associated costs), apply filtering logic directly when configuring the event source mapping. With Lambda, you can define filtering (up to 10 filters) criteria so that only relevant records trigger your function. This helps reduce compute time, minimize noise, and optimize your budget, especially when dealing with high-throughput topics with mixed content. For Amazon MSK, Lambda commits offsets for matched and unmatched messages after successfully invoking the function.

Conclusion

By combining Amazon MSK with AWS Lambda, you can seamlessly build modern, serverless event-driven applications. This integration eliminates the need to manage consumer groups, compute infrastructure, or scaling logic so teams can focus on delivering business value faster.

Whether you’re integrating Kafka into microservices, transforming data pipelines, or building reactive applications, Lambda with Amazon MSK is a powerful and flexible serverless solution. For detailed documentation on how to configure Lambda with Amazon MSK, refer to the AWS Lambda Developer Guide. For more serverless learning resources, visit Serverless Land.

About the Authors

Tarun Rai Madan is a Principal Product Manager at Amazon Web Services (AWS). He specializes in serverless technologies and leads product strategy to help customers achieve accelerated business outcomes with event-driven applications, using services like AWS Lambda, AWS Step Functions, Apache Kafka, and Amazon SQS/SNS. Prior to AWS, he was an engineering leader in the semiconductor industry, and led development of high-performance processors for wireless, automotive, and data center applications.

Masudur Rahaman Sayem is a Streaming Data Architect at AWS with over 25 years of experience in the IT industry. He collaborates with AWS customers worldwide to architect and implement sophisticated data streaming solutions that address complex business challenges. As an expert in distributed computing, Sayem specializes in designing large-scale distributed systems architecture for maximum performance and scalability. He has a keen interest and passion for distributed architecture, which he applies to designing enterprise-grade solutions at internet scale.

Introducing AWS Lambda native support for Avro and Protobuf formatted Apache Kafka events

2025-06-23 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-aws-lambda-native-support-for-avro-and-protobuf-formatted-apache-kafka-events/

AWS Lambda now provides native support for Apache Avro and Protocol Buffers (Protobuf) formatted events with Apache Kafka event source mapping (ESM) when using Provisioned Mode. The support allows you to validate your schema with popular schema registries. This allows you to use and filter the more efficient binary event formats and share data using schema in a centralized and consistent way. This blog post shows how you can use Lambda to process Avro and Protobuf formatted events from Kafka topics using schema registry integration.

This new capability works with both Amazon Managed Streaming for Apache Kafka (Amazon MSK), Confluent Cloud and self-managed Kafka clusters. To get started, update your existing Kafka ESM to Provisioned Mode and add schema registry configuration, or create a new ESM in Provisioned Mode with schema registry integration enabled.

Avro and Protobuf

Many organizations use Avro and Protobuf formats with Apache Kafka because these binary serialization formats offer advantages over JSON. They provide 50-80% smaller message sizes, faster serialization and deserialization performance, robust schema evolution capabilities, and strong typing across multiple programming languages.Working with these formats in Lambda functions previously necessitated custom code. Developers needed to implement schema registry clients, handle authentication and caching, write format-specific deserialization logic, and manage schema evolution scenarios.

What’s new

Lambda’s Kafka Event Source Mapping (ESM) now provides built-in integration with AWS Glue Schema Registry, Confluent Cloud Schema Registry, and self-managed Confluent Schema Registry. When you configure schema registry settings for your Kafka ESM, the service automatically validates incoming JSON Schema, Avro, and Protobuf records against their registered schema. This moves complex schema registry integration logic from your application layer to the managed Lambda service.

You can build your function with Kafka’s open-source ConsumerRecords interface using Powertools for AWS Lambda to get your Avro or Protobuf generated business objects directly. Optionally you can specify to get your records in the JSON format, where your function receives clean, validated JSON data regardless of the original serialization format, removing the need for custom deserialization code in your Lambda functions. This also allows you to create Kafka consumers across multiple programming languages.

Powertools for AWS Lambda is a developer toolkit that provides specific support for Java, .NET, Python, and TypeScript, maintaining consistency with existing Kafka development patterns. You can directly access business objects without custom deserialization code.

You can also setup filtering rules to discard irrelevant, JSON, Avro or Protobuf formatted events before function invocations, which can improve processing performance and reduce costs.

How schema validation works

When you configure schema registry integration for your Kafka ESM, you specify the registry endpoint, authentication details, and which event fields (key, value, or both) to validate. The ESM polls your Kafka topics for records as usual but now performs additional processing before invoking your Lambda function.For each incoming event, the ESM extracts the schema ID embedded in the serialized data. It fetches the corresponding schema from your configured registry. This process happens transparently, with schema definitions cached for up to 24 hours to optimize performance. The ESM identifies the format of your events using schema metadata and validates the event structure. It keeps either the original binary data or deserializes it to JSON format based on your customer configuration and sends it to your function for processing.

Figure 1: Kafka processing flow diagram.

The ESM handles schema evolution automatically. When producers begin using new schema versions, the service detects the updated schema IDs and fetches the latest definitions from your registry. This makes sure that your functions always receive properly deserialized data without requiring code changes.

Event record format

As a part of the ESM schema registry configuration, you need to specify Event Record Format, which Lambda uses to deliver validated records to your function. The schema registry configuration supports SOURCE and JSON.

SOURCE preserves the original binary format of the data as a base64-encoded string with producer-appended schema-id removed. This allows direct conversion to Avro or Protobuf objects so that you can use Kafka’s ConsumerRecords interface for a Kafka-like experience. Use this format when working with strongly typed languages or when you need to maintain the full capabilities of Avro or Protobuf schemas. Then, you can use any Avro or Protobuf deserializer to convert raw bytes to your business object. Powertools provides native support for this deserialization.

With JSON, the ESM deserializes the data ready for direct use in languages with native JSON support. Use this when you don’t need to preserve the original binary format or work with generated classes. You can also use Powertools to convert the base64 to your business object. See the documentation for payload formats and deserialization behavior.

If you configure filtering rules, then they operate on the JSON-formatted events after deserialization. This upstream filtering prevents unnecessary Lambda invocations for events that don’t match your processing criteria, directly reducing your compute costs.

Configuration and setup

To use this feature, you must enable Provisioned Mode for your Kafka ESM, which provides the dedicated compute resources needed for schema registry integration.

You can configure the integration through the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS Language SDKs, or infrastructure as code (IaC) tools such as the AWS Serverless Application Model (AWS SAM) or AWS Cloud Development Kit (AWS CDK).

Your schema registry configuration includes the registry endpoint URL, authentication method (AWS Identity and Access Management (IAM) for AWS Glue Schema Registry, or Basic Auth, SASL/SCRAM, or mTLS for Confluent registries), and validation settings. You specify which event attributes to validate and optionally define filtering rules using standard Lambda event filtering syntax.

For error handling, configure Lambda failure destinations where events that fail schema validation or deserialization are sent. This makes sure that problematic events don’t disappear silently but are routed to other services such as Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and Amazon S3 for debugging and analysis.

Seeing the new features in action

There are a number of Serverless Patterns that you can use to process Kafka streams using Lambda. This example uses the Java pattern.

Deploy a sample Amazon MSK cluster

To set up an Amazon MSK cluster, follow the instructions in the GitHub repo and create a new AWS CloudFormation stack using the MSKAndKafkaClientEC2.yaml template file. The stack creates the Amazon MSK cluster, along with a client Amazon EC2 instance, to manage the Kafka cluster. There are costs involved when running this infrastructure.

Connect to the EC2 instance using EC2 Instance Connect.
Check that the Kafka topic is created by checking the contents of the kafka_topic_creator_output.txt file.
```
cat kafka_topic_creator_output.txt
```
The file should contain the text: “Created topic MskIamJavaLambdaTopic.”

Deploy the Glue schema registry and consumer Lambda function

The EC2 instance contains the software needed to deploy the schema registry and Lambda function.

Change directory to the pattern directory.
cd serverless-patterns/msk-lambda-iam-java-sam
Build the application using AWS SAM.
sam build

To deploy your application for the first time, run the following in the EC2 instance shell:

sam deploy --capabilities CAPABILITY_IAM --no-confirm-changeset \
	--no-disable-rollback --region $AWS_REGION --stack-name msk-lambda-schema-avro-java-sam --guided

You can accept all the defaults by hitting Enter. You can browse to the AWS Glue schema registry console and view the ContactSchema definition:

{
  "type": "record",
  "name": "Contact",
  "fields": [
    {"name": "firstname", "type": "string"},
    {"name": "lastname", "type": "string"},
    {"name": "company", "type": "string"},
    {"name": "street", "type": "string"},
    {"name": "city", "type": "string"},
    {"name": "county", "type": "string"},
    {"name": "state", "type": "string"},
    {"name": "zip", "type": "string"},
    {"name": "homePhone", "type": "string"},
    {"name": "cellPhone", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "website", "type": "string"}
  ]
}

The consumer Lambda function ESM is configured for Provisioned Mode.

View the ESM configuration from the Lambda console for the Lambda function name prefixed with msk-lambda-schema-avro-ja-LambdaMSKConsumer.
Choose the MSK Lambda trigger which opens the Triggers pane under Configuration.
Figure 2: View Lambda ESM schema configuration
The configuration specifies using the Event record format SOURCE so your function can use Kafka’s native open-source ConsumerRecords interface. Powertools then deserializes the payload.
The schema validation attribute is VALUE.
The ESM filter configuration only processes the records that match zip codes of 2000.
In your function code, specify the open-source Kafka ConsumersRecords interface by including Powertools for Lambda as a dependency. ConsumerRecords provides metadata about Kafka records and allows you to get direct access to your Avro/Protobuf generated business objects without requiring any additional deserialization code.

package com.amazonaws.services.lambda.samples.events.msk;

import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import software.amazon.lambda.powertools.kafka.Deserialization;
import software.amazon.lambda.powertools.kafka.DeserializationType;
import software.amazon.lambda.powertools.logging.Logging;

public class AvroKafkaHandler implements RequestHandler<ConsumerRecords<String, Contact>, String> {
    private static final Logger LOGGER = LoggerFactory.getLogger(AvroKafkaHandler.class);

    @Override
    @Logging(logEvent = true)
    @Deserialization(type = DeserializationType.KAFKA_AVRO)
    public String handleRequest(ConsumerRecords<String, Contact> records, Context context) {
        LOGGER.info("=== AvroKafkaHandler called ===");
        LOGGER.info("Event object: {}", records);
        LOGGER.info("Number of records: {}", records.count());
        
        for (ConsumerRecord<String, Contact> record : records) {
            LOGGER.info("Processing record - Topic: {}, Partition: {}, Offset: {}", 
                       record.topic(), record.partition(), record.offset());
            LOGGER.info("Record key: {}", record.key());
            LOGGER.info("Record value: {}", record.value());
            
            if (record.value() != null) {
                Contact contact = record.value();
                LOGGER.info("Contact details - firstName: {}, zip: {}", 
                           contact.getFirstname(), contact.getZip());
            }
        }
        
        LOGGER.info("=== AvroKafkaHandler completed ===");
        return "OK";
    }
}

Produce and consumer records

To send messages to Kafka, there is a LambdaMSKProducerJava function.

Invoke the function from the Lambda console or CLI within the EC2 instance.

sam remote invoke LambdaMSKProducerJavaFunction --region $AWS_REGION \
	--stack-name msk-lambda-schema-avro-java-sam

You can view the Producer logs to see the 10 records produced.The consumer Lambda function processes the records.

View the consumer Lambda function logs using the Amazon CloudWatch logs console or CLI within the EC2 instance.

sam logs --name LambdaMSKConsumerJavaFunction \
	--stack-name msk-lambda-schema-avro-java-sam --region $AWS_REGION

The Lambda function processes and logs only the records that match the filter FILTER. The Avro binary data is deserialized using Powertools for AWS Lambda. You should see the function logs showing each record processed with the decoded keys and values.

Figure 3: Lambda consumer logs showing Avro processing

Cleaning up

You can clean up the example Lambda function by running the sam delete command.

sam delete

If you created the Amazon MSK cluster and EC2 client instance, then navigate to the CloudFormation console, choose the stack, and choose Delete.

Performance and cost considerations

Schema validation and deserialization can add processing time before your function invocation. However, this overhead is typically minimal when compared to the benefits. ESM caching minimizes schema registry API calls. Using filtering allows you to reduce costs, depending on how effectively your filtering rules eliminate irrelevant events. This feature simplifies the operational overhead of managing schema registry integration code so teams can focus on business logic rather than infrastructure concerns.

Error handling and monitoring

If schema registries become temporarily unavailable, then cached schemas allow event processing to continue until the registry is available again. Authentication failures generate error messages with automatic retry logic. Schema evolution happens seamlessly as Lambda automatically detects and fetches new versions.

If events fail validation or deserialization, they are routed to your configured failure destinations. For Amazon SQS and Amazon SNS destinations, the service sends metadata about the failure. For Amazon S3 destinations, both metadata and the original serialized payload are included for detailed analysis.

You can use standard Lambda monitoring, with more CloudWatch metrics providing visibility into schema validation success rates, registry API usage, and filtering effectiveness.

Conclusion

AWS Lambda now supports Avro and Protobuf formats for Kafka event processing in Provisioned Mode for Kafka ESM. This enables schema validation, event filtering, and integration with both Amazon MSK, Confluent, and self-managed Kafka clusters. Whether you’re building new Kafka applications or migrating existing consumers to Lambda, this native schema registry integration streamlines processing pipelines.

For more information about the Lambda Kafka integration capabilities, go to the learning guide, Lambda ESM documentation. To learn about Lambda pricing, such as Provisioned Mode costs, visit the Lambda pricing page.

For more serverless learning resources, visit Serverless Land.

AWS Weekly Roundup: re:Inforce re:Cap, Valkey GLIDE 2.0, Avro and Protobuf or MCP Servers on Lambda, and more (June 23, 2025)

2025-06-23 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-reinforce-recap-valkey-glide-2-0-avro-and-protobuf-or-mcp-servers-on-lambda-and-more-june-23-2025/

Last week’s hallmark event was the security-focused AWS re:Inforce conference.

Now a tradition, the blog team wrote a re:Cap post to summarize the announcements and link to some of the top blog posts.

To further summarize, several new security innovations were announced, including enhanced IAM Access Analyzer capabilities, MFA enforcement for root users, and threat intelligence integration with AWS Network Firewall. Other notable updates include exportable public SSL/TLS certificates from AWS Certificate Manager, a simplified AWS WAF console experience, and a new AWS Shield feature for proactive network security (in preview). Additionally, AWS Security Hub has been enhanced for risk prioritization (Preview), and Amazon GuardDuty now supports Amazon EKS clusters.

But my favorite announcement came from the Amazon Verified Permissions team. They released an open source package for Express.js, enabling developers to implement external fine-grained authorization for web application APIs. This simplifies authorization integration, reducing code complexity and improving application security.

The team also published a blog post that outlines how to create a Verified Permissions policy store, add Cedar and Verified Permissions authorisation middleware to your app, create and deploy a Cedar schema, and create and deploy Cedar policies. The Cedar schema is generated from an OpenAPI specification and formatted for use with the AWS Command Line Interface (CLI).

Let’s look at last week’s other new announcements.

Last week’s launches
Apart from re:Inforce, here are the launches that got my attention.

AWS Lambda announces native support for Avro and Protobuf formatted Kafka events — AWS Lambda now provides native support for Avro and Protobuf formatted Kafka events with Apache Kafka’s event-source-mapping (ESM). This integration allows you to validate your schema, filter events, and process them using open source Kafka consumer interfaces. You can also use Powertools for AWS Lambda to process your Kafka events without writing custom deserialization code, making it easier to build your Kafka applications with AWS Lambda.

Kafka customers use Avro and Protobuf formats for efficient data storage, fast serialization and deserialization, schema evolution support, and interoperability between different programming languages. They utilize schema registries to manage, evolve, and validate schemas before data enters processing pipelines. Previously, you were required to write custom code within your Lambda function to validate, deserialize, and filter events when using these data formats. With this launch, Lambda natively supports Avro and Protobuf, as well as integration with GSR, CCSR, and SCSR. This enables you to process your Kafka events using these data formats without writing custom code. Additionally, you can optimize costs through event filtering to prevent unnecessary function invocations.

Amazon S3 Express One Zone now supports atomic renaming of objects with a single API call – The RenameObject API simplifies data management in S3 directory buckets by transforming a multi-step rename operation into a single API call. This means you can now rename objects in S3 Express One Zone by specifying an existing object’s name as the source and the new name as the destination within the same S3 directory bucket. With no data movement involved, this capability accelerates applications like log file management, media processing, and data analytics, while also lowering costs. For instance, renaming a 1-terabyte log file can now complete in milliseconds, instead of hours, significantly accelerating applications and reducing costs.
Valkey introduces GLIDE 2.0 with support for Go, OpenTelemetry, and pipeline batching – AWS, in partnership with Google and the Valkey community, announces the general availability of General Language Independent Driver for the Enterprise (GLIDE) 2.0. This is the latest release of one of AWS’s official open-source Valkey client libraries. Valkey, the most permissive open-source alternative to Redis, is stewarded by the Linux Foundation and will always remain open-source. Valkey GLIDE is a reliable, high-performance, multi-language client that supports all Valkey commands

GLIDE 2.0 introduces new capabilities that expand developer support, improve observability, and optimise performance for high-throughput workloads. Valkey GLIDE 2.0 extends its multi-language support to Go (contributed by Google), joining Java, Python, and Node.js to provide a consistent, fully compatible API experience across all four languages. More language support is on the way. With this release, Valkey GLIDE now supports OpenTelemetry, an open-source, vendor-neutral framework that enables developers to generate, collect, and export telemetry data and critical client-side performance insights. Additionally, GLIDE 2.0 introduces batching capabilities, reducing network overhead and latency for high-frequency use cases by allowing multiple commands to be grouped and executed as a single operation.

You can discover more about Valkey GLIDE in this recent episode of the AWS Developers Podcast: Inside Valkey GLIDE: building a next-gen Valkey client library with Rust.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Some other reading
My Belgian compatriot Alexis has written the first article of a two-part series explaining how to develop an MCP Tool server with a streamable HTTP transport and deploy it on Lambda and API Gateway. This is a must-read for anyone implementing MCP servers on AWS. I’m eagerly looking forward to the second part, where Alexis will discuss authentication and authorization for remote MCP servers.

Other AWS events
Check your calendar and sign up for upcoming AWS events.

AWS GenAI Lofts are collaborative spaces and immersive experiences that showcase AWS expertise in cloud computing and AI. They provide startups and developers with hands-on access to AI products and services, exclusive sessions with industry leaders, and valuable networking opportunities with investors and peers. Find a GenAI Loft location near you and don’t forget to register.

AWS Summits are free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Japan (this week June 25 – 26), Online in India (June 26), New-York City (July 16).

Save the date for these upcoming Summits in July and August: Taipei (July 29), Jakarta (August 7), Mexico (August 8), São Paulo (August 13), and Johannesburg (August 20) (and more to come in September and October).

Browse all upcoming AWS led in-person and virtual events here.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— seb

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Networking of Amazon MQ for RabbitMQ event source mapping for AWS Lambda

2025-06-18 Rafal Pawlaszek

Post Syndicated from Rafal Pawlaszek original https://aws.amazon.com/blogs/compute/networking-of-amazon-mq-for-rabbitmq-event-source-mapping-for-aws-lambda/

Event-driven architectures with message brokers need careful attention to security best practices. Amazon MQ for RabbitMQ combined with AWS Lambda enables serverless event processing. However, implementing defense in depth and least privilege principles necessitates a clear understanding of networking requirements. This is particularly important when working with different subnet types and their impact on service connectivity.

This post explores the networking aspects of Lambda event source mapping for Amazon MQ for RabbitMQ. Learn how deployment options influence your networking setup and security posture to make informed architectural decisions. These networking concepts are essential for building secure, scalable solutions, regardless of your experience level with message brokers.

For clarity in this post, when we refer to “RabbitMQ”, we mean Amazon MQ for RabbitMQ.

Prerequisites

The following prerequisites are necessary to complete this post:

An Amazon Web Services (AWS) account
Basic understanding of AWS networking concepts
Familiarity with Amazon MQ for RabbitMQ
Basic knowledge of Lambda

Furthermore, to enable setup of the discussed architectures, this post is accompanied by a GitHub repository that uses AWS Cloud Development Kit (AWS CDK).

Repository prerequisites

The following prerequisites are necessary for the repository:

Repository setup

Clone the https://github.com/aws-samples/sample-amazonmq-rabbitmq-lambda-esm repository. This repository contains all the necessary code and instructions to create relevant architectures using AWS CDK.

Install dependencies and build

Install the necessary NPM dependencies by running the following commands:

npm install
npm run build

Amazon MQ for RabbitMQ networking deployment options

Public accessibility is the primary networking differentiator when deploying a RabbitMQ broker in AWS. Although the broker operates in the Amazon MQ service account, the networking configuration varies based on this choice.

Public broker

When you deploy a publicly accessible broker, Amazon MQ provisions all networking components in the service account. The service provides a DNS name that resolves to an IP address of the Network Load Balancer (NLB) in that account. This configuration doesn’t support security groups. All security measures must be implemented through the RabbitMQ broker’s authentication and authorization mechanisms. The following diagram shows this communication flow.

Figure-1 DNS resolution of a public Amazon MQ for RabbitMQ broker.

Private broker

A private broker routes networking through a Amazon Virtual Private Cloud (Amazon VPC) in your account. Amazon MQ uses AWS PrivateLink to provision VPC Endpoints, which serve as entry points for broker communication.

The following diagram shows how client applications communicate with RabbitMQ:

The client application connects to Amazon Route 53 Resolver
Route 53 Resolver resolves the DNS name to the VPC Endpoint’s IP address
The client communicates with the broker through PrivateLink
Security groups protect the VPC Endpoint’s Elastic Network Interfaces (ENIs)

Figure-2 DNS resolution of a private Amazon MQ for RabbitMQ broker.

A private broker deployment offers two networking options:

Custom VPC configuration – Specify:
- Subnets for VPC Endpoint creation
- At least one security group to protect the VPC Endpoints
Default VPC configuration – Leave VPC options blank to use:
- Default VPC
- Default security group

Amazon MQ for RabbitMQ Lambda event source mapping building blocks

RabbitMQ solutions offer two approaches for message processing:

Create a custom client to read messages from broker queues
Use Lambda functions with event source mapping (ESM) for automated message retrieval

The ESM is a Lambda service resource that reads the messages from the broker and invokes the Lambda function synchronously. In the remainder of this post, we refer to this Lambda function as listener.

ESM connectivity depends on the following:

The listener’s AWS Identity and Access Management (IAM) Role for access permissions
RabbitMQ broker networking components

For public brokers, ESM uses public connectivity. For private brokers, ESM:

Assumes the listener’s IAM Role
Creates ENIs in the same subnets as the broker’s VPC Endpoints
Uses the same security groups that protect the VPC Endpoints

The listener’s IAM Role must include these Amazon Elastic Compute Cloud (Amazon EC2) permissions:

CreateNetworkInterface
DeleteNetworkInterface
DescribeNetworkInterfaces
DescribeSecurityGroups
DescribeSubnets
DescribeVpcs

To view ESM ENIs:

Open the AWS Management Console
Navigate to EC2 > Network Interfaces
Look for ENIs with the following naming pattern:
AWS Lambda VPC ENI-armq-<ACCOUNT_ID>-<ESM_ID>-<remainder>

where:
- ACCOUNT_ID – The AWS account number containing the ESM
- ESM_ID – The unique identifier of the ESM

The following image shows example ESM ENIs.

Figure-3 An example list of interfaces that Amazon MQ for RabbitMQ creates for private brokers.

Disabling or deleting the ESM removes the ESM components.

An enabled ESM needs connectivity to the following:

AWS Security Token Service: For IAM role assumption
AWS Secrets Manager: For RabbitMQ credentials
RabbitMQ broker: For queue access
AWS Lambda: For listener invocation

Because the ESM queue polling process follows these steps:

Assumes the listener’s IAM Role
Retrieves RabbitMQ credentials from Secrets Manager
Establishes broker communication
Invokes the listener when messages are present

You have two options to enable private broker connectivity to support the queue polling process:

Deploy VPC endpoints in ESM subnets for:
- AWS Security Token Service (AWS STS)
- Secrets Manager
- Lambda
Deploy NAT gateway in ESM subnets

ESM networking configuration options

The following sections detail ESM networking configurations for different deployment scenarios.

Option 1: Public broker

In this approach all network communication happens on the Amazon MQ service’s side. The ESM, when enabled, uses public connectivity.

To observe the architecture implemented in your account go to the cloned repository root location, make sure that you are signed in with AWS CLI and run the following:

cdk deploy PublicRabbitMqInstanceStack

Option 2: Private broker in a default VPC

Deploying a private RabbitMQ broker without specifying the VPC informs the Amazon MQ service to pick the default VPC for setting up the networking and then the public subnet(s) in that VPC. The default security group is used for securing the broker’s VPC Endpoints.

Creating the ESM provisions dedicated ENIs in the public subnets where the RabbitMQ broker’s VPC Endpoints reside with the default security group applied. The default security group allows itself for inbound traffic on all protocols and full port range, thus the ESM can route traffic through the VPC Endpoint.

Although the subnet is public with internet gateway access, the ESM ENIs operate in private address space, preventing direct communication with AWS services. To enable proper communication, create VPC Endpoints for AWS STS, Secrets Manager, and Lambda. These endpoints allow the ESM to communicate with AWS services through private IP addresses within your VPC. The following diagram shows the complete communication path from the ESM to the broker.

Figure-4 Networking configuration and request flow for a private broker provisioned in the default VPC.

To observe the architecture implemented in your account, go to the cloned repository root location, make sure that you are signed in with AWS CLI, and run the following

cdk deploy PrivateRabbitMqInstanceDefaultVpcStack

Option 3: Private broker in a Custom VPC with NAT

When deploying a private RabbitMQ broker in a custom VPC, specify either a single subnet for a standalone broker or multiple subnets for a cluster deployment. The deployment also needs a security group for the VPC Endpoint ENIs.

Configure the security group with a self-referencing inbound rule on the AMQP port. This configuration enables communication between the ESM and the RabbitMQ VPC Endpoints’ ENIs.

The following diagram shows how ESM resources communicate through networking components when deployed in a private subnet with NAT gateway. This architecture demonstrates the complete communication path from the ESM to the broker.

Figure-5 Networking configuration and request flow for a private broker provisioned in a private VPC subnet with NAT.

To observe the architecture implemented in your account, go to the cloned repository root location, make sure that you are signed in with AWS CLI, and run the following:

cdk deploy PrivateRabbitMqInstanceCustomVpcWithNatStack

Option 4: Private broker in a Custom VPC with isolated subnets

This configuration builds upon the previous architecture but introduces isolated subnets. These subnets restrict all internet connectivity, permitting only internal VPC network traffic. Although the broker networking components mirror Option 3, the isolation introduces more considerations.

The security group still needs an open AMQP port for queue operations, but the subnet isolation prevents the ESM from directly accessing AWS services. To address this limitation, deploy VPC Endpoints for AWS STS, Secrets Manager, and Lambda within the isolated subnets. These endpoints create a private communication path for the ESM to interact with essential AWS services without needing internet access.

The following diagram shows the communication architecture for ESM resources deployed in isolated subnets. It demonstrates how VPC Endpoints enable secure communication between the ESM and AWS services while maintaining network isolation. This architecture makes sure that the ESM can fulfill its message processing responsibilities without compromising security through internet exposure.

Figure-6 Networking configuration and request flow for a private broker provisioned in an isolated VPC subnet.

To observe the architecture implemented in your account, go to the cloned repository root location, make sure that you are signed in with AWS CLI, and run the following:

cdk deploy PrivateRabbitMqInstanceCustomVpcIsolatedSubnetStack

Option 5: Private broker in a Custom VPC with public subnets

The final configuration places the broker in public subnets while maintaining the core deployment requirements from the previous options. Despite the public subnet placement, the ESM’s networking behavior presents an important consideration: ESM ENIs operate in private address space, preventing direct internet communication even with an internet gateway present.

This architecture necessitates VPC Endpoints for AWS service communication, similar to Option 2. Any attempts to route ESM traffic through the internet gateway fail because the ENIs operate in private address space. Understanding this limitation is crucial for proper deployment planning.

The following diagram shows the ESM communication architecture in public subnets. Despite the different subnet type, this configuration mirrors the isolated subnet approach in its use of VPC Endpoints. These endpoints enable the ESM to communicate with AWS STS, Secrets Manager, and Lambda services through private, secure connections within the VPC.

Figure-7 Networking configuration and request flow for a private broker provisioned in a public VPC subnet.

To observe the architecture implemented in your account, go to the cloned repository root location, make sure that you are signed in with AWS CLI, and run the following:

cdk deploy PrivateRabbitMqInstanceCustomVpcPublicSubnetStack

Cleaning up

To prevent unexpected AWS charges, remove resources you’ve created. The following AWS CDK command helps you safely remove all deployed resources:

cdk destroy --all

Conclusion

This post explored the relationship between AWS Lambda event source mapping and RabbitMQ networking configurations. We examined various deployment scenarios, from public brokers to isolated subnets, each presenting unique considerations for secure and effective implementation.

Understanding these networking patterns enables you to make informed architectural decisions when deploying Amazon MQ for RabbitMQ with Lambda event source mapping. Whether choosing public accessibility or implementing private networking with VPC Endpoints, understanding the consequences of choosing specific networking configurations allows you to apply security best practices while meeting your application’s messaging needs. As you implement these patterns, consider your specific security requirements and operational needs to choose the most appropriate configuration for your use case.

Take the next step in optimizing your serverless messaging architecture. Dive in to the AWS documentation, experiment with the RabbitMQ and Lambda integration patterns discussed, and discover how these networking configurations can elevate the security and performance of your own applications. Start implementing these strategies today to build more robust, scalable solutions.

AWS Weekly Roundup: AWS re:Inforce 2025, AWS WAF, AWS Control Tower, and more (June 16, 2025)

2025-06-16 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-aws-reinforce-2025-aws-waf-aws-control-tower-and-more-june-16-2025/

Today marks the start of AWS re:Inforce 2025, where security professionals are gathering for three days of technical learning sessions, workshops, and demonstrations. This security-focused conference brings together AWS security specialists who build and maintain the services that organizations rely on for their cloud security needs.

AWS Chief Information Security Officer (CISO) Amy Herzog will deliver the conference keynote along with guest speakers who will share new security capabilities and implementation insights. The event offers multiple learning paths with sessions designed for various technical roles and expertise levels. Many of my colleagues from across AWS are leading hands-on workshops, demonstrating new security features, and facilitating community discussions. For those unable to join us in Philadelphia, the keynote and innovation talks will be viewable by livestream during the event, and available to watch on demand after the event. Look out for the key announcements and technical insights from the conference in upcoming posts!

Let’s look at last week’s new announcements.

Last week’s launches
Here are the launches that got my attention.

Extend Amazon Q Developer IDE plugins with MCP tools – Amazon Q Developer now supports Model Context Protocol (MCP) in its integrated development environment (IDE) plugins, helping developers integrate external tools for enhanced contextual development workflows. You can now augment the built-in tools with any MCP server that supports the stdio transport layer. These servers can be managed within the Amazon Q Developer user interface. This makes it easy to add, remove, and modify tool permissions. The integration enables more customized responses by orchestrating tasks across both native and MCP server-based tools. MCP support is available in Visual Studio Code and JetBrains IDE plugins, as well as in the Amazon Q Developer command line interface (CLI), with detailed documentation and implementation guides available in the Amazon Q Developer documentation.

AWS WAF now supports automatic application layer DDoS protection – AWS has enhanced its application layer (L7) distributed denial of service (DDoS) protection capabilities with faster automatic detection and mitigation that responds to events within seconds. This AWS Managed Rules group automatically detects and mitigates DDoS attacks of any duration to keep applications running on Amazon CloudFront, Application Load Balancer, and other AWS WAF supported services available to users. The system establishes a baseline within minutes of activation using machine learning (ML) models to detect traffic anomalies, then automatically applies rules to address suspicious requests. Configuration options help you customize responses such as presenting challenges or blocking requests. The feature is available to all AWS WAF and AWS Shield Advanced subscribers in all supported AWS Regions, except Asia Pacific (Thailand), Mexico (Central), and China (Beijing and Ningxia). To learn more about AWS WAF application layer (L7) DDoS protection, visit the AWS WAF documentation or the AWS WAF console.

AWS Control Tower now supports service-linked AWS Config managed AWS Config rules – AWS Control Tower now deploys service-linked AWS Config rules directly in managed accounts, replacing the previous CloudFormation StackSets deployment method. This change improves deployment speed when enabling service-linked AWS Config rules across multiple AWS Control Tower managed accounts and Regions. These service-linked rules are managed entirely by AWS services and can’t be edited or deleted by users. This helps maintain consistency and prevent configuration drift. AWS Control Tower Config rules detect resource noncompliance within accounts and provide alerts through the dashboard. You can deploy these controls using the AWS Control Tower console or AWS Control Tower control APIs.

Powertools for AWS Lambda introduces Bedrock Agents Function utility – The new Amazon Bedrock Agents Function utility in Powertools for AWS Lambda simplifies building serverless applications integrated with Amazon Bedrock Agents. This utility helps developers create AWS Lambda functions that respond to Amazon Bedrock Agents action requests with built-in parameter injection and response formatting, eliminating boilerplate code. The utility seamlessly integrates with other Powertools features like Logger and Metrics, making it easier to build production-ready AI applications. This integration improves the developer experience when building agent-based solutions that use AWS Lambda functions to process actions requested by Amazon Bedrock Agents. The utility is available in Python, TypeScript, and .NET versions of Powertools.

Announcing open sourcing pgactive: active-active replication extension for PostgreSQL – Pgactive is a PostgreSQL extension that enables asynchronous active-active replication for streaming data between database instances, and AWS has made it open source. This extension provides additional resiliency and flexibility for moving data between instances, including writers located in different Regions. It helps maintain availability during operations like switching write traffic. Building on PostgreSQL’s logical replication features, pgactive adds capabilities that simplify managing active-active replication scenarios. The open source approach encourages collaboration on developing PostgreSQL’s active-active capabilities while offering features that streamline using PostgreSQL in multi-active instance environments. For more information and implementation guidance, visit the GitHub repository.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

We launched existing services and instance types in additional Regions:

AWS Glue is now available in Asia Pacific (Taipei) Region – AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, ML, and application development.
Amazon EC2 I8g instances are now available in Europe (Ireland) Region – Amazon EC2 I8g instances offer the best performance in Amazon EC2 for storage-intensive workloads.
Amazon VPC IP Address Manager is now available in Asia Pacific (Taipei) Region – Amazon VPC IPAM makes it easier to plan, track, and monitor IP addresses for AWS workloads.

Other AWS events
Check your calendar and sign up for upcoming AWS events.

AWS Summits are free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Milano (June 18), Shanghai (June 19 – 20), Mumbai (June 19) and Japan (June 25 – 26).

Browse all upcoming AWS led in-person and virtual events here.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Esra

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Validating event payload with Powertools for AWS Lambda (TypeScript)

2025-06-12 Alexander Schüren

Post Syndicated from Alexander Schüren original https://aws.amazon.com/blogs/compute/validating-event-payload-with-powertools-for-aws-lambda-typescript/

In this post, learn how the new Powertools for AWS Lambda (TypeScript) Parser utility can help you validate payloads easily and make your Lambda function more resilient.

Validating input payloads is an important aspect of building secure and reliable applications. This ensures that data that an application receives can gracefully handle unexpected or malicious inputs and prevent harmful downstream processing. When writing AWS Lambda functions, developers need to validate and verify the payload and ensure that specific fields and values are correct and safe to process.

Powertools for AWS Lambda is a developer toolkit available in Python, NodeJS/TypeScript, Java and .NET. It helps implement serverless best practices and increase developer velocity. Powertools for AWS Lambda (TypeScript) is introducing a new Parser utility to help developers more easily implement validation in their Lambda functions.

Why payload validation matters

Validating payloads can make your Lambda functions more resilient. Payloads that combine both technical and business information can also be challenging to validate. This requires writing validation logic inside your Lambda function code.This could range from a few if-statements to check payload values to a complex series of validation steps based on custom business logic. You may need to separate the validation of the technical information of payload like AWS Region, accountId, event source and business information inside the event such as productId and payment details.

It can be challenging to know the structure and the values of the event object and how to extract the relevant information. For example, an Amazon SQS event has a body field with a string value, which can be a JSON document. Amazon EventBridge has an object in the detail field that you can read directly without further transformation. You may need to decompress, decode, transform, and validate the payload inside a specific field. Understanding the many transformation layers can be complex, especially if your event object is a result of multiple service invocations.

Using the Powertools for AWS Lambda (TypeScript) Parser Utility

Powertools for AWS Lambda (TypeScript) is a modular library. You can selectively install features such as Logger, Tracer, Metrics, Batch Processing, Idempotency, and more. You can use Powertools for AWS Lambda in both TypeScript and JavaScript code bases. The new Parser utility simplifies validation and uses the popular validation library, Zod.

You can use parser as method decorator, with middyjs middleware or manually in all Lambda provided NodeJS runtimes

To use the utility, install the Powertools parser utility and Zod (<v3.x) using NPM or any package manager of your choice:

npm install @aws-lambda-powertools/parser zod@~3

You can define your schema using Zod. Here is an example of a simple order schema for validating events:

import { z } from 'zod';
const orderSchema = z.object({
    id: z.number().positive(),
	description: z.string(),
	items: z.array(
	    z.object({
		    id: z.number().positive(),
			quantity: z.number(),
			description: z.string(),
			})
		),
	});
export { orderSchema };

This order schema defines id, description and a list of items. You can specify the value types from simple numeric, narrow it down to positive or literal, or more complex like unition, array or even other schema. Zod offers an extensive list of value types that you can use.

Add the parser decorator to your handler function and set the schema parameter and use this schema to parse the event object.

import type {Context} from 'aws-lambda';
import type {LambdaInterface} from '@aws-lambda-powertools/commons/types';
import {parser} from '@aws-lambda-powertools/parser';
import {z} from 'zod';
import {Logger} from '@aws-lambda-powertools/logger';

const logger = new Logger();

const orderSchema = z.object({
    id: z.number().positive(),  
	description: z.string(),  
	items: z.array(
	    z.object({
		    id: z.number().positive(),
			quantity: z.number(),
			description: z.string(),
		})
	),
});

type Order = z.infer<typeof orderSchema>;

class Lambda implements LambdaInterface {
    @parser({schema: orderSchema})  
	public async handler(event: Order, _context: Context): Promise<void> {
	    // event is now typed as Order    
		for (const item of event.items) {      
		    logger.info('Processing item', {item});
			// process order item from the event
		}
	}
}
	
const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

Note that z.infer helps to extract the Order type from the schema, which improves development experience with autocomplete when using TypeScript. Zod parses the entire object, including nested fields, and reports all the errors combined, instead of returning only the first error.

Using built-in schema for AWS services

A more common scenario is to validate events from AWS Services that trigger Lambda functions, including Amazon SQS, Amazon EventBridge and many more. To make this easier Powertools includes pre-built schema for AWS events that you can use.

To parse an incoming Amazon EventBridge event, set the built-in schema in your parser configuration:

import {LambdaInterface} from '@aws-lambda-powertools/commons/types';
import {Context} from 'aws-lambda';
import {parser} from '@aws-lambda-powertools/parser';
import {EventBridgeSchema} from '@aws-lambda-powertools/parser/schemas';
import type {EventBridgeEvent} from '@aws-lambda-powertools/parser/types';

class Lambda implements LambdaInterface {  
    @parser({schema: EventBridgeSchema})  
	public async handler(event: EventBridgeEvent, _context: Context): Promise<void> {    
	    // event is parsed but the detail field is not specified  
	}
}

const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

The event object is parsed and validated during runtime and the TypeScript type EventBridgeEvent helps you understand the structure and access the fields during development. In this example, you only parse the EventBridge event object, so the detail field can be an arbitrary object.

You can also extend the built-in EventBridge schema and override the detail field with your custom oderSchema.

import {LambdaInterface} from '@aws-lambda-powertools/commons/types';
import {Context} from 'aws-lambda';
import {parser} from '@aws-lambda-powertools/parser';
import {EventBridgeSchema} from '@aws-lambda-powertools/parser/schemas';
import {z} from 'zod';

const orderSchema = z.object({  
    id: z.number().positive(),  
	description: z.string(),  
	items: z.array(
	    z.object({
		    id: z.number().positive(),      
			quantity: z.number(),      
			description: z.string(),
		}), 
	),
});

const eventBridgeOrderSchema = EventBridgeSchema.extend({  detail: orderSchema,});

type EventBridgeOrder = z.infer<typeof eventBridgeOrderSchema>;

class Lambda implements LambdaInterface {  
    @parser({schema: eventBridgeOrderSchema})  public async handler(event: EventBridgeOrder, _context: Context): Promise<void> {
	    // event.detail is now parsed as orderSchema  
	}
}

const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

The parser validates the full structure of the entire EventBridge event including the custom business object. Use .extend or other Zod schema functions to change any field of the built-in schema and customize the payload validation.

Using envelopes with custom schema

In some cases, you only need the custom portion of the payload, for example, the detail field of the EventBridge event or the body of SQS records. This requires you to parse the event schema manually, extract the required field, and then parse it again with the custom schema. This is complex as you must know the exact payload field and how to transform and parse it.

Powertools Parser utility helps solve this problem with Envelopes. Envelopes are schema objects with built-in logic to extract custom payloads.

Here is an example of the EventBridgeEnvelope of how it works:

import {LambdaInterface} from '@aws-lambda-powertools/commons/types';
import {Context} from 'aws-lambda';
import {parser} from '@aws-lambda-powertools/parser';
import {EventBridgeEnvelope} from '@aws-lambda-powertools/parser/envelopes';
import {z} from 'zod';

const orderSchema = z.object({  
    id: z.number().positive(), 
	description: z.string(),  
	items: z.array(    
	    z.object({      
		    id: z.number().positive(),      
			quantity: z.number(),      
			description: z.string(),
		}), 
	),
});

type Order = z.infer<typeof orderSchema>;

class Lambda implements LambdaInterface {  
    @parser({schema: orderSchema, envelope: EventBridgeEnvelope})  public async handler(event: Order, _context: Context): Promise<void> {
        // event is now typed as Order inferred from the orderSchema  
    }
}

const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

By setting schema and envelope, the parser utility knows how to combine both parameters, extract, and validate the custom payload from the event. Powertools Parser transforms the event object according to the schema definition so you can focus on your business-critical part of the code inside the handler function.

Safe parsing

If the object does not match the provided Zod schema, by default, the parser throws ParserError. If you require control over validation errors and need to implement custom error handling, use the safeParse option.

Here is an example how to capture failed validations as a metric in your handler function:

import {Logger} from "@aws-lambda-powertools/logger";
import {LambdaInterface} from "@aws-lambda-powertools/commons/types";
import {parser} from "@aws-lambda-powertools/parser";
import {orderSchema} from "../app/schema";import {z} from "zod";
import {EventBridgeEnvelope} from "@aws-lambda-powertools/parser/envelopes";
import {Metrics, MetricUnit} from "@aws-lambda-powertools/metrics";
import {ParsedResult, EventBridgeEvent} from "@aws-lambda-powertools/parser/types";

const logger = new Logger();

const metrics = new Metrics();

type Order = z.infer<typeof orderSchema>;

class Lambda implements LambdaInterface {  

    @metrics.logMetrics() 
	@parser({schema: orderSchema, envelope: EventBridgeEnvelope, safeParse: true})  
	public async handler(event: ParsedResult<EventBridgeEvent, Order>, _context: unknown): Promise<void> {
	    if (!event.success) {      
		    // failed validation      
			metrics.addMetric('InvalidPayload', MetricUnit.Count, 1);      
			logger.error('Invalid payload', event.originalEvent);    
		} else {      
		    // successful validation      
			for (const item of event.data.items) {        
			    logger.info('Processing item', item);        
				// event.data is typed as Order      
			}   
		}  
	}
}

const myFunction = new Lambda();

export const handler = myFunction.handler.bind(myFunction);

Setting safeParse option to true does not throw an error, but returns a modified event object that has a success flag and either error or data fields, depending on the validation result. You can then create custom error handling, for example incrementing InvalidPayload metric and access the originalEvent to log an error.

For successful validations, you can access the data field and process the payload. Note that the event object type is now ParsedResult with the EventBridgeEvent as input and Order as output types.

Custom validations

Sometimes you may require more complex business rules for your validation. Because Parser built-in schemas are Zod objects, you can customize the validation by applying .extend, .refine , .transform and other Zod operators. Here is an example of complex rules for the orderSchema:

import {z} from 'zod';

const orderSchema = z.object({
    id: z.number().positive(),  
	description: z.string(),  
	items: z.array(z.object({
	    id: z.number().positive(),    
		quantity: z.number(),    
		description: z.string(),  
	})).refine((items) => items.length > 0, {
	    message: 'Order must have at least one item',  
	}),
})  
    .refine((order) => order.id > 100 && order.items.length > 100, {
	    message:      'All orders with more than 100 items must have an id greater than 100', 
	});

Use .refine on items field to check if there is at least one item in the order. You can also combine multiple fields, here order.id and order.items.length, to have a specific rule for orders with more than 100 items. Keep in mind that .refine runs during the validation step and .transform will be applied after the validation. This allows you to change the shape of the data to normalize the output.

Conclusion

Powertools for AWS Lambda (TypeScript) is introducing a new Parser utility that makes it easier to add validation to your Lambda functions. By relying on the popular validation library Zod, Powertools offers an extensive set of built-in schemas for popular AWS service integrations including Amazon SQS, Amazon DynamoDB, Amazon EventBridge. Developers can use these schemas to validate their event payloads and also customize them according to their business needs.

Visit the documentation to learn more and join our Powertools community Discord to connect with like-minded serverless enthusiasts.

Optimizing ODCR usage through AI-powered capacity insights

2025-06-05 Ankush Goyal

Post Syndicated from Ankush Goyal original https://aws.amazon.com/blogs/compute/optimizing-odcr-usage-through-ai-powered-capacity-insights/

Efficient resource management is crucial for organizations seeking to optimize cloud costs while making sure of seamless access to compute capacity. Amazon Elastic Compute Cloud (Amazon EC2) On-Demand Capacity Reservations (ODCRs) provide the flexibility to reserve compute capacity within a specific Availability Zone (AZ) for any duration. This makes sure that critical workloads always have the necessary resources available, minimizing the risk of capacity shortages.

However, managing ODCRs across multiple teams and accounts presents several challenges:

Limited visibility across teams: In large organizations, tracking ODCR usage across teams and business units can be difficult, often leading to underused or duplicate reservations.
Complexity in optimization: Without clear insights, adjusting, releasing, or modifying reservations to align with changing workloads becomes a cumbersome process.
Cost management challenges: Unused or oversized ODCRs contribute to unnecessary expenses, making it essential to continuously monitor and optimize usage.

In this post, we demonstrate how Amazon Bedrock Agents can help organizations gain actionable insights into ODCR usage across their Amazon Web Services (AWS) environment. Unlike traditional approaches that rely on predefined patterns or manual tracking, this solution dynamically retrieves the latest ODCR usage data and provides intelligent, query-based recommendations to fulfill the capacity needs. This solution is serverless and pay-per-use and incurs minimal operational cost.

This approach allows IT leaders, cloud architects, and finance teams to optimize reservations, control costs, and enhance overall resource management—without the complexity of traditional analysis methods.

Solution overview

This solution addresses two specific use cases:

The team must create a new ODCR to accommodate an upcoming project.
The team needs to expand the capacity of their existing ODCR.

The system consists of three essential components:

ODCRSupervisorAgent: Functions as the main coordinator, handling user inquiries and managing requests to specialized subordinate agents.
CapacityPlanningAgent: Reviews existing ODCRs throughout the organization, recommending SPLIT and MOVE operations to optimize capacity allocation for new projects.
AugmentationAgent: Identifies opportunities to increase existing ODCR capacity by recommending MOVE operations from other organizational ODCRs.

The following figure shows the high-level architecture for this solution.

The system architecture uses AWS services for comprehensive ODCR management, as shown in the following figure. The platform combines Amazon Cognito for authentication, AWS Amplify for front-end delivery, and AWS Lambda for serverless computing. AWS Resource Explorer enables efficient discovery and tracking of ODCRs across AWS Regions, while Lambda functions periodically query Resource Explorer to retrieve detailed ODCR information and usage metrics. AWS Identity and Access Management (IAM) plays a crucial role in securing the system, managing fine-grained access controls and permissions across all components. Amazon Bedrock Agents and its multi-agent capabilities generate intelligent recommendations through coordinated agent interactions, allowing organizations to optimize ODCR usage and implement data-driven capacity planning strategies.

How the solution functions

At the heart of this system is an Amazon Bedrock Agent powered by Action Groups, which map user queries to backend tasks using Lambda. These Lambda functions act as the system’s intelligence layer: fetching, filtering, and formatting ODCR data for the agent to reason over. The following three Lambda functions are deployed as part of the AWS CloudFormation template:

get_az_mapping_info Lamba function: This Lambda function takes an AWS account ID and Region as input. It returns a mapping of AZ names to their corresponding AZ IDs for that account and AWS Region. This helps the Capacity Planning Agent align AZs across accounts.

find_eligiable_odcrs Lambda function: This Lambda function supports the Capacity Planning Agent by identifying suitable ODCRs that meet the specified instance type and AZ. It uses Resource Explorer to search for ODCRs across the organization. Then, the function assumes cross-account roles to access ODCRs in different AWS accounts to gather detailed information. After retrieving the necessary data, it filters the ODCRs based on instance type, tenancy, and AZ. Finally, it returns a formatted list of eligible ODCRs, which are sourced from either the specified account or an alternative account with adequate capacity.

find_eligible_odcrs_for_move Lambda function: This Lambda function assists the Augmentation Agent by finding all capacity reservations within the same account as the specified ODCR. It uses Resource Explorer to discover ODCRs and, when necessary, assumes cross-account roles to gather ODCR information. The function filters ODCRs based on matching instance type, tenancy, and AZ, then returns a formatted list of eligible ODCRs that can provide more capacity through MOVE operations.

Each Lambda function receives structured inputs from the Amazon Bedrock Agent, executes its designated task, and returns structured responses. Then, these responses are used by the agent to generate intelligent, actionable answers to user queries.

Together, these Lambda functions extend the capabilities of the Amazon Bedrock Agent by integrating with external data sources and automating complex ODCR management logic—allowing the system to deliver accurate, dynamic recommendations.

Prerequisites

You must have the following in place to implement the solution in this post:

An AWS account or multiple accounts with AWS Organizations configured.
- Enable Trusted Access for Resource Explorer in Organizations.
Foundational Model (FM) access in Amazon Bedrock for Anthropic’s Claude 3.5 Haiku v1 in the same Region where you plan to deploy this solution.
Turn on Resource Explorer.
- Setting up and configuring Resource Explorer
- We recommend assigning a delegated administrator account—the account where you plan to deploy the ODCR solution—and creating the view in that account. Note this account number because it is needed later while deploying cross account role.
- Create a Resource Explorer view that filters for resourcetype:ec2:capacity-reservation to enable searching for ODCRs in the delegated account.
- After setting up Resource Explorer and creating a view, note the Amazon Resource Name (ARN). You need this ARN when deploying the ODCR CloudFormation template.
Active ODCRs configured and maintained across multiple accounts within your Organization.
The accompanying CloudFormation template downloaded from the aws-samples GitHub repo.
- Cross Account IAM Role
- ODCR Solution

Deploy solution resources using CloudFormation

This solution is designed to be deployed in a single Region. If deploying outside us-east-1, then you must modify the CloudFormation parameters to reference the correct FM version and adjust for Regional availability, such as any necessary cross-Region inference profile configurations.

Deploy the Cross Account IAM Role template: To access ODCR data across your Organization, deploy a CloudFormation StackSet from your payer account. During deployment, you must provide the account number where you plan to deploy the ODCR solution. This is the same account number that you used as delegate account for Resource Explorer under the prerequisites. This account number is added to the trust policy of the IAM role, allowing the solution in that account to assume the role created by the StackSet. The deployment automatically provisions a standardized IAM role with the necessary permissions (for example ec2:DescribeCapacityReservations and ec2:DescribeAvailabilityZones) in each target account.
Deploy the ODCR Solution template: Deploy the provided template in the delegated administrator account that you chose for Resource Explorer. This sets up the core components of the solution and integrates with your payer account for organization-wide ODCR insights.

Resources deployed by the CloudFormation template

After you have deployed the CloudFormation template, the following AWS resources are created in your account.

Cross Account IAM Role template

IAM resource
- CrossAccountODCRAccessRole

ODCR Solution template

Amazon Cognito resources:
- User Pool: ODCRAgentUserPool
- App Client: ODCRAgentApp
- Identity Pool: odcr-identity-pool
- Groups: ODCRAdmins
- User: ODCR User
IAM resources:
- IAM roles:
  - GetAZMappingFunctionRole
  - LambdaODCRAccessRole
  - BedrockAgentExecutionRole
  - ODCRSupervisorAgentExecutionRole
  - CognitoAuthRole
Lambda functions:
- get_az_mapping_info
- find_eligible_odcrs
- find_eligible_odcrs_for_move
Amazon Bedrock Agents
- ODCRSupervisorAgent
- CapacityPlanningAgent with action groups:
  - get_az_mapping_info
  - find_eligible_odcrs
- AugmentationAgent with action group:
  - find_eligible_odcrs_for_move

After you deploy the final CloudFormation template, copy the following from the Outputs tab on the CloudFormation console to use during the configuration of your application after it’s deployed in Amplify:

AWSRegion
BedrockAgentAliasId
BedrockAgentId
BedrockAgentName
IdentityPoolId
UserPoolClientId
UserPoolId

The following screenshot shows you what the Outputs tab looks like.

Deploy the Amplify application for front-end

You need to manually deploy the Amplify application using the front-end code found on GitHub. Complete the following steps, as shown in the following figure:

Download the front-end code AWS-Amplify-Frontend.zip from GitHub.
Use the .zip file to manually deploy the application in Amplify.
Return to the Amplify page and use the domain that it automatically generated to access the application.

After completing these steps, your environment is ready to analyze ODCR usage and receive recommendations powered by Amazon Bedrock Agents.

Solution walkthrough

After deploying the application in Amplify, return to the Amplify console. Use the auto-generated domain URL shown there to access the front-end. Upon accessing the application URL, you’re prompted to provide information related to Amazon Cognito and Amazon Bedrock Agents. This information is necessary to securely authenticate users and allow the front end to interact with the Amazon Bedrock Agent. It enables the application to manage user sessions and make authorized API calls to AWS services on behalf of the user.

You can enter information with the values that you collected from the CloudFormation stack outputs, as shown in the following screenshot:

A temporary password was automatically generated during deployment and sent to the email address provided when launching the CloudFormation template. At first sign-in, you’re prompted to reset your password.When you’re signed in, you can begin interacting with the application to analyze and manage your ODCR usage. The interface supports natural language queries to streamline capacity planning. The following are some example questions that you can ask:

I require [X units] of capacity for the [INSTANCE TYPE] instance type in Availability Zone [AZ ID], within account [ACCOUNT NUMBER]. Could you please advise if there are any existing On-Demand Capacity Reservations (ODCRs) that can fulfill this requirement? Additionally, what operations would be necessary to utilize the available capacity?

I have a requirement to add [X additional units] of capacity to an existing On-Demand Capacity Reservation with ID [cr-0012abcd123456789]. Could you please advise on the steps or operations required to fulfill this request?

Cleaning up

If you decide to discontinue using this solution, then you can follow these steps to remove it, its associated resources deployed using CloudFormation, and the Amplify deployment:

Delete the CloudFormation StackSet:
1. On the CloudFormation console, choose StackSets in the navigation pane.
2. Locate your StackSet for the Cross-Account IAM Role (you assigned a name to it).
3. Choose Delete stacks from StackSet to remove all stack instances.
4. After all instances are deleted, choose StackSet.
5. Choose Delete StackSet from the Actions menu.
Delete the CloudFormation stack:
1. On the CloudFormation console, choose Stacks in the navigation pane.
2. Locate the stack you created during the deployment process (you assigned a name to it).
3. Choose the stack and choose Delete.
Delete the Amplify application and its resources. For instructions, refer to Clean Up Resources.

Conclusion

In this post, we demonstrated how to use Amazon Bedrock Agents and AWS services to build an intelligent ODCR management solution. Combining AWS Resource Explorer for organization-wide visibility, Amazon Bedrock Agents for intelligent recommendations, and a secure front-end powered by AWS Amplify, organizations can now make data-driven decisions about their capacity reservations. This solution helps teams optimize ODCR usage, reduce costs, and efficiently manage compute capacity across their AWS Organization. The artificial intelligence (AI)-powered approach eliminates manual tracking and analysis, allowing teams to focus on strategic capacity planning rather than operational overhead.

Get started today by deploying this solution in your AWS environment, and unlock the power of AI-driven capacity insights for more efficient ODCR management.

How Launchpad from Pega enables secure SaaS extensibility with AWS Lambda

2025-05-30 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-launchpad-from-pega-enables-secure-saas-extensibility-with-aws-lambda/

Large organizations increasingly adopt software as a service (SaaS) solutions to focus on business priorities, reduce infrastructure management overhead, and optimize costs. These organizations expect SaaS vendors to provide customizability facilities for tailoring the solution behavior according to their needs. Although traditional approaches like feature flags and webhooks offer some flexibility, they often fall short of providing a high degree of customizability. A new emerging pattern in this space is tenant-supplied custom code execution, which allows tenants to inject their own code into specific workflow points, enabling deep customization while preserving the core SaaS solutions’ integrity and security.

In this post, we share how Pegasystems (Pega) built Launchpad, its new SaaS development platform, to solve a core challenge in multi-tenant environments: enabling secure customer customization. By running tenant code in isolated environments with AWS Lambda, Launchpad offers its customers a secure, scalable foundation, eliminating the need for bespoke code customizations.

Solution overview

Launchpad, which is built on AWS, is an end-to-end platform on which software providers can build, launch, and operate workflow-centric B2B SaaS applications and AI solutions. It provides a managed, secure, scalable cloud environment for hosting multi-tenant applications and data. It accelerates the build experience with generative AI-powered low code tools, prebuilt capabilities, and subscriber-level configuration. Being a multi-tenant platform at its core, Launchpad had to maintain stringent isolation across tenants in its architecture.

One of the requirements Launchpad had was to allow their tenants to augment the workflows natively by providing custom code. Some common scenarios included communicating with external systems with proprietary non-industry-standard protocols, reuse of existing business logic, and SDK-based custom code development.

The solution necessitated the ability for tenants to provide custom code that would implement the required business logic, which Launchpad would be executing. This required architecting a secure runtime environment for custom code execution that maintains the highest degree of cross-tenant isolation within the multi-tenant architecture, at the same time allowing sufficient access to platform APIs and services. It was essential to build an architecture that would decouple the environment running tenant code from the core SaaS platform, as illustrated in the following diagram.

Architecting the solution topology

To achieve the required high level of compute isolation for running code provided by different tenants, Launchpad has adopted Lambda functions in its architecture as the secure ephemeral compute environment. Each untrusted code snippet provided by tenants is bootstrapped as a stand-alone Lambda function, with strong Firecracker-based isolation across different functions and execution environments addressing Launchpad’s requirements. This isolation provides dedicated resources, customizable access permissions, independent monitoring and operations, and automatic scaling for each function, while maintaining complete separation from other functions and their execution environments, as illustrated in the following diagram.

With Lambda being a serverless compute service, adopting it for the Launchpad architecture yielded several significant benefits. The major business benefit was that tenants could implement thousands of custom workflow augmentations on their own simply by providing code snippets, instead of the Launchpad engineering team being responsible for implementing them in the core platform code. Other benefits included:

Managed runtimes – AWS handles patching and updating the underlying infrastructure, operating system, and runtimes for customers, reducing the potential attack surface.
Fine-grained permissions – Each function can have its own set of access policies to tightly control what resources and actions it can access.
No need to pre-provision and pay for overprovisioned capacity – Lambda functions scale up and down automatically based on traffic patterns.
Built-in monitoring – Lambda functions emit detailed metrics, logs, and traces through Amazon CloudWatch and AWS X-Ray out of the box, making it straightforward to monitor tenant code execution.

To further reduce risks, Launchpad runs these Lambda functions with untrusted code in a dedicated AWS account. This account is separated from the core SaaS platform account. When end-users create a new function in the Launchpad authoring portal, they upload their code and specify the code handler to be executed during the invocation. Users can also map function input and output to Launchpad fields for further processing to enable an even higher degree of customizability and integration. The multi-tenant authoring service is a Control Plane component that runs as a microservice on the Amazon Elastic Kubernetes Service (Amazon EKS) cluster and uses the Lambda API for function lifecycle management, as illustrated in the following diagram. After a function resource is created, it can be used for further invocations.

Runtime architecture

At runtime, when Launchpad needs to invoke a function, it calls the Lambda Invoke API. Before the function is invoked, the multi-tenant runtime service performs a tenancy check to make sure the request is coming from an authorized tenant by doing the token validation. After a successful validation, the service invokes the required Lambda function. To invoke functions hosted in a different AWS account, the multi-tenant runtime service uses an AWS Identity and Access Management (IAM) role to assume the required permissions and invokes the Lambda service using the AWS SDK. The sequence of interactions is shown in the following architecture diagram.

The workflow consists of the following steps:

An incoming user request reaches the application gateway service.
The application gateway authenticates the request using the tenancy security service.
After it’s authenticated, the request is forwarded to the multi-tenant runtime service
The multi-tenant runtime service validates the supplied token and performs a tenancy check. This makes sure tenants can only invoke own functions they have permissions for (for example, functions they own).
The multi-tenant runtime service pod assumes the IAM role required for invoking the tenant-specific Lambda function in a different AWS account.
The multi-tenant runtime service pod invokes the required Lambda function.

Invoking the platform API from custom code is as straightforward as connecting to any external API. The custom code can authenticate with the platform using OAuth2. To facilitate the authentication, the developer can pass along the credentials as input parameters to the function from the core platform. Then the developer can create a corresponding record (isolated by tenant) in the platform that stores the credentials per function, and pass credentials as input parameters during invocation.

Distributed architecture observability

Operating a distributed architecture that runs untrusted code across multiple AWS accounts requires a comprehensive observability strategy. Launchpad’s approach combines centralized logging and monitoring with cross-account aggregation to provide a unified operational view of the platform.

The monitoring architecture uses CloudWatch Metrics to observe the Lambda functions, aggregating them through a centralized observability layer. This setup empowers platform operators to correlate Lambda function metrics with the core platform services running on Amazon EKS. Launchpad also collects per-function telemetry like function invocations, error rates, and execution time, which allows them to observe per-tenant metrics. These telemetry dimensions enable both a platform-wide and tenant-specific monitoring perspective.

For logging and troubleshooting, Launchpad implements a unified logging pipeline that aggregates Lambda function logs with application gateway and runtime service logs. Each request flowing through the system carries a correlation ID, so operators can trace execution paths across the core SaaS services and into the tenant functions running in the AWS account running tenant Lambda functions.

With this multi-layer observability architecture, Launchpad can maintain operational excellence while running tenant code securely at scale. Regular operational reviews drive continuous improvements in monitoring coverage and incident response procedures. Having per-tenant Lambda functions make it possible for Launchpad to use tenant-specific cost allocation tags, further empowering them to understand the cost footprint of running tenant custom code.

Best practices

When building a SaaS solution, maintaining a unified core code base is essential for scalability and manageability. Implementing per-tenant variations within the core platform code can lead to maintenance complexity and technical debt. Instead, architect your SaaS solution to have extension points, which allow your tenants to inject their custom code at specific points in the workflow, enabling customization without compromising the platform’s maintainability. This pattern makes sure the core SaaS platform remains clean and standardized while offering the flexibility that customers demand.

Additional best practices include:

Use separate accounts for running Lambda functions with untrusted tenant-provided code to make sure it’s isolated from your core SaaS platform code.
Grant absolute minimum required access permissions to the execution role assigned to the function. The custom code running within the execution environment gets permissions defined in the execution role when making requests to AWS API endpoints. If the function doesn’t need to reach out to AWS API endpoints, remove all permissions from the execution role and add an explicit AWSDenyAll policy.
Use separate Lambda functions for each code snippet and each tenant. This will provide the highest degree of cross-tenant isolation. Resources are not reused across different functions and execution environments.
Use Lambda layers in case you need to add a layer of vendor-provided code in order to keep it separated from the untrusted tenant-provided code.
Implement additional security controls, such as using Amazon Virtual Private Cloud (Amazon VPC) constructs to restrict network access and VPC Flow Logs for network activity monitoring.

Conclusion

The implementation of a secure untrusted code execution environment within SaaS platforms addresses a critical need for tenant customization while maintaining architectural integrity. Lambda offers a built-in isolation model, fine-grained security controls, and serverless scalability, so SaaS providers such as Launchpad can address the requirements of executing tenant-provided code in a multi-tenant environment and offer robust customization capabilities while maintaining strict security boundaries and operational efficiency. This architectural pattern enables providers to focus on core platform development while confidently supporting tenant-specific workflows through the secure and scalable Lambda execution environment.

To learn more, refer to the Security Overview of AWS Lambda white paper. For additional serverless architectural patterns, see Serverlessland.com.

About the authors

PackScan: Building real-time sort center analytics with AWS Services

2025-05-30 Sairam Vangapally

Post Syndicated from Sairam Vangapally original https://aws.amazon.com/blogs/big-data/packscan-building-real-time-sort-center-analytics-with-aws-services/

Amazon manages a complex logistics network with multiple touch points, from fulfillment centers to sort centers to final customer delivery. Among these, sort centers play a crucial role in the middle mile, providing faster and more efficient package movement. Within Amazon’s Middle Mile operations, high-volume sort centers process millions of packages daily, making immediate access to operational data essential for optimizing efficiency and decision-making. Real-time visibility into key metrics—such as package movements, container statuses, and associate productivity—is critical for smooth logistics operations. To address the need for real-time operational planning, the Amazon Middle Mile team developed PackScan, a cloud-based platform designed to provide instant insights across the network. By significantly reducing data latency, PackScan enables proactive decision-making, so teams can monitor inbound package flows, optimize outbound shipments based on live data, track associate productivity, identify bottlenecks, and enhance overall operational efficiency—all in real time.

In this post, we explore how PackScan uses Amazon cloud-based services to drive real-time visibility, improve logistics efficiency, and support the seamless movement of packages across Amazon’s Middle Mile network.

Prerequisites

This post assumes a foundational understanding of the following services and concepts:

Core services such as Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), AWS Lambda, Amazon Data Firehose, and Amazon OpenSearch Service
Infrastructure components like Amazon Elastic Compute Cloud (Amazon EC2), Amazon Load Balancers, and security configurations including network rules and security groups
Visualization tools such as Grafana hosted on Amazon EC2

Although hands-on experience is not required, a conceptual understanding of these services will help in understanding the architecture, design patterns, and components discussed throughout the article.

Business challenges

Amazon’s sort centers handle over 15 million packages daily across more than 120 facilities in North America. Given this scale, even minor delays in operational insights can lead to inefficiencies, increased costs, and escalations. Traditionally, data latencies of up to an hour have restricted the ability to make proactive decisions, directly affecting productivity, resource allocation, and responsiveness—especially during peak periods like holiday seasons and big deal days.

Without immediate visibility into package movements, container statuses, and associate performance, operational teams face challenges in identifying and resolving bottlenecks in real time. The lack of timely insights can disrupt the flow of packages, leading to shipment delays, reduced throughput, and suboptimal facility performance. Addressing these inefficiencies required a solution capable of delivering real-time, high-fidelity data to support rapid decision-making.

To bridge this gap, Amazon’s Middle Mile organization needed a scalable platform that could enhance visibility, minimize latency, and provide up-to-the-minute insights into logistics operations. PackScan was designed to meet these demands, giving teams access to the real-time data necessary to optimize workflows, mitigate bottlenecks, and improve overall efficiency.

Data flow

In 2024, PackScan was deployed across 80 sort centers in the USA, enabling real-time package analytics. The solution powers Grafana dashboards, which refresh every 10 seconds by fetching live package data from OpenSearch Service. With this near real-time visibility, operations teams can monitor package movement and sorting efficiency across sort centers. The following diagram outlines how package scan data is ingested, processed, and made actionable.

Each sort center is equipped with hardware at inbound stations where packages arrive from trailers. Integrated barcode scanners automatically scan each package as it enters the sorting process. Every scan generates an SNS event, capturing key attributes such as the package ID, dimensions, the associate who performed the scan, and the timestamp and location of the scan.

After they’re generated, these SNS events are ingested into Data Firehose through a Lambda function, where the data undergoes real-time enrichment. During this process, additional attributes are appended, including the business logic rules. The enriched data is then streamed into OpenSearch Service, where events are indexed to enable fast and efficient querying. With the indexed package scan events available in OpenSearch Service, real-time analytics and monitoring become possible. The Grafana dashboards query this data every 10 seconds, providing operational insights into package inflow metrics and associate performance.

Solution overview

PackScan was implemented using a structured and scalable approach, using AWS cloud-based services to enable high-frequency data ingestion, real-time processing, and actionable insights. The architecture is designed to minimize latency while providing reliability, scalability, and operational efficiency. The solution is built around a serverless, event-driven architecture that dynamically scales based on data ingestion volumes. The architecture—illustrated in the following figure—enabled us to build a real-time data solution, utilizing the advantages of various AWS services to provide low-latency analytics, high scalability, and real-time operational insights across Amazon’s sort centers.

The following are the key components and features of the solution:

Real-time data processing – Lambda functions serve as the processing backbone of the system, handling 500,000 scan events per second. Each incoming event is processed by applying data transformations, enrichment, and validation before passing it downstream.
High-frequency data ingestion and streaming – Data Firehose is the primary ingestion pipeline, handling millions of scan events daily from thousands of barcode scanners across multiple sort centers. The Firehose streams handle incoming data of 12,000 PUT requests per second, maintaining smooth ingestion and low-latency streaming. Data retention policies are set to buffer and forward enriched events every 60 seconds or upon reaching 5 MB batch size, optimizing storage and processing efficiency.
Optimized querying and operational insights – OpenSearch Service is used to index and store the processed scan events, providing real-time querying and anomaly detection. The OpenSearch cluster consists of 12 data nodes (r5.4xlarge.search) and 3 primary nodes (r5.large.search), processing up to 10 GB of data per day with a rolling index strategy, where indexes are rotated every 24 hours to maintain query performance. The system supports concurrent queries per second, enabling logistics teams to perform rapid lookups and gain instant visibility into package movements.
Live visualization and dashboarding – Grafana, hosted on an m5.12xlarge EC2 instance, provides real-time visualization of key logistics metrics. The dashboards refresh every 10 seconds, querying OpenSearch and displaying up-to-the-minute package analytics. The setup includes multiple preconfigured dashboards, monitoring package flow at different inbound stations, and workforce efficiency. These dashboards support concurrent users, enabling supervisors and associates to track and optimize operations proactively. The following screenshot shows one of the real-time dashboards, with details of package flow by different routes within sort centers.

The entire PackScan architecture is designed for automatic scaling, adjusting dynamically based on data ingestion volume to maintain efficiency during peak and off-peak operations. This approach provides cost-effective resource utilization while maintaining high availability and performance.

Business outcomes

The implementation of PackScan has led to measurable improvements in operational efficiency, workforce productivity, and real-time decision-making across Amazon’s sort centers. By reducing data latency and enabling real-time insights, PackScan has transformed logistics operations in meaningful ways:

Widespread deployment – PackScan was deployed across 80 sort centers, supporting approximately 1,000 display monitors that provide real-time operational insights.
Significant reduction in data latency – Data latency dropped from approximately 1 hour to less than 1 minute, allowing for real-time operational responsiveness and minimizing workflow disruptions.
Proactive operational management – With dynamic workload balancing and instant bottleneck identification, supervisors can now address issues as they arise, leading to smoother operations and fewer escalations.
Boost in workforce productivity – The real-time performance feedback has enhanced associate engagement, resulting in a 25% increase in throughput per hour and 12% reduction in labor hours.

Overall, PackScan has redefined real-time logistics visibility within Amazon’s Middle Mile operations, empowering operational teams with actionable insights, enhanced workforce efficiency, and a data-driven approach to package movement and sort center performance.

Lessons learned and best practices

The deployment and scaling of PackScan provided valuable insights into optimizing real-time logistics visibility. Several key lessons and best practices emerged from this implementation:

Cloud architecture drives efficiency – Adopting Amazon technologies provides seamless scalability, reduced operational overhead, and lower infrastructure costs, while maintaining high reliability. The following table shows an approximate breakdown of monthly service costs observed in production. This is an estimation based on current pricing; we recommend checking the respective AWS service pricing pages to generate the most up-to-date quote. This architecture demonstrates that with combination of provisioned and serverless design, production-ready solutions can be built and scaled at a fraction of the cost of traditional infrastructure.

AWS Service	Description	Estimated Monthly Cost
Amazon EC2	Three EC2 instances of type m5.12xlarge hosting Grafana	$1,700
AWS Lambda	Streams SNS events to Data Firehose	$4,000
Amazon Data Firehose	Real-time data delivery with 12,000 records streaming to OpenSearch Service	$1,500
Amazon OpenSearch Service	Indexing and querying package scan events	$28,000

Real-time visibility is a game changer – Immediate access to operational data enhances agility, enabling teams to make timely, data-driven decisions that prevent bottlenecks and improve throughput.
Continuous monitoring enhances decision-making – Operational dashboards should evolve with business needs. Regular monitoring and updates provide accuracy, usability, and relevance in driving informed decision-making.

By applying these best practices, PackScan has set a foundation for scalable, real-time logistics management, making sure that Amazon’s Middle Mile operations remain proactive, efficient, and highly responsive to changing business demands.

Conclusion

PackScan has successfully transformed real-time operational visibility within Amazon’s sort centers, addressing critical challenges in data latency, workforce productivity, and logistics efficiency. By using AWS services, particularly Data Firehose for real-time data delivery and OpenSearch Service for analytics, PackScan has enabled proactive decision-making, streamlined operations, and enhanced throughput in high-volume sort environments. Looking ahead, future enhancements will focus on further elevating operational intelligence and scalability, including:

Integrating predictive analytics to anticipate workflow bottlenecks and optimize resource allocation
Scaling the solution across additional operational scenarios, providing greater resilience and adaptability to dynamic logistics environments

With these advancements, PackScan will continue to drive operational excellence, cost-efficiency, and real-time decision-making capabilities, reinforcing Amazon’s commitment to innovation in logistics and supply chain management.

For those interested in implementing similar solutions, we recommend exploring AWS Serverless Architecture Patterns and the AWS Architecture Blog for additional insights and best practices in building scalable, real-time analytics solutions.

About the authors

Sairam Vangapally is a Data Engineer at Amazon with extensive experience architecting real-time, large-scale data platforms that power critical logistics operations across North America. He has led the design and deployment of end-to-end data pipelines, enabling high-throughput ingestion, transformation, and analytics at scale. He is passionate about building resilient data infrastructure and driving cross-functional collaboration to deliver solutions that accelerate operational insights and business impact.

Nitin Goyal serves as a Data Engineering Manager in Amazon’s Sort Center organization, where he leads initiatives to optimize operational efficiency across North American facilities. With over nine years of tenure at Amazon spanning multiple teams, he specializes in architecting high-performance data systems, with particular emphasis on real-time streaming pipelines, artificial intelligence, and low-latency solutions. His expertise drives the development of sophisticated operational workflows that enhance sort center productivity and effectiveness.

Enhance AI-assisted development with Amazon ECS, Amazon EKS and AWS Serverless MCP server

2025-05-29 Elizabeth Fuentes

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/enhance-ai-assisted-development-with-amazon-ecs-amazon-eks-and-aws-serverless-mcp-server/

Today, we’re introducing specialized Model Context Protocol (MCP) servers for Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Serverless, now available in the AWS Labs GitHub repository. These open source solutions extend AI development assistants capabilities with real-time, contextual responses that go beyond their pre-trained knowledge. While Large Language Models (LLM) within AI assistants rely on public documentation, MCP servers deliver current context and service-specific guidance to help you prevent common deployment errors and provide more accurate service interactions.

You can use these open source solutions to develop applications faster, using up-to-date knowledge of Amazon Web Services (AWS) capabilities and configurations during the build and deployment process. Whether you’re writing code in your integrated development environment (IDE), or debugging production issues, these MCP servers support AI code assistants with deep understanding of Amazon ECS, Amazon EKS, and AWS Serverless capabilities, accelerating the journey from code to production. They work with popular AI-enabled IDEs, including Amazon Q Developer on the command line (CLI), to help you build and deploy applications using natural language commands.

The Amazon ECS MCP Server containerizes and deploys applications to Amazon ECS within minutes by configuring all relevant AWS resources, including load balancers, networking, auto-scaling, monitoring, Amazon ECS task definitions, and services. Using natural language instructions, you can manage cluster operations, implement auto-scaling strategies, and use real-time troubleshooting capabilities to identify and resolve deployment issues quickly.
For Kubernetes environments, the Amazon EKS MCP Server provides AI assistants with up-to-date, contextual information about your specific EKS environment. It offers access to the latest EKS features, knowledge base, and cluster state information. This gives AI code assistants more accurate, tailored guidance throughout the application lifecycle, from initial setup to production deployment.
The AWS Serverless MCP Server enhances the serverless development experience by providing AI coding assistants with comprehensive knowledge of serverless patterns, best practices, and AWS services. Using AWS Serverless Application Model Command Line Interface (AWS SAM CLI) integration, you can handle events and deploy infrastructure while implementing proven architectural patterns. This integration streamlines function lifecycles, service integrations, and operational requirements throughout your application development process. The server also provides contextual guidance for infrastructure as code decisions, AWS Lambda specific best practices, and event schemas for AWS Lambda event source mappings.

Let’s see it in action
If this is your first time using AWS MCP servers, visit the Installation and Setup guide in the AWS Labs GitHub repository to installation instructions. Once installed, add the following MCP server configuration to your local setup:

Install Amazon Q for command line and add the conﬁguration to ~/.aws/amazonq/mcp.json. If you’re already an Amazon Q CLI user, add only the configuration.

{
  "mcpServers": {
    "awslabs.aws-serverless-mcp":  {
      "command": "uvx",
      "timeout": 60,
      "args": ["awslabs.aws_serverless_mcp_server@latest"],
    },
    "awslabs.ecs-mcp-server": {
      "disabled": false,
      "command": "uv",
      "timeout": 60,
      "args": ["awslabs.ecs-mcp-server@latest"],
    },
    "awslabs.eks-mcp-server": {
      "disabled": false,
      "timeout": 60,
      "command": "uv",
      "args": ["awslabs.eks-mcp-server@latest"],
    }
  }
}

For this demo I’m going to use the Amazon Q CLI to create an application that understands video using 02_using_converse_api.ipynb from Amazon Nova model cookbook repository as sample code. To do this, I send the following prompt:

I want to create a backend application that automatically extracts metadata and understands the content of images and videos uploaded to an S3 bucket and stores that information in a database. I'd like to use a serverless system for processing. Could you generate everything I need, including the code and commands or steps to set up the necessary infrastructure, for it to work from start to finish? - Use 02_using_converse_api.ipynb as example code for the image and video understanding.

Amazon Q CLI identifies the necessary tools, including the MCP serverawslabs.aws-serverless-mcp-server. Through a single interaction, the AWS Serverless MCP server determines all requirements and best practices for building a robust architecture.

I ask to Amazon Q CLI that build and test the application, but encountered an error. Amazon Q CLI quickly resolved the issue using available tools. I verified success by checking the record created in the Amazon DynamoDB table and testing the application with the dog2.jpeg file.

To enhance video processing capabilities, I decided to migrate my media analysis application to a containerized architecture. I used this prompt:

I'd like you to create a simple application like the media analysis one, but instead of being serverless, it should be containerized. Please help me build it in a new CDK stack.

Amazon Q Developer begins building the application. I took advantage of this time to grab a coffee. When I returned to my desk, coffee in hand, I was pleasantly surprised to find the application ready. To ensure everything was up to current standards, I simply asked:

please review the code and all app using the awslabsecs_mcp_server tools

Amazon Q Developer CLI gives me a summary with all the improvements and a conclusion.

I ask it to make all the necessary changes, once ready I ask Amazon Q developer CLI to deploy it in my account, all using natural language.

After a few minutes, I review that I have a complete containerized application from the S3 bucket to all the necessary networking.

I ask Amazon Q developer CLI to test the app send it the-sea.mp4 video file and received a timed out error, so Amazon Q CLI decides to use the fetch_task_logs from awslabsecs_mcp_server tool to review the logs, identify the error and then fix it.

After a new deployment, I try it again, and the application successfully processed the video file

I can see the records in my Amazon DynamoDB table.

To test the Amazon EKS MCP server, I have code for a web app in the auction-website-main folder and I want to build a web robust app, for that I asked Amazon Q CLI to help me with this prompt:

Create a web application using the existing code in the auction-website-main folder. This application will grow, so I would like to create it in a new EKS cluster

Once the Docker file is created, Amazon Q CLI identifies generate_app_manifests from awslabseks_mcp_server as a reliable tool to create a Kubernetes manifests for the application.

Then create a new EKS cluster using the manage_eks_staks tool.

Once the app is ready, the Amazon Q CLI deploys it and gives me a summary of what it created.

I can see the cluster status in the console.

After a few minutes and resolving a couple of issues using the search_eks_troubleshoot_guide tool the application is ready to use.

Now I have a Kitties marketplace web app, deployed on Amazon EKS using only natural language commands through Amazon Q CLI.

Get started today
Visit the AWS Labs GitHub repository to start using these AWS MCP servers and enhance your AI-powered developmen there. The repository includes implementation guides, example configurations, and additional specialized servers to run AWS Lambda function, which transforms your existing AWS Lambda functions into AI-accessible tools without code modifications, and Amazon Bedrock Knowledge Bases Retrieval MCP server, which provides seamless access to your Amazon Bedrock knowledge bases. Other AWS specialized servers in the repository include documentation, example configurations, and implementation guides to begin building applications with greater speed and reliability.

To learn more about MCP Servers for AWS Serverless and Containers and how they can transform your AI-assisted application development, visit the Introducing AWS Serverless MCP Server: AI-powered development for modern applications, Automating AI-assisted container deployments with the Amazon ECS MCP Server, and Accelerating application development with the Amazon EKS MCP server deep-dive blogs.

— Eli

Modernizing applications with AWS AppSync Events

2025-05-29 Ricardo Marques

Post Syndicated from Ricardo Marques original https://aws.amazon.com/blogs/compute/modernizing-applications-with-aws-appsync-events/

In today’s fast-paced digital world, organizations are facing challenges for modernizing their applications. A common problem is the smooth shift from synchronous to asynchronous communication without substantial client or frontend alterations. When modernizing applications, it is often necessary to move from a synchronous communication model to an asynchronous one. However, this transition can be complex, especially when the client or frontend communicates synchronously. Adapting the current code for asynchronous communication demands significant time and resources.

AWS AppSync Events helps address this challenge by enabling you to build event-driven APIs that can bridge between synchronous and asynchronous communication models. With AppSync Events, you can modernize your backend architecture to leverage asynchronous patterns while maintaining compatibility with existing synchronous clients.

Overview

The solution comprises an API that converts client synchronous requests to asynchronous backend requests using AppSync Events.

For demonstrating the integration between the API and the backend, I’m simulating the backend processing using an asynchronous AWS Step Functions workflow. This workflow receives a Name and Surname event, waits 10 seconds, and posts a full-name event to the AppSync Event channel. To receive event notifications, the API subscribes to the AppSync channel. At the same time, the backend handles events asynchronously.

Figure 1: Representation of an API integrating a synchronous frontend with an asynchronous backend using AWS AppSync Events.

The Amazon API Gateway makes a synchronous request to AWS Lambda and waits for the response.
Lambda function starts the execution of the asynchronous workflow.
After starting the workflow execution, Lambda connects to AppSync and creates a channel to receive asynchronous notifications (channels are ephemeral and unlimited. Here it creates one channel per request using the workflow execution ID).
The workflow executes asynchronously, calling other workflows.
Upon completion of the main workflow, it sends a POST request to the AppSync events API with the processing result. The POST is made to the channel that was created by the Lambda function using the workflow execution ID.
AppSync receives the POST request and sends a notification to the subscriber, which in this case is the Lambda function. The entire process must be finished within the Lambda functions’s timeout limit you defined.
Lambda sends the response to the API Gateway, which has been waiting for the synchronous response.

To better understand the Event API WebSocket Protocol used in this solution, refer to this AppSync documentation.

You can access the GitHub repo through this link: AppSync_Sync_Async_Integration.

The repository includes a comprehensive README file that walks you through the process of setting up and configuring the preceding solution.

Prerequisites

To follow this walkthrough, you need the following prerequisites:

An AWS account
Necessary AWS permissions to create all specified resources
AWS Command Line Interface (AWS CLI) configured with appropriate credentials
Terraform installed (version 0.12 or later)

With the full code, including API Gateway and Step Functions, on GitHub, this post only covers the core components: the AppSync Events API and the Lambda function.

Walkthrough

The following steps walk you through this solution.

Creating an AppSync event API with API Key Authorization

An AppSync Event API allows calls using API key, Amazon Cognito user pools, Lambda authorizer, OIDC, or AWS identity and Access Management (IAM). This solution uses API Key.

The infrastructure as code (IaC) has been created using Terraform. However, as of writing this post, there weren’t Terraform AppSync Event API resource available. Therefore, the AppSync Event API resources were made with AWS CloudFormation, which is imported and implemented by Terraform.

In the resource AWS:AppSync:Api, define the API name and Auth method:

Resources:
  #Creating the AppSync Events API
  EventAPI:
    Type: AWS::AppSync::Api
    Properties:
      Name: SyncAsyncAPI
      EventConfig:
        AuthProviders:
          - AuthType: API_KEY
        ConnectionAuthModes:
          - AuthType: API_KEY
        DefaultPublishAuthModes:
          - AuthType: API_KEY
        DefaultSubscribeAuthModes:
          - AuthType: API_KEY
#Creating the Events API Namespace
  DefaultNamespace:
    Type: AWS::AppSync::ChannelNamespace
    Properties:
      Name: AsyncEvents
      ApiId: !GetAtt EventAPI.ApiId
  
  #Creating the Events API APIKey
  EventAPIKey:
    Type: AWS::AppSync::ApiKey
    Properties:
      ApiId: !GetAtt EventAPI.ApiId
      Expires: 1748950672
      Description: 'API Key for Event API'

  #Creating the SecretsManager to store the APIKey
  SecretsManagerAPIKey:
    Type: AWS::SecretsManager::Secret
    Properties:
      Name: 'AppSyncEventAPIKEY'
      SecretString: !GetAtt EventAPIKey.ApiKey

To have the Host DNS, Realtime Endpoint, and Secret Manager created referenced by the Terraform template, output them:

Outputs:
  ApiARN:
    Description: 'The ARN ID'
    Value: !GetAtt EventAPI.ApiArn

  AppSyncHost:
    Description: 'The API Endpoint'
    Value: !GetAtt EventAPI.Dns.Http

  AppSyncRealTimeEndpoint:
    Description: 'The Real-time Endpoint'
    Value: !GetAtt EventAPI.Dns.Realtime

  SecretsManagerARN:
    Description: 'The ARN of the Secrets Manager entry'
    Value: !Ref SecretsManagerAPIKey

The key information needed from the AppSync Event API is:

Host DNS: This DNS is used to send events to the API Channel through HTTP Post requests.
Realtime endpoint: This endpoint is a WebSocket endpoint where the Lambda function connects to receive the events posted in the AppSync Channel.
API Key: This key is used not only in the Post HTTP requests, but also to connect and subscribe to the AppSync channel.

Lambda Sync/Async API

In this solution, the Lambda function runs two tasks:

Start an asynchronous workflow
Subscribe to an event channel through WebSocket

To handle the WebSocket connection, use the websocket-client lib, which is a powerful Python lib developed for working with WebSockets.

Request isolation is maintained by using the same UUID for workflow name and AppSync channel name.

try:
        handler = WebSocketHandler()
        sfn_response = wf.start_workflow_async(event["body"])
        
        if sfn_response["status"] == "started":
            handler.execution_name = sfn_response["id"]
            handler.start_websocket_connection()
            
            return {
                'statusCode': 200,
                'body': json.dumps({ 
                        "id": handler.execution_name,
                        "nome completo": handler.final_name
                        })
            }
        else:
            raise ValueError("Workflow failed to start")

First, to initialize the WebSocket Connection, the subprotocols must be defined:

WEBSOCKET_PROTOCOL
Headers:
- Host: The AppSync Host DNS (even with a WebSocket Connection, the HTTP Host must be sent)
- x-api-key: The API key create fot the Event API.
- Sec-Websocket-Protocol: WEBSOCKET_PROTOCOL

def start_websocket_connection(self) -> None:
        try: 
            """Initialize and start WebSocket connection."""
            header_str = self._create_connection_header()
            
            self.ws = websocket.WebSocketApp(
                os.environ["API_URL"],
                subprotocols=[WEBSOCKET_PROTOCOL, f'header-{header_str}'],
                on_open=self.on_open,
                on_message=self.on_message,
                on_error=self.on_error,
                on_close=self.on_close
)
            self.ws.run_forever()
        except Exception as e:
            return e

def _create_connection_header(self) -> str:
        """Create and encode connection header."""
        connection_header = {
            "host": os.environ["API_HOST"],
            "x-api-key": APIKEY,
            "Sec-WebSocket-Protocol": WEBSOCKET_PROTOCOL
        }
        return base64.b64encode(json.dumps(connection_header).encode()).decode()

Once the WebSocket connection is established, a first message with the type CONNECTION_INIT_TYPE must be sent.

To subscribe to the channel by which our function is notified when the Step Functions workflow finishes, send a second message with the type SUBSCRIBE_TYPE, an ID, the channel name and authorization.

For more information about types of message, read this AppSync documentation.

def on_open(self, ws: websocket.WebSocketApp) -> None:
        try:
            """Handle WebSocket connection opening and send initial messages."""
            logger.info("Connection opened")
            
            # Send connection initialization
            connection_init = {"type": CONNECTION_INIT_TYPE}
            ws.send(json.dumps(connection_init))

            # Send subscription
            subscription_msg = {
                "type": SUBSCRIBE_TYPE,
                "id": self.execution_name,
                "channel": f"{os.environ["APPSYNC_NAMESPACE"]}/{self.execution_name}",
                "authorization": {
                    "x-api-key": APIKEY,
                    "host": os.environ["API_HOST"]
                }
            }
            
            logger.info("Sending subscription")
            ws.send(json.dumps(subscription_msg))
        except Exception as e:
            self.on_error = e

After receiving the message confirming the subscription, wait for messages with the type data. Whenever a message with this type arrives, execute the logic to identify if the workflow was successfully executed, and then close the connection.

def on_message(self, ws: websocket.WebSocketApp, message: str) -> None:
        """Handle incoming WebSocket messages."""
        logger.info("Message received: %s", message)
        try:
            message_dict = json.loads(message)
            required_keys = ["id", "type", "event"]
            
            if all(key in message_dict for key in required_keys):
                event_json = json.loads(message_dict["event"])
                
                if (message_dict["id"] == self.execution_name and 
                    message_dict["type"] == "data"):
                    
                    self.final_name = event_json["nome_completo"]
                    logger.info("Message received: %s", self.final_name)
                    logger.info("Successfully received return message")
                    logger.info("Ending processing")
                    
                    self.message_queue = {
                        "status": SUCCESS_STATUS,
                        "executionID": message_dict["id"]
                    }
                    ws.close()
        except json.JSONDecodeError as e:
            logger.error("Failed to parse message: %s", str(e))
        except Exception as e:
            logger.error("Error processing message: %s", str(e))

Conclusion

In this post, you learned how to use event-driven architectures and the capabilities of AWS AppSync Events to integrate synchronous and asynchronous communication patterns in your applications. This allows you to modernize your systems without the need for extensive modifications to your existing frontend codebase. Explore the demonstrations and documentation provided in the GitHub repository to gain a deeper understanding of how AppSync Events can be applied to your specific use cases.

To learn more about serverless architectures and asynchronous invocation patterns, see Serverless Land.

Enhancing multi-account activity monitoring with event-driven architectures

2025-05-28 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/enhancing-multi-account-activity-monitoring-with-event-driven-architectures/

Enterprise cloud environments are growing increasingly complex as they scale, with organizations managing hundreds to thousands of Amazon Web Services (AWS) accounts across multiple business units and AWS Regions. Organizations need efficient ways to collect, transport, and analyze activity data for threat detection and compliance monitoring. This presents unique challenges for enterprise Application Security (AppSec) teams, cloud security vendors, and DevSecOps professionals, because traditional polling-based monitoring approaches struggle to provide real-time activity insights needed for modern cloud operations.

In this post, you will learn to use AWS CloudTrail and Amazon EventBridge for real-time cloud activity monitoring and automated response.

Overview

As organizations expand their cloud footprint, account activity monitoring that comprehensively tracks user actions and successfully identifies security threats becomes crucial for threat detection and compliance. Although AWS provides native tools—such as CloudTrail for API activity capture, EventBridge for real-time event routing, AWS Organizations for multi-account management, and AWS Config for resource evaluation—many enterprises struggle with the volume of activities while maintaining efficiency and controlling costs. Organizations need to carefully architect solutions to effectively use these tools as their environments scale.

Traditional polling-based techniques, which worked well for smaller environments, can become unsustainable when scaled to enterprise deployments, where the volume of activity data grows exponentially with each new account and service. API polling limitations, growing data volumes, and increasing demand for real-time analysis are pushing teams to rethink their architectural approach.

Figure 1. Poll model, periodically retrieving the latest state.

Adopting push-based event-driven architectures offers a compelling solution for AppSec teams and cloud security vendors facing these challenges. Using AWS services, such as CloudTrail and EventBridge, allows these teams and vendors to build scalable activity monitoring solutions that overcome the limitations of traditional polling-based approaches and provide real-time notifications across thousands of AWS accounts. This approach not only enables security use cases but also supports broader real-time operational monitoring, compliance reporting, and automation requirements.

“By integrating AWS CloudTrail and Amazon EventBridge, we’ve built a scalable architecture to monitor activity across thousands of AWS accounts. This provides the visibility needed to detect threats in real time and protect large, distributed AWS environments.” — Rob Solomon, Senior Cloud Solution Architect, CrowdStrike

Solution components

Enterprise AppSec teams and cloud security vendors share common requirements when building multi-account monitoring solutions. They need to efficiently collect activity data across thousands of accounts, transport it to a centralized location for analysis, and process it in real-time to detect threats and compliance violations. The solution must scale seamlessly from dozens to thousands of accounts while remaining highly-performant and cost-efficient. At its core, a scalable multi-account activity monitoring solution consists of three components: activity data collection, cross-account transport to a centralized location, and processing. In the following sections, you will learn how AppSec teams and cloud security vendors can implement each step efficiently while avoiding common pitfalls.

Figure 2. Push model. Account activity is collected at the source, and pushed to the AppSec or cloud security vendor account for further processing.

Data collection strategies

Many teams begin their cloud activity monitoring journey by polling the resource status through service management APIs. Although this approach works good for retrieving the latest resource state on-demand, its fundamental limitation is inability to detect state changes efficiently, necessitating continuous querying of all resources at fixed intervals. Consider a scenario where you’re monitoring 1,000 accounts, with 100 resources in each account. A single polling cycle would necessitate 100,000 API requests, consuming over 28 million API calls daily if running at five-minute intervals. This inefficiency compounds as environments grow, leading to throttling issues, increased costs, and scaling challenges.

AWS Config improved upon this by offering continuous resource configuration tracking without manual polling. Although this works excellent for configuration compliance and a history of changes for auditing, AWS Config reports changes on a best-effort basis and is not optimal for real-time threat detection.

To overcome this constraint, your solution can use services such as CloudTrail and EventBridge as primary data sources, complemented by intelligent on-demand targeted API polling. CloudTrail records API activity across AWS services, providing a detailed history of actions taken by users, roles, and AWS services in your accounts. Over 250 AWS services automatically report their activity and API calls to CloudTrail and EventBridge in real-time. This allows you to capture this information, providing a detailed history of actions taken in your accounts, and enabling security analysis, resource state change tracking, and compliance audit.

Figure 3. Over 250 AWS services are automatically emitting activity events to CloudTrail.

When a resource state changes, commonly as a result of a management API call, the affected service sends an event to CloudTrail and EventBridge. Your monitoring solution can examine the event payload to determine if polling for supplementary data is necessary, particularly when the initial payload lacks complete information. This provides you with comprehensive service coverage with reduced maintenance effort. This hybrid approach guarantees delivery of activity data to eliminate monitoring blind spots, while significantly reducing AWS management API quota consumption.

Cross-account data transport

Your solution should transport activity data from thousands of tenant accounts into a small number of centralized accounts, such as a regional AppSec account, for further processing and analysis. The solution must be secure, scalable, resilient, and cost-efficient while maintaining real-time delivery.

The most direct way to achieve it is to enable Amazon S3 event notifications for new objects that are created in the CloudTrail trails S3 bucket. When you receive the notification, you can retrieve and process new activities.

Figure 4. Exporting CloudTrail events into an S3 bucket and retrieving after receiving a notification.

This direct way to consume CloudTrail events has one important consideration: typically it can take an average of five minutes to deliver events to Amazon S3. Teams and vendors looking for lower mean-time-to-detect (MTTD) and mean-time-to-respond (MTTR) should evaluate transporting CloudTrail events across accounts with EventBridge, which provides close-to-real-time delivery.

Transporting events with EventBridge

EventBridge is a serverless event router that connects applications. It receives events from various sources, such as CloudTrail, and routes them to multiple targets based on defined rules.

Using EventBridge for cross-account data transport comes with several major benefits:

EventBridge seamlessly delivers events across accounts and AWS Regions, providing support for automatic retries, exponential backoff with jitter, and dead-letter queues.
EventBridge is natively integrated with CloudTrail.
EventBridge delivers events in close to real-time.
EventBridge is a serverless event router managed and operated by AWS. You don’t need to be concerned with its scaling and you always pay per number of events it transports.

There are two approaches you can take for delivering cross-account events with EventBridge: direct service-to-service or service-to-API-endpoint.

The first approach uses the EventBridge direct bus-to-bus and bus-to-service delivery capabilities. This method is most suitable when you want AWS to handle data ingestion on the receiving end. The delivery target is always either an EventBridge bus, or another AWS service, such as an Amazon Simple Queue Service (Amazon SQS) queue, Amazon Kinesis Data Streams stream, or an AWS Lambda function. With support for up to 18,750 target invocations per second and native AWS Identity and Access Management (IAM) integration, this method is particularly suitable for large multi-account deployments.

The second approach uses the EventBridge API destinations feature. This method is most suitable when you have existing HTTP-based ingestion endpoints in place. Although it offers lower throughput, it provides greater flexibility for ingestion endpoint and authentication methods implementation, making it attractive for AppSec teams and cloud security vendors integrating with existing ingestion infrastructure.

Figure 5. Emitting CloudTrail events in real-time through EventBridge.

The following table summarizes two approaches for transporting events across accounts with EventBridge.

	Direct bus-to-bus or bus-to-service	API destinations
Data ingestion implementation effort	Minimal	Needed
Default target invocations per second (TPS) quotas	Up to 18,750 (region dependent)	Up to 300 (region dependent)
Can the TPS quota be increased	Yes	Yes
Authorization support	Native AWS IAM, fully handled by AWS	Basic, OAuth2, API Key. You’re responsible for implementing credentials validation during ingestion.
Cross-account delivery costs	$1 per million events	$0.20 per million events

Go to the EventBridge quotas and pricing pages for more details.

Processing architecture

Processing would commonly be done by existing products and services the AppSec team or cloud security vendor provides for activity analysis. The architecture for event processing pipeline operating at enterprise scale must consider design decisions to handle large and potentially irregular event volumes while maintaining high performance, as shown in the following figure.

Figure 6. An activity event processing pipeline, with priority-based processing.

Use the following best practices for a robust processing architecture:

Buffer ingested events Use services such as Amazon SQS, Amazon Kinesis Data Streams, or Amazon Managed Streaming for Apache Kafka to buffer incoming events, handle traffic surges, and make sure of reliable processing.
Use serverless services that scale automatically, or invest in automated scaling mechanisms that adjust processing capacity based on event volume
Minimize polling: Resort to intelligent on-demand polling, only poll when you need additional data that is not available in the CloudTrail event payload.
Routing and classification: Rather than processing all events equally, implement intelligent classification and routing early in your pipeline. Security-related events such as IAM changes or security group modifications should take priority over routine activities or data events. This approach helps to control costs while maintaining rapid detection of important security events.
Cost optimization: At the enterprise scale, cost optimization becomes crucial. Use EventBridge rules in source accounts to filter out irrelevant events before they enter your processing pipeline. Consider implementing regional collection points to optimize data transfer costs. When using Lambda functions for data processing, use batch processing to reduce invocation costs. Evaluate which event types must be delivered in real-time through EventBridge, which event types can be delayed and collected through S3 bucket export, and which events should be discarded.
Observability: Monitor the ingestion and processing throughput to react to potential slowdowns early. Detect when source accounts are approaching EventBridge quotas. Consider using AWS Service Quotas to request quota increases automatically through APIs.
Cross-Region considerations: Design your architecture to support efficient cross-Region event collection while respecting data sovereignty requirements. Consider implementing regional processing nodes with centralized aggregation for global security analysis.
Integration patterns: Modern security solutions must integrate with existing security tools and workflows. Implement standardized output formats that allow seamless integration with SIEM systems, ticketing platforms, and automation frameworks. Consider publishing security findings back to EventBridge buses to enable automated remediation workflows. If you’re a cloud security vendor, then consider integrating with EventBridge as an SaaS partner.

Conclusion

Event-driven architectures present a powerful opportunity for building scalable multi-account activity monitoring solutions. Using services such as AWS CloudTrail and Amazon EventBridge allows teams to overcome the limitations of traditional polling-based approaches while achieving close to real-time delivery.The shift to event-driven security monitoring isn’t just an architectural choice—it’s becoming a necessity for teams operating at enterprise scale. This approach enables security teams to achieve the real-time threat detection capabilities needed in today’s cloud environments while maintaining operational efficiency and cost control.

Securing Amazon S3 presigned URLs for serverless applications

2025-05-14 Raaga N.G

Post Syndicated from Raaga N.G original https://aws.amazon.com/blogs/compute/securing-amazon-s3-presigned-urls-for-serverless-applications/

Modern serverless applications must be capable of seamlessly handling large file uploads. This blog demonstrates how to leverage Amazon Simple Storage Service (Amazon S3) presigned URLs to allow your users to securely upload files to S3 without requiring explicit permissions in the AWS Account. This blog post specifically focuses on the security ramifications of using S3 presigned URLs, and explains mitigation steps that serverless developers can take to improve the security of their systems using S3 presigned URLs. Additionally, the blog post also walks through an AWS Lambda function that adheres to the provided recommendations, ensuring a robust and secure approach to handling S3 presigned URLs. For more information on S3 presigned URLs, see Working with presigned URLs.

Presigned URL Workflow for Serverless Applications

The following architecture diagram illustrates a serverless application that generates an S3 presigned URL. By using S3 presigned URLs, serverless applications can offload to S3 the computation required to receive files. The diagram captures a seven-step process between the client, Amazon API Gateway, the Lambda function, and S3.

A typical workflow to upload a file to a serverless application hosted on S3 includes the following steps:

Client submits a request to upload a file.
API Gateway receives the client request and invokes a Lambda function that then generates the S3 presigned URL.
The Lambda function makes a getSignedUrl API call to S3.
S3 returns the presigned URL for the object to be uploaded.
The Lambda function returns a presigned URL to the API.
Client receives the S3 presigned URL to upload the file.
Client uploads the file directly to S3 using the presigned URL.

How to Secure Presigned URLs

When designing a serverless application that utilizes S3 presigned URLs to store data in S3, a developer must consider several primary security aspects. S3 presigned URLs are public resources that do not authenticate users, and anyone in possession of a valid S3 presigned URL can access the associated resource. Consequently, it is important to implement additional security measures to ensure that these URLs are not misused or accessed by unauthorized parties. The following blog post contains techniques you can use to make your presigned URLs more secure.

1. Add a Content-MD5 checksum using the X-Amz-Signed header

When you upload an object to S3, you can include a precalculated checksum of the object as part of your request. S3 will perform an integrity check and verify if the object sent is the same as the object received. S3 supports the use of MD5 checksums to verify the integrity of objects uploaded. You provide the MD5 digest by including a Content-MD5 header in the initial PUT request. Upon receiving the object, S3 will calculate the MD5 digest and compare it with the one you originally provided. The upload operation succeeds only if both MD5 digests match, ensuring end-to-end data integrity. If an unintended party gets their hands on the S3 presigned URL, then they will not be able to use it without possessing the same object. This provides protection against arbitrary file uploads.

The key element for a developer to remember is that when the client uploads the file to the S3 presigned URL, it must supply the correct MD5 in Base64 using the Content-MD5 header. Developers can see a sample serverless application with client-side code to extract the MD5 digest, request a S3 presigned URL, and upload a file in this GitHub repository. This sample application uses NodeJS v20 in the Lambda function.

2. Expire the S3 presigned URLs

An S3 presigned URL remains valid for the period of time specified when the URL is generated. It is important to ensure that the S3 presigned URL does not remain accessible for longer than required as it can be reused when still valid. You can define the expiration time of the S3 presigned URL by either passing X-Amz-Expires as a query parameter or by setting the expiresIn parameter when using the AWS SDK for JavaScript.

S3 validates the expiration time and date at the time of initial HTTP Request. However, to support situations where the connection drops and the client needs to restart uploading a file, you may want your S3 presigned URL to remain valid for the entire anticipated time needed to upload the file to S3. The challenge is to generate an S3 presigned URL that is valid long enough to accommodate the file’s upload, yet still short enough that you prevent reuse.

A solution we propose to overcome these challenges is to dynamically set the S3 presigned URL’s expiration time by using the browser Network Information API. Using this new API, when the client browser places the initial request for an S3 presigned URL, the client also transmits the file’s size and the network type, so the Lambda function can calculate the anticipated transfer time.

Within the Lambda function, we can now estimate the transfer time for this size of file on this type of network, using sample code as featured in this GitHub repository.

With the estimated transfer time calculated, the Lambda function can now request the S3 presigned URL and set the expiresIn parameter to the transfer time, resulting in an S3 presigned URL that is only available for the time needed to upload that size of file on this type of network.

If you are using the AWS SDK, you may also be using AWS Signature Version 4 (SigV4) to sign your requests. To create a defense in depth approach, which will place a ceiling on total expiration time, you can utilize condition keys in bucket policies. For an example policy, see Limiting presigned URL capabilities.

3. Generating a UUID to replace the uploaded filename

When an application allows a user to upload files, the application exposes itself to various security threats, such as path traversal attacks. Path traversal vulnerabilities allow attackers to access files that are not meant to be accessed or to overwrite files outside the intended directory structure. In order to secure your applications against such vulnerabilities, the most effective approach is to incorporate user input validation and sanitization. You can sanitize the filename by replacing it with a generated UUID (Universally Unique Identifier).

You can see an example function in the server-side code for Lambda in this GitHub repository.

4. Applying the Principle of Least Privilege and using a separate Lambda function to create S3 presigned URLs

The capabilities of an S3 presigned URL are constrained by the permissions of the principal that created it. To offer fine-grained access, the very first step in limiting use of an S3 presigned URL should be building a specific Lambda function that generates these URLs. By having a Lambda function dedicated to this purpose, you do not risk an overly permissive Lambda function. The second step is to limit your specific Lambda function’s access to S3.

Adhering to the Principle of Least Privilege, it’s important to restrict the Lambda function’s permissions to only the required prefixes in the bucket and allow it to perform only the required actions on the bucket, instead of granting full bucket access. This minimizes the potential attack surface and mitigates the risk of unintended data exposure or modification. It is important to limit the permissions to the minimum required set of actions and resources.

This example AWS Identity and Access Management (IAM) policy demonstrates how to grant the Lambda function read access (GET) to objects within the "Example-Prefix" prefix of a specific S3 bucket. The IAM policy is attached to the Lambda function via an execution role, which together establish what actions the Lambda function can perform.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadStatement",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/",
        "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/*"
      ],
      "Effect": "Allow"
    }
  ]
}

This example IAM policy demonstrates how to grant the Lambda function permissions to upload (PUT) objects within the "Example-Prefix" prefix of a specific S3 bucket.

{   
    "Version": "2012-10-17",
    "Statement": [
        {   
            "Sid": "UploadStatement",
            "Action": [
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/",
                "arn:aws:s3:::EXAMPLE-BUCKET/Example-Prefix/*"
            ],
            "Effect": "Allow"
        }
    ]
}

This approach will ensure that your Lambda function possesses the minimum required permissions to perform its intended tasks and reduces the risk of unintended data access or modification.

If you want to restrict the use of S3 presigned URLs and all S3 access to a particular network path, you can also define a network-path restriction policy on the S3 Bucket. This restriction on the bucket requires that all requests to the bucket originate from a specified network. AWS Prescriptive Guidance says, an extension of least privilege is to maintain a data perimeter that’s consistent with your organization’s needs. The goal of an AWS perimeter is to ensure that the access is allowed only if the request is coming from a trusted entity, for trusted resources from a trusted network. These data perimeters are applicable to S3 presigned URLs as well.

5.Creating one-time use S3 presigned URLs

Serverless applications developers may want each S3 presigned URL to only be used once. Developers can incorporate a token-based mechanism to facilitate secure one-time use of an S3 presigned URL. This involves generating unique tokens for each authorized user or client and associating these tokens with the S3 presigned URLs. When a client attempts to access the resource using the S3 presigned URL, they must provide the corresponding token for validation. This additional layer of security ensures that only authorized entities can access the S3 presigned URLs and the associated resources. Furthermore, you can leverage a database to track the issued tokens and expire them after each use. A solution to implement such a mechanism has been discussed in detail in How to securely transfer files with presigned URLs.

Cleaning up

You may clean up the sample application by deleting the API Gateway, Lambda function, and S3 bucket. In addition, please do not forget to delete any IAM execution roles you created for the Lambda function.

Conclusion

In this blog we have discussed various considerations that a developer must make when designing an application that leverages S3 presigned URLs. By incorporating robust security measures, such as proper access control, input sanitization, expiration handling and integrity checks, developers can mitigate potential risks when using S3 presigned URLs.

Accelerate lightweight analytics using PyIceberg with AWS Lambda and an AWS Glue Iceberg REST endpoint

2025-05-09 Sotaro Hikita

Post Syndicated from Sotaro Hikita original https://aws.amazon.com/blogs/big-data/accelerate-lightweight-analytics-using-pyiceberg-with-aws-lambda-and-an-aws-glue-iceberg-rest-endpoint/

For modern organizations built on data insights, effective data management is crucial for powering advanced analytics and machine learning (ML) activities. As data use cases become more complex, data engineering teams require sophisticated tooling to handle versioning, increasing data volumes, and schema changes across multiple data sources and applications.

Apache Iceberg has emerged as a popular choice for data lakes, offering ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema evolution, and time travel capabilities. Iceberg tables can be accessed from various distributed data processing frameworks like Apache Spark and Trino, making it a flexible solution for diverse data processing needs. Among the available tools for working with Iceberg, PyIceberg stands out as a Python implementation that enables table access and management without requiring distributed compute resources.

In this post, we demonstrate how PyIceberg, integrated with the AWS Glue Data Catalog and AWS Lambda, provides a lightweight approach to harness Iceberg’s powerful features through intuitive Python interfaces. We show how this integration enables teams to start working with Iceberg tables with minimal setup and infrastructure dependencies.

PyIceberg’s key capabilities and advantages

One of PyIceberg’s primary advantages is its lightweight nature. Without requiring distributed computing frameworks, teams can perform table operations directly from Python applications, making it suitable for small to medium-scale data exploration and analysis with minimal learning curve. In addition, PyIceberg is integrated with Python data analysis libraries like Pandas and Polars, so data users can use their existing skills and workflows.

When using PyIceberg with the Data Catalog and Amazon Simple Storage Service (Amazon S3), data teams can store and manage their tables in a completely serverless environment. This means data teams can focus on analysis and insights rather than infrastructure management.

Furthermore, Iceberg tables managed through PyIceberg are compatible with AWS data analytics services. Although PyIceberg operates on a single node and has performance limitations with large data volumes, the same tables can be efficiently processed at scale using services such as Amazon Athena and AWS Glue. This enables teams to use PyIceberg for rapid development and testing, then transition to production workloads with larger-scale processing engines—while maintaining consistency in their data management approach.

Representative use case

The following are common scenarios where PyIceberg can be particularly useful:

Data science experimentation and feature engineering – In data science, experiment reproducibility is crucial for maintaining reliable and efficient analyses and models. However, continuously updating organizational data makes it challenging to manage data snapshots for important business events, model training, and consistent reference. Data scientists can query historical snapshots through time travel capabilities and record important versions using tagging features. With PyIceberg, they can receive these benefits in their Python environment using familiar tools like Pandas. Thanks to Iceberg’s ACID capabilities, they can access consistent data even when tables are being actively updated.
Serverless data processing with Lambda – Organizations often need to process data and maintain analytical tables efficiently without managing complex infrastructure. Using PyIceberg with Lambda, teams can build event-driven data processing and scheduled table updates through serverless functions. PyIceberg’s lightweight nature makes it well-suited for serverless environments, enabling simple data processing tasks like data validation, transformation, and ingestion. These tables remain accessible for both updates and analytics through various AWS services, allowing teams to build efficient data pipelines without managing servers or clusters.

Event-driven data ingestion and analysis with PyIceberg

In this section, we explore a practical example of using PyIceberg for data processing and analysis using NYC yellow taxi trip data. To simulate an event-driven data processing scenario, we use Lambda to insert sample data into an Iceberg table, representing how real-time taxi trip records might be processed. This example will demonstrate how PyIceberg can streamline workflows by combining efficient data ingestion with flexible analysis capabilities.

Imagine your team faces several requirements:

The data processing solution needs to be cost-effective and maintainable, avoiding the complexity of managing distributed computing clusters for this moderately-sized dataset.
Analysts need the ability to perform flexible queries and explorations using familiar Python tools. For example, they might need to compare historical snapshots with current data to analyze trends over time.
The solution should have the ability to expand to be more scalable in the future.

To address these requirements, we implement a solution that combines Lambda for data processing with Jupyter notebooks for analysis, both powered by PyIceberg. This approach provides a lightweight yet robust architecture that maintains data consistency while enabling flexible analysis workflows. At the end of the walkthrough, we also query this data using Athena to demonstrate compatibility with multiple Iceberg-supporting tools and show how the architecture can scale.

We walk through the following high-level steps:

Use Lambda to write sample NYC yellow taxi trip data to an Iceberg table on Amazon S3 using PyIceberg with an AWS Glue Iceberg REST endpoint. In a real-world scenario, this Lambda function would be triggered by an event from a queuing component like Amazon Simple Queue Service (Amazon SQS). For more details, see Using Lambda with Amazon SQS.
Analyze table data in a Jupyter notebook using PyIceberg through the AWS Glue Iceberg REST endpoint.
Query the data using Athena to demonstrate Iceberg’s flexibility.

The following diagram illustrates the architecture.

When implementing this architecture, it’s important to note that Lambda functions can have multiple concurrent invocations when triggered by events. This concurrent invocation might lead to transaction conflicts when writing to Iceberg tables. To handle this, you should implement an appropriate retry mechanism and carefully manage concurrency levels. If you’re using Amazon SQS as an event source, you can control concurrent invocations through the SQS event source’s maximum concurrency setting.

Prerequisites

The following prerequisites are necessary for this use case:

An active AWS account that provides access to Lambda, AWS Glue, Amazon S3, Athena, and AWS CloudFormation.
Permissions to create and deploy a CloudFormation stack. For more details, see Create CloudFormation StackSets with self-managed permissions.
A user with full access to AWS CloudShell and its features. For more details, see Getting started with AWS CloudShell.

Set up resources with AWS CloudFormation

You can use the provided CloudFormation template to set up the following resources:

An S3 bucket for metadata and data files of an Iceberg table
An Amazon Elastic Container Registry (Amazon ECR) repository to store the container image for a Lambda function
A Data Catalog database to store the table
An Amazon SageMaker AI notebook instance for the Jupyter notebook environment
An AWS Identity and Access Management (IAM) role for a Lambda function and SageMaker AI notebook instance

Complete the following steps to deploy the resources:

Choose Launch stack.

For Parameters, pyiceberg_lambda_blog_database is set by default. You can also change the default value. If you change the database name, remember to replace pyiceberg_lambda_blog_database with your chosen name in all subsequent steps. Then, choose Next.
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Submit.

Build and run a Lambda function

Let’s build a Lambda function to process incoming records using PyIceberg. This function creates an Iceberg table called nyc_yellow_table in the database pyiceberg_lambda_blog_database in the Data Catalog if it doesn’t exist. It then generates sample NYC taxi trip data to simulate incoming records and inserts it into nyc_yellow_table.

Although we invoke this function manually in this example, in real-world scenarios, this Lambda function would be triggered by actual events, such as messages from Amazon SQS. When implementing real-world use cases, the function code must be modified to receive the event data and process it based on the requirements.

We deploy the function using container images as the deployment package. To create a Lambda function from a container image, build your image on CloudShell and push it to an ECR repository. Complete the following steps:

Sign in to the AWS Management Console and launch CloudShell.
Create a working directory.

mkdir pyiceberg_blog
cd pyiceberg_blog

Download the Lambda script lambda_function.py.

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-5013/lambda_function.py .

This script performs the following tasks:

Creates an Iceberg table with the NYC taxi schema in the Data Catalog
Generates a random NYC taxi dataset
Inserts this data into the table

Let’s break down the essential parts of this Lambda function:

Iceberg catalog configuration – The following code defines an Iceberg catalog that connects to the AWS Glue Iceberg REST endpoint:

# Configure the catalog
catalog_properties = {
   "type": "rest",
   "uri": f"https://glue.{region}.amazonaws.com/iceberg",
   "s3.region": region,
   "rest.sigv4-enabled": "true",
   "rest.signing-name": "glue",
   "rest.signing-region": region
}
catalog = load_catalog(**catalog_properties)

Table schema definition – The following code defines the Iceberg table schema for the NYC taxi dataset. The table includes:
- Schema columns defined in the Schema
- Partitioning by vendorid and tpep_pickup_datetime using PartitionSpec
- Day transform applied to tpep_pickup_datetime for daily record management
- Sort ordering by tpep_pickup_datetime and tpep_dropoff_datetime

When applying the day transform to timestamp columns, Iceberg automatically handles date-based partitioning hierarchically. This means a single day transform enables partition pruning at the year, month, and day levels without requiring explicit transforms for each level. For more details about Iceberg partitioning, see Partitioning.

# Table Definition
schema = Schema(
    NestedField(field_id=1, name="vendorid", field_type=LongType(), required=False),
    NestedField(field_id=2, name="tpep_pickup_datetime", field_type=TimestampType(), required=False),
    NestedField(field_id=3, name="tpep_dropoff_datetime", field_type=TimestampType(), required=False),
    NestedField(field_id=4, name="passenger_count", field_type=LongType(), required=False),
    NestedField(field_id=5, name="trip_distance", field_type=DoubleType(), required=False),
    NestedField(field_id=6, name="ratecodeid", field_type=LongType(), required=False),
    NestedField(field_id=7, name="store_and_fwd_flag", field_type=StringType(), required=False),
    NestedField(field_id=8, name="pulocationid", field_type=LongType(), required=False),
    NestedField(field_id=9, name="dolocationid", field_type=LongType(), required=False),
    NestedField(field_id=10, name="payment_type", field_type=LongType(), required=False),
    NestedField(field_id=11, name="fare_amount", field_type=DoubleType(), required=False),
    NestedField(field_id=12, name="extra", field_type=DoubleType(), required=False),
    NestedField(field_id=13, name="mta_tax", field_type=DoubleType(), required=False),
    NestedField(field_id=14, name="tip_amount", field_type=DoubleType(), required=False),
    NestedField(field_id=15, name="tolls_amount", field_type=DoubleType(), required=False),
    NestedField(field_id=16, name="improvement_surcharge", field_type=DoubleType(), required=False),
    NestedField(field_id=17, name="total_amount", field_type=DoubleType(), required=False),
    NestedField(field_id=18, name="congestion_surcharge", field_type=DoubleType(), required=False),
    NestedField(field_id=19, name="airport_fee", field_type=DoubleType(), required=False),
)

# Define partition spec
partition_spec = PartitionSpec(
    PartitionField(source_id=1, field_id=1001, transform=IdentityTransform(), name="vendorid_idenitty"),
    PartitionField(source_id=2, field_id=1002, transform=DayTransform(), name="tpep_pickup_day"),
)

# Define sort order
sort_order = SortOrder(
    SortField(source_id=2, transform=DayTransform()),
    SortField(source_id=3, transform=DayTransform())
)

database_name = os.environ.get('GLUE_DATABASE_NAME')
table_name = os.environ.get('ICEBERG_TABLE_NAME')
identifier = f"{database_name}.{table_name}"

# Create the table if it doesn't exist
location = f"s3://pyiceberg-lambda-blog-{account_id}-{region}/{database_name}/{table_name}"
if not catalog.table_exists(identifier):
    table = catalog.create_table(
        identifier=identifier,
        schema=schema,
        location=location,
        partition_spec=partition_spec,
        sort_order=sort_order
    )
else:
    table = catalog.load_table(identifier=identifier)

Data generation and insertion – The following code generates random data and inserts it into the table. This example demonstrates an append-only pattern, where new records are continuously added to track business events and transactions:

# Generate random data
records = generate_random_data()
# Convert to Arrow Table
df = pa.Table.from_pylist(records)
# Write data using PyIceberg
table.append(df)

Download the Dockerfile. It defines the container image for your function code.

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-5013/Dockerfile .

Download the requirements.txt. It defines the Python packages required for your function code.

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-5013/requirements.txt .

At this point, your working directory should contain the following three files:

Dockerfile
lambda_function.py
requirements.txt

Set the environment variables. Replace <account_id> with your AWS account ID:

export AWS_ACCOUNT_ID=<account_id>

Build the Docker image:

docker build --provenance=false -t localhost/pyiceberg-lambda .

# Confirm built image
docker images | grep pyiceberg-lambda

Set a tag to the image:

docker tag localhost/pyiceberg-lambda:latest ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/pyiceberg-lambda-repository:latest

aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

Push the image to the ECR repository:

docker push ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/pyiceberg-lambda-repository:latest

Create a Lambda function using the container image you pushed to Amazon ECR:

aws lambda create-function \
--function-name pyiceberg-lambda-function \
--package-type Image \
--code ImageUri=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/pyiceberg-lambda-repository:latest \
--role arn:aws:iam::${AWS_ACCOUNT_ID}:role/pyiceberg-lambda-function-role-${AWS_REGION} \
--environment "Variables={ICEBERG_TABLE_NAME=nyc_yellow_table, GLUE_DATABASE_NAME=pyiceberg_lambda_blog_database}" \
--region ${AWS_REGION} \
--timeout 60 \
--memory-size 1024

Invoke the function at least five times to create multiple snapshots, which we will examine in the following sections. Note that we are invoking the function manually to simulate event-driven data ingestion. In real world scenarios, Lambda functions will be automatically invoked with event-driven fashion.

aws lambda invoke \
--function-name arn:aws:lambda:${AWS_REGION}:${AWS_ACCOUNT_ID}:function:pyiceberg-lambda-function \
--log-type Tail \
outputfile.txt \
--query 'LogResult' | tr -d '"' | base64 -d

At this point, you have deployed and run the Lambda function. The function creates the nyc_yellow_table Iceberg table in the pyiceberg_lambda_blog_database database. It also generates and inserts sample data into this table. We will explore the records in the table in later steps.

For more detailed information about building Lambda functions with containers, see Create a Lambda function using a container image.

Explore the data with Jupyter using PyIceberg

In this section, we demonstrate how to access and analyze the data stored in Iceberg tables registered in the Data Catalog. Using a Jupyter notebook with PyIceberg, we access the taxi trip data created by our Lambda function and examine different snapshots as new records arrive. We also tag specific snapshots to retain important ones, and create new tables for further analysis.

Complete the following steps to open the notebook with Jupyter on the SageMaker AI notebook instance:

On the SageMaker AI console, choose Notebooks in the navigation pane.
Choose Open JupyterLab next to the notebook that you created using the CloudFormation template.

Download the notebook and open it in a Jupyter environment on your SageMaker AI notebook.

Open uploaded pyiceberg_notebook.ipynb.
In the kernel selection dialog, leave the default option and choose Select.

From this point forward, you will work through the notebook by running cells in order.

Connecting Catalog and Scanning Tables

You can access the Iceberg table using PyIceberg. The following code connects to the AWS Glue Iceberg REST endpoint and loads the nyc_yellow_table table on the pyiceberg_lambda_blog_database database:

import pyarrow as pa
from pyiceberg.catalog import load_catalog
import boto3

# Set AWS region
sts = boto3.client('sts')
region = sts._client_config.region_name

# Configure catalog connection properties
catalog_properties = {
    "type": "rest",
    "uri": f"https://glue.{region}.amazonaws.com/iceberg",
    "s3.region": region,
    "rest.sigv4-enabled": "true",
    "rest.signing-name": "glue",
    "rest.signing-region": region
}

# Specify database and table names
database_name = "pyiceberg_lambda_blog_database"
table_name = "nyc_yellow_table"

# Load catalog and get table
catalog = load_catalog(**catalog_properties)
table = catalog.load_table(f"{database_name}.{table_name}")

You can query full data from the Iceberg table as an Apache Arrow table and convert it to a Pandas DataFrame.

Working with Snapshots

One of the important features of Iceberg is snapshot-based version control. Snapshots are automatically created whenever data changes occur in the table. You can retrieve data from a specific snapshot, as shown in the following example.

# Get data from a specific snapshot ID
snapshot_id = snapshots.to_pandas()["snapshot_id"][3]
snapshot_pa_table = table.scan(snapshot_id=snapshot_id).to_arrow()
snapshot_df = snapshot_pa_table.to_pandas()

You can compare the current data with historical data from any point in time based on snapshots. In this case, you are comparing the differences in data distribution between the latest table and a snapshot table:

# Compare the distribution of total_amount in the specified snapshot and the latest data.
import matplotlib.pyplot as plt

plt.figure(figsize=(4, 3))
df['total_amount'].hist(bins=30, density=True, label="latest", alpha=0.5)
snapshot_df['total_amount'].hist(bins=30, density=True, label="snapshot", alpha=0.5)
plt.title('Distribution of total_amount')
plt.xlabel('total_amount')
plt.ylabel('relative Frequency')
plt.legend()
plt.show()

Tagging snapshots

You can tag specific snapshots with an arbitrary name and query specific snapshots with that name later. This is useful when managing snapshots of important events.

In this example, you query a snapshot specifying the tag checkpointTag. Here, you are using the polars to create a new DataFrame by adding a new column called trip_duration based on existing columns tpep_dropoff_datetime and tpep_pickup_datetime columns:

# retrive tagged snapshot table as polars data frame
import polars as pl

# Get snapshot id from tag name
df = table.inspect.refs().to_pandas()
filtered_df = df[df["name"] == tag_name]
tag_snapshot_id = filtered_df["snapshot_id"].iloc[0]

# Scan Table based on the snapshot id
tag_pa_table = table.scan(snapshot_id=tag_snapshot_id).to_arrow()
tag_df = pl.from_arrow(tag_pa_table)

# Process the data adding a new column "trip_duration" from check point snapshot.
def preprocess_data(df):
    df = df.select(["vendorid", "tpep_pickup_datetime", "tpep_dropoff_datetime", 
                    "passenger_count", "trip_distance", "fare_amount"])
    df = df.with_columns(
        ((pl.col("tpep_dropoff_datetime") - pl.col("tpep_pickup_datetime"))
         .dt.total_seconds() // 60).alias("trip_duration"))
    return df

processed_df = preprocess_data(tag_df)
display(processed_df)
print(processed_df["trip_duration"].describe())

Create a new table from the processed DataFrame with the trip_duration column. This step illustrates how to prepare data for potential future analysis. You can explicitly specify the snapshot of the data that the processed data is referring to by using a tag, even if the underlying table has been changed.

# write processed data to new iceberg table
account_id = sts.get_caller_identity()["Account"] 

new_table_name = "processed_" + table_name
location = f"s3://pyiceberg-lambda-blog-{account_id}-{region}/{database_name}/{new_table_name}"

pa_new_table = processed_df.to_arrow()
schema = pa_new_table.schema
identifier = f"{database_name}.{new_table_name}"

new_table = catalog.create_table(
                identifier=identifier,
                schema=schema,
                location=location
            )
            
# show new table's schema
print(new_table.schema())
# insert processed data to new table
new_table.append(pa_new_table)

Let’s query this new table made from processed data with Athena to demonstrate the Iceberg table’s interoperability.

Query the data from Athena

In the Athena query editor, you can query the table pyiceberg_lambda_blog_database.processed_nyc_yellow_table created from the notebook in the previous section:

SELECT * FROM "pyiceberg_lambda_blog_database"."processed_nyc_yellow_table" limit 10;

By completing these steps, you’ve built a serverless data processing solution using PyIceberg with Lambda and an AWS Glue Iceberg REST endpoint. You’ve worked with PyIceberg to manage and analyze data using Python, including snapshot management and table operations. In addition, you ran the query using another engine, Athena, which shows the compatibility of the Iceberg table.

Clean up

To clean up the resources used in this post, complete the following steps:

On the Amazon ECR console, navigate to the repository pyiceberg-lambda-repository and delete all images contained in the repository.
On the CloudShell, delete working directory pyiceberg_blog.
On the Amazon S3 console, navigate to the S3 bucket pyiceberg-lambda-blog-<ACCOUNT_ID>-<REGION>, which you created using the CloudFormation template, and empty the bucket.
After you confirm the repository and the bucket are empty, delete the CloudFormation stack pyiceberg-lambda-blog-stack.
Delete the Lambda function pyiceberg-lambda-function that you created using the Docker image.

Conclusion

In this post, we demonstrated how using PyIceberg with the AWS Glue Data Catalog enables efficient, lightweight data workflows while maintaining robust data management capabilities. We showcased how teams can use Iceberg’s powerful features with minimal setup and infrastructure dependencies. This approach allows organizations to start working with Iceberg tables quickly, without the complexity of setting up and managing distributed computing resources.

This is particularly valuable for organizations looking to adopt Iceberg’s capabilities with a low barrier to entry. The lightweight nature of PyIceberg allows teams to begin working with Iceberg tables immediately, using familiar tools and requiring minimal additional learning. As data needs grow, the same Iceberg tables can be seamlessly accessed by AWS analytics services like Athena and AWS Glue, providing a clear path for future scalability.

To learn more about PyIceberg and AWS analytics services, we encourage you to explore the PyIceberg documentation and What is Apache Iceberg?

About the authors

Sotaro Hikita is a Specialist Solutions Architect focused on analytics with AWS, working with big data technologies and open source software. Outside of work, he always seeks out good food and has recently become passionate about pizza.

Shuhei Fukami is a Specialist Solutions Architect focused on Analytics with AWS. He likes cooking in his spare time and has become obsessed with making pizza these days.

Monitoring network traffic in AWS Lambda functions

2025-05-05 Anton Aleksandrov

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/compute/monitoring-network-traffic-in-aws-lambda-functions/

Network monitoring provides essential visibility into cloud application traffic patterns across large organizations. It enables security and compliance teams to detect anomalies and maintain compliance, while allowing development teams to troubleshoot issues, optimize performance, and track costs in multi-tenant software as a service (SaaS) environments. Implementing robust network monitoring allows organizations to effectively manage their security, compliance, and operational requirements while continuously enhancing their applications.

In this post, you will learn methods for network monitoring in AWS Lambda functions and how to apply them to your scenarios.

Overview

Lambda is a secure and highly scalable serverless compute service where each function operates in an isolated execution environment with strict security boundaries. This architecture delivers key advantages, such as enhanced security, automatic compute capacity scaling, and minimal operational overhead. Minimizing infrastructure management allows Lambda to enable organizations to redirect their focus from managing servers to other critical aspects, such as performance optimization and network traffic analysis. In turn, these enable organizations to build more secure and efficient applications.

Lambda network monitoring addresses diverse organizational needs, such as compliance requirements for audit logs and anomaly detection, business needs for traffic metering and customer billing, and development needs for troubleshooting network issues. Traditional agent-based or host-based monitoring methods often aren’t compatible with the strongly isolated, ephemeral execution environment of Lambda, which necessitates alternative approaches to meet these critical requirements.

You can use AWS-native, integrated network monitoring solutions, such as Amazon Virtual Private Cloud (Amazon VPC) Flow Logs, or build your own custom solution, as detailed in this post. Each solution offers distinct capabilities with varying levels of granularity and real-time visibility. When choosing an approach, you must evaluate key factors such as the desired data granularity, operational complexity, latency tolerance, and cost implications.

Using VPC Flow Logs

VPC Flow Logs is the AWS-recommended tool for network activity monitoring. If your scenario necessitates monitoring of the network activity of Lambda functions, then you can attach these functions to a VPC and enable Flow Logs. This captures detailed network traffic data, such as source and destination IPs, ports, protocols, and traffic volume for all traffic flowing through the network interfaces used by your functions.

When you attach your functions to a VPC, the Lambda service automatically creates an Elastic Network Interface (ENI) for functions to communicate with VPC-based resources. By default, VPC-attached functions can only access private resources within the VPC. If you need your functions to communicate with other AWS services, then you should use VPC Endpoints. If your function needs to communicate with the public internet, then you should use an NAT Gateway for egress traffic. The following diagram shows how you can use VPC Flow Logs for Lambda functions.

Flow Logs provide detailed insights into the IP traffic flowing to and from the network interfaces within a VPC, offering valuable data for network audits and activity monitoring. This approach promotes a clear separation of concerns between application and networking layers, with VPC constructs typically managed by a dedicated operations or infrastructure team.

VPC Flow Logs provides a robust network monitoring solution. However, when using it with Lambda functions, you should evaluate the following considerations:

It captures ENI-level information. ENIs can be reused across multiple functions, thus it won’t provide per-function or per-invoke granularity.
It only logs IP addresses, not DNS names (if capturing DNS names is a requirement for you).
It introduces infrastructure management into your serverless applications. You must learn VPC constructs or involve your infrastructure team.
Potential added data transfer costs. Go to the pricing for NAT Gateway, VPC Endpoints, and Flow Logs for more details.

The following sections explore Lambda network monitoring methods that can either be augmented with VPC Flow Logs for better granularity or used without attaching your functions to a VPC.

Proxying network traffic

You can configure the Lambda runtime to route egress network traffic through a side-car proxy that runs as a Lambda layer within the Lambda execution environment and logs network activity. The proxy layer should be agnostic to the language runtime. AWS recommends that you use compiled languages such as Rust or Golang for maximum reusability and minimal added latency. The following diagram shows emitting logs from a network proxy layer.

Applying proxy configuration differs across language runtimes. In Python you set proxy_http and proxy_https environment variables. Java uses JVM flags. Node.js doesn’t provide a native way to configure proxy using command line flags or environment variables. Therefore, you need to make code changes, such as configuring a proxy for your AWS SDK or using a third-party open source libraries such as global-agent or Interceptors.

The proxy approach is most suitable if you’re okay with making function code or configuration changes that might vary across runtimes. Furthermore, adding a network proxy server process inside the execution environment consumes resources shared with the function code, which can add network latency.

Refer to the post Enhancing runtime security and governance with the AWS Lambda Runtime API proxy extension for ways to intercept inbound requests/responses between the Lambda Runtime API and Runtime process.

Runtime-agnostic techniques

Following techniques use the fact that the Lambda execution environment is a Linux-based micro-VM. Lambda runtimes operate within a restricted user space that prevents the use of traditional OS-level monitoring tools needing elevated privileges, such as tcpdump, iptables, ptrace, or eBPF. The following techniques are specifically designed to work under these user space constraints, allowing their use without needing elevated privileges.

Reading OS networking layer information from procfs

Use this method when you need to obtain the OS-level information, such as metering transferred bytes, or see all open connections. You can use it to implement tenant chargebacks or detect network traffic pattern changes. The method is based on the proc pseudo-filesystem (also known as procfs) available in Linux OS, which provides an interface to kernel data and allows you to read OS-level information. For example, /proc/cpuinfo and /proc/meminfo pseudo-files provide information about current CPU and memory usage, while /proc/net/* provides you with network layer information. Reading /proc/net/tcp and /proc/net/udp gives you a list of active TCP/UDP connections, such as remote IP addresses and ports. Reading /proc/net/dev provides the list of network devices with detailed usage statistics, such as bytes transferred and received.

“The procfs method provides a simple but powerful way to collect critical network telemetry data from Lambda functions, such as up-to-date network statistics and file descriptor counts, which enables us to monitor outbound connections from Lambda functions. Better yet, it enables us to support multiple Lambda runtimes with a single implementation in our Rust-based, next-generation Lambda Extension”—AJ Stuyvenberg, Staff Engineer at Datadog.

The sample project provides the LambdaNetworkMonitor-Procfs stack to show this technique. For every invocation, the function reads /proc/net/dev, and sends network statistics to log and Amazon CloudWatch Metrics, as shown in the following figure.

Reading the /proc/net/dev pseudo-file is a simple and effective way to monitor Lambda functions networking without adding latency. However, it doesn’t capture DNS names and the IP addresses to which they resolve.

Intercepting network-related libc calls

Low-level network operations in Linux, such as DNS lookup and connection creation, are managed by the C Standard Library (libc). You can intercept libc function calls made by Lambda runtimes to monitor network traffic at the OS level. This is a significantly more advanced and complex mechanism, enabling visibility into OS-level activities, as shown in the following figure.

Intercepting libc function calls, such as getaddrinfo (DNS lookup) and connect, allows you to log details such as DNS name, IP addresses, ports, and protocols at a granular level, as shown in the following figure. This method allows you to capture comprehensive information about DNS queries and initiated network connections. It can provide precise per-function and per-invoke network monitoring, such as hostnames and IP addresses.

The following is a simplified flow:

A function sends a request to example.com.
The runtime calls libc getaddrinfo to resolve the DNS name.
You intercept this call, log the DNS name, and forward the call to the original libc getaddrinfo.
The original libc getaddrinfo returns resolved IP addresses. You log them and return to the runtime.
The runtime calls libc connect method to create a new connection.
You intercept this call, log the IP address, forward the call to the original libc connect, and so on.

To implement this technique, you need to use a language that compiles to a shared object (.so) file. To implement libc method signatures you should use a language such as C, C++, or Rust. The following sample code uses Rust for its strong safety guarantees and implements overriding the getaddrinfo libc function (DNS lookup).

pub extern "C" fn getaddrinfo(
    node: *const c_char,
    service: *const c_char,
    hints: *const addrinfo,
    res: *mut *mut addrinfo,
) -> c_int {
    let printable_node = format!("{}", PrintableCString::from(node));
    let printable_service = format!("{}", PrintableCString::from(service));

    log::debug!("> getaddrinfo node={printable_node} service={printable_service}");

    LIBC_GETADDRINFO(node, service, hints, res)
}

The following should be considered:

The method signature fully matches the libc function signature of the same name.
The node and service arguments would commonly be DNS name and port.
At the end, the function invokes the real libc getaddrinfo and returns the result.

When compiled to an .so file, you must package it as a Lambda layer, attach the layer to your functions, and configure the execution environment to use it through the Linux dynamic linker capability called preloading. Set the LD_PRELOAD environment variable to point to your .so file to instruct the OS to preload it before it loads any other library, such as libc. You can configure this either as a function environment variable or through a wrapper script, as shown in the following figure.

#!/bin/sh
echo "running wrapper..."
args=("$@")
export LD_PRELOAD=/opt/liblambda_network_monitor.so
exec "${args[@]}"

This technique allows you to get detailed connection-level monitoring such as DNS lookups, resolved IP addresses, ports, protocols, and count bytes transferred. Depending on your requirements, it can be adapted to track further network-related information as needed.

The sample project provides the LambdaNetworkMonitor-LdPreload stack to show this technique, as shown in the following figure. For every invocation, the function prints intercepted libc functions, DNS names, and connection IP addresses.

Considerations

OS-level techniques necessitate strong understanding of Linux fundamentals and careful implementation. AWS recommends that you closely evaluate which methods to use and keep your solution minimally invasive.
LD_PRELOAD is an advanced low-level technique that allows you to override libc methods and OS behavior. Incorrectly implemented hooks could lead to undefined behavior and crashes. Make sure your code is robust to recursion and thread-safe. Test it thoroughly in a controlled environment before using it in production.
The LD_PRELOAD technique relies on dynamic linking. It works with dynamically linked runtimes such as Node.js, Python, and Java. It doesn’t work with runtimes that use static linking, such as Golang.
When using runtime-dependent functionality, consider using Lambda runtime update controls to make sure that runtimes are only updated when the next function update occurs.
Always install layers from trusted sources only. Use infrastructure as code (IaC) tools to attach and audit layer configurations, such as AWS Identity and Access Management (IAM) permissions.

Conclusion

Monitoring network traffic in Lambda functions is a common requirement for many organizations. In case you need to audit IP-level network logs, AWS recommends that you to attach your functions to a VPC and use Flow Logs. If you need per-function or per-invoke granularity, then you can augment it with techniques described in this post.

These techniques provide valuable insights for debugging, auditing, and monitoring, but they also necessitate a solid understanding of Linux fundamentals and careful implementation. They offer a practical solution for organizations that need Lambda network monitoring, allowing them to troubleshoot issues and maintain compliance.

To learn more about Serverless architectures and asynchronous Lambda invocation patterns, go to Serverless Land.

AWS Weekly Roundup: Amazon Nova Premier, Amazon Q Developer, Amazon Q CLI, Amazon CloudFront, AWS Outposts, and more (May 5, 2025)

2025-05-05 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-nova-premier-amazon-q-developer-amazon-q-cli-amazon-cloudfront-aws-outposts-and-more-may-5-2025/

Last week I went to Thailand to attend the AWS Summit Bangkok. It was an energizing and exciting event. We hosted the Developer Lounge, where developers can meet, discuss ideas, enjoy lightning talks, win SWAGs at AWS Builder ID Prize Wheel, take a challenge at Amazon Q Developer Coding Challenge, or learn Generative AI at Learn Amazon Bedrock booth.

Here’s a quick look:

Thank you to AWS Heroes, AWS Community Builders, AWS User Group leaders and developers for your collaboration.

Coming up next in ASEAN is AWS Summit Singapore—make sure you don’t miss it by registering now.

Last Week’s Launches
Here are some launches last week that caught my attention:

Amazon Nova Premier Now Generally Available — Amazon Nova Premier, our most capable model for complex tasks and teacher for model distillation, is now generally available in Amazon Bedrock. It excels at complex tasks requiring deep context understanding and multistep planning, while processing text, images, and videos with a 1M token context length. With Nova Premier and Amazon Bedrock Model Distillation, you can create highly capable, cost-effective, and low-latency versions of Nova Pro, Lite, and Micro, for your specific needs.

Amazon Q Developer elevates the IDE experience with new agentic coding experience — This new interactive, agentic coding experience for Visual Studio Code allows Q Developer to intelligently take actions on behalf of the developer. Amazon Q Developer introduces an interactive coding experience in Visual Studio Code, offering real-time collaboration for coding, documentation, and testing. It provides transparent reasoning, and supports automated or step-by-step changes in multiple languages.

New Foundation Models in Amazon Bedrock — Amazon Bedrock expands its model offerings with two significant additions:
- Writer’s Palmyra X5 and X4 models feature extensive context windows (1M and 128K tokens respectively) and excel in complex reasoning for enterprise applications. They support multistep tool-calling and adaptive thinking with high reliability standards.
- Meta’s Llama 4 Scout 17B and Maverick 17B models offer natively multimodal capabilities using mixture-of-experts architecture for enhanced reasoning and image understanding. They support multiple languages and extended context processing, with simplified integration through the Bedrock Converse API.
Second-Generation AWS Outposts Racks Released — AWS announces the general availability of second-generation Outposts racks with significant enhancements including the latest x86 EC2 instances, simplified networking, and accelerated networking options. These improvements deliver doubled vCPU, memory, and network bandwidth, 40% better performance, and support for ultra-low latency workloads, making them ideal for demanding on-premises deployments.

Amazon CloudFront SaaS Manager Launches — Amazon CloudFront SaaS Manager helps SaaS providers and web hosting platforms efficiently manage content delivery across multiple customer domains. The service dramatically reduces operational complexity while providing high-performance content delivery and enterprise-grade security for every customer domain.

Extend the Amazon Q Developer CLI with Model Context Protocol (MCP) for Richer Context | Amazon Web Services — Amazon Q Developer CLI now supports Model Context Protocol (MCP), enabling integration with external data sources for context-aware responses. This give developers the ability to connect pre-built integrations or MCP Servers supporting stdio, enhancing code accuracy, data understanding, and query execution. The feature streamlines development tasks and will be extended to Amazon Q Developer IDE plugins soon.

Amazon Aurora Now Supports PostgreSQL 17 — Amazon Aurora now supports PostgreSQL 17.4, offering community improvements and Aurora-specific enhancements like optimized memory management and faster failovers. The release includes new features for Babelfish, security fixes, and updated extensions, available in all AWS Regions.
CloudWatch Introduces Tiered Pricing for Lambda Logs — Amazon CloudWatch launches tiered pricing for AWS Lambda logs and new delivery destinations. Pricing in US East starts at $0.50/GB for CloudWatch and $0.25/GB for S3 and Firehose, both tiering down to $0.05/GB. This update enhances flexibility in log management across all supporting Regions.
RDS for MySQL Updates Minor Versions — Amazon RDS for MySQL now supports minor versions 8.0.42 and 8.4.5, delivering security fixes, bug fixes, and performance improvements. Users can upgrade automatically during maintenance windows or use Blue/Green deployments for safer updates.
Amazon Bedrock Model Distillation Generally Available — Amazon Bedrock Model Distillation is now generally available, supporting new models like Amazon Nova and Claude 3.5. It enables smaller models to accurately predict function calling for Agents, delivering up to 500% faster responses and 75% lower costs with minimal accuracy loss for RAG use cases. The service includes automated workflows for data synthesis and student model training.
AI Search Flow Builder for Amazon OpenSearch Service — Amazon OpenSearch Service now offers an AI search flow builder for OpenSearch 2.19+ domains. This low-code designer enables creation of sophisticated AI-enhanced search flows using AWS and third-party services, supporting use cases like RAG, query rewriting, and semantic encoding.

From Community.AWS
Here’s my personal favorites posts from community.aws:

How to Generate AWS Architecture Diagrams Using Amazon Q CLI and MCP — Omshree Butani demonstrates how to quickly generate AWS Architecture Diagrams using Amazon Q CLI and Model Context Protocol (MCP), streamlining the architecture design process.
Implementing Nova Act MCP Server on ECS Fargate — Vivek V details implementing an Amazon Nova Act Model Context Protocol (MCP) server on ECS Fargate for browser automation. The solution includes architecture designs, deployment strategies, server/client implementation, Streamlit UI, AWS CDK infrastructure, and VS Code integration.
Leveraging Crossplane to build single-tenant SaaS control planes on top of Kubernetes — Yehuda Cohen explores leveraging Crossplane to build single-tenant SaaS control planes on Kubernetes. The article highlights how Crossplane extends Kubernetes’ declarative model to manage non-Kubernetes resources, enabling automated tenant provisioning and scalable cloud resource management.
How to Securely Display Objects from an S3 Bucket in a Browser — Osabutey-Anikon Theeophilus Lloyd shares techniques for securely displaying objects from Amazon S3 buckets in web browsers, focusing on proper security measures for browser-based access.

Upcoming AWS events
Check your calendars and sign up for these upcoming AWS events:

AWS Summit — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Poland (6 May), Bengaluru (May 7 – 8), Hong Kong (May 8), Seoul (May 14-15), Singapore (May 29), and Sydney (June 4–5).
AWS re:Inforce – Mark your calendars for AWS re:Inforce (June 16–18) in Philadelphia, PA. AWS re:Inforce is a learning conference focused on AWS security solutions, cloud security, compliance, and identity. You can subscribe for event updates now!
AWS Partners Events – You’ll find a variety of AWS Partner events that will inspire and educate you, whether you are just getting started on your cloud journey or you are looking to solve new business challenges.
AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Yerevan, Armenia (May 24), Zurich, Switzerland (May 25), and Bengaluru, India (May 25).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

How is the News Blog doing? Take this 1 minute survey!

(This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.)

AWS Lambda introduces tiered pricing for Amazon CloudWatch logs and additional logging destinations

2025-05-01 Shridhar Pandey

Post Syndicated from Shridhar Pandey original https://aws.amazon.com/blogs/compute/aws-lambda-introduces-tiered-pricing-for-amazon-cloudwatch-logs-and-additional-logging-destinations/

Effective logging is an important part of an observability strategy when building serverless applications using AWS Lambda.

Lambda automatically captures and sends logs to Amazon CloudWatch Logs. This allows you to focus on building application logic rather than setting up logging infrastructure and allows operators to troubleshoot failures and performance issues more easily.

On May 1st, 2025, AWS announced changes to Lambda logging, which can reduce Lambda CloudWatch logging costs and make it easier and more cost-effective to use a wider range of monitoring tools. Lambda logs are now available at volume-based tiered pricing when using CloudWatch Logs Standard and Infrequent Access log classes. When generating Lambda logs at scale, you can expect an immediate cost reduction under this new pricing model. Lambda also now supports Amazon S3 and Amazon Data Firehose as additional destinations for Lambda logs, in addition to CloudWatch Logs. Lambda logs sent to S3 and Firehose are also available at volume-based tiered pricing.

This blog post covers some recent Lambda logging enhancements and describes how this change delivers a simpler, more cost-effective logging experience for Lambda.

Overview

Logging provides developers and operators with valuable data for debugging and troubleshooting application behavior, performance issues, and potential failures. It becomes even more important for serverless applications built using Lambda because of the ephemeral and stateless nature of the Lambda execution environment. Lambda’s built-in integration with CloudWatch Logs ensures that logs for every function invocation are readily available for analysis. The captured log data includes application logs generated by your Lambda function code and system logs generated by the Lambda service while running your function code. CloudWatch Logs allows you to search, filter, and analyze log data to troubleshoot issues, track metrics, and set up alerts.

Logging requirements evolve as serverless applications grow in complexity and scale, sometimes spanning hundreds or thousands of Lambda functions which generate substantial log volumes. Organizations need sophisticated logging solutions that can handle this scale while remaining cost-effective. Some scenarios—such as monitoring critical business transactions—demand real-time log analysis, while others focus on after-the-fact forensic analysis. Debug logs from development and staging environments often need high granularity, whereas you may want lower verbosity in production logs to improve the signal-to-noise ratio.

Recent Lambda logging enhancements

In recent years, Lambda and CloudWatch Logs have expanded Lambda’s logging capabilities to meet the evolving needs of serverless applications. These capabilities provide deeper insights, greater control, and more cost-effective solutions to capture, process, and consume logs to enhancing the serverless observability experience. Lambda advanced logging controls gives developers control over log generation and content. These controls allow you to capture Lambda logs in JSON structured format. You don’t have to use logging libraries and customize log levels (INFO, DEBUG, WARN, ERROR) separately for application and system logs. This helps reduce logging costs by ensuring only necessary logs are generated while maintaining appropriate visibility across different environments. For example, you can set verbose DEBUG level logging in development environments while limiting production logging to ERROR level to improve the signal-to-noise ratio and control costs.

The Infrequent Access log class for CloudWatch Logs introduced a cost-effective solution for logs that need retention but are accessed less frequently. Infrequent Access is 50% lower per GB ingestion price than the Standard log class This tailored set of capabilities allows you to reduce your logging costs while maintaining access to historical data for compliance, audit purposes, or forensic analysis.

CloudWatch Logs Live Tail is an interactive, real-time log streaming and analytics capability. Live Tail streamlines debugging and monitoring workflows; it allows you to observe log output as functions execute without navigating away from the Lambda console. This makes it easier to identify and diagnose issues during development and troubleshooting. Logs Live Tail is also available in Visual Code IDE.

Tiered pricing for Lambda logs in CloudWatch Logs

Starting today, Lambda logs sent to CloudWatch Logs are classed as Vended Logs, which are logs from specific AWS services that are available at volume tiered pricing. This replaces the previous flat rate model when using CloudWatch Logs Standard log class. For example, in the US East (N. Virginia) AWS Region, you were charged at $0.50 per GB when using Standard log class for your Lambda logs. Under the new pricing model, you are charged for sending your Lambda logs to CloudWatch Logs starting at $0.50 per GB for initial usage. As log volume increases, the price per GB automatically decreases through multiple tiers, reaching rates as low as $0.05 per GB in the lowest tier. This pricing change applies automatically to all Lambda logs sent to CloudWatch Logs, requiring no code or configuration changes from you.

Data Ingested	CloudWatch Logs Standard	CloudWatch Logs Infrequent Access
First 10 TB per month	$0.50 per GB	$0.25 per GB
Next 20 TB per month	$0.25 per GB	$0.15 per GB
Next 20 TB per month	$0.10 per GB	$0.075 per GB
Over 50 TB per month	$0.05 per GB	$0.05 per GB

Table 1: Tiered pricing for Lambda logs in CloudWatch Logs in US East (N. Virginia) Region

When generating Lambda logs at scale, you will see an immediate cost reduction under this new pricing model. For example, if you generate 60 TB of Lambda logs monthly in CloudWatch Logs, costs would decrease by 58% (from $30,000 to $12,500). The pricing tiers scale with your logging volume, ensuring that cost benefits increase as your application grows. This allows you to maintain comprehensive logging practices that previously may have been cost-prohibitive. Vended logs tiered pricing is applied on all vended logs ingested to CloudWatch and not tiered per service.

When ingesting other vended logs, such as Amazon Virtual Private Cloud flow logs and Amazon Route 53 resolver query logs, you will see larger discounts as the tiering is applied at a consolidated log ingestion volume.

New Lambda logging destinations: Amazon S3 and Amazon Data Firehose

Starting today, Lambda also supports Amazon S3 and Amazon Data Firehose as destinations for Lambda logs, in addition to CloudWatch Logs. When using S3 or Firehose as a destination, logging costs start at $0.25 per GB. The tiered pricing also applies, with rates reducing to as low as $0.05 per GB in the lowest tier. This tiering is also applied at a consolidated log ingestion volume.

Data Ingested	Delivery Cost to Amazon S3	Delivery Cost to Amazon Data Firehose
First 10TB per month	$0.25 per GB	$0.25 per GB
Next 20TB per month	$0.15 per GB	$0.15 per GB
Next 20TB per month	$0.075 per GB	$0.075 per GB
Over 50TB per month	$0.05 per GB	$0.05 per GB

Table 2:Tiered pricing for Lambda logs delivery to Amazon S3 and Amazon Data Firehose in US East (N. Virginia) Region

Direct delivery of Lambda logs to S3 provides enhanced flexibility in log management. Support for Firehose streamlines Lambda log delivery to additional destinations such as Amazon OpenSearch Service, HTTP endpoints, and third-party observability providers. This matches the established log delivery pattern used with other AWS compute services such as Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Compute Cloud (Amazon EC2).

This new capability provides significant cost benefits and streamlines log delivery to additional logging destinations, making it easier to use a wider range of monitoring tools (including CloudWatch) when building serverless applications using Lambda.

New Lambda logging destinations in action

All new and existing Lambda functions have CloudWatch Logs as the default logging destination, with S3 and Firehose as alternative choices. When you select S3 or Firehose as your logging destination, Lambda sends logs to the selected destination via a new CloudWatch Logs Delivery log class. This log class enables efficient routing but doesn’t support CloudWatch Logs Standard log class features, such as Logs Insights and Live Tail.

To set up S3 or Firehose as the destination for your Lambda logs in the Lambda console:

Navigate to the Lambda console, and select or create a function to set up an S3 or Firehose logging destination.
In the Configuration tab, select Monitoring and operations tools on the left pane.
Select Edit in the Logging configuration. This opens the Edit logging configuration page.

Figure 1. Edit logging configuration in Lambda console
In the Log destination section, select Amazon S3 or Amazon Data Firehose. Amazon CloudWatch Logs is the default selection.

Figure 2. Select log destination in the Edit logging configuration page
Under CloudWatch delivery log group, choose Create new log group or Existing log group.
To create a new delivery log group to send logs to S3, enter a log group name and specify the destination S3 bucket. Provide an AWS Identity and Access Management (IAM) role for CloudWatch Logs to deliver logs to S3.
Follow similar steps to send logs to a Firehose stream.

Figure 3. Create new CloudWatch delivery log group for S3
To use an existing delivery log group, select one from the Delivery log group. The selected delivery log group must have a configured destination (S3 or Firehose) and match the destination you selected.

Figure 4. Select existing CloudWatch delivery log group for Firehose

Advanced logging controls are also available for S3 and Firehose destinations. These controls include JSON structured format selection and log level filters for both application and system logs. This gives you enhanced log management controls for easier search, filter, and analysis. You can also use AWS Command Line Interface (AWS CLI) and infrastructure as code (IaC) tools such as AWS CloudFormation and AWS Cloud Development Kit (AWS CDK) to set up Lambda logs delivery to S3 and Firehose.

Best practices

To get the most out of the changes announced today, ensure that your logging strategy is closely aligned with the requirements of your workload. For example, consider sending critical production logs to CloudWatch Logs to take advantage of its advanced real-time analytics and alerting features. You now automatically benefit from volume-based discounts through tiered pricing in CloudWatch Logs for high-volume logging scenarios. For logs that need long-term retention for historical analysis, you can use S3’s storage classes to further reduce costs. When using your existing or third-party monitoring tools, direct integration through Firehose eliminates the need for custom forwarding solutions and associated costs.

Logging cost optimization extends beyond destination selection. Monitor log volumes regularly to understand the impact of pricing tiers. Implement appropriate retention policies to prevent unnecessary storage of old logs and log sampling for high-volume debug logs. Consider using different logging strategies across development, staging, and production environments to balance observability needs with cost efficiency.

Conclusion

Tiered pricing for Lambda logs in CloudWatch Logs and support for S3 and Firehose as additional logging destinations improves Lambda application observability. You can now manage logging costs at scale and expand Lambda monitoring solutions through cost-effective, easy-to-configure integrations. Whether you’re building new serverless applications or optimizing existing ones, these enhancements help you implement comprehensive logging strategies that scale cost-effectively with your workload.

The new features announced today are available in all commercial AWS Regions where Lambda and CloudWatch Logs are available. Support for configuring log delivery to S3 and Firehose in the Lambda console is available in US East (Ohio), US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions, with additional Regions coming soon. Review the Lambda documentation and CloudWatch Logs documentation to learn more about these features and how to use them. Review the CloudWatch pricing page to learn more about how these features are priced.

For more serverless learning resources, visit Serverless Land.

Integrating aggregators and Quick Service Restaurants with AWS serverless architectures

2025-04-30 Mike Gomez

Post Syndicated from Mike Gomez original https://aws.amazon.com/blogs/compute/integrating-aggregators-and-quick-service-restaurants-with-aws-serverless-architectures/

In this post, you learn how to use AWS serverless technologies, such as Amazon EventBridge and AWS Lambda, to build an integration between Quick Service Restaurants (QSRs) and online ordering and food delivery aggregators. These aggregators have taken off as an option to QSRs to expand their consumer base, enabling them with delivery options to help grow their businesses.

QSR overview

QSRs prioritize speedy and convenient service, offering a streamlined menu. To meet evolving consumer expectations, QSRs can use API integrations with third-party aggregators. This technological synergy enables QSRs to expand their capabilities, introducing diverse payment methods and incorporating delivery services. These features have become standard in this restaurant segment.

Behind the scenes, the APIs are used to orchestrate the interaction between the aggregator and the QSR while having a consistent ordering and delivery experience.

QSR business objectives are:

Providing consistent ordering and delivery experiences
Offering personalized menu items
Retaining repeat customers
Reducing third-party delivery cancellation due to lack of delivery personalization options

This post starts with a simple architecture and adds components to solve architectural challenges.

Architecture

As a solutions architect, you’ve been approached by a thriving local restaurant business seeking technological solutions to fuel their expansion. Your task is to design an optimal integration architecture that aligns with their technical requirements, streamlines operations, and enhances customer experience.

At the core of this integration is Amazon API Gateway, which accepts the incoming orders from various delivery aggregators. The API Gateway becomes the front door, connecting the QSRs with the end customers for a streamlined and dynamic order processing system.

Driving the backend of this integration are Lambda functions. These functions validate orders and securely communicate with delivery aggregators. Lambda functions can scale dynamically based on-demand, and make sure of optimal resource usage and cost-effectiveness.

Order placement workflow

The following steps outline the serverless integration between API Gateway and Lambda functions, as shown in the following figure:

Customers can place orders either through food delivery aggregators or the business’s own ordering system.
The order request is sent to API Gateway.

This architecture works for small and simple integrations. To scale this architecture for high traffic, use asynchronous integration to reduce the coupling between API and Lambda function.

Order routing workflow

The following steps outline a serverless integration where API Gateway connects to Lambda functions through Amazon EventBridge as the event routing service, as shown in the following figure:

API Gateway receives the order request.
The API Gateway routes the customer’s order request to an EventBridge bus for processing.

EventBridge routes events (for example order status changes) to Lambda functions, making sure of resiliency during service disruptions. This eliminates manual error handling and keeps QSRs and aggregators synchronized.

EventBridge delivers the following essential capabilities:

EventBridge receives events triggered by various actions, such as new orders or menu updates.
It routes events to the relevant Lambda functions, initiating the appropriate actions.
EventBridge supports event replay, allowing recovery from Lambda deployment issues or function failures. This feature enables business continuity by storing events during service disruptions and automatically resuming processing when the system stabilizes.

To maintain order history and enable fast data retrieval, the system needs a highly performant database. Amazon DynamoDB, a serverless NoSQL database service, meets these requirements by efficiently storing and managing order information and metadata. The order processing Lambda function interacts with DynamoDB to persist order details. This approach enables asynchronous processing of the stored data by other backend processes. The database solution provides the scalability and responsiveness needed to handle growing order volumes while maintaining consistent performance, separating order intake from subsequent processing steps.

Order processing workflow

The following steps outline the order processing workflow, as shown in the following figure:

The order processing Lambda function validates the order and updates the DynamoDB database with the new order details.
The function publishes error events to EventBridge, enabling downstream processing for error handling and retry logic. These events can trigger more Lambda functions designed to manage specific error scenarios and recovery processes.

EventBridge implementation patterns: single or dual bus approaches

EventBridge offers multiple approaches for event bus topology. Architects can choose to either use a single event bus with distinct event patterns based on order status or implement a multi-bus strategy.

The single-bus approach uses one event bus for all events with routing rule patterns based on order status. For example, rules would match specific statuses (for example “new” or “processed”) to trigger appropriate Lambda functions. Although it is architecturally simple, it needs careful management of the event schema to avoid potential errors. However, a single-bus approach requires careful handling to prevent recursive processing, where messages trigger additional messages in an endless loop.

Alternatively, the multi-bus method, separating order placement and processing across different buses, effectively prevents loops and recursion issues. This approach provides better separation of transactions, albeit with a slightly more complex setup.

EventBridge can directly target external services using the API destination option, eliminating the need for Lambda functions for third party integrations.

Orchestrating order processing

In complex order processing systems for QSRs, managing multiple interdependent Lambda functions can become challenging, potentially leading to intricate code and difficult-to-maintain architectures. To address this, AWS Step Functions can be introduced as an orchestration layer.

Step Functions acts as a central coordinator for the business logic needed in QSR order flows. This service manages the progression of activities in the order processing workflow, thereby efficiently coordinating tasks such as kitchen preparation and delivery logistics. Defining and managing complex workflows allows Step Functions to optimize the overall efficiency of QSR operations, providing a structured and adaptable solution. This orchestration enhances the restaurant’s ability to handle dynamic processing, achieving a smooth and responsive integration with delivery services while streamlining the underlying architecture.

The following steps outline the orchestration of order processing, as shown in the following figure:

Order processing trigger respective Lambda function, which updates the order data in the DynamoDB database.
The updated order is made available for subsequent Lambda functions that process more business logic being performed by further Lambda functions.

In a multi-bus EventBridge architecture, the process flows are as follows:

The first EventBridge bus receives the initial order event and routes it to a Step Functions workflow.
The Step Functions workflow orchestrates the order processing, coordinating various tasks and checks.
Upon completion, the Step Functions workflow emits an event with the processing results to the second EventBridge bus.
Based on the output from the Step Function workflow, this second bus contains a rule that triggers the Aggregator API as an API destination.

User engagement workflow

When a customer places an order, there must be a way to confirm or notify them when the order is ready. For this purpose, you can use AWS End User Messaging services to push notifications for order completion and new offers to customers.

Analyzing customer data and individual preferences allows Amazon Personalize to be used to present personalized recommendations and promotions.

Amazon Personalize can analyze historical order data to enhance the user experience through personalized recommendations, such as optimal delivery times, preferred menu items, and tailored promotions based on individual ordering patterns.

Conclusion

This post showed how to use AWS serverless services to build a platform for your order processing without worrying about managing underlying infrastructure. The serverless services included were Amazon API Gateway, AWS Lambda, Amazon EventBridge, AWS Step Functions, AWS End User Messaging, and Amazon Personalize.

This post is a brief introduction to event-driven architectures focused on integrations of internal ordering systems with delivery aggregators and third-party ordering platforms. This can help expand the user base, and it has been a key factor in the growth of many QSRs. Making the ordering, take-out, and delivery experience more efficient translates to revenue growth, reduction of order abandonment, as well as increased recurrent customer retention and brand loyalty.

For more serverless learning resources, visit Serverless Land. To find more patterns, go directly to the Serverless Patterns Collection.

How to use AWS Transfer Family and GuardDuty for malware protection

2025-04-30 James Abbott

Post Syndicated from James Abbott original https://aws.amazon.com/blogs/security/how-to-use-aws-transfer-family-and-guardduty-for-malware-protection/

Organizations often need to securely share files with external parties over the internet. Allowing public access to a file transfer server exposes the organization to potential threats, such as malware-infected files uploaded by threat actors or inadvertently by genuine users. To mitigate this risk, companies can take steps to help make sure that files received through public channels are scanned for malware before processing.

This post demonstrates how to use AWS Transfer Family and Amazon GuardDuty to scan files uploaded through a secure FTP (SFTP) server for malware as part of an overall transfer workflow. For readers who might have read an earlier blog post on this topic, the key difference is that this solution is fully managed and doesn’t require the deployment of compute resources. GuardDuty automatically updates malware signatures every 15 minutes instead of using a container image for scanning, avoiding the need for manual patching to keep the signatures up to date.

Prerequisites

To deploy the solution in this post, you will need:

An AWS account: You need access to AWS to deploy this solution. If you don’t have an account that you can use, see Start building on AWS today.
AWS CLI: Install and configure the AWS Command Line Interface (AWS CLI) to be authenticated to your AWS account. Set up the environment variables for your AWS account using the access token and secret access key for your environment.
Git: You will use Git to pull down the example code from GitHub.
Terraform: You’ll use Terraform to run the automation. Follow the Terraform installation instructions to download and set up Terraform.

Solution overview

This solution uses Transfer Family and GuardDuty. Transfer Family provides a secure file transfer service that you can use to set up an SFTP server, and GuardDuty is an intelligent threat detection service. GuardDuty monitors for malicious activity and anomalous behavior to protect AWS accounts, workloads, and data. At a high level, the solution uses the following steps:

A user uploads a file through a Transfer Family SFTP server.
A Transfer Family managed workflow invokes AWS Lambda to execute an AWS Step Functions workflow.
- The workflow begins only after a successful file upload.
- Partial uploads to the SFTP server will invoke an error handling Lambda function to report a partial upload error.
A step function state machine invokes a Lambda function to move uploaded files to an Amazon Simple Storage Service (Amazon S3) bucket for processing and then starts scanning using GuardDuty.
The GuardDuty scan result is sent as a callback to the step function.
Infected files are moved or cleaned.
The workflow sends the user the results through an Amazon Simple Notification Service (Amazon SNS) topic. This can be a notification of an error or malicious upload during the scan or notification of a successful upload and a clean scan for further processing.

Solution architecture and walkthrough

The solution uses GuardDuty Malware Protection for S3 to scan newly uploaded objects to the S3 bucket. You can use this feature of GuardDuty to set up a malware protection plan for an S3 bucket at the bucket level or to watch for specific object prefixes.

Figure 1: Solution architecture

The following steps (shown in Figure 1) describe the workflow for this solution starting from the point the file is uploaded until it’s scanned and marked as safe or as infected, leading to subsequent steps that can be customized based on your use case.

A file is uploaded using the SFTP protocol through Transfer Family.
If the file is successfully uploaded, Transfer Family uploads the file to the S3 bucket called Unscanned and the Managed Workflow Complete workflow is triggered. This is the workflow used to handle successful uploads and invokes the Step Function Invoker Lambda function.
The Step Function Invoker starts the state machine and kicks off the first step in the process by invoking the GuardDuty – Scan Lambda function.
The GuardDuty – Scan function moves the file to the Processing bucket. This is the bucket from which the files will be scanned.
When an object upload activity is detected, GuardDuty automatically scans the object. In this implementation, a malware protection plan is created for the Processing bucket.
When a scan completes, GuardDuty publishes the scan result to Amazon EventBridge.
An EventBridge rule has been created to invoke a Lambda Callback function whenever a scan event has completed. EventBridge will invoke the function with an event that contains the scan results. See Monitoring S3 object scans with Amazon EventBridge for an example.
The Lambda Callback function notifies the GuardDuty – Scan task using the callback task integration pattern. The results of the GuardDuty scan are returned to the GuardDuty – Scan function and these results are passed to the Move File task.
If the result is a clean scan with no threats detected, the Move File task will place the file in the Clean S3 bucket, indicating that the file is successfully scanned and safe for further processing.
At this point, the Move File function publishes a notification to the Success SNS topic to notify the subscribers.
If the result indicates that the file is malicious, the Move File function will instead move the file to the Quarantine S3 bucket for further investigation. The function will also delete the file from the Processing bucket and publish a notification in the Error topic in SNS to notify the user of a potential malicious file being uploaded.
If the file upload is unsuccessful and the file isn’t fully uploaded, then Transfer Family will trigger the Managed Workflow Partial workflow.
Managed Workflow Partial is an error handling workflow and invokes the Error Publisher function, which is used for reporting errors that occur anywhere in the workflow.
The Error Publisher function identifies the type of error—whether it’s because of the partial upload or an issue elsewhere in the workflow—and sets the error status accordingly. It will then publish an error message to Error Topic in SNS.
The GuardDuty – Scan task has a timeout to make sure that an event is published to Error Topic to prompt a manual intervention to investigate further if the file isn’t successfully scanned. If the GuardDuty – Scan task fails, the Error clean up Lambda function is invoked.

Finally, there’s an S3 Lifecycle policy attached to the Processing bucket. This is to make sure that no file is left in the Processing bucket for more than one day.

Code repository

The GitHub AWS-samples repository has a sample implementation developed using Terraform and Python-based Lambda functions to implement this solution. The same solution can also be implemented using AWS CloudFormation. The code has the components needed to deploy the entire workflow to demonstrate the abilities of Transfer Family and the GuardDuty malware protection plan.

Install the solution

Use the following steps to deploy this solution to your test environment.

Clone the repository to your working directory using Git.
Navigate to the root directory of your cloned project directory.
Update the terraform locals.tf file with the values of your choice for the S3 bucket names, SFTP server names, and other variables.
Run terraform plan.
If everything looks good, run a terraform apply and enter yes to create the resources.

Clean up

After testing and exploring the solution, it’s important to clean up the resources you created to avoid incurring unnecessary costs. To delete the resources created by this solution, navigate to the root directory of your cloned project and run the following command:

terraform destroy

This command will remove the resources created by Terraform, including the SFTP server, S3 buckets, Lambda functions, and other components. Confirm the deletion by entering yes when prompted.

Conclusion

By using the approach outlined in the post, you can make sure that the files received over SFTP and uploaded to your S3 bucket are scanned for threats and are safe for further processing. The solution reduces the exposure surface by making sure that public uploads are scanned in a safe environment before they’re sent to other components of your system.

If you have feedback about this post, submit comments in the Comments section below.

Optimizing cold start performance of AWS Lambda using advanced priming strategies with SnapStart

2025-04-29 Shan Kandaswamy

Post Syndicated from Shan Kandaswamy original https://aws.amazon.com/blogs/compute/optimizing-cold-start-performance-of-aws-lambda-using-advanced-priming-strategies-with-snapstart/

Introduced at re:Invent 2022, SnapStart is a performance optimization that makes it easier to build highly responsive and scalable applications using AWS Lambda. The largest contributor to startup latency (often referred to as cold-start time) is the time spent initializing a function. This includes loading the function’s code and initializing dependencies. For latency-sensitive workloads such as APIs and real-time data processing applications, high startup latency can result in a suboptimal end user experience. Lambda SnapStart can reduce startup duration from several seconds to as low as sub-second, with minimal or no code changes. This post discusses ‘Priming’, a technique to further optimize startup times for AWS Lambda functions built using Java and Spring Boot.

Spring Boot applications typically experience high cold start latency during JVM and framework initialization, where significant time is spent loading classes and performing Just-In-Time (JIT) compilation of Java bytecode. This blog post uses a Spring Boot application as an example that retrieves 10 records from a ‘UnicornEmployee’ table in an Amazon RDS for PostgreSQL database, where each employee record includes employee name, location, and hire date.

The sample application uses Amazon API Gateway which triggers an AWS Lambda function that connects to the database through RDS Proxy to return the employee data. While this sample application uses dummy employee data for demonstration, the patterns and optimization techniques discussed in this post are applicable to real-world scenarios with similar data access patterns. Sample code for this implementation can be found in our GitHub repository at lambda-priming-crac-java-cdk.

Background: How SnapStart works

The post assumes familiarity with SnapStart and provides a short background. For additional details, refer to the SnapStart documentation.

To quickly recap, the INIT phase for a Lambda function involves downloading the function’s code, starting the runtime and any external dependencies, and running the function’s initialization code. For functions that don’t use SnapStart, this phase occurs each time your application scales up to create a new execution environment. When SnapStart is activated, the INIT phase happens when you publish a function version.

The following image shows a comparison of a Lambda request lifecycle with and without SnapStart.

Figure 1 – comparison of a Lambda request lifecycle with and without SnapStart

At the end of the INIT phase, Lambda executes your before-checkpoint runtime hooks. Lambda then snapshots the memory and disk state of the initialized execution environment, persists the encrypted snapshot, and caches it for low-latency access. When the function is subsequently invoked, new execution environments are resumed from the cached snapshot (during the RESTORE phase), speeding up function startup.

Figure 2 – new execution environments are resumed from the cached snapshot.

You can validate this speedup by comparing the RESTORE duration with the INIT duration recorded before SnapStart in your Lambda function’s Amazon CloudWatch Logs. As demonstrated in the following table, enabling SnapStart reduces the startup latency of our sample Spring Boot application by 4.3x from 6.1s to 1.4s. The 6.1s cold start latency for ON_DEMAND is primarily due to the combination of (1) initializing the JVM and Spring Boot framework, (2) JIT compilation of lazy loaded application code during initial invocation and (3) the time needed to establish a database connection with RDS through Amazon RDS Proxy. By enabling SnapStart, Lambda initializes the JVM and Spring Boot prior to the function invocation – resulting in the significantly reduced latency of 1.4s.

Method	Cold Start Invocations	p50	P90	P99	p99.9
PrimingLogGroup-1_ON_DEMAND	128	5047.94 ms	5386.78 ms	6158.80 ms	6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING	111	1177.87 ms	1288.73 ms	1419.94 ms	1425.63 ms

You can reduce cold starts even further for your latency-sensitive Spring Boot applications by using priming techniques on Lambda functions. Let’s explore how to implement priming techniques.

Priming explained

Priming is the process of preloading dependencies and initializing resources during the INIT phase, rather than during the INVOKE phase to further optimize startup performance with SnapStart. This is required because Java frameworks that use dependency injection load classes into memory when these classes are explicitly invoked, which typically happens during Lambda’s INVOKE phase. You can proactively load classes using Java runtime hooks, that are part of the open source CRaC (Coordinated Restore at Checkpoint) project. This post demonstrates how to use this hook, called beforeCheckpoint(), to prime SnapStart-enabled Java functions, in two ways:

Invoke Priming: This approach involves directly invoking application endpoints or methods in your pre-snapshotting hook so that they are JIT compiled during the INIT phase and included in the snapshot. This can include operations such as invoking API Gateway endpoints or fetching data from an S3 bucket or RDS database to proactively execute the code paths, ensuring that the underlying classes are included in the snapshot.
Class Priming: This approach involves proactive initialization of classes during the INIT phase, ensuring that they are included in the function’s snapshot without risking unwanted changes to application state or data. This can be achieved by leveraging Java’s forName() method, which loads, links, and initializes the specified class. Initialization refers to the JVM process of loading the class definition into memory, verifying the bytecode, preparing static fields with default values, and executing static initializers. This is different from instantiation, which creates objects of the class using constructors. To generate a list of the classes required for pre-loading, you can use the following VM option, writing the list to a file called classes-loaded.txt:
-Xlog:class+load=info:classes-loaded.txt

While invoke priming can offer better performance, it requires additional effort to ensure that the actions performed are idempotent and do not have unintended side effects, for instance processing financial transactions in a banking application. For this reason, invoke priming should only be used when code executed during priming is either idempotent or does not modify state. For scenarios where this is not possible, class priming provides a safer alternative by only initializing classes without executing their methods. Note that this assumes your application does not execute state-modifying code during class initialization.

With this context, let’s look at how to implement Invoke and Class priming for a Spring Boot sample application.

Example priming Implementation using CRaC runtime hooks before taking a Lambda snapshot

This post demonstrates both Invoke priming and Class priming using the sample Spring Boot application. The choice between the two approaches depends on the specific requirements and complexities of your application.

Step 1: Set up your Spring Boot Application using the aws-serverless-springboot3-archetype as explained in our Quick Start Spring Boot3 guide, adding database connectivity code, or simply clone the sample project from GitHub repository.

Create a Spring Boot Application.

// src/main/java/software/amazon/awscdk/examples/unicorn/UnicornApplication.java
package software.amazon.awscdk.examples.unicorn;
…
@Import({ UnicornConfig.class })
@SpringBootApplication
public class UnicornApplication {

    private static final Logger log = LoggerFactory.getLogger(UnicornApplication.class);

    public static void main(String... arguments) {
        SpringApplication.run(UnicornApplication.class, arguments);
    }

}

Add all the necessary Maven dependencies for Spring Boot, AWS Lambda, and Database Connection in your pom.xml file. The following, highlighted, dependency contains the classes required to use the CRaC runtime hooks.
```
...
        <dependency>
            <groupId>org.crac</groupId>
            <artifactId>crac</artifactId>
        </dependency>
...
```

Configure Database Connection – Set up the database connection details in application.properties.

spring.datasource.password=${SPRING_DATASOURCE_PASSWORD} 
spring.datasource.url=${SPRING_DATASOURCE_URL} 
spring.datasource.username=postgres 
spring.datasource.hikari.maximumPoolSize=1

Step 2: Implement Lambda Function Handler with CRaC runtime hooks and Invoke Priming Approach:

Create Lambda Function Handler and integrate CRaC runtime hooks to execute beforeCheckpoint() and afterRestore() methods in your application for before taking and after restoring the snapshot.

Implement the RequestHandler<UnicornRequest, UnicornResponse> interface in the Lambda function handler class.
Implement the CRaC resource interface with two methods: beforeCheckpoint() and afterRestore(), which defines actions performed before Lambda creates the snapshot and after the snapshot is restored.
Add invoke priming by creating a UnicornRequest object with a GET request to a specific endpoint (such as, /unicorn) and call the handleRequest(unicornRequest, null) method.

This ensures that the code paths associated with the specified endpoint are JIT compiled and optimized for faster execution during the first invocation after the snapshot is restored.

/src/main/java/software/amazon/awscdk/examples/unicorn/handler/InvokePriming.java
package software.amazon.awscdk.examples.unicorn.handler;

import org.crac.Core;
import org.crac.Resource;
...
public class InvokePriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {
	...

@Override
public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
    var awsLambdaInitializationType = System.getenv("AWS_LAMBDA_INITIALIZATION_TYPE");
    var unicorns = getUnicorns();
    var body = gson.toJson(unicorns);
    return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
}

@Override
public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
        throws Exception {
    var event = APIGatewayV2HTTPEvent.builder().build();
    handleRequest(event, null);
}
...
}

Step 3: Implement Class priming Approach:

The class priming approach focuses on pre-loading required classes to achieve optimal performance. To implement class priming, generate the list of classes that are loaded during the application startup and function execution by running the application locally using the following JVM argument: -Xlog:class+load=info:classes-loaded.txt

Ensure that your application classes included in the generated classes-loaded.txt file are not mutating state during static initialization.
Note: the generated classes-loaded.txt contains class entries in the following format:
```
[0.068s][info][class,load] software.amazon.awscdk.examples.unicorn.handler.ClassPriming source: file:/var/task/
```
Extract only the fully qualified class names from each line and remove the additional logging information. For Example:
```
software.amazon.awscdk.examples.unicorn.handler.ClassPriming
```

Use the ClassLoaderUtil.loadClassesFromFile() utility method to extract the generated class entries.

	     //src/main/java/software/amazon/awscdk/examples/unicorn/service/ClassLoaderUtil.java
package software.amazon.awscdk.examples.unicorn;
	...
public class ClassLoaderUtil {
	...
    public static void loadClassesFromFile() {
        log.info("loadClassesFromFile->started");
        Path path = Paths.get("classes-loaded.txt");

        try (BufferedReader bufferedReader = Files.newBufferedReader(path)) {
            Stream<String> lines = bufferedReader.lines();
            lines.forEach(line -> {
                var index1 = line.indexOf("[class,load] ");
                var index2 = line.indexOf(" source: ");

                if (index1 < 0 || index2 < 0) {
                    return;
                }

                var className = line.substring(index1 + 13, index2);
                try {
                    Class.forName(className, true,
                            ClassPriming.class.getClassLoader());
                } catch (Throwable ignored) {
                }
            });

            log.info("loadClassesFromFile->finished");
        } catch (IOException exception) {
            log.error("Error on newBufferedReader", exception);
        }
    }
...
}

Read a file (such as, /classes-loaded.txt) that contains a list of classes that have been loaded during the application’s execution in the beforeCheckpoint() method.

Use the Class.forName() method to load and initialize the class, ensuring that it is ready during the snapshot.
Note: by systematically pre-loading these classes, the Class priming approach simplifies the optimization process and reduces the complexities associated with Invoke priming.

//src/main/java/software/amazon/awscdk/examples/unicorn/handler/ClassPriming.java
package software.amazon.awscdk.examples.unicorn.handler;

...
import org.crac.Core;
import org.crac.Resource;

public class ClassPriming implements RequestHandler<APIGatewayV2HTTPEvent, APIGatewayV2HTTPResponse>, Resource {

...
        ConfigurableApplicationContext configurableApplicationContext =
				SpringApplication.run(UnicornApplication.class);

        this.unicornService = configurableApplicationContext.getBean(UnicornService.class);
        this.gson = configurableApplicationContext.getBean(Gson.class);

        Core.getGlobalContext().register(this);
    }

    @Override
    public APIGatewayV2HTTPResponse handleRequest(APIGatewayV2HTTPEvent event, Context context) {
        var unicorns = getUnicorns();
        var body = gson.toJson(unicorns);

        return APIGatewayV2HTTPResponse.builder().withStatusCode(200).withBody(body).build();
    }

    @Override
    public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
            throws Exception {

        ClassLoaderUtil.loadClassesFromFile();

    }
...
}

Step 4: AWS CDK Infrastructure Setup

Before proceeding, review the prerequisites in the project README file.

The CDK stack deploys a serverless application and required infrastructure for testing different Lambda optimization strategies. It creates a VPC with private subnets, an RDS for PostgreSQL instance with a database proxy, and five Lambda functions implementing different optimization approaches (ON_DEMAND without SnapStart, SnapStart without priming, SnapStart with invoke priming, and SnapStart with class priming). Each Lambda function is integrated with API Gateway for HTTP access, configured with Java 21 runtime on ARM64 architecture, and includes CloudWatch log groups for monitoring.

Follow these steps to deploy the infrastructure:

Clone the sample repository:

git clone https://github.com/aws-samples/lambda-priming-crac-java-cdk.git

Deploy the CDK stack:

cd lambda-priming-crac-java-cdk/infrastructure
cdk synth
cdk deploy --require-approval never --all 2>&1 | tee cdk_output.txt

Save the API Gateway URLs:
The deployment will output five URLs in this format:

# ON_DEMAND endpoint (without SnapStart)
LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi1ONDEMANDEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

# SnapStart without priming endpoint
LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi2SnapStartNOPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

# SnapStart with invoke priming endpoint
LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi3SnapStartINVOKEPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

# SnapStart with class priming endpoint
LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi4SnapStartCLASSPRIMINGEndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

# Database setup endpoint
LambdaPrimingCracJavaCdkStack.PrimingJavaRestApi5DBLOADEREndpoint = https://[id].execute-api.us-east-1.amazonaws.com/prod/

Extract the URLs into variables for testing:

ONDEMAND_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 1) \

NOPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 2 | tail -n 1) \

INVOKEPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 3 | tail -n 1) \

CLASSPRIMING_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 4 | tail -n 1) \

SETUP_URL=$(grep -oE 'https://[a-zA-Z0-9.-]+\.execute-api\.[a-zA-Z0-9-]+\.amazonaws\.com/prod/' "cdk_output.txt" | head -n 5 | tail -n 1)

Step 5: Load database and run performance testing using artillery:

Initialize the database with sample data.

curl -X GET "$SETUP_URL"

#Expected output: {"message":"Database schema initialized and data loaded"}

Run performance tests for all endpoints

artillery run -t "$ONDEMAND_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
artillery run -t "$NOPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
artillery run -t "$INVOKEPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml && \
artillery run -t "$CLASSPRIMING_URL" -v '{ "url": "/unicorn" }' ./loadtest.yaml

Step 6: Compare the load test results for On-demand (non-SnapStart), SnapStart, Invoke priming, and Class priming

The performance test results in the table below are sorted from slowest to fastest startup latency. The function without SnapStart performs the slowest due to JVM initialization, class loading and JIT compilation that occurs when the function is invoked. Notice a 4.3x improvement with SnapStart, which resumes invocations from a pre-initialized snapshot thereby avoiding JVM initialization and initial JIT compilation. SnapStart with class priming achieves a 1.4x speed-up over SnapStart, by proactively loading/initializing classes during INIT so that they are included in your function’s snapshot. Finally, SnapStart with invoke priming achieves the fastest performance – with a 781.68ms p99.9 cold-start latency that is 1.8x faster than SnapStart. This is because in addition to initializing classes, it also executes methods on the instances of those classes, resulting in even more components being included in the function’s snapshot.

Note that with invoke priming, any application code you execute must either be idempotent or modify stub data only. For instance, consider application code that triggers a financial transaction. If this code is executed during invoke priming with real user data, it may drive unintended effects with potentially serious consequences. Class priming avoids this, since application classes are initialized rather than being instantiated and their methods executed. This assumes that application code does not execute state modifying logic during class initialization. We recommend that you keep these considerations in mind when using invoke and/or class priming, and choose the appropriate approach for your use case.

Method	Cold Start Invocations	p50	P90	P99	p99.9
PrimingLogGroup-1_ON_DEMAND	128	5047.94 ms	5386.78 ms	6158.80 ms	6195.84 ms
PrimingLogGroup-2_SnapStart_NO_PRIMING	111	1177.87 ms	1288.73 ms	1419.94 ms	1425.63 ms
PrimingLogGroup-4_SnapStart_CLASS_PRIMING	82	857.81 ms	997.49 ms	1085.94 ms	1085.94 ms
PrimingLogGroup-3_SnapStart_INVOKE_PRIMING	66	608.42 ms	688.88 ms	781.68 ms	781.68 ms

Conclusion

This post showed how AWS Lambda SnapStart, enhanced by CRaC runtime hooks, unlocks granular control over cold-start optimization for Java applications through two distinct priming strategies:

Invoke Priming: improves performance by executing critical endpoints during snapshot creation, ideal for idempotent workflows.
Class Priming: preloads classes without triggering business logic, mitigating side-effect risks.

To implement these optimization techniques in your applications evaluate your use case and opt for the optimal priming approach. Track latency reductions and resource utilization of your application via Amazon CloudWatch metrics to quantify performance improvements. By integrating these strategies, developers can achieve sub-second cold starts while maintaining the scalability and cost-efficiency of serverless architecture using Java.

To dive deeper, check out the GitHub repository with the full example code, including setup instructions and reusable patterns you can adapt to your own projects. For more examples of Java applications running on AWS Lambda, visit serverlessland.com and explore a wide range of resources, tutorials, and real-world use cases.