Tag Archives: contributed

Enriching and customizing notifications with Amazon EventBridge Pipes

2025-04-09 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/enriching-and-customizing-notifications-with-amazon-eventbridge-pipes/

This blog post authored by Elie Elmalem, Associate Scale Solutions Architect

When implementing event-driven architectures, customers frequently need to enrich their incoming events with additional information to make them more valuable for downstream consumers. Traditionally, customers using Amazon EventBridge would accomplish this by writing AWS Lambda functions to augment their events with supplementary data. However, this approach requires writing and maintaining custom code, adding complexity to their event processing pipeline.

Amazon EventBridge Pipes simplifies this process by providing a streamlined, managed service for event enrichment without the need to write and manage custom Lambda functions. This blog post demonstrates how you can use EventBridge Pipes’ built-in data enrichment capabilities to dynamically enhance your events with additional context and customer-specific details, making event processing more efficient and easier to maintain.

Amazon EventBridge Pipes

Amazon EventBridge creates a direct connection between sources and targets. Using an EventBridge bus helps you route and fan-out events to services in a pub/sub pattern. EventBridge Pipes on the other hand help you with point-to-point service integrations patterns. What sets it apart from the traditional event bus/rule pattern is its data transformation and enrichment support.

When defining an EventBridge Pipes, you specify the source and the target of the pipe. Pipes support a variety of sources and targets. Between the source and target, EventBridge Pipes supports filtering and enrichment. The filtering enables you to select and process a targeted subset of events. Enrichment allows you to enhance data by adding missing information before sending it to a target. For instance, if an event lacks necessary information, it ensures the target can properly consume the event. Enriching data can be very powerful, as it makes it possible to enhance a generic event and transform it. EventBridge Pipes support enrichment using Lambda functions, AWS Step Functions, Amazon API Gateway and EventBridge API destinations. More details about these concepts can be found in the Amazon EventBridge Pipes concepts documentation.

Figure 1: Representation of EventBridge Pipes showing filter and enrichment steps

This blog post will use the enrichment step of the pipe to create custom notifications.

Overview

To illustrate the functionality, this post uses a use case from a clothing retailer. Businesses such as this retailer want to keep their loyal customers engaged. Often, they rely on bulk promotional emails which lack personalization. In this use case, the retailer wants to send targeted promotion codes. As soon as the 10th order is placed, the code is sent via email or SMS to their customer.

Without EventBridge Pipes, this would be implemented using EventBridge to respond to the order event. All the events are sent to a custom Lambda function to process it. If the order meets the right conditions, the Lambda function sends a notification with the discount code to the customer using Amazon Simple Notification Service (Amazon SNS).

Figure 2: Traditional approach using EventBridge

While this architecture works, it requires you to maintain the integration code as well as the data enriching logic within the Lambda function as the function needs to extract the necessary information from the events and manage routing to SNS. As more microservices follow the same pattern, the code becomes more complex. This can lead to longer execution times along with higher cost and greater maintenance effort.

Simplifying using Amazon EventBridge Pipes

Amazon EventBridge Pipes can be used to simplify the previous implementation by handling the enrichment and integration between services. Amazon EventBridge Pipes take care of sending the event to your configured enrichment step and then routes the enriched event to the target. If the chosen method is a Lambda function, it leaves the function code to only focus on enrichment logic. It eliminates the need for code to extract the necessary fields from the event and to send notifications.

Figure 3: Solution architecture using EventBridge Pipes

As the event comes into the pipe, the enrichment step triggers a Lambda function, which will check eligibility and returns the message to route to SNS. If the customer is not eligible for a discount code, it returns an order confirmed message with the data retrieved from the original order event. If the customer is eligible for a discount, the message also contains the discount code.

This is the architecture for the updated flow:

A customer orders a new item. The order is sent to a Simple Queue Service (SQS) Orders queue.
The new message on the Orders queue triggers the EventBridge Pipes.
The pipe triggers an AWS Lambda function to enrich the data.
The functions checks if the customer is eligible for a discount code against an Amazon DynamoDB table. The table contains the number of times each customer has ordered.
The Lambda function returns the custom message that will be sent to the customer, either with or without the discount code.
The message is routed to an SNS topic by the EventBridge Pipe
Customer receives the notification via its preferred subscription method.

Building the updated flow

To build the updated flow, I have chosen to use the AWS Cloud Development Kit (CDK) in Python. You can use the code given here to deploy it into your account. The code can also be found on GitHub.

Note: This sample code is for testing purposes only and is not intended to be used in a production account.

For this solution, you need the following prerequisites:

The AWS Command Line Interface (CLI) installed and configured for use.
An Identity and Access Management (IAM) role or user with enough permissions to create an IAM policy, DynamoDB table, SQS queue, SNS topic, Lambda Function and EventBridge Pipes.
AWS CDK
Python version 3.9 or above, with pip and virtual virtualenv.

Once the prerequisites are met, set up a new Python CDK project in an empty directory:

mkdir blog_code
cd blog_code
cdk init app –-language python

Then, activate the virtual environment and install the CDK’s dependencies:

source .venv/bin/activate
python -m pip install -r requirements.txt

The cdk init command creates a blog_code folder. The GitHub repository contains the code for the blog_code_stack.py file inside the blog_code folder.

Then, within the blog_code folder, create a new folder called lambda. Inside this new folder, create a file called index.py. This file will contain the code for the enrichment lambda function. Once again, this code can be found in the GitHub repository. Here is a section of the Lambda code:

def lambda_handler(event, context):

    message = json.loads(event[0]['body'])

    id = message['id']
    order_content = message['order_content']
    
    nmb_orders = get_number_of_orders(id)
    
    # Calculate orders left
    orders_left = MAX_ORDERS - nmb_orders
    
    # Update the DynamoDB table with the new number of orders
    if nmb_orders == MAX_ORDERS:
        update_table(id, 0)
    else:
        update_table(id, nmb_orders)
    
    if orders_left == 0:
        return [f"Thank you for your order of {order_content}. You have earned a 10% discount code on your next order: XA5GT2SF"]
    else:  
        # Return the confirmation message
        return [f"Thank you for your order of {order_content}. This is your confirmation message! Only {orders_left} orders left until a 10% discount!"]

The Lambda function works in the following way:

It receives an event from the EventBridge Pipe which consists of the order and the ID of the user who made the order
It gets the number of orders that the user has already placed by calling a GetItem command on the DynamoDB table.
It calculates how many orders are left before the user gets the discount code.
It updates the DynamoDB table with the new number of orders to account for the one that was just placed.
If the user has placed the right number of orders, it returns a confirmation message with the discount code. Otherwise, it notifies the user of the number of orders that still need to be placed to get the discount.

Now, deploy the CDK stack into your account. Make sure that you are in the root directory of your project:

cdk bootstrap
cdk deploy

Once the stack has finished deploying, you will find an EventBridge pipe visible on the console by going to the EventBridge service page and clicking on Pipes in the left panel.

Testing the solution

To test the solution, you must first set up a subscription to the SNS topic to receive notifications. It is recommended to set up email notifications for simplicity and testing purposes. To do so, follow the instructions on the Amazon SNS documentation for the topic with name TargetTopic. When the subscription is set up, don’t forget to check your email inbox and confirm the subscription.

Once notifications are set up, visit the DynamoDB console page. You need to manually add an entry to the eligibility table to mimic a real environment:

Click on Tables in the left panel
Select the EligibilityTable table.
Click Explore table items then Create item
Enter an id with a value of 01.
Click Add new attribute and select String.
Under attribute name, enter orders, and under value enter 8.
Click Create item.

The Items returned table should look like the following. This assumes that the customer has already place 8 orders.

Figure 4: Items returned table after adding a new item

Now, visit the SQS console page. You will need to send a message to the queue to mimic new orders being placed.

Click on the queue called SourceQueue.
Click Send and receive messages.
Under message body, paste in the following message and click Send message:

{
    "order_content": "large shirt",
    "id": "01",
    "username": "johndoe01",
    "transaction_time": "10:04:00"
  }

After a few minutes, you should receive an email confirming your order, as your order message is considered to be the 9^th order. Send the message again to place a 10^th order and you should receive your discount code!

Figure 5: Email received with a discount code

Cleanup

To delete the resources in your account, run the following command in the root directory of your project:

cdk destroy

Conclusion

This blog post showed how Amazon EventBridge Pipes and its enrichment feature can help you create tailored notifications. First, it discussed how it could be implemented using EventBridge and then presented a simplified implementation using EventBridge Pipes.

For more information on common patterns for EventBridge Pipes, you can check out Implementing architectural patterns with Amazon EventBridge Pipes.

For more serverless learning resources, visit Serverless Land. To find more patterns, go directly to the Serverless Patterns Collection.

Simplifying private API integrations with Amazon EventBridge and AWS Step Functions

2025-03-27 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/simplifying-private-api-integrations-with-amazon-eventbridge-and-aws-step-functions-2/

This blog written by Pawan Puthran, Principal Specialist TAM, Serverless and Vamsi Vikash Ankam, Senior Serverless Solutions Architect.

In December 2024, AWS announced that Amazon EventBridge and AWS Step Functions support integration with private APIs using AWS PrivateLink and Amazon VPC Lattice. This feature allows users to integrate applications seamlessly across private networks, on-premises infrastructure, and cloud platforms. It provides operational simplicity, enabling secure and controlled communication between services within a Virtual Private Cloud (VPC). This blog post explores how to leverage this new capability to integrate Step Functions with private APIs, making application interactions across private networks more efficient and secure.

Overview

Private integrations are essential for secure communication between cloud services within a VPC. As organizations modernize their applications in the cloud, they often need to integrate existing systems with private network environments. EventBridge and Step Functions previously needed proxies to send events to HTTPS applications. These proxies, such as AWS Lambda or Amazon Simple Queue Service (Amazon SQS), delivered events to applications running on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), or Amazon Elastic Container Service (Amazon ECS). Now, users can directly invoke private HTTPS-based endpoints running within their VPC using EventBridge and Step Functions.

This new capability offers several key benefits:

Enhanced security and compliance: Private API integrations significantly enhance security by keeping APIs within private networks, minimizing exposure to internet threats and making sure of compliance in regulated industries such as finance and healthcare.
Simplified architecture and increased developer productivity: This feature streamlines integration by enabling direct access to private APIs, eliminating complex network setups and proxy solutions. It allows developers to focus on core logic, resulting in cleaner architectures, faster development, and reduced maintenance. By removing the need for custom code and unifying application architecture, the integration process accelerates, leading to faster time to market and enhanced innovation.
Improved performance and reliability: Private API integrations to VPC resources enhance performance by leveraging the AWS backbone network. This direct connectivity improves speed, increases reliability, and minimizes external network dependencies and points of failure.

EventBridge and Step Functions use new capabilities of PrivateLink and VPC Lattice, Resource Gateway and Resource Configuration, to facilitate secure network connectivity to services and resources inside of a VPC. To establish the private connectivity, you need the following components:

Resource Gateway: A Resource Gateway serves as a secure entry point for the inbound traffic to the resource. This acts as an ingress point within the VPC where the resources reside.
Resource Configuration: A Resource Configuration is a logical entity that identifies the resource and specifies how and who can access it. Defining a resource configuration allows you to allow private, secure, and unidirectional network connectivity to resources in your VPC from clients and services in other VPCs and accounts.
EventBridge Connections: EventBridge Connections used in EventBridge API destinations and Step Function workflows, establishes connectivity to your private HTTPS endpoints by using resource configurations.
AWS Resource Access Manager: You can share the resource configuration through AWS Resource Access Manager (AWS RAM), a service that securely shares your VPC resources across your organizations and with other AWS accounts.

Workload overview

To illustrate how Step Functions invoke private HTTPS APIs, consider the following workflow that classifies product reviews as fake or real.

The Step Functions workflow processes an array of product reviews using Distributed Map.
It involves calling the Amazon Nova Micro model through Amazon Bedrock to classify the review text.
If a review is classified as fake, then the workflow publishes an event to an EventBridge bus, providing a flexible integration for potential downstream analysis or notifications.
If a review is classified as real, then Step Functions calls the private HTTPS endpoint, using DNS address to further process the reviews.
This private API is hosted in AWS Fargate behind an internal Application Load Balancer (ALB) within a VPC.

Figure 1: Step Functions workflow calling private HTTPS-based endpoint running in AWS Fargate

In real-world scenarios, this includes analyzing text patterns, user behavior, and linguistic cues to determine the authenticity of each review. Suspicious reviews are automatically flagged by building customized workflows to maintain the integrity of the product feedback system.

Deploying the example

Before configuring the private integration, create an Amazon Route53 public hosted zone with a registered domain (such as api.com), and an AWS Certificate Manager (ACM) certificate corresponding to the domain. While Amazon Route53 private hosted zones is currently not supported, utilizing public hosted zones resolves the domain name to a private IP address, accessible only from within the VPC.

This post includes a sample application and deployment instructions. For complete details, refer to the README.

Scenario 1: Single account

In this scenario, the Step Functions, EventBridge connections, and private resources reside in the same account, as shown in the following figure

Figure 2: Overview of a single account setup with Step Functions workflow and private API in the same account

VPC Resource Gateway acts like the entry point to access the private resources running within your VPC. As a best-practice, consider creating a resource gateway to span across multiple private subnets (Availability Zones) for high availability. Refer to the AWS Cloud Development Kit (AWS CDK) code snippet in lib/vpclattice-stack.ts for resource gateway implementation.
Resource Configurations establish the connection between the private endpoint and the Resource Gateway and are used to uniquely identify the private resources running within your VPC. Refer to the AWS CDK code snippet in lib/vpclattice-stack.ts to create Resource Configuration, and configure the domain name and port.
To enable Step Functions to communicate with the private VPC resources, you create an EventBridge Connection. This handles the authorization and private connectivity to connect to the private API. Refer to the AWS CDK code snippet in lib/workflow-stack.ts for creating EventBridge Connections.
The Step Functions state machine deployed as part of the sample application uses the HTTPS Invoke task type to call the private API. Calling private APIs from Step Functions allows you to use features such as built-in error handling like retries for transient issues and redrive for errors.

You can use the following payload to test the Step Functions execution:

{
  "items": [
    {
      "asin": "B000FA64PA",
      "helpful": [ 0, 0],
      "overall": 5,
      "reviewText": "Darth Maul working under cloak of darkness committing sabotage now that is a story worth reading many times over. Great story.",
      "reviewTime": "10 11, 2013",
      "unixReviewTime": 1381449600
    },
    {
      "asin": "B000F83SZQ",
      "helpful": [ 1, 1],
      "overall": 4,
      "reviewText": "Never heard of Amy Brewster. But I don't need to like Amy Brewster to like this book. Actually, Amy Brewster is a sidekick in this story, who added mystery to the story not the one resolved it. The story brings back the old times, simple life, simple people, and straight relationships.",
      "reviewTime": "03 22, 2014",
      "unixReviewTime": 1395446400
    }
  ]
}

The following figure shows the Step Functions execution where the review is classified as real and successfully invokes the private HTTPS endpoint.

Figure 3: Step Functions execution classifying the product reviews as real and successfully invoking the private API

Scenario 2: Cross account

In this scenario, all the private resources reside in Account A. The Step Functions and EventBridge Connections reside in Account B. The cross-account resource sharing is powered by AWS RAM, as shown in the following figure.

Figure 4: Cross-account setup

Following the creation of the Resource Gateway and the Resource Configuration, as described in the previous section, configure the resource share using AWS RAM in Account A.

The sample application creates the AWS RAM resource share in Account A. This allows Account B to access private VPC resources in Account A, enabling secure, AWS Identity and Access Management (IAM) authorized access to the VPC resources in Account A. Refer to the CDK code snippet in lib/vpclattice-stack.ts to create cross-account resource share using AWS RAM.
In Account B, AWS RAM receives an invitation from Account A to access the private VPC resources. Upon acceptance, the resource share status changes to Active, granting access to the private VPC resources in Account A.
To enable access from Account B’s Step Function or EventBridge to Account A’s private VPC resources, create an EventBridge Connection as described in Step 3 (Single account scenario). Map this connection to the shared AWS RAM Resource Configuration created from the previous step.

Enterprises with distributed development teams operate across multiple AWS accounts. The setup described above enables secure cross-account access to VPC resources.

New connection state events

EventBridge now publishes change in the state events for new or existing connections. This is useful when taking actions on state changes or for troubleshooting purposes. The following example shows the state change events published for Connection Authorized and Connection Activated.

Figure 5: EventBridge connections state change

Conclusion

The new integration allows Amazon EventBridge and AWS Step Functions to integrate with private APIs, powered by AWS PrivateLink and Amazon VPC Lattice. Users can integrate legacy on-premises systems with cloud-native applications using event-driven architectures and workflow orchestration. The integration helps enterprises modernize distributed applications across public and private networks, enabling faster innovation, higher performance, and lower costs by eliminating the need for custom networking or integration code.

For more details, refer to the EventBridge and Step Functions documentation. Check out this video on setting up integrations with EventBridge and Step Functions. Get the sample code used in this post from this GitHub repository.

To expand your serverless knowledge, visit Serverless Land.

Optimizing network footprint in serverless applications

2025-03-21 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/optimizing-network-footprint-in-serverless-applications/

This post is authored by Anton Aleksandrov, Principal Solution Architect, AWS Serverless and Daniel Abib, Senior Specialist Solutions Architect, AWS

Serverless application developers may commonly encounter scenarios where they need to transport large payloads, especially when building modern cloud applications that need rich data. Examples include analytics services with detailed reports, e-commerce platforms with extensive product catalogs, healthcare applications transmitting patient records, or financial services aggregating transactional data.

Many serverless services have a well-defined maximum payload size. For example, AWS Lambda maximum request/response payload size is 6 MB, and Amazon Simple Queue Service (Amazon SQS) and Amazon EventBridge maximum message size is 256 KB. In this post, you will learn how to use data compression techniques to reduce your network footprint and transport larger payloads under existing constraints.

Overview

Cloud applications evolve continuously and need to be adjusted frequently for new requirements, such as new business features or new Service Level Objectives (SLO) for higher throughput and lower latency. As new use cases and data patterns are added, it is common to see request and response payload sizes increase. At some point, you might hit the maximum service payload size limits, such as 6 MB for synchronous Lambda function invokes, 10 MB for Amazon API Gateway, and 256 KB for Amazon SQS, EventBridge, and asynchronous Lambda invokes.

There are several techniques you can apply when dealing with large payloads. If your payloads are tens of MBs or more, or you need to transport large binary objects with API Gateway, you can store the payload on Amazon Simple Storage Service (Amazon S3) and use pre-signed URLs for clients to directly upload and download from S3.

Figure 1. A sample of architecture for handling large payloads

Lambda function URLs response streaming supports up to 20 MB responses. For handling large messages with services such as SQS or EventBridge, you can store the message in S3 and pass a reference. The downstream consumer will use the reference to download the message directly from S3. One common characteristic of these techniques is that they introduce architectural complexity and may necessitate modifications to your existing solution architecture and data flow patterns.

Furthermore, as your payloads grow in size, you will see increased data transfer costs, especially if your solution is transporting data through Amazon Virtual Private Cloud (VPC) NAT Gateways, VPC endpoints, or sending data across AWS Regions. For example, it is common for VPC-based solutions to have Lambda functions in their architecture. A container running on Amazon Elastic Kubernetes Service (Amazon EKS) might need to invoke a Lambda function, or a VPC-attached Lambda function might need to reach out to the public internet.

Figure 2. Examples of using virtual network appliances with serverless applications

Both NAT Gateway and VPC Endpoint are billed per GB of data processed, which makes data compression a valuable optimization technique. Go to NAT Gateway pricing and VPC Endpoint pricing for details.

The following sections explore data compression techniques and demonstrate how to apply them in your serverless applications. You can learn how to send larger payloads within the existing payload size boundaries and reduce your network footprint without significant architectural changes. This post discusses compression techniques in the context of Lambda and API Gateway, but the same principles can be applied to other services, such as SQS, EventBridge, and AWS AppSync. Understanding compression concepts better equips you to optimize your application’s data-handling capabilities.

What is data compression?

Compression is a widely used approach to reduce data size in order to improve cost-effectiveness and performance for data storage and transmission. Many tools and frameworks incorporate data compression techniques, such as gzip or zstd. It is thoroughly documented in the official IANA specification and IETF RFC 9110. Browsers such as Chrome and Firefox, HTTP toolkits such as curl and Postman, and runtimes such as Node.js and Python natively handle compression, often without user involvement.

Consider HTTP protocol. When a client wants to send a compressed payload, it specifies it in the Content-Type header. To receive a compressed response, the client specifies supported compression methods in the Accept-Encoding request header.

Figure 3. Accept-Encoding request header specifying supported compression methods

The server compresses the response payload using one of the supported methods and uses the Content-Encoding response header to indicate the method to the client.

Figure 4. Content-Encoding response header specifying compression method

This mechanism can accelerate client-server communications by reducing the number of bytes transmitted over the network. Compression efficiency depends on the data type. Text-based formats like JSON, XML, HTML, and YAML compress well, while binary data such as PDF and JPEG generally compress less effectively.

Data compression with API Gateway

API Gateway provides built-in compression support. Use the minimumCompressionSize configuration to set the smallest payload size to compress automatically. The value can be between 0 bytes to 10 MB. Compressing very small payloads might actually increase the final payload size, and you should always test with your real payload patterns to determine the optimal threshold.

Figure 5. Handling data compression in API Gateway

API Gateway enables clients to interact with your API using compressed payloads through supported content encodings. The compression mechanism works bi-directionally. For JSON payloads, API Gateway seamlessly handles compression and decompression, maintaining compatibility with mapping templates. It decompresses incoming payloads before applying request mapping templates and compresses outgoing responses after applying response mapping templates. This automated compression optimizes data transfer:

When sending compressed data, clients supply the appropriate Content-Encoding header. API Gateway handles the decompression and applies configured mapping templates before forwarding the request to the integration.
When API Gateway receives an integration response and compression is enabled, it compresses the response payload and returns it to the client, provided that the client has included a matching Accept-Encoding header.

A sample test using the compression technique with API Gateway and JSON payload yielded the following results.

Compression disabled. Response size = 1 MB, response latency = 660 ms
Compression enabled. Response size = 220 KB, response latency = 550 ms

Compressing data resulted in 78% network footprint reduction and improved latency by 110 ms.

This configuration-based technique uses the API Gateway native compression. However, payloads are decompressed before being delivered to downstream integrations, thus they still remain subject to Lambda’s 6 MB max payload size. To address this, you can configure binaryMediaTypes in the API Gateway to pass compressed payloads to Lambda directly, enabling the function to handle decompression.

Figure 6. CDK code to configure API Gateway for data compression and binary data passthrough

Handling compressed data in Lambda functions

The Lambda Invoke API supports payloads in plain-text formats, such as JSON. The maximum payload size is 6 MB for synchronous invocations and 256 KB for asynchronous. Although the Invoke API supports uncompressed text-based payloads, you can introduce data compression in your function code and use API Gateway or Function URLs to facilitate content conversion, as illustrated in the following figure.

Figure 7. Transporting compressed payloads in a serverless applications

Handling data compression in your Lambda function code can be done through libraries commonly embedded in the runtime. The following code snippet shows the compressing response payload using Node.js. Similar techniques can be applied to other runtimes.

Figure 8. Sample code implementing response payload compression in a Lambda function

Line 1: Import gzip functionality from the zlib module.
Lines 11: Compress and Base64-encode data. Gzip compression, similar to many other compression methods, produces a binary stream. Base64 encoding converts it to the text-based format expected by the Lambda service
Lines 13-21: Response object is created with isBase64Encoded=true and response headers telling the client that the response is a gzip-encoded JSON object.

The following screenshot shows the result: 20 MB uncompressed JSON returned from a Lambda function as a 2.5 MB compressed response body. Network footprint reduced by over 80%.

Figure 9. A screenshot from Postman showing the original and compressed payload size

Using this technique, you can reduce your network footprint and transport payload sizes several times higher than the Lambda maximum payload size.

Using Function URLs with compressed payloads

Transporting compressed payloads through Lambda Function URLs doesn’t necessitate any extra configuration. For handler responses, your code needs to compress and Base64-encode the data as shown in the preceding figure. For invocation requests, the Function URL endpoint recognizes the incoming compressed payload as binary and passes it to your handler as a Base64 encoded string in the event body.

Figure 10. Sample code implementing request payload decompression in a Lambda function

Trade-offs and testing results

Compressing data in function code is a CPU-intensive activity, potentially increasing invocation duration and, as a result, function cost. This, however, can be balanced by the benefits of data compression. As you’ve seen in previous sections, while compressing data adds compute latency, transporting smaller payloads over the network reduces network latency. The following section summarizes a series of tests performed to estimate the impact of data compression on Lambda function invocation duration, Lambda function invocation cost, and data transfer savings with both NAT Gateway and VPC Endpoint. The tests were performed with several assumptions and randomly generated JSON data. You can see full testing results in the sample GitHub.com repo.

Test results demonstrated that the impact on function latency and cost primarily depends on two key factors: payload size and allocated memory (which determines vCPU capacity). Using a Node.js runtime with ARM architecture as an example, compressing a 1 MB JSON object in a function with 1 GB of allocated memory resulted in 124 ms of added processing time on average. For 10 million invocations, this extra processing time adds approximately $16. At the same time, the compression yielded a 70% reduction in payload size. With the same number of invocations, this translates to approximately $300 in savings when using NAT Gateway and $70 in savings when using VPC Endpoints (depending on the number of Availability Zones (AZs)).

AWS Service pricing is updated regularly, you should always consult the respective pricing pages for the latest information. Moreover, you should conduct your own performance and cost estimates using payloads that represent your workloads. Compression effectiveness varies significantly depending on the data type: payloads with low compression rates might not benefit from this technique.

Sample application

Follow the instructions in this GitHub repository to provision the sample in your AWS account. The project creates two Lambda functions to demonstrate receiving and returning compressed JSON using Function URLs and API Gateway.

The sample shows how to GET and POST JSON payloads using gzip compression to reduce the network footprint by over 80%.

Figure 11. A screenshot from Postman showing the original and compressed payload size

Conclusion

Data compression enables larger payload transfers and reduces network footprint. It can help to lower network latencies and optimize data transfer costs. When implementing compression within Lambda functions, it is important to consider its CPU-bound nature, which may increase function duration and costs. You should always evaluate the added compute cost against potential data transfer savings to make sure the technique benefits your use case.

Compression is most effective for handling large text-based payloads and when a slight increase in compute latency balanced by reduced network latency is acceptable.

To learn more about Serverless architectures and asynchronous Lambda invocation patterns, see Serverless Land.

Handling billions of invocations – best practices from AWS Lambda

2025-03-17 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/handling-billions-of-invocations-best-practices-from-aws-lambda/

This post is written by Anton Aleksandrov, Principal Solution Architect, AWS Serverless and Rajesh Kumar Pandey, Principal Engineer, AWS Lambda

AWS Lambda is a highly scalable and resilient serverless compute service. With over 1.5 million monthly active customers and tens of trillions of invocations processed, scalability and reliability are two of the most important service tenets. This post provides recommendations and insights for implementing highly distributed applications based on the Lambda service team’s experience building its robust asynchronous event processing system. It dives into challenges you might face, solution techniques, and best practices for handling noisy neighbors.

Overview

Developers building serverless applications create Lambda functions to run their code in the cloud. After uploading the code, the functions are invoked using synchronous or asynchronous mode.

Synchronous invocations are commonly used for interactive applications that expect immediate responses, such as web APIs. The Lambda service receives the invocation request, invokes the function handler, waits for the handler response, and returns it in response to the original request. With synchronous invocations, the client waits for the function handler to return, and is responsible for managing timeouts and retries for failed invocations.

Figure 1. Synchronous invocation sequence diagram

Asynchronous invocations enable decoupled function executions. Clients submit payloads for processing without expecting immediate responses. This is used for scenarios like asynchronous data processing or order/job submissions. The Lambda service immediately returns a confirmation for accepted invocation and proceeds to manage further handler invocation, timeouts, and retries asynchronously.

Figure 2. Asynchronous invocation sequence diagram

Asynchronous invocations under-the-hood

To accommodate asynchronous invocations, the Lambda service places requests into its internal queue and immediately returns HTTP 202 back to the client. After that, a separate internal poller component reads messages from the queue and synchronously invokes the function.

Figure 3. Asynchronous invocations workflow high-level topology

The same system also takes care of timeouts and retries in case of handler exceptions. When code execution completes, the system sends handler response to either onSuccess or onFailure destination, if configured.

Figure 4. Asynchronous invocations workflow detailed sequence diagram

Scaling highly distributed systems for billions of asynchronous requests presents unique challenges, such as managing noisy neighbors and potential traffic spikes to prevent system overload. Solutions vary by scale – what works for millions of requests may not suite billions. As workload size increases, solutions typically become more complex and costly, so right-sizing the approach is critical and should evolve with changing needs.

Simple queueing

A simple implementation of an asynchronous architecture can start with a single shared queue. This is a common approach for many asynchronous systems, particularly in early stages. It is effective when you’re not concerned about tenant isolation and when capacity planning indicates that a single queue can handle estimated incoming traffic efficiently.

Figure 5. Asynchronous workflow with a single queue

Even with this simple setup, it is critical to instrument your solution for observability to detect potential issues as soon as possible. You should monitor key metrics like queue backlog size, processing time, and errors, to indicate insufficient processing capacity early. Periods of unexpected traffic spikes and degraded performance may be a signal you have noisy neighbors impacting other tenants.Top of FormBottom of Form

To address this, you can scale your solution horizontally. You can implement random request placement across multiple queues to spread the load. Using a serverless service like Amazon SQS allows you to easily add and remove queues on-demand. One notable benefit of this approach is its simplicity – you do not need to introduce any complex routing mechanisms; requests are evenly spread across the queues. The downside is that you still do not have tenant boundaries. As your system grows, high-volume tenants and noisy neighbors can potentially affect all queues, thus impacting all tenants.

Figure 6. Asynchronous workflow with multiple queues and random request placement

Intelligent partitioning with consistent hashing

In order to further reduce potential impact, you can partition your tenants using sticky tenant-to-partition assignment with a hashing technique such as consistent hashing. This method uses a hash function to assign each tenant to a queue on a consistent hash ring.

Figure 7. Asynchronous workflow with multiple queues and consistent hashing placement

This technique ensures individual tenants stay in their queue partitions without the risk of disturbing the whole system. It helps to solve the problem where a few noisy neighbors have the potential to overflow all queues and as such impact all other tenants.

The consistent hashing approach proved to be efficient and enabled Lambda to offer robust asynchronous invocation performance to customers. As the volume of traffic and number of customers continued to grow, the Lambda service team came up with an innovative shuffle-sharding technique to further optimize the experience, and proactively eliminate any potential noisy-neighbor issues.

Shuffle-sharding

Drawing inspiration from the “The Power of Two Random Choices” paper, the Lambda team explored the shuffle-sharding technique for its asynchronous invocations processing. Using this technique, you shuffle-shard tenants into several randomly assigned queues. Upon receiving an asynchronous invocation, you place the message in the queue with the smallest backlog to optimize load distribution. This approach helps to minimize the likelihood of assigning tenants to a busy queue.

Figure 8. Asynchronous workflow with multiple queues and shuffle-sharding placement

To illustrate the benefit of this approach, consider a scenario where you’re using а 100 queues. The following formula helps to calculate the number of unique queue shards (combinations), where n is the total number of queues and r is the shard size (the number of queues you’re assigning per tenant).

With n=100, r=2 (each tenant is assigned randomly to 2 out of 100 queues), you get 4,950 unique combinations (shards). The probability of two tenants assigned to exactly the same shard is 0.02%. In case of r=3, the number of combinations spikes to 161,700. The probability of two tenants assigned to exactly the same shard drops to 0.0006%.

The shuffle-sharding technique proved remarkably effective. By distributing tenants across shards, the approach ensures that only a very small subset of tenants could be affected by a noisy neighbor. The potential impact is also minimized since each affected tenant maintains access to unaffected queues. As your workloads grow, increasing the number of queues enhances resilience and further reduces the probability of multiple tenants being assigned to the same shard. This significantly lowers the risk of a single point of failure, making shuffle sharding a robust strategy for workload isolation and fault tolerance.

Proactive detection, automated isolation, sidelining

Many distributed services will have a cohort of tenants with legitimate spiky asynchronous invocation traffic. This can be driven by seasonal factors, such as holiday shopping, or periodical batch processing. Recognizing these as real business needs, not malicious actions, you want to improve service quality for these tenants as well, while maintaining the overall system stability. For example, you can further improve solution performance by continuously monitoring queue depth to detect traffic spikes and route traffic to dynamically allocated dedicated queues. When you use Lambda asynchronous invocations, this internal complexity is managed for you by the service, ensuring seamless consumption experience.

Figure 9. Tenant D is automatically reallocated to a dedicated queue

Resilience and failure handling

“Everything fails, all the time” is a famous quote from Amazon’s Chief Technology Officer Werner Vogels. Lambda’s distributed and resilient architecture is built to withstand potential outages of its dependencies and internal components to limit the fallout for customers. Specifically for asynchronous invocation processing, the frontend service builds a processing backlog during an outage, allowing the backend to gradually recover without losing any in-flight messages.

Figure 10. Lambda service maintains resilience during component outage

Upon recovery, the service gradually ramps up the traffic to process the accumulated backlog. During this time, automated mechanisms are in place to coordinate between system components, preventing inadvertently DDoSing itself.

To further improve the recovery ramp-up process and provide a smooth restoration of normal operations, the Lambda service uses load-shedding technique to ensure fair resource allocation during recovery. While trying to drain the backlog as fast as possible, the service ensures that no single customer ends up consuming an outsized share of the available resources. Adopting such techniques can help you to improve your mean-time-to-recovery (MTTR).

Observability for asynchronous invocations processing

When using the Lambda service for asynchronous processing, you want to monitor your invocations for situational awareness and potential slowdowns. Use metrics such as AsyncEventReceived, AsyncEventAge, and AsyncEventDropped to get insights about internal processing.

AsyncEventReceived tracks the number of async invocations the Lambda service was able to successfully queue for processing. A drop in this metric indicates that invocations are not being delivered to the Lambda service and you should check your invocation source. Potential issues include misconfigurations, invalid access permissions, or throttling. Check your invocation source configuration, logs, and the function resource policy for further analysis.

AsyncEventAge tracks how long has a message spent in the internal queue before being processed by a function. This metric increases when async invocations processing is delayed due to insufficient concurrency, execution failures, or throttles. Increase your function concurrency to process more asynchronous invocations at a time and optimize function performance for better throughput, i.e. by increasing memory allocation to add more vCPU capacity. Experiment with adjusting batch size to enable functions to process more messages at a time. Use invocation logs to identify whether the problem is caused by function code throwing exceptions. Check Throttles and Errors metrics for further analysis.

AsyncEventDropped tracks the number of messages in the internal queue that were dropped because Lambda could not process them. This can be due to throttling, exceeding number of retries, exceeding maximum message age, or function code throwing an exception. Configure OnFailure destination or a dead-letter queue to avoid losing data and save dropped messages for re-processing. Use function logs and metrics described above to investigate whether you can address the issue by increasing function concurrency or allocating more memory.

By monitoring these metrics and addressing the underlying issues, you can ensure that your Lambda functions run smoothly, with minimal event processing delays and failures. You can also enable AWS X-Ray tracing to capture Lambda service traces. The AWS::Lambda trace segment captures the breakdown of the time that Lambda service spends routing requests to internal queues, the time a message spends in a queue, and the time before a function is invoked. This is a powerful tool to get insights into Lambda’s internal processing.

Conclusion

AWS Lambda processes tens of trillions of monthly invocations across more than 1.5 million active customers, demonstrating its exceptional scalability and resilience. Gaining an understanding of the underlying mechanisms of AWS services like Lambda enables you to proactively address potential challenges in your own applications. By learning how these services handle traffic, manage resources, and recover from failures, you can incorporate similar capabilities into your own solutions. For instance, leveraging Lambda’s asynchronous invocation metrics allows you to optimize workflow performance. This knowledge empowers you to implement strategies such as automated scaling, proactive monitoring, and graceful recovery during outages.

See below resources to learn about using queues and shuffle sharding at scale at Amazon

To learn more about Serverless architectures and asynchronous Lambda invocation patterns, see Serverless Land.

Introducing an enhanced local IDE experience for AWS Step Functions

2025-03-06 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/introducing-an-enhanced-local-ide-experience-for-aws-step-functions/

This post written by Ben Freiberg, Senior Solutions Architect.

AWS Step Functions introduces an enhanced local IDE experience to simplify building state machines. Workflow Studio is now available within Visual Studio Code (VS Code) through the AWS Toolkit extension. With this integration, developers can author and edit state machines in their local IDE using the same powerful visual authoring experience found in the AWS Console.

Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.

Customers choose Step Functions to build workflows that involve multiple services such as AWS Lambda, AWS Fargate, Amazon Bedrock, and HTTP API integrations. Developers create these workflows as state machines through the AWS Console using Workflow Studio or as code using Amazon States Language (ASL), a JSON-based domain specific language. Developers maintain their workflows definitions alongside the application and Infrastructure as code (IaC) code. Now, builders have even more capabilities to build and test their workflow in VS Code that matches the same experience as in AWS console.

Simplifying local workflow development

The integrated Workflow Studio provides developers with a seamless experience for building Step Functions workflows within their local IDE. You’ll use the same canvas used in the AWS Console to drag and drop states to build your workflows. As you modify the workflow visually, the ASL definition updates automatically, so you can focus on business logic rather than syntax. The Workflow Studio integration offers the same intuitive and visual approach to designing state machines as the AWS Console, without switching context.

Getting started

To use the updated IDE experience, verify that you have the AWS Toolkit with at least version 3.49.0 installed as a VS Code Extension.

Figure 1: AWS Toolkit update available

After installing the AWS Toolkit extension, you can start building with Workflow Studio by opening a state machine definition. You can use a definition file from your local workspace or use AWS Explorer to download an existing state machine definition from the cloud. VS Code integration supports ASL definitions in JSON and YAML formats. (Note: Files must end in .asl.json, asl.yml or .asl.yaml for Workflow Studio to automatically open the file.) While working with YAML files, Workflow Studio converts the definition to JSON for editing, then converts back to YAML before saving.

Figure 2: Design mode in Workflow Studio

Workflow Studio in VS Code supports both the Design and Code mode. Design mode provides a graphical interface to build and inspect your workflows. In Code mode, you can use an integrated code editor to view and edit the Amazon States Language (ASL) definition of your workflows. You can always switch back to text-based editing by selecting the Return to Default Editor link at the top right of Workflow Studio, as shown in the following screen.

Figure 3: Code mode in Workflow Studio

To open Workflow Studio in VS Code manually, you can use the “Open with Workflow Studio” action at the top of a workflow definition file or the icon in the top right of the editor pane. Both options are highlighted in the following screen. Additionally, you can use the file context-menu to open Workflow Studio from the file explorer pane.

Figure 4: Integrations of Workflow Studio into the editor

Edits you make in Workflow Studio are automatically synced to the underlying file as unsaved changes. To persist your changes, you must either save the changes from Workflow Studio or the file editor. Similarly, any changes you make to the local file are synced to Workflow Studio on save.

Workflow Studio is aware of Definition Substitutions, so you can even edit workflows which have been integrated with your IaC tooling like AWS CloudFormation or the AWS Cloud Development Kit (CDK). Definition Substitutions is a feature of CloudFormation that lets you add dynamic references in your workflow definition to a value that you provide in your IaC template.

AWSTemplateFormatVersion: "2010-09-09"
Description: "State machine with Definition Substitutions"
Resources:
  MyStateMachine:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      StateMachineName: HelloWorld-StateMachine
      DefinitionS3Location:
        Bucket: amzn-s3-demo-bucket
        Key: state-machine-definition.json
      DefinitionSubstitutions:
        TableName: DemoTable

You can then use the Definition Substitutions in the definition of your state machine.

"Write message to DynamoDB": {
  "Type": "Task",
  "Resource": "arn:aws:states:::dynamodb:putItem",
  "Next": "Remove message from SQS queue",
  "Arguments": {
    "TableName": "${TableName}",
    "Item": {
      ... omitted for brevity ...
     }
  },
  "Output": "{% $states.input %}"
}

The code defines a Step Functions task state that writes a message to DynamoDB using the putItem operation. The ${TableName} substitution syntax allows for a dynamic DynamoDB table name that can be passed as a parameter when the state machine is executed.

Testing and Deployment

Workflow Studio integration supports testing a single state through Step Functions TestState API. With the TestState API, you can test a state in the cloud from your local IDE without creating a state machine or updating an existing state machine. With the power of localized granular testing, you can build and debug changes for individual states without needing to invoking the entire state machine. For example, you can refine the input or output processing, or update the conditional logic in a choice state without ever leaving your IDE.

Testing a state

Open any state machine definition file in Workflow Studio
Select a state from the canvas or the code tab
Open the Inspector panel on the right side if not already open
Figure 5: Arguments of an Individual state
Select Test state button at the top
Select your IAM role and add the input. Make sure that the role has the necessary permissions for using TestState API
If your state contains any Definition Substitutions, you’ll see an additional section where you can replace them with your specific values.
Select Start Test

Figure 6: TestState configuration with a Definition Substitution

After the test succeeds, you can publish your workflow from the IDE using the AWS Toolkit. You can also use IaC tools such as AWS Serverless Application Model, AWS CDK, or CloudFormation to deploy your state machine.

Conclusion

Step Functions is introducing an enhanced local IDE experience to simplify the development of workflows using the VS Code IDE and AWS Toolkit. This streamlines the code-test-deploy-debug cycle and offers developers a seamless integration of Workflow Studio. By combining visual workflow design with the power of a full-featured IDE, developers can now build Step Functions workflows more efficiently.

To get started, install the AWS Toolkit for Visual Studio Code and visit the user guide on Workflow Studio integration. Find hands-on examples, best practices, and useful resources for AWS serverless at Serverless Land.

Introducing JSONL support with Step Functions Distributed Map

2025-02-08 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/introducing-jsonl-support-with-step-functions-distributed-map/

This post written by Uma Ramadoss, Principal Specialist SA, Serverless and Vinita Shadangi, Senior Specialist SA, Serverless.

Today, AWS Step Functions is expanding the capabilities of Distributed Map by adding support for JSON Lines (JSONL) format. JSONL, a highly efficient text-based format, stores structured data as individual JSON objects separated by newlines, making it particularly suitable for processing large datasets.

This new capability enables you to process large collection of items stored in JSONL format directly through Distribtued Map and optionally exports the output of the Distributed Map as JSONL file. The enhancement also introduces support for additional delimited file formats, including semicolon and tab-delimited files, providing greater flexibility in data source options. Furthermore, new flexible output transformations gives developers more control over result formatting, enabling better integration with downstream processes for efficient data handling.

Overview

Distributed Map enables parallel processing of large-scale data by concurrently running the same processing steps for millions of entries in a dataset at the maximum scale of 10000. This is particularly useful for use cases like large scale payroll processing, image conversion, document processing and data migrations. Previously, the dataset can come from state input, JSON/CSV files in S3 and collection of S3 objects. With this new feature, the dataset can be a JSONL file in Amazon S3.

The AWS Step Functions workflow

Consider an example of end-to-end GenAI batch inferencing using Amazon Bedrock. Batch inference helps you process a large number of requests efficiently by bundling them as single request and storing the results in an S3 bucket. Since both input and output are handled as JSONL files, the blog uses the scenario as an example to demonstrate the new capabilities of Distributed Map.

The diagram below shows the end-to-end flow –

Step Functions workflow (Batch inference input generation worfklow) uses Distributed Map to build and bundle AI prompts for a collection of product review data. Workflow then invokes the Amazon Bedrock batch inference API.
Amazon Bedrock stores the results in S3 as JSONL file when the batch inference is completed.
An S3 object created event invokes the second Step Functions workflow (Batch inference output processing workflow) that processes the JSONL file and loads the results into an Amazon DynamoDB table.

Diagram showing the overall architecture for the end to end flow

Batch inferencing workflow

Introducing new output transformations through batch inference input generation workflow

The batch inference input generation workflow processes product review data in S3 using Distributed Map. Distributed Map spins multiple child workflows that generate AI prompts for sentiment analysis of each product review and exports the results of the child workflows to S3 as a JSONL file. The workflow calls Amazon Bedrock batch inference API (CreateModelInvocationJob) with the JSONL file as input upon completion of the Distributed Map state. Since the inference API operates asynchronously, the workflow completes immediately after receiving a successful response from the API.

Diagram of the AWS Step Function workflow that creates the batch process

Batch inference input generation workflow

Each child workflow receives a batch of product reviews as an array. It operates on the array using Pass state to create an array of AI prompts, one for each item. The Pass state manipulates the input using JSONata expressions, generates unique recordId using JSONata numeric functions, and outputs the results in a format Amazon Bedrock expects.

JSONata transformation to generate prompts

Once all child workflows are complete, Distributed Map uses the new output transformations to export the outputs from child workflows to S3.

Using the new output transformations to export in JSONL format

Distributed Map now offers more flexible output handling through an optional writer configuration. While it traditionally exports child workflow execution results to three separate JSON files (successful, failed, and pending), the new writer configuration streamlines the output and supports JSONL format in addition to JSON format.
The previous export option included comprehensive execution details – metadata, child workflow inputs, and outputs. The new configuration allows you to streamline output to include only the child workflow execution results, which are valuable for map/reduce patterns, where the output from one Distributed Map needs to feed directly into another without the need for additional transformation steps.

JSON structure showing the ASL for Step Functions

Output writer config for JSONL

Writer config also allows you to flatten the output array. When a child workflow processes batches of the input, it produces an array of results which will eventually become an array of arrays when the Distributed Map aggregates the outputs from all child workflows. With the new output transformation called FLATTEN, you can choose to flatten the array without additional code.

JSON examples showing the multiple arrays being flattened

Flattening output in JSONL

Introducing the new ItemReader for JSONL using batch inference output processing workflow

The second workflow processes output of the batch inference job by launching multiple child workflows using Distributed Map. Each child workflow processes batches of items, examining them for error objects and separating successful inferences from errors. The workflow then loads all successful inferences into a DynamoDB table while sending errors to a dead letter queue for subsequent analysis.

AWS Step Functions workflow architecture

Processing inference results

Using the new InputType to read the JSONL inference results

The Distributed Map in the batch inference results processing workflow uses the newly supported ItemReader-InputType, JSONL. Previously, the InputType only accepted CSV, JSON, and MANIFEST, which is an S3 Inventory manifest file.

AWS Step Functions ASL showing how to read the JSONL file

Reading JSONL file

There is no other change to how Distributed Map processes and shares data with child workflows. The Pass state in the child workflow receives batches of Items from the Map, and uses JSONata expressions to separate the errors from successful items.

AWS Step Functions ASL showing JSONata to separate processing for errors

Separating successful processing from errors

The following shows the input received by the Pass state and the output generated by the state using the above JSONata expression.

Resulting JSON showing processed records

Sample successful processing records

Using S3 events as connective tissue between the workflows

When Amazon Bedrock completes the batch inference job, it stores the output in the S3 location specified in the API request. An EventBridge rule triggers the batch inference results processing workflow using S3 event notifications. The rule looks for “Object Created” event from the specified S3 bucket and a wildcard pattern for JSONL file extension. When the rule matches the incoming event, it triggers the workflow.

EventBridge rule

You can detect failed batch inference jobs by setting up EventBridge rules that listen to Amazon Bedrock status events. Since failed jobs don’t create output files in S3, monitoring status events directly ensures you catch and handle job failures.

Key considerations

The new output transformations do not change the information in the FAILED execution results file in order to help you analyze the reasons for failures. To learn more about the output transformation configurations, visit the documentation.
The new transformation mode FLATTEN, COMPACT stores only the output of the execution results. To inspect the results for fact checking or troubleshooting, use the default transformation.
As a best practice, when implementing code changes, it’s advised to use the versioning and aliasing feature for gradual deployment of changes to production.
When using Distributed Map, there is an option to configure the child workflow as either Standard or Express. Express is the recommended choice if each iteration (child workflow) can be completed within 5 minutes, and batching items will help optimize costs. To learn more about optimizations for Distributed Map, visit the workshop.

Conclusion

Step Functions Distributed Map is a powerful feature that enables developers to create large-scale data processing solutions with ease, eliminating concerns about operational aspects and software challenges like batching, concurrency, and failure handling. The addition of JSONL support for both input and output expands workload capabilities and minimizes additional effort through transformations by natively deserializing and flattening the output. This blog demonstrated the new feature’s capabilities through a practical example of building large-scale data processing applications using Distributed Map.

For more information on Distributed Map and how to use it with JSONL files, refer to the user guide.

To explore generative AI samples with Step Functions, visit the the GitiHub repo.

To expand your serverless knowledge, visit Serverless Land.

AWS Lambda introduces recursive loop detection APIs

2024-08-20 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/aws-lambda-introduces-recursive-loop-detection-apis/

This post is written by James Ngai, Senior Product Manager, AWS Lambda, and Aneel Murari, Senior Specialist SA, Serverless.

Today, AWS Lambda is announcing new recursive loop detection APIs that allow you to set recursive loop detection configuration on individual Lambda functions. This allows you to turn off recursive loop detection on functions that intentionally use recursive patterns, avoiding disruption of these workloads. You can use these APIs to avoid disruption to any intentionally recursive workflows as Lambda expands support of recursive loop detection to other AWS services.

Overview

AWS Lambda functions are triggered in response to events generated by various AWS services. These Lambda functions may interact with other AWS services by invoking the corresponding service APIs. Typically, the service and resource that generates the triggering event is distinct from the service and resource that the Lambda function calls. However, due to coding errors or configuration issues, there may be situations where these two resources are the same, leading to an infinite or recursive loop. Such misconfigurations can result in runaway workloads, which can incur unplanned usage and charges to your AWS account. For example, a Lambda function processes messages from an Amazon Simple Notification Service (SNS) topic but then puts the resulting notification back to the same SNS topic. This causes an infinite loop.

Lambda provides a built-in preventative guardrail that detects and stops functions running in a recursive or infinite loop between Lambda, Amazon Simple Queue Service (SQS), and SNS. This feature, known as recursive loop detection, is enabled by default for all Lambda functions. This serves as a protective mechanism against unintended usage and unexpected billing from runaway workloads.

Lambda uses an AWS X-Ray trace header primitive called “lineage” to track the number of times a function has been invoked with an event. When your function code sends an event using a supported AWS SDK version, Lambda increments the counter in the lineage header. If your function is then invoked with the same triggering event more than 16 times, Lambda stops the next invocation for that event and emits an Amazon CloudWatch RecursiveInvocationsDropped metric. If the function is invoked synchronously, Lambda returns a RecursiveInvocationException to the caller. For asynchronous invocations, Lambda sends the event to a dead-letter queue or on-failure destination if one is configured.

You do not need to configure active X-Ray tracing for this feature to work. For more information on this feature and an example scenario, please refer to Detecting and stopping recursive loops in AWS Lambda functions.

Although AWS generally discourages this practice due to the possibility of runaway workloads, some customers intentionally employ recursive patterns in their workflows. Previously, customers that run workloads that intentionally use recursive patterns could only opt-out of recursive loop detection on a per-account basis by contacting AWS Support. With these new APIs, customers can selectively opt-out of recursive loop detection on individual functions while maintaining this preventative guardrail for the remaining functions in their account that do not use recursive code.

Today we are introducing two new API actions for recursive loop detection:

GetFunctionRecursiveConfig returns details about a function’s recursive loop detection configuration.
PutFunctionRecursiveConfig sets the recursive loop detection configuration for a function. By default, recursive loop detection is turned ON for all functions.

How to use the new recursive loop detection APIs

You can configure recursive loop detection for Lambda functions through the Lambda Console, the AWS CLI, or Infrastructure as Code tools like AWS CloudFormation, AWS Serverless Application Model (AWS SAM), or AWS Cloud Development Kit (CDK). This new configuration option is supported in AWS SAM CLI version 1.123.0 and CDK v2.153.0.

If you turn recursive loop detection off for a function, the metric for RecursiveInvocationsDropped is no longer emitted for that function.

Turning off recursive loop detection on your function means that Lambda no longer prevents recursive invocations caused by misconfiguration. This may lead to unexpected usage and billing to your AWS account. You should explore alternate ways of architecting your workload that do not use recursive patterns. AWS recommends you exercise caution when turning off this guardrail feature.

Setting recursive loop detection configuration on a function using the Lambda Console

You can get recursive loop detection configuration in the AWS Lambda console:

In the AWS Lambda Console, navigate to the Functions page. Select the function that uses intentionally recursive patterns.
Select Configuration. You can find recursive loop detection controls under the Concurrency and recursion detection section.

Recursive loop detection controls in the Lambda console

Recursive loop detection is turned on by default for all functions. You can change the recursive loop detection configuration of a function by choosing Edit.
To turn off recursive loop detection for a function, select Allow recursive loops and select Save.

Setting to allow recursive loops

Setting recursive loop detection configuration using the AWS CLI

You can get the current recursion loop detection configuration of a Lambda function by using the following CLI command:

aws lambda get-function-recursion-config \
--region $AWS_REGION \
--function-name $FUNCTION_NAME

You can update the recursion loop detection configuration for a Lambda function by using the following CLI command:

aws lambda put-function-recursion-config \
--region $AWS_REGION \
--function-name $FUNCTION_NAME \
--recursive-loop Allow|Terminate

Make sure to set appropriate values for AWS_REGION and FUNCTION_NAME in the previous commands. Setting the put-function-recursion-config parameter to Allow turns off the default behavior of detecting recursive loops. Set this value to Terminate to switch back to default behavior.

Setting recursive loop detection configuration using AWS CloudFormation

You can control the recursive loop detection configuration for a Lambda function by setting the RecursiveLoop resource property in CloudFormation. Setting the value of this property to Allow turns off the default behavior of automatically detecting recursive loops. Set this property to Terminate if you want to switch it back to the default behavior. The following CloudFormation snippet shows RecursiveLoop set to Allow.

LambdaFunction:
    Type: AWS::Lambda::Function                                                                                                                                                                                    
    Properties:                                                                                                                                                                                       
      Code:                                                                                                                                                                                          
        S3Bucket:S3_BUCKET                                                                                                                                                                            
        S3Key: S3_KEY      
      Handler: com.example.App::handleRequest                                                                                                                                                        
      MemorySize: 1024
      Role:                                                                                                                                                                                          
        Fn::GetAtt:                                                                                                                                                                                  
        - LambdaFunctionRole                                                                                                                                                                         
        - Arn                                                                                                                                                                               
      Runtime: java17
      RecursiveLoop : Allow                                                                                                                                                                                                                                                                           
      Timeout: 20                                                                                                                                                                        
      TracingConfig:                                                                                                                                                                               
        Mode: Active

Extending recursive loop detection to additional AWS services

Today, recursive loop detection detects and stops loops between Lambda, SQS, and SNS after approximately 16 invocations. Lambda plans to extend support for recursive loop detection to additional AWS services. Using the APIs, you can turn off recursive loop detection for specific functions that use recursive patterns so that they are not impacted when Lambda expands recursive loop detection to additional AWS services in the future.

One way you can identify functions that use recursive patterns is by using the CloudWatch metric RecursiveInvocationsDropped.

Set a CloudWatch alarm on all Lambda functions for the CloudWatch metric RecursiveInvocationsDropped. Configure the alarm to trigger when the metric is greater than a threshold of zero. Refer to CloudWatch documentation to set alarms. You can use the following CLI command to set this alarm:

aws cloudwatch put-metric-alarm --alarm-name lambda-recursive-alarm --metric-name RecursiveInvocationsDropped --namespace AWS/Lambda --statistic Sum --period 60 --threshold 0 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --alarm-actions $arn-of-sns-notification-topic

When Lambda detects recursive invocations, it will emit the RecursiveInvocationsDropped metric, which will trigger the alarm. Note that Lambda will only detect and stop recursive invocations if all the services within the loop support recursive loop detection.
Navigate to the CloudWatch Console and determine which function has emitted the RecursiveInvocationsDropped metric. On the Browse tab, under Metrics, choose to view metrics By Function Name and search for RecursiveInvocationsDropped. This will list all functions that have emitted that metric.

RecursiveInvocationsDropped metric

Determine if recursion is the intended pattern for that function. If so, use the recursive loop detection API to turn off recursive loop detection for this function.

Conclusion

Lambda recursive loop detection automatically detects and stops recursive invocations between Lambda and supported services, preventing runaway workloads. In most cases, you should architect your workloads to avoid any recursive loops. In rare and special circumstances, you may want to turn off the default behavior on a case-by-case basis. The recursive loop detection APIs allow you to set recursive loop detection configuration on individual functions.

This feature is available in all AWS Regions where Lambda supports recursive loop detection.

To learn more about these APIs, refer to the AWS Lambda API Reference.

For more serverless learning resources, visit Serverless Land

Using the circuit-breaker pattern with AWS Lambda extensions and Amazon DynamoDB

2024-05-16 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-the-circuit-breaker-pattern-with-aws-lambda-extensions-and-amazon-dynamodb/

This post is written by Alan Oberto Jimenez, Senior Cloud Application Architect, and Tobias Drees, Cloud Application Architect.

Modern software systems frequently rely on remote calls to other systems across networks. When failures occur, they can cascade across multiple services causing service disruptions. One technique for mitigating this risk is the circuit breaker pattern, which can detect and isolate failures in a distributed system. The circuit breaker pattern can help prevent cascading failures and improve overall system stability.

The pattern isolates the failing service and thus prevents cascading failures. It improves the overall responsiveness by preventing long waiting times for timeout periods. Furthermore, it also increases the fault tolerance of the system since it lets the system interact with the affected service again once it is available again.

This blog post presents an example application, showing how AWS Lambda extensions integrate with Amazon DynamoDB to implement the circuit breaker pattern.

Using Lambda extensions to implement the circuit breaker pattern

AWS Lambda extensions provide a way to integrate monitoring, observability, security, and governance tools into the Lambda execution environment without complex installation or configuration management. You can run extensions both as part of the runtime process with an internal extension or as a separate process in the execution environment with an external extension.

Lambda extensions enable the circuit breaker pattern without modifying the core function code. An external extension checks in a separate runtime whether a certain service is reachable or not. This approach decouples the business logic in the Lambda function from failure detection, allowing for the reuse of this Lambda extension across different Lambda functions. Both decoupling of code with different purposes and code reuse is in line with the best practices for building Lambda functions.

Pinging a microservice at each Lambda invocation increases network traffic and latency. Circuit breaker implementations benefit from a caching layer to store the state of the microservices. The Lambda extension fetches the status of a microservice from a database and stores the result in memory for a specified time avoiding a disk write. The Lambda function checks the extension cache before pinging the microservice reducing network traffic. Lambda extensions are an ideal tool to build a caching layer for Lambda functions since its in-memory cache makes it more secure, easier to manage, and more performant due to higher availability compared to calling a network resource instead.

Overview

The main function process handles the event after every AWS Lambda invocation. Before performing any external call against the external components, it listens for HTTP POST events from the Lambda extension process to fetch the last status of the circuits.
The extension process provides the circuit state to the main process via HTTP POST.
1. The extension checks its internal cache and returns a valid value if available, otherwise reads the state of the circuits from the DynamoDB table and updates the cache.
2. Finally, the extension process returns the state of the circuits to the main function via an API call response.
3. Because of the Lambda extensions lifecycle, this process occurs periodically to keep the local cache updated until the execution environment is terminated.
If the circuit is in the OPEN state, the main function process executes calls against the external microservices, otherwise the process returns a local response.
An Amazon EventBridge event periodically invokes a Lambda responsible for updating the circuit states.
This Lambda function performs the validations needed to determine the status of the different remote microservices (circuits) with an Amazon API Gateway entrypoint.
The Lambda function writes the result of the verification process to the DynamoDB table.

Walkthrough

The following prerequisites are required to complete the walkthrough:

An active AWS account
AWS CLI 2.15.17 or later
AWS SAM CLI 1.116.0 or later
Git 2.39.3 or later
Python 3.12

Initial setup

Clone the code from GitHub onto a local machine:

git clone https://github.com/aws-samples/implementing-the-circuit-breaker-pattern-with-lambda-extensions-and-dynamodb.git

To install the packages, utilize a virtual environment:

python -m venv circuit_breaker_venv && source circuit_breaker_venv/bin/activate

To prepare the services for deployment, execute the following AWS Serverless Application Model (SAM) command:
```
sam build
```
To deploy the services, use this command specifying the AWS CLI profile (in the config file in the .aws folder) for the AWS account to deploy the services in:
```
sam deploy --guided --profile <AWSProfile>
```
Answer the question prompts as appropriate.
You can deploy subsequent local changes in the code with:
```
sam build 
sam deploy
```

Testing and adjusting the solution

The Lambda function updating the state in DynamoDB runs every minute as specified by the template. After the function has run for the first time after 1 minute, the DynamoDB entry containing the status (“OPEN” or “CLOSED”) is ready. Since the mock API is part of the stack, the status is “OPEN”.

You can invoke the My Microservice Lambda function manually to see:

The Lambda function updating the state in DynamoDB is invoked with an EventBridge rule that specifies the URL and the ID of the service to be monitored. By creating a new EventBridge rule with the correct URL and a new ID, you can use the AWS SAM template for monitoring multiple services.

To add a new EventBridge rule, add this to the template:

  NewEventRule:
    Type: AWS::Events::Rule
    Properties:
      Description: Event rule to trigger the Lambda function with a JSON payload
      ScheduleExpression: rate(1 minute) 
      State: ENABLED
      Targets:
        - Arn: !GetAtt UpdatingStateLambda.Arn
          Id: TargetFunction
          Input: '{ "URL": "https://aws.amazon.com/", "ID": "NewMicroservice"}'  # Add the JSON payload here

  MyPermissionForNewEventRule:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref UpdatingStateLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt NewEventRule.Arn

In the Lambda function that contains the business logic, add the following environment variables. However, for more complex cases with multiple microservices to be monitored, it’s recommended to use AWS Config. Using AWS Config, configurations for Lambda functions can be stored to enable more granular control than with environment variables.

Environment:
        Variables:
          service_name: "NewMicroservice"

You can adjust the logic of this Lambda function by changing the code in my-microservice/lambda-handler.py or directly in the Lambda section of the AWS Management Console.

If you end up using your own Lambda function to use the circuit breaker Lambda extension, include the circuit breaker extension as a layer:

BusinessLogicMicroservice:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: business-logic-microservice/
      Handler: lambda_function.lambda_handler
      MemorySize: 128
      Policies:
      - DynamoDBCrudPolicy:
          TableName: !Ref CircuitBreakerStateTable
      Timeout: 100
      Runtime: python3.8
      Layers:
      - !Ref CircuitBreakerExtensionLayer

Circuit breaker in closed state

So far, the sample application only features an open circuit breaker state signaling a functioning microservice. This section simulates an unresponsive microservice to test the behavior of the system with a closed-circuit breaker state.

Edit the environment variables of the MyMicroservice Lambda function in line 47 of the template.yaml file and the URL of the input to the Lambda updating the state in the event rule in line 107 to a domain that times out such as ”https://aws.amazon.com:81/“.
```
API_URL: "https://aws.amazon.com:81/"
Input: '{ "URL": "https://aws.amazon.com:81/", "ID": "MyMicroservice"}'
```
Deploy these changes:
```
sam build
sam deploy
```

The event rule invokes the Lambda function, updating the state every minute. To see the output of this Lambda function, invoke it manually:

This Lambda function changes the DynamoDB entry for this URL to:

The MyMicroservice Lambda function receives the DynamoDB entries for the status over HTTP from the Circuit Breaker Lambda extension and proceeds with the logic following a closed state. The output of invoking the Lambda manually is:

This shows the circuit breaker pattern working as intended. In the Lambda updating state, the time it takes for the Lambda function to throw a timeout exception is defined as 4 seconds and can be adjusted to the use case.

requests.get(API_URL, headers=headers, timeout=4)

Clean-up

To delete all resources from this stack, run:

sam delete --stack-name new-circuit-breaker-sam-stack

Security

The provided AWS SAM template does not provide an Amazon Virtual Private Cloud (VPC) in which to host the resources. Integrate the resources into an appropriate networking configuration if you are using it in production applications.

The solution has auditability characteristics, as calls to the circuit breaker and to the microservices are logged to the Amazon CloudWatch log group. The audit log is encrypted using AWS Key Management Service.

To monitor the security of your account with the solution, use Amazon GuardDuty, AWS CloudTrail, AWS Config, and AWS WAF for API Gateway.

Conclusion

The circuit breaker pattern is a powerful tool for helping to ensure the resiliency and stability of serverless applications. Lambda extensions are a good fit for its implementation, as demonstrated in this example. With the provided Lambda extension and code, you can incorporate the circuit breaker pattern into your applications and customize it to suit your specific requirements, helping to ensure a robust and reliable system.

For more serverless learning resources, visit Serverless Land.

Running code after returning a response from an AWS Lambda function

2024-05-07 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/running-code-after-returning-a-response-from-an-aws-lambda-function/

This post is written by Uri Segev, Principal Serverless Specialist SA.

When you invoke an AWS Lambda function synchronously, you expect the function to return a response. For example, this is the case when a client invokes a Lambda function through Amazon API Gateway or from AWS Step Functions. As the client is waiting for the response, you should return the response as soon as possible.

However, there may be instances where you must perform additional work that does not affect the response and you can do it asynchronously, after you send the response. For example, you may store data in a database or send information to a logging system.

Once you send the response from the function, the Lambda service freezes the runtime environment, and the function cannot run additional code. Even if you create a thread for running a task in the background, the Lambda service freezes the runtime environment once the handler returns, causing the thread to freeze until the next invocation. While you can delay returning the response to the client until all work is complete, this approach can negatively impact the user experience.

This blog explores ways to run a task that may start before the function returns but continues running after the function returns the response to the client.

Invoking an asynchronous Lambda function

The first option is to break the code into two functions. The first function runs the synchronous code; the second function runs the asynchronous code. Before the synchronous function returns, it invokes the second function asynchronously, either directly, using the Invoke API, or indirectly, for example, by sending a message to Amazon SQS to trigger the second function.

This Python code demonstrates how to implement this:

import json
import time
import os
import boto3
from aws_lambda_powertools import Logger

logger = Logger()
client = boto3.client('lambda')

def calc_response(event):
    logger.info(f"[Function] Calculating response")
    time.sleep(1) # Simulate sync work
    return {
        "message": "hello from async"
    }

def submit_async_task(response):
    # Invoke async function to continue
    logger.info(f"[Function] Invoking async task in async function")
    client.invoke_async(FunctionName=os.getenv('ASYNC_FUNCTION'), InvokeArgs=json.dumps(response))

def handler(event, context):
    logger.info(f"[Function] Received event: {json.dumps(event)}")

    response = calc_response(event)
    
    # Done calculating response, submit async task
    submit_async_task(response)

    # Return response to client
    logger.info(f"[Function] Returning response to client")
    return {
        "statusCode": 200,
        "body": json.dumps(response)
    }

The following is the Lambda function that performs the asynchronous work:

import json
import time
from aws_lambda_powertools import Logger

logger = Logger()

def handler(event, context):
    logger.info(f"[Async task] Starting async task: {json.dumps(event)}")
    time.sleep(3)  # Simulate async work
    logger.info(f"[Async task] Done")

Use Lambda response streaming

Response streaming enables developers to start streaming the response as soon as they have the first byte of the response, without waiting for the entire response. You usually use response streaming when you must minimize the Time to First Byte (TTFB) or when you must send a response that is larger than 6 MB (the Lambda response payload size limit).

Using this method, the function can send the response using the response streaming mechanism and can continue running code even after sending the last byte of the response. This way, the client receives the response, and the Lambda function can continue running.

This Node.js code demonstrates how to implement this:

import { Logger } from '@aws-lambda-powertools/logger';

const logger = new Logger();

export const handler = awslambda.streamifyResponse(async (event, responseStream, _context) => {
    logger.info("[Function] Received event: ", event);
  
    // Do some stuff with event
    let response = await calc_response(event);
    
    // Return response to client
    logger.info("[Function] Returning response to client");
    responseStream.setContentType('application/json');
    responseStream.write(response);
    responseStream.end();

    await async_task(response);   
});

const calc_response = async (event) => {
    logger.info("[Function] Calculating response");
    await sleep(1);  // Simulate sync work

    return {
        message: "hello from streaming"
    };
};

const async_task = async (response) => {
    logger.info("[Async task] Starting async task");
    await sleep(3);  // Simulate async work
    logger.info("[Async task] Done");
};

const sleep = async (sec) => {
    return new Promise((resolve) => {
        setTimeout(resolve, sec * 1000);
    });
};

Use Lambda extensions

Lambda extensions can augment Lambda functions to integrate with your preferred monitoring, observability, security, and governance tools. You can also use an extension to run your own code in the background so that it continues running after your function returns the response to the client.

There are two types of Lambda extensions: external extensions and internal extensions. External extensions run as separate processes in the same execution environment. The Lambda function can communicate with the extension using files in the /tmp folder or using a local network, for example, via HTTP requests. You must package external extensions as a Lambda layer.

Internal extensions run as separate threads within the same process that runs the handler. The handler can communicate with the extension using any in-process mechanism, such as internal queues. This example shows an internal extension, which is a dedicated thread within the handler process.

When the Lambda service invokes a function, it also notifies all the extensions of the invocation. The Lambda service only freezes the execution environment when the Lambda function returns a response and all the extensions signal to the runtime that they are finished. With this approach, the function has the extension run the task independently from the function itself and the extension notifies the Lambda runtime when it is done processing the task. This way, the execution environment stays active until the task is done.

The following Python code example isolates the extension code into its own file and the handler imports and uses it to run the background task:

import json
import time
import async_processor as ap
from aws_lambda_powertools import Logger

logger = Logger()

def calc_response(event):
    logger.info(f"[Function] Calculating response")
    time.sleep(1) # Simulate sync work
    return {
        "message": "hello from extension"
    }

# This function is performed after the handler code calls submit_async_task 
# and it can continue running after the function returns
def async_task(response):
    logger.info(f"[Async task] Starting async task: {json.dumps(response)}")
    time.sleep(3)  # Simulate async work
    logger.info(f"[Async task] Done")

def handler(event, context):
    logger.info(f"[Function] Received event: {json.dumps(event)}")

    # Calculate response
    response = calc_response(event)

    # Done calculating response
    # call async processor to continue
    logger.info(f"[Function] Invoking async task in extension")
    ap.start_async_task(async_task, response)

    # Return response to client
    logger.info(f"[Function] Returning response to client")
    return {
        "statusCode": 200,
        "body": json.dumps(response)
    }

The following Python code demonstrates how to implement the extension that runs the background task:

import os
import requests
import threading
import queue
from aws_lambda_powertools import Logger

logger = Logger()
LAMBDA_EXTENSION_NAME = "AsyncProcessor"

# An internal queue used by the handler to notify the extension that it can
# start processing the async task.
async_tasks_queue = queue.Queue()

def start_async_processor():
    # Register internal extension
    logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Registering with Lambda service...")
    response = requests.post(
        url=f"http://{os.environ['AWS_LAMBDA_RUNTIME_API']}/2020-01-01/extension/register",
        json={'events': ['INVOKE']},
        headers={'Lambda-Extension-Name': LAMBDA_EXTENSION_NAME}
    )
    ext_id = response.headers['Lambda-Extension-Identifier']
    logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Registered with ID: {ext_id}")

    def process_tasks():
        while True:
            # Call /next to get notified when there is a new invocation and let
            # Lambda know that we are done processing the previous task.

            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Waiting for invocation...")
            response = requests.get(
                url=f"http://{os.environ['AWS_LAMBDA_RUNTIME_API']}/2020-01-01/extension/event/next",
                headers={'Lambda-Extension-Identifier': ext_id},
                timeout=None
            )

            # Get next task from internal queue
            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Wok up, waiting for async task from handler")
            async_task, args = async_tasks_queue.get()
            
            if async_task is None:
                # No task to run this invocation
                logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Received null task. Ignoring.")
            else:
                # Invoke task
                logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Received async task from handler. Starting task.")
                async_task(args)
            
            logger.debug(f"[{LAMBDA_EXTENSION_NAME}] Finished processing task")

    # Start processing extension events in a separate thread
    threading.Thread(target=process_tasks, daemon=True, name='AsyncProcessor').start()

# Used by the function to indicate that there is work that needs to be 
# performed by the async task processor
def start_async_task(async_task=None, args=None):
    async_tasks_queue.put((async_task, args))

# Starts the async task processor
start_async_processor()

Use a custom runtime

Lambda supports several runtimes out of the box: Python, Node.js, Java, Dotnet, and Ruby. Lambda also supports custom runtimes, which lets you develop Lambda functions in any other programming language that you need to.

When you invoke a Lambda function that uses a custom runtime, the Lambda service invokes a process called ‘bootstrap’ that contains your custom code. The custom code needs to interact with the Lambda Runtime API. It calls the /next endpoint to obtain information about the next invocation. This API call is blocking and it waits until a request arrives. When the function is done processing the request, it must call the /response endpoint to send the response back to the client and then it must call the /next endpoint again to wait for the next invocation. Lambda freezes the execution environment after you call /next, until a request arrives.

Using this approach, you can run the asynchronous task after calling /response, and sending the response back to the client, and before calling /next, indicating that the processing is done.

The following Python code example isolates the custom runtime code into its own file and the function imports and uses it to interact with the runtime API:

import time
import json
import runtime_interface as rt
from aws_lambda_powertools import Logger

logger = Logger()

def calc_response(event):
    logger.info(f"[Function] Calculating response")
    time.sleep(1) # Simulate sync work
    return {
        "message": "hello from custom"
    }

def async_task(response):
    logger.info(f"[Async task] Starting async task: {json.dumps(response)}")
    time.sleep(3)  # Simulate async work
    logger.info(f"[Async task] Done")

def main():
    # You can add initialization code here

    # The following loop runs forever waiting for the next invocation
    # and sending the response back to the client
    while True:
        # Call /next to wait for next request (and indicate 
        # that we are done processing the previous request)

        requestId, event = rt.get_next()

        # The code from here to send_response() is the code
        # that usually goes inside the Lambda handler()

        logger.info(f"[Function] Received event: {json.dumps(event)}")

        # Calculate response
        response = calc_response(event)

        # Done calculating response, send response to client
        logger.info(f"[Function] Returning response to client")
        rt.send_response(requestId, {
            "statusCode": 200,
            "body": json.dumps(response)
        })

        logger.info(f"[Function] Invoking async task")
        async_task(response)

main()

This Python code demonstrates how to interact with the runtime API:

import requests
import os
from aws_lambda_powertools import Logger

logger = Logger()
run_time_endpoint = os.environ['AWS_LAMBDA_RUNTIME_API']

def get_next():
    logger.debug("[Custom runtime] Waiting for invocation...")
    request = requests.get(
        url=f"http://{run_time_endpoint}/2018-06-01/runtime/invocation/next",
        timeout=None
    )
    event = request.json()
    requestId = request.headers["Lambda-Runtime-Aws-Request-Id"]
    return requestId, event

def send_response(requestId, response):
    logger.debug("[Custom runtime] Sending response")
    requests.post(
        url=f"http://{run_time_endpoint}/2018-06-01/runtime/invocation/{requestId}/response",
        json = response,
        timeout=None
    )

Conclusion

This blog shows four ways of combining synchronous and asynchronous tasks in a Lambda function, allowing you to run tasks that continue running after the function returns a response to the client. The following table summarizes the pros and cons of each solution:

Function URLs, cannot be used with API Gateway, always public

	Asynchronous invocation	Response streaming	Lambda extensions	Custom runtime
Complexity	Easier to implement	Easiest to implement	The most complex solution to implement as it requires interacting with the extensions API and a dedicated thread	Medium as it interacts with the runtime API
Deployment	Need two artifacts: the synchronous function and the asynchronous function	A single deployment artifact that contains all code	A single deployment artifact that contains all code	A single deployment artifact, requires packaging all needed runtime files
Cost	Most expensive as it incurs additional invocation cost as well as the overall duration of both functions is higher than having it in one	Least expensive	Least expensive	Least expensive
Starting the async task	Before returning from handler	Anytime during the handler invocation	Anytime during the handler invocation	After returning the response to the client, unless you use a dedicated thread
Limitations	Payload sent to the asynchronous function cannot exceed 256 KB	Only supported with Node.js and custom runtimes. Requires Lambda Function URLs, cannot be used with API Gateway, always public	–	–
Additional benefits	Better decoupling between synchronous and asynchronous code	Ability to send response in stages. Supports payloads larger than 6 MB (at additional cost)	The asynchronous task runs in its own thread, which can reduce overall duration and cost	–
Retries in case of failure in async code	Managed by the Lambda service	Responsibility of the developer	Responsibility of the developer	Responsibility of the developer

Choosing the right approach depends on your use case. If you write your function in Node.js and you invoke it using Lambda Function URLs, use response streaming. This is the easiest way to implement, and it is the most cost effective.

If there is a chance for a failure in the asynchronous task (for example, a database is not accessible), and you must ensure that the task completes, use the asynchronous Lambda invocation method. The Lambda service retries your asynchronous function until it succeeds. Eventually, if all retries fail, it invokes a Lambda destination so you can take action.

If you need a custom runtime because you need to use a programming language that Lambda does not natively support, use the custom runtime option. Otherwise, use the Lambda extensions option. It is more complex to implement, but it is cost effective. This allows you to package the code in a single artifact and start processing the asynchronous task before you send the response to the client.

For more serverless learning resources, visit Serverless Land.

Accelerating workflow development with the TestState API in AWS Step Functions

2024-05-02 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/accelerating-workflow-development-with-the-teststate-api-in-aws-step-functions/

This post is written by Ben Freiberg, Senior Solutions Architect.

Developers often choose AWS Step Functions to orchestrate the services that comprise their applications. Step Functions is a visual workflow service that makes it easier for developers to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. Step Functions integrates with over 220 AWS services and any publicly accessible HTTP endpoint. Step Functions provides many features that help developers build, such as built-in error handling, real-time and auditable workflow execution history, and large-scale parallel processing.

Several areas can be time consuming for developers when testing Step Functions workflows. For example, authentication with external services, input/output processing, AWS IAM permission, or intrinsic functions. To simplify and speed up resolving these issues, Step Functions released a new capability last year to test individual states: the TestState API. This feature allows you to test states independently from the execution of your workflow. You can change the input and test different scenarios without the need to deploy your workflow or execute the whole state machine. This feature is available for all task, choice, and pass states.

Since developers spend significant time in IDEs and terminals, TestState is also available via an API. This allows you to iterate over changes for an individual state and lets you refine the input/output processing or conditional logic in a choice state without leaving your IDE. In this post, you’ll learn how the TestState API can speed up your testing and development.

Getting started with TestState

Suppose that you are developing a payment processing workflow that consists of three states. First, a Choice state that checks the type of payment based on the input data. Depending on the type, it calls either an AWS Lambda function or an external endpoint. The task state that invokes the Lambda function includes some input/output processing.

To get started with the TestState API, you must create an IAM role that the service can assume. The role must contain the required IAM permissions for the resources your state is accessing. For information about the permissions a state might need, see IAM permissions to test a state. The following snippet shows the minimal necessary permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "states:TestState",
        "iam:PassRole"
      ],
      "Resource": "*"
    }
  ]
}

Next, you must provide the definition of the state being tested. The choice state is configured to check the type of payment and if the voucherId is present, in case of a voucher. The following snippet shows the state definition:

{
    "Type": "Choice",
    "Choices": [
        {
            "And": [
                {
                    "Variable": "$.payment.type",
                    "IsPresent": true
                },
                {
                    "Variable": "$.payment.type",
                    "StringEquals": "voucher"
                }
            ],
            "Next": "Process voucher"
        },
        {
            "Variable": "$.payment.type",
            "StringEquals": "credit",
            "Next": "Call payment provider"
        }
    ],
    "Default": "Fail"
}

Using the role and state definition, you can now test it if an input results in the expected next state:

aws stepfunctions test-state 
--definition file://choice.json 
--role-arn "arn:aws:iam::<account-id>:role/StepFunctions-TestState-Role" 
--input '{"payment":{"type":"voucher"}}'

The response shows that the test did not encounter any errors and that the next state would be invoking the Lambda function to process the voucher as expected.

{
    "output": "{\"payment\":{\"type\":\"voucher\"}}",
    "nextState": "Process voucher",
    "status": "SUCCEEDED"
}

Similarly, with a payment type of credit as input, the next state is invoking the third-party endpoint:

aws stepfunctions test-state
--definition file://choice.json
--role-arn "arn:aws:iam::<account-id>:role/StepFunctions-TestState-Role"
--input '{"payment":{"type":"credit"}}'

{
    "output": "{\"payment\":{\"type\":\"credit\"}}",
    "nextState": "Call payment provider",
    "status": "SUCCEEDED"
}

Because the TestState API takes the state definition as an argument, you do not have to redeploy the state machine when changing the state definition. Instead, you can iterate and test your settings by passing the modified state definition to the TestState API.

Using inspection levels

For each state, you can specify the amount of detail you want to view in the test results. These details provide additional information about the state that you are testing. For example, if you’ve used any input and output data processing filters, such as InputPath or ResultPath in a state, you can view the intermediate and final data processing results. Step Functions provides the following levels to specify the details you want to view, INFO, DEBUG, and TRACE. All these levels return the status and nextState fields.

Next, the Lambda Invoke state is tested. In this scenario, the state includes input/output processing. The output from the function is transformed by renaming and restructuring the field and then merged with the original input. This is the relevant part of the task definition:

"Process voucher": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {...},
      "Retry": [...],
      "Next": "Success",
      "ResultPath": "$.voucherProcessed",
      "ResultSelector": {
        "status.$": "$.Payload.result",
        "workflowId.$": "$.Payload.workflow"
      }
}

This time test using the Step Functions console, which can make it easier to understand the input/output processing steps. To get started, open the state machine in Workflow Studio and select the state, and then choose Test State. Make sure to select DEBUG as the inspection level. After testing the state, switch to the Input/output processing tab to check the intermediate steps.

When you call the TestState API and set the inspectionLevel parameter to DEBUG, the API response includes an object called inspectionData. This object contains fields to help you inspect how data was filtered or manipulated within the state when it was executed. This data is shown in the Input/output processing tab in the console.

Being able to see all the processing steps easily in one place allows developers to spot issues and iterate more quickly, saving time.

Testing third-party endpoint integrations

Applications might call third-party endpoints that require authentication. Step Functions offers the HTTPS endpoint resource to connect to third-party HTTP targets outside of the AWS Cloud.

HTTPS endpoints use Amazon EventBridge connections to manage the authentication credentials for the target. This defines the authorization type used, which can be a basic authentication with a username and password, an API key, or OAuth. EventBridge connections use AWS Secrets Manager to store the secret. This keeps the secrets out of the state machine, reducing the risks of accidentally exposing your secrets in logs or in the state machine definition.

Getting the authentication configuration right might involve several time-consuming iterations. With the TRACE inspection level, developers can see the raw HTTP request and response, which is useful for verifying headers, query parameters, and other API-specific details. This option is only available for the HTTP Task. You can also view the secrets included in the EventBridge connection. To do this, you must set the revealSecrets parameter to true in the TestState API. This can help verifying that the correct authentication parameters are used.

To get started, ensure that the execution role used for testing has the necessary permissions, as shown here:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": "arn:aws:secretsmanager:<your-region>:<account-id>:secret:events!connection/<your-connection-id>"
        }
    ]
}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RetrieveConnectionCredentials",
            "Effect": "Allow",
            "Action": [
                "events:RetrieveConnectionCredentials"
            ],
            "Resource": [
                "arn:aws:events:<your-region>:<account-id>:connection/<your-connection-id>"
            ]
        }
    ]
}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeHTTPEndpoint",
            "Effect": "Allow",
            "Action": [
                "states:InvokeHTTPEndpoint"
            ],
            "Resource": [
                "arn:aws:states:<your-region>:<account-id>:stateMachine:<your-statemachine>"
            ]
        }
    ]
}

When you test the HTTP task, make sure to set the inspection level to TRACE. Then use the HTTP request and response tab to check the details. This capability saves you time when debugging complex authentication issues.

Automating testing

Testing is not only a manual activity to get the configuration right. Most often, tests are run as part of a suite of tests, which are automatically performed to validate the correct behavior. It also prevents regressions when making changes. The TestState API can easily be integrated in such tests as well.

The following snippet shows a test using the Jest framework in JavaScript. The test checks if the correct next state is produced given a definition and input. The definition resides in a different file, which can also be used for infrastructure as code (IaC) to create the state machine.

const { SFNClient, TestStateCommand } = require("@aws-sdk/client-sfn");
// Import the state definition 
const definition = require("./definition.json");

const client = new SFNClient({});

describe("Step Functions", () => {
  test("that next state is correct", async () => {
    const command = new TestStateCommand({
      definition: JSON.stringify(definition),
      roleArn: "arn:aws:iam::<account-id>:role/<role-with-sufficient-permissions>",
      input: "{}" # Adjust as necessary
    });
    const data = await client.send(command);

    expect(data.status).toBe("SUCCEEDED");
    expect(data.nextState).toBe("Success"); # Adjust as necessary
  });
});

With automated tests, you can safely change your workflow definitions without the need for manual efforts. That way, you are immediately alerted if a change would result in an incompatibility.

With TestState you can increase your test coverage with less effort because you can test states directly. This is especially helpful for complex workflows and states that require a specific set of circumstances to reach them. It makes it easier to validate the correctness of your error-handling as well. You can now test the potentially many combinations of your configured Retriers and Catchers much easier.

Conclusion

The TestState API helps developers to iterate faster, resolve issues efficiently, and deliver high-quality applications with greater confidence. By enabling developers to test individual states independently and integrating testing into their preferred development workflows, it simplifies the debugging process and reduces context switches. Whether testing input/output processing, authentication with external services, or third-party endpoint integrations, the TestState API can be a useful tool for testing.

Automating chaos experiments with AWS Fault Injection Service and AWS Lambda

2024-03-22 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/automating-chaos-experiments-with-aws-fault-injection-service-and-aws-lambda/

This post is written by André Stoll, Solution Architect.

Chaos engineering is a popular practice for building confidence in system resilience. However, many existing tools assume the ability to alter infrastructure configurations, and cannot be easily applied to the serverless application paradigm. Due to the stateless, ephemeral, and distributed nature of serverless architectures, you must evolve the traditional technique when running chaos experiments on these systems.

This blog post explains a technique for running chaos engineering experiments on AWS Lambda functions. The approach uses Lambda extensions to induce failures in a runtime-agnostic way requiring no function code changes. It shows how you can use the AWS Fault Injection Service (FIS) to automate and manage chaos experiments across different Lambda functions to provide a reusable testing method.

Overview

Chaos experiments are commonly applied to cloud applications to uncover latent issues and prevent service disruptions. IT teams use chaos experiments to build confidence in the robustness of their systems. However, the traditional methods used in server-based chaos engineering do not easily translate to the serverless world since many existing tools are based on altering the underlying infrastructure configurations, such as cluster nodes or server instances of your applications.

In serverless applications, AWS handles the undifferentiated heavy lifting of managing infrastructure, so you can focus on delivering business value. But this also means that engineering teams have limited control over the infrastructure, and must rely on application-level tooling to run chaos experiments. Two techniques commonly used in the serverless community for conducting chaos experiments on Lambda functions are modifying the function configuration or using runtime-specific libraries.

Changing the configuration of a Lambda function allows you to induce rudimentary failures. For example, you can set the reserved concurrency of a Lambda function to simulate invocation throttling. Alternatively, you might change the function execution role permissions or the function policy to simulate IAM access denial. These types of failures are easy to implement, but the range of possible fault injection types is limited.

The other technique—injecting chaos into Lambda functions through purpose-built, runtime-specific libraries—is more flexible. There are various open-source libraries that allow you to inject failures, such as added latency, exceptions, or disk exhaustion. Examples of such libraries are Python’s chaos_lambda and failure-lambda for Node.js. The downside is that you must change the function code for every function you want to run chaos experiments on. In addition, those libraries are runtime-specific and each library comes with a set of different capabilities and configurations. This reduces the reusability of your chaos experiments across Lambda functions implemented in different languages.

Injecting chaos using Lambda extensions

Implementing chaos experiments using Lambda extensions allows you to address all of the previous concerns. Lambda extensions augment your functions by adding functionality, such as capturing diagnostic information or automatically instrumenting your code. You can integrate your preferred monitoring, observability, or security tooling deeply into the Lambda environment without complex installation or configuration management. Lambda extensions are generally packaged as Lambda layers and run as a separate process in the Lambda execution environment. You may use extensions from AWS, AWS Lambda partners, or build your own custom functionality.

With Lambda extensions, you can implement a chaos extension to inject the desired failures into your Lambda environments. This chaos extension uses the Runtime API proxy pattern that enables you to hook into the function invocation request and response lifecycle. Lambda runtimes use the Lambda Runtime API to retrieve the next incoming event to be processed by the function handler and return the handler response to the Lambda service.

The Runtime API HTTP endpoint is available within the Lambda execution environment. Runtimes get the API endpoint from the environment variable AWS_LAMBDA_RUNTIME_API. During the initialization of the execution environment, you can modify the runtime startup behavior. This lets you change the value of AWS_LAMBDA_RUNTIME_API to the port the chaos extension process is listening on. Now, all requests to the Runtime API go through the chaos extension proxy. You can use this workflow for blocking malicious events, auditing payloads, or injecting failures.

The chaos extension intercepts incoming events and outbound responses, and injects failures according to the chaos experiment configuration.
The extension accesses environment variables to read the chaos experiment configuration.
A wrapper script configures the runtime to proxy requests through the chaos extension.

When intercepting incoming events and outbound responses to the Lambda Runtime API, you can simulate failures such as introducing artificial delay or generate an error response to return to the Lambda service. This workflow adds latency to your function calls:

All Lambda runtimes support extensions. Since extensions run as a separate process, you can implement them in a language other than the function code. AWS recommends you implement extensions using a programming language that compiles to a binary executable, such as Golang or Rust. This allows you to use the extension with any Lambda runtime.

Some of the open source projects following this technique are the chaos-lambda-extension, implemented in Rust, or the serverless-chaos-extension, implemented in Python.

Extensions provide you with a flexible and reusable method to run your chaos experiments on Lambda functions. You can reuse the chaos extension for all runtimes without having to change function code. Add the extension to any Lambda function where you want to run chaos experiments.

Automating with AWS FIS experiment templates

According to the Principles of Chaos Engineering, you should “automate your experiments to run continuously”. To achieve this, you can use the AWS Fault Injection Service (FIS).

This service allows you to generate reusable experiment templates. The template specifies the targets and the actions to run on them during the experiment, and an optional stop condition that prevents the experiment from going out of bounds. You can also execute AWS Systems Manager Automation runbooks which support custom fault types. You can write your own custom Systems Manager documents to define the individual steps involved in the automation. To carry out the actions of the experiment, you define scripts in the document to manage your Lambda function and set it up for the chaos experiment.

To use the chaos extension for your serverless chaos experiments:

Set up the Lambda function for the experiment. Add the chaos extension as a layer and configure the experiment, for example, by adding environment variables specifying the fault type and its corresponding value.
Pause the automation and conduct the experiment. To do this, use the aws:sleep automation action. During this period, you conduct the experiment, measure and observe the outcome.
Clean up the experiment. The script removes the layer again and also resets the environment variables.

Running your first serverless chaos experiment

This sample repository provides you with the necessary code to run your first serverless chaos experiment in AWS. The experiment uses the chaos-lambda-extension extension to inject chaos.

The sample deploys the AWS FIS experiment template, the necessary SSM Automation runbooks including the IAM role used by the runbook to configure the Lambda functions. The sample also provisions a Lambda function for testing and an Amazon CloudWatch alarm used to roll back the experiment.

Prerequisites

You have the chaos extension already deployed as a Lambda layer in your AWS account.
You have installed the AWS Cloud Development Kit (CDK) CLI. See the Getting started with the AWS CDK guide for details.
Clone the sample repository locally by running
git clone [email protected]:aws-samples/serverless-chaos-experiments.git
Ensure you have sufficient permissions to interact with the AWS FIS, Lambda, and CloudWatch alarms.

Running the experiment

Follow the steps outlined in the repository to conduct your first experiment. Starting the experiment triggers the automation execution.

This automation includes adding the extension and configuring the experiment, pausing the execution and observing the system and reverting all changes to the initial state.

If you invoke the targeted Lambda function during the second step, failures (in this case, artificial latency) are simulated.

Security best practices

Extensions run within the same execution environment as the function, so they have the same level of access to resources such as file system, networking, and environment variables. IAM permissions assigned to the function are shared with extensions. AWS recommends you assign the least required privileges to your functions.

Always install extensions from a trusted source only. Use Infrastructure as Code (IaC) and automation tools, such as CloudFormation or AWS Systems Manager, to simplify attaching the same extension configuration, including AWS Identity and Access Management (IAM) permissions, to multiple functions. IaC and automation tools allow you to have an audit record of extensions and versions used previously.

When building extensions, do not log sensitive data. Sanitize payloads and metadata before logging or persisting them for audit purposes.

Conclusion

This blog post details how to run chaos experiments for serverless applications built using Lambda. The described approach uses Lambda extension to inject faults into the execution environment. This allows you to use the same method regardless of runtime or configuration of the Lambda function.

To automate and successfully conduct the experiment, you can use the AWS Fault Injection Service. By creating an experiment template, you can specify the actions to run on the defined targets, such as adding the extension during the experiment. Since the extension can be used for any runtime, you can reuse the experiment template to inject failures into different Lambda functions.

Visit this repository to deploy your first serverless chaos experiment, or watch this video guide for learning more about building extensions. Explore the AWS FIS documentation to learn how to create your own experiments.

For more serverless learning resources, visit Serverless Land.

Building a Serverless Streaming Pipeline to Deliver Reliable Messaging

2024-03-06 Chris McPeek

Post Syndicated from Chris McPeek original https://aws.amazon.com/blogs/compute/building-a-serverless-streaming-pipeline-to-deliver-reliable-messaging/

This post is written by Jeff Harman, Senior Prototyping Architect, Vaibhav Shah, Senior Solutions Architect and Erik Olsen, Senior Technical Account Manager.

Many industries are required to provide audit trails for decision and transactional systems. AI assisted decision making requires monitoring the full inputs to the decision system in near real time to prevent fraud, detect model drift, and discrimination. Modern systems often use a much wider array of inputs for decision making, including images, unstructured text, historical values, and other large data elements. These large data elements pose a challenge to traditional audit systems that deal with relatively small text messages in structured formats. This blog shows the use of serverless technology to create a reliable, performant, traceable, and durable streaming pipeline for audit processing.

Overview

Consider the following four requirements to develop an architecture for audit record ingestion:

Audit record size: Store and manage large payloads (256k – 6 MB in size) that may be heterogeneous, including text, binary data, and references to other storage systems.
Audit traceability: The data stored has full traceability of the payload and external processes to monitor the process via subscription-based events.
High Performance: The time required for blocking writes to the system is limited to the time it takes to transmit the audit record over the network.
High data durability: Once the system sends a payload receipt, the payload is at very low risk of loss because of system failures.

The following diagram shows an architecture that meets these requirements and models the flow of the audit record through the system.

The primary source of latency is the time it takes for an audit record to be transmitted across the network. Applications sending audit records make an API call to an Amazon API Gateway endpoint. An AWS Lambda function receives the message and an Amazon ElastiCache for Redis cluster provides a low latency initial storage mechanism for the audit record. Once the data is stored in ElastiCache, the AWS Step Functions workflow then orchestrates the communication and persistence functions.

Subscribers receive four Amazon Simple Notification Service (Amazon SNS) notifications pertaining to arrival and storage of the audit record payload, storage of the audit record metadata, and audit record archive completion. Users can subscribe an Amazon Simple Queue Service (SQS) queue to the SNS topic and use fan out mechanisms to achieve high reliability.

The Ingest Message Lambda function sends an initial receipt notification
The Message Archive Handler Lambda function notifies on storage of the audit record from ElastiCache to Amazon Simple Storage Service (Amazon S3)
The Message Metadata Handler Lambda function notifies on storage of the message metadata into Amazon DynamoDB
The Final State Aggregation Lambda function notifies that the audit record has been archived.

Any failure by the three fundamental processing steps: Ingestion, Data Archive, and Metadata Archive triggers a message in an SQS Dead Letter Queue (DLQ) which contains the original request and an explanation of the failure reason. Any failure in the Ingest Message function invokes the Ingest Message Failure function, which stores the original parameters to the S3 Failed Message Storage bucket for later analysis.

The Step Functions workflow provides orchestration and parallel path execution for the system. The detailed workflow below shows the execution flow and notification actions. The transformer steps convert the internal data structures into the format required for consumers.

Data structures

There are types three events and messages managed by this system:

Incoming message: This is the message the producer sends to an API Gateway endpoint.
Internal message: This event contains the message metadata allowing subsequent systems to understand the originating message producer context.
Notification message: Messages that allow downstream subscribers to act based on the message.

Solution walkthrough

The message producer calls the API Gateway endpoint, which enforces the security requirements defined by the business. In this implementation, API Gateway uses an API key for providing more robust security. API Gateway also creates a security header for consumption by the Ingest Message Lambda function. API Gateway can be configured to enforce message format standards, see Use request validation in API Gateway for more information.

The Ingest Message Lambda function generates a message ID that tracks the message payload throughout its lifecycle. Then it stores the full message in the ElastiCache for Redis cache. The Ingest Message Lambda function generates an internal message with all the elements necessary as described above. Finally, the Lambda function handler code starts the Step Functions workflow with the internal message payload.

If the Ingest Message Lambda function fails for any reason, the Lambda function invokes the Ingestion Failure Handler Lambda function. This Lambda function writes any recoverable incoming message data to an S3 bucket and sends a notification on the Ingest Message dead letter queue.

The Step Functions workflow then runs three processes in parallel.

The Step Functions workflow triggers the Message Archive Data Handler Lambda function to persist message data from the ElastiCache cache to an S3 bucket. Once stored, the Lambda function returns the S3 bucket reference and state information. There are two options to remove the internal message from the cache. Remove the message from cache immediately before sending the internal message and updating the ElastiCache cache flag or wait for the ElastiCache lifecycle to remove a stale message from cache. This solution waits for the ElastiCache lifecycle to remove the message.
The workflow triggers the Message Metadata Handler Lambda function to write all message metadata and security information to DynamoDB. The Lambda function replies with the DynamoDB reference information.
Finally, the Step Functions workflow sends a message to the SNS topic to inform subscribers that the message has arrived and the data persistence processes have started.

After each of the Lambda functions’ processes complete, the Lambda function sends a notification to the SNS notification topic to alert subscribers that each action is complete. When both Message Metadata and Message Archive Lambda functions are done, the Final Aggregation function makes a final update to the metadata in DynamoDB to include S3 reference information and to remove the ElastiCache Redis reference.

Deploying the solution

Prerequisites:

AWS Serverless Application Model (AWS SAM) is installed (see Getting started with AWS SAM)
AWS User/Credentials with appropriate permissions to run AWS CloudFormation templates in the target AWS account
Python 3.8 – 3.10
The AWS SDK for Python (Boto3) is installed
The requests python library is installed

The source code for this implementation can be found at https://github.com/aws-samples/blog-serverless-reliable-messaging

Installing the Solution:

Clone the git repository to a local directory
git clone https://github.com/aws-samples/blog-serverless-reliable-messaging.git
Change into the directory that was created by the clone operation, usually blog_serverless_reliable_messaging
Execute the command: sam build
Execute the command: sam deploy –-guided. You are asked to supply the following parameters:
1. Stack Name: Name given to this deployment (example: serverless-streaming)
2. AWS Region: Where to deploy (example: us-east-1)
3. ElasticacheInstanceClass: EC2 cache instance type to use with (example: cache.t3.small)
4. ElasticReplicaCount: How many replicas should be used with ElastiCache (recommended minimum: 2)
5. ProjectName: Used for naming resources in account (example: serverless-streaming)
6. MultiAZ: True/False if multiple Availability Zones should be used (recommend: True)
7. The default parameters can be selected for the remainder of questions

Testing:

Once you have deployed the stack, you can test it through the API gateway endpoint with the API key that is referenced in the deployment output. There are two methods for retrieving the API key either via the AWS console (from the link provided in the output – ApiKeyConsole) or via the AWS CLI (from the AWS CLI reference in the output – APIKeyCLI).

You can test directly in the Lambda service console by invoking the ingest message function.

A test message is available at the root of the project test_message.json for direct Lambda function testing of the Ingest function.

In the console navigate to the Lambda service
From the list of available functions, select the “<project name> -IngestMessageFunction-xxxxx” function
Under the “Function overview” select the “Test” tab
Enter an event name of your choosing
Copy and paste the contents of test_message.json into the “Event JSON” box
Click “Save” then after it has saved, click the “Test”

If successful, you should see something similar to the below in the details:

{
"isBase64Encoded": false,
"statusCode": 200,
"headers": {
"Access-Control-Allow-Headers": "Content-Type",
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "OPTIONS,POST"
},
"body": "{\"messageID\": \"XXXXXXXXXXXXXX\"}"
}

In the S3 bucket “<project name>-s3messagearchive-xxxxxx“, find the payload of the original json with a key based on the date and time of the script execution, e.g.: YEAR/MONTH/DAY/HOUR/MINUTE with a file name of the messageID
In a DynamoDB table named metaDataTable, you should find a record with a messageID equal to the messageID from above that contains all of the metadata related to the payload

A python script is included with the code in the test_client folder

Replace the <Your API key key here> and the <Your API Gateway URL here (IngestMessageApi)> values with the correct ones for your environment in the test_client.py file
Execute the test script with Python 3.8 or higher with the requests package installed
Example execution (from main directory of git clone):
python3 -m pip install -r ./test_client/requirements.txt
python3 ./test_client/test_client.py
Successful output shows the messageID and the header JSON payload:
```
{
"messageID": " XXXXXXXXXXXXXX"
}
```
In the S3 bucket “<project name>-s3messagearchive-xxxxxx“, you should be able to find the payload of the original json with a key based on the date and time of the script execution, e.g.: YEAR/MONTH/DAY/HOUR/MINUTE with a file name of the messageID
In a DynamoDB table named metaDataTable, you should find a record with a messageID equal to the messageID from above that contains all of the meta data related to the payload

Conclusion

This blog describes architectural patterns, messaging patterns, and data structures that support a highly reliable messaging system for large messages. The use of serverless services including Lambda functions, Step Functions, ElastiCache, DynamoDB, and S3 meet the requirements of modern audit systems to be scalable and reliable. The architecture shared in this blog post is suitable for a highly regulated environment to store and track messages that are larger than typical logging systems, records sized between 256k and 6MB. The architecture serves as a blueprint that can be extended and adapted to fit further serverless use cases.

For serverless learning resources, visit Serverless Land.

Comparing design approaches for building serverless microservices

2024-03-04 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/comparing-design-approaches-for-building-serverless-microservices/

This post is written by Luca Mezzalira, Principal SA, and Matt Diamond, Principal, SA.

Designing a workload with AWS Lambda creates questions for developers due to the modularity that can be expressed either at the code or infrastructure level. Using serverless for running code requires additional planning to extract the business logic from the underlying functional components. This deliberate separation of concerns ensures a robust modularity, paving the way for evolutionary architectures.

This post focuses on synchronous workloads, but similar considerations are applicable in other workload types. After identifying the bounded context of your API and agreeing on API contracts with consumers, it’s time to structure the architecture of your bounded context and the associated infrastructure.

The two most common ways to structure an API using Lambda functions are single responsibility and Lambda-lith. However, this blog post explores an alternative to these approaches, which can provide the best of both.

Single responsibility Lambda functions

Single responsibility Lambda functions are designed to run a specific task or handle a particular event-triggered operation within a serverless architecture:

$c:\temp\design1.png$

This approach provides a strong separation of concerns between business logic and capabilities. You can test in isolation specific capabilities, deploy a Lambda function independently, reduce the surface to introduce bugs, and enable easier debugging for issues in Amazon CloudWatch.

Additionally, single purpose functions enable efficient resource allocation as Lambda automatically scales based on demand, optimizing resource consumption, and minimizing costs. This means you can modify the memory size, architecture, and any other configuration available per function. Moreover, requesting an update of concurrent function execution via a support ticket becomes easier because you are not aggregating the traffic to a single Lambda function that handles every request but you can request specific increase based on the traffic of a single task.

Another advantage is rapid execution time. Considering the business logic for a single-purpose Lambda function designed for a single task, you can optimize the size of a function more easily, without the need of additional libraries required in other approaches. This helps reduce the cold start time due to a smaller bundle size.

Despite these benefits, some issues exist when solely relying on single-purpose Lambda functions. While the cold start time is mitigated, you might experience a higher number of cold starts, particularly for functions with sporadic or infrequent invocations. For example, a function that deletes users in an Amazon DynamoDB table likely won’t be triggered as often as one that reads user data. Also, relying heavily on single-purpose Lambda functions can lead to increased system complexity, especially as the number of functions grows.

A good separation of concerns helps maintain your code base, at the cost of a lack of cohesion. In functions with similar tasks, such as write operations of an API (POST, PUT, DELETE), you might duplicate code and behaviors across multiple functions. Moreover, updating common libraries shared via Lambda Layers, or other dependency management systems, requires multiple changes across every function instead of an atomic change on a single file. This is also true for any other change across multiple functions, for instance, updating the runtime version.

Lambda-lith: Using one single Lambda function

When many workloads use single purpose Lambda functions, developers end up with a proliferation of Lambda functions across an AWS account. One of the main challenges developers face is updating common dependencies or function configurations. Unless there is a clear governance strategy implemented for addressing this problem (such as using Dependabot for enforcing the update of dependencies, or parameterized parameters that are retrieved at provisioning time), developers may opt for a different strategy.

As a result, many development teams move in the opposite direction, aggregating all code related to an API inside the same Lambda function.

This approach is often referred to as a Lambda-lith, because it gathers all the HTTP verbs that compose an API and sometimes multiple APIs in the same function.

This allows you to have a higher code cohesion and colocation across the different parts of the application. Modularity in this case is expressed at the code level, where patterns like single responsibility, dependency injection, and façade are applied to structure your code. The discipline and code best practices applied by the development teams is crucial for maintaining large code bases.

However, considering the reduced number of Lambda functions, updating a configuration or implementing a new standard across multiple APIs can be achieved more easily compared with the single responsibility approach.

Moreover, since every request invokes the same Lambda function for every HTTP verb, it’s more likely that little-used parts of your code have a better response time because an execution environment is more likely to be available to fulfill the request.

Another factor to consider is the function size. This increases when collocating verbs in the same function with all the dependencies and business logic of an API. This may affect the cold start of your Lambda functions with spiky workloads. Customers should evaluate the benefits of this approach, especially when applications have restrictive SLAs, which would be impacted by cold starts. Developers can mitigate this problem by paying attention to the dependencies used and implementing techniques like tree-shaking, minification, and dead code elimination, where the programming language allows.

This coarse grain approach won’t allow you to tune your function configurations individually. But you must find a configuration that matches all the code capabilities with a possibly higher memory size and looser security permissions that might clash with the requirements defined by the security team.

Read and write functions

These two approaches both have trade-offs, but there is a third option that can combine their benefits.

Often, API traffic leans towards more reads or writes and that forces developers to optimize code and configurations more on one side over the other.

For example, consider building a user API that allows consumers to create, update, and delete a user but also to find a user or a list of users. In this scenario, you can change one user at a time with no bulk operations available, but you can get one or more users per API request. Dividing the design of the API into read and write operations results in this architecture:

The cohesion of code for write operations (create, update, and delete) is beneficial for many reasons. For instance, you may need to validate the request body, ensuring it contains all the mandatory parameters. If the workload is heavy on writes, the less-used operations (for instance, Delete) benefit from warm execution environments. The code colocation enables reusability of code on similar actions, reducing the cognitive load to structure your projects with shared libraries or Lambda layers, for instance.

When looking at the read operations side, you can reduce the code bundled with this function, having a faster cold start, and heavily optimize the performance compared to a write operation. You can also store partial or full query results in-memory of an execution environment to improve the execution time of a Lambda function.

This approach helps you further with its evolutionary nature. Imagine if this platform becomes much more popular. Now, you must optimize the API even further by improving reads and adding a cache aside pattern with ElastiCache and Redis. Moreover, you have decided to optimize the read queries with a second database that is optimized for the read capability when the cache is missed.

On the write side, you have agreed with the API consumers that receiving and acknowledging user creation or deletion is adequate, considering they fully embraced the eventual consistency nature of distributed systems.

Now, you can improve the response time of write operations by adding an SQS queue before the Lambda function. You can update the write database in batches to reduce the number of invocations needed for handling write operations, instead of dealing with every request individually.

Command query responsibility segregation (CQRS) is a well-established pattern that separates the data mutation, or the command part of a system, from the query part. You can use the CQRS pattern to separate updates and queries if they have different requirements for throughput, latency, or consistency.

While it’s not mandatory to start with a full CQRS pattern, you can evolve from the infrastructure highlighted more easily in the initial read and write implementation, without massive refactoring of your API.

Comparison of the three approaches

Here is a comparison of the three approaches:

	Single responsibility	Lambda-lith	Read and write
Benefits	Strong separation of concerns Granular configuration Better debug Rapid execution time	Fewer cold start invocations Higher code cohesion Simpler maintenance	Code cohesion where needed Evolutionary architecture Optimization of read and write operations
Issues	Code duplication Complex maintenance Higher cold start invocations	Corse grain configuration Higher cold start time	Using CQRS with two data models CQRS adds eventual consistency to your system

Conclusion

Developers often move from single responsibility functions to the Lambda-lith as their architectures evolve, but both approaches have relative trade-offs. This post shows how it’s possible to have the best of both approaches by dividing your workloads per read and write operations.

All three approaches are viable for designing serverless APIs, and understanding what you are optimizing for is the key for making the best decision. Remember, understanding your context and business requirements to express in your applications leads you towards the acceptable trade-offs to specify inside a specific workload. Keep an open mind and find the solution that solves the problem and balances security, developer experience, cost, and maintainability.

For more serverless learning resources, visit Serverless Land.

Introducing the .NET 8 runtime for AWS Lambda

2024-02-23 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-the-net-8-runtime-for-aws-lambda/

This post is written by Beau Gosse, Senior Software Engineer and Paras Jain, Senior Technical Account Manager.

AWS Lambda now supports .NET 8 as both a managed runtime and container base image. With this release, Lambda developers can benefit from .NET 8 features including API enhancements, improved Native Ahead of Time (Native AOT) support, and improved performance. .NET 8 supports C# 12, F# 8, and PowerShell 7.4. You can develop Lambda functions in .NET 8 using the AWS Toolkit for Visual Studio, the AWS Extensions for .NET CLI, AWS Serverless Application Model (AWS SAM), AWS CDK, and other infrastructure as code tools.

Creating .NET 8 function in the console

What’s new

Upgraded operating system

The .NET 8 runtime is built on the Amazon Linux 2023 (AL2023) minimal container image. This provides a smaller deployment footprint than earlier Amazon Linux 2 (AL2) based runtimes and updated versions of common libraries such as glibc 2.34 and OpenSSL 3.

The new image also uses microdnf as a package manager, symlinked as dnf. This replaces the yum package manager used in earlier AL2-based images. If you deploy your Lambda functions as container images, you must update your Dockerfiles to use dnf instead of yum when upgrading to the .NET 8 base image. For more information, see Introducing the Amazon Linux 2023 runtime for AWS Lambda.

Performance

There are a number of language performance improvements available as part of .NET 8. Initialization time can impact performance, as Lambda creates new execution environments to scale your function automatically. There are a number of ways to optimize performance for Lambda-based .NET workloads, including using source generators in System.Text.Json or using Native AOT.

Lambda has increased the default memory size from 256 MB to 512 MB in the blueprints and templates for improved performance with .NET 8. Perform your own functional and performance tests on your .NET 8 applications. You can use AWS Compute Optimizer or AWS Lambda Power Tuning for performance profiling.

At launch, new Lambda runtimes receive less usage than existing established runtimes. This can result in longer cold start times due to reduced cache residency within internal Lambda subsystems. Cold start times typically improve in the weeks following launch as usage increases. As a result, AWS recommends not drawing performance comparison conclusions with other Lambda runtimes until the performance has stabilized.

Native AOT

Lambda introduced .NET Native AOT support in November 2022. Benchmarks show up to 86% improvement in cold start times by eliminating the JIT compilation. Deploying .NET 8 Native AOT functions using the managed dotnet8 runtime rather than the OS-only provided.al2023 runtime gives your function access to .NET system libraries. For example, libicu, which is used for globalization, is not included by default in the provided.al2023 runtime but is in the dotnet8 runtime.

While Native AOT is not suitable for all .NET functions, .NET 8 has improved trimming support. This allows you to more easily run ASP.NET APIs. Improved trimming support helps eliminate build time trimming warnings, which highlight possible runtime errors. This can give you confidence that your Native AOT function behaves like a JIT-compiled function. Trimming support has been added to the Lambda runtime libraries, AWS .NET SDK, .NET Lambda Annotations, and .NET 8 itself.

Using.NET 8 with Lambda

To use .NET 8 with Lambda, you must update your tools.

Install or update the .NET 8 SDK.
If you are using AWS SAM, install or update to the latest version.
If you are using Visual Studio, install or update the AWS Toolkit for Visual Studio.
If you use the .NET Lambda Global Tools extension (Amazon.Lambda.Tools), install the CLI extension and templates. You can upgrade existing tools with dotnet tool update -g Amazon.Lambda.Tools and existing templates with dotnet new install Amazon.Lambda.Templates.

You can also use .NET 8 with Powertools for AWS Lambda (.NET), a developer toolkit to implement serverless best practices such as observability, batch processing, retrieving parameters, idempotency, and feature flags.

Building new .NET 8 functions

Using AWS SAM

Run sam init.
Choose 1- AWS Quick Start Templates.
Choose one of the available templates such as Hello World Example.
Select N for Use the most popular runtime and package type?
Select dotnet8 as the runtime. The dotnet8 Hello World Example also includes a Native AOT template option.
Follow the rest of the prompts to create the .NET 8 function.

AWS SAM .NET 8 init options

You can amend the generated function code and use sam deploy --guided to deploy the function.

Using AWS Toolkit for Visual Studio

From the Create a new project wizard, filter the templates to either the Lambda or Serverless project type and select a template. Use Lambda for deploying a single function. Use Serverless for deploying a collection of functions using AWS CloudFormation.
Continue with the steps to finish creating your project.
You can amend the generated function code.
To deploy, right click on the project in the Solution Explorer and select Publish to AWS Lambda.

Using AWS extensions for the .NET CLI

Run dotnet new list --tag Lambda to get a list of available Lambda templates.
Choose a template and run dotnet new <template name>. To build a function using Native AOT, use dotnet new lambda.NativeAOT or dotnet new serverless.NativeAOT when using the .NET Lambda Annotations Framework.
Locate the generated Lambda function in the directory under src which contains the .csproj file. You can amend the generated function code.
To deploy, run dotnet lambda deploy-function and follow the prompts.
You can test the function in the cloud using dotnet lambda invoke-function or by using the test functionality in the Lambda console.

You can build and deploy .NET Lambda functions using container images. Follow the instructions in the documentation.

Migrating from .NET 6 to .NET 8 without Native AOT

Using AWS SAM

Open the template.yaml file.
Update Runtime to dotnet8.
Open a terminal window and rebuild the code using sam build.
Run sam deploy to deploy the changes.

Using AWS Toolkit for Visual Studio

Open the .csproj project file and update the TargetFramework to net8.0. Update NuGet packages for your Lambda functions to the latest version to pull in .NET 8 updates.
Verify that the build command you are using is targeting the .NET 8 runtime.
There may be additional steps depending on what build/deploy tool you’re using. Updating the function runtime may be sufficient.

Using AWS extensions for the .NET CLI or AWS Toolkit for Visual Studio

Open the aws-lambda-tools-defaults.json file if it exists.
1. Set the framework field to net8.0. If unspecified, the value is inferred from the project file.
2. Set the function-runtime field to dotnet8.
Open the serverless.template file if it exists. For any AWS::Lambda::Function or AWS::Servereless::Function resources, set the Runtime property to dotnet8.
Open the .csproj project file if it exists and update the TargetFramework to net8.0. Update NuGet packages for your Lambda functions to the latest version to pull in .NET 8 updates.

Migrating from .NET 6 to .NET 8 Native AOT

The following example migrates a .NET 6 class library function to a .NET 8 Native AOT executable function. This uses the optional Lambda Annotations framework which provides idiomatic .NET coding patterns.

Update your project file

Open the project file.
Set TargetFramework to net8.0.
Set OutputType to exe.
Remove PublishReadyToRun if it exists.
Add PublishAot and set to true.
Add or update NuGet package references to Amazon.Lambda.Annotations and Amazon.Lambda.RuntimeSupport. You can update using the NuGet UI in your IDE, manually, or by running dotnet add package Amazon.Lambda.RuntimeSupport and dotnet add package Amazon.Lambda.Annotations from your project directory.

Your project file should look similar to the following:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>exe</OutputType>
    <TargetFramework>net8.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
    <AWSProjectType>Lambda</AWSProjectType>
    <CopyLocalLockFileAssemblies>true</CopyLocalLockFileAssemblies>
    <!-- Generate native aot images during publishing to improve cold start time. -->
    <PublishAot>true</PublishAot>
	  <!-- StripSymbols tells the compiler to strip debugging symbols from the final executable if we're on Linux and put them into their own file. 
		This will greatly reduce the final executable's size.-->
	  <StripSymbols>true</StripSymbols>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Amazon.Lambda.Core" Version="2.2.0" />
    <PackageReference Include="Amazon.Lambda.RuntimeSupport" Version="1.10.0" />
    <PackageReference Include="Amazon.Lambda.Serialization.SystemTextJson" Version="2.4.0" />
  </ItemGroup>
</Project>

Updating your function code

1. Reference the annotations library with using Amazon.Lambda.Annotations;
2. Add [assembly:LambdaGlobalProperties(GenerateMain = true)] to allow the annotations framework to create the main method. This is required as the project is now an executable instead of a library.
3. Add the below partial class and include a JsonSerializable attribute for any types that you need to serialize, including for your function input and output This partial class is used at build time to generate reflection free code dedicated to serializing the listed types. The following is an example:
4. After the using statement, add the following to specify the serializer to use. [assembly: LambdaSerializer(typeof(SourceGeneratorLambdaJsonSerializer<LambdaFunctionJsonSerializerContext>))]
Swap LambdaFunctionJsonSerializerContext for your context if you are not using the partial class from the previous step.

Updating your function configuration

If you are using aws-lambda-tools-defaults.json.
1. Set function-runtime to dotnet8.
2. Set function-architecture to match your build machine – either x86_64 or arm64.
3. Set (or update) environment-variables to include ANNOTATIONS_HANDLER=<YourFunctionHandler>. Replace <YourFunctionHandler> with the method name of your function handler, so the annotations framework knows which method to call from the generated main method.
4. Set function-handler to the name of the executable assembly in your bin directory. By default, this is your project name, which tells the .NET Lambda bootstrap script to run your native binary instead of starting the .NET runtime. If your project file has AssemblyName then use that value for the function handler.
```
{
  "function-architecture": "x86_64",
  "function-runtime": "dotnet8",
  "function-handler": "<your-assembly-name>",
  "environment-variables",
  "ANNOTATIONS_HANDLER=<your-function-handler>",
}
```
Deploy and test
1. Deploy your function. If you are using Amazon.Lambda.Tools, run dotnet lambda deploy-function. Check for trim warnings during build and refactor to eliminate them.
2. Test your function to ensure that the native calls into AL2023 are working correctly. By default, running local unit tests on your development machine won’t run natively and will still use the JIT compiler. Running with the JIT compiler does not allow you to catch native AOT specific runtime errors.
Conclusion

Lambda is introducing the new .NET 8 managed runtime. This post highlights new features in .NET 8. You can create new Lambda functions or migrate existing functions to .NET 8 or .NET 8 Native AOT.

For more information, see the AWS Lambda for .NET repository, documentation, and .NET on Serverless Land.

For more serverless learning resources, visit Serverless Land.

Re-platforming Java applications using the updated AWS Serverless Java Container

2024-02-06 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/re-platforming-java-applications-using-the-updated-aws-serverless-java-container/

This post is written by Dennis Kieselhorst, Principal Solutions Architect.

The combination of portability, efficiency, community, and breadth of features has made Java a popular choice for businesses to build their applications for over 25 years. The introduction of serverless functions, pioneered by AWS Lambda, changed what you need in a programming language and runtime environment. Functions are often short-lived, single-purpose, and do not require extensive infrastructure configuration.

This blog post shows how you can modernize a legacy Java application to run on Lambda with minimal code changes using the updated AWS Serverless Java Container.

Deployment model comparison

Classic Java enterprise applications often run on application servers such as JBoss/ WildFly, Oracle WebLogic and IBM WebSphere, or servlet containers like Apache Tomcat. The underlying Java virtual machine typically runs 24/7 and serves multiple requests using its multithreading capabilities.

Typical long running Java application server

When building Lambda functions with Java, an HTTP server is no longer required and there are other considerations for running code in a Lambda environment. Code runs in an execution environment, which processes a single invocation at a time. Functions can run for up to 15 minutes with a maximum of 10 Gb allocated memory.

Functions are triggered by events such as an HTTP request with a corresponding payload. An Amazon API Gateway HTTP request invokes the function with the following JSON payload:

Amazon API Gateway HTTP request payload

The code to process these events is different from how you implement it in a traditional application.

AWS Serverless Java Container

The AWS Serverless Java Container makes it easier to run Java applications written with frameworks such as Spring, Spring Boot, or JAX-RS/Jersey in Lambda.

The container provides adapter logic to minimize code changes. Incoming events are translated to the Servlet specification so that frameworks work as before.

AWS Serverless Java Container adapter

Version 1 of this library was released in 2018. Today, AWS is announcing the release of version 2, which supports the latest Jakarta EE specification, along with Spring Framework 6.x, Spring Boot 3.x and Jersey 3.x.

Example: Modifying a Spring Boot application

This following example illustrates how to migrate a Spring Boot 3 application. You can find the full example for Spring and other frameworks in the GitHub repository.

Add the AWS Serverless Java dependency to your Maven POM build file (or Gradle accordingly):

<dependency>
    <groupId>com.amazonaws.serverless</groupId>
    <artifactId>aws-serverless-java-container-springboot3</artifactId>
    <version>2.0.0</version>
</dependency>

Spring Boot, by default, embeds Apache Tomcat to deal with HTTP requests. The examples use Amazon API Gateway to handle inbound HTTP requests so you can exclude the dependency.

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <configuration>
                <createDependencyReducedPom>false</createDependencyReducedPom>
            </configuration>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <artifactSet>
                            <excludes>
                                <exclude>org.apache.tomcat.embed:*</exclude>
                            </excludes>
                        </artifactSet>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

The AWS Serverless Java Container accepts API Gateway proxy requests and transforms them into a plain Java object. The library also transforms outputs into a suitable API Gateway response object.

Once you run your build process, Maven’s Shade-plugin now produces an Uber-JAR that bundles all dependencies, which you can upload to Lambda.

The Lambda runtime must know which handler method to invoke. You can configure and use the SpringDelegatingLambdaContainerHandler implementation or implement your own handler Java class that delegates to AWS Serverless Java Container. This is useful if you want to add additional functionality.
Configure the handler name in the runtime settings of your function.

Configure the handler name

Configure an environment variable named MAIN_CLASS to let the generic handler know where to find your original application main class, which is usually annotated with @SpringBootApplication.

Configure MAIN_CLASS environment variable

You can also configure these settings using infrastructure as code (IaC) tools such as AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or the AWS Serverless Application Model (AWS SAM).

In an AWS SAM template, the related changes are as follows. Full templates are part of the GitHub repository.

Handler: com.amazonaws.serverless.proxy.spring.SpringDelegatingLambdaContainerHandler 
Environment:
  Variables:
    MAIN_CLASS: com.amazonaws.serverless.sample.springboot3.Application

Optimizing memory configuration

When running Lambda functions, start-up time and memory footprint are important considerations. The amount of memory you configure for your Lambda function also determines the amount of virtual CPU available. Adding more memory proportionally increases the amount of CPU, and therefore increases the overall computational power available. If a function is CPU-, network- or memory-bound, adding more memory can improve performance.

Lambda charges for the total amount of gigabyte-seconds consumed by a function. Gigabyte-seconds are a combination of total memory (in gigabytes) and duration (in seconds). Increasing memory incurs additional cost. However, in many cases, increasing the memory available causes a decrease in the function duration due to the additional CPU available. As a result, the overall cost increase may be negligible for additional performance, or may even decrease.

Choosing the memory allocated to your Lambda functions is an optimization process that balances speed (duration) and cost. You can manually test functions by selecting different memory allocations and measuring the completion time. AWS Lambda Power Tuning is a tool to simplify and automate the process, which you can use to optimize your configuration.

Power Tuning uses AWS Step Functions to run multiple concurrent versions of a Lambda function at different memory allocations and measures the performance. The function runs in your AWS account, performing live HTTP calls and SDK interactions, to measure performance in a production scenario.

Improving cold-start time with AWS Lambda SnapStart

Traditional applications often have a large tree of dependencies. Lambda loads the function code and initializes dependencies during Lambda lifecycle initialization phase. With many dependencies, this initialization time may be too long for your requirements. AWS Lambda SnapStart for Java based functions can deliver up to 10 times faster startup performance.

Instead of running the function initialization phase on every cold-start, Lambda SnapStart runs the function initialization process at deployment time. Lambda takes a snapshot of the initialized execution environment. This snapshot is encrypted and persisted in a tiered cache for low latency access. When the function is invoked and scales, Lambda resumes the execution environment from the persisted snapshot instead of running the full initialization process. This results in lower startup latency.

To enable Lambda SnapStart you must first enable the configuration setting, and also publish a function version.

Enabling SnapStart

Ensure you point your API Gateway endpoint to the published version or an alias to ensure you are using the SnapStart enabled function.

The corresponding settings in an AWS SAM template contain the following:

SnapStart: 
  ApplyOn: PublishedVersions
AutoPublishAlias: my-function-alias

Read the Lambda SnapStart compatibility considerations in the documentation as your application may contain specific code that requires attention.

Conclusion

When building serverless applications with Lambda, you can deliver features faster, but your language and runtime must work within the serverless architectural model. AWS Serverless Java Container helps to bridge between traditional Java Enterprise applications and modern cloud-native serverless functions.

You can optimize the memory configuration of your Java Lambda function using AWS Lambda Power Tuning tool and enable SnapStart to optimize the initial cold-start time.

The self-paced Java on AWS Lambda workshop shows how to build cloud-native Java applications and migrate existing Java application to Lambda.

Explore the AWS Serverless Java Container GitHub repo where you can report related issues and feature requests.

For more serverless learning resources, visit Serverless Land.

Build real-time applications with Amazon EventBridge and AWS AppSync

2024-01-30 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/build-real-time-applications-with-amazon-eventbridge-and-aws-appsync/

This post is written by Josh Kahn, Tech Leader, Serverless.

Amazon EventBridge now supports publishing events to AWS AppSync GraphQL APIs as native targets. The new integration enables builders to publish events easily to a wider variety of consumers and simplifies updating clients with near real-time data. You can use EventBridge and AWS AppSync to build resilient, subscription-based event-driven architectures across consumers.

To illustrate using EventBridge with AWS AppSync, consider a simplified airport operations scenario. In this example, airlines publish flight events (for example, boarding, push back, gate changes, and delays) to a service that maintains flight status on in-airport displays. Airlines also publish events that are useful for other entities at the airport, such as baggage handlers and maintenance, but not to passengers. This depicts a conceptual view of the system:

Passengers want the in-airport displays to be up-to-date and accurate. There are a number of ways to design the display application so that data remains up-to-date. Broadly, these include the application polling some API or the application subscribing to data changes.

Subscriptions for this scenario are better as the data changes are small and incremental relative to the large amount of information displayed. In a delay, for example, the display updates the status and departure time but no other details of a single flight among a larger list of flight information.

AWS AppSync can enable clients to listen for real-time data changes through the use of GraphQL subscriptions. These are implemented using a WebSocket connection between the client and the AWS AppSync service. The display application client invokes the GraphQL subscription operation to establish a secure connection. AWS AppSync will automatically push data changes (or mutations) via the GraphQL API to subscribers using that connection.

Previously, builders could use EventBridge API Destinations to wire events published and routed through EventBridge to AWS AppSync, as described in an earlier blog post, and available in Serverless Land patterns (API Key, OAuth). The approach is useful for dealing with “out-of-band” updates in which data changes outside of an AWS AppSync mutation. Out-of-band updates generally require a NONE data source in AWS AppSync to notify subscribers of changes, as described in the AWS re:Post Knowledge Center. The addition of AWS AppSync as a target for EventBridge simplifies these use cases as you can now trigger a mutation in response to an event without additional code.

Airport Operations Events

Expanding the scenario, airport operations events look like this:

{
  "flightNum": 123,
  "carrierCode": "JK",
  "date": "2024-01-25",
  "event": "FlightDelayed",
  "message": "Delayed 15 minutes, late aircraft",
  "info": "{ \"newDepTime\": \"2024-01-25T13:15:00Z\", \"delayMinutes\": 15 }"
}

The event field identifies the type of event and if it is relevant to passengers. The event details provide further information about the event, which varies based on the type of event. The airport publishes a variety of events but the airport displays only need a subset of those changes.

AWS AppSync GraphQL APIs start with a GraphQL schema that defines the types, fields, and operations available in that API. AWS AppSync documentation provides an overview of schema and other GraphQL essentials. The partial GraphQL schema for the airport scenario is as follows:


type DelayEventInfo implements EventInfo {
	message: String
	delayMinutes: Int
	newDepTime: AWSDateTime
}

interface EventInfo {
	message: String
}

enum StatusEvent {
	FlightArrived
	FlightBoarding
	FlightCancelled
	FlightDelayed
	FlightGateChanged
	FlightLanded
	FlightPushBack
	FlightTookOff
}

type StatusUpdate {
	num: Int!
	carrier: String!
	date: AWSDate!
	event: StatusEvent!
	info: EventInfo
}

input StatusUpdateInput {
	num: Int!
	carrier: String!
	date: AWSDate!
	event: StatusEvent!
	message: String
	extra: AWSJSON
}

type Mutation {
	updateFlightStatus(input: StatusUpdateInput!): StatusUpdate!
}

type Query {
	listStatusUpdates(by: String): [StatusUpdate]
}

type Subscription {
	onFlightStatusUpdate(date: AWSDate, carrier: String): StatusUpdate
		@aws_subscribe(mutations: ["updateFlightStatus"])
}

schema {
	query: Query
	mutation: Mutation
	subscription: Subscription
}

Connect EventBridge to AWS AppSync

EventBridge allows you to filter, transform, and route events to a number of targets. The airport display service only needs events that directly impact passengers. You can define a rule in EventBridge that routes only those events (included in the preceding GraphQL schema) to the AWS AppSync target. Other events are routed elsewhere, as defined by other rules, or dropped. Details on creating EventBridge rules and the event matching pattern format can be found in EventBridge documentation.

The previous flight delayed event would be delivered using EventBridge as follows:

{
  "id": "b051312994104931b0980d1ad1c5340f",
  "detail-type": "Operations: Flight delayed",
  "source": "airport-operations",
  "time": "2024-01-25T16:58:37Z",
  "detail": {
    "flightNum": 123,
    "carrierCode": "JK",
    "date": "2024-01-25",
    "event": "FlightDelayed",
    "message": "Delayed 15 minutes, late aircraft",
    "info": "{ \"newDepTime\": \"2024-01-25T13:15:00Z\", \"delayMinutes\": 15 }"
  }
}

In this scenario, there is a specific list of events of interest, but EventBridge provides a flexible set of operations to match patterns, inspect arrays, and filter by content using prefix, numerical, or other matching. Some organizations will also allow subscribers to define their own rules on an EventBridge event bus, allowing targets to subscribe to events via self-service.

The following event pattern matches on the events needed for the airport display service:

{
  "source": [ "airport-operations" ],
  "detail": {
    "event": [ "FlightArrived", "FlightBoarding", "FlightCancelled", ... ]
  }
}

To create a new EventBridge rule, you can use the AWS Management Console or infrastructure as code. You can find the CloudFormation definition for the completed rule, with the AWS AppSync target, later in this post.

Create the AWS AppSync target

Now that EventBridge is configured to route selected events, define AWS AppSync as the target for the rule. The AWS AppSync API must support IAM authorization to be used as an EventBridge target. AWS AppSync supports multiple authorization types on a single GraphQL type, so you can also use OpenID Connect, Amazon Cognito User Pools, or other authorization methods as needed.

To configure AWS AppSync as an EventBridge target, define the target using the AWS Management Console or infrastructure as code. In the console, select the Target Type as “AWS Service” and Target as “AppSync.” Select your API. EventBridge parses the GraphQL schema and allows you to select the mutation to invoke when the rule is triggered.

When using the AWS Management Console, EventBridge will also configure the necessary AWS IAM role to invoke the selected mutation. Remember to create and associate a role with an appropriate trust policy when configuring with IaC.

EventBridge supports input transformation to customize the contents of an event before passing the information as input to the target. Configure the input transformer to extract needed values from the event using JSON path and a template in the input format expected by the AWS AppSync API. EventBridge provides a handy utility in the Console to pass and test the output of a sample event.

Finally, configure the selection set to include the response from the AWS AppSync API. These are the fields that will be returned to EventBridge when the mutation is invoked. While the result returned to EventBridge is not overly useful (aside from troubleshooting), the mutation selection set will also determine the fields available to subscribers to the onFlightStatusUpdate subscription.

Define the EventBridge to AWS AppSync rule in CloudFormation

Infrastructure as code templates, including AWS CloudFormation and AWS CDK, are useful for codifying infrastructure definitions to deploy across Regions and accounts. While you can write CloudFormation by hand, EventBridge provides a useful CloudFormation export in the AWS Management Console. You can use this feature to export the definition for a defined rule.

This is the CloudFormation for the previous configured rule and AWS AppSync target. This snippet includes both the rule definition and the target configuration.

PassengerEventsToDisplayServiceRule:
    Type: AWS::Events::Rule
    Properties:
      Description: Route passenger related events to the display service endpoint
      EventBusName: eb-to-appsync
      EventPattern:
        source:
          - airport-operations
        detail:
          event:
            - FlightArrived
            - FlightBoarding
            - FlightCancelled
            - FlightDelayed
            - FlightGateChanged
            - FlightLanded
            - FlightPushBack
            - FlightTookOff
      Name: passenger-events-to-display-service
      State: ENABLED
      Targets:
        - Id: 12344535353263463
          Arn: <AppSync API GraphQL API ARN>
          RoleArn: <EventBridge Role ARN (defined elsewhere)>
          InputTransformer:
            InputPathsMap:
              carrier: $.detail.carrierCode
              date: $.detail.date
              event: $.detail.event
              extra: $.detail.info
              message: $.detail.message
              num: $.detail.flightNum
            InputTemplate: |-
              {
                "input": {
                  "num": <num>,
                  "carrier": <carrier>,
                  "date": <date>,
                  "event": <event>,
                  "message": "<message>",
                  "extra": <extra>
                }
              }
          AppSyncParameters:
            GraphQLOperation: >-
              mutation
              UpdateFlightStatus($input:StatusUpdateInput!){updateFlightStatus(input:$input){
                event
                date
                carrier
                num
                info {
                  __typename
                  ... on DelayEventInfo {
                    message
                    delayMinutes
                    newDepTime
                  }
                }
              }}

The ARN of the AWS AppSync API follows the form arn:aws:appsync:<AWS_REGION>:<ACCOUNT_ID>:endpoints/graphql-api/<GRAPHQL_ENDPOINT_ID>. The ARN is available in CloudFormation (see GraphQLEndpointArn return value) or can be created using the identifier found in the AWS AppSync GraphQL endpoint. The ARN included in the EventBridge execution role policy is the AWS AppSync API ARN (a different ARN).

The AppSyncParameters field includes the GraphQL operation for EventBridge to invoke on the AWS AppSync API. This must be well formatted and match the GraphQL schema. Include any fields that must be available to subscribers in the selection set.

Testing subscriptions

AWS AppSync is now configured as a target for the EventBridge rule. The real-life display application would use a GraphQL library, such as AWS Amplify, to subscribe to real-time data changes. The AWS Management Console provides a useful utility to test. Navigate to the AWS AppSync console and select Queries in the menu for your API. Enter the following query and choose Run to subscribe for data changes:

subscription MySubscription {
  onFlightStatusUpdate {
    carrier
    date
    event
    num
    info {
      __typename
      … on DelayEventInfo {
        message
        delayMinutes
        newDepTime
      }
    }
  }
}

In a separate browser tab, navigate to the EventBridge console, and choose Send events. On the Send events page, select the required event bus and set the Event source to “airport-operations.” Then enter a detail type of your choice. Finally, paste the following as the Event detail, then choose Send.

{
  "id": "b051312994104931b0980d1ad1c5340f",
  "detail-type": "Operations: Flight delayed",
  "source": "airport-operations",
  "time": "2024-01-25T16:58:37Z",
  "detail": {
    "flightNum": 123,
    "carrierCode": "JK",
    "date": "2024-01-25",
    "event": "FlightDelayed",
    "message": "Delayed 15 minutes, late aircraft",
    "info": "{ \"newDepTime\": \"2024-01-25T13:15:00Z\", \"delayMinutes\": 15 }"
  }
}

Return to the AWS AppSync tab in your browser to see the changed data in the result pane:

Conclusion

Directly invoking AWS AppSync GraphQL API targets from EventBridge simplifies and streamlines integration between these two services, ideal for notifying a variety of subscribers of data changes in event-driven workloads. You can also take advantage of other features available from the two services. For example, use AWS AppSync enhanced subscription filtering to update only airport displays in the terminal in which they are located.

To learn more about serverless, visit Serverless Land for a wide array of reusable patterns, tutorials, and learning materials. Newly added to the pattern library is an EventBridge to AWS AppSync pattern similar to the one described in this post. Visit EventBridge documentation for more details.

For more serverless learning resources, visit Serverless Land.

Invoking on-premises resources interactively using AWS Step Functions and MQTT

2024-01-30 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/invoking-on-premises-resources-interactively-using-aws-step-functions-and-mqtt/

This post is written by Alex Paramonov, Sr. Solutions Architect, ISV, and Pieter Prinsloo, Customer Solutions Manager.

Workloads in AWS sometimes require access to data stored in on-premises databases and storage locations. Traditional solutions to establish connectivity to the on-premises resources require inbound rules to firewalls, a VPN tunnel, or public endpoints.

This blog post demonstrates how to use the MQTT protocol (AWS IoT Core) with AWS Step Functions to dispatch jobs to on-premises workers to access or retrieve data stored on-premises. The state machine can communicate with the on-premises workers without opening inbound ports or the need for public endpoints on on-premises resources. Workers can run behind Network Access Translation (NAT) routers while keeping bidirectional connectivity with the AWS Cloud. This provides a more secure and cost-effective way to access data stored on-premises.

Overview

By using Step Functions with AWS Lambda and AWS IoT Core, you can access data stored on-premises securely without altering the existing network configuration.

AWS IoT Core lets you connect IoT devices and route messages to AWS services without managing infrastructure. By using a Docker container image running on-premises as a proxy IoT Thing, you can take advantage of AWS IoT Core’s fully managed MQTT message broker for non-IoT use cases.

MQTT subscribers receive information via MQTT topics. An MQTT topic acts as a matching mechanism between publishers and subscribers. Conceptually, an MQTT topic behaves like an ephemeral notification channel. You can create topics at scale with virtually no limit to the number of topics. In SaaS applications, for example, you can create topics per tenant. Learn more about MQTT topic design here.

The following reference architecture shown uses the AWS Serverless Application Model (AWS SAM) for deployment, Step Functions to orchestrate the workflow, AWS Lambda to send and receive on-premises messages, and AWS IoT Core to provide the MQTT message broker, certificate and policy management, and publish/subscribe topics.

Start the state machine, either “on demand” or on a schedule.
The state: “Lambda: Invoke Dispatch Job to On-Premises” publishes a message to an MQTT message broker in AWS IoT Core.
The message broker sends the message to the topic corresponding to the worker (tenant) in the on-premises container that runs the job.
The on-premises container receives the message and starts work execution. Authentication is done using client certificates and the attached policy limits the worker access to only the tenant’s topic.
The worker in the on-premises container can access local resources like DBs or storage locations.
The on-premises container sends the results and job status back to another MQTT topic.
The AWS IoT Core rule invokes the “TaskToken Done” Lambda function.
The Lambda function submits the results to Step Functions via SendTaskSuccess or SendTaskFailure API.

Deploying and testing the sample

Ensure you can manage AWS resources from your terminal and that:

Latest versions of AWS CLI and AWS SAM CLI are installed.
You have an AWS account. If not, visit this page.
Your user has sufficient permissions to manage AWS resources.
Git is installed.
Python version 3.11 or greater is installed.
Docker is installed.

You can access the GitHub repository here and follow these steps to deploy the sample.

The aws-resources directory contains the required AWS resources including the state machine, Lambda functions, topics, and policies. The directory on-prem-worker contains the Docker container image artifacts. Use it to run the on-premises worker locally.

In this example, the worker container adds two numbers, provided as an input in the following format:

{
  "a": 15,
  "b": 42
}

In a real-world scenario, you can substitute this operation with business logic. For example, retrieving data from on-premises databases, generating aggregates, and then submitting the results back to your state machine.

Follow these steps to test the sample end-to-end.

Using AWS IoT Core without IoT devices

There are no IoT devices in the example use case. However, the fully managed MQTT message broker in AWS IoT Core lets you route messages to AWS services without managing infrastructure.

AWS IoT Core authenticates clients using X.509 client certificates. You can attach a policy to a client certificate allowing the client to publish and subscribe only to certain topics. This approach does not require IAM credentials inside the worker container on-premises.

AWS IoT Core’s security, cost efficiency, managed infrastructure, and scalability make it a good fit for many hybrid applications beyond typical IoT use cases.

Dispatching jobs from Step Functions and waiting for a response

When a state machine reaches the state to dispatch the job to an on-premises worker, the execution pauses and waits until the job finishes. Step Functions support three integration patterns: Request-Response, Sync Run a Job, and Wait for a Callback with Task Token. The sample uses the “Wait for a Callback with Task Token“ integration. It allows the state machine to pause and wait for a callback for up to 1 year.

When the on-premises worker completes the job, it publishes a message to the topic in AWS IoT Core. A rule in AWS IoT Core then invokes a Lambda function, which sends the result back to the state machine by calling either SendTaskSuccess or SendTaskFailure API in Step Functions.

You can prevent the state machine from timing out by adding HeartbeatSeconds to the task in the Amazon States Language (ASL). Timeouts happen if the job freezes and the SendTaskFailure API is not called. HeartbeatSeconds send heartbeats from the worker via the SendTaskHeartbeat API call and should be less than the specified TimeoutSeconds.

To create a task in ASL for your state machine, which waits for a callback token, use the following code:

{
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Parameters": {
        "FunctionName": "${LambdaNotifierToWorkerArn}",
        "Payload": {
          "Input.$": "$",
          "TaskToken.$": "$$.Task.Token"
        }
}

The .waitForTaskToken suffix indicates that the task must wait for the callback. The state machine generates a unique callback token, accessible via the $$.Task.Token built-in variable, and passes it as an input to the Lambda function defined in FunctionName.

The Lambda function then sends the token to the on-premises worker via an AWS IoT Core topic.

Lambda is not the only service that supports Wait for Callback integration – see the full list of supported services here.

In addition to dispatching tasks and getting the result back, you can implement progress tracking and shut down mechanisms. To track progress, the worker sends metrics via a separate topic.

Depending on your current implementation, you have several options:

Storing progress data from the worker in Amazon DynamoDB and visualizing it via REST API calls to a Lambda function, which reads from the DynamoDB table. Refer to this tutorial on how to store data in DynamoDB directly from the topic.
For a reactive user experience, create a rule to invoke a Lambda function when new progress data arrives. Open a WebSocket connection to your backend. The Lambda function sends progress data via WebSocket directly to the frontend.

To implement a shutdown mechanism, you can run jobs in separate threads on your worker and subscribe to the topic, to which your state machine publishes the shutdown messages. If a shutdown message arrives, end the job thread on the worker and send back the status including the callback token of the task.

Using AWS IoT Core Rules and Lambda Functions

A message with job results from the worker does not arrive to the Step Functions API directly. Instead, an AWS IoT Core Rule and a dedicated Lambda function forward the status message to Step Functions. This allows for more granular permissions in AWS IoT Core policies, which result in improved security because the worker container can only publish and subscribe to specific topics. No IAM credentials exist on-premises.

The Lambda function’s execution role contains the permissions for SendTaskSuccess, SendTaskHeartbeat, and SendTaskFailure API calls only.

Alternatively, a worker can run API calls in Step Functions workflows directly, which replaces the need for a topic in AWS IoT Core, a rule, and a Lambda function to invoke the Step Functions API. This approach requires IAM credentials inside the worker’s container. You can use AWS Identity and Access Management Roles Anywhere to obtain temporary security credentials. As your worker’s functionality evolves over time, you can add further AWS API calls while adding permissions to the IAM execution role.

Cleaning up

The services used in this solution are eligible for AWS Free Tier. To clean up the resources in the aws-resources/ directory of the repository run:

sam delete

This removes all resources provisioned by the template.yml file.

To remove the client certificate from AWS, navigate to AWS IoT Core Certificates and delete the certificate, which you added during the manual deployment steps.

Lastly, stop the Docker container on-premises and remove it:

docker rm --force mqtt-local-client

Finally, remove the container image:

docker rmi mqtt-client-waitfortoken

Conclusion

Accessing on-premises resources with workers controlled via Step Functions using MQTT and AWS IoT Core is a secure, reactive, and cost effective way to run on-premises jobs. Consider updating your hybrid workloads from using inefficient polling or schedulers to the reactive approach described in this post. This offers an improved user experience with fast dispatching and tracking of jobs outside of cloud.

For more serverless learning resources, visit Serverless Land.

Consuming private Amazon API Gateway APIs using mutual TLS

2024-01-23 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/consuming-private-amazon-api-gateway-apis-using-mutual-tls/

This post is written by Thomas Moore, Senior Solutions Architect and Josh Hart, Senior Solutions Architect.

A previous blog post explores using Amazon API Gateway to create private REST APIs that can be consumed across different AWS accounts inside a virtual private cloud (VPC). Private cross-account APIs are useful for software vendors (ISVs) and SaaS companies providing secure connectivity for customers, and organizations building internal APIs and backend microservices.

Mutual TLS (mTLS) is an advanced security protocol that provides two-way authentication via certificates between a client and server. mTLS requires the client to send an X.509 certificate to prove its identity when making a request, together with the default server certificate verification process. This ensures that both parties are who they claim to be.

The mTLS connection process illustrated in the diagram above:

Client connects to the server.
Server presents its certificate, which is verified by the client.
Client presents its certificate, which is verified by the server.
Encrypted TLS connection established.

Customers use mTLS because it offers stronger security and identity verification than standard TLS connections. mTLS helps prevent man-in-the-middle attacks and protects against threats such as impersonation attempts, data interception, and tampering. As threats become more advanced, mTLS provides an extra layer of defense to validate connections.

Implementing mTLS increases overhead for certificate management, but for applications transmitting valuable or sensitive data, the extra security is important. If security is a priority for your systems and users, you should consider deploying mTLS.

Regional API Gateway endpoints have native support for mTLS but private API Gateway endpoints do not support mTLS, so you must terminate mTLS before API Gateway. The previous blog post shows how to build private mTLS APIs using a self-managed verification process inside a container running an NGINX proxy. Since then, Application Load Balancer (ALB) now supports mTLS natively, simplifying the architecture.

This post explores building mTLS private APIs using this new feature.

Application Load Balancer mTLS configuration

You can enable mutual authentication (mTLS) on a new or existing Application Load Balancer. By enabling mTLS on the load balancer listener, clients are required to present trusted certificates to connect. The load balancer validates the certificates before allowing requests to the backends.

There are two options available when configuring mTLS on the Application Load Balancer: Passthrough mode and Verify with trust store mode.

In Passthrough mode, the client certificate chain is passed as an X-Amzn-Mtls-Clientcert HTTP header for the application to inspect for authorization. In this scenario, there is still a backend verification process. The benefit in adding the ALB to the architecture is that you can perform application (layer 7) routing, such as path-based routing, allowing more complex application routing configurations.

In Verify with trust store mode, the load balancer validates the client certificate and only allows clients providing trusted certificates to connect. This simplifies the management and reduces load on backend applications.

This example uses AWS Private Certificate Authority but the steps are similar for third-party certificate authorities (CA).

To configure the certificate Trust Store for the ALB:

Create an AWS Private Certificate Authority. Specify the Common Name (CN) to be the domain you use to host the application at (for example, api.example.com).
Export the CA using either the CLI or the Console and upload the resulting Certificate.pem to an Amazon S3 bucket.
Create a Trust Store, point this at the certificate uploaded in the previous step.
Update the listener of your Application Load Balancer to use this trust store and select the required mTLS verification behavior.
Generate certificates for the client application against the private certificate authority, for example using the following commands:

openssl req -new -newkey rsa:2048 -days 365 -keyout my_client.key -out my_client.csr

aws acm-pca issue-certificate –certificate-authority-arn arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/certificate_authority_id–csr fileb://my_client.csr –signing-algorithm “SHA256WITHRSA” –validity Value=365,Type=”DAYS” –template-arn arn:aws:acm-pca:::template/EndEntityCertificate/V1

aws acm-pca get-certificate -certificate-authority-arn arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/certificate_authority_id–certificate-arn arn:aws:acm-pca:us-east-1:account_id:certificate-authority/certificate_authority_id/certificate/certificate_id–output text

For more details on this part of the process, see Use ACM Private CA for Amazon API Gateway Mutual TLS.

Private API Gateway mTLS verification using an ALB

Using the ALB Verify with trust store mode together with API Gateway can enable private APIs with mTLS, without the operational burden of a self-managed proxy service.

You can use this pattern to access API Gateway in the same AWS account, or cross-account.

The same account pattern allows clients inside the VPC to consume the private API Gateway by calling the Application Load Balancer URL. The ALB is configured to verify the provided client certificate against the trust store before passing the request to the API Gateway.

If the certificate is invalid, the API never receives the request. A resource policy on the API Gateway ensures that can requests are only allowed via the VPC endpoint, and a security group on the VPC endpoint ensures that it can only receive requests from the ALB. This prevents the client from bypassing mTLS by invoking the API Gateway or VPC endpoints directly.

The cross-account pattern using AWS PrivateLink provides the ability to connect to the ALB endpoint securely across accounts and across VPCs. It avoids the need to connect VPCs together using VPC Peering or AWS Transit Gateway and enables software vendors to deliver SaaS services to be consumed by their end customers. This pattern is available to deploy as sample code in the GitHub repository.

The flow of a client request through the cross-account architecture is as follows:

A client in the consumer application sends a request to the producer API endpoint.
The request is routed via AWS PrivateLink to a Network Load Balancer in the consumer account. The Network Load Balancer is a requirement of AWS PrivateLink services.
The Network Load Balancer uses an Application Load Balancer-type Target Group.
The Application Load Balancer listener is configured for mTLS in verify with trust store mode.
An authorization decision is made comparing the client certificate to the chain in the certificate trust store.
If the client certificate is allowed the request is routed to the API Gateway via the execute-api VPC Endpoint. An API Gateway resource policy is used to allow connections only via the VPC endpoint.
Any additional API Gateway authentication and authorization is performed, such as using a Lambda authorizer to validate a JSON Web Token (JWT).

Using the example deployed from the GitHub repo, this is the expected response from a successful request with a valid certificate:

curl –key my_client.key –cert my_client.pem https://api.example.com/widgets 

{“id”:”1”,”value”:”4.99”}

When passing an invalid certificate, the following response is received:

curl: (35) Recv failure: Connection reset by peer

Custom domain names

An additional benefit to implementing the mTLS solution with an Application Load Balancer is support for private custom domain names. Private API Gateway endpoints do not support custom domain names currently. But in this case, clients first connect to an ALB endpoint, which does support a custom domain. The sample code implements private custom domains using a public AWS Certificate Manager (ACM) certificate on the internal ALB, and an Amazon Route 53 hosted DNS zone. This allows you to provide a static URL to consumers so that if the API Gateway is replaced the consumer does not need to update their code.

Certificate revocation list

Optionally, as another layer of security, you can also configure a certificate revocation list for a trust store on the ALB. Revocation lists allow you to revoke and invalidate issued certificates before their expiry date. You can use this feature to off-boarding customers or denying compromised credentials, for example.

You can add the certificate revocation list to a new or existing trust store. The list is provided via an Amazon S3 URI as a PEM formatted file.

Conclusion

This post explores ways to provide mutual TLS authentication for private API Gateway endpoints. A previous post shows how to achieve this using a self-managed NGINX proxy. This post simplifies the architecture by using the native mTLS support now available for Application Load Balancers.

This new pattern centralizes authentication at the edge, streamlines deployment, and minimizes operational overhead compared to self-managed verification. AWS Private Certificate Authority and certificate revocation lists integrate with managed credentials and security policies. This makes it easier to expose private APIs safely across accounts and VPCs.

Mutual authentication and progressive security controls are growing in importance when architecting secure cloud-based workloads. To get started, visit the GitHub repository.

For more serverless learning resources, visit Serverless Land.

Using generative infrastructure as code with Application Composer

2024-01-16 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/using-generative-infrastructure-as-code-with-application-composer/

This post is written by Anna Spysz, Frontend Engineer, AWS Application Composer

AWS Application Composer launched in the AWS Management Console one year ago, and has now expanded to the VS Code IDE as part of the AWS Toolkit. This includes access to a generative AI partner that helps you write infrastructure as code (IaC) for all 1100+ AWS CloudFormation resources that Application Composer now supports.

Overview

Application Composer lets you create IaC templates by dragging and dropping cards on a virtual canvas. These represent CloudFormation resources, which you can wire together to create permissions and references. With support for all 1100+ resources that CloudFormation allows, you can now build with everything from AWS Amplify to AWS X-Ray.

Previously, standard CloudFormation resources came only with a basic configuration. Adding an Amplify App resource resulted in the following configuration by default:

  MyAmplifyApp:
    Type: AWS::Amplify::App
    Properties:
      Name: <String>

And in the console:

AWS App Composer in the console

Now, Application Composer in the IDE uses generative AI to generate resource-specific configurations with safeguards such as validation against the CloudFormation schema to ensure valid values.

When working on a CloudFormation or AWS Serverless Application Model (AWS SAM) template in VS Code, you can sign in with your Builder ID and generate multiple suggested configurations in Application Composer. Here is an example of an AI generated configuration for the AWS::Amplify::App type:

AI generated configuration for the Amplify App type

These suggestions are specific to the resource type, and are safeguarded by a check against the CloudFormation schema to ensure valid values or helpful placeholders. You can then select, use, and modify the suggestions to fit your needs.

You now know how to generate a basic example with one resource, but let’s look at building a full application with the help of AI-generated suggestions. This example recreates a serverless application from a Serverless Land tutorial, “Use GenAI capabilities to build a chatbot,” using Application Composer and generative AI-powered code suggestions.

Getting started with the AWS Toolkit in VS Code

If you don’t yet have the AWS Toolkit extension, you can find it under the Extensions tab in VS Code. Install or update it to at least version 2.1.0, so that the screen shows Amazon Q and Application Composer:

Amazon Q and Application Composer

Next, to enable gen AI-powered code suggestions, you must enable Amazon CodeWhisperer using your Builder ID. The easiest way is to open Amazon Q chat, and select Authenticate. On the next screen, select the Builder ID option, then sign in with your Builder ID.

Enable Amazon CodeWhisperer using your Builder ID

After sign-in, your connection appears in the VS Code toolkit panel:

Connection in VS Code toolkit panel

Building with Application Composer

With the toolkit installed and connected with your Builder ID, you are ready to start building.

In a new workspace, create a folder for the application and a blank template.yaml file.
Open this file and initiate Application Composer by choosing the icon in the top right.

Original architecture diagram

The original tutorial includes this architecture diagram:

Initiate Application Composer

First, add the services in the diagram to sketch out the application architecture, which simultaneously creates a deployable CloudFormation template:

From the Enhanced components list, drag in a Lambda function and a Lambda layer.
Double-click the Function resource to edit its properties. Rename the Lambda function’s Logical ID to LexGenAIBotLambda.
Change the Source path to src/LexGenAIBotLambda, and the runtime to Python.
Change the handler value to TextGeneration.lambda_handler, and choose Save.
Double-click the Layer resource to edit its properties. Rename the layer Boto3Layer and change its build method to Python. Change its Source path to src/Boto3PillowPyshorteners.zip.
Finally, connect the layer to the function to add a reference between them. Your canvas looks like this:

Your App Composer canvas

The template.yaml file is now updated to include those resources. In the source directory, you can see some generated function files. You will replace them with the tutorial function and layers later.

In the first step, you added some resources and Application Composer generated IaC that includes best practices defaults. Next, you will use standard CloudFormation components.

Using AI for standard components

Start by using the search bar to search for and add several of the Standard components needed for your application.

Search for and add Standard components

In the Resources search bar, enter “lambda” and add the resource type AWS::Lambda::Permission to the canvas.
Enter “iam” in the search bar, and add type AWS::IAM::Policy.
Add two resources of the type AWS::IAM::Role.

Your application now look like this:

Updated canvas

Some standard resources have all the defaults you need. For example, when you add the AWS::Lambda::Permission resource, replace the placeholder values with:

FunctionName: !Ref LexGenAIBotLambda
Action: lambda:InvokeFunction
Principal: lexv2.amazonaws.com

Other resources, such as the IAM roles and IAM policy, have a vanilla configuration. This is where you can use the AI assistant. Select an IAM Role resource and choose Generate suggestions to see what the generative AI suggests.

Generate suggestions

Because these suggestions are generated by a Large Language Model (LLM), they may differ between each generation. These are checked against the CloudFormation schema, ensuring validity and providing a range of configurations for your needs.

Generating different configurations gives you an idea of what a resource’s policy should look like, and often gives you keys that you can then fill in with the values you need. Use the following settings for each resource, replacing the generated values where applicable.

Double-click the “Permission” resource to edit its settings. Change its Logical ID to LexGenAIBotLambdaInvoke and replace its Resource configuration with the following, then choose Save:

Action: lambda:InvokeFunction
FunctionName: !GetAtt LexGenAIBotLambda.Arn
Principal: lexv2.amazonaws.com

Double-click the “Role” resource to edit its settings. Change its Logical ID to CfnLexGenAIDemoRole and replace its Resource configuration with the following, then choose Save:

AssumeRolePolicyDocument:
  Statement:
    - Action: sts:AssumeRole
      Effect: Allow
      Principal:
        Service: lexv2.amazonaws.com
  Version: '2012-10-17'
ManagedPolicyArns:
  - !Join
    - ''
    - - 'arn:'
      - !Ref AWS::Partition
      - ':iam::aws:policy/AWSLambdaExecute'

Double-click the “Role2” resource to edit its settings. Change its Logical ID to LexGenAIBotLambdaServiceRole and replace its Resource configuration with the following, then choose Save:

AssumeRolePolicyDocument:
  Statement:
    - Action: sts:AssumeRole
      Effect: Allow
      Principal:
        Service: lambda.amazonaws.com
  Version: '2012-10-17'
ManagedPolicyArns:
  - !Join
    - ''
    - - 'arn:'
      - !Ref AWS::Partition
      - ':iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'

Double-click the “Policy” resource to edit its settings. Change its Logical ID to LexGenAIBotLambdaServiceRoleDefaultPolicy and replace its Resource configuration with the following, then choose Save:

PolicyDocument:
  Statement:
    - Action:
        - lex:*
        - logs:*
        - s3:DeleteObject
        - s3:GetObject
        - s3:ListBucket
        - s3:PutObject
      Effect: Allow
      Resource: '*'
    - Action: bedrock:InvokeModel
      Effect: Allow
      Resource: !Join
        - ''
        - - 'arn:aws:bedrock:'
          - !Ref AWS::Region
          - '::foundation-model/anthropic.claude-v2'
  Version: '2012-10-17'
PolicyName: LexGenAIBotLambdaServiceRoleDefaultPolicy
Roles:
  - !Ref LexGenAIBotLambdaServiceRole

Once you have updated the properties of each resource, you see the connections and groupings automatically made between them:

Connections and automatic groupings

To add the Amazon Lex bot:

In the resource picker, search for and add the type AWS::Lex::Bot. Here’s another chance to see what configuration the AI suggests.
Change the Amazon Lex bot’s logical ID to LexGenAIBot update its configuration to the following:

DataPrivacy:
  ChildDirected: false
IdleSessionTTLInSeconds: 300
Name: LexGenAIBot
RoleArn: !GetAtt CfnLexGenAIDemoRole.Arn
AutoBuildBotLocales: true
BotLocales:
  - Intents:
      - InitialResponseSetting:
          CodeHook:
            EnableCodeHookInvocation: true
            IsActive: true
            PostCodeHookSpecification: {}
        IntentClosingSetting:
          ClosingResponse:
            MessageGroupsList:
              - Message:
                  PlainTextMessage:
                    Value: Hi there, I'm a GenAI Bot. How can I help you?
        Name: WelcomeIntent
        SampleUtterances:
          - Utterance: Hi
          - Utterance: Hey there
          - Utterance: Hello
          - Utterance: I need some help
          - Utterance: Help needed
          - Utterance: Can I get some help?
      - FulfillmentCodeHook:
          Enabled: true
          IsActive: true
          PostFulfillmentStatusSpecification: {}
        InitialResponseSetting:
          CodeHook:
            EnableCodeHookInvocation: true
            IsActive: true
            PostCodeHookSpecification: {}
        Name: GenerateTextIntent
        SampleUtterances:
          - Utterance: Generate content for
          - Utterance: 'Create text '
          - Utterance: 'Create a response for '
          - Utterance: Text to be generated for
      - FulfillmentCodeHook:
          Enabled: true
          IsActive: true
          PostFulfillmentStatusSpecification: {}
        InitialResponseSetting:
          CodeHook:
            EnableCodeHookInvocation: true
            IsActive: true
            PostCodeHookSpecification: {}
        Name: FallbackIntent
        ParentIntentSignature: AMAZON.FallbackIntent
    LocaleId: en_US
    NluConfidenceThreshold: 0.4
Description: Bot created demonstration of GenAI capabilities.
TestBotAliasSettings:
  BotAliasLocaleSettings:
    - BotAliasLocaleSetting:
        CodeHookSpecification:
          LambdaCodeHook:
            CodeHookInterfaceVersion: '1.0'
            LambdaArn: !GetAtt LexGenAIBotLambda.Arn
        Enabled: true
      LocaleId: en_US

Choose Save on the resource.

Once all of your resources are configured, your application looks like this:

New AI generated canvas

Adding function code and deployment

Once your architecture is defined, review and refine your template.yaml file. For a detailed reference and to ensure all your values are correct, visit the GitHub repository and check against the template.yaml file.

Copy the Lambda layer directly from the repository, and add it to ./src/Boto3PillowPyshorteners.zip.
In the .src/ directory, rename the generated handler.py to TextGeneration.py. You can also delete any unnecessary files.
Open TextGeneration.py and replace the placeholder code with the following:

import json
import boto3
import os
import logging
from botocore.exceptions import ClientError

LOG = logging.getLogger()
LOG.setLevel(logging.INFO)

region_name = os.getenv("region", "us-east-1")
s3_bucket = os.getenv("bucket")
model_id = os.getenv("model_id", "anthropic.claude-v2")

# Bedrock client used to interact with APIs around models
bedrock = boto3.client(service_name="bedrock", region_name=region_name)

# Bedrock Runtime client used to invoke and question the models
bedrock_runtime = boto3.client(service_name="bedrock-runtime", region_name=region_name)


def get_session_attributes(intent_request):
    session_state = intent_request["sessionState"]
    if "sessionAttributes" in session_state:
        return session_state["sessionAttributes"]

    return {}

def close(intent_request, session_attributes, fulfillment_state, message):
    intent_request["sessionState"]["intent"]["state"] = fulfillment_state
    return {
        "sessionState": {
            "sessionAttributes": session_attributes,
            "dialogAction": {"type": "Close"},
            "intent": intent_request["sessionState"]["intent"],
        },
        "messages": [message],
        "sessionId": intent_request["sessionId"],
        "requestAttributes": intent_request["requestAttributes"]
        if "requestAttributes" in intent_request
        else None,
    }

def lambda_handler(event, context):
    LOG.info(f"Event is {event}")
    accept = "application/json"
    content_type = "application/json"
    prompt = event["inputTranscript"]

    try:
        request = json.dumps(
            {
                "prompt": "\n\nHuman:" + prompt + "\n\nAssistant:",
                "max_tokens_to_sample": 4096,
                "temperature": 0.5,
                "top_k": 250,
                "top_p": 1,
                "stop_sequences": ["\\n\\nHuman:"],
            }
        )

        response = bedrock_runtime.invoke_model(
            body=request,
            modelId=model_id,
            accept=accept,
            contentType=content_type,
        )

        response_body = json.loads(response.get("body").read())
        LOG.info(f"Response body: {response_body}")
        response_message = {
            "contentType": "PlainText",
            "content": response_body["completion"],
        }
        session_attributes = get_session_attributes(event)
        fulfillment_state = "Fulfilled"

        return close(event, session_attributes, fulfillment_state, response_message)

    except ClientError as e:
        LOG.error(f"Exception raised while execution and the error is {e}")

To deploy the infrastructure, go back to the App Composer extension, and choose the Sync icon. Follow the guided AWS SAM instructions to complete the deployment.

App Composer Sync

After the message SAM Sync succeeded, navigate to CloudFormation in the AWS Management Console to see the newly created resources. To continue building the chatbot, follow the rest of the original tutorial.

Conclusion

This guide demonstrates how AI-generated CloudFormation can streamline your workflow in Application Composer, enhance your understanding of resource configurations, and speed up the development process. As always, adhere to the AWS Responsible AI Policy when using these features.

Introducing Amazon MQ cross-Region data replication for ActiveMQ brokers

2023-12-19 Pascal Vogel

Post Syndicated from Pascal Vogel original https://aws.amazon.com/blogs/compute/introducing-amazon-mq-cross-region-data-replication-for-activemq-brokers/

This post is written by Dominic Gagné, Senior Software Development Engineer, and Vinodh Kannan Sadayamuthu, Senior Solutions Architect

Amazon MQ now supports cross-Region data replication for ActiveMQ brokers. This feature enables you to build regionally resilient messaging applications and makes it easier to set up cross-Region message replication between ActiveMQ brokers in Amazon MQ. This blog post explains how cross-Region data replication works in Amazon MQ, how to setup cross-Region replica brokers for ActiveMQ, and how to test promoting a replica broker.

Amazon MQ is a managed message broker service for Apache ActiveMQ and RabbitMQ that simplifies setting up and operating message brokers on AWS.

Cross-Region replication improves the resilience and disaster recovery capabilities of your systems. This new Amazon MQ feature makes it easier to increase resilience of your ActiveMQ messaging systems across AWS Regions.

How cross-Region data replication works in Amazon MQ for ActiveMQ

The Amazon MQ for ActiveMQ cross-Region data replication feature replicates broker state from the primary broker in one AWS Region to the replica broker in another Region. Broker state consists of messages that have been sent to a broker by a message producer. Additionally, message acknowledgments and transactions are replicated. Scheduled messages and broker XML configuration are not replicated from the primary to the replica broker.

State replication occurs asynchronously and runs in the background. When a message is sent to a cross-Region data replication enabled broker, the data is persisted both to the primary data store and also on a queue used to replicate data. The replica broker acts as a client of this queue and consumes data that represents broker state from the primary broker.

At any given moment, only the primary broker is available for client connections. The replica broker is a hot standby and passively replicates the primary broker’s state. However, it does not accept client connections. The following diagram shows a simplified version of a cross-Region data replication broker pair. All replication traffic is encrypted using TLS and remains within AWS’ private backbone.

Configuring cross-Region replica brokers for Amazon MQ for ActiveMQ

To set up a cross-Region replica broker, your Amazon MQ for ActiveMQ primary broker must meet the following eligibility criteria:

ActiveMQ version 5.17.6 or above
Instance size m5.large or higher
Active/standby broker deployment enabled
Be in the Running state

If you do not have an ActiveMQ broker that meets these criteria, see Creating and configuring an ActiveMQ broker for instructions on how to create a primary broker.

To configure cross-Region replication

Navigate to the Amazon MQ console and choose Create replica broker.
Select a primary broker from the list of eligible primary brokers and choose Next.
Under Replica broker details, select the Region for your replica broker and enter a Replica broker name.
In the ActiveMQ console user for replica broker panel, enter a Username and Password for broker access.
In the Data replication user to bridge access between brokers panel, enter a replication user Username and Password.
In the Additional settings panel, keep the defaults and choose Next.
Review the settings and choose Create replica broker.
Note: The broker access type is automatically set based on the primary broker access type.
The creation process takes up to 25 minutes. Once the replica broker creation is complete, begin replication between the primary and the replica brokers by rebooting the primary broker.
Once the primary broker is rebooted and its status is Running, you can see the replica details in the Data replication panel of the primary broker.

Both brokers now synchronize with each other to establish an inter-Region network and connection through which broker state is replicated. Once both brokers are in the Running state, the primary broker accepts client connections and passes all broker state changes (messages, acknowledgments, transactions, etc.) to the replica broker.

The replica broker now asynchronously mirrors the state of the primary broker. However, it does not become available for client connections until it is promoted via a switchover or a failover. These operations are covered in the following section.

Testing data replication and promoting the replica broker

There are two ways to promote a replica broker: initiating a switchover or a failover.

Switchover	Failover
Prioritizes consistency over availability.	Prioritizes availability over consistency.
Brokers are guaranteed to have identical states.	Brokers are not guaranteed to be in identical states.
Brokers may not be available immediately to serve client traffic.	Replica broker is immediately available to serve client traffic.

To initiate a failover or switchover

1. Navigate to the Amazon MQ console, choose your primary broker, and log in to the ActiveMQ Web Console using the URLs located in the Connections panel.
2. In the top menu, select Queues. You should be able to see four ActiveMQ.Plugin.Replication queues used by the replication feature.
3. To test message replication from the primary to a replica broker, create a queue and send messages. To create the queue:
  - For Queue Name, enter TestQueue.
  - Choose Create.
4. Under Operations for the TestQueue, choose Send To and perform the following steps:
  - For Number of messages to send, enter 10 and keep the other defaults.
  - Under Message body, enter a test message.
  - Choose Send.
5. To promote the replica broker, navigate to the Amazon MQ console and change the Region to the AWS Region where the replica broker is located.
6. Select the replica broker (in this example called Secondarybroker) and choose Promote replica.
7. In the Promote replica broker pop-up window:
  - Select Failover or Switchover.
  - Enter confirm in text box.
  - Choose Confirm.
8. While a replica broker is being promoted, its replication status changes to Promotion in progress. The corresponding primary broker’s replication status changes to Demotion in progress.

Replica Secondarybroker status – Promotion in progress:

Primary broker status – Demotion in progress:

Secondarybroker status – Promoted to new primary broker:

Once the Secondarybroker status is Running, log in to the ActiveMQ Web Console from the URLs located in the Connections panel. You can see the replicated messages sent from the former primary broker in Step 4 in the TestQueue:

Monitoring cross-Region data replication

To monitor cross-Region data replication progress, you can use the Amazon CloudWatch metrics TotalReplicationLag and ReplicationLag.

You can use these two metrics to monitor the progress of a switchover. When their value reaches zero, the switchover will complete because the broker states have been synchronized and the replica broker begins accepting client connections. If the switchover does not progress fast enough, or if you need the replica broker to be immediately available to serve client traffic, you can request a failover at any time.

Note: A failover can interrupt an ongoing switchover. However, a switchover cannot interrupt an ongoing failover.

Issuing a failover request causes the replica broker to become immediately available, but does not provide any guarantees about what data has been replicated to the replica broker. This means that a failover can make data tracking and reconciliation more challenging for your client application than a switchover.

For this reason, we recommend that you always start with a switchover and interrupt it with a failover if necessary. To interrupt an ongoing switchover, follow the same steps as for promoting a replica broker, select the failover option, and confirm.

Note: If you fail back to the original primary broker, messages that are not replicated from the primary to the replica broker during the failover will still exist on the primary broker. Therefore, consumers must manage these messages. We recommend tracking the processed message IDs in a data store such as Amazon DynamoDB global tables and comparing the message to the processed message IDs.

If you no longer need to replicate broker data across Regions or if you need to delete a primary or replica broker, you must unpair the replica broker and reboot the primary broker. You can unpair the replica broker in the Amazon MQ console by following Delete a CRDR broker.

To unpair the broker using the AWS Command Line Interface (AWS CLI), run the following command, replacing the --broker-id with your primary broker ID:

aws mq update-broker --broker-id <primary broker ID> \
--data-replication-mode "NONE" \
--region us-east-1

Conclusion

Using the cross-Region data replication feature for Amazon MQ for ActiveMQ provides a straightforward way to implement cross-Region replication to improve the resilience of your architecture and meet your business continuity and disaster recovery requirements. This post explains how cross-Region data replication works in Amazon MQ, how to set up a cross-Region replica broker, and how to test and promote the replica broker.

For more details, see the Amazon MQ documentation.

For more serverless learning resources, visit Serverless Land.