All posts by Chris Munns

Measuring Amazon MQ throughput using Maven 2 benchmark and AWS CDK

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/measuring-amazon-mq-throughput-using-maven-2-benchmark-and-aws-cdk/

This post is written by Olajide Enigbokan, Senior Solutions Architect and Mohammed Atiq, Solutions Architect

In this post you will learn how to evaluate the throughput for Amazon MQ, a managed message broker service for ActiveMQ, by using the ActiveMQ Classic Maven Performance test plugin. This post will provide recommendations for configuring Amazon MQ to optimize throughput when leveraging ActiveMQ as a broker engine.

Overview on benchmarking throughput for Amazon MQ for ActiveMQ

To get a good balance of cost and performance while leveraging ActiveMQ on Amazon MQ, AWS recommends that customers benchmark during migration, instance type/size upgrade, or downgrade. Benchmarking can help you choose the correct instance type and size for your workload requirements. For common benchmark scenarios and benchmark figures for different instances types and sizes, see AmazonMQ for ActiveMQ Throughput benchmarks.

Performance of your ActiveMQ workload depends on the specifics of your use-case. For example, if you have a workload where durability is extremely important (meaning that messages cannot be lost), enabling persistence mode ensures that messages are persisted to disk before the broker informs the client that the message send action has completed. The faster the disk I/O capacity and the smaller the message size during these writes, the better the message throughput. For this reason, AWS recommends the mq.m5.* instance types for regular development, testing, and production workloads as described in Amazon MQ for ActiveMQ instance types. The mq.t2.micro and mq.t3.micro instance types are intended for product evaluation and are subject to burst CPU credits and baseline performance. Hence, they are not suitable for applications that require fixed performance. In the situation where a larger broker instance type is selected, AWS also recommends batching transactions for persistent store which allows you to send multiple messages per transaction while achieving an overall higher message throughput.

The next section describes the details of setting up your own benchmark for Amazon MQ using the open-source benchmarking tool: ActiveMQ Classic Maven Performance test plugin. The ActiveMQ Classic Maven Performance test plugin benchmark suite is highly recommended due to the ease in setup and deployment process.

Getting started

This walkthrough guides you through the steps for benchmarking your Amazon MQ brokers:

Step 1 – Build and push container image to Amazon ECR

Clone the mq-benchmarking-container-image-sample repository and follow the steps in the README file to build and push your image to an Amazon Elastic Container Registry (Amazon ECR) public repository. You will need this container image for Step 2.

Step 2 – Automate Your Benchmarking Setup with AWS CDK

Architecture of CDK deployment

Architecture of CDK deployment

To streamline the deployment of an active/standby ActiveMQ broker alongside Amazon Elastic Container Service (Amazon ECS) tasks for this walk-through, follow these steps below to set up the environment leveraging AWS Cloud Development Kit (AWS CDK). This will deploy the resources shown in the architecture diagram above.

2.1. Prerequisites:

Ensure the following packages are installed:

2.2 Repository Setup:

Clone the mq-benchmarking-sample repository. This repository contains all the necessary code and instructions to automate the benchmarking process using the AWS CDK.

2.3 Create a Virtual Environment:

Change directory (cd) to the cloned repository directory and create a Python virtual environment by running the following command:

cd mq-benchmarking-sample

python -m venv .venv

2.4 Activate Virtual Environment:

Run the following commands to activate your virtual environment:

# Linux
source .venv/bin/activate

# Windows
.\.venv\Scripts\activate

2.5 Install Dependencies:

Install the required Python packages using:

pip install -r requirements.txt

2.6 Customize and Deploy:

In this step, deploy the necessary stacks and their resources for benchmarking in your AWS account. The command ‘cdk deploy’ below deploys three stacks with resources for Amazon ECS, MQ and VPC. Deploy your application with AWS CDK using the command:

cdk deploy "*" -c container_repo_url=<YOUR CONTAINER REPO URL> -c container_repo_tag=<YOUR CONTAINER REPO TAG>

This command deploys your application with the specified Docker image. Replace <YOUR CONTAINER REPO URL> and <YOUR CONTAINER REPO TAG> with your specific Docker repo image details from Step 1. An example container repo URL would look like this: public.ecr.aws/xxxxxxxxx/xxxxxxxxxx.

The deployment of the stacks and their resources happen in three stages. Please select “yes” at each stage to deploy the stated changes as shown below:

First stage of the deploy

Select yes to deploy these changes

Deployed stacks and their resources

Deployed stacks and their resources

Optionally, you can include additional context variables in your command as seen below:

cdk deploy "*" -c vpc_cidr=10.0.0.0/16 -c mq_cidr=10.0.0.0/16 -c broker_instance_type=mq.m5.large -c mq_username=testuser -c tasks=2 -c container_repo_url=<YOUR CONTAINER REPO URL> -c container_repo_tag=<YOUR CONTAINER REPO TAG>

Note: In the example command above, the vpc_cidr specified is the same as mq_cidr. If you decide to use the above command, you will need to ensure that your vpc_cidr range is the same as your mq_cidr range. AWS recommends this as security best practice to ensure that your broker endpoint is only accessible from recognized IP ranges, see Security best practices for Amazon MQ.

More details on the above context variables:

  • broker_instance_type: Represents the instance type for the Amazon MQ Broker. You can start with the instance type mq.m5.large.
  • vpc_cidr: Allows you to customize the VPC’s CIDR block. The default CIDR is set to 10.42.0.0/16.
  • mq_cidr: Allows you set a specific security group CIDR range for the broker. This must be set to the vpc_cidr. From the sample command above, this is set to 10.0.0.0/16. For more flexibility with source IP ranges, you can edit the broker security group of your CDK deployment.
  • mq_username: Allows you to specify a username to access the ActiveMQ web console access and broker.
  • tasks: Determines the number of ECS tasks (1 or more) to run your Docker image. Since the OpenWire configuration file for both consumers and producers allow you to specify the number of clients that you want, all the clients in one ECS task will share the CPU and memory allocation for that task. You have the option to run multiple ECS tasks (with multiple clients) running the benchmark in parallel.

These adjustments allow for a more customized deployment to fit specific benchmarking needs and scenarios.

2.7 Benchmarking Execution

After deployment, you should see an output similar to the following:

Successful deployment of CDK application with output

Successful deployment of CDK application with output

1. Retrieve the TASK-ARN and access the Container

The above exec command in “outputs:” requires that you supply a <TASK-ARN> before the command can be run. To retrieve the <TASK-ARN> via the AWS CLI, you will need to do the following:

  • Run the below command and note down the Task ARN (needed later):
aws ecs list-tasks --cluster <cluster-name> --region <region>

You can also retrieve this value via the Amazon ECS console by going to your ECS Cluster and choosing Tasks.

  • Access the running ECS task using the ECS Exec feature with the command that is output from the CDK deployment. The command should look like the following:
aws ecs execute-command --region eu-central-1 --cluster arn:aws:ecs:eu-central-1:XXXXXXXX:cluster/ECS-Stack-ClusterEB0386A7-gRmSxC06y4ay --task <TASK-ARN> --container Benchmarking-Container --command "/bin/bash" --interactive

Before running the above command, replace the placeholder value of <TASK-ARN> with the value of the actual Task ARN noted earlier.

After retrieving the <TASK-ARN>, and running the exec command, you should have a directory structure as follows:

Directory Structure within ECS Task using ECS Exec

Directory Structure within ECS Task using ECS Exec

2. Configure the openwire-producer.properties and openwire-consumer.properties files.

Open both files. Shown below is the content of the openwire-producer.properties and openwire-consumer.properties files.

openwire-producer.properties:

sysTest.reportDir=./reports/
sysTest.samplers=tp
sysTest.spiClass=org.apache.activemq.tool.spi.ActiveMQReflectionSPI
sysTest.clientPrefix=JmsProducer
sysTest.numClients=25


producer.destName=queue://PERF.TEST
producer.deliveryMode=persistent
producer.messageSize=1024
producer.sendDuration=300000

factory.brokerURL=
factory.userName=
factory.password=

openwire-consumer.properties:

sysTest.reportDir=./reports/
sysTest.samplers=tp
sysTest.spiClass=org.apache.activemq.tool.spi.ActiveMQReflectionSPI
sysTest.destDistro=equal
sysTest.clientPrefix=JmsConsumer
sysTest.numClients=25

consumer.destName=queue://PERF.TEST

factory.brokerURL=
factory.userName=
factory.password=

In both files, provide the brokerURL, username and password as they are required before starting the benchmarking process. The brokerURL and username can be obtained from the Amazon MQ console

Amazon MQ broker

Amazon MQ broker

Once you click into the deployed broker, you will find the brokerURL under the Endpoints section for OpenWire.

Endpoints in Amazon MQ console

Endpoints in Amazon MQ console

The endpoint URL for OpenWire should be in this format:

failover:(ssl://b-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-1.mq.<aws region>.amazonaws.com:61617,ssl://b-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx -2.mq.<aws region>.amazonaws.com:61617)
Retrieve username from Amazon MQ console

Retrieve username from Amazon MQ console

Since you are using an active/standby broker, the test would only leverage the active broker URL and not both. The failover protocol automatically manages this exchange. The password can be retrieved from the AWS Secrets Manager console or via the CLI.

The following parameters and values can be adjusted in both producer and consumer properties file to suite your use case:

  • sendDuration: The sendDuration which represents the time taken for the producer/consumer test to run. Default value is set to 300000ms.
  • messageSize: The messageSize which adjusts the size of messages sent is set to 1024KB by default.
  • deliveryMode: The deliveryMode is set to persistent by default.
  • numClients: numClients sets the number of concurrent consumers, influencing message processing speed. It is set to 25 by default.
  • destName: destName represents the name of your destination queue or topic. You can change the name to your preference.

For a more comprehensive guide, refer to the mq-benchmarking-sample documentation.

2.8 Benchmark Results

After populating both producer and consumer files with the required parameters, run the following maven commands (one after the other) in separate terminals to start the test:

Maven producer command:

mvn activemq-perf:producer -DsysTest.propsConfigFile=openwire-producer.properties

Maven consumer command:

mvn activemq-perf:consumer -DsysTest.propsConfigFile=openwire-consumer.properties

Once each of the above tests complete, they provide a summary of the tests in stdout as shown below:

#########################################
####    SYSTEM THROUGHPUT SUMMARY    ####
#########################################
System Total Throughput: 562020
System Total Clients: 25
System Average Throughput: 1873.4000000000003
System Average Throughput Excluding Min/Max: 1860.8333333333333
System Average Client Throughput: 74.936
System Average Client Throughput Excluding Min/Max: 74.43333333333334
Min Client Throughput Per Sample: clientName=JmsProducer19, value=2
Max Client Throughput Per Sample: clientName=JmsProducer13, value=169
Min Client Total Throughput: clientName=JmsProducer0, value=20224
Max Client Total Throughput: clientName=JmsProducer5, value=23917
Min Average Client Throughput: clientName=JmsProducer0, value=67.41333333333333
Max Average Client Throughput: clientName=JmsProducer5, value=79.72333333333333
Min Average Client Throughput Excluding Min/Max: clientName=JmsProducer0, value=67.04333333333334
Max Average Client Throughput Excluding Min/Max: clientName=JmsProducer8, value=78.91
[main] INFO org.apache.activemq.tool.reports.XmlFilePerfReportWriter - Created performance report: /app/activemq-perftest/./reports/JmsProducer_numClients25_numDests1_all.xml
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5:05.052s
[INFO] Finished at: Mon Apr 29 10:22:01 UTC 2024
[INFO] Final Memory: 15M/60M
[INFO] ------------------------------------------------------------------------
#########################################
####    SYSTEM THROUGHPUT SUMMARY    ####
#########################################
System Total Throughput: 562023
System Total Clients: 25
System Average Throughput: 1873.4100000000005
System Average Throughput Excluding Min/Max: 1864.6599999999996
System Average Client Throughput: 74.93640000000002
System Average Client Throughput Excluding Min/Max: 74.58639999999998
Min Client Throughput Per Sample: clientName=JmsConsumer13, value=0
Max Client Throughput Per Sample: clientName=JmsConsumer13, value=105
Min Client Total Throughput: clientName=JmsConsumer13, value=22475
Max Client Total Throughput: clientName=JmsConsumer14, value=22495
Min Average Client Throughput: clientName=JmsConsumer13, value=74.91666666666667
Max Average Client Throughput: clientName=JmsConsumer14, value=74.98333333333333
Min Average Client Throughput Excluding Min/Max: clientName=JmsConsumer13, value=74.56666666666666
Max Average Client Throughput Excluding Min/Max: clientName=JmsConsumer14, value=74.63333333333334
[main] INFO org.apache.activemq.tool.reports.XmlFilePerfReportWriter - Created performance report: /app/activemq-perftest/./reports/JmsConsumer_numClients25_numDests1_equal.xml
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5:02.434s
[INFO] Finished at: Mon Apr 29 10:22:02 UTC 2024
[INFO] Final Memory: 14M/68M
[INFO] ------------------------------------------------------------------------

The above output is from a test performed and shown as sample output.

System Average Throughput and System Total Clients are the most useful metrics.

In the reports directory look for two xml files with more detailed throughput metrics. In the JmsProducer_numClients25_numDests1_all.xml file for example, jmsClientSettings and jmsFactorySettings captures different broker switches.

Each of the report files captures exact test and broker environment. Keeping these files around will allow you to compare performance between different test cases and analyze how a set of configurations have impacted performance.

For this test, the average throughput for a producer is around 1873 messages per second for 25 clients. Keep in mind that the broker instance is an mq.m5.large. You can get higher throughput with more clients and a larger broker instance. This test demonstrates the concept of running fast consumers while producing messages.

More comprehensive information on the test output can be found in performance testing.

By following these guidelines and leveraging ECS Exec for direct access, you can deploy the ActiveMQ Classic Maven Performance test plugin, using AWS CDK. This setup allows you to customize and execute benchmark tests on Amazon MQ within an ECS task, facilitating an automated and efficient deployment and testing workflow.

Amazon MQ benchmarking architecture

Amazon MQ for ActiveMQ brokers can be deployed as a single-instance broker or as an active/standby broker. Amazon MQ is architected for high availability (HA) and durability. For HA and broker benchmarking, AWS recommends using the active/standby deployment. After a message is sent to Amazon MQ in persistent mode, the message is written to the highly durable message store which replicates the data across multiple nodes and Availability Zones.

Cleanup

To avoid incurring future charges for the resources deployed in this walkthrough, run the following command and follow the prompts to delete the CloudFormation stacks launched in 2.6 Customize and Deploy:

cdk destroy "*"

Conclusion

This post provides a detailed guide on performing benchmarking for Amazon MQ for ActiveMQ brokers leveraging the ActiveMQ Classic Maven Performance test plugin. Benchmarking plays a crucial role for customers migrating to Amazon MQ, as it offers insights into the broker’s performance under conditions that mirror their existing setup. This process enables customers to fine-tune their configurations and choose the appropriate instance type that aligns with their specific use case, ensuring optimal handling of their workloads’ throughput.

Get started with Amazon MQ by using the AWS Management Console, AWS CLI, AWS Software Development Kit (SDK), or AWS CloudFormation. For information on cost, see Amazon MQ pricing.

Efficiently processing batched data using parallelization in AWS Lambda

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/efficiently-processing-batched-data-using-parallelization-in-aws-lambda/

This post is written by Anton Aleksandrov, Principal Solutions Architect, AWS Serverless

Efficient message processing is crucial when handling large data volumes. By employing batching, distribution, and parallelization techniques, you can optimize the utilization of resources allocated to your AWS Lambda function. This post will demonstrate how to implement parallel data processing within the Lambda function handler, maximizing resource utilization and potentially reducing invocation duration and function concurrency requirements.

Overview

AWS Lambda integrates with various event sources, such as Amazon SQSApache Kafka, or Amazon Kinesis, using event-source mappings. When you configure an event-source mapping, Lambda continuously polls the event source and automatically invokes your function to process the retrieved data. Lambda makes more invocations of your function as the number of messages it reads from the event source increases. This can increase the utilized function concurrency and consume the available concurrency in your account. Click the links to learn more about how Lambda consumes messages from SQS queues and Kinesis streams.

To improve the data processing throughput, you can configure event-source mapping batch window and batch size. These settings ensure that your function is invoked only when a sufficient number of messages have accumulated in the event source. For example, if you configure a batch size of 100 messages and a batch window of 10 seconds, Lambda will invoke your function when either 100 messages have accumulated or 10 seconds have elapsed, whichever happens first.

Event source mapping event batching

Event source mapping event batching

By processing messages in batches, rather than individually, you can improve throughput and optimize costs by reducing the number of polling requests to event sources and the number of function invocations. For instance, processing a million messages without batching would require one million function invocations, but configuring a batch size of 100 messages can reduce the number of invocations to 10,000.

Optimizing batch processing within the Lambda execution environment

Each Lambda execution environment processes one event per invocation. With batching enabled, the event object Lambda sends to the function handler contains an array of messages retrieved and batched by the event-source mapping. Once an execution environment starts processing an event object containing a batch of messages, it won’t handle additional invocations until the current one is complete. However, simply iterating over the array of messages and processing them one by one may not fully utilize the allocated compute resources. This can lead to underutilized or idle compute resources, like CPU capacity, and hence longer overall processing times.

Underutilized Lambda environments

Underutilized Lambda environments

Underutilization of compute resources can be generally caused by two things – non-CPU-intensive blocking tasks, such as sending HTTP requests and waiting for responses, and single-threaded processing when you have more than one vCPU core. To address these concerns and maximize resource utilization, you can implement your functions to process data in parallel. This allows more efficient utilization of the allocated compute capacity, reducing invocation duration, time spent idle, and the total concurrency required. In addition, when you allocate more than 1.8GB of memory to your function, it also gets more than one vCPU, which allows threads to land on separate cores for even better performance and true parallel processing.

Improved concurrency in Lambda environment

Improved concurrency in Lambda environment

When processing messages sequentially with a low compute utilization rate, reducing memory allocation may seem intuitive to save costs. This, however, can result in slower performance due to less CPU capacity being allocated. When your function is parallelizing data processing within the execution environment, you’re getting a higher compute utilization rate, and since raising the memory allocation also provides additional CPU capacity, it can lead to better performance. Use the Lambda Power Tuning tool to find the optimal memory configuration, balancing cost with performance.

Understanding the Lambda execution environment lifecycle

After processing an invocation, the Lambda execution environment is “frozen” by the Lambda service. Lambda runtime considers the invocation complete and “freezes” the execution environment when your function handler returns.

When the Lambda service is looking for an execution environment to process a new incoming invocation, it will first try to “thaw” and use any available execution environments that were previously “frozen”. This cycle repeats until the execution environment is eventually shut down.

Lambda worker lifecycle over time

Lambda worker lifecycle over time

Implementing parallel processing within the Lambda execution environment

You can implement parallel processing by running multiple threads in your function handler, but if those threads are still running when the handler returns, then they will be “frozen” together with the execution environment until the next invocation. This can lead to unexpected behavior, where the execution environment is “thawed” to process a new invocation, however, it still has background threads running and processing data from previous invocations. If you do not handle this properly, the behavior can cascade across multiple invocations, leading to delayed or unfinished processing and complicated debugging.

Threads frozen before finishing

Threads frozen before finishing

To address this concern, you need to ensure that the background threads you spawn in the function handler are done processing data before returning from the handler. All threads spawned within a particular invocation must complete within the same invocation in order not to spill over to subsequent invocations. This is illustrated in the following diagram. You can see threads start and end within the same invocation, and only once all threads have finished, the function handler returns.

Threads returning before end of invoke

Threads returning before end of invoke

Sample code

Programming languages offer diverse techniques and terminology for parallel and concurrent processing. Java employs multi-threading and thread pools. Node.js, though single-threaded, provides event loop and promises (for async programming), as well as child processes and worker threads (for actual multi-threading). Python supports both multi-threading (subject to Global Interpreter Lock) and multi-processing. Concurrent routines is another technique gaining attention.

The following sample is provided for illustration purposes only and is based on Node.js promises running concurrently. The sample code uses a language-agnostic term “worker” to denote a unit of parallel processing. Your specific parallelization implementation depends on your choice of runtime language and frameworks. AWS recommends you use battle-tested frameworks like Powertools for AWS Lambda that implement concurrent batch processing when possible. Regardless of the programming language, it is crucial to ensure all background threads/workers/promises/routines/tasks spawned by the function handler are completed within the same invocation before the handler returns.

Sample implementation with Node.js

const NUMBER_OF_WORKERS = 4;

export const handler = async (event) => {
    const workers = []; 
    const messages = event.Records;
    
    // For handling partial batch processing errors
    const batchItemFailures = [];

    for (let i=0; i<NUMBER_OF_WORKERS;i++){
        // No await here! The waiting will happen later
        const worker = spawnWorker(i, messages, batchItemFailures);
        workers.push(worker);
    }
    
    // This line is crucial. This is where the handler
    // waits for all workers to complete their tasks
    const processingResults = await Promise.allSettled(workers);
    console.log('All done!');

    // Return messageIds of all messages that failed 
    // to process in order to retry
    return {batchItemFailures};
};

async function spawnWorker(id, messages, batchItemFailures){
    console.log(`worker.id=${id} spawning`);
    while (messages.length>0){
        const msg = messages.shift();
        console.log(`worker.id=${id} processing message`);
        try {
            // A blocking, but not CPU-intensive operation 
            await processMessage(msg);
        } catch (err){
            // If message processing failed, add it to 
            // the list of batch item failures
            batchItemFailures.push({ itemIdentifier: msg.messageId});
        }
    }
}

See the sample code and AWS Cloud Development Kit (CDK) stack at github.com.

Testing results

The following chart illustrates a Lambda function processing messages using an SQS event-source mapping. After enabling message processing with 4 workers, the invocation duration and concurrent executions dropped to 1/4th of the previous value, while still processing the same number of messages per second. Thanks to parallelization, the new function is faster and requires less concurrency.

Function performance dashboard

Function performance dashboard

Looking at the invocation log, you can see that the function handler has spawned four workers, and all of them were completed before the handler returned the result. You can also see that although the handler received 20 items, with each item taking 200ms to process, the overall duration is only 1000ms. This is because items were processed in parallel (20 items * 200ms / 4 workers = 1000ms total processing time).

START RequestId: (redacted)  Version: $LATEST
2024-06-18T03:18:03.049Z    INFO    Got messages from SQS
2024-06-18T03:18:03.049Z    INFO    messages.length=20
2024-06-18T03:18:03.049Z    INFO    worker.id=0 spawning
2024-06-18T03:18:03.049Z    INFO    worker.id=0 processing message
2024-06-18T03:18:03.049Z    INFO    worker.id=1 spawning
2024-06-18T03:18:03.049Z    INFO    worker.id=1 processing message
2024-06-18T03:18:03.050Z    INFO    worker.id=2 spawning
2024-06-18T03:18:03.050Z    INFO    worker.id=2 processing message
2024-06-18T03:18:03.050Z    INFO    worker.id=3 spawning
2024-06-18T03:18:03.050Z    INFO    worker.id=3 processing message
2024-06-18T03:18:03.250Z    INFO    worker.id=0 processing message
2024-06-18T03:18:03.250Z    INFO    worker.id=1 processing message
(redacted for brevity)
2024-06-18T03:18:03.852Z    INFO    worker.id=1 processing message
2024-06-18T03:18:03.852Z    INFO    worker.id=2 processing message
2024-06-18T03:18:03.852Z    INFO    worker.id=3 processing message
2024-06-18T03:18:04.052Z    INFO    All done!
END RequestId: (redacted)
REPORT RequestId: (redacted) Duration: 1004.48 ms

Considerations

  • The technique and samples described in this post assume unordered message processing. In case you use ordered event sources, such as SQS FIFO Queues, and require preserving message order, you will need to address that in your implementation code. One technique might be creating a separate thread for each messageGroupId.
  • While providing performance and cost benefits, multi-threading and parallel processing is an advanced technique that requires proper error handling. Lambda supports partial batch responses, where you can report back to the event source that specific messages from the batch failed to be processed so they can be retried. You can collect failed message IDs from each thread and return them as your function handler response. This is illustrated in the sample above. See Handling errors for an SQS event source in Lambda and Best Practices for implementing partial batch responses for additional details.

Conclusion

Efficiently processing large volumes of data implies efficient resource utilization. When processing batches of messages from event sources, validate whether your function would benefit from parallel or concurrent processing within the function handler thus increasing the compute capacity utilization rate. With a high compute capacity utilization rate, you can allocate more memory to your function, thus getting more CPU allocated as well, for faster and more efficient processing. Use frameworks like Powertools for AWS Lambda that implement concurrent batch processing when possible, and use the Lambda Power Tuning tool to find the best memory configuration for your functions, balancing performance and cost.

For more serverless learning resources, visit Serverless Land.

Strengthening data security in AWS Step Functions with a customer-managed AWS KMS key

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/strengthening-data-security-in-aws-step-functions-with-a-customer-managed-aws-kms-key/

This post is written by Dhiraj Mahapatro, AWS Principal Specialist SA, Serverless.

AWS Step Functions provides enhanced security with a customer-managed AWS KMS key. This allows organizations to maintain complete control over the encryption keys used to protect their data in Step Functions, ensuring that only allowed principals (IAM role, user, or a group) have access to the sensitive information that is processed in a state machine. This post explores the details of this feature and the new console experience of executing Step Functions workflows when a customer-managed KMS key is used.

Step Functions is a serverless orchestration service that enables you to coordinate multiple AWS services, microservices, and third-party integrations into business-critical applications. Step Functions is widely used for orchestrating complex workflows, such as loan processing, fraud detection, risk management, and compliance processes. By breaking down these processes into a series of steps, Step Functions provides a clear overview and control of the entire workflow. This ensures that it executes each stage correctly and in the right order. One of the critical aspects of using Step Functions in regulated industries is the importance of security and data protection. Step Functions manages sensitive customer data, including PII and financial records, and require protection against unauthorized access and data breaches. Enabling a customer-managed KMS key further strengthens the data security in a state machine.

Using customer-managed AWS KMS keys

With this launch, Step Functions enable encryption of the state machine definition and execution details, including event history using customer-managed symmetric KMS keys. As part of this feature, you also have the option to encrypt Step Functions activities using customer-managed key.

This post uses a sample application to show the implementation details of this new feature. See user guide for a detailed explanation of this feature.

The sample application shows a basic stock trading example where the state machine buys or sells a stock if the price of the stock is above or below 50 and finally saves the transaction.

Example workflow

Example workflow

The Step Functions Cloudformation resource of the state machine has a new property EncryptionConfiguration as shown in the following:

StockTradingStateMachine:
  Type: AWS::StepFunctions::StateMachine
  Properties:
    StateMachineName: !FindInMap ['StateMachine', 'Name', 'Value']
    RoleArn: !GetAtt StockTradingStateMachineExecutionRole.Arn
    EncryptionConfiguration:
      KmsKeyId: !Ref StocksKmsKey
      KmsDataKeyReusePeriodSeconds: 100
      Type: CUSTOMER_MANAGED_KMS_KEY
    Definition: . . .

Within EncryptionConfiguration, you specify the KmsKeyId and the Type. This sample application uses a CUSTOMER_MANAGED_KMS_KEY key type. The Type is a required field and it will be AWS_OWNED_KEY if it is not a customer managed key. The state machine also allows to specify the KmsDataKeyReusePeriodSeconds property to a value between 60 and 900 seconds (default: 300), which signifies the maximum duration for which the state machine reuses the data keys. When the period expires, Step Functions will call GenerateDataKey API on AWS KMS. Therefore, besides kms:Decrypt, Step Functions needs access to kms:GenerateDataKey action.

The sample application also creates a customer-managed KMS key with a condition to force the stock trading state machine to only use the key.

Security controls

Within an AWS Organization setup, the best practices guidance is to have a dedicated security organizational unit responsible for managing and enforcing security standards, including ownership of KMS keys. The security account provides cross-account access for the key usage. You grant admin access only to the root of the security account, while external or member accounts can access it for various purposes like decryption, encryption, description, and data key generation. This can be done through an IAM Role, User, or Group in the member account. The standard approach for cross-account access involves combining KMS key policies in the security account and IAM policies to the identity that gives permission for the service in the member account.

Cross account access

Cross account access

For Step Functions, you can go a step further to restrict access to the caller’s role in the member account and provide a condition. The condition forces Step Functions service to only use the key. For example, with a security account (id: 1111111111) and a member account (id: 1234567890), the KMS key policy can use a kms:ViaService condition to restrict access to Step Functions state machines present in us-east-1 region only:

{
  "Sid": "Allow access to member account via Step Functions service",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::1234567890:role/MemberAccountRole"
  },
  "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "kms:ViaService": "states.us-east-1.amazonaws.com"
    }
  }
}

Constantly updating the key policy for every new Step Functions workflow in member accounts is cumbersome. Therefore, a combination of KMS key policy and IAM roles grants fine-grained and least-privilege access to key actions. For organizations that do not have a security account or security organizational unit, the member account owns the KMS key, as shown below. The key policy must be more restrictive to the Step Functions execution role and the Step Functions ARN that will use the key.

Member account ownership

Member account ownership

For example, a member account with an account id 1234567890 sets the Step Functions execution role sfn-execution-role as the Principal and restricts the key usage to a specific Step Functions ARN in the same account by using kms:EncryptionContext:aws:states:stateMachineArn condition as shown in the following:

{
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::1234567890:role/sfn-execution-role"
  },
  "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "kms:EncryptionContext:aws:states:stateMachineArn": 
      "arn:aws:states:us-east-1:1234567890:stateMachine:MyStateMachine"
    }
  }
}

Testing

To setup the application in your AWS account, you need the following tools:

Clone the git repository. To build and deploy your application for the first time, run the following in your shell from the repository home directory:

sam build && sam deploy –guided

You can find the State Machine’s ARN in the output values displayed after deployment.

Once deployed, run the application using the AWS CLI. Run the following command after replacing the state machine ARN from the output of the deployment and the region where you have the state machine:

aws stepfunctions start-execution \
  --state-machine-arn <state-machine-arn> \
  --region <region>

You get a successful response in the CLI. You can also see a corresponding execution listed in the AWS Console as RUNNING:

Running workflow

Running workflow

However, opening the execution details will show an “Access Denied” error as expected:

Access denied error

Access denied error

You get the same error while visualizing the Step Functions definition or editing the state machine. The sample application restricts the decryption by the KMS key to only the Step Functions workflow’s execution role. Therefore, any other entity cannot decrypt the state machine’s workflow execution details and the state machine’s definition. This secures the exposure of information, including the payload passed to Step Functions or the payload passed in between state transitions to external entities. This new feature will securely allow personally identifiable information (PII), credit card information (PCI), and other similar sensitive information in Step Functions. Existing sensitive workloads are now unlocked for Step Functions, therefore easing, making them AWS cloud native.

You can integrate Amazon CloudWatch Logs with Step Functions for logging and monitoring capabilities. To send logs, you must provide access for log delivery to decrypt your logs. In your State Machine customer-managed key policy, you must grant kms:decrypt permission to the principal delivery.logs.amazonaws.com. Logging a workflow will not work without above grant. You encrypted data is sent to CloudWatch logs with the same or different customer managed KMS key. See CloudWatch logs documentation to learn how to set permissions on the KMS key for your log group.

Cleanup

To delete the sample application, use the latest version of the AWS SAM CLI and run:

sam delete

Conclusion

Customer-managed AWS KMS keys in Step Functions allows for access control sensitive data. KMS key policy and IAM identity policies determine who decrypts and access various aspects of the state machine, including the definition, execution details, and input/output payload transitions for each task. This is an essential feature for highly regulated industries like financial services. Apply these security guardrails using customer-managed AWS KMS keys at the organizational unit, business unit, or at the individual account level.

The sample application shows a way of using the customer managed KMS key in Step Functions resource in CloudFormation. The user guide provides additional details. Support for this feature is available in AWS CDK now while Terraform support will fast follow. Dive deeper into additional details from the Step Functions user guide.

For more serverless learning resources, visit Serverless Land.

Introducing quorum queues on Amazon MQ for RabbitMQ

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/introducing-quorum-queues-on-amazon-mq-for-rabbitmq/

This post is written by Vignesh Selvam (Senior Product Manager – Amazon MQ), Simon Unge (Senior software development engineer – Amazon MQ).

Amazon MQ for RabbitMQ announced support for quorum queues, a type of replicated queue designed for higher availability and data safety. This post presents an overview of this queue type, describes when you should use it, and best practices you can follow. The post also describes how Amazon MQ has also improved quorum queues in the open-source RabbitMQ community.

Overview of quorum queues

A quorum queue is a replicated first in, first out queue type offered by open-source RabbitMQ that uses the Raft consensus algorithm to maintain data consistency. Each quorum queue has a leader and multiple followers (replicas), which ensure that messages are replicated and persisted across a majority of nodes, thus providing resilience against node failures. Quorum queues only need a majority of member nodes (a quorum) to make decisions about data. If a RabbitMQ node hosting a leader becomes unavailable, another node hosting one of the followers is automatically elected as the leader. Once the node becomes available again, the node will become a follower for the quorum queue and catch up or synchronize with the new leader. Quorum queues can detect network failures faster and recover quicker than classic mirrored queues, thus improving the resiliency of the message broker as a whole.

Quorum queues share most of the fundamental features that are key to RabbitMQ replicated queue types such as consumption, consumer acknowledgements, cancelling consumers, purging and deletion. Poison message handling is a unique feature of quorum queues which help developers manage unprocessed messages more efficiently. A poison message is a message that cannot be processed and ends up being repeatedly requeued. Quorum queues keep track of the number of unsuccessful delivery attempts and expose it in the ‘x-delivery-count’ header that is included with any redelivered message. A delivery limit can be set using a policy argument for ’delivery-limit’. If the limit is reached, the message can be dropped or put in a dead-letter queue. This feature further improves the data reliability of a quorum queue.

You can get started with quorum queues by explicitly specifying the ‘x-queue-type’ parameter as ’quorum’ on a RabbitMQ broker running version 3.13 and above. We recommend that you change the default vhost queue type to ’quorum’ to ensure that all queues are created as quorum queues by default inside a vhost.

RabbitMQ queues console

RabbitMQ queues console

When should you use quorum queues?

You should use quorum queues when you need higher availability and consistency for their messaging infrastructure. Quorum queues are ideal for scenarios where data durability and fault tolerance are critical, such as financial transaction systems, e-commerce data processing systems, or any application requiring high reliability. They are particularly beneficial in environments where node failures are more likely or where maintaining data consistency across distributed systems is essential.

When should you NOT use quorum queues?

Quorum queues are not meant to be temporary. They do not support transient or exclusive queues and are not meant to be used in scenarios with high queue churn (declaration and deletion rates). They are also not recommended for unreplicated queues.

Best practices for quorum queues

Quorum queues perform better when the queues are short. You can set the maximum queue length using a policy or queue arguments to limit the total memory usage by queues (max-length, max-length-bytes).

Add a new queue dialog

Add a new queue dialog

Amazon MQ recommends publishers to use publisher confirms and consumers to use manual acknowledgements on quorum queues. Publisher confirms will only be issued once a published message has been successfully replicated to a quorum of nodes and is considered safe within the context of the system. Publisher confirms can also serve as a form of back pressure and protect the availability of the broker during periods of high workload. Manual acknowledgements are used to ensure messages that are not processed can be returned to the queue for reprocessing.

Open-source improvements by Amazon MQ

Amazon MQ contributed multiple improvements to the open-source RabbitMQ community to improve quorum queues for operators and users.

Automatic membership reconciliation
Quorum queues depend on a majority of replicas being available for the Raft consensus algorithm. Amazon MQ identified that many users and operators would prefer to maintain a certain minimal number of replicas (generally 3 or 5) at all times to ensure a majority always exists. The quorum queue replica management was also initially available only via CLI tools. Amazon MQ engineers introduced automatic membership reconciliation to improve this experience. Now, RabbitMQ can be configured to identify any queues that are below a target group member size, and automatically grow or add a node to the queue members. Thus ensuring a certain minimum number of replicas always exist.

Voter status
RabbitMQ considers a quorum queue member node to be a full member even if the member has not caught up or fully synced to the quorum. The CLI command rabbitmq-queues check_if_node_is_ quorum_critical can provide a false positive, and indicate a node is safe to remove, even though another node has queue members that are still synchronizing to the quorum. Amazon MQ introduced a new ‘non-voter’ state for a queue member node to indicate a member that is still catching up or synchronizing to the quorum. If a queue has a member in this state, it is not considered a full member. Once the member is fully synchronized, it is automatically promoted to the voter status, and is considered a full member. The command rabbitmq-queues check_if_node_is_quorum_critical now takes this into account and correctly reports if a node can be safely terminated without any queues becoming unavailable due to a loss of majority.

Inconsistent state management
When a broker is overloaded, a quorum queue can end up in an inconsistent state, where the quorum queue membership state stored in the Raft state machine differs from the RabbitMQ internal state for the queue. Amazon MQ introduced a periodic check per quorum queue that identifies if a queue has an inconsistent state and takes action to fix it.

Default queue type
The default queue type for a RabbitMQ broker vhost was classic queues. You could declare a different queue type by explicitly stating the ’x-queue-type’ as a queue creation argument. Amazon MQ introduced a global default queue type in the configuration file (rabbit.conf) that provides the ability to define a default queue type at the broker level. Now, an operator can change the default queue type to quorum queues if not specified during creation.

Membership management permissions
RabbitMQ users are able to configure the quorum queue membership using the management API. This can interfere with automatic membership reconciliation. Amazon MQ introduced the ability for an operator to turn off the membership management permissions available through the management API. Thus, preventing customers from accidentally affecting their broker.

Conclusion

Quorum queues on RabbitMQ provide a robust solution for scenarios requiring high availability and resilience. By leveraging the Raft consensus protocol, quorum queues ensure that messages are safely stored and replicated across a quorum of nodes, making them an excellent choice for modern, distributed message queuing systems.

Amazon MQ recommends that you adopt quorum queues as the preferred replicated queue type on RabbitMQ 3.13 brokers. For more details, see Amazon MQ documentation. To know more about the open-source feature, see quorum queues.

Get started with quorum queues on Amazon MQ for RabbitMQ 3.13 with a few clicks.

Implementing multi-Region failover for Amazon API Gateway

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/implementing-multi-region-failover-for-amazon-api-gateway/

This post is written by Marcos Ortiz, Principal AWS Solutions Architect and Khubyar Behramsha, Sr. AWS Solutions Architect.

In this post, you learn how organizations can evolve from a single-Region architecture API Gateway to a multi-Region one, using a reliable failover mechanism without dependencies on AWS control plane operations. An AWS Well-Architected best practice is to rely on the data plane and not the control plane during recovery. Failover controls should work with no dependencies on the primary Region. This pattern shows how to independently failover discrete services deployed behind a shared public API. Additionally, there is a walkthrough on how to deploy and test the proposed architecture, using our open-source code available on GitHub.

For many organizations, running services behind a Regional Amazon API Gateway endpoint aligned to AWS Well-Architected best practices, offers the right balance of resilience, simplicity, and affordability. However, depending on business criticality, regulatory requirements, or disaster recovery objectives, some organizations must deploy their APIs using a multi-Region architecture.

When dealing with business-critical applications, organizations often want full control over how and when to trigger a failover. A manually triggered failover allows for dependencies to be failed over in a specific order. Failover actions follow the chain of approvals needed, which helps prevent failing over to an unprepared replica or other flapping issues caused by intermittent disruptions. While the failover action or trigger has a human-in-the-loop component, the recommendation is for all subsequent actions to be automated as much as possible. This approach gives application owners control over the failover process, including the ability to trigger the failover in cases of intermittent issues.

Overview

One common approach for customers is to deploy a public Regional API with a custom domain name, providing more intuitive URLs for their users. The backend uses API mappings to connect multiple API stages to a custom domain. This approach allows service owners to deploy their services independently while sharing the same top-level API domain name. Here is a typical architecture that follows this pattern:

Regional endpoint with mapping

Regional endpoint with mapping

However, when trying to evolve this to a multi-Region architecture, organizations often struggle to fail over each service independently. If the preceding architecture is deployed in two Regions as-is, it becomes an all-or-nothing scenario, where organizations must either fail over all the services behind API Gateway or none.

Evolving to a multi-Region architecture

To enable each team to manage and failover their services independently, you can implement this new approach for a multi-Region architecture. Each service has its own subdomain, using API Gateway HTTP integrations to route the request to a given service. This allows the service APIs the flexibility to be independently failed over, or all at once, with the shared public API.

Multi-Region architecture

Multi-Region architecture

This is the request flow:

  1. Users access a specific service through the public shared API domain name using a URL suffix. For instance, to access service1, the end user would send a request to http://example.com/service1.
  2. Amazon Route 53 has the top-level domain, example.com, registered with a primary and a secondary failover record. It routes the request to the API Gateway external API endpoint in the primary Region (us-east-1).
  3. API Gateway uses an HTTP integration to forward the request to service1 at https://service1.example.com.
  4. Amazon Route 53, has the domain service1.example.com registered with a primary and a secondary failover record. It routes the request to the API Gateway service1 API Regional endpoint in the primary Region (us-east-1) when healthy and routes to the service1 API Regional endpoint in the secondary Region (us-west-2) when unhealthy.
  5. Represents the primary route for service1 configured in Amazon Route 53.
  6. Represents the secondary route for service1 configured in Amazon Route 53.

This solution requires deploying each service API in both the primary (us-east-1) and secondary (us-west-2) Regions. Both Regions use the same custom domain configuration. For the primary Region, primary DNS records for each service point to the Regional API Gateway distribution endpoint. In the secondary Region, secondary DNS records for each service point to the Regional API Gateway distribution endpoint in the secondary Region.

Route 53 records

Route 53 records

Active-passive manual failover

The example provided here enables a reliable failover mechanism that does not rely on the Amazon Route 53 control plane. It uses Amazon Route 53 Application Recovery Controller (Route 53 ARC), which provides a cluster with five Regional endpoints across five different AWS Regions. The failover process uses these endpoints, instead of manually editing Amazon Route 53 DNS records, which is a control plane operation. The routing controls in Route 53 ARC failover traffic from the primary Region to the secondary one.

Route 53 ARC routing controls

Route 53 ARC routing controls

Routing controls are on-off switches that enable you to redirect client traffic from one instance of your workload to another. Traffic re-routing is the result of setting associated DNS health checks as healthy or unhealthy.

Route 53 ARC toggles

Route 53 ARC toggles

Deploying the sample application

Pre-requisites

  1. A public domain (example.com) registered with Amazon Route 53. Follow the instructions here on how to register a domain and the instructions here to configure Amazon Route 53 as your DNS service.
  2. An AWS Certificate Manager certificate (*.example.com) for your domain name on both the primary and secondary Regions you plan to deploy the sample APIs.

Deploy the Amazon Route 53 ARC stack

Deploy the Amazon Route 53 ARC stack first, which creates a cluster and the routing controls that enable you to fail over the APIs.

Follow the detailed instructions here to deploy the Amazon Route 53 Application Recovery Controller (ARC) stack.

Deploy the Service1 API both in the primary and secondary Regions

This deploys an API Gateway Regional endpoint in each Region, which calls an AWS Lambda function to return the service name and the current AWS Region serving the request:

{"service": "service1", "region": "us-east-1"}

This is the code for the Lambda function:

import json
import os

def lambda_handler(event, context):
    return {
"statusCode": 200,
"body": json.dumps({
  "service": "service1",
  "region": os.environ['AWS_REGION']}),
}

Follow the detailed instructions here to deploy the service1 stack.

Deploy the Service2 API both in the primary and secondary Regions

This stack is similar to service1, but has a different domain name and returns service2 as the service name:

{"service": "service2", "region": "us-east-1"}

Follow the detailed instructions here to deploy the service2 stack.

Deploy the shared public API both in the primary and secondary Regions

This step configures HTTP endpoints so that when you call example.com/service1 or example.com/service2, it routes the request to the respective public DNS records you have set up for service1 and service2.

Follow the detailed instructions here to deploy the external API stack.

Failover tests

To test the deployed example, modify then run the provided test script:

  1. Update lines 3–5 in the test.sh file to reference the domain name you configured for your APIs.
  2. Provide execute permissions and run the script:
chmod +x ./test/sh
./test.sh

This script sends an HTTP request to each one of your three endpoints every 5 seconds. You can then use Amazon Route 53 ARC to fail over your services independently and see the responses served from different Regions.

Initially, all services are routing traffic to the us-east-1 Region:

Initial routing

Initial routing

With the following command, you update two routing controls for service1, setting the primary Region (us-east-1) health check state to off, and the secondary Region (us-west-2) health check state to on:

aws route53-recovery-cluster update-routing-control-states \
 --update-routing-control-state-entries \
 '[{"RoutingControlArn":"arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567","RoutingControlState":"On"},
{"RoutingControlArn":"arn:aws:route53-recovery-control:: 111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/hijklmnop987654321","RoutingControlState":"Off"}]' \
 --region ap-southeast-2 \
 --endpoint-url https://abcd1234.route53-recovery-cluster.ap-southeast-2.amazonaws.com/v1

After a few seconds, the script terminal shows that service1 is now routing traffic to us-west-2, while the other services are still routing traffic to the us-east-1 Region.

Flipping service1 to backup Region

Flipping service1 to backup Region

To fail back service1 to the us-east-1 Region, run this command, now setting the service1 primary Region (us-east-1) health check state to on, and the secondary Region (us-west-2) health check state to off:

aws route53-recovery-cluster update-routing-control-states \
 --update-routing-control-state-entries \
 '[{"RoutingControlArn":"arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567","RoutingControlState":"Off"},
{"RoutingControlArn":"arn:aws:route53-recovery-control:: 111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/hijklmnop987654321","RoutingControlState":"On"}]' \
 --region ap-southeast-2 \
 --endpoint-url https:// abcd1234.route53-recovery-cluster.ap-southeast-2.amazonaws.com/v1

After a few seconds, the script terminal shows that service1 is now routing traffic to the us-east-1 Region again, like the other services.

Routing recovery

Routing recovery

Cleaning up

After you are finished, follow the cleanup instructions on GitHub.

Conclusion

This solution helps put the control back in the hands of the teams managing critical workloads using API Gateway. By decoupling the frontend and backend, this solution gives organizations granular control over failover at the service level using Amazon Route 53 ARC to remove dependencies on control plane actions.

The pattern outlined also reduces the impact to consumers of the service as it allows you to use the same public API and top-level domain when moving from a single-Region to a multi-Region architecture.

For more resilience learning, visit AWS Architecture Blog – Resilience.

For more serverless learning, visit Serverless Land.

Applying Spot-to-Spot consolidation best practices with Karpenter

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/applying-spot-to-spot-consolidation-best-practices-with-karpenter/

This post is written by Robert Northard – AWS Container Specialist Solutions Architect, and Carlos Manzanedo Rueda – AWS WW SA Leader for Efficient Compute

Karpenter is an open source node lifecycle management project built for Kubernetes. In this post, you will learn how to use the new Spot-to-Spot consolidation functionality released in Karpenter v0.34.0, which helps further optimize your cluster. Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances are spare Amazon EC2 capacity available for up to 90% off compared to On-Demand prices. One difference between On-Demand and Spot is that Spot Instances can be interrupted by Amazon EC2 when the capacity is needed back. Karpenter’s built-in support for Spot Instances allows users to seamlessly implement Spot best practices and helps users optimize the cost of stateless, fault tolerant workloads. For example, when Karpenter observes a Spot interruption, it automatically starts a new node in response.

Karpenter provisions nodes in response to unschedulable pods based on aggregated CPU, memory, volume requests, and other scheduling constraints. Over time, Karpenter has added functionality to simplify instance lifecycle configuration, providing a termination controller, instance expiration, and drift detection. Karpenter also helps optimize Kubernetes clusters by selecting the optimal instances while still respecting Kubernetes pod-to-node placement nuances, such as nodeSelector, affinity and anti-affinity, taints and tolerations, and topology spread constraints.

The Kubernetes scheduler assigns pods to nodes based on their scheduling constraints. Over time, as workloads are scaled out and scaled in or as new instances join and leave, the cluster placement and instance load might end up not being optimal. In many cases, it results in unnecessary extra costs. Karpenter has a consolidation feature that improves cluster placement by identifying and taking action in situations such as:

  1. when a node is empty
  2. when a node can be removed as the pods that are running on it can be rescheduled into other existing nodes
  3. when the number of pods in a node has gone down and the node can now be replaced with a lower-priced and rightsized variant (which is shown in the following figure)
Karpenter consolidation, replacing one 2xlarge Amazon EC2 Instance with an xlarge Amazon EC2 Instance.

Karpenter consolidation, replacing one 2xlarge Amazon EC2 Instance with an xlarge Amazon EC2 Instance.

Karpenter versions prior to v0.34.0 only supported consolidation for Amazon EC2 On-Demand Instances. On-Demand consolidation allowed consolidating from On-Demand into Spot Instances and to lower-priced On-Demand Instances. However, once a pod was placed on a Spot Instance, Spot nodes were only removed when the nodes were empty. In v0.34.0, you can enable the feature gate to use Spot-to-Spot consolidation.

Solution overview

When launching Spot Instances, Karpenter uses the price-capacity-optimized allocation strategy when calling the Amazon EC2 instant Fleet API (shown in the following figure) and passes in a selection of compute instance types based on the Karpenter NodePool configuration. The Amazon EC2 Fleet API in instant mode is a synchronous API call that immediately returns a list of instances that launched and any instance that could not be launched. For any instances that could not be launched, Karpenter might request alternative capacity or remove any soft Kubernetes scheduling constraints for the workload.

Karpenter instance orchestration

Karpenter instance orchestration

Spot-to-Spot consolidation needed an approach that was different from On-Demand consolidation. For On-Demand consolidation, rightsizing and lowest price are the main metrics used. For Spot-to-Spot consolidation to take place, Karpenter requires a diversified instance configuration (see the example NodePool defined in the walkthrough) with at least 15 instances types. Without this constraint, there would be a risk of Karpenter selecting an instance that has lower availability and, therefore, higher frequency of interruption.

Prerequisites

The following prerequisites are required to complete the walkthrough:

  • Install an Amazon Elastic Kubernetes Service (Amazon EKS) cluster (version 1.29 or higher) with Karpenter (v0.34.0 or higher). The Karpenter Getting Started Guide provides steps for setting up an Amazon EKS cluster and adding Karpenter.
  • Enable replacement with Spot consolidation through the SpotToSpotConsolidation feature gate. This can be enabled during a helm install of the Karpenter chart by adding –-set settings.featureGates.spotToSpotConsolidation=true argument.
  • Install kubectl, the Kubernetes command line tool for communicating with the Kubernetes control plane API, and kubectl context configured with Cluster Operator and Cluster Developer permissions.

Walkthrough

The following walkthrough guides you through the steps for simulating Spot-to-Spot consolidation.

1. Create a Karpenter NodePool and EC2NodeClass

Create a Karpenter NodePool and EC2NodeClass. Replace the following with your own values. If you used the Karpenter Getting Started Guide to create your installation, then the value would be your cluster name.

  • Replace <karpenter-discovery-tag-value> with your subnet tag for Karpenter subnet and security group auto-discovery.
  • Replace <role-name> with the name of the AWS Identity and Access Management (IAM) role for node identity.
cat <<EOF > nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        intent: apps
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c","m","r"]
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano","micro","small","medium"]
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values: ["nitro"]
  limits:
    cpu: 100
    memory: 100Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  subnetSelectorTerms:          
    - tags:
        karpenter.sh/discovery: "<karpenter-discovery-tag-value>"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "<karpenter-discovery-tag-value>"
  role: "<role-name>"
  tags:
    Name: karpenter.sh/nodepool/default
    IntentLabel: "apps"
EOF

kubectl apply -f nodepool.yaml

The NodePool definition demonstrates a flexible configuration with instances from the C, M, or R EC2 instance families. The configuration is restricted to use smaller instance sizes but is still diversified as much as possible. For example, this might be needed in scenarios where you deploy observability DaemonSets. If your workload has specific requirements, then see the supported well-known labels in the Karpenter documentation.

2. Deploy a sample workload

Deploy a sample workload by running the following command. This command creates a Deployment with five pod replicas using the pause container image:

cat <<EOF > inflate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 5
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      nodeSelector:
        intent: apps
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
              memory: 1.5Gi
EOF
kubectl apply -f inflate.yaml

Next, check the Kubernetes nodes by running a kubectl get nodes CLI command. The capacity pool (instance type and Availability Zone) selected depends on any Kubernetes scheduling constraints and spare capacity size. Therefore, it might differ from this example in the walkthrough. You can see Karpenter launched a new node of instance type c6g.2xlarge, an AWS Graviton2-based instance, in the eu-west-1c Region:

$ kubectl get nodes -L karpenter.sh/nodepool -L node.kubernetes.io/instance-type -L topology.kubernetes.io/zone -L karpenter.sh/capacity-type

NAME                                     STATUS   ROLES    AGE   VERSION               NODEPOOL   INSTANCE-TYPE   ZONE         CAPACITY-TYPE
ip-10-0-12-17.eu-west-1.compute.internal Ready    <none>   80s   v1.29.0-eks-a5ec690   default    c6g.2xlarge     eu-west-1c   spot

3. Scale in a sample workload to observe consolidation

To invoke a Karpenter consolidation event scale, inflate the deployment to 1. Run the following command:

kubectl scale --replicas=1 deployment/inflate 

Tail the Karpenter logs by running the following command. If you installed Karpenter in a different Kubernetes namespace, then replace the name for the -n argument in the command:

kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter --all-containers=true -f --tail=20

After a few seconds, you should see the following disruption via consolidation message in the Karpenter logs. The message indicates the c6g.2xlarge Spot node has been targeted for replacement and Karpenter has passed the following 15 instance types—m6gd.xlarge, m5dn.large, c7a.xlarge, r6g.large, r6a.xlarge and 10 other(s)—to the Amazon EC2 Fleet API:

{"level":"INFO","time":"2024-02-19T12:09:50.299Z","logger":"controller.disruption","message":"disrupting via consolidation replace, terminating 1 candidates ip-10-0-12-181.eu-west-1.compute.internal/c6g.2xlarge/spot and replacing with spot node from types m6gd.xlarge, m5dn.large, c7a.xlarge, r6g.large, r6a.xlarge and 10 other(s)","commit":"17d6c05","command-id":"60f27cb5-98fa-40fb-8231-05b31fd41892"}

Check the Kubernetes nodes by running the following kubectl get nodes CLI command. You can see that Karpenter launched a new node of instance type c6g.large:

$ kubectl get nodes -L karpenter.sh/nodepool -L node.kubernetes.io/instance-type -L topology.kubernetes.io/zone -L karpenter.sh/capacity-type

NAME                                      STATUS   ROLES    AGE   VERSION               NODEPOOL   INSTANCE-TYPE ZONE       CAPACITY-TYPE
ip-10-0-12-156.eu-west-1.compute.internal           Ready    <none>   2m1s   v1.29.0-eks-a5ec690   default    c6g.large       eu-west-1c   spot

Use kubectl get nodeclaims to list all objects of type NodeClaim and then describe the NodeClaim Kubernetes resource using kubectl get nodeclaim/<claim-name> -o yaml. In the NodeClaim .spec.requirements, you can also see the 15 instance types passed to the Amazon EC2 Fleet API:

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
...
spec:
  nodeClassRef:
    name: default
  requirements:
  ...
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c5.large
    - c5ad.large
    - c6g.large
    - c6gn.large
    - c6i.large
    - c6id.large
    - c7a.large
    - c7g.large
    - c7gd.large
    - m6a.large
    - m6g.large
    - m6gd.large
    - m7g.large
    - m7i-flex.large
    - r6g.large
...

What would happen if a Spot node could not be consolidated?

If a Spot node cannot be consolidated because there are not 15 instance types in the compute selection, then the following message will appear in the events for the NodeClaim object. You might get this event if you overly constrained your instance type selection:

Normal  Unconsolidatable   31s   karpenter  SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 1

Spot best practices with Karpenter

The following are some best practices to consider when using Spot Instances with Karpenter.

  • Avoid overly constraining instance type selection: Karpenter selects Spot Instances using the price-capacity-optimized allocation strategy, which balances the price and availability of AWS spare capacity. Although a minimum of 15 instances are needed, you should avoid constraining instance types as much as possible. By not constraining instance types, there is a higher chance of acquiring Spot capacity at large scales with a lower frequency of Spot Instance interruptions at a lower cost.
  • Gracefully handle Spot interruptions and consolidation actions: Karpenter natively handles Spot interruption notifications by consuming events from an Amazon Simple Queue Service (Amazon SQS) queue, which is populated with Spot interruption notifications through Amazon EventBridge. As soon as Karpenter receives a Spot interruption notification, it gracefully drains the interrupted node of any running pods while also provisioning a new node for which those pods can schedule. With Spot Instances, this process needs to complete within 2 minutes. For a pod with a termination period longer than 2 minutes, the old node will be interrupted prior to those pods being rescheduled. To test a replacement node, AWS Fault Injection Service (FIS) can be used to simulate Spot interruptions.
  • Carefully configure resource requests and limits for workloads: Rightsizing and optimizing your cluster is a shared responsibility. Karpenter effectively optimizes and scales infrastructure, but the end result depends on how well you have rightsized your pod requests and any other Kubernetes scheduling constraints. Karpenter does not consider limits or resource utilization. For most workloads with non-compressible resources, such as memory, it is generally recommended to set requests==limits because if a workload tries to burst beyond the available memory of the host, an out-of-memory (OOM) error occurs. Karpenter consolidation can increase the probability of this as it proactively tries to reduce total allocatable resources for a Kubernetes cluster. For help with rightsizing your Kubernetes pods, consider exploring Kubecost, Vertical Pod Autoscaler configured in recommendation mode, or an open source tool such as Goldilocks.
  • Configure metrics for Karpenter: Karpenter emits metrics in the Prometheus format, so consider using Amazon Managed Service for Prometheus to track interruptions caused by Karpenter Drift, consolidation, Spot interruptions, or other Amazon EC2 maintenance events. These metrics can be used to confirm that interruptions are not having a significant impact on your service’s availability and monitor NodePool usage and pod lifecycles. The Karpenter Getting Started Guide contains an example Grafana dashboard configuration.

You can learn more about other application best practices in the Reliability section of the Amazon EKS Best Practices Guide.

Cleanup

To avoid incurring future charges, delete any resources you created as part of this walkthrough. If you followed the Karpenter Getting Started Guide to set up a cluster and add Karpenter, follow the clean-up instructions in the Karpenter documentation to delete the cluster. Alternatively, if you already had a cluster with Karpenter, delete the resources created as part of this walkthrough:

kubectl delete -f inflate.yaml
kubectl delete -f nodepool.yaml

Conclusion

In this post, you learned how Karpenter can actively replace a Spot node with another more cost-efficient Spot node. Karpenter can consolidate Spot nodes that have the right balance between lower price and low-frequency interruptions when there are at least 15 selectable instances to balance price and availability.

To get started, check out the Karpenter documentation as well as Karpenter Blueprints, which is a repository including common workload scenarios following the best practices.

You can share your feedback on this feature by a raising a GitHub Issue.

The attendee’s guide to the AWS re:Invent 2023 Compute track

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/the-attendees-guide-to-the-aws-reinvent-2023-compute-track/

This post by Art Baudo – Principal Product Marketing Manager – AWS EC2, and Pranaya Anshu – Product Marketing Manager – AWS EC2

We are just a few weeks away from AWS re:Invent 2023, AWS’s biggest cloud computing event of the year. This event will be a great opportunity for you to meet other cloud enthusiasts, find productive solutions that can transform your company, and learn new skills through 2000+ learning sessions.

Even if you are not able to join in person, you can catch-up with many of the sessions on-demand and even watch the keynote and innovation sessions live.

If you’re able to join us, just a reminder we offer several types of sessions which can help maximize your learning in a variety of AWS topics. Breakout sessions are lecture-style 60-minute informative sessions presented by AWS experts, customers, or partners. These sessions are recorded and uploaded a few days after to the AWS Events YouTube channel.

re:Invent attendees can also choose to attend chalk-talks, builder sessions, workshops, or code talk sessions. Each of these are live non-recorded interactive sessions.

  • Chalk-talk sessions: Attendees will interact with presenters, asking questions and using a whiteboard in session.
  • Builder Sessions: Attendees participate in a one-hour session and build something.
  • Workshops sessions: Attendees join a two-hour interactive session where they work in a small team to solve a real problem using AWS services.
  • Code talk sessions: Attendees participate in engaging code-focused sessions where an expert leads a live coding session.

To start planning your re:Invent week, check-out some of the Compute track sessions below. If you find a session you’re interested in, be sure to reserve your seat for it through the AWS attendee portal.

Explore the latest compute innovations

This year AWS compute services have launched numerous innovations: From the launch of over 100 new Amazon EC2 instances, to the general availability of Amazon EC2 Trn1n instances powered by AWS Trainium and Amazon EC2 Inf2 instances powered by AWS Inferentia2, to a new way to reserve GPU capacity with Amazon EC2 Capacity Blocks for ML. There’s a lot of exciting launches to take in.

Explore some of these latest and greatest innovations in the following sessions:

  • CMP102 | What’s new with Amazon EC2
    Provides an overview on the latest Amazon EC2 innovations. Hear about recent Amazon EC2 launches, learn how about differences between Amazon EC2 instances families, and how you can use a mix of instances to deliver on your cost, performance, and sustainability goals.
  • CMP217 | Select and launch the right instance for your workload and budget
    Learn how to select the right instance for your workload and budget. This session will focus on innovations including Amazon EC2 Flex instances and the new generation of Intel, AMD, and AWS Graviton instances.
  • CMP219-INT | Compute innovation for any application, anywhere
    Provides you with an understanding of the breadth and depth of AWS compute offerings and innovation. Discover how you can run any application, including enterprise applications, HPC, generative artificial intelligence (AI), containers, databases, and games, on AWS.

Customer experiences and applications with machine learning

Machine learning (ML) has been evolving for decades and has an inflection point with generative AI applications capturing widespread attention and imagination. More customers, across a diverse set of industries, choose AWS compared to any other major cloud provider to build, train, and deploy their ML applications. Learn about the generative AI infrastructure at Amazon or get hands-on experience building ML applications through our ML focused sessions, such as the following:

Discover what powers AWS compute

AWS has invested years designing custom silicon optimized for the cloud to deliver the best price performance for a wide range of applications and workloads using AWS services. Learn more about the AWS Nitro System, processors at AWS, and ML chips.

Optimize your compute costs

At AWS, we focus on delivering the best possible cost structure for our customers. Frugality is one of our founding leadership principles. Cost effective design continues to shape everything we do, from how we develop products to how we run our operations. Come learn of new ways to optimize your compute costs through AWS services, tools, and optimization strategies in the following sessions:

Check out workload-specific sessions

Amazon EC2 offers the broadest and deepest compute platform to help you best match the needs of your workload. More SAP, high performance computing (HPC), ML, and Windows workloads run on AWS than any other cloud. Join sessions focused around your specific workload to learn about how you can leverage AWS solutions to accelerate your innovations.

Hear from AWS customers

AWS serves millions of customers of all sizes across thousands of use cases, every industry, and around the world. Hear customers dive into how AWS compute solutions have helped them transform their businesses.

Ready to unlock new possibilities?

The AWS Compute team looks forward to seeing you in Las Vegas. Come meet us at the Compute Booth in the Expo. And if you’re looking for more session recommendations, check-out additional re:Invent attendee guides curated by experts.

It’s About Time: Microsecond-Accurate Clocks on Amazon EC2 Instances

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/

This post is written by Josh Levinson, AWS Principal Product Manager and Julien Ridoux, AWS Principal Software Engineer

Today, we announced that we improved the Amazon Time Sync Service to microsecond-level clock accuracy on supported Amazon EC2 instances. This new capability adds a local reference clock to your EC2 instance and is designed to deliver clock accuracy in the low double-digit microsecond range within your instance’s guest OS software. This post shows you how to connect to the improved clocks on your EC2 instances. This post also demonstrates how you can measure your clock accuracy and easily generate and compare timestamps from your EC2 instances with ClockBound, an open source daemon and library.

In general, it’s hard to achieve high-fidelity clock synchronization due to hardware limitations and network variability. While customers have depended on the Amazon Time Sync Service to provide one millisecond clock accuracy, workloads that need microsecond-range accuracy, such as financial trading and broadcasting, required customers to maintain their own time infrastructure, which is a significant operational burden, and expensive. Other clock-sensitive applications that run on the cloud, including distributed databases and storage, have to incorporate message exchange delays with wait periods, data locks, or transaction journaling to maintain consistency at scale.

With global and reliable microsecond-range clock accuracy, you can now migrate and modernize your most time-sensitive applications in the cloud and retire your burdensome on-premises time infrastructure. You can also simplify your applications and increase their throughput by leveraging the high-accuracy timestamps to determine the ordering of events and transactions on workloads across instances, Availability Zones, and Regions. Additionally, you can audit the improved Amazon Time Sync Service to measure and monitor the expected microsecond-range accuracy.

New improvements to Amazon Time Sync Service

The new local clock source can be accessed over the existing Amazon Time Sync Service’s Network Time Protocol (NTP) IPv4 and IPv6 endpoints, or by configuring a new Precision Time Protocol (PTP) reference clock device, to get the best accuracy possible. It’s important to note that both NTP and the new PTP Hardware Clock (PHC) device share the same highly accurate source of time. The new PHC device is part of the AWS Nitro System, so it is directly accessible on supported bare metal and virtualized Amazon EC2 instances without using any customer resources.

A quick note about Leap Seconds

Leap seconds, introduced in 1972, are occasional one-second adjustments to UTC time to factor in irregularities in Earth’s rotation to UTC time in order to accommodate differences between International Atomic Time (TAI) and solar time (Ut1). To manage leap seconds on behalf of customers, we designed leap second smearing within the Amazon Time Sync Service (details on smearing time in “Look Before You Leap”).

Leap seconds are going away, and we are in full support of the decision made at the 27th General Conference on Weights and Measures to abandon leap seconds by or before 2035.

To support this transition, we still plan on smearing time when accessing the Amazon Time Sync Service over the local NTP connection or our Public NTP pools (time.aws.com). The new PHC device, however, will not provide a smeared time option. In the event of a leap seconds, PHC would add the leap seconds following UTC standards. Leap smeared and leap second time sources are the same in most cases. But, since they differ during a leap second event, we do not recommend mixing smeared and non-smeared time sources in your time client configuration during a leap second event.

Connect using NTP (automatic for most customers)

You can connect to the new, microsecond-accurate clocks over NTP the same way you use the Amazon Time Sync Service today at the 169.254.169.123 IPv4 address or the fd00:ec2::123 IPv6 address. This is already the default configuration on all Amazon AMIs and many partner AMIs, including RHEL, Ubuntu, and SUSE. You can verify this connection in your NTP daemon. The below example, using the chrony daemon, verifies that chrony is using the 169.254.169.123 IPv4 address of the Amazon Time Sync Service to synchronize the time:

[ec2-user@ ~]$ chronyc sources
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^- pacific.latt.net              3  10   377    69  +5630us[+5632us] +/-   90ms
^- edge-lax.txryan.com           2   8   377   224   -691us[ -689us] +/-   33ms
^* 169.254.169.123               1   4   377     2  -4487ns[-5914ns] +/-   85us
^- blotch.image1tech.net         2   9   377   327  -1710us[-1720us] +/-   64ms
^- 44.190.40.123                 2   9   377   161  +3057us[+3060us] +/-   84ms

The 169.254.169.123 IPv4 address of the Amazon Time Sync Service is designated with a *, showing it is the source of synchronization on this instance. See the EC2 User Guide for more details on configuring the Amazon Time Sync Service if it is not already configured by default.

Connect using the PTP Hardware Clock

First, you need to install the latest Elastic Network Adapter (ENA) driver. This driver will allow you to connect directly to the PHC. Connect to your instance and install the Linux kernel driver for Elastic Network Adapter (ENA) version 2.10.0 or later. For the installation instructions, see Linux kernel driver for Elastic Network Adapter (ENA) family on GitHub. To enable PTP support in the driver follow the instructions in the section “PTP Hardware Clock (PHC)“.

Once the driver is installed, you need to configure your NTP daemon to connect to the PHC. Below is an example on how to change the configuration in chrony by adding the PHC to your chrony configuration file. Then restart chrony for the change to take place:

[ec2-user ~]$ sudo sh -c 'echo "refclock PHC /dev/ptp0 poll 0 delay 0.000010 prefer" >> /etc/chrony.conf'
[ec2-user ~]$ sudo systemctl restart chronyd

This example uses a +/-5 microsecond range in receiving the reference signal from the PHC. These 10 microseconds are needed to account for operating system latency.

After changing your configuration, you can validate your daemon is correctly syncing to the PHC. Below is an example of output from the chronyc command. An asterisk will appear next to the PHC0 source indicating that you are now syncing to the PHC:

[ec2-user@ ~]$ chronyc sources
MS Name/IP address         Stratum Poll Reach LastRx Last sample
=============================================================================
#* PHC0                           0   0   377     1   +18ns[  +20ns] +/- 5032ns

The PHC0 device of the Amazon Time Sync Service is designated with a *, showing it is the source of synchronization on this instance

Your chrony tracking information will also show that you are syncing to the PHC:

[ec2-user@ ~]$ chronyc tracking
Reference ID    : 50484330 (PHC0)
Stratum         : 1
Ref time (UTC)  : Mon Nov 13 18:43:09 2023
System time     : 0.000000004 seconds fast of NTP time
Last offset     : -0.000000010 seconds
RMS offset      : 0.000000012 seconds
Frequency       : 7.094 ppm fast
Residual freq   : -0.000 ppm
Skew            : 0.004 ppm
Root delay      : 0.000010000 seconds
Root dispersion : 0.000001912 seconds
Update interval : 1.0 seconds
Leap status     : Normal

See the EC2 User Guide for more details on configuring the PHC.

Measuring your clock accuracy

Clock accuracy is a measure of clock error, typically defined as the offset to UTC. This clock error is the difference between the observed time on the computer and the reference time (also known as true time). If your instance is configured to use the Amazon Time Sync Service where the microsecond-accurate enhancement is available, you will typically see a clock error bound of under 100us using the NTP connection. When configured and synchronized correctly with the new PHC connection, you will typically see a clock error bound of under 40us.

We previously published a blog on measuring and monitoring clock accuracy over NTP, which still applies to the improved NTP connection.

If you are connected to the PHC, your time daemon, such as chronyd, will underestimate the clock error bound. This is because inherently, a PTP hardware clock device in Linux does not pass any “error bound” information to chrony, the way the NTP would. As a result, your clock synchronization daemon assumes the clock itself is accurate to UTC and thus has an “error bound” of 0. To get around this issue, the Nitro System calculates the error bound of the PTP Hardware Clock itself, and exposes it to your EC2 instance over the ENA driver sysfs filesystem. You can read this directly as a value in nanoseconds with the command cat /sys/devices/pci0000:00/0000:00:05.0/phc_error_bound. To get your clock error bound at some instant, you would need to add the clock error bound from chrony or ClockBound at the time that chronyd polls the PTP Hardware Clock and add it to this phc_error_bound value.

Below is how you would calculate the clock error incorporating the PHC clock error to get your true clock error bound:

CLOCK ERROR BOUND = SYSTEM TIME + (.5 * ROOT DELAY) + ROOTDISPERSION + PHC Error Bound

For the values in the example:

PHC Error Bound = cat /sys/devices/pci0000:00/0000:00:05.0/phc_error_bound

The System Time, Root Delay, and Root Dispersion are values taken from the chrony tracking information.

ClockBound

However accurate, a clock is never perfect. Instead of providing an estimate of the clock error, ClockBound provides a reliable confidence interval by automatically calculating the clock accuracy, using the calculations in which the reference time (true time) does exist. The open source ClockBound daemon provides a convenient way to retrieve this confidence interval, and work is continuing to make it easier to integrate into high performance workloads.

Conclusion

The Amazon Time Sync Service’s new microsecond-accurate clocks can be leveraged to migrate and modernize your most clock-sensitive applications in the cloud. In this post, we showed you how to can connect to the improved clocks on supported Amazon EC2 instances, how to measure your clock accuracy, and how to easily generate and compare timestamps from your Amazon EC2 instances with ClockBound. Launch a supported instance and get started today to build using this new capability.

To learn more about the Amazon Time Sync Service, see the EC2 UserGuide for Linux and Windows.

If you have questions about this post, start a new thread on the AWS Compute re:Post or contact AWS Support.

Hear about the Amazon Time Sync Service at re:Invent

We will speak in more detail about the Amazon Time Sync Service during re:invent 2023. Look for Session ID CMP220 in the AWS re:Invent session catalog to register.

An attendee’s guide to hybrid cloud and edge computing at AWS re:Invent 2023

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/an-attendees-guide-to-hybrid-cloud-and-edge-computing-at-aws-reinvent-2023/

This post is written by Savitha Swaminathan, AWS Sr. Product Marketing Manager

AWS re:Invent 2023 starts on Nov 27th in Las Vegas, Nevada. The event brings technology business leaders, AWS partners, developers, and IT practitioners together to learn about the latest innovations, meet AWS experts, and network among their peer attendees.

This year, AWS re:Invent will once again have a dedicated track for hybrid cloud and edge computing. The sessions in this track will feature the latest innovations from AWS to help you build and run applications securely in the cloud, on premises, and at the edge – wherever you need to. You will hear how AWS customers are using our cloud services to innovate on premises and at the edge. You will also be able to immerse yourself in hands-on experiences with AWS hybrid and edge services through innovative demos and workshops.

At re:Invent there are several session types, each designed to provide you with a way to learn however fits you best:

  • Innovation Talks provide a comprehensive overview of how AWS is working with customers to solve their most important problems.
  • Breakout sessions are lecture style presentations focused on a topic or area of interest and are well liked by business leaders and IT practitioners, alike.
  • Chalk talks deep dive on customer reference architectures and invite audience members to actively participate in the white boarding exercise.
  • Workshops and builder sessions popular with developers and architects, provide the most hands-on experience where attendees can build real-time solutions with AWS experts.

The hybrid edge track will include one leadership overview session and 15 other sessions (4 breakouts, 6 chalk talks, and 5 workshops). The sessions are organized around 4 key themes: Low latency, Data residency, Migration and modernization, and AWS at the far edge.

Hybrid Cloud & Edge Overview

HYB201 | AWS wherever you need it

Join Jan Hofmeyr, Vice President, Amazon EC2, in this leadership session where he presents a comprehensive overview of AWS hybrid cloud and edge computing services, and how we are helping customers innovate on AWS wherever they need it – from Regions, to metro centers, 5G networks, on premises, and at the far edge. Jun Shi, CEO and President of Accton, will also join Jan on stage to discuss how Accton enables smart manufacturing across its global manufacturing sites using AWS hybrid, IoT, and machine learning (ML) services.

Low latency

Many customer workloads require single-digit millisecond latencies for optimal performance. Customers in every industry are looking for ways to run these latency sensitive portions of their applications in the cloud while simplifying operations and optimizing for costs. You will hear about customer use cases and how AWS edge infrastructure is helping companies like Riot Games meet their application performance goals and innovate at the edge.

Breakout session

HYB305 | Delivering low-latency applications at the edge

Chalk talk

HYB308 | Architecting for low latency and performance at the edge with AWS

Workshops

HYB302 | Architecting and deploying applications at the edge

HYB303 | Deploying a low-latency computer vision application at the edge

Data residency

As cloud has become main stream, governments and standards bodies continue to develop security, data protection, and privacy regulations. Having control over digital assets and meeting data residency regulations is becoming increasingly important for public sector customers and organizations operating in regulated industries. The data residency sessions deep dive into the challenges, solutions, and innovations that customers are addressing with AWS to meet their data residency requirements.

Breakout session

HYB309 | Navigating data residency and protecting sensitive data

Chalk talk

HYB307 | Architecting for data residency and data protection at the edge

Workshops

HYB301 | Addressing data residency requirements with AWS edge services

Migration and modernization

Migration and modernization in industries that have traditionally operated with on-premises infrastructure or self-managed data centers is helping customers achieve scale, flexibility, cost savings, and performance. We will dive into customer stories and real-world deployments, and share best practices for hybrid cloud migrations.

Breakout session

HYB203 | A migration strategy for edge and on-premises workloads

Chalk talk

HYB313 | Real-world analysis of successful hybrid cloud migrations

AWS at the far edge

Some customers operate in what we call the far edge: remote oil rigs, military and defense territories, and even space! In these sessions we cover customer use cases and explore how AWS brings cloud services to the far edge and helps customers gain the benefits of the cloud regardless of where they operate.

Breakout session

HYB306 | Bringing AWS to remote edge locations

Chalk talk

HYB312 | Deploying cloud-enabled applications starting at the edge

Workshops

HYB304 | Generative AI for robotics: Race for the best drone control assistant

In addition to the sessions across the 4 themes listed above, the track includes two additional chalk talks covering topics that are applicable more broadly to customers operating hybrid workloads. These chalk talks were chosen based on customer interest and will have repeat sessions, due to high customer demand.

HYB310 | Building highly available and fault-tolerant edge applications

HYB311 | AWS hybrid and edge networking architectures

Learn through interactive demos

In addition to breakout sessions, chalk talks, and workshops, make sure you check out our interactive demos to see the benefits of hybrid cloud and edge in action:

Drone Inspector: Generative AI at the Edge

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852 | AWS for Every App activation

Embark on a competitive adventure where generative artificial intelligence (AI) intersects with edge computing. Experience how drones can swiftly respond to chat instructions for a time-sensitive object detection mission. Learn how you can deploy foundation models and computer vision (CV) models at the edge using AWS hybrid and edge services for real-time insights and actions.

AWS Hybrid Cloud & Edge kiosk

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852 | Kiosk #9 & 10

Stop by and chat with our experts about AWS Local Zones, AWS Outposts, AWS Snow Family, AWS Wavelength, AWS Private 5G, AWS Telco Network Builder, and Integrated Private Wireless on AWS. Check out the hardware innovations inside an AWS Outposts rack up close and in person. Learn how you can set up a reliable private 5G network within days and live stream video content with minimal latency.

AWS Next Gen Infrastructure Experience

Location: AWS Village | Venetian Level 2, Expo Hall, Booth 852

Check out demos across Global Infrastructure, AWS for Hybrid Cloud & Edge, Compute, Storage, and Networking kiosks, share on social, and win prizes!

The Future of Connected Mobility

Location: Venetian Level 4, EBC Lounge, wall outside of Lando 4201B

Step into the driver’s seat and experience high fidelity 3D terrain driving simulation with AWS Local Zones. Gain real-time insights from vehicle telemetry with AWS IoT Greengrass running on AWS Snowcone and a broader set of AWS IoT services and Amazon Managed Grafana in the Region. Learn how to combine local data processing with cloud analytics for enhanced safety, performance, and operational efficiency. Explore how you can rapidly deliver the same experience to global users in 75+ countries with minimal application changes using AWS Outposts.

Immersive tourism experience powered by 5G and AR/VR

Location: Venetian, Level 2 | Expo Hall | Telco demo area

Explore and travel to Chichen Itza with an augmented reality (AR) application running on a private network fully built on AWS, which includes the Radio Access Network (RAN), the core, security, and applications, combined with services for deployment and operations. This demo features AWS Outposts.

AWS unplugged: A real time remote music collaboration session using 5G and MEC

Location: Venetian, Level 2 | Expo Hall | Telco demo area

We will demonstrate how musicians in Los Angeles and Las Vegas can collaborate in real time with AWS Wavelength. You will witness songwriters and musicians in Los Angeles and Las Vegas in a live jam session.

Disaster relief with AWS Snowball Edge and AWS Wickr

Location: AWS for National Security & Defense | Venetian, Casanova 606

The hurricane has passed leaving you with no cell coverage and you have a slim chance of getting on the internet. You need to set up a situational awareness and communications network for your team, fast. Using Wickr on Snowball Edge Compute, you can rapidly deploy a platform that provides both secure communications with rich collaboration functionality, as well as real time situational awareness with the Wickr ATAK integration. Allowing you to get on with what’s important.


We hope this guide to the Hybrid Cloud and Edge track at AWS re:Invent 2023 helps you plan for the event and we hope to see you there!

Coming soon: Expansion of AWS Lambda states to all functions

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/coming-soon-expansion-of-aws-lambda-states-to-all-functions/

In November of 2019, we announced AWS Lambda function state attributes, a capability to track the current “state” of a function throughout its lifecycle.

Since launch, states have been used in two primary use-cases. First, to move the blocking setup of VPC resources out of the path of function invocation. Second, to allow the Lambda service to optimize new or updated container images for container-image based functions, also before invocation. By moving this additional work out of the path of the invocation, customers see lower latency and better consistency in their function performance. Soon, we will be expanding states to apply to all Lambda functions.

This post outlines the upcoming change, any impact, and actions to take during the roll out of function states to all Lambda functions. Most customers experience no impact from this change.

As functions are created or updated, or potentially fall idle due to low usage, they can transition to a state associated with that lifecycle event. Previously any function that was zip-file based and not attached to a VPC would only show an Active state. Updates to the application code and modifications of the function configuration would always show the Successful value for the LastUpdateStatus attribute. Now all functions will follow the same function state lifecycles described in the initial announcement post and in the documentation for Monitoring the state of a function with the Lambda API.

All AWS CLIs and SDKs have supported monitoring Lambda function states transitions since the original announcement in 2019. Infrastructure as code tools such as AWS CloudFormation, AWS SAM, Serverless Framework, and Hashicorp Terraform also already support states. Customers using these tools do not need to take any action as part of this, except for one recommended service role policy change for AWS CloudFormation customers (see Updating CloudFormation’s service role below).

However, there are some customers using SDK-based automation workflows, or calling Lambda’s service APIs directly, that must update those workflows for this change. To allow time for testing this change, we are rolling it out in a phased model, much like the initial rollout for VPC attached functions. We encourage all customers to take this opportunity to move to the latest SDKs and tools available.

Change details

Nothing is changing about how functions are created, updated, or operate as part of this. However, this change may impact certain workflows that attempt to invoke or modify a function shortly after a create or an update action. Before making API calls to a function that was recently created or modified, confirm it is first in the Active state, and that the LastUpdateStatus is Successful.

For a full explanation of both the create and update lifecycles, see Tracking the state of AWS Lambda functions.

Create function state lifecycle

Create function state lifecycle

Update function state lifecycle

Update function state lifecycle

Change timeframe

We are rolling out this change over a multiple phase period, starting with the Begin Testing phase today, July 12, 2021. The phases allow you to update tooling for deploying and managing Lambda functions to account for this change. By the end of the update timeline, all accounts transition to using the create/update Lambda lifecycle.

July 12 2021– Begin Testing: You can now begin testing and updating any deployment or management tools you have to account for the upcoming lifecycle change. You can also use this time to update your function configuration to delay the change until the End of Delayed Update.

September 6 2021 – General Update (with optional delayed update configuration): All customers without the delayed update configuration begin seeing functions transition through the lifecycles for create and update. Customers that have used the delay update configuration as described below will not see any change.

October 01 2021 – End of Delayed Update: The delay mechanism expires and customers now see the Lambda states lifecycle applied during function create or update.

Opt-in and delayed update configurations

Starting today, we are providing a mechanism for an opt-in. This allows you to update and test your tools and developer workflow processes for this change. We are also providing a mechanism to delay this change until the End of Delayed Update date. After the End of Delayed Update date, all functions will begin using the Lambda states lifecycle.

This mechanism operates on a function-by-function basis, so you can test and experiment individually without impacting your whole account. Once the General Update phase begins, all functions in an account that do not have the delayed update mechanism in place see the new lifecycle for their functions.

Both mechanisms work by adding a special string in the “Description” parameter of Lambda functions. You can add this string anywhere in this parameter. You can opt to add it to the prefix or suffix, or set the entire contents of the field. This parameter is processed at create or update in accordance with the requested action.

To opt in:

aws:states:opt-in

To delay the update:

aws:states:opt-out

NOTE: Delay configuration mechanism has no impact after the end of the Delayed Update.

Here is how this looks in the console:

I add the opt-in configuration to my function’s Description. You can find this under Configuration -> General Configuration in the Lambda console. Choose Edit to change the value.

Edit basic settings

Edit basic settings

After choosing Save, you can see the value in the console:

Opt-in flag set

Opt-in flag set

Once the opt-in is set for a function, then updates on that function go through the preceding update flow.

Checking a function’s state

With this in place, you can now test your development workflow ahead of the General Update phase. Download the latest AWS CLI (version 2.2.18 or greater) or SDKs to see function state and related attribute information.

You can confirm the current state of a function using the AWS APIs or AWS CLI to perform the GetFunction or GetFunctionConfiguration API or command for a specified function:

$ aws lambda get-function --function-name MY_FUNCTION_NAME --query 'Configuration.[State, LastUpdateStatus]'
[
    "Active",
    "Successful"
]

This returns the State and LastUpdateStatus in order for a function.

Updating CloudFormation’s service role

CloudFormation allows customers to create an AWS Identity and Access Management (IAM) service role to make calls to resources in a stack on your behalf. Customers can use service roles to allow or deny the ability to create, update, or delete resources in a stack.

As part of the rollout of function states for all functions, we recommend that customers configure CloudFormation service roles with an Allow for the “lambda:GetFunction” API. This API allows CloudFormation to get the current state of a function, which is required to assist in the creation and deployment of functions.

Conclusion

With function states, you can have better clarity on how the resources required by your Lambda function are being created. This change does not impact the way that functions are invoked or how your code is run. While this is a minor change to when resources are created for your Lambda function, the result is even better consistency of working with the service.

For more serverless learning resources, visit Serverless Land.

Hosting Hugging Face models on AWS Lambda for serverless inference

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/hosting-hugging-face-models-on-aws-lambda/

This post written by Eddie Pick, AWS Senior Solutions Architect – Startups and Scott Perry, AWS Senior Specialist Solutions Architect – AI/ML

Hugging Face Transformers is a popular open-source project that provides pre-trained, natural language processing (NLP) models for a wide variety of use cases. Customers with minimal machine learning experience can use pre-trained models to enhance their applications quickly using NLP. This includes tasks such as text classification, language translation, summarization, and question answering – to name a few.

First introduced in 2017, the Transformer is a modern neural network architecture that has quickly become the most popular type of machine learning model applied to NLP tasks. It outperforms previous techniques based on convolutional neural networks (CNNs) or recurrent neural networks (RNNs). The Transformer also offers significant improvements in computational efficiency. Notably, Transformers are more conducive to parallel computation. This means that Transformer-based models can be trained more quickly, and on larger datasets than their predecessors.

The computational efficiency of Transformers provides the opportunity to experiment and improve on the original architecture. Over the past few years, the industry has seen the introduction of larger and more powerful Transformer models. For example, BERT was first published in 2018 and was able to get better benchmark scores on 11 natural language processing tasks using between 110M-340M neural network parameters. In 2019, the T5 model using 11B parameters achieved better results on benchmarks such as summarization, question answering, and text classification. More recently, the GPT-3 model was introduced in 2020 with 175B parameters and in 2021 the Switch Transformers are scaling to over 1T parameters.

One consequence of this trend toward larger and more powerful models is an increased barrier to entry. As the number of model parameters increases, as does the computational infrastructure that is necessary to train such a model. This is where the open-source Hugging Face Transformers project helps.

Hugging Face Transformers provides over 30 pretrained Transformer-based models available via a straightforward Python package. Additionally, there are over 10,000 community-developed models available for download from Hugging Face. This allows users to use modern Transformer models within their applications without requiring model training from scratch.

The Hugging Face Transformers project directly addresses challenges associated with training modern Transformer-based models. Many customers want a zero administration ML inference solution that allows Hugging Face Transformers models to be hosted in AWS easily. This post introduces a low touch, cost effective, and scalable mechanism for hosting Hugging Face models for real-time inference using AWS Lambda.

Overview

Our solution consists of an AWS Cloud Development Kit (AWS CDK) script that automatically provisions container image-based Lambda functions that perform ML inference using pre-trained Hugging Face models. This solution also includes Amazon Elastic File System (EFS) storage that is attached to the Lambda functions to cache the pre-trained models and reduce inference latency.Solution architecture

In this architectural diagram:

  1. Serverless inference is achieved by using Lambda functions that are based on container image
  2. The container image is stored in an Amazon Elastic Container Registry (ECR) repository within your account
  3. Pre-trained models are automatically downloaded from Hugging Face the first time the function is invoked
  4. Pre-trained models are cached within Amazon Elastic File System storage in order to improve inference latency

The solution includes Python scripts for two common NLP use cases:

  • Sentiment analysis: Identifying if a sentence indicates positive or negative sentiment. It uses a fine-tuned model on sst2, which is a GLUE task.
  • Summarization: Summarizing a body of text into a shorter, representative text. It uses a Bart model that was fine-tuned on the CNN / Daily Mail dataset.

For simplicity, both of these use cases are implemented using Hugging Face pipelines.

Prerequisites

The following is required to run this example:

Deploying the example application

  1. Clone the project to your development environment:
    git clone https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face.git
  2. Install the required dependencies:
    pip install -r requirements.txt
  3. Bootstrap the CDK. This command provisions the initial resources needed by the CDK to perform deployments:
    cdk bootstrap
  4. This command deploys the CDK application to its environment. During the deployment, the toolkit outputs progress indications:
    $ cdk deploy

Testing the application

After deployment, navigate to the AWS Management Console to find and test the Lambda functions. There is one for sentiment analysis and one for summarization.

To test:

  1. Enter “Lambda” in the search bar of the AWS Management Console:Console Search
  2. Filter the functions by entering “ServerlessHuggingFace”:Filtering functions
  3. Select the ServerlessHuggingFaceStack-sentimentXXXXX function:Select function
  4. In the Test event, enter the following snippet and then choose Test:Test function
{
   "text": "I'm so happy I could cry!"
}

The first invocation takes approximately one minute to complete. The initial Lambda function environment must be allocated and the pre-trained model must be downloaded from Hugging Face. Subsequent invocations are faster, as the Lambda function is already prepared and the pre-trained model is cached in EFS.Function test results

The JSON response shows the result of the sentiment analysis:

{
  "statusCode": 200,
  "body": {
    "label": "POSITIVE",
    "score": 0.9997532367706299
  }
}

Understanding the code structure

The code is organized using the following structure:

├── inference
│ ├── Dockerfile
│ ├── sentiment.py
│ └── summarization.py
├── app.py
└── ...

The inference directory contains:

  • The Dockerfile used to build a custom image to be able to run PyTorch Hugging Face inference using Lambda functions
  • The Python scripts that perform the actual ML inference

The sentiment.py script shows how to use a Hugging Face Transformers model:

import json
from transformers import pipeline

nlp = pipeline("sentiment-analysis")

def handler(event, context):
    response = {
        "statusCode": 200,
        "body": nlp(event['text'])[0]
    }
    return response

For each Python script in the inference directory, the CDK generates a Lambda function backed by a container image and a Python inference script.

CDK script

The CDK script is named app.py in the solution’s repository. The beginning of the script creates a virtual private cloud (VPC).

vpc = ec2.Vpc(self, 'Vpc', max_azs=2)

Next, it creates the EFS file system and an access point in EFS for the cached models:

        fs = efs.FileSystem(self, 'FileSystem',
                            vpc=vpc,
                            removal_policy=cdk.RemovalPolicy.DESTROY)
        access_point = fs.add_access_point('MLAccessPoint',
                                           create_acl=efs.Acl(
                                               owner_gid='1001', owner_uid='1001', permissions='750'),
                                           path="/export/models",
                                           posix_user=efs.PosixUser(gid="1001", uid="1001"))>

It iterates through the Python files in the inference directory:

docker_folder = os.path.dirname(os.path.realpath(__file__)) + "/inference"
pathlist = Path(docker_folder).rglob('*.py')
for path in pathlist:

And then creates the Lambda function that serves the inference requests:

            base = os.path.basename(path)
            filename = os.path.splitext(base)[0]
            # Lambda Function from docker image
            function = lambda_.DockerImageFunction(
                self, filename,
                code=lambda_.DockerImageCode.from_image_asset(docker_folder,
                                                              cmd=[
                                                                  filename+".handler"]
                                                              ),
                memory_size=8096,
                timeout=cdk.Duration.seconds(600),
                vpc=vpc,
                filesystem=lambda_.FileSystem.from_efs_access_point(
                    access_point, '/mnt/hf_models_cache'),
                environment={
                    "TRANSFORMERS_CACHE": "/mnt/hf_models_cache"},
            )

Adding a translator

Optionally, you can add more models by adding Python scripts in the inference directory. For example, add the following code in a file called translate-en2fr.py:

import json
from transformers 
import pipeline

en_fr_translator = pipeline('translation_en_to_fr')

def handler(event, context):
    response = {
        "statusCode": 200,
        "body": en_fr_translator(event['text'])[0]
    }
    return response

Then run:

$ cdk synth
$ cdk deploy

This creates a new endpoint to perform English to French translation.

Cleaning up

After you are finished experimenting with this project, run “cdk destroy” to remove all of the associated infrastructure.

Conclusion

This post shows how to perform ML inference for pre-trained Hugging Face models by using Lambda functions. To avoid repeatedly downloading the pre-trained models, this solution uses an EFS-based approach to model caching. This helps to achieve low-latency, near real-time inference. The solution is provided as infrastructure as code using Python and the AWS CDK.

We hope this blog post allows you to prototype quickly and include modern NLP techniques in your own products.

Improved failure recovery for Amazon EventBridge

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/improved-failure-recovery-for-amazon-eventbridge/

Today we’re announcing two new capabilities for Amazon EventBridgedead letter queues and custom retry policies. Both of these give you greater flexibility in how to handle any failures in the processing of events with EventBridge. You can easily enable them on a per target basis and configure them uniquely for each.

Dead letter queues (DLQs) are a common capability in queuing and messaging systems that allow you to handle failures in event or message receiving systems. They provide a way for failed events or messages to be captured and sent to another system, which can store them for future processing. With DLQs, you can have greater resiliency and improved recovery from any failure that happens.

You can also now configure a custom retry policy that can be set on your event bus targets. Today, there are two attributes that can control how events are retried: maximum number of retries and maximum event age. With these two settings, you could send events to a DLQ sooner and reduce the retries attempted.

For example, this could allow you to recover more quickly if an event bus target is overwhelmed by the number of events received, causing throttling to occur. The events are placed in a DLQ and then processed later.

Failures in event processing

Currently, EventBridge can fail to deliver an event to a target in certain scenarios. Events that fail to be delivered to a target due to client-side errors are dropped immediately. Examples of this are when EventBridge does not have permission to a target AWS service or if the target no longer exists. This can happen if the target resource is misconfigured or is deleted by the resource owner.

For service-side issues, EventBridge retries delivery of events for up to 24 hours. This can happen if the target service is unavailable or the target resource is not provisioned to handle the incoming event traffic and the target service is throttling the requests.

EventBridge failures

EventBridge failures

Previously, when all attempts to deliver an event to the target were exhausted, EventBridge published a CloudWatch metric indicating a failed target invocation. However, this provides no visibility into which events failed to be delivered and there was no way to recover the event that failed.

Dead letter queues

EventBridge’s DLQs are made possible today with Amazon Simple Queue Service (SQS) standard queues. With SQS, you get all of the benefits of a fully serverless queuing service: no servers to manage, automatic scalability, pay for what you consume, and high availability and security built in. You can configure the DLQs for your EventBridge bus and pay nothing until it is used, if and when a target experiences an issue. This makes it a great practice to follow and standardize on, and provides you with a safety net that’s active only when needed.

Optionally, you could later configure an AWS Lambda function to consume from that DLQ. The function is only invoked when messages exist in the queue, allowing you to maintain a serverless stack to recover from a potential failure.

DLQ configured per target

DLQ configured per target

With DLQ configured, the queue receives the event that failed in the message with important metadata that you can use to troubleshoot the issue. This can include: Error Code, Error Message, Exhausted Retry Condition, Retry Attempts, Rule ARN, and the Target ARN.

You can use this data to more easily troubleshoot what went wrong with the original delivery attempt and take action to resolve or prevent such failures in the future. You could also use the information such as Exhausted Retry Condition and Retry Attempts to further tweak your custom retry policy.

You can configure a DLQ when creating or updating rules via the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code (IaC) tools such as AWS CloudFormation.

In the console, select the queue to be used for your DLQ configuration from the drop-down as shown here:

DLQ configuration

DLQ configuration

When configured via API, AWS CLI, or IaC tools, you must specify the ARN of the queue:

arn:aws:sqs:us-east-1:123456789012:orders-bus-shipping-service-dlq

When you configure a DLQ, the target SQS queue requires a resource-based policy that grants EventBridge access. One is created and applied automatically via the console when you create or update an EventBridge rule with a DLQ that exists in your own account.

For any queues created in other accounts, or via API, AWS CLI, or IaC tools, you must add a policy that allows SQS’s SendMessage permission to the EventBridge rule ARN, as shown below:

{
  "Sid": "Dead-letter queue permissions",
  "Effect": "Allow",
  "Principal": {
     "Service": "events.amazonaws.com"
  },
  "Action": "sqs:SendMessage",
  "Resource": "arn:aws:sqs:us-east-1:123456789012:orders-bus-shipping-service-dlq",
  "Condition": {
    "ArnEquals": {
      "aws:SourceArn": "arn:aws:events:us-east-1:123456789012:rule/MyTestRule"
    }
  }
}

You can read more about setting permissions for DLQ the documentation for “Granting permissions to the dead-letter queue”.

Once configured, you can monitor CloudWatch metrics for the DLQ queue. This shows both the successful delivery of messages via the InvocationsSentToDLQ metric, in addition to any failures via the InvocationsFailedToBeSentToDLQ. Note that these metrics do not exist if your queue is not considered “active”.

Retry policies

By default, EventBridge retries delivery of an event to a target so long as it does not receive a client-side error as described earlier. Retries occur with a back-off, for up to 185 attempts or for up to 24 hours, after which the event is dropped or sent to a DLQ, if configured. Due to the jitter of the back-off and retry process you may reach the 24-hour limit before reaching 185 retries.

For many workloads, this provides an acceptable way to handle momentary service issues or throttling that might occur. For some however, this model of back-off and retry can cause increased and on-going traffic to an already overloaded target system.

For example, consider an Amazon API Gateway target that has a resource constrained backend service behind it.

Constrained target service

Constrained target service

Under a consistently high load, the bus could end up generating too many API requests, tripping the API Gateway’s throttling configuration. This would cause API Gateway to respond with throttling errors back to EventBridge.

Throttled API reply

Throttled API reply

You may decide that allowing the failed events to retry for 24 hours puts too much load into this system and it may not properly recover from the load. This could lead to potential data loss unless a DLQ was configured.

Added DLQ

Added DLQ

With a DLQ, you could choose to process these events later, once the overwhelmed target service has recovered.

DLQ drained back to API

DLQ drained back to API

Or the events in question may no longer have the same value as they did previously. This can occur in systems where data loss is tolerated but the timeliness of data processing matters. In these situations, the DLQ would have less value and dropping the message is acceptable.

For either of these situations, configuring the maximum number of retries or the maximum age of the event could be useful.

Now with retry policies, you can configure per target the following two attributes:

  • MaximumEventAgeInSeconds: between 60 and 86400 seconds (86400, or 24 hours the default)
  • MaximumRetryAttempts: between 0 and 185 (185 is the default)

When either condition is met, the event fails. It’s then either dropped, which generates an increase to the FailedInvocations CloudWatch metric, or sent to a configured DLQ.

You can configure retry policy attributes when creating or updating rules via the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code (IaC) tools such as AWS CloudFormation.

Retry policy

Retry policy

There is no additional cost for configuring either of these new capabilities. You only pay for the usage of the SQS standard queue configured as the dead letter queue during a failure and any application that handles the failed events. SQS pricing can be found here.

Conclusion

With dead letter queues and custom retry policies, you have improved handling and control over failure in distributed systems built with EventBridge. With DLQs you can capture failed events and then process them later, potentially saving yourself from data loss. With custom retry policies, you gain the improved ability to control the number of retries and for how long they can be retried.

I encourage you to explore how both of these new capabilities can help make your applications more resilient to failures, and to standardize on using them both in your infrastructure.

For more serverless learning resources, visit https://serverlessland.com.

Implementing FIFO message ordering with Amazon MQ for Apache ActiveMQ

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/implementing-fifo-message-ordering-with-amazon-mq-for-apache-activemq/

This post is contributed by Ravi Itha, Sr. Big Data Consultant

Messaging plays an important role in building distributed enterprise applications. Amazon MQ is a key offering within the AWS messaging services solution stack focused on enabling messaging services for modern application architectures. Amazon MQ is a managed message broker service for Apache ActiveMQ that simplifies setting up and operating message brokers in the cloud. Amazon MQ uses open standard APIs and protocols such as JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. Using standards means that, in most cases, there’s no need to rewrite any messaging code when you migrate to AWS. This allows you to focus on your business logic and application architecture.

Message ordering via Message Groups

Sometimes it’s important to guarantee the order in which messages are processed. In ActiveMQ, there is no explicit distinction between a standard queue and a FIFO queue. However, a queue can be used to route messages in FIFO order. This ordering can be achieved via two different ActiveMQ features, either by implementing an Exclusive Consumer or using using Message Groups. This blog focuses on Message Groups, an enhancement to Exclusive Consumers. Message groups provide:

  • Guaranteed ordering of the processing of related messages across a single queue
  • Load balancing of the processing of messages across multiple consumers
  • High availability with automatic failover to other consumers if a JVM goes down

This is achieved programmatically as follows:

Sample producer code snippet:

TextMessage tMsg = session.createTextMessage("SampleMessage");
tMsg.setStringProperty("JMSXGroupID", "Group-A");
producer.send(tMsg);

Sample consumer code snippet:

Message consumerMessage = consumer.receive(50);
TextMessage txtMessage = (TextMessage) message.get();
String msgBody = txtMessage.getText();
String msgGroup = txtMessage.getStringProperty("JMSXGroupID")

This sample code highlights:

  • A message group is set by the producer during message ingestion
  • A consumer determines the message group once a message is consumed

Additionally, if a queue has messages for multiple message groups then it’s possible a consumer receives messages for multiple message groups. This depends on various factors such as the number of consumers of a queue and consumer start time.

Scenarios: Multiple producers and consumers with Message Groups

A FIFO queue in ActiveMQ supports multiple ordered message groups. Due to this, it’s common that a queue is used to exchange messages between multiple producer and consumer applications. By running multiple consumers to process messages from a queue, the message broker is able to partition messages across consumers. This improves the scalability and performance of your application.

In terms of scalability, commonly asked questions center on the ideal number of consumers and how messages are distributed across all consumers. To provide more clarity in this area, we provisioned an Amazon MQ broker and ran various test scenarios.

Scenario 1: All consumers started at the same time

Test setup

  • All producers and consumers have the same start time
  • Each test uses a different combination of number of producers, message groups, and consumers
Setup for Tests 1 to 5

Setup for Tests 1 to 5

Test results

Test # # producers # message groups

# messages sent

by each producer

# consumers Total messages # messages received by consumers
C1 C2 C3 C4
1 3 3 5000 1 15000

15000

(All Groups)

NA NA NA
2 3 3 5000 2 15000

5000

(Group-C)

10000

(Group-A and Group-B)

NA NA
3 3 3 5000 3 15000

5000

(Group-A)

5000

(Group-B)

5000

(Group C)

NA
4 3 3 5000 4 15000

5000

(Group-C)

5000

(Group-B)

5000

(Group-A)

0
5 4 4 5000 3 20000

5000

(Group-A)

5000

(Group-B)

10000

(Group-C and Group-D)

NA

Test conclusions

  • Test 3 – illustrates even message distribution across consumers when a one-to-one relationship exists between message groups and number of consumers
  • Test 4 – illustrates one of the four consumers did not receive any messages. This highlights that running more consumers than the available number of messages groups does not provide additional benefits
  • Tests 1, 2, 5 – indicate that a consumer can receive messages belonging to multiple message groups. The following table provides additional granularity to messages received by consumer C2 in test #2. As you can see, these messages belong to Group-A and Group-B message groups, and FIFO ordering is maintained at a message group level
consumer_id msg_id msg_group
Consumer C2 A-1 Group-A
Consumer C2 B-1 Group-B
Consumer C2 A-2 Group-A
Consumer C2 B-2 Group-B
Consumer C2 A-3 Group-A
Consumer C2 B-3 Group-B
Consumer C2 A-4999 Group-A
Consumer C2 B-4999 Group-B
Consumer C2 A-5000 Group-A
Consumer C2 B-5000 Group-B

Scenario 2a: All consumers not started at same time

Test setup

  • Three producers and one consumer started at the same time
  • The second and third consumers started after 30 seconds and 60 seconds respectively
  • 15,000 messages sent in total across three message groups
 Setup for Test 6

Setup for Test 6

Test results

Test # # producers # message groups

# messages sent

by each producer

# consumers Total messages # messages received by consumers
C1 C2 C3
6 3 3 5000 3 15000 15000 0 0

Test conclusion

Consumer C1 received all messages, while consumers C2 and C3 both ran idle and did not receive any messages. Key takeaway here is that results can be inefficient in real-world scenarios where consumers start at different times.

The last scenario (2b) illustrates this same scenario, while optimizing message distribution so that all consumers are used.

Scenario 2b: Utilization of all consumers when not started at same time

Test setup

  • Three producers and one consumer started at the same time
  • The second and third consumers started after 30 seconds and 60 seconds respectively
  • 15,000 messages sent in total across three message groups
  • After each producer message group sends its 2501st message, their message groups are closed after which message distribution is restarted by sending the remaining messages. Closing a message group can be done as in the following code example (specifically the -1 value set for the JMSXGroupSeq property):
TextMessage tMsg = session.createTextMessage("<foo>hey</foo>");
tMsg.setStringProperty("JMSXGroupID", "Group-A");
tMsg.setIntProperty("JMSXGroupSeq", -1);
producer.send(tMsg);
Setup for Test 7

Setup for Test 7

Test results

Test # # producers # message groups

# messages sent

by each producer

# consumers

Total

messages

# messages received by consumers
C1 C2 C3
7 3 3 5001 3 15003 10003 2500 2500

Distribution of messages received by message group

Consumer Group-A Group-B Group-C Consumer-wise total
Consumer 1 2501 2501 5001 10003
Consumer 2 2500 0 0 2500
Consumer 3 0 2500 0 2500
Group total 5001 5001 5001 NA
Total messages received 15003

Test conclusions

Message distribution is optimized with the closing and reopening of a message group when all consumers are not started at the same time. This mitigation step results in all consumers receiving messages.

  • After Group-A was closed, the broker assigned subsequent Group-A messages to consumer C2
  • After Group-B was closed, the broker assigned subsequent Group-B messages to consumer C3
  • After Group-C was closed, the broker continued to send Group-C messages to consumer C1. The assignment did not change because there was no other available consumer
Test 7 – Message distribution among consumers

Test 7 – Message distribution among consumers

Scalability techniques

Now that we understand how to use Message Groups to implement FIFO use cases within Amazon MQ, let’s look at how they scale. By default, a message queue supports a maximum of 1024 message groups. This means, if you use more than 1024 message groups per queue then message ordering is lost for the oldest message group. This is further explained in the ActiveMQ Message Groups documentation. This can be problematic for complex use cases involving stock exchanges or financial trading scenarios where thousands of ordered message groups are required. In the following table, are a couple of techniques to address this issue.

Scalability techniques Details
  1. Verify that the appropriate Amazon MQ broker instance type is used
  2. Increase number of message groups per message queue
  1. Select the appropriate broker instance type according to your use case. To learn more, refer to the Amazon MQ broker instance types documentation.
  2. Default max number of message groups can be increased via a custom configuration file at the time of launching a new broker. This default can also be increased by modifying an existing broker. Refer to the next section for an example (requirement #2)
Recycle the number of message groups when they are no longer needed A message group can be closed programmatically by a producer once it’s finished sending all messages to a queue. Following is a sample code snippet:

TextMessage tMsg = session.createTextMessage("<foo>hey</foo>");
tMsg.setStringProperty("JMSXGroupID", "GroupA");
tMsg.setIntProperty("JMSXGroupSeq", -1);
producer.send(tMsg);

In the preceding scenario 2b, we used this technique to improve the message distribution across consumers.

Customize message broker configuration

In the previous section, to improve scalability we suggested increasing the number of message groups per queue by updating the broker configuration. A broker configuration is essentially an XML file that contains all ActiveMQ settings for a given message broker. Let’s look at the following broker configuration settings for the purpose of achieving a specific requirement. For your reference, we’ve placed a copy of a broker configuration file with these settings, within a GitHub repository.

# Requirement Applicable broker configuration
1 Change message group implementation from default CachedMessageGroupMap default to MessageGroupHashBucket

<!–valid values: simple, bucket, cached. default is cached–>

<!–keyword simple represents SimpleMessageGroupMap–>

<!–keyword bucket represents MessageGroupHashBucket–>

<!–keyword cached represents CachedMessageGroupMap–>

<policyEntry messageGroupMapFactoryType=“bucket” queue=“&gt;”/>

2 Increase number of message groups per queue from 1024 to 2048 and increase cache size from 64 to 128

<!–default value for bucketCount is 1024 and for cacheSize is 64–>

<policyEntry queue=“&gt;”>

<messageGroupMapFactory>

<messageGroupHashBucketFactory bucketCount=“2048” cacheSize=“128”/>

</messageGroupMapFactory>

</policyEntry>

3 Wait for three consumers or 30 seconds before broker begins sending messages <policyEntry queue=”&gt;” consumersBeforeDispatchStarts=”3″ timeBeforeDispatchStarts=”30000″/>

When must default broker configurations be updated? This would only apply in scenarios where the default settings do not meet your requirements. Additional information on how to update your broker configuration file can be found here.

Amazon MQ starter kit

Want to get up and running with Amazon MQ quickly? Start building with Amazon MQ by cloning the starter kit available on GitHub. The starter kit includes a CloudFormation template to provision a message broker, sample broker configuration file, and source code related to the consumer scenarios in this blog.

Conclusion

In this blog post, you learned how Amazon MQ simplifies the setup and operation of Apache ActiveMQ in the cloud. You also learned how the Message Groups feature can be used to implement FIFO. Lastly, you walked through real scenarios demonstrating how ActiveMQ distributes messages with queues to exchange messages between multiple producers and consumers.