Tag Archives: contributed

Implementing patterns that exit early out of a parallel state in AWS Step Functions

2023-07-21 Benjamin Smith

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/implementing-patterns-that-exit-early-out-of-a-parallel-state-in-aws-step-functions/

This post is written by Madhav Vishnubhatta, Senior Technical Account Manager, Enterprise Support.

This blog post explains how to implement patterns in AWS Step Functions that control the break out of a parallel state as soon as a minimum requirement is met. The parallel state usually completes only when all the parallel flows inside it are completed. But if you do not want to wait for all of the parallel flows to complete before moving to the next step, this post provides patterns to help implement this functionality.

You can use AWS Step Functions to set up visual serverless workflows that orchestrate and coordinate multiple AWS services into a serverless workflow. This allows you to build complex, stateful, and scalable applications without managing the underlying infrastructure. In Step Functions, the individual steps are called states.

Step Functions offers multiple types of states. Some states help control the logic of the workflow. For example, the choice state enables conditional logic to control the flow to any one of the multiple possible next states, depending on the conditions defined in the state. The parallel state helps control the logic, but rather than choose one of multiple next states (as the choice state does), the parallel state allows all the branches to run as parallel flows concurrently. When all the parallel flows are complete, control moves on to the Parallel state’s next state.

Patterns that do not need to wait for all parallel flows to finish

Consider a scenario where the Step Functions workflow represents the process of an employee requesting a laptop in your organization. The process begins with a request from the employee as the first step, but the approval of this request could come from either of two IT managers.

In this case there could be two parallel flows, each waiting for an approval from one IT manager. But, as soon as one person provides approval, the workflow can move forward to the next step of actually issuing a laptop to the employee. This is an “either-or” pattern.

Consider a similar use-case but with a slightly different requirement. Instead of just one person’s approval being enough to issue a laptop, what if approval is needed from a minimum of two out of three IT managers before the laptop is issued. This is the “quorum” pattern.

The parallel state does not directly support these two patterns because the state waits for all the flows to complete. In this case, that means all the managers must provide an approval before a laptop can be issued.

Solution overview

Step Functions provides an error handling mechanism with the fail state, which can be used to fail the workflow with an error. This error can be caught downstream in the workflow and handled as needed. Both the either-or and the quorum patterns can be implemented with this fail state along with the error handling capability.

In case of either-or, as soon as the parallel flow is finished, the fail state can throw an error, which is caught outside the parallel state for further processing. Even though it is the fail state, it might not represent an error scenario in your use-case.

The quorum pattern needs an additional mechanism to store the status of each parallel flow, using an Amazon DynamoDB table. The quorum pattern creates an item in the DynamoDB table at the beginning of the workflow that is updated by each parallel flow as soon as it has completed. Each parallel flow checks the DynamoDB table to look at the number of processes that have completed and compare it against the quorum. If the quorum is met, that flow raises an error with a fail state that can be caught outside the parallel step.

Prerequisites

Both of these patterns are published on Serverless Land:

To deploy and use these patterns, you need:

An AWS Account
Access to login as a user or assume a role that can:
- Create and run a Step Functions workflow.
- Create and update a DynamoDB table.
- Create an AWS Identity and Access Management (IAM) role for the Step Functions to assume when running the workflow.
Familiarity with AWS Serverless Application Model (AWS SAM).
AWS SAM Command Line Interface installed.

Example walkthrough

Either-or pattern

To deploy the Either-or pattern, follow the deployment Instructions section in the GitHub repo. This deployment creates the following resources:

A Step Functions workflow.
An IAM role that is assumed by the Step Functions workflow during execution.

Navigate to the AWS CloudFormation page in the AWS Management Console and choose the stack with the name provided during deployment. Choose the State Machine resource in the Resources section of the CloudFormation stack to go to the Step Functions console. Choose Edit and then choose WorkflowStudio to see a visual representation of the workflow.

You can see the exported workflow in the GitHub repo. This is the logic of the workflow:

Either-or patter. Conceptual flow.

There are three (numbered) parallel flows in this workflow.
Flows #1 and #2 are the main parallel flows, one of which completing should move the control to outside the Parallel state.
Flow #3 is the time out flow so that the workflow can exit after a set amount of time if neither of the other two parallel flows complete by then.
Each of the two main parallel flows follows the following logic:
- Wait for the process to complete. This is a filler and can be replaced with your business logic on how to monitor process completion. This could be a human approval, or any other job that needs to finish.
- Once process is complete, throw a dummy error, which moves control to outside the parallel state.
The dummy errors for the two flows are caught outside the parallel state with corresponding catch condition.
The errors from the two flows need not be caught separately. You might just do the same action no matter which of the parallel flows finished, but I show separate steps in case you need to do something different based on which parallel flow finished.

To test the workflow, follow the instructions provided in the Testing section of the README file at the GitHub repo.

To clean up the resources created, run:

sam delete

Quorum pattern

To deploy the Quorum pattern, follow the Deployment Instructions section in the GitHub repo. This deployment creates the following resources:

A Step Functions workflow.
An IAM role that is assumed by the Step Functions workflow during execution.
A DynamoDB Table called “QuorumWorkflowTable”.

Navigate to CloudFormation in the AWS Management Console and choose the stack with the name provided during deployment. Choose the state machine resource in the Resources section of the CloudFormation stack to go to the Step Functions console.

Choose Edit and then choose WorkflowStudio to see a visual representation of the workflow.

You can see the the exported workflow in the GitHub repo. This is the logic of the workflow:

Quorum pattern. Conceptual flow.

The first step creates an entry in the DynamoDB table with the execution ID of the workflow’s execution. This item in the table tracks the completion of processes.
The next state is the parallel state, which has three parallel flows and a fourth time out flow. All the four flows are numbered.
Flow #1, #2, and #3 are the main parallel flows, two of which completing should move the control to outside the parallel state.
Flow #4 is the timeout flow so that the workflow can exit after a set amount of time, if neither of the other two parallel flows complete by then.
Each of the three main parallel flows uses the following logic:
- Wait for the process to complete.
- Once complete, update the DynamoDB table entry to mark the completion of the process.
- After the update, query the item from DynamoDB to get the list of processes that have completed and check if the quorum has been met.
- If the quorum has been met, raise an “Error” (which is actually a success criterion in terms of business case), to move the control to outside the parallel state.

To test the workflow, follow the instructions provided in the Testing section of the README file at the GitHub repo.

To clean up the resources created, run:

sam delete

Conclusion

This blog post shows how you can implement patterns that must exit early out of a parallel state in an AWS Step Functions workflow.

The use-cases for this approach are not limited to these two patterns. More complicated use-cases like having different combinations of conditions to exit a parallel state can all be implemented using parallel and fail states.

Visit Serverless Land for more Step Functions workflow patterns.

Decoupling event publishing with Amazon EventBridge Pipes

2023-07-11 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/decoupling-event-publishing-with-amazon-eventbridge-pipes/

This post is written by Gregor Hohpe, Sr. Principal Evangelist, Serverless.

Event-driven architectures (EDAs) help decouple applications or application components from one another. Consuming events makes a component less dependent on the sender’s location or implementation details. Because events are self-contained, they can be consumed asynchronously, which allows sender and receiver to follow independent timing considerations. Decoupling through events can improve development agility and operational resilience when building fine-grained distributed applications, which is the preferred style for serverless applications.

Many AWS services publish events through built-in mechanisms to support building event-driven architectures with a minimal amount of custom coding. Modern applications built on top of those services can also send and consume events based on their specific business logic. AWS application integration services like Amazon EventBridge or Amazon SNS, a managed publish-subscribe service, filter those events and route them to the intended destination, providing additional decoupling between event producer and consumer.

Publishing events

Custom applications that act as event producers often use the AWS SDK library, which is available for 12 programming languages, to send an event message. The application code constructs the event as a local data structure and specifies where to send it, for example to an EventBridge event bus.

The application code required to send an event to EventBridge is straightforward and only requires a few lines of code, as shown in this (simplified) helper method that publishes an order event generated by the application:

async function sendEvent(event) {
    const ebClient = new EventBridgeClient();
    const params = { Entries: [ {
          Detail: JSON.stringify(event),
          DetailType: "newOrder",
          Source: "orders",
          EventBusName: eventBusName
        } ] };
    return await ebClient.send(new PutEventsCommand(params));
}

An application most likely calls such a method in the context of another action, for example when persisting a received order to a data store. The code that performs those tasks might look as follows:

const order = { "orderId": "1234", "Items": ["One", "Two", "Three"] }
await writeToDb(order);
await sendEvent(order);

The code populates an order object with multiple line items (in reality, this would be based on data entered by a user or received via an API call), writes it to a database (via another helper method whose implementation isn’t shown), and then sends it to an EventBridge bus via the preceding method.

Code causes coupling

Although this code is not complex, it has drawbacks from an architectural perspective:

It interweaves application logic with the solution’s topology because the destination of the event, both in terms of service (EventBridge versus SNS, for example) and the instance (the service bus name in this case) are defined inside the application’s source code. If the event destination changes, you must change the source code, or at least know which string constant is passed to the function via an environment variable. Both aspects work against the EDA principle of minimizing coupling between components because changes in the communication structure propagate into the producer’s source code.
Sending the event after updating the database is brittle because it lacks transactional guarantees across both steps. You must implement error handling and retry logic to handle cases where sending the event fails, or even undo the database update. Writing such code can be tedious and error-prone.
Code is a liability. After all, that’s where bugs come from. In a real-life example, a helper method similar to preceding code erroneously swapped day and month on the event date, which led to a challenging debugging cycle. Boilerplate code to send events is therefore best avoided.

Performing event routing inside EventBridge can lessen the first concern. You could reconfigure EventBridge’s rules and targets to route events with a specified type and source to a different destination, provided you keep the event bus name stable. However, the other issues would remain.

Serverless: Less infrastructure, less code

AWS serverless integration services can alleviate the need to write custom application code to publish events altogether.

A key benefit of serverless applications is that you can let the AWS Cloud do the undifferentiated heavy lifting for you. Traditionally, we associate serverless with provisioning, scaling, and operating compute infrastructure so that developers can focus on writing code that generates business value.

Serverless application integration services can also take care of application-level tasks for you, including publishing events. Most applications store data in AWS data stores like Amazon Simple Storage Service (S3) or Amazon DynamoDB, which can automatically emit events whenever an update takes place, without any application code.

EventBridge Pipes: Events without application code

EventBridge Pipes allows you to create point-to-point integrations between event producers and consumers with optional transformation, filtering, and enrichment steps. Serverless integration services combined with cloud automation allow ”point-to-point” integrations to be more easily managed than in the past, which makes them a great fit for this use case.

This example takes advantage of EventBridge Pipes’ ability to fetch events actively from sources like DynamoDB Streams. DynamoDB Streams captures a time-ordered sequence of item-level modifications in any DynamoDB table and stores this information in a log for up to 24 hours. EventBridge Pipes picks up events from that log and pushes them to one of over 14 event targets, including an EventBridge bus, SNS, SQS, or API Destinations. It also accommodates batch sizes, timeouts, and rate limiting where needed.

The integration through EventBridge Pipes can replace the custom application code that sends the event, including any retry or error logic. Only the following code remains:

const order = { "orderId": "1234", "Items": ["One", "Two", "Three"] }
await writeToDb(order);

Automation as code

EventBridge Pipes can be configured from the CLI, the AWS Management Console, or from automation code using AWS CloudFormation or AWS CDK. By using AWS CDK, you can use the same programming language that you use to write your application logic to also write your automation code.

For example, the following CDK snippet configures an EventBridge Pipe to read events from a DynamoDB Stream attached to the Orders table and passes them to an EventBridge event bus.

This code references the DynamoDB table via the ordersTable variable that would be set when the table is created:

const pipe = new CfnPipe(this, 'pipe', {
  roleArn: pipeRole.roleArn,
  source: ordersTable.tableStreamArn!,
  sourceParameters: {
    dynamoDbStreamParameters: {
      startingPosition: 'LATEST'
    },
  },
  target: ordersEventBus.eventBusArn,
  targetParameters: {
    eventBridgeEventBusParameters: {
      detailType: 'order-details',
      source: 'my-source',
    },
  },
});

The automation code cleanly defines the dependency between the DynamoDB table and the event destination, independent of application logic.

Decoupling with data transformation

Coupling is not limited to event sources and destinations. A source’s data format can determine the event format and require downstream changes in case the data format or the data source change. EventBridge Pipes can also alleviate that consideration.

Events emitted from the DynamoDB Stream use the native, marshaled DynamoDB format that includes type information, such as an “S” for strings or “L” for lists.

For example, the order event in the DynamoDB stream from this example looks as follows. Some fields are omitted for readability:

{
  "detail": {
    "eventID": "be1234f372dd12552323a2a3362f6bd2",
    "eventName": "INSERT",
    "eventSource": "aws:dynamodb",
    "dynamodb": {
      "Keys": { "orderId": { "S": "ABCDE" } },
      "NewImage": { 
        "orderId": { "S": "ABCDE" },
        "items": {
            "L": [ { "S": "One" }, { "S": "Two" }, { "S": "Three" } ]
        }
      }
    }
  }
}

This format is not well suited for downstream processing because it would unnecessarily couple event consumers to the fact that this event originated from a DynamoDB Stream. EventBridge Pipes can convert this event into a more easily consumable format. The transformation is specified via an inputTemplate parameter using JSONPath expressions. EventBridge Pipes added support for list processing with wildcards proves to be perfect for this scenario.

In this example, add the following transformation template inside the target parameters to the preceding CDK code (the asterisk character matches a complete list of elements):

targetParameters: {
  inputTemplate: '{"orderId": <$.dynamodb.NewImage.orderId.S>,' + 
                 '"items": <$.dynamodb.NewImage.items.L[*].S>}'
}

This transformation formats the event published by EventBridge Pipes like a regular business event, decoupling any event consumer from the fact that it originated from a DynamoDB table:

{
  "time": "2023-06-01T06:18:10Z",
  "detail": {
    "orderId": "ABCDE",
    "items": ["One", "Two", "Three" ]
  }
}

Conclusion

When building event-driven applications, consider whether you can replace application code with serverless integration services to improve the resilience of your application and provide a clean separation between application logic and system dependencies.

EventBridge Pipes can be a helpful feature in these situations, for example to capture and publish events based on DynamoDB table updates.

Learn more about EventBridge Pipes at https://aws.amazon.com/eventbridge/pipes/ and discover additional ways to reduce serverless application code at https://serverlessland.com/refactoring-serverless. For a complete code example, see https://github.com/aws-samples/aws-refactoring-to-serverless.

Detecting and stopping recursive loops in AWS Lambda functions

2023-07-10 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/detecting-and-stopping-recursive-loops-in-aws-lambda-functions/

This post is written by Pawan Puthran, Principal Serverless Specialist TAM, Aneel Murari, Senior Serverless Specialist Solution Architect, and Shree Shrikhande, Senior AWS Lambda Product Manager.

AWS Lambda is announcing a recursion control to detect and stop Lambda functions running in a recursive or infinite loop.

At launch, this feature is available for Lambda integrations with Amazon Simple Queue Service (Amazon SQS), Amazon SNS, or when invoking functions directly using the Lambda invoke API. Lambda now detects functions that appear to be running in a recursive loop and drops requests after exceeding 16 invocations.

This can help reduce costs from unexpected Lambda function invocations because of recursion. You receive notifications about this action through the AWS Health Dashboard, email, or by configuring Amazon CloudWatch Alarms.

Overview

You can invoke Lambda functions in multiple ways. AWS services generate events that invoke Lambda functions, and Lambda functions can send messages to other AWS services. In most architectures, the service or resource that invokes a Lambda function should be different from the service or resource that the function outputs to. Because of misconfiguration or coding bugs, a function can send a processed event to the same service or resource that invokes the Lambda function, causing a recursive loop.

Lambda now detects the function running in a recursive loop between supported services, after exceeding 16 invocations. It returns a RecursiveInvocationException to the caller. There is no additional charge for this feature. For asynchronous invokes, Lambda sends the event to a dead-letter queue or on-failure destination, if one is configured.

The following is an example of an order processing system.

Order processing system

A new order information message is sent to the source SQS queue.
Lambda consumes the message from the source queue using an ESM.
The Lambda function processes the message and sends the updated orders message to a destination SQS queue using the SQS SendMessage API.
The source queue has a dead-letter queue(DLQ) configured for handling any failed or unprocessed messages.
Because of a misconfiguration, the Lambda function sends the message to the source SQS queue instead of the destination queue. This causes a recursive loop of Lambda function invocations.

To explore sample code for this example, see the GitHub repo.

In the preceding example, after 16 invocations, Lambda throws a RecursiveInvocationException to the ESM. The ESM stops invoking the Lambda function and, once the maxReceiveCount is exceeded, SQS moves the message to the source queues configured DLQ.

You receive an AWS Health Dashboard notification with steps to troubleshoot the function.

AWS Health Dashboard notification

You also receive an email notification to the registered email address on the account.

Email notification

Lambda emits a RecursiveInvocationsDropped CloudWatch metric, which you can view in the CloudWatch console.

RecursiveInvocationsDropped CloudWatch metric

How does Lambda detect recursion?

For Lambda to detect recursive loops, your function must use one of the supported AWS SDK versions or higher.

Lambda uses an AWS X-Ray trace header primitive called “Lineage” to track the number of times a function has been invoked with an event. When your function code sends an event using a supported AWS SDK version, Lambda increments the counter in the lineage header. If your function is then invoked with the same triggering event more than 16 times, Lambda stops the next invocation for that event. You do not need to configure active X-Ray tracing for this feature to work.

An example lineage header looks like:

X-Amzn-Trace-Id:Root=1-645f7998-4b1e232810b0bb733dba2eab;Parent=5be88d12eefc1fc0;Sampled=1;Lineage=43e12f0f:5

43e12f0f is the hash of a resource, in this case a Lambda function. 5 is the number of times this function has been invoked with the same event. The logic of hash generation, encoding, and size of the lineage header may change in the future. You should not design any application functionality based on this.

When using an ESM to consume messages from SQS, after the maxReceiveCount value is exceeded, the message is sent to the source queue’s configured DLQ. When Lambda detects a recursive loop and drops subsequent invocations, it returns a RecursiveInvocationException to the ESM. This increments the maxReceiveCount value. When the ESM auto retries to process events, based on the error handling configuration, these retries are not considered recursive invocations.

When using SQS, you can also batch multiple messages into one Lambda event. Where the message batch size is greater than 1, Lambda uses the maximum lineage value within the batch of messages. It drops the entire batch if the value exceeds 16.

Recursion detection in action

You can deploy a sample application example in the GitHub repo to test Lambda recursive loop detection. The application includes a Lambda function that reads from an SQS queue and writes messages back to the same SQS queue.

As prerequisites, you must install:

AWS Command Line Interface (AWS CLI)
AWS Serverless Application Model (AWS SAM) (version 1.81.1 or above)
Docker

To deploy the application:

1. Set your AWS Region:

export REGION=<your AWS region>

1. Clone the GitHub repository

git clone https://github.com/aws-samples/aws-lambda-recursion-detection-sample.git
cd aws-lambda-recursion-detection-sample

1. Use AWS SAM to build and deploy the resources to your AWS account. Enter a stack name, such as lambda-recursion, when prompted. Accept the remaining default values.

sam build –-use-container
sam deploy --guided --region $REGION

To test the application:

1. Save the name of the SQS queue in a local environment variable:

SOURCE_SQS_URL=$(aws cloudformation describe-stacks \ --region $REGION \ --stack-name lambda-recursion \ --query 'Stacks[0].Outputs[?OutputKey==`SourceSQSqueueURL`].OutputValue' --output text)

Publish a message to the source SQS queue:

aws sqs send-message --queue-url $SOURCE_SQS_URL --message-body '{"orderId":"111","productName":"Bolt","orderStatus":"Submitted"}' --region $REGION

This invokes the Lambda function, which writes the message back to the queue.

To verify that Lambda has detected the recursion:

Navigate to the CloudWatch console. Choose All Metrics under Metrics in the left-hand panel and search for RecursiveInvocationsDropped.

Find RecursiveInvocationsDropped.
Choose Lambda > By Function Name and choose RecursiveInvocationsDropped for the function you created. Under Graphed metrics, change the statistic to sum and Period to 1 minute. You see one record. Refresh if you don’t see the metric after a few seconds.

Metrics sum view

Actions to take when Lambda stops a recursive loop

When you receive a notification regarding recursion in your account, the following steps can help address the issue.

To stop further invoke attempts while you fix the underlying configuration issue, set the function concurrency to 0. This acts as an off switch for the Lambda function. You can choose the “Throttle” button in the Lambda console or use the PutFunctionConcurrency API to set the function concurrency to 0.
You can also disable or delete the event source mapping or trigger for the Lambda function.
Check your Lambda function code and configuration for any code defects that create loops. For example, check your environment variables to ensure you are not using the same SQS queue or SNS topic as source and target.
If an SQS Queue is the event source for your Lambda function, configure a DLQ on the source queue.
If an SNS topic is the event source, configure an On-Failure Destination for the Lambda function.

Disabling recursion detection

You may have valid use-cases where Lambda recursion is intentional as part of your design. In this case, use caution and implement suitable guardrails to prevent unexpected charges to your account. To learn more about best practices for using recursive invocation patterns, see Recursive patterns that cause run-away Lambda functions in the AWS Lambda Operator Guide.

This feature is turned on by default to stop recursive loops. To request turning it off for your account, reach out to AWS Support.

Conclusion

Lambda recursion control for SQS and SNS automatically detects and stops functions running in a recursive or infinite loop. This can be due to misconfiguration or coding errors. Recursion control helps reduce unexpected usage with Lambda and downstream services. The post also explains how Lambda detects and stops recursive loops and notifies you through AWS Health Dashboard to troubleshoot the function.

To learn more about the feature, visit the AWS Lambda Developer Guide.

For more serverless learning resources, visit Serverless Land

Understanding AWS Lambda’s invoke throttling limits

2023-07-08 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/understanding-aws-lambdas-invoke-throttle-limits/

This post is written by Archana Srikanta, Principal Engineer, AWS Lambda.

When you call AWS Lambda’s Invoke API, a series of throttle limits are evaluated to decide if your call is let through or throttled with a 429 “Too Many Requests” exception. This blog post explains the most common invoke throttle limits and the relationship between them, so you can better understand scaling workloads on Lambda.

Overview

The throttle limits exist to protect the following components of Lambda’s internal service architecture, and your workload, from noisy neighbors:

Execution environment: An execution environment is a Firecracker microVM where your function code runs. A given execution environment only hosts one invocation at a time, but it can be reused for subsequent invocations of the same function version.
Invoke data plane: These are a series of internal web services that, on an invoke, select (or create) a sandbox and route your request to it. This is also responsible for enforcing the throttle limits.

When you make an Invoke API call, it transits through some or all of the Invoke Data Plane services, before reaching an execution environment where your function code is downloaded and executed.

There are three distinct but related throttle limits which together decide if your invoke request is accepted by the data plane or throttled.

Concurrency

Concurrent means “existing, happening, or done at the same time”. Accordingly, the Lambda concurrency limit is a limit on the simultaneous in-flight invocations allowed at any given time. It is not a rate or transactions per second (TPS) limit in and of itself, but instead a limit on how many invocations can be inflight at the same time. This documentation visually explains the concept of concurrency.

Under the hood, the concurrency limit roughly translates to a limit on the maximum number of execution environments (and thus Firecracker microVMs) that your account can claim at any given point in time. Lambda runs a fleet of multi-tenant bare metal instances, on which Firecracker microVMs are carved out to serve as execution environments for your functions. AWS constantly monitors and scales this fleet based on incoming demand and shares the available capacity fairly among customers.

The concurrency limit helps protect Lambda from a single customer exhausting all the available capacity and causing a denial of service to other customers.

Transactions per second (TPS)

Customers often ask how their concurrency limit translates to TPS. The answer depends on how long your function invocations last.

The diagram above considers three cases, each with a different function invocation duration, but a fixed concurrency limit of 1000. In the first case, invocations have a constant duration of 1 second. This means you can initiate 1000 invokes and claim all 1000 execution environments permitted by your concurrency limit. These execution environments remain busy for the entire second, and you cannot start any more invokes in that second because your concurrency limit prevents you from claiming any more execution environments. So, the TPS you can achieve with a concurrency limit of 1000 and a function duration of 1 second is 1000 TPS.

In case 2, the invocation duration is halved to 500ms, with the same concurrency limit of 1000. You can initiate 1000 concurrent invokes at the start of the second as before. These invokes keep the execution environments busy for the first half of the second. Once finished, you can start an additional 1000 invokes against the same execution environments while still being within your concurrency limit. So, by halving the function duration, you doubled your TPS to 2000.

Similarly, in case 3, if your function duration is 100ms, you can initiate 10 rounds of 1000 invokes each in a second, achieving a TPS of 10K.

Codifying this as an equation, the TPS you can achieve given a concurrency limit is:

TPS = concurrency / function duration in seconds

Taken to an extreme, for a function duration of only 1ms and at a concurrency limit of 1000 (the default limit), an account can drive an invoke TPS of one million. For every additional unit of concurrency granted via a limit increase, it implicitly grants an additional 1000 TPS per unit of concurrency increased. The high TPS doesn’t require any additional execution environments (Firecracker microVMs), so it’s not problematic from a fleet capacity perspective. However, driving over a million TPS from a single account puts stress on the Invoke Data Plane services. They must be protected from noisy neighbor impact as well so all customers have a fair share of the services’ bandwidth. A concurrency limit alone isn’t sufficient to protect against this – the TPS limit provides this protection.

As of this writing, the invoke TPS is capped at 10 times your concurrency. Added to the previous equation:

TPS = min( 10 x concurrency, concurrency / function duration in seconds)

The concurrency factor is common across both terms in the min function, so the key comparison is:

min(10, 1 / function duration in seconds)

If the function duration is exactly 100ms (or 1/10th of a second), both terms in the min function are equal. If the function duration is over 100ms, the second term is lower and TPS is limited as per concurrency/function duration. If the function duration is under 100ms, the first term is lower and TPS is limited as per 10 x concurrency.

To summarize, the TPS limit exists to protect the Invoke Data Plane from the high churn of short-lived invocations, for which the concurrency limit alone affords too high of a TPS. If you drive short invocations of under 100ms, your throughput is capped as though the function duration is 100ms (at 10 x concurrency) as shown in the diagram above. This implies that short lived invocations may be TPS limited, rather than concurrency limited. However, if your function duration is over 100ms you can effectively ignore the 10 x concurrency TPS limit and calculate your available TPS as concurrency/function duration.

Burst

The third throttle limit is the burst limit. Lambda does not keep execution environments provisioned for your entire concurrency limit at all times. That would be wasteful, especially if usage peaks are transient, as is the case with many workloads. Instead, the service spins up execution environments just-in-time as the invoke arrives, if one doesn’t already exist. Once an execution environment is spun up, it remains “warm” for some period of time and is available to host subsequent invocations of the same function version.

However, if an invoke doesn’t find a warm execution environment, it experiences a “cold start” while we provision a new execution environment. Cold starts involve certain additional operations over and above the warm invoke path, such as downloading your code or container and initializing your application within the execution environment. These initialization operations are typically computationally heavy and so have a lower throughput compared to the warm invoke path. If there are sudden and steep spikes in the number of cold starts, it can put pressure on the invoke services that handle these cold start operations, and also cause undesirable side effects for your application such as increased latencies, reduced cache efficiency and increased fan out on downstream dependencies. The burst limit exists to protect against such surges of cold starts, especially for accounts that have a high concurrency limit. It ensures that the climb up to a high concurrency limit is gradual so as to smooth out the number of cold starts in a burst.

The algorithm used to enforce the burst limit is the Token Bucket rate-limiting algorithm. Consider a bucket that holds tokens. The bucket has a maximum capacity of B tokens (burst). The bucket starts full. Each time you send an invoke request that requires an additional unit of concurrency, it costs a token from the bucket. If the token exists, you are granted the additional concurrency and the token is removed from the bucket. The bucket is refilled at a constant rate of r tokens per minute (rate) until it reaches its maximum capacity.

What this means is that the rate of climb of concurrency is limited to r tokens per minute. Even though the algorithm allows you to collect up to B tokens and burst, you must wait for the bucket to refill before you can burst again, effectively limiting your average rate to r per minute.

The chart above shows the burst limit in action with a maximum concurrency limit of 3000, a maximum burst(B) of 1000 and a refill rate(r) of 500/minute. The token bucket starts full with 1000 tokens, as is the available burst headroom.

There is a burst activity between minute one and two, which consumes all tokens in the bucket and claims all 1000 concurrent execution environments allowed by the burst limit. At this point the bucket is empty and any attempt to claim additional concurrent execution environments is burst throttled, in spite of max concurrency not being reached yet.

The token bucket and the burst headroom are replenished at minutes two and three with 500 tokens each minute to bring it back up to its maximum capacity of 1000. At minute four, there is no additional refill because the bucket is at maximum capacity. Between minutes four and five, there is a second burst activity which empties the bucket again and claims an additional 1000 execution environments, bringing the total number of active execution environments to 2000.

The bucket continues to replenish at a rate of 500/minute at minutes five and six. At this point, sufficient tokens have been accumulated to cover the entire concurrency limit of 3000, and so the bucket isn’t refilled anymore even when you have the third burst activity at minute seven. At minute ten, when all the usage ramps down, the available burst headroom slowly stair steps back down to the maximum initial burst of 1K.

The actual numbers for maximum burst and refill rate vary by Region and are subject to change, please visit the Lambda burst limits page for specific values.

It is important to distinguish that the burst limit isn’t a rate limit on the invoke itself, but a rate limit on how quickly concurrency can rise. However, since invoke TPS is a function of concurrency, it also clamps how quickly TPS can rise (a rate limit for a rate limit). The following chart shows how the TPS burst headroom follows a similar stair step pattern as the concurrency burst headroom, only with a multiplier.

Conclusion

This blog explains three key throttle limits applied on Lambda invokes: the concurrency limit, TPS limit and burst limit. It outlines the relationship between these limits and how each one protects the system and your workload from noisy neighbors. Equipped with this knowledge you can better interpret any 429 throttling exceptions you may receive while scaling your applications on Lambda. For more information on getting started with Lambda visit the Developer Guide.

For more serverless learning resources, visit Serverless Land.

Implementing AWS Lambda error handling patterns

2023-07-06 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/implementing-aws-lambda-error-handling-patterns/

This post is written by Jeff Chen, Principal Cloud Application Architect, and Jeff Li, Senior Cloud Application Architect

Event-driven architectures are an architecture style that can help you boost agility and build reliable, scalable applications. Splitting an application into loosely coupled services can help each service scale independently. A distributed, loosely coupled application depends on events to communicate application change states. Each service consumes events from other services and emits events to notify other services of state changes.

Handling errors becomes even more important when designing distributed applications. A service may fail if it cannot handle an invalid payload, dependent resources may be unavailable, or the service may time out. There may be permission errors that can cause failures. AWS services provide many features to handle error conditions, which you can use to improve the resiliency of your applications.

This post explores three use-cases and design patterns for handling failures.

Overview

AWS Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and Amazon EventBridge are core building blocks for building serverless event-driven applications.

The post Understanding the Different Ways to Invoke Lambda Functions lists the three different ways of invoking a Lambda function: synchronous, asynchronous, and poll-based invocation. For a list of services and which invocation method they use, see the documentation.

Lambda’s integration with Amazon API Gateway is an example of a synchronous invocation. A client makes a request to API Gateway, which sends the request to Lambda. API Gateway waits for the function response and returns the response to the client. There are no built-in retries or error handling. If the request fails, the client attempts the request again.

Lambda’s integration with SNS and EventBridge are examples of asynchronous invocations. SNS, for example, sends an event to Lambda for processing. When Lambda receives the event, it places it on an internal event queue and returns an acknowledgment to SNS that it has received the message. Another Lambda process reads events from the internal queue and invokes your Lambda function. If SNS cannot deliver an event to your Lambda function, the service automatically retries the same operation based on a retry policy.

Lambda’s integration with SQS uses poll-based invocations. Lambda runs a fleet of pollers that poll your SQS queue for messages. The pollers read the messages in batches and invoke your Lambda function once per batch.

You can apply this pattern in many scenarios. For example, your operational application can add sales orders to an operational data store. You may then want to load the sales orders to your data warehouse periodically so that the information is available for forecasting and analysis. The operational application can batch completed sales as events and place them on an SQS queue. A Lambda function can then process the events and load the completed sale records into your data warehouse.

If your function processes the batch successfully, the pollers delete the messages from the SQS queue. If the batch is not successfully processed, the pollers do not delete the messages from the queue. Once the visibility timeout expires, the messages are available again to be reprocessed. If the message retention period expires, SQS deletes the message from the queue.

The following table shows the invocation types and retry behavior of the AWS services mentioned.

AWS service example	Invocation type	Retry behavior
Amazon API Gateway	Synchronous	No built-in retry, client attempts retries.
Amazon SNS Amazon EventBridge	Asynchronous	Built-in retries with exponential backoff.
Amazon SQS	Poll-based	Retries after visibility timeout expires until message retention period expires.

There are a number of design patterns to use for poll-based and asynchronous invocation types to retain failed messages for additional processing. These patterns can help you recover from delivery or processing failures.

You can explore the patterns and test the scenarios by deploying the code from this repository which uses the AWS Cloud Development Kit (AWS CDK) using Python.

Lambda poll-based invocation pattern

When using Lambda with SQS, if Lambda isn’t able to process the message and the message retention period expires, SQS drops the message. Failure to process the message can be due to function processing failures, including time-outs or invalid payloads. Processing failures can also occur when the destination function does not exist, or has incorrect permissions.

You can configure a separate dead-letter queue (DLQ) on the source queue for SQS to retain the dropped message. A DLQ preserves the original message and is useful for analyzing root causes, handling error conditions properly, or sending notifications that require manual interventions. In the poll-based invocation scenario, the Lambda function itself does not maintain a DLQ. It relies on the external DLQ configured in SQS. For more information, see Using Lambda with Amazon SQS.

The following shows the design pattern when you configure Lambda to poll events from an SQS queue and invoke a Lambda function.

Lambda synchronously polling catches of messages from SQS

Lambda synchronously polling batches of messages from SQS

To explore this pattern, deploy the code in this repository. Once deployed, you can use this instruction to test the pattern with the happy and unhappy paths.

Lambda asynchronous invocation pattern

With asynchronous invokes, there are two failure aspects to consider when using Lambda. The event source cannot deliver the message to Lambda and the Lambda function errors when processing the event.

Event sources vary in how they handle failures delivering messages to Lambda. If SNS or EventBridge cannot send the event to Lambda after exhausting all their retry attempts, the service drops the event. You can configure a DLQ on an SNS topic or EventBridge event bus to hold the dropped event. This works in the same way as the poll-based invocation pattern with SQS.

Lambda functions may then error due to input payload syntax errors, duration time-outs, or the function throws an exception such as a data resource not available.

For asynchronous invokes, you can configure how long Lambda retains an event in its internal queue, up to 6 hours. You can also configure how many times Lambda retries when the function errors, between 0 and 2. Lambda discards the event when the maximum age passes or all retry attempts fail. To retain a copy of discarded events, you can configure either a DLQ or, preferably, a failed-event destination as part of your Lambda function configuration.

A Lambda destination enables you to specify what to do next if an asynchronous invocation succeeds or fails. You can configure a destination to send invocation records to SQS, SNS, EventBridge, or another Lambda function. Destinations are preferred for failure processing as they support additional targets and include additional information. A DLQ holds the original failed event. With a destination, Lambda also passes details of the function’s response in the invocation record. This includes stack traces, which can be useful for analyzing the root cause.

Using both a DLQ and Lambda destinations

You can apply this pattern in many scenarios. For example, many of your applications may contain customer records. To comply with the California Consumer Privacy Act (CCPA), different organizations may need to delete records for a particular customer. You can set up a consumer delete SNS topic. Each organization creates a Lambda function, which processes the events published by the SNS topic and deletes customer records in its managed applications.

The following shows the design pattern when you configure an SNS topic as the event source for a Lambda function, which uses destination queues for success and failure process.

SNS topic as event source for Lambda

You configure a DLQ on the SNS topic to capture messages that SNS cannot deliver to Lambda. When Lambda invokes the function, it sends details of the successfully processed messages to an on-success SQS destination. You can use this pattern to route an event to multiple services for simpler use cases. For orchestrating multiple services, AWS Step Functions is a better design choice.

Lambda can also send details of unsuccessfully processed messages to an on-failure SQS destination.

A variant of this pattern is to replace an SQS destination with an EventBridge destination so that multiple consumers can process an event based on the destination.

To explore how to use an SQS DLQ and Lambda destinations, deploy the code in this repository. Once deployed, you can use this instruction to test the pattern with the happy and unhappy paths.

Using a DLQ

Although destinations is the preferred method to handle function failures, you can explore using DLQs.

The following shows the design pattern when you configure an SNS topic as the event source for a Lambda function, which uses SQS queues for failure process.

Lambda invoked asynchonously

You configure a DLQ on the SNS topic to capture the messages that SNS cannot deliver to the Lambda function. You also configure a separate DLQ for the Lambda function. Lambda saves an unsuccessful event to this DLQ after Lambda cannot process the event after maximum retry attempts.

To explore how to use a Lambda DLQ, deploy the code in this repository. Once deployed, you can use this instruction to test the pattern with happy and unhappy paths.

Conclusion

This post explains three patterns that you can use to design resilient event-driven serverless applications. Error handling during event processing is an important part of designing serverless cloud applications.

You can deploy the code from the repository to explore how to use poll-based and asynchronous invocations. See how poll-based invocations can send failed messages to a DLQ. See how to use DLQs and Lambda destinations to route and handle unsuccessful events.

Learn more about event-driven architecture on Serverless Land.

Serverless ICYMI Q2 2023

2023-07-05 Benjamin Smith

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/serverless-icymi-q2-2023/

Welcome to the 22^nd edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened last quarter here.

Serverless Innovation Day

AWS recently hosted the Serverless Innovation Day, a day of live streams that showcased AWS serverless technologies such as AWS Lambda, Amazon ECS with AWS Fargate, Amazon EventBridge, and AWS Step Functions. The event included insights from AWS leaders such as Holly Mesrobian, Ajay Nair, and Usman Khalid, as well as prominent customers and our serverless Developer Advocate team. It provided insights into serverless modernization success stories, use cases, and best practices. If you missed the event, you can catch up on the recorded sessions here.

Serverless Land, your go-to resource for all things serverless, expanded to include a new Serverless Testing section. This provides valuable insights, patterns, and best practices for testing integrations using AWS SAM and CDK templates.

Serverless Land also launched a new learning page featuring a collection of resources, including blog posts, videos, workshops, and training materials, allowing users to choose a learning path from a variety of topics. “EventBridge Visuals“, small, easily digestible visuals focused on EventBridge have also been added.

AWS Lambda

Lambda introduced support for response payload streaming allowing functions to progressively stream response data to clients. This feature significantly improves performance by reducing the time to first byte (TTFB) latency, benefiting web and mobile applications.

Response streaming is particularly useful for applications with large payloads such as images, videos, documents, or database results. It eliminates the need to buffer the entire payload in memory and enables the transfer of responses larger than Lambda’s 6 MB limit, up to a soft limit of 20 MB.

By configuring the Function URL to use the InvokeWithResponseStream API, streaming responses can be accessed through an HTTP client that supports incremental response data. This enhancement expands Lambda’s capabilities, allowing developers to handle larger payloads more efficiently and enhance the overall performance and user experience of their web and mobile applications.

Lambda now supports Java 17 with Amazon Corretto distribution, providing long-term support and improved performance. Java 17 introduces new language features like records, sealed classes, and multi-line strings. The runtime uses ZGC and Shenandoah garbage collectors to reduce latency. Default JVM configuration changes optimize tiered compilation for reduced startup latency. Developers can use Java 17 in Lambda through AWS Management Console, AWS SAM, and AWS CDK. Popular frameworks like Spring Boot 3 and Micronaut 4 require Java 17 as a minimum. Micronaut provides a web service to generate example projects using Java 17 and AWS CDK infrastructure.

Lambda now supports the Ruby 3.2 runtime, enabling you to write serverless functions using the latest version of the Ruby programming language. This update enhances developer productivity and brings new features and improvements to your Ruby-based Lambda functions.

Lambda introduced support for Kafka and Amazon MQ event sources in four additional Regions. This expanded availability allows developers to build event-driven architectures using these messaging systems in more regions around the world, providing greater flexibility and scalability. It also supports Kafka and Amazon MQ event sources in AWS GovCloud (US) Regions, allowing government organizations to leverage the benefits of event-driven architectures in their cloud environments.

Lambda also added support for starting from a specific timestamp for Kafka event sources, allowing for precise message processing and useful scenarios like Disaster Recovery, without any additional charges.

Serverless Land has launched new learning paths for Lambda to help you level up your serverless skills:

The Java Replatforming learning path guides Java developers through the process of migrating existing Java applications to a serverless architecture.
The Lift and Shift to Serverless learning path provides guidance on migrating traditional applications to a serverless environment.
Lambda Fundamentals is a 23-part video series providing practical examples and tips to help you get started with serverless development using Lambda.

The new AWS Tech Talk, Best practices for building interactive applications with AWS Lambda, helps you learn best practices and architectural patterns for building web and mobile backends as well as API-driven microservices on Lambda. Explore how to take advantage of features in Lambda, Amazon API Gateway, Amazon DynamoDB, and more to easily build highly scalable serverless web applications.

AWS Step Functions

The latest update to AWS Step Functions introduces versions and aliases, allows users to run specific state machine revisions, ensuring reliable deployments, reducing risks, and providing version visibility. Appending version numbers to the state machine ARN enables selection of desired versions, even after updates. Aliases distribute execution requests based on weights, supporting incremental deployment patterns.

This enhances confidence in state machine updates, improves observability, auditing, and can be managed through the Step Functions console or AWS CloudFormation. Versions and aliases are available in all supported AWS Regions at no extra cost.

AWS SAM

AWS SAM CLI has introduced a new feature called remote invoke that allows developers to test Lambda functions in the AWS Cloud. This feature enables developers to invoke Lambda functions from their local development environment and provides options for event payloads, output formats, and logging.

It can be used with or without AWS SAM and can be combined with AWS SAM Accelerate for streamlined development and testing. Overall, the remote invoke feature simplifies serverless application testing in the AWS Cloud.

Amazon EventBridge

EventBridge announced an open-source connector for Kafka Connect, providing seamless integration between EventBridge and Kafka Connect. This connector simplifies the process of streaming events from Kafka topics to EventBridge, enabling you to build event-driven architectures with ease.

EventBridge has improved end-to-end latencies for event buses, delivering events up to 80% faster. This enables broader use in latency-sensitive applications such as industrial and medical applications, with the lower latencies applied by default across all AWS Regions at no extra cost.

Amazon Aurora Serverless v2

Amazon Aurora Serverless v2 is now available in four additional Regions, expanding the reach of this scalable and cost-effective serverless database option. With Aurora Serverless v2, you can benefit from automatic scaling, pause-and-resume capability, and pay-per-use pricing, enabling you to optimize costs and manage your databases more efficiently.

Amazon SNS

Amazon SNS now supports message data protection in five additional Regions, ensuring the security and integrity of your message payloads. With this feature, you can encrypt sensitive message data at rest and in transit, meeting compliance requirements and safeguarding your data.

Serverless Blog Posts

April 2023

Apr 27 – AWS Lambda now supports Java 17

Apr 27 – Optimizing Amazon EC2 Spot Instances with Spot Placement Scores

Apr 26 – Building private serverless APIs with AWS Lambda and Amazon VPC Lattice

Apr 25 – Implementing error handling for AWS Lambda asynchronous invocations

Apr 20 – Understanding techniques to reduce AWS Lambda costs in serverless applications

Apr 18 – Python 3.10 runtime now available in AWS Lambda

Apr 13 – Optimizing AWS Lambda extensions in C# and Rust

Apr 7 – Introducing AWS Lambda response streaming

May 2023

May 24 – Developing a serverless Slack app using AWS Step Functions and AWS Lambda

May 11 – Automating stopping and starting Amazon MWAA environments to reduce cost

May 10 – Monitor Amazon SNS-based applications end-to-end with AWS X-Ray active tracing

May 10 – Debugging SnapStart-enabled Lambda functions made easy with AWS X-Ray

May 10 – Implementing cross-account CI/CD with AWS SAM for container-based Lambda functions

May 3 – Extending a serverless, event-driven architecture to existing container workloads

May 3 – Patterns for building an API to upload files to Amazon S3

June 2023

Jun 7 – Ruby 3.2 runtime now available in AWS Lambda

Jun 5 – Implementing custom domain names for Amazon API Gateway private endpoints using a reverse proxy

June 22 – Deploying state machines incrementally with versions and aliases in AWS Step Functions

June 22 – Testing AWS Lambda functions with AWS SAM remote invoke

Videos

Serverless Office Hours – Tues 10AM PT

Weekly live virtual office hours. In each session we talk about a specific topic or technology related to serverless and open it up to helping you with your real serverless challenges and issues.

YouTube: youtube.com/serverlessland
Twitch: twitch.tv/aws

LinkedIn: linkedin.com/company/serverlessland

April 2023

Apr 4 – Serverless AI with ChatGPT and DALL-E

Apr 11 – Building Java apps with AWS SAM

Apr 18 – Managing EventBridge with Kubernetes

Apr 25 – Lambda response streaming

May 2023

May 2 – Automating your life with serverless

May 9 – Building real-life asynchronous architectures

May 16 – Testing Serverless Applications

May 23 – Build faster with Amazon CodeCatalyst

May 30 – Serverless networking with VPC Lattice

June 2023

June 6 – AWS AppSync: Private APIs and Merged APIs

June 13 – Integrating EventBridge and Kafka

June 20 – AWS Copilot for serverless containers

June 27 – Serverless high performance modeling

FooBar Serverless YouTube channel

April 2023

Apr 6 – Designing a DynamoDB Table in 4 Steps: From Entities to Access Patterns

Apr 14 – Amazon CodeWhisperer – Improve developer productivity using machine learning (ML)

Apr 20 – Beginner’s Guide to DynamoDB with AWS CDK: Step-by-Step Tutorial for provisioning NoSQL Databases

Apr 27 – Build a WebApp that uses DynamoDB in 6 steps | DynamoDB Expressions

May 2023

May 4 – How to Migrate Data to DynamoDB?

May 11 – Load Testing DynamoDB: Observability and Performance tuning

May 18 – DynamoDB Streams – THE most powerful feature from DynamoDB for event-driven applications

May 25 – Track Application Events with DynamoDB streams and Email Notifications using EventBridge Pipes

June 2023

Jun 1 – How to filter messages based on the payload using Amazon SNS

June 8 – Getting started with Amazon Kinesis

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

Eric Johnson: @edjgeek
James Beswick: @jbesw
Ben Smith: @benjamin_l_s
Julian Wood: @julian_wood
Marcia Villalba: @mavi888uy
David Boyne: @boyney123

Retrieving parameters and secrets with Powertools for AWS Lambda (TypeScript)

2023-06-29 Pascal Vogel

Post Syndicated from Pascal Vogel original https://aws.amazon.com/blogs/compute/retrieving-parameters-and-secrets-with-powertools-for-aws-lambda-typescript/

This post is written by Andrea Amorosi, Senior Solutions Architect and Pascal Vogel, Solutions Architect.

When building serverless applications using AWS Lambda, you often need to retrieve parameters, such as database connection details, API secrets, or global configuration values at runtime. You can make these parameters available to your Lambda functions via secure, scalable, and highly available parameter stores, such as AWS Systems Manager Parameter Store or AWS Secrets Manager.

The Parameters utility for Powertools for AWS Lambda (TypeScript) simplifies the integration of these parameter stores inside your Lambda functions. The utility provides high-level functions for retrieving secrets and parameters, integrates caching and transformations, and reduces the amount of boilerplate code you must write.

The Parameters utility supports the following parameter stores:

AWS Systems Manager Parameter Store
AWS Secrets Manager
AWS AppConfig
Amazon DynamoDB
Custom parameter store providers

The Parameters utility is part of the Powertools for AWS Lambda (TypeScript), which you can use in both JavaScript and TypeScript code bases. Implementing guidance from the Serverless Applications Lens of the AWS Well-Architected Framework, Powertools provides utilities to ease the adoption of best practices such as distributed tracing, structured logging, and asynchronous business and application metrics.

For more details, see the Powertools for AWS Lambda (TypeScript) documentation on GitHub and the introduction blog post.

This blog post shows how to use the new Parameters utility to retrieve parameters and secrets in your JavaScript and TypeScript Lambda functions securely.

Getting started with the Parameters utility

Initial setup

The Powertools toolkit is modular, meaning that you can install the Parameters utility independently from the Logger, Tracing, or Metrics packages. Install the Parameters utility library in your project via npm:

npm install @aws-lambda-powertools/parameters

In addition, you must add the AWS SDK client for the parameter store you are planning to use. The Parameters utility supports AWS SDK v3 for JavaScript only, which allows the utility to be modular. You install only the needed SDK packages to keep your bundle size small.

Next, assign appropriate AWS Identity and Access Management (IAM) permissions to the Lambda function execution role of your Lambda function that allow retrieving parameters from the parameter store.

The following sections illustrate how to perform the previously mentioned steps for some typical parameter retrieval scenarios.

Retrieving a single parameter from SSM Parameter Store

To retrieve parameters from SSM Parameter Store, install the AWS SDK client for SSM in addition to the Parameters utility:

npm install @aws-sdk/client-ssm

To retrieve an individual parameter, the Parameters utility provides the getParameter function:

import { getParameter } from '@aws-lambda-powertools/parameters/ssm';

export const handler = async (): Promise<void> => {
  // Retrieve a single parameter
  const parameter = await getParameter('/my/parameter');
  console.log(parameter);
};

Finally, you need to assign an IAM policy with the ssm:GetParameter permission to your Lambda function execution role. Apply the principle of least privilege by scoping the permission to the specific parameter resource as shown in the following policy example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameter"
      ],
      "Resource": [
        "arn:aws:ssm:AWS_REGION:AWS_ACCOUNT_ID:my/parameter"
      ]
    }
  ]
}

Adjusting cache TTL

By default, the retrieved parameters are cached in-memory for 5 seconds. This cached value is used for further invocations of the Lambda function until it expires. If your application requires a different behavior, the Parameters utility allows you to adjust the time-to-live (TTL) via the maxAge argument.

Building on the previous example, if you want to cache your retrieved parameter for 30 instead of 5 seconds, you can adapt your function code as follows:

import { getParameter } from '@aws-lambda-powertools/parameters/ssm';

export const handler = async (): Promise<void> => {
  // Retrieve a single parameter with a 30 seconds cache TTL
  const parameter = await getParameter('/my/parameter', { maxAge: 30 });
  console.log(parameter);
};

In other cases, you may want to always retrieve the latest value from the parameter store and ignore any cached value. To achieve this, set the forceFetch parameter to true:

import { getParameter } from '@aws-lambda-powertools/parameters/ssm';

export const handler = async (): Promise<void> => {
  // Always retrieve the latest value of a single parameter
  const parameter = await getParameter('/my/parameter', { forceFetch: true });
  console.log(parameter);
};

For details, see Always fetching the latest in the Powertools for AWS Lambda (TypeScript) documentation.

Decoding parameters stored in JSON or base64 format

If some of your parameters are stored in base64 or JSON, you can deserialize them via the Parameters utility’s transform argument.

Considering a parameter stored in SSM as JSON, it can be retrieved and deserialized as follows:

import { Transform } from '@aws-lambda-powertools/parameters';
import { getParameter } from '@aws-lambda-powertools/parameters/ssm';

export const handler = async (): Promise => {
  // Retrieve and deserialize a single JSON parameter
  const valueFromJson = await getParameter('/my/json/parameter', { transform: Transform.JSON });
  console.log(valueFromJson);
};

The Parameters utility supports the transform argument for all parameter store providers and high-level functions. For details, see Deserializing values with transform parameters.

Working with encrypted parameters in SSM Parameter Store

SSM Parameter Store supports encrypted secure string parameters via the AWS Key Management Service (AWS KMS). The Parameters utility allows you to retrieve these encrypted parameters by adding the decrypt argument to your request.

For example, you could retrieve an encrypted parameter as follows:

import { getParameter } from '@aws-lambda-powertools/parameters/ssm';

export const handler = async (): Promise<void> => {
  // Decrypt the parameter
  const decryptedParameter = await getParameter('/my/encrypted/parameter', { decrypt: true });
  console.log(decryptedParameter);
};

In this case, the Lambda function execution role needs to have the kms:Decrypt IAM permission in addition to ssm:GetParameter.

Retrieving multiple parameters from SSM Parameter Store

Besides retrieving a single parameter using getParameter, you can also use getParameters to recursively retrieve multiple parameters under a SSM Parameter Store path, or getParametersByName to retrieve multiple distinct parameters by their full name.

You can also apply custom caching, transform, or decrypt configurations per parameter when using getParametersByName. The following example retrieves three distinct parameters from SSM Parameter Store with different caching and transform configurations:

import { getParametersByName } from '@aws-lambda-powertools/parameters/ssm';
import type {
  SSMGetParametersByNameOptionsInterface
} from '@aws-lambda-powertools/parameters/ssm/types';

const props: Record<string, SSMGetParametersByNameOptionsInterface> = {
  '/develop/service/commons/telemetry/config': { maxAge: 300, transform: 'json' },
  '/no_cache_param': { maxAge: 0 },
  '/develop/service/payment/api/capture/url': {}, // When empty or undefined, it uses default values
};

export const handler = async (): Promise<void> => {
  // This returns an object with the parameter name as key
  const parameters = await getParametersByName(props);
  for (const [ key, value ] of Object.entries(parameters)) {
    console.log(`${key}: ${value}`);
  }
};

Retrieving multiple parameters requires the GetParameter and GetParameters permissions to be present in the Lambda function execution role.

Retrieving secrets from Secrets Manager

To securely store sensitive parameters such as passwords or API keys for external services, Secrets Manager is a suitable option. To retrieve secrets from Secrets Manager using the Parameters utility, install the AWS SDK client for Secrets Manager in addition to the Parameters utility:

npm install @aws-sdk/client-secrets-manager

Now you can access a secret using its key as follows:

import { getSecret } from '@aws-lambda-powertools/parameters/secrets';

export const handler = async (): Promise<void> => {
  // Retrieve a single secret
  const secret = await getSecret('my-secret');
  console.log(secret);
};

Getting a secret from Secrets Manager requires you to add the secretsmanager:GetSecretValue IAM permission to your Lambda function execution role.

Retrieving an application configuration from AppConfig

If you plan to leverage feature flags or dynamic application configurations in your applications built on Lambda, AppConfig is a suitable option. The Parameters utility makes it easy to fetch configurations from AppConfig while benefitting from utility features such as caching and transformations.

For example, considering an AppConfig application called my-app with an environment called my-env, you can retrieve its configuration profile my-configuration as follows:

import { getAppConfig } from '@aws-lambda-powertools/parameters/appconfig';

export const handler = async (): Promise<void> => {
  // Retrieve a configuration, latest version
  const config = await getAppConfig('my-configuration', {
    environment: 'my-env',
    application: 'my-app'
  });
  console.log(config);
};

Retrieving a configuration requires both the appconfig:GetLatestConfiguration and appconfig:StartConfigurationSession IAM permissions to be attached to the Lambda function execution role.

Retrieving a parameter from a DynamoDB table

DynamoDB’s low latency and high flexibility make it a great option for storing parameters. To use DynamoDB as a parameter store via the Parameters utility, install the DynamoDB AWS SDK client and utility package in addition to the Parameters utility.

npm install @aws-sdk/client-dynamodb @aws-sdk/util-dynamodb

By default, the Parameters utility expects the DynamoDB table containing the parameters to have a partition key of id and an attribute called value. For example, assuming an item with an id of my-parameter and a value of my-value stored in an DynamoDB table called my-table, you can retrieve it as follows:

import { DynamoDBProvider } from '@aws-lambda-powertools/parameters/dynamodb';

const dynamoDBProvider = new DynamoDBProvider({ tableName: 'my-table' });

export const handler = async (): Promise<void> => {
  // Retrieve a value from DynamoDB
  const value = await dynamoDBProvider.get('my-parameter');
  console.log(value);
};

In case of retrieving a single parameter from DynamoDB, the Lambda function execution role needs to have the dynamodb:GetItem IAM permission.

The Parameters utility DynamoDB provider can also retrieve multiple parameters from a table with a single request via a DynamoDB query. See DynamoDB provider in the Powertools for AWS Lambda (TypeScript) documentation for details.

Conclusion

This blog post introduces the Powertools for AWS Lambda (TypeScript) Parameters utility and demonstrates how it is used with different parameter stores. The Parameters utility allows you to retrieve secrets and parameters in your Lambda function from SSM Parameter Store, Secrets Manager, AppConfig, DynamoDB, and custom parameter stores. By using the utility, you get access to functionality such as caching and transformation, and reduce the amount of boilerplate code you need to write for your Lambda functions.

To learn more about the Parameters utility and its full set of functionality, take a look at the Powertools for AWS Lambda (TypeScript) documentation.

Share your feedback for Powertools for AWS Lambda (TypeScript) by opening a GitHub issue.

For more serverless learning resources, visit Serverless Land.

Implementing AWS Well-Architected best practices for Amazon SQS – Part 3

2023-06-27 Pascal Vogel

Post Syndicated from Pascal Vogel original https://aws.amazon.com/blogs/compute/implementing-aws-well-architected-best-practices-for-amazon-sqs-part-3/

This blog is written by Chetan Makvana, Senior Solutions Architect and Hardik Vasa, Senior Solutions Architect.

This is the third part of a three-part blog post series that demonstrates best practices for Amazon Simple Queue Service (Amazon SQS) using the AWS Well-Architected Framework.

This blog post covers best practices using the Performance Efficiency Pillar, Cost Optimization Pillar, and Sustainability Pillar. The inventory management example introduced in part 1 of the series will continue to serve as an example.

See also the other two parts of the series:

Performance Efficiency Pillar

The Performance Efficiency Pillar includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve. It recommends best practices to use trade-offs to improve performance, such as learning about design patterns and services and identify how tradeoffs impact customers and efficiency.

By adopting these best practices, you can optimize the performance of SQS by employing appropriate configurations and techniques while considering trade-offs for the specific use case.

Best practice: Use action batching or horizontal scaling or both to increase throughput

For achieving high throughput in SQS, optimizing the performance of your message processing is crucial. You can use two techniques: horizontal scaling and action batching.

When dealing with high message volume, consider horizontally scaling the message producers and consumers by increasing the number of threads per client, by adding more clients, or both. By distributing the load across multiple threads or clients, you can handle a high number of messages concurrently.

Action batching distributes the latency of the batch action over the multiple messages in a batch request, rather than accepting the entire latency for a single message. Because each round trip carries more work, batch requests make more efficient use of threads and connections, improving throughput. You can combine batching with horizontal scaling to provide throughput with fewer threads, connections, and requests than individual message requests.

In the inventory management example that we introduced in part 1, this scaling behavior is managed by AWS for the AWS Lambda function responsible for backend processing. When a Lambda function subscribes to an SQS queue, Lambda polls the queue as it waits for the inventory updates requests to arrive. Lambda consumes messages in batches, starting at five concurrent batches with five functions at a time. If there are more messages in the queue, Lambda adds up to 60 functions per minute, up to 1,000 functions, to consume those messages.

This means that Lambda can scale up to 1,000 concurrent Lambda functions processing messages from the SQS queue. Batching enables the inventory management system to handle a high volume of inventory update messages efficiently. This ensures real-time visibility into inventory levels and enhances the accuracy and responsiveness of inventory management operations.

Best practice: Trade-off between SQS standard and First-In-First-Out (FIFO) queues

SQS supports two types of queues: standard queues and FIFO queues. Understanding the trade-offs between SQS standard and FIFO queues allows you to make an informed choice that aligns with your application’s requirements and priorities. While SQS standard queues support a nearly unlimited throughput, it sacrifices strict message ordering and occasionally delivers messages in an order different from the one they were sent in. If maintaining the exact order of events is not critical for your application, utilizing SQS standard queues can provide significant benefits in terms of throughput and scalability.

On the other hand, SQS FIFO queues guarantee message ordering and exactly-once processing. This makes them suitable for applications where maintaining the order of events is crucial, such as financial transactions or event-driven workflows. However, FIFO queues have a lower throughput compared to standard queues. They can handle up to 3,000 transactions per second (TPS) per API method with batching, and 300 TPS without batching. Consider using FIFO queues only when the order of events is important for the application, otherwise use standard queues.

In the inventory management example, since the order of inventory records is not crucial, the potential out-of-order message delivery that can occur with SQS standard queues is unlikely to impact the inventory processing. This allows you to take advantage of the benefits provided by SQS standard queues, including their ability to handle a high number of transactions per second.

Cost Optimization Pillar

The Cost Optimization Pillar includes the ability to run systems to deliver business value at the lowest price. It recommends best practices to build and operate cost-aware workloads that achieve business outcomes while minimizing costs and allowing your organization to maximize its return on investment.

Best practice: Configure cost allocation tags for SQS to organize and identify SQS for cost allocation

A well-defined tagging strategy plays a vital role in establishing accurate chargeback or showback models. By assigning appropriate tags to resources, such as SQS queues, you can precisely allocate costs to different teams or applications. This level of granularity ensures fair and transparent cost allocation, enabling better financial management and accountability.

In the inventory management example, tagging the SQS queue allows for specific cost tracking under the Inventory department, enabling a more accurate assessment of expenses. The following code snippet shows how to tag the SQS queue using AWS Could Development Kit (AWS CDK).

# Create the SQS queue with DLQ setting
queue = sqs.Queue(
    self,
    "InventoryUpdatesQueue",
    visibility_timeout=Duration.seconds(300),
)

Tags.of(queue).add("department", "inventory")

Best practice: Use long polling

SQS offers two methods for receiving messages from a queue: short polling and long polling. By default, queues use short polling, where the ReceiveMessage request queries a subset of servers to identify available messages. Even if the query found no messages, SQS sends the response right away.

In contrast, long polling queries all servers in the SQS infrastructure to check for available messages. SQS responds only after collecting at least one message, respecting the specified maximum. If no messages are immediately available, the request is held open until a message becomes available or the polling wait time expires. In such cases, an empty response is sent.

Short polling provides immediate responses, making it suitable for applications that require quick feedback or near-real-time processing. On the other hand, long polling is ideal when efficiency is prioritized over immediate feedback. It reduces API calls, minimizes network traffic, and improves resource utilization, leading to cost savings.

In the inventory management example, long polling enhances the efficiency of processing inventory updates. It collects and retrieves available inventory update messages in a batch of 10, reducing the frequency of API requests. This batching approach optimizes resource utilization, minimizes network traffic, and reduces excessive API consumption, resulting in cost savings. You can configure this behavior using batch size and batch window:

# Add the SQS queue as a trigger to the Lambda function
sqs_to_dynamodb_function.add_event_source_mapping(
    "MyQueueTrigger", event_source_arn=queue.queue_arn, batch_size=10
)

Best practice: Use batching

Batching messages together allows you to send or retrieve multiple messages in a single API call. This reduces the number of API requests required to process or retrieve messages compared to sending or retrieving messages individually. Since SQS pricing is based on the number of API requests, reducing the number of requests can lead to cost savings.

To send, receive, and delete messages, and to change the message visibility timeout for multiple messages with a single action, use Amazon SQS batch API actions. This also helps with transferring less data, effectively reducing the associated data transfer costs, especially if you have many messages.

In the context of the inventory management example, the CSV processing Lambda function groups 10 inventory records together in each API call, forming a batch. By doing so, the number of API requests is reduced by a factor of 10 compared to sending each record separately. This approach optimizes the utilization of API resources, streamlines message processing, and ultimately contributes to cost efficiency. Following is the code snippet from the CSV processing Lambda function showcasing the use of SendMessageBatch to send 10 messages with a single action.

# Parse the CSV records and send them to SQS as batch messages
csv_reader = csv.DictReader(csv_content.splitlines())
message_batch = []
for row in csv_reader:
    # Convert the row to JSON
    json_message = json.dumps(row)

    # Add the message to the batch
    message_batch.append(
        {"Id": str(len(message_batch) + 1), "MessageBody": json_message}
    )

    # Send the batch of messages when it reaches the maximum batch size (10 messages)
    if len(message_batch) == 10:
        sqs_client.send_message_batch(QueueUrl=queue_url, Entries=message_batch)
        message_batch = []
        print("Sent messages in batch")

Best practice: Use temporary queues

In case of short-lived, lightweight messaging with synchronous two-way communication, you can use temporary queues . The temporary queue makes it easy to create and delete many temporary messaging destinations without inflating your AWS bill. The key concept behind this is the virtual queue. Virtual queues let you multiplex many low-traffic queues onto a single SQS queue. Creating a virtual queue only instantiates a local buffer to hold messages for consumers as they arrive; there is no API call to SQS, and no costs associated with creating a virtual queue.

The inventory management example does not use temporary queues. However, in use cases that involve short-lived, lightweight messaging with synchronous two-way communication, adopting the best practice of using temporary queues and virtual queues can enhance the overall efficiency, reduce costs, and simplify the management of messaging destinations.

Sustainability Pillar

The Sustainability Pillar provides best practices to meet sustainability targets for your AWS workloads. It encompasses considerations related to energy efficiency and resource optimization.

Best practice: Use long polling

Besides its cost optimization benefits explained as part of the Cost Optimization Pillar, long polling also plays a crucial role in improving resource efficiency by reducing API requests, minimizing network traffic, and optimizing resource utilization.

By collecting and retrieving available messages in a batch, long polling reduces the frequency of API requests, resulting in improved resource utilization and minimized network traffic. By reducing excessive API consumption through long polling, you can effectively use resources. It collects and retrieves messages in batches, reducing excessive API consumption and unnecessary network traffic.

By reducing API calls, it optimizes data transfer and infrastructure operations. Additionally, long polling’s batching approach optimizes resource allocation, utilizing system resources more efficiently and improving energy efficiency. This enables the inventory management system to handle high message volumes effectively while operating in a cost-efficient and resource-efficient manner.

Conclusion

This blog post explores best practices for SQS using the Performance Efficiency Pillar, Cost Optimization Pillar, and Sustainability Pillar of the AWS Well-Architected Framework. We cover techniques such as batch processing, message batching, and scaling considerations. We also discuss important considerations, such as resource utilization, minimizing resource waste, and reducing cost.

This three-part blog post series covers a wide range of best practices, spanning the Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability Pillars of the AWS Well-Architected Framework. By following these guidelines and leveraging the power of the AWS Well-Architected Framework, you can build robust, secure, and efficient messaging systems using SQS.

For more serverless learning resources, visit Serverless Land.

Implementing AWS Well-Architected best practices for Amazon SQS – Part 2

2023-06-27 Pascal Vogel

Post Syndicated from Pascal Vogel original https://aws.amazon.com/blogs/compute/implementing-aws-well-architected-best-practices-for-amazon-sqs-part-2/

This blog is written by Chetan Makvana, Senior Solutions Architect and Hardik Vasa, Senior Solutions Architect.

This is the second part of a three-part blog post series that demonstrates implementing best practices for Amazon Simple Queue Service (Amazon SQS) using the AWS Well-Architected Framework.

This blog post covers best practices using the Security Pillar and Reliability Pillar of the AWS Well-Architected Framework. The inventory management example introduced in part 1 of the series will continue to serve as an example.

See also the other two parts of the series:

Security Pillar

The Security Pillar includes the ability to protect data, systems, and assets and to take advantage of cloud technologies to improve your security. This pillar recommends putting in place practices that influence security. Using these best practices, you can protect data while in-transit (as it travels to and from SQS) and at rest (while stored on disk in SQS), or control who can do what with SQS.

Best practice: Configure server-side encryption

If your application has a compliance requirement such as HIPAA, GDPR, or PCI-DSS mandating encryption at rest, if you are looking to improve data security to protect against unauthorized access, or if you are just looking for simplified key management for the messages sent to the SQS queue, you can leverage Server-Side Encryption (SSE) to protect the privacy and integrity of your data stored on SQS.

SQS and AWS Key Management Service (KMS) offer two options for configuring server-side encryption. SQS-managed encryptions keys (SSE-SQS) provide automatic encryption of messages stored in SQS queues using AWS-managed keys. This feature is enabled by default when you create a queue. If you choose to use your own AWS KMS keys to encrypt and decrypt messages stored in SQS, you can use the SSE-KMS feature.

SSE-KMS provides greater control and flexibility over encryption keys, while SSE-SQS simplifies the process by managing the encryption keys for you. Both options help you protect sensitive data and comply with regulatory requirements by encrypting data at rest in SQS queues. Note that SSE-SQS only encrypts the message body and not the message attributes.

In the inventory management example introduced in part 1, an AWS Lambda function responsible for CSV processing sends incoming messages to an SQS queue when an inventory updates file is dropped into the Amazon Simple Storage Service (Amazon S3) bucket. SQS encrypts these messages in the queue using SQS-SSE. When a backend processing Lambda polls messages from the queue, the encrypted message is decrypted, and the function inserts inventory updates into Amazon DynamoDB.

The AWS Could Development Kit (AWS CDK) code sets SSE-SQS as the default encryption key type. However, the following AWS CDK code shows how to encrypt the queue with SSE-KMS.

# Create the SQS queue with DLQ setting
queue = sqs.Queue(
    self,
    "InventoryUpdatesQueue",
    visibility_timeout=Duration.seconds(300),
    encryption=sqs.QueueEncryption.KMS_MANAGED,
)

Best practice: Implement least-privilege access using access policy

For securing your resources in AWS, implementing least-privilege access is critical. This means granting users and services the minimum level of access required to perform their tasks. Least-privilege access provides better security, allows you to meet your compliance requirements, and offers accountability via a clear audit trail of who accessed what resources and when.

By implementing least-privilege access using access policies, you can help reduce the risk of security breaches and ensure that your resources are only accessed by authorized users and services. AWS Identity and Access Management (IAM) policies apply to users, groups, and roles, while resource-based policies apply to AWS resources such as SQS queues. To implement least-privilege access, it’s essential to start by defining what actions are required for each user or service to perform their tasks.

In the inventory management example, the CSV processing Lambda function doesn’t perform any other task beyond parsing the inventory updates file and sending the inventory records to the SQS queue for further processing. To ensure that the function has the permissions to send messages to the SQS queue, grant the SQS queue access to the IAM role that the Lambda function assumes. By granting the SQS queue access to the Lambda function’s IAM role, you establish a secure and controlled communication channel. The Lambda function can only interact with the SQS queue and doesn’t have unnecessary access or permissions that might compromise the system’s security.

# Create pre-processing Lambda function
csv_processing_to_sqs_function = _lambda.Function(
    self,
    "CSVProcessingToSQSFunction",
    runtime=_lambda.Runtime.PYTHON_3_8,
    code=_lambda.Code.from_asset("sqs_blog/lambda"),
    handler="CSVProcessingToSQSFunction.lambda_handler",
    role=role,
    tracing=Tracing.ACTIVE,
)

# Define the queue policy to allow messages from the Lambda function's role only
policy = iam.PolicyStatement(
    actions=["sqs:SendMessage"],
    effect=iam.Effect.ALLOW,
    principals=[iam.ArnPrincipal(role.role_arn)],
    resources=[queue.queue_arn],
)

queue.add_to_resource_policy(policy)

Best practice: Allow only encrypted connections over HTTPS using aws:SecureTransport

It is essential to have a secure and reliable method for transferring data between AWS services and on-premises environments or other external systems. With HTTPS, a network-based attacker cannot eavesdrop on network traffic or manipulate it, using an attack such as man-in-the-middle.

With SQS, you can choose to allow only encrypted connections over HTTPS using the aws:SecureTransport condition key in the queue policy. With this condition in place, any requests made over non-secure HTTP receive a 400 InvalidSecurity error from SQS.

In the inventory management example, the CSV processing Lambda function sends inventory updates to the SQS queue. To ensure secure data transfer, the Lambda function uses the HTTPS endpoint provided by SQS. This guarantees that the communication between the Lambda function and the SQS queue remains encrypted and resistant to potential security threats.

# Create an IAM policy statement allowing only HTTPS access to the queue
secure_transport_policy = iam.PolicyStatement(
    effect=iam.Effect.DENY,
    actions=["sqs:*"],
    resources=[queue.queue_arn],
    conditions={
        "Bool": {
            "aws:SecureTransport": "false",
        },
    },
)

Best practice: Use attribute-based access controls (ABAC)

Some use-cases require granular access control. For example, authorizing a user based on user roles, environment, department, or location. Additionally, dynamic authorization is required based on changing user attributes. In this case, you need an access control mechanism based on user attributes.

Attribute-based access controls (ABAC) is an authorization strategy that defines permissions based on tags attached to users and AWS resources. With ABAC, you can use tags to configure IAM access permissions and policies for your queues. ABAC hence enables you to scale your permission management easily. You can author a single permission policy in IAM using tags created for each business role, and no longer need to update the policy when adding new resources.

ABAC for SQS queues enables two key use cases:

Tag-based access control: use tags to control access to your SQS queues, including control plane and data plane API calls.
Tag-on-create: enforce tags at the time of creation of an SQS queues and deny the creation of SQS resources without tags.

Reliability Pillar

The Reliability Pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. By leveraging the best practices outlined in this pillar, you can enhance the way you manage messages in SQS.

Best practice: Configure dead-letter queues

In a distributed system, when messages flow between sub-systems, there is a possibility that some messages may not be processed right away. This could be because of the message being corrupted or downstream processing being temporarily unavailable. In such situations, it is not ideal for the bad message to block other messages in the queue.

Dead Letter Queues (DLQs) in SQS can improve the reliability of your application by providing an additional layer of fault tolerance, simplifying debugging, providing a retry mechanism, and separating problematic messages from the main queue. By incorporating DLQs into your application architecture, you can build a more robust and reliable system that can handle errors and maintain high levels of performance and availability.

In the inventory management example, a DLQ plays a vital role in adding message resiliency and preventing situations where a single bad message blocks the processing of other messages. If the backend Lambda function fails after multiple attempts, the inventory update message is redirected to the DLQ. By inspecting these unconsumed messages, you can troubleshoot and redrive them to the primary queue or to custom destination using the DLQ redrive feature. You can also automate redrive by using a set of APIs programmatically. This ensures accurate inventory updates and prevents data loss.

The following AWS CDK code snippet shows how to create a DLQ for the source queue and sets up a DLQ policy to only allow messages from the source SQS queue. It is recommended not to set the max_receive_count value to 1, especially when using a Lambda function as the consumer, to avoid accumulating many messages in the DLQ.

# Create the Dead Letter Queue (DLQ)
dlq = sqs.Queue(self, "InventoryUpdatesDlq", visibility_timeout=Duration.seconds(300))

# Create the SQS queue with DLQ setting
queue = sqs.Queue(
    self,
    "InventoryUpdatesQueue",
    visibility_timeout=Duration.seconds(300),
    dead_letter_queue=sqs.DeadLetterQueue(
        max_receive_count=3,  # Number of retries before sending the message to the DLQ
        queue=dlq,
    ),
)
# Create an SQS queue policy to allow source queue to send messages to the DLQ
policy = iam.PolicyStatement(
    effect=iam.Effect.ALLOW,
    actions=["sqs:SendMessage"],
    resources=[dlq.queue_arn],
    conditions={"ArnEquals": {"aws:SourceArn": queue.queue_arn}},
)
queue.queue_policy = iam.PolicyDocument(statements=[policy])

Best practice: Process messages in a timely manner by configuring the right visibility timeout

Setting the appropriate visibility timeout is crucial for efficient message processing in SQS. The visibility timeout is the period during which SQS prevents other consumers from receiving and processing a message after it has been polled from the queue.

To determine the ideal visibility timeout for your application, consider your specific use case. If your application typically processes messages within a few seconds, set the visibility timeout to a few minutes. This ensures that multiple consumers don’t process the message simultaneously. If your application requires more time to process messages, consider breaking them down into smaller units or batching them to improve performance.

If a message fails to process and is returned to the queue, it will not be available for processing again until the visibility timeout period has elapsed. Increasing the visibility timeout will increase the overall latency of your application. Therefore, it’s important to balance the tradeoff between reducing the likelihood of message duplication and maintaining a responsive application.

In the inventory management example, setting the right visibility timeout helps the application fail fast and improve the message processing times. Since the Lambda function typically processes messages within milliseconds, a visibility timeout of 30 seconds is set in the following AWS CDK code snippet.

queue = sqs.Queue(
    self,
    " InventoryUpdatesQueue",
    visibility_timeout=Duration.seconds(30),
)

It is recommended to keep the SQS queue visibility timeout to at least six times the Lambda function timeout, plus the value of MaximumBatchingWindowInSeconds. This allows Lambda function to retry the messages if the invocation fails.

Conclusion

This blog post explores best practices for SQS using the Security Pillar and Reliability Pillar of the AWS Well-Architected Framework. We discuss various best practices and considerations to ensure the security of SQS. By following these best practices, you can create a robust and secure messaging system using SQS. We also highlight fault tolerance and processing a message in a timely manner as important aspects of building reliable applications using SQS.

The next part of this blog post series focuses on the Performance Efficiency Pillar, Cost Optimization Pillar, and Sustainability Pillar of the AWS Well-Architected Framework and explore best practices for SQS.

For more serverless learning resources, visit Serverless Land.

Implementing AWS Well-Architected best practices for Amazon SQS – Part 1

2023-06-27 Pascal Vogel

Post Syndicated from Pascal Vogel original https://aws.amazon.com/blogs/compute/implementing-aws-well-architected-best-practices-for-amazon-sqs-part-1/

This blog is written by Chetan Makvana, Senior Solutions Architect and Hardik Vasa, Senior Solutions Architect.

Amazon Simple Queue Service (Amazon SQS) is a fully managed message queuing service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications. AWS customers have constantly discovered powerful new ways to build more scalable, elastic, and reliable applications using SQS. You can leverage SQS in a variety of use-cases requiring loose coupling and high performance at any level of throughput, while reducing cost by only paying for value and remaining confident that no message is lost. When building applications with Amazon SQS, it is important to follow architectural best practices.

To help you identify and implement these best practices, AWS provides the AWS Well-Architected Framework for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems in the AWS Cloud. Built around six pillars—operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability, AWS Well-Architected provides a consistent approach for customers and partners to evaluate architectures and implement scalable designs.

This three-part blog series covers each pillar of the AWS Well-Architected Framework to implement best practices for SQS. This blog post, part 1 of the series, discusses best practices using the Operational Excellence Pillar of the AWS Well-Architected Framework.

See also the other two parts of the series:

Solution overview

This solution architecture shows an example of an inventory management system. The system leverages Amazon Simple Storage Service (Amazon S3), AWS Lambda, Amazon SQS, and Amazon DynamoDB to streamline inventory operations and ensure accurate inventory levels. The system handles frequent updates from multiple sources, such as suppliers, warehouses, and retail stores, which are received as CSV files.

These CSV files are then uploaded to an S3 bucket, consolidating and securing the inventory data for the inventory management system’s access. The system uses a Lambda function to read and parse the CSV file, extracting individual inventory update records. The backend Lambda function transforms each inventory update record into a message and sends it to an SQS queue. Another Lambda function continually polls the SQS queue for new messages. Upon receiving a message, it retrieves the inventory update details and updates the inventory levels in DynamoDB accordingly.

This ensures that the inventory quantities for each product are accurate and reflect the latest changes. This way, the inventory management system provides real-time visibility into inventory levels across different locations and suppliers, enabling the company to monitor product availability with precision. Find the example code for this solution in the GitHub repository.

This example is used throughout this blog series to highlight how SQS best practices can be implemented based on the AWS Well Architected Framework.

Operational Excellence Pillar

The Operational Excellence Pillar includes the ability to support development and run workloads effectively, gain insight into their operation, and continuously improve supporting processes and procedures to deliver business value. To achieve operational excellence, the pillar recommends best practices such as defining workload metrics and implementing transaction traceability. This enables organizations to gain valuable insights into their operations, identify potential issues, and optimize services accordingly to improve customer experience. Furthermore, understanding the health of an application is critical to ensuring that it is functioning as expected.

Best practice: Use infrastructure as code to deploy SQS

Infrastructure as Code (IaC) helps you model, provision, and manage your cloud resources. One of the primary advantages of IaC is that it simplifies infrastructure management. With IaC, you can quickly and easily replicate your environment to multiple AWS Regions with a single turnkey solution. This makes it easy to manage your infrastructure, regardless of where your resources are located. Additionally, IaC enables you to create, deploy, and maintain infrastructure in a programmatic, descriptive, and declarative way repeatably. This reduces errors caused by manual processes, such as creating resources in the AWS Management Console. With IaC, you can easily control and track changes in your infrastructure, which makes it easier to maintain and troubleshoot your systems.

For managing SQS resources, you can use different IaC tools like AWS Serverless Application Model (AWS SAM), AWS CloudFormation, or AWS Could Development Kit (AWS CDK). There are also third-party solutions for creating SQS resources, such as the Serverless Framework. AWS CDK is a popular choice because it allows you to provision AWS resources using familiar programming languages such as Python, Java, TypeScript, Go, JavaScript, and C#/.Net.

This blog series showcases the use of AWS CDK with Python to demonstrate best practices for working with SQS. For example, the following AWS CDK code creates a new SQS queue:

from aws_cdk import (
    Duration,
    Stack,
    aws_sqs as sqs,
)
from constructs import Construct


class SqsCdBlogStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # The code that defines your stack goes here

        # example resource
        queue = sqs.Queue(
            self,
            "InventoryUpdatesQueue",
            visibility_timeout=Duration.seconds(300),
        )

Best practice: Configure CloudWatch alarms for ApproximateAgeofOldestMessage

It is important to understand Amazon CloudWatch metrics and dimensions for SQS, to have a plan in place to assess its behavior, and to add custom metrics where necessary. Once you have a good understanding of the metrics, it is essential to identify the key metrics that are most relevant to your use case and set up appropriate alerts to monitor them.

One of the key metrics that SQS provides is the ApproximateAgeOfOldestMessage metric. By monitoring this metric, you can determine the age of the oldest message in the queue, and take appropriate action to ensure that messages are processed in a timely manner. To set up alerts for the ApproximateAgeOfOldestMessage metric, you can use CloudWatch alarms. You configure these alarms to issue alerts when messages remain in the queue for extended periods of time. You can use these alerts to act, for instance by scaling up consumers to process messages more quickly or investigating potential issues with message processing.

In the inventory management example, leveraging the ApproximateAgeOfOldestMessage metric provides valuable insights into the health and performance of the SQS queue. By monitoring this metric, you can detect processing delays, optimize performance, and ensure that inventory updates are processed within the desired timeframe. This ensures that your inventory levels remain accurate and up-to-date. The following code creates an alarm which is triggered if the oldest inventory updates request is in the queue for more than 30 seconds.

# Create a CloudWatch alarm for ApproximateAgeOfOldestMessage metric
alarm = cloudwatch.Alarm(
	self,
	"OldInventoryUpdatesAlarm",
	alarm_name="OldInventoryUpdatesAlarm",
	metric=queue.metric_approximate_age_of_oldest_message(),
	threshold=600,  # Specify your desired threshold value in seconds
	evaluation_periods=1,
	comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
)

Best practice: Add a tracing header while sending a message to the queue to provide distributed tracing capabilities for faster troubleshooting

By implementing distributed tracing, you can gain a clear understanding of the flow of messages in SQS queues, identify any bottlenecks or potential issues, and proactively react to any signals that indicate an unhealthy state. Tracing provides a wider continuous view of an application and helps to follow a user journey or transaction through the application.

AWS X-Ray is an example of a distributed tracing solution that integrates with Amazon SQS to trace messages that are passed through an SQS queue. When using the X-Ray SDK, SQS can propagate tracing headers to maintain trace continuity and enable tracking, analysis, and debugging throughout downstream services. SQS supports tracing headers through the Default HTTP header and the AWSTraceHeader System Attribute. AWSTraceHeader is available for use even when auto-instrumentation through the X-Ray SDK is not, for example, when building a tracing SDK for a new language. If you are using a Lambda downstream consumer, trace context propagation is automatic.

In the inventory management example, by utilizing distributed tracing with X-Ray for SQS, you can gain deep insights into the performance, behavior, and dependencies of the inventory management system. This visibility enables you to optimize performance, troubleshoot issues more effectively, and ensure the smooth and efficient operation of the system. The following code sets up a CSV processing Lambda function and a backend processing Lambda function with active tracing enabled. The Lambda function automatically receives the X-Ray TraceId from SQS.

# Create pre-processing Lambda function
csv_processing_to_sqs_function = _lambda.Function(
    self,
    "CSVProcessingToSQSFunction",
    runtime=_lambda.Runtime.PYTHON_3_8,
    code=_lambda.Code.from_asset("sqs_blog/lambda"),
    handler="CSVProcessingToSQSFunction.lambda_handler",
    role=role,
    tracing=Tracing.ACTIVE,  # Enable active tracing with X-Ray
)

# Create a post-processing Lambda function with the specified role
sqs_to_dynamodb_function = _lambda.Function(
    self,
    "SQSToDynamoDBFunction",
    runtime=_lambda.Runtime.PYTHON_3_8,
    code=_lambda.Code.from_asset("sqs_blog/lambda"),
    handler="SQSToDynamoDBFunction.lambda_handler",
    role=role,
    tracing=Tracing.ACTIVE,  # Enable active tracing with X-Ray
)

Conclusion

This blog post explores best practices for SQS with a focus on the Operational Excellence Pillar of the AWS Well-Architected Framework. We explore key considerations for ensuring the smooth operation and optimal performance of applications using SQS. Additionally, we explore the advantages of infrastructure as code in simplifying infrastructure management and showcase how AWS CDK can be used to provision and manage SQS resources.

The next part of this blog post series addresses the Security Pillar and Reliability Pillar of the AWS Well-Architected Framework and explores best practices for SQS.

For more serverless learning resources, visit Serverless Land.

Testing AWS Lambda functions with AWS SAM remote invoke

2023-06-23 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/testing-aws-lambda-functions-with-aws-sam-remote-invoke/

Developers are taking advantage of event driven architecture (EDA) to build large distributed applications. To build these applications, developers are using managed service like AWS Lambda, AWS Step Functions, and Amazon EventBridge to handle compute, orchestration, and choreography. Since these services run in the cloud, developers are also looking for ways to test in the cloud. With this in mind, AWS SAM is adding a new feature to the AWS SAM CLI called remote invoke.

AWS SAM remote invoke enables developers to invoke a Lambda function in the AWS Cloud from their development environment. The feature has several options for identifying the Lambda function to invoke, the payload event, and the output type.

Using remote invoke

To test the remote invoke feature, there is a small AWS SAM application that comprises two AWS Lambda functions. The TranslateFunction takes a text string and translates it to the target language using the AI/ML service Amazon Translate. The StreamFunction generates data in a streaming format. To run these demonstrations, be sure to install the latest AWS SAM CLI.

To deploy the application, follow these steps:

Clone the repository:

git clone https://github.com/aws-samples/aws-sam-remote-invoke-example

Change to the root directory of the repository:
```
cd aws-sam-remote-invoke-example
```
Build the AWS Lambda artifacts (use the –use-container option to ensure Python 3.10 and Node 18 are present. If these are both set up on your machine, you can ignore this flag):
```
sam build --use-container
```
Deploy the application to your AWS account:
```
sam deploy --guided
```
Name the application “remote-test” and choose all defaults.

AWS SAM can now remotely invoke the Lambda functions deployed with this application. Use the following command to test the TranslateFunction:

sam remote invoke --stack-name remote-test --event '{"message":"I am testing the power of remote invocation", "target-language":"es"}' TranslateFunction

This is a quick way to test a small event. However, developers often deal with large complex payloads. The AWS SAM remote invoke function also allows an event to be passed as a file. Use the following command to test:

sam remote invoke --stack-name remote-test --event-file './events/translate-event.json' TranslateFunction

With either of these methods, AWS SAM returns the response from the Lambda function as if it were called from a service like Amazon API Gateway. However, AWS SAM also offers the ability to get the raw response as returned from the Python software development kit (SDK), boto3. This format provides additional information such as the version that you invoked, if any retries were attempted, and more. To retrieve this output, run the invocation with the additional –output parameter with the value of json.

sam remote invoke --stack-name remote-test --event '{"message": "I am testing the power of remote invocation", "target-language": "es"}' --output json TranslateFunction

Full output from SDK

It is also possible to invoke Lambda functions that are not created in AWS SAM. Using the name of a Lambda function, AWS SAM can remotely invoke any Lambda function that you have permission to invoke. When you deployed the sample application, AWS SAM prints the name of the Lambda function in the console. Use the following command to print the output again:

sam list stack-outputs --stack-name remote-test

Using the output for the TranslateFunctionName, run:

sam remote invoke --event '{"message": "Testing direct access of the function", "target-language": "fr"}' <TranslateFunctionName>

Lambda recently added support from streaming responses from Lambda functions. Streaming functions do not wait until the entire response is available before they respond to the client. To show this, the StreamFunction generates multiple chunks of text and sends them over a period of time.

To invoke the function, run:

sam remote invoke --stack-name remote-test StreamFunction

Extending remote invoke

The AWS SDKs offer different options when invoking Lambda functions via the Lambda service. Behind the scenes, AWS SAM is using boto3 to power the remote invoke functionality. To make full use of the SDK options for Lambda function invocation, the AWS SAM offers a —parameter flag that can be used multiple times.

For example, you may want to run an invocation as a dry run only. This type of invocation tests Lambda’s ability to invoke the function based on factors like variable values and proper permissions. The command looks like the following:

sam remote invoke --stack-name remote-test --event '{"message": "I am testing the power of remote invocation", "target-language": "es"}' --parameter InvocationType=DryRun --output json TranslateFunction

In a second example, I want to invoke a specific version of the Lambda function:

sam remote invoke --stack-name remote-test --event '{"message": "I am testing the power of remote invocation", "target-language": "es"}' --parameter Qualifier='$LATEST' TranslateFunction

If you need both options:

sam remote invoke --stack-name remote-test --event '{"message": "I am testing the power of remote invocation", "target-language": "es"}' --parameter InvocationType=DryRun --parameter Qualifier='$LATEST' --output json TranslateFunction

Logging

When developing distributed applications, logging is a critical tool to trace the state of a request across decoupled microservices. AWS SAM offers the sam logs functionality to help view aggregated logs and traces from Amazon CloudWatch and AWS X-Ray, respectively. However, when testing individual functions, developers want contextual logs pinpointed to a specific invocation. The new remote invoke function provides these logs by default. Returning to the TranslateFunction, run the following command again:

sam remote invoke --stack-name remote-test --event '{"message": "I am testing the power of remote invocation", "target-language": "es"}' TranslateFunction

Logging response from remote invoke

The remote invocation returns the response from the Lambda function, any logging from within the Lambda function, followed by the final report from the Lambda service about the invocation itself.

Combining remote invoke with AWS SAM Accelerate

Developers are constantly striving to remove complexity and friction and improve speed and agility in the development pipeline. To help serverless developers towards this goal, the AWS SAM team released a feature called AWS SAM Accelerate. AWS SAM Accelerate is a series of features that move debugging and testing from the local machine to the cloud.

To show how AWS SAM Accelerate and remote invoke can work together, follow these steps:

In a separate terminal, start the AWS SAM sync process with the watch option:
```
sam sync --stack-name remote-test --use-container --watch
```

In a second window or tab, run the remote invoke function:

sam remote invoke --stack-name remote-test --event-file './events/translate-event.json' TranslateFunction

The combination of these two options provides a robust auto-deployment and testing environment. During iterations of code in the Lambda function, each time you save the file, AWS SAM syncs the code and any dependencies to the cloud. As needed, the remote invoke is then run to verify the code works as expected, with logging provided for each execution.

Conclusion

Serverless developers are looking for the most efficient way to test their applications in the AWS Cloud. They want to invoke an AWS Lambda function quickly without having to mock security, external services, or other environment variables. This blog shows how to use the new AWS SAM remote invoke feature to do just that.

This post shows how to invoke the Lambda function, change the payload type and location, and change the output format. It explains using this feature in conjunction with the AWS SAM Accelerate features to streamline the serverless development and testing process.

For more serverless learning resources, visit Serverless Land.

Ruby 3.2 runtime now available in AWS Lambda

2023-06-07 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/ruby-3-2-runtime-now-available-in-aws-lambda/

This post is written by Praveen Koorse, Senior Solutions Architect, AWS.

AWS Lambda now supports Ruby 3.2 runtime. With this release, Ruby developers can now take advantage of new features and improvements introduced in Ruby 3 when creating serverless applications on Lambda. Use this runtime today by specifying the runtime parameter of ruby3.2 when creating or updating Lambda functions.

Ruby 3.2 adds many features and performance improvements, including anonymous arguments passing improvements, ‘endless’ methods, Regexp improvements, a new Data class, support for pattern-matching in Time and MatchData, and support for ‘find pattern’ in pattern matching.

Our testing shows Ruby 3.2 cold starts are marginally slower than Ruby 2.7 for a trivial ‘hello world’ function. However, for many real-world workloads, the improved execution performance of Ruby 3.2 results in similar or better performance overall.

Existing Lambda customers using the Ruby 2.7 runtime should migrate to the Ruby 3.2 runtime as soon as possible. Even though community support for Ruby 2.7 has ended, Lambda has extended support for the Ruby 2.7 runtime until December 7, 2023 to provide existing Ruby customers with time to transition to Ruby 3.2. Functions using Ruby 2.7 continue to be eligible for technical support and Lambda will continue to apply OS security updates to the runtime until this date.

Anonymous arguments passing improvements

Ruby 3.2 has introduced improvements to how you can pass anonymous arguments, making it easier and cleaner to work with keyword arguments in code. Previously, you could pass anonymous keyword arguments to a method by using the delegation syntax (…) or use Module#ruby2_keywords and delegate *args, &block. This method was not intuitive and lacked clarity when working with multiple arguments.

def foo(...)
  target(...)
end

ruby2_keywords def foo(*args, &block)
  target(*args, &block)
end

Now, if a method declaration includes anonymous positional or keyword arguments, those can be passed to the next method as arguments themselves. The same advantages of anonymous block forwarding apply to rest and keyword rest argument forwarding.

def keywords(**) # accept keyword arguments
  foo(**) # pass them to the next method
end

def positional(*) # accept positional arguments
  bar(*) # pass to the next method
end

def positional_keywords(*, **) # same as ...
  foobar(*, **)
end

Endless methods

Ruby 3 introduced endless methods that enable developers to define methods of exactly one statement with a syntax def method() = statement. The syntax doesn’t need an end and allows methods to be defined as short-cut one liners, making the creation of basic utility methods easier and helping developers write clean code and improve readability and maintainability of code.

def dbg_args(a, b=1, c:, d: 6, &block) = puts("Args passed: #{[a, b, c, d, block.call]}")
dbg_args(0, c: 5) { 7 }
# Prints: Args passed: [0, 1, 5, 6, 7]

def square(x) = x**2
square(100) 
# => 10000

Regexp improvements

Regular expression matching can take an unexpectedly long time. If your code attempts to match a possibly inefficient Regexp against an untrusted input, an attacker may exploit it for efficient denial of service (so-called regular expression DoS, or ReDoS).

Ruby 3.2 introduces two improvements to mitigate this.

The first is an improved Regexp matching algorithm using a cache-based approach that allows most Regexp matching to be completed in linear time, thus significantly improving overall match performance.

As the prior optimization cannot be applied to some regular expressions, such as those using advanced features or working with a huge fixed number of repetitions, a Regexp timeout improvement now allows you to specify a timeout for Regexp operations as a fallback measure.

There are two APIs to set timeout:

Timeout.timeout: This is a global configuration and applies to all Regexps in the process.
timeout keyword for Regexp.new: This allows you to specify a different timeout setting for some Regexps. If used, it takes precedence over the global configuration.

Regexp.timeout = 2.0 # Global configuration set to two seconds
/^x*y?x*()\1$/ =~ "x" * 45000 + "a"
#=> Regexp::TimeoutError is raised in two seconds

my_long_rexp = Regexp.new('^x*y?x*()\1$', timeout: 4)
my_long_rexp =~ "x" * 45000 + "a"
# Regexp::TimeoutError is raised in four seconds

Data class

Ruby 3.2 introduces a new core class – ‘Data’ – to represent simple value-alike objects. The class is similar to Struct and partially shares the implementation, but is intended to be immutable with a more modern interface.

Once a Data class is defined using Data.define, both positional and keyword arguments can be used while constructing objects.

def handler(event:, context:) 
  employee = Data.define(:firstname, :lastname, :empid, :department)

  emp1 = employee.new('John', 'Doe', 12345, 'Sales')
  emp2 = employee.new(firstname: 'Alice', lastname: 'Doe', empid: 12346, department: 'Marketing')

  # Alternative form to construct an object
  emp3 = employee['Jack', 'Frost', 12354, 'Tech']
  emp4 = employee[firstname: 'Emma', lastname: 'Frost', empid: 12453, department: 'HR']
end

Pattern matching improvements

Pattern matching is a feature allowing deep matching of structured values: checking the structure and binding the matched parts to local variables.

Ruby 3.2 introduces ‘Find pattern’, allowing you to check if the given object has any elements that match a pattern. This pattern is similar to the Array pattern, except that it finds the first match in a given object containing multiple elements.

For example, previously, if you used the following in a Lambda function:

person = {name: "John", children: [{name: "Mark", age: 12}, {name: "Butler", age: 9}], siblings: [{name: "Mary", age: 31}, {name: "Conrad", age: 38}] }

case person
in {name: "John", children: [{name: "Mark", age: age}]}
  p age
end

This wouldn’t match because it does not search for the element in the ‘children’ array. It only matches for an array with a single element named “Mark”.

To find an element in an array with multiple elements, use ‘find pattern’:

case person
in {name: "John", children: [*, {name: "Mark", age: age}, *]}
  p age
end

As part of an effort to make core classes more pattern matching friendly, Ruby 3.2 also introduces the ability to deconstruct keys in Time and MatchData, allowing their use in case/in statements for pattern matching.

# `deconstruct_keys(nil)` provides all available keys:
timestamp = Time.now.deconstruct_keys(nil)

# Usage in pattern-matching:
case timestamp
in year: ...2022
  puts "Last year!"
in year: 2022, month: 1..3
  puts "Last year's first quarter"
in year: 2023, month:, day:
  puts "#{day} of #{month}th month!"
end

Standard Date and DateTime classes also have similar implementations for key deconstruction:

require 'date'
Date.today.deconstruct_keys(nil)
#=> {:year=>2023, :month=>1, :day=>15, :yday=>15, :wday=>0}
DateTime.now.deconstruct_keys(nil)
# => {:year=>2023, :month=>1, :day=>15, :yday=>15, :wday=>0, :hour=>17, :min=>19, :sec=>15, :sec_fraction=>(478525469/500000000), :zone=>"+02:00"}

Pattern matching with MatchData result deconstruction:

case db_connection_string.match(%r{postgres://(\w+):(\w+)@(.+)})
in 'admin', password, server
  # do connection with admin rights
in 'devuser', _, 'dev-server.us-east-1.rds.amazonaws.com'
  # connect to dev server
in user, password, server
  # do regular connection
end

YJIT – Yet Another Ruby JIT

YJIT, a lightweight, minimalistic Ruby JIT compiler built inside CRuby, is now an official part of the Ruby 3.2 runtime. It provides significantly higher performance, but also uses more memory than the Ruby interpreter and is generally suited for Ruby on Rails workloads.

By default, YJIT is not enabled in the Lambda Ruby 3.2 runtime. You can enable it for specific functions by setting the RUBY_YJIT_ENABLE environment variable to 1. Once enabled, you can verify it by printing the result of the RubyVM::YJIT.enabled? method.

puts(RubyVM::YJIT.enabled?())
# => true

Using Ruby 3.2 in Lambda

AWS Cloud Development Kit (AWS CDK):

In AWS CDK, set the runtime attribute to Runtime.RUBY_3_2 when creating the function to use this version. In TypeScript:

import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';

export class MyCdkStack extends cdk.Stack {

  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    	super(scope, id, props);

new lambda.Function(this, 'Ruby32Lambda', {
  		runtime: lambda.Runtime.RUBY_3_2, //execution environment
  		handler: 'test.handler', //file is “test”, function is “handler”
  	code: lambda.Code.fromAsset('lambda'), //code loaded from “lambda” dir
});
}

AWS Management Console

In the Lambda console, specify a runtime parameter value of Ruby 3.2 when creating or updating a function. The Ruby 3.2 runtime is now available in the Runtime dropdown in the Create function page.

To update an existing Lambda function to Ruby 3.2, navigate to the function in the Lambda console, then choose Edit in the Runtime settings panel. The new version of Ruby is available in the Runtime dropdown:

AWS Serverless Application Model

In the AWS Serverless Application Model (AWS SAM), set the Runtime attribute to ruby3.2 to use this version in your application deployments:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: AWS Lambda Ruby 3.2 example

Resources:
  Ruby32Lambda:
    Type: AWS::Serverless::Function
    Description: 'Lambda function that uses the Ruby3.2 runtime'
    Properties:
      FunctionName: Ruby32Lambda
      Handler: function.handler
      Runtime: ruby3.2
      CodeUri: src/

Conclusion

Get started building with Ruby 3.2 today by making necessary changes for compatibility with Ruby 3.2, and specifying a runtime parameter value of ruby3.2 when creating or updating your Lambda functions. You can read about the Ruby programming model in the Lambda documentation to learn more about writing functions in Ruby 3.2.

For more serverless learning resources, visit Serverless Land.

Implementing custom domain names for Amazon API Gateway private endpoints using a reverse proxy

2023-06-05 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/implementing-custom-domain-names-for-amazon-api-gateway-private-endpoints-using-a-reverse-proxy/

This post is written by Heeki Park, Principal Solutions Architect, Sachin Doshi, Senior Application Architect, and Jason Enderle, Senior Solutions Architect.

Amazon API Gateway enables developers to create private REST APIs that can only be accessed from within a Virtual Private Cloud (VPC). Traffic to the private API is transmitted over secure connections and stays within the AWS network and specifically within the customer’s VPC, protecting it from the public internet. This approach can be used to address a customer’s regulatory or security requirements by ensuring the confidentiality of the transmitted traffic. This makes private API Gateway endpoints suitable for publishing internal APIs, such as those used by microservices and data APIs.

In microservice architectures, teams often build and manage components in separate AWS accounts and prefer to access those private API endpoints using company-specific custom domain names. Custom domain names serve as an alias for a hostname and path to your API. This makes it easier for clients to connect using an easy-to-remember vanity URL and also maintains a stable URL in case the underlying API endpoint URL changes. Custom domain names can also improve the organization of APIs according to their functions within the enterprise. For example, the standard API Gateway URL format: “https://api-id.execute-api.region.amazonaws.com/stage” can be transformed into “https://api.private.example.com/myservice”.

Overview

This blog post builds on documentation that covers frontend invokes of private endpoints and backend integration patterns and two previously published blog posts.

The first blog post that helps you consume private API endpoints from API Gateway, using a VPC-enabled Lambda function and a container-based application using mTLS. The second post helps you implement private backend integrations to your API microservices that are deployed using AWS Fargate or Amazon EC2. This post extends these, enabling you to simplify access to your API endpoints by implementing custom domain names for private endpoints using an NGINX reverse proxy.

This solution uses NGINX because it acts as a high-performance intermediary, enabling the efficient forwarding of traffic within a private network. A configuration mapping file associates your custom domain with the corresponding private endpoint across AWS accounts. This configuration mapping file can then be source controlled and used for governed deployments into your lower and production environments.

The following diagram illustrates the interactions between the components and the path for an API request. In this use case, a shared services account (Account A) is responsible for centrally managing the mappings of custom domains and creating an AWS PrivateLink connection to private API endpoints in provider accounts (Account B and Account C).

A request to the API is made using a private custom domain from within a VPC or another device that is able to route to the VPC. For example, the request might use the domain https://api.private.example.com.
An alias record in Amazon Route 53 private hosted zone resolves to the fully qualified domain name of the private Elastic Load Balancing (ELB). The ELB can be configured to be either a Network Load Balancer (NLB) or an Application Load Balancer (ALB).
The ELB uses an AWS Certificate Manager (ACM) certificate to terminate TLS (Transport Layer Security) for corresponding custom private domain.
The ELB listener redirects requests to an associated ELB target group, which in turn forwards the request to an Amazon Elastic Container Service task running on AWS Fargate.
The Fargate service hosts a container based on NGINX that acts as a reverse proxy to the private API endpoint in one or more provider accounts. The Fargate service is configured to scale using a metric that tracks CPU utilization automatically.
The Fargate task, forwards traffic to the appropriate private endpoints in provider Account B or Account C through a PrivateLink VPC Endpoint.
The API Gateway resource policy limits access to the private endpoints based on a specific VPC endpoint, HTTP verbs, and source domain used to request the API.

The solution passes any additional information found in headers from upstream calls, such as authentication headers, content type headers, or custom data headers unmodified to private endpoints in provider accounts (Account B and Account C).

Prerequisites

To use custom domain names, you need two components: a TLS certificate and a DNS alias. This example uses ACM for managing the TLS certificate and Route 53 for creating the DNS alias.

ACM offers various options for integrating a TLS certificate, such as:

Importing an existing TLS certificate into ACM.
Requesting a TLS certificate in ACM with Email-based validation.
Requesting a TLS certificate in ACM with DNS-based validation.
Requesting a TLS certificate from ACM using your Organization’s Private CA in AWS.

The following diagram illustrates the benefits and drawbacks associated with each option.

This solution uses DNS-based validation (option #3) to request TLS certificates from ACM. It is assumed that a public hosted zone with a registered root domain (such as example.com) is already deployed in the target account. The solution then uses ACM to validate ownership of the domain names specified in the configuration mapping file during deployment.

With a deployed public hosted zone, private child domains (such as api.private.example.com) can be deployed using DNS validation, which enables infrastructure as code (IaC) deployment to automate certificate validation during deployment of the solution. Additionally, DNS-based validation automatically renews the ACM certificate before it expires.

This solution requires the presence of specific VPC endpoints, namely execute-api, logs, ecr.dkr, ecr.api, and Amazon S3 gateway in a shared services account (Account A). Enabling private DNS on the execute-api VPC endpoint is optional and is not a requirement of the solution. Some customers may choose not to enable private DNS on the execute-api VPC endpoint, as this then allows applications within the VPC to reach the private API endpoints through the NGINX reverse proxy but also resolve public API Gateway endpoints.

Deploying the example

You can use the example AWS Cloud Development Kit (CDK) or Terraform code available on GitHub to deploy this pattern in your own account.

This solution uses a YAML-based configuration mapping file to add, update, or delete a mapping between a custom domain and a private API endpoint. During deployment, the automated infrastructure as code (IaC) script parses the provided YAML file and does the following:

Create an NGINX configuration file.
Apply the NGINX configuration file to the standard NGINX container image.
Parses the mapping file and creates necessary Route 53 private hosted zones in Account A.
Creates wildcard-based SSL certificates (such as *.example.com) in Account A. ACM validates these certificates using its respective public hosted zone (such as example.com) and attaches them to the ELB listener. By default, an ELB listener supports up to 25 SSL certificates. Wildcards are used to secure an unlimited number of subdomains, making it easier to manage and scale multiple subdomains.

Description of mapping file fields

Property	Required	Example Values	Description
CUSTOM_DOMAIN_URL	true	api.private.example.com	Desired custom URL for private API.
PRIVATE_API_URL	true	https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/dev/path1/path2	Execution URL of targeted private endpoint
VERBS	false	[“GET”,” POST”, “PUT”, “PATCH”, “DELETE”, “HEAD”, “OPTIONS”]	This property would be used to create API Gateway resource policies. You can provide one or more verbs as a comma separated list. If this property is not provided, all verbs are permitted.

Using API Gateway resource policies for private endpoints

To allow access to private endpoints from your VPCs or from VPCs in another account, you must implement a resource policy. Resource policies can be used to restrict access based on specific criteria such as VPC endpoints, API paths, and API verbs. To enable this functionality, follow these steps:

Complete the infrastructure as code (IaC) deployment.
Create or update an API Gateway resource policy in the provider accounts (such as Account B and Account C). This policy should include the VPC endpoint id from the shared services account (Account A).
Deploy your API to apply the changes in provider accounts (such as Account B and Account C).

To update the API Gateway resource policy with code, refer to the documentation and code examples in the GitHub repository.

Deploying updates to the mapping file

To add, update, or delete a mapping between your custom domain and private endpoint, you can update the mapping file and then rerun the deployment using the same steps as before.

Deploying subsequent updates to the mapping file using the existing infrastructure as code pipeline reduces the risk of human error, adds traceability, prevents configuration drift, and allows the deployment process to follow your existing DevOps and governance processes in place.

For example, you could store the configuration mapping file in a separate source control repository and commit each change to that repository. Each change could then trigger a deployment process, which would then check the configuration changes and conduct the appropriate deployment. If required, you could introduce gates to enforce either manual checks or a ticketing process to ensure that change control processes are enforced.

Understanding cost of the solution

Most of the services mentioned in this solution are billed according to usage, which is determined by the number of requests made.

However, there are a few services that incur hourly or monthly costs. These include monthly fees for Route 53 hosted zones, hourly charges for VPC endpoints, Elastic Load Balancing, and the hourly cost of running the NGINX reverse proxy on Fargate. To estimate the cost for these options based on your specific workload, you can utilize the AWS pricing calculator. Here is an example outlining the approximate cost associated with the architecture implemented in this solution.

Conclusion

This blog post demonstrates a solution that allows customers to utilize their private endpoints securely with API Gateway across AWS accounts and within a VPC network by using a reverse proxy with a custom domain name. The solution offers a simplified approach to manage the mapping between private endpoints with API Gateway and custom domain names, ensuring seamless connectivity and security.

For more serverless learning resources, visit Serverless Land.

Developing a serverless Slack app using AWS Step Functions and AWS Lambda

2023-05-25 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/developing-a-serverless-slack-app-using-aws-step-functions-and-aws-lambda/

This blog was written by Sam Wilson, Cloud Application Architect and John Lopez, Cloud Application Architect.

Slack, as an enterprise collaboration and communication service, presents opportunities for builders to improve efficiency through implementing custom-written Slack Applications (apps). One such opportunity is to expose existing AWS resources to your organization without your employees needing AWS Management Console or AWS CLI access.

For example, a member of your data analytics team needs to trigger an AWS Step Functions workflow to reprocess a batch data job. Instead of granting the user direct access to the Step Functions workflow in the AWS Management Console or AWS CLI, you can provide access to invoke the workflow from within a designated Slack channel.

This blog covers how serverless architecture lets Slack users invoke AWS resources such as AWS Lambda functions and Step Functions via the Slack Desktop UI and Mobile UI using Slack apps. Serverless architecture is ideal for a Slack app because of its ability to scale. It can process thousands of concurrent requests for Slack users without the burden of managing operational overhead.

This example supports integration with other AWS resources via Step Functions. Visit the documentation for more information on integrations with other AWS resources.

This post explains the serverless example architecture, and walks through how to deploy the example in your AWS account. It demonstrates the example and discusses constraints discovered during development.

Overview

The code included in this post creates a Slack app built with a variety of AWS serverless services:

Amazon API Gateway receives all incoming requests from Slack. Step Functions coordinates request activities such as user validation, configuration retrieval, request routing, and response formatting.
A Lambda Function invokes Slack-specific authentication functionality and sends responses to the Slack UI.
Amazon EventBridge serves as a pub-sub integration between a request and the request processor.
Amazon DynamoDB stores permissions for each Slack user to ensure they only have access to resources you designate.
AWS Systems Manager stores the specific Slack channel where you use the Slack app.
AWS Secrets Manager stores the Slack app signing secret and bot token used for authentication.

AWS Cloud Development Kit (AWS CDK) deploys the AWS resources. This example can plug into any existing CI/CD platform of your choice.

Architectural overview

The desktop or mobile Slack user starts requests by using /my-slack-bot slash command or by interacting with a Slack Block Kit UI element.
API Gateway proxies the request and transforms the payload into a format that the request validator Step Functions workflow can accept.
The request validator triggers the Slack authentication Lambda function to verify that the request originates from the configured Slack organization. This Lambda function uses the Slack Bolt library for TypeScript to perform request authentication and extract request details into a consistent payload. Also, Secrets Manager stores a signing secret, which the Slack Bolt API uses during authentication.
The request validator queries the Authorized Users DynamoDB table with the username extracted from the request payload. If the user does not exist, the request ends with an unauthorized response.
The request validator retrieves the permitted channel ID and compares it to the channel ID found in the request. If the two channel IDs do not match, the request ends with an unauthorized response.
The request validator sends the request to the Command event bus in EventBridge.
The Command event bus uses the request’s action property to route the request to a request processor Step Functions workflow.
Each processor Step Functions workflow may build Slack Block Kit UI elements, send updates to existing UI elements, or invoke existing Lambda functions and Step Functions workflows.
The Slack desktop or mobile application displays new UI elements or presents updates to an existing request as it is processed. Users may interact with new UI elements to further a request or start over with an additional request.

This application architecture scales for production loads, providing capacity for processing thousands of concurrent Slack users (through the Slack platform itself), bypassing the need for direct AWS Management Console access. This architecture provides for easier extensibility of the chat application to support new commands as consumer needs arise.

Finally, the application’s architecture and implementation follow the AWS Well-Architected guidance for Operational Excellence, Security, Reliability, and Performance Efficiency.

CDK manages application infrastructure for the Operational Excellence Pillar.
Secrets Manager and DynamoDB encrypt data at rest for the Security Pillar.
This architecture scales horizontally with Lambda functions and Step Functions for the Reliability Pillar.
Serverless architecture removes the need to maintain physical servers for the Performance Efficiency Pillar.

Step Functions is suited for this example because the service supports integrations with many other AWS services. Step Functions allows this example to orchestrate interactions with Lambda functions, DynamoDB tables, and EventBridge event busses with minimal code.

This example takes advantage of Step Functions Express Workflows to support the high-volume, event-driven actions generated by the Slack app. The result of using Express Workflows is a responsive, scalable request processor capable of handling tens of thousands requests per second. To learn more, review the differences between standard and Express Workflows.

The presented example uses AWS Secrets Manager for storing and retrieving application credentials securely. AWS Secrets Manager provides the following benefits:

Central, secure storage of secrets via encryption-at-rest.
Ease of access management through AWS Identity and Access Management (IAM) permissions policies.
Out-of-the-box integration support with all AWS services comprising the architecture

In addition, this example uses the AWS Systems Manager Parameter Store service for persisting our application’s configuration data for the Slack Channel ID. Among the benefits offered by AWS System Manager, this example takes advantage of storing encrypted configuration data with versioning support.

This example features a variety of interactions for the Slack Mobile or Desktop application, including displaying a welcome screen, entering form information, and reporting process status. EventBridge enables this example to route requests from the Request Validator Step Function using the serverless Event Bus and decouple each action from one another. We configure Rules to associate an event to a request processor Step Function.

Walkthrough

As prerequisites, you need the following:

AWS CDK version 2.19.0 or later.
Node version 16+.
Docker-cli.
Git.
A personal or company Slack account with permissions to create applications.
The Slack Channel ID of a channel in your Workspace for integration with the Slack App.

Slack channel details

To get the Channel ID, open the context menu on the Slack channel and select View channel details. The modal displays your Channel ID at the bottom:

Slack bot channel details

To learn more, visit Slack’s Getting Started with Bolt for JavaScript page.

The README document for the GitHub Repository titled, Amazon Interactive Slack App Starter Kit, contains a comprehensive walkthrough, including detailed steps for:

Slack API configuration
Application deployment via AWS Cloud Development Kit (CDK)
Required updates for AWS Systems Manager Parameters and secrets

Demoing the Slack app

Start the Slack app by invoking the /my-slack-bot slash command.

Start sample Lambda invocation
From the My Slack Bot action menu, choose Sample Lambda.

Choosing sample Lambda

Choosing sample Lambda
Enter command input, choose Submit, then observe the response (this input value applies to the sample Lambda function).

Sample Lambda submit

Sample Lambda results
Start the Slack App by invoking the /my-slack-bot slash command, then select Sample State Machine:

Start sample state machine invocation

Select Sample State Machine
Enter command input, choose Submit, then observe the response (this input value applies to the downstream state machine)

Sample state machine submit

Sample state machine results

Constraints in this example

Slack has constraints for sending and receiving requests, but API Gateway’s mapping templates provide mechanisms for integrating with a variety of request constraints.

Slack uses application/x-www-form-urlencoded to send requests to a custom URL. We designed a request template to format the input from Slack into a consistent format for the Request Validator Step Function. Headers such as X-Slack-Signature and X-Slack-Request-Timestamp needed to be passed along to ensure the request from Slack was authentic.

Here is the request mapping template needed for the integration:

{
  "stateMachineArn": "arn:aws:states:us-east-1:827819197510:stateMachine:RequestValidatorB6FDBF18-FACc7f2PzNAv",
  "input": "{\"body\": \"$input.path('$')\", \"headers\": {\"X-Slack-Signature\": \"$input.params().header.get('X-Slack-Signature')\", \"X-Slack-Request-Timestamp\": \"$input.params().header.get('X-Slack-Request-Timestamp')\", \"Content-Type\": \"application/x-www-form-urlencoded\"}}"
}

Slack sends the message payload in two different formats: URL-encoded and JSON. Fortunately, the Slack Bolt for JavaScript library can merge the two request formats into a single JSON payload and handle verification.

Slack requires a 204 status response along with an empty body to recognize that a request was successful. An integration response template overrides the Step Function response into a format that Slack accepts.

Here is the response mapping template needed for the integration:

#set($context.responseOverride.status = 204)
{}

Conclusion

In this blog post, you learned how you can let your organization’s Slack users with the ability to invoke your existing AWS resources with no AWS Management Console or AWS CLI access. This serverless example lets you build your own custom workflows to meet your organization’s needs.

To learn more about the concepts discussed in this blog, visit:

For more serverless learning resources, visit Serverless Land.

Automating stopping and starting Amazon MWAA environments to reduce cost

2023-05-12 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/automating-stopping-and-starting-amazon-mwaa-environments-to-reduce-cost/

This was written by Uma Ramadoss, Specialist Integration Services, and Chandan Rupakheti, Solutions Architect.

This blog post shows how you can save cost by automating the stopping and starting of an Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment. It describes how you can retain the data stored in a metadata database and presents an automated solution you can use in your AWS account.

Customers run end to end data pipelines at scale with MWAA. It is a common best practice to run non-production environments for development and testing. A nonproduction environment often does not need to run throughout the day due to factors such as working hours of the development team. As there is no automatic way to stop an MWAA environment when not in use and deleting an environment causes metadata loss, customers often run it continually and pay the full cost.

Overview

Amazon MWAA has a distributed architecture with multiple components such as scheduler, worker, webserver, queue, and database. Customers build data pipelines as Directed Acyclic Graphs (DAGs) and run in Amazon MWAA. The DAGs use variables and connections from the Amazon MWAA metadata database. The history of DAG runs and related data are stored in the same metadata database. The database also stores other information such as user roles and permissions.

When you delete the Amazon MWAA environment, all the components including the database are deleted so that you do not incur any cost. As this normal deletion results in loss of metadata, you need a customized solution to back up the data and to automate the deletion and recreation.

The sample application deletes and recreates your MWAA environment at a scheduled interval defined by you using Amazon EventBridge Scheduler. It exports all metadata into an Amazon S3 bucket before deletion and imports the metadata back to the environment after creation. As this is a managed database and you cannot access the database outside the Amazon MWAA environment, it uses DAGs to import and export the data. The entire process is orchestrated using AWS Step Functions.

Deployment architecture

The sample application is in a GitHub repository. Use the instructions in the readme to deploy the application.

The sample application deploys the following resources –

A Step Functions state machine to orchestrate the steps needed to delete the MWAA environment.
A Step Functions state machine to orchestrate the steps needed to recreate the MWAA environment.
EventBridge Scheduler rules to trigger the state machines at the scheduled time.
An S3 bucket to store metadata database backup and environment details.
Two DAG files uploaded to the source S3 bucket configured with the MWAA environment. The export DAG exports metadata from the MWAA metadata database to backup S3 bucket. The import DAG restores the metadata from the backup S3 bucket to the newly created MWAA environment.
AWS Lambda functions for triggering the DAGs using MWAA CLI API.
A Step Functions state machine to wait for the long-running MWAA creation and deletion process.
Amazon EventBridge rule to notify on state machine failures.
Amazon Simple Notification Service (Amazon SNS) topic as a target to the EventBridge rule for failure notifications.
Amazon Interface VPC Endpoint for Step Functions for MWAA environment deployed in the private mode.

Stop workflow

At a scheduled time, Amazon EventBridge Scheduler triggers a Step Functions state machine to stop the MWAA environment. The state machine performs the following actions:

Fetch Amazon MWAA environment details such as airflow configurations, IAM execution role, logging configurations and VPC details.
If the environment is not in the “AVAILABLE” status, it fails the workflow by branching to the “Pausing unsuccessful” state.
Otherwise, it runs the normal workflow and stores the environment details in an S3 bucket so that Start workflow can recreate the environment with this data.
Trigger an MWAA DAG using AWS Lambda function to export metadata to the Amazon S3 bucket. This step uses Step Functions to wait for callback token integration.
Resume the workflow when the task token is returned from the MWAA DAG.
Delete Amazon MWAA environment.
Wait to confirm the deletion.

Start workflow

At a scheduled time, EventBridge Scheduler triggers the Step Functions state machine to recreate the MWAA environment. The steps in the state machine perform the following actions:

Retrieve the environment details stored in Amazon S3 bucket by the stop workflow.
Create an MWAA environment with the same configuration as the original.
Trigger an MWAA DAG through the Lambda function to restore the metadata from the S3 bucket to the newly created environment.

Cost savings

Consider a small MWAA environment in us-east-2 with a minimum of one worker, a maximum of one worker, and 1GB data storage. At the time of this writing, the monthly cost of the environment is $357.80. Let’s assume you use this environment between 6 am and 6 pm on weekdays.

The schedule in the env file of the sample application looks like:

MWAA_PAUSE_CRON_SCHEDULE=’0 18 ? * MON-FRI *’
MWAA_RESUME_CRON_SCHEDULE=’30 5 ? * MON-FRI *’

As MWAA environment creation takes anywhere between 20 and 30 minutes, the MWAA_RESUME_CRON_SCHEDULE is set at 5.30 pm.

Assuming 21 weekdays per month, the monthly cost of the environment is $123.48 and is 65.46% less compared to running the environment continuously:

21 weekdays * 12 hours * 0.49 USD per hour = $123.48

Additional considerations

The sample application only restores at-store data. Though the deletion process pauses all the DAGs before making the backup, it cannot stop any running tasks or in-flight messages in the queue. It also does not backup tasks that are not in completed state. This can result in task history loss for the tasks that were running during the backup.

Over time, the metadata grows in size, which can increase latency in query performance. You can use a DAG as shown in the example to clean up the database regularly.

Avoid setting the catchup by default configuration flag in the environment setting to true or in the DAG definition unless it is required. Catch up feature runs all the DAG runs that are missed for any data interval. When the environment is created again, if the flag is true, it catches up with the missed DAG runs and can overload the environment.

Conclusion

Automating the deletion and recreation of Amazon MWAA environments is a powerful solution for cost optimization and efficient management of resources. By following the steps outlined in this blog post, you can ensure that your MWAA environment is deleted and recreated without losing any of the metadata or configurations. This allows you to deploy new code changes and updates more quickly and easily, without having to configure your environment each time manually.

The potential cost savings of running your MWAA environment for only 12 hours on weekdays are significant. The example shows how you can save up to 65% of your monthly costs by choosing this option. This makes it an attractive solution for organizations that are looking to reduce cost while maintaining a high level of performance.

Visit the samples repository to learn more about Amazon MWAA. It contains a wide variety of examples and templates that you can use to build your own applications.

For more serverless learning resources, visit Serverless Land.

Monitor Amazon SNS-based applications end-to-end with AWS X-Ray active tracing

2023-05-11 Pascal Vogel

Post Syndicated from Pascal Vogel original https://aws.amazon.com/blogs/compute/monitor-amazon-sns-based-applications-end-to-end-with-aws-x-ray-active-tracing/

This post is written by Daniel Lorch, Senior Consultant and David Mbonu, Senior Solutions Architect.

Amazon Simple Notification Service (Amazon SNS), a messaging service that provides high-throughput, push-based, many-to-many messaging between distributed systems, microservices, and event-driven serverless applications, now supports active tracing with AWS X-Ray.

With AWS X-Ray active tracing enabled for SNS, you can identify bottlenecks and monitor the health of event-driven applications by looking at segment details for SNS topics, such as resource metadata, faults, errors, and message delivery latency for each subscriber.

This blog post reviews common use cases where AWS X-Ray active tracing enabled for SNS provides a consistent view of tracing data across AWS services in real-world scenarios. We cover two architectural patterns which allow you to gain accurate visibility of your end-to-end tracing: SNS to Amazon Simple Queue Service (Amazon SQS) queues and SNS topics to Amazon Kinesis Data Firehose streams.

Getting started with the sample serverless application

To demonstrate AWS X-Ray active tracing for SNS, we will use the Wild Rydes serverless application as shown in the following figure. The application uses a microservices architecture which implements asynchronous messaging for integrating independent systems.

Wild Rydes serverless application architecture

This is how the sample serverless application works:

An Amazon API Gateway receives ride requests from users.
An AWS Lambda function processes ride requests.
An Amazon DynamoDB table serves as a store for rides.
An SNS topic serves as a fan-out for ride requests.
Individual SQS queues and Lambda functions are set up for processing requests via various back-office services (customer notification, customer accounting, and others).
An SNS message filter is in place for the subscription of the extraordinary rides service.
A Kinesis Data Firehose delivery stream archives ride requests in an Amazon Simple Storage Service (Amazon S3) bucket.

Deploying the sample serverless application

Prerequisites

Deployment steps using AWS SAM

The sample application is provided as an AWS SAM infrastructure as code template.

This demonstrative application will deploy an API without authorization. Please consider controlling and managing access to your APIs.

Clone the GitHub repository:

git clone https://github.com/aws-samples/sns-xray-active-tracing-blog-source-code
cd sns-xray-active-tracing-blog-source-code

Build the lab artifacts from source:
```
sam build
```

Deploy the sample solution into your AWS account:

export AWS_REGION=$(aws --profile default configure get region)
sam deploy \
--stack-name wild-rydes-async-msg-2 \
--capabilities CAPABILITY_IAM \
--region $AWS_REGION \
--guided

Confirm SubmitRideCompletionFunction may not have authorization defined, Is this okay? [y/N]: with yes.

Wait until the stack reaches status CREATE_COMPLETE.

See the sample application README.md for detailed deployment instructions.

Testing the application

Once the application is successfully deployed, generate messages and validate that the SNS topic is publishing all messages:

Look up the API Gateway endpoint:

export AWS_REGION=$(aws --profile default configure get region)
aws cloudformation describe-stacks \
--stack-name wild-rydes-async-msg-2 \
--query 'Stacks[].Outputs[?OutputKey==`UnicornManagementServiceApiSubmitRideCompletionEndpoint`].OutputValue' \
--output text

Store this API Gateway endpoint in an environment variable:

export ENDPOINT=$(aws cloudformation describe-stacks \
--stack-name wild-rydes-async-msg-2 \
--query 'Stacks[].Outputs[?OutputKey==`UnicornManagementServiceApiSubmitRideCompletionEndpoint`].OutputValue' \
--output text)

Send requests to the submit ride completion endpoint by executing the following command five or more times with varying payloads:

curl -XPOST -i -H "Content-Type\:application/json" -d '{ "from": "Berlin", "to": "Frankfurt", "duration": 420, "distance": 600, "customer": "cmr", "fare": 256.50 }' $ENDPOINT

Validate that messages are being passed in the application using the CloudWatch service map:

See the sample application README.md for detailed testing instructions.

The sample application shows various use-cases, which are described in the following sections.

Amazon SNS to Amazon SQS fanout scenario

A common application integration scenario for SNS is the Fanout scenario. In the Fanout scenario, a message published to an SNS topic is replicated and pushed to multiple endpoints, such as SQS queues. This allows for parallel asynchronous processing and is a common application integration pattern used in event-driven application architectures.

When an SNS topic fans out to SQS queues, the pattern is called topic-queue-chaining. This means that you add a queue, in our case an SQS queue, between the SNS topic and each of the subscriber services. As messages are buffered in a persistent manner in an SQS queue, no message is lost should a subscriber process run into issues for multiple hours or days, or experience exceptions or crashes.

By placing an SQS queue in front of each subscriber service, you can leverage the fact that a queue can act as a buffering load balancer. As every queue message is delivered to one of potentially many consumer processes, subscriber services can be easily scaled out and in, and the message load is distributed over the available consumer processes. In an event where suddenly a large number of messages arrives, the number of consumer processes has to be scaled out to cope with the additional load. This takes time and you need to wait until additional processes become operational. Since messages are buffered in the queue, you do not lose any messages in the process.

To summarize, in the Fanout scenario or the topic-queue-chaining pattern:

SNS replicates and pushes the message to multiple endpoints.
SQS decouples sending and receiving endpoints.

With AWS X-Ray active tracing enabled on the SNS topic, the CloudWatch service map shows us the complete application architecture, as follows.

Prior to the introduction of AWS X-Ray active tracing on the SNS topic, the AWS X-Ray service would not be able to reconstruct the full service map and the SQS nodes would be missing from the diagram.

To see the integration without AWS X-Ray active tracing enabled, open template.yaml and navigate to the resource RideCompletionTopic. Comment out the property TracingConfig: Active, redeploy and test the solution. The service map should then show an incomplete diagram where the SNS topic is linked directly to the consumer Lambda functions, omitting the SQS nodes.

For this use case, given the Fanout scenario, enabling AWS X-Ray active tracing on the SNS topic provides full end-to-end observability of the traces available in the application.

Amazon SNS to Amazon Kinesis Data Firehose delivery streams for message archiving and analytics

SNS is commonly used with Kinesis Data Firehose delivery streams for message archival and analytics use-cases. You can use SNS topics with Kinesis Data Firehose subscriptions to capture, transform, buffer, compress and upload data to Amazon S3, Amazon Redshift, Amazon OpenSearch Service, HTTP endpoints, and third-party service providers.

We will implement this pattern as follows:

An SNS topic to replicate and push the message to its subscribers.
A Kinesis Data Firehose delivery stream to capture and buffer messages.
An S3 bucket to receive uploaded messages for archival.

In order to demonstrate this pattern, an additional consumer has been added to the SNS topic. The same Fanout pattern applies and the Kinesis Data Firehose delivery stream receives messages from the SNS topic alongside the existing consumers.

The Kinesis Data Firehose delivery stream buffers messages and is configured to deliver them to an S3 bucket for archival purposes. Optionally, an SNS message filter could be added to this subscription to select relevant messages for archival.

With AWS X-Ray active tracing enabled on the SNS topic, the Kinesis Data Firehose node will appear on the CloudWatch service map as a separate entity, as can be seen in the following figure. It is worth noting that the S3 bucket does not appear on the CloudWatch service map as Kinesis does not yet support AWS X-Ray active tracing at the time of writing of this blog post.

Prior to the introduction of AWS X-Ray active tracing on the SNS topic, the AWS X-Ray service would not be able to reconstruct the full service map and the Kinesis Data Firehose node would be missing from the diagram. To see the integration without AWS X-Ray active tracing enabled, open template.yaml and navigate to the resource RideCompletionTopic. Comment out the property TracingConfig: Active, redeploy and test the solution. The service map should then show an incomplete diagram where the Kinesis Data Firehose node is missing.

For this use case, given the data archival scenario with Kinesis Delivery Firehose, enabling AWS X-Ray active tracing on the SNS topic provides additional visibility on the Kinesis Data Firehose node in the CloudWatch service map.

Review faults, errors, and message delivery latency on the AWS X-Ray trace details page

The AWS X-Ray trace details page provides a timeline with resource metadata, faults, errors, and message delivery latency for each segment.

With AWS X-Ray active tracing enabled on SNS, additional segments for the SNS topic itself, but also the downstream consumers (AWS::SNS::Topic, AWS::SQS::Queue and AWS::KinesisFirehose) segments are available, providing additional faults, errors, and message delivery latency for these segments. This allows you to analyze latencies in your messages and their backend services. For example, how long a message spends in a topic, and how long it took to deliver the message to each of the topic’s subscriptions.

Enabling AWS X-Ray active tracing for SNS

AWS X-Ray active tracing is not enabled by default on SNS topics and needs to be explicitly enabled.

The example application used in this blog post demonstrates how to enable active tracing using AWS SAM.

You can enable AWS X-Ray active tracing using the SNS SetTopicAttributes API, SNS Management Console, or via AWS CloudFormation. See Active tracing in Amazon SNS in the Amazon SNS Developer Guide for more options.

Cleanup

To clean up the resources provisioned as part of the sample serverless application, follow the instructions as outlined in the sample application README.md.

Conclusion

AWS X-Ray active tracing for SNS enables end-to-end visibility in real-world scenarios involving patterns like SNS to SQS and SNS to Amazon Kinesis.

But it is not only useful for these patterns. With AWS X-Ray active tracing enabled for SNS, you can identify bottlenecks and monitor the health of event-driven applications by looking at segment details for SNS topics and consumers, such as resource metadata, faults, errors, and message delivery latency for each subscriber.

Enable AWS X-Ray active tracing for SNS to gain accurate visibility of your end-to-end tracing.

For more serverless learning resources, visit Serverless Land.

Debugging SnapStart-enabled Lambda functions made easy with AWS X-Ray

2023-05-11 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/debugging-snapstart-enabled-lambda-functions-made-easy-with-aws-x-ray/

This post is written by Rahul Popat (Senior Solutions Architect) and Aneel Murari (Senior Solutions Architect)

Today, AWS X-Ray is announcing support for SnapStart-enabled AWS Lambda functions. Lambda SnapStart is a performance optimization that significantly improves the cold startup times for your functions. Announced at AWS re:Invent 2022, this feature delivers up to 10 times faster function startup times for latency-sensitive Java applications at no extra cost, and with minimal or no code changes.

X-Ray is a distributed tracing system that provides an end-to-end view of how an application is performing. X-Ray collects data about requests that your application serves and provides tools you can use to gain insight into opportunities for optimizations. Now you can use X-Ray to gain insights into the performance improvements of your SnapStart-enabled Lambda function.

With today’s feature launch, by turning on X-Ray tracing for SnapStart-enabled Lambda functions, you see separate subsegments corresponding to the Restore and Invoke phases for your Lambda function’s execution.

How does Lambda SnapStart work?

With SnapStart, the function’s initialization is done ahead of time when you publish a function version. Lambda takes an encrypted snapshot of the initialized execution environment and persists the snapshot in a tiered cache for low latency access.

When the function is first invoked or scaled, Lambda restores the cached execution environment from the persisted snapshot instead of initializing anew. This results in reduced startup times.

X-Ray tracing before this feature launch

Using an example of a Hello World application written in Java, a Lambda function is configured with SnapStart and fronted by Amazon API Gateway:

Before today’s launch, X-Ray was not supported for SnapStart-enabled Lambda functions. So if you had enabled X-Ray tracing for API Gateway, the X-Ray trace for the sample application would look like:

The trace only shows the overall duration of the Lambda service call. You do not have insight into your function’s execution or the breakdown of different phases of Lambda function lifecycle.

Next, enable X-Ray for your Lambda function and see how you can view a breakdown of your function’s total execution duration.

Prerequisites for enabling X-Ray for SnapStart-enabled Lambda function

SnapStart is only supported for Lambda functions with Java 11 and newly launched Java 17 managed runtimes. You can only enable SnapStart for the published versions of your Lambda function. Once you’ve enabled SnapStart, Lambda publishes all subsequent versions with snapshots. You may also create a Lambda function alias, which points to the published version of your Lambda function.

Make sure that the Lambda function’s execution role has appropriate permissions to write to X-Ray.

Enabling AWS X-Ray for your Lambda function with SnapStart

You can enable X-Ray tracing for your Lambda function using AWS Management Console, AWS Command Line Interface (AWS CLI), AWS Serverless Application Model (AWS SAM), AWS CloudFormation template, or via AWS Cloud Deployment Kit (CDK).

This blog shows how you can achieve this via AWS Management Console and AWS SAM. For more information on enabling SnapStart and X-Ray using other methods, refer to AWS Lambda Developer Guide.

Enabling SnapStart and X-Ray via AWS Management Console

To enable SnapStart and X-Ray for Lambda function via the AWS Management Console:

Navigate to your Lambda Function.
On the Configuration tab, choose Edit and change the SnapStart attribute value from None to PublishedVersions.
Choose Save.

To enable X-Ray via the AWS Management Console:

Navigate to your Lambda Function.
On the Configuration tab, scroll down to the Monitoring and operations tools card and choose Edit.
Under AWS X-Ray, enable Active tracing.
Choose Save

To publish a new version of Lambda function via the AWS Management Console:

Navigate to your Lambda Function.
On the Version tab, choose Publish new version.
Verify that PublishedVersions is shown below SnapStart.
Choose Publish.

To create an alias for a published version of your Lambda function via the AWS Management Console:

Navigate to your Lambda Function.
On the Aliases tab, choose Create alias.
Provide a Name for an alias and select a Version of your Lambda function to point the alias to.
Choose Save.

Enabling SnapStart and X-Ray via AWS SAM

To enable SnapStart and X-Ray for Lambda function via AWS SAM:

1. Enable Lambda function versions and create an alias by adding a AutoPublishAlias property in template.yaml file. AWS SAM automatically publishes a new version for each new deployment and automatically assigns the alias to the newly published version.
```
Resources:
  my-function:
    type: AWS::Serverless::Function
    Properties:
      […]
      AutoPublishAlias: live
```
2. Enable SnapStart on Lambda function by adding the SnapStart property in template.yaml file.
```
Resources: 
  my-function: 
    type: AWS::Serverless::Function 
    Properties: 
      […] 
      SnapStart:
       ApplyOn: PublishedVersions
```
3. Enable X-Ray for Lambda function by adding the Tracing property in template.yaml file.
```
Resources:
  my-function:
    type: AWS::Serverless::Function
    Properties:
      […]
      Tracing: Active 
```

You can find the complete AWS SAM template for the preceding example in this GitHub repository.

Using X-Ray to gain insights into SnapStart-enabled Lambda function’s performance

To demonstrate X-Ray integration for your Lambda function with SnapStart, you can build, deploy, and test the sample Hello World application using AWS SAM CLI. To do this, follow the instructions in the README file of the GitHub project.

The build and deployment output with AWS SAM looks like this:

Once your application is deployed to your AWS account, note that SnapStart and X-Ray tracing is enabled for your Lambda function. You should also see an alias `live` created against the published version of your Lambda function.

You should also have an API deployed via API Gateway, which is pointing to the `live` alias of your Lambda function as the backend integration.

Now, invoke your API via `curl` command or any other HTTP client. Make sure to replace the url with your own API’s url.

$ curl --location --request GET https://{rest-api-id}.execute-api.{region}.amazonaws.com/{stage}/hello

Navigate to Amazon CloudWatch and under the X-Ray service map, you see a visual representation of the trace data generated by your application.

Under Traces, you can see the individual traces, Response code, Response time, Duration, and other useful metrics.

Select a trace ID to see the breakdown of total Duration on your API call.

You can now see the complete trace for the Lambda function’s invocation with breakdown of time taken during each phase. You can see the Restore duration and actual Invocation duration separately.

Restore duration shown in the trace includes the time it takes for Lambda to restore a snapshot on the microVM, load the runtime (JVM), and run any afterRestore hooks if specified in your code. Note that, the process of restoring snapshots can include time spent on activities outside the microVM. This time is not reported in the Restore sub-segment, but is part of the AWS::Lambda segment in X-Ray traces.

This helps you better understand the latency of your Lambda function’s execution, and enables you to identify and troubleshoot the performance issues and errors.

Conclusion

This blog post shows how you can enable AWS X-Ray for your Lambda function enabled with SnapStart, and measure the end-to-end performance of such functions using X-Ray console. You can now see a complete breakdown of your Lambda function’s execution time. This includes Restore duration along with the Invocation duration, which can help you to understand your application’s startup times (cold starts), diagnose slowdowns, or troubleshoot any errors and timeouts.

To learn more about the Lambda SnapStart feature, visit the AWS Lambda Developer Guide.

For more serverless learning resources, visit Serverless Land.

Implementing cross-account CI/CD with AWS SAM for container-based Lambda functions

2023-05-10 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/implementing-cross-account-cicd-with-aws-sam-for-container-based-lambda/

This post is written by Chetan Makvana, Sr. Solutions Architect.

Customers use modular architectural patterns and serverless operational models to build more sustainable, scalable, and resilient applications in modern application development. AWS Lambda is a popular choice for building these applications.

If customers have invested in container tooling for their development workflows, they deploy to Lambda using the container image packaging format for workloads like machine learning inference or data intensive workloads. Using functions deployed as container images, customers benefit from the same operation simplicity, automation scaling, high availability, and native integration with many services.

Containerized applications often have several distinct environments and accounts, such as dev, test, and prod. An application has to go through a process of deployment and testing in these environments. One common pattern for deploying containerized applications is to have a central AWS create a single container image, and carry out deployment across other AWS accounts. To achieve automated deployment of the application across different environments, customers use CI/CD pipelines with familiar container tooling.

This blog post explores how to use AWS Serverless Application Model (AWS SAM) Pipelines to create a CI/CD deployment pipeline and deploy a container-based Lambda function across multiple accounts.

Solution overview

This example comprises three accounts: tooling, test, and prod. The tooling account is a central account where you provision the pipeline, and build the container. The pipeline deploys the container into Lambda in the test and prod accounts using AWS CodeBuild. It also requires the necessary resources in the test and prod account. This consists of an Identity and Access Management (IAM) role that trusts the tooling account and provides the required deployment-specific permissions. AWS CodeBuild assumes this IAM role in the tooling account to carry out deployment.

The solution uses AWS SAM Pipelines to create CI/CD deployment pipeline resources. It provides commands to generate the required AWS infrastructure resources and a pipeline configuration file that CI/CD system can use to deploy using AWS SAM. Find the example code for this solution in the GitHub repository.

Full solution architecture

AWS CodePipeline goes through these steps to deploy the container-based Lambda function in the test and prod accounts:

The developer commits the code of Lambda function into AWS CodeCommit or other source control repositories, which triggers the CI/CD workflow.
AWS CodeBuild builds the code, creates a container image, and pushes the image to the Amazon Elastic Container Registry (ECR) repository using AWS SAM.
AWS CodeBuild assumes a cross-account role for the test account.
AWS CodeBuild uses AWS SAM to deploy the Lambda function by pulling image from Amazon ECR.
If deployment is successful, AWS CodeBuild deploys the same image in prod account using AWS SAM.

Deploying the example

Prerequisites

An AWS account. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources.
AWS CLI installed and configured.
Git installed.
AWS SAM installed.
Setup .aws/credentials named profiles for tooling, test and prod accounts so you can run CLI and AWS SAM commands against them.

Set the TOOLS_ACCOUNT_ID, TEST_ACCOUNT_ID, and PROD_ACCOUNT_ID env variables:

export TOOLS_ACCOUNT_ID=<Tooling Account Id>
export TEST_ACCOUNT_ID=<Test Account Id>
export PROD_ACCOUNT_ID=<Prod Account Id>

Creating a Git Repository and pushing the code

Run the following command in the tooling account from your terminal to create a new CodeCommit repository:

aws codecommit create-repository --repository-name lambda-container-repo --profile tooling

Initialize the Git repository and push the code.

cd ~/environment/cicd-lambda-container
git init -b main
git add .
git commit -m "Initial commit"
git remote add origin codecommit://lambda-container-repo
git push -u origin main

Creating cross-account roles in test and prod accounts

For the pipeline to gain access to the test and production environment, it must assume an IAM role. In cross-account scenario, the IAM role for the pipeline must be created on the test and production accounts.

Change directory to the directory templates and run the following command to deploy roles to test and prod using respective named profiles.

Test Profile

cd ~/environment/cicd-lambda-container/templates
aws cloudformation deploy --template-file crossaccount_pipeline_roles.yml --stack-name codepipeline-crossaccount-roles --capabilities CAPABILITY_NAMED_IAM --profile test --parameter-overrides ToolAccountID=${TOOLS_ACCOUNT_ID}
aws cloudformation describe-stacks --stack-name codepipeline-crossaccount-roles --query "Stacks[0].Outputs" --output json --profile test

Open the codepipeline_parameters.json file from the root directory. Replace the value of TestCodePipelineCrossAccountRoleArn and TestCloudFormationCrossAccountRoleArn with the CloudFormation output value of CodePipelineCrossAccountRole and CloudFormationCrossAccountRole respectively.

Prod Profile

aws cloudformation deploy --template-file crossaccount_pipeline_roles.yml --stack-name codepipeline-crossaccount-roles --capabilities CAPABILITY_NAMED_IAM --profile prod --parameter-overrides ToolAccountID=${TOOLS_ACCOUNT_ID}
aws cloudformation describe-stacks --stack-name codepipeline-crossaccount-roles --query "Stacks[0].Outputs" --output json --profile prod

Open the codepipeline_parameters.json file from the root directory. Replace the value of ProdCodePipelineCrossAccountRoleArn and ProdCloudFormationCrossAccountRoleArn with the CloudFormation output value of CodePipelineCrossAccountRole and CloudFormationCrossAccountRole respectively.

Creating the required IAM roles and infrastructure in the tooling account

Change to the templates directory and run the following command using tooling named profile:

aws cloudformation deploy --template-file tooling_resources.yml --stack-name tooling-resources --capabilities CAPABILITY_NAMED_IAM --parameter-overrides TestAccountID=${TEST_ACCOUNT_ID} ProdAccountID=${PROD_ACCOUNT_ID} --profile tooling
aws cloudformation describe-stacks --stack-name tooling-resources --query "Stacks[0].Outputs" --output json --profile tooling

Open the codepipeline_parameters.json file from the root directory. Replace value of ImageRepositoryURI, ArtifactsBucket, ToolingCodePipelineExecutionRoleArn, and ToolingCloudFormationExecutionRoleArn with the corresponding CloudFormation output value.

Updating cross-account IAM roles

The cross-account IAM roles on the test and production account require permission to access artifacts that contain application code (S3 bucket and ECR repository). Note that the cross-account roles are deployed twice. This is because there is a circular dependency on the roles in the test and prod accounts and the pipeline artifact resources provisioned in the tooling account.

The pipeline must reference and resolve the ARNs of the roles it needs to assume to deploy the application to the test and prod accounts, so the roles must be deployed before the pipeline is provisioned. However, the policies attached to the roles need to include the S3 bucket and ECR repository. But the S3 bucket and ECR repository don’t exist until the resources deploy in the preceding step. By deploying the roles twice, once without a policy so their ARNs resolve, and a second time to attach policies to the existing roles that reference the resources in the tooling account.

Replace ImageRepositoryArn and ArtifactBucketArn with output value from the above step in the below command and run it from the templates directory using Test and Prod named profiles.

Test Profile

aws cloudformation deploy --template-file crossaccount_pipeline_roles.yml --stack-name codepipeline-crossaccount-roles --capabilities CAPABILITY_NAMED_IAM --profile test --parameter-overrides ToolAccountID=${TOOLS_ACCOUNT_ID} ImageRepositoryArn=&lt;ImageRepositoryArn value&gt; ArtifactsBucketArn=&lt;ArtifactsBucketArn value&gt;

Prod Profile

aws cloudformation deploy --template-file crossaccount_pipeline_roles.yml --stack-name codepipeline-crossaccount-roles --capabilities CAPABILITY_NAMED_IAM --profile prod --parameter-overrides ToolAccountID=${TOOLS_ACCOUNT_ID} ImageRepositoryArn=&lt;ImageRepositoryArn value&gt; ArtifactsBucketArn=&lt;ArtifactsBucketArn value&gt;

Deploying the pipeline

Replace DeploymentRegion value with the current Region and CodeCommitRepositoryName value with the CodeCommit repository name in codepipeline_parameters.json file.

Push the changes to CodeCommit repository using Git commands.

Replace CodeCommitRepositoryName value with the CodeCommit repository name created in the first step and run the following command from the root directory of the project using tooling named profile.

sam deploy -t codepipeline.yaml --stack-name cicd-lambda-container-pipeline --capabilities=CAPABILITY_IAM --parameter-overrides CodeCommitRepositoryName=&lt;CodeCommit Repository Name&gt; --profile tooling

Cleaning Up

Run the following command in root directory of the project to delete the pipeline:
```
sam delete --stack-name cicd-lambda-container-pipeline --profile tooling
```
Empty the artifacts bucket. Replace the artifacts bucket name with the output value from the preceding step:
```
aws s3 rm s3://&lt;Arifacts bucket name&gt; --recursive --profile tooling
```

Delete the Lambda functions from the test and prod accounts:

aws cloudformation delete-stack --stack-name lambda-container-app-test --profile test
aws cloudformation delete-stack --stack-name lambda-container-app-prod --profile prod

Delete cross-account roles from the test and prod accounts:

aws cloudformation delete-stack --stack-name codepipeline-crossaccount-roles --profile test
aws cloudformation delete-stack --stack-name codepipeline-crossaccount-roles --profile prod

Delete the ECR repository:

aws ecr delete-repository --repository-name image-repository --profile tooling --force

Delete resources from the tooling account:

aws cloudformation delete-stack --stack-name tooling-resources --profile tooling

Conclusion

This blog post discusses how to automate deployment of container-based Lambda across multiple accounts using AWS SAM Pipelines.

Navigate to the GitHub repository and review the implementation to see how CodePipeline pushes container image to Amazon ECR, and deploys image to Lambda using cross-account role. Examine the codepipeline.yml file to see how the AWS SAM Pipelines creates CI/CD resources using this template.

For more serverless learning resources, visit Serverless Land.

Extending a serverless, event-driven architecture to existing container workloads

2023-05-04 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/extending-a-serverless-event-driven-architecture-to-existing-container-workloads/

This post is written by Dhiraj Mahapatro, Principal Specialist SA, and Sascha Moellering, Principal Specialist SA, and Emily Shea, WW Lead, Integration Services.

Many serverless services are a natural fit for event-driven architectures (EDA), as events invoke them and only run when there is an event to process. When building in the cloud, many services emit events by default and have built-in features for managing events. This combination allows customers to build event-driven architectures easier and faster than ever before.

The insurance claims processing sample application in this blog series uses event-driven architecture principles and serverless services like AWS Lambda, AWS Step Functions, Amazon API Gateway, Amazon EventBridge, and Amazon SQS.

When building an event-driven architecture, it’s likely that you have existing services to integrate with the new architecture, ideally without needing to make significant refactoring changes to those services. As services communicate via events, extending applications to new and existing microservices is a key benefit of building with EDA. You can write those microservices in different programming languages or running on different compute options.

This blog post walks through a scenario of integrating an existing, containerized service (a settlement service) to the serverless, event-driven insurance claims processing application described in this blog post.

Overview of sample event-driven architecture

The sample application uses a front-end to sign up a new user and allow the user to upload images of their car and driver’s license. Once signed up, they can file a claim and upload images of their damaged car. Previously, it did not yet integrate with a settlement service for completing the claims and settlement process.

In this scenario, the settlement service is a brownfield application that runs Spring Boot 3 on Amazon ECS with AWS Fargate. AWS Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building container applications without managing servers.

The Spring Boot application exposes a REST endpoint, which accepts a POST request. It applies settlement business logic and creates a settlement record in the database for a car insurance claim. Your goal is to make settlement work with the new EDA application that is designed for claims processing without re-architecting or rewriting. Customer, claims, fraud, document, and notification are the other domains that are shown as blue-colored boxes in the following diagram:

Project structure

The application uses AWS Cloud Development Kit (CDK) to build the stack. With CDK, you get the flexibility to create modular and reusable constructs imperatively using your language of choice. The sample application uses TypeScript for CDK.

The following project structure enables you to build different bounded contexts. Event-driven architecture relies on the choreography of events between domains. The object oriented programming (OOP) concept of CDK helps provision the infrastructure to separate the domain concerns while loosely coupling them via events.

You break the higher level CDK constructs down to these corresponding domains:

Application and infrastructure code are present in each domain. This project structure creates a seamless way to add new domains like settlement with its application and infrastructure code without affecting other areas of the business.

With the preceding structure, you can use the settlement-service.ts CDK construct inside claims-processing-stack.ts:

const settlementService = new SettlementService(this, "SettlementService", {
  bus,
});

The only information the SettlementService construct needs to work is the EventBridge custom event bus resource that is created in the claims-processing-stack.ts.

To run the sample application, follow the setup steps in the sample application’s README file.

Existing container workload

The settlement domain provides a REST service to the rest of the organization. A Docker containerized Spring Boot application runs on Amazon ECS with AWS Fargate. The following sequence diagram shows the synchronous request-response flow from an external REST client to the service:

External REST client makes POST /settlement call via an HTTP API present in front of an internal Application Load Balancer (ALB).
SettlementController.java delegates to SettlementService.java.
SettlementService applies business logic and calls SettlementRepository for data persistence.
SettlementRepository persists the item in the Settlement DynamoDB table.

A request to the HTTP API endpoint looks like:

curl --location <settlement-api-endpoint-from-cloudformation-output> \
--header 'Content-Type: application/json' \
--data '{
  "customerId": "06987bc1-1234-1234-1234-2637edab1e57",
  "claimId": "60ccfe05-1234-1234-1234-a4c1ee6fcc29",
  "color": "green",
  "damage": "bumper_dent"
}'

The response from the API call is:

You can learn more here about optimizing Spring Boot applications on AWS Fargate.

Extending container workload for events

To integrate the settlement service, you must update the service to receive and emit events asynchronously. The core logic of the settlement service remains the same. When you file a claim, upload damaged car images, and the application detects no document fraud, the settlement domain subscribes to Fraud.Not.Detected event and applies its business logic. The settlement service emits an event back upon applying the business logic.

The following sequence diagram shows a new interface in settlement to work with EDA. The settlement service subscribes to events that a producer emits. Here, the event producer is the fraud service that puts an event in an EventBridge custom event bus.

Producer emits Fraud.Not.Detected event to EventBridge custom event bus.
EventBridge evaluates the rules provided by the settlement domain and sends the event payload to the target SQS queue.
SubscriberService.java polls for new messages in the SQS queue.
On message, it transforms the message body to an input object that is accepted by SettlementService.
It then delegates the call to SettlementService, similar to how SettlementController works in the REST implementation.
SettlementService applies business logic. The flow is like the REST use case from 7 to 10.
On receiving the response from the SettlementService, the SubscriberService transforms the response to publish an event back to the event bus with the event type as Settlement.Finalized.

The rest of the architecture consumes this Settlement.Finalized event.

Using EventBridge schema registry and discovery

Schema enforces a contract between a producer and a consumer. A consumer expects the exact structure of the event payload every time an event arrives. EventBridge provides schema registry and discovery to maintain this contract. The consumer (the settlement service) can download the code bindings and use them in the source code.

Enable schema discovery in EventBridge before downloading the code bindings and using them in your repository. The code bindings provide a marshaller that unmarshals the incoming event from SQS queue to a plain old Java object (POJO) FraudNotDetected.java. You download the code bindings using the choice of your IDE. AWS Toolkit for IntelliJ makes it convenient to download and use them.

The final architecture for the settlement service with REST and event-driven architecture looks like:

Transition to become fully event-driven

With the new capability to handle events, the Spring Boot application now supports both the REST endpoint and the event-driven architecture by running the same business logic through different interfaces. In this example scenario, as the event-driven architecture matures and the rest of the organization adopts it, the need for the POST endpoint to save a settlement may diminish. In the future, you can deprecate the endpoint and fully rely on polling messages from the SQS queue.

You start with using an ALB and Fargate service CDK ECS pattern:

const loadBalancedFargateService = new ecs_patterns.ApplicationLoadBalancedFargateService(
  this,
  "settlement-service",
  {
    cluster: cluster,
    taskImageOptions: {
      image: ecs.ContainerImage.fromDockerImageAsset(asset),
      environment: {
        "DYNAMODB_TABLE_NAME": this.table.tableName
      },
      containerPort: 8080,
      logDriver: new ecs.AwsLogDriver({
        streamPrefix: "settlement-service",
        mode: ecs.AwsLogDriverMode.NON_BLOCKING,
        logRetention: RetentionDays.FIVE_DAYS,
      })
    },
    memoryLimitMiB: 2048,
    cpu: 1024,
    publicLoadBalancer: true,
    desiredCount: 2,
    listenerPort: 8080
  });

To adapt to EDA, you update the resources to retrofit the SQS queue to receive messages and EventBridge to put events. Add new environment variables to the ApplicationLoadBalancerFargateService resource:

environment: {
  "SQS_ENDPOINT_URL": queue.queueUrl,
  "EVENTBUS_NAME": props.bus.eventBusName,
  "DYNAMODB_TABLE_NAME": this.table.tableName
}

Grant the Fargate task permission to put events in the custom event bus and consume messages from the SQS queue:

props.bus.grantPutEventsTo(loadBalancedFargateService.taskDefinition.taskRole);
queue.grantConsumeMessages(loadBalancedFargateService.taskDefinition.taskRole);

When you transition the settlement service to become fully event-driven, you do not need the HTTP API endpoint and ALB anymore, as SQS is the source of events.

A better alternative is to use QueueProcessingFargateService ECS pattern for the Fargate service. The pattern provides auto scaling based on the number of visible messages in the SQS queue, besides CPU utilization. In the following example, you can also add two capacity provider strategies while setting up the Fargate service: FARGATE_SPOT and FARGATE. This means, for every one task that is run using FARGATE, there are two tasks that use FARGATE_SPOT. This can help optimize cost.

const queueProcessingFargateService = new ecs_patterns.QueueProcessingFargateService(this, 'Service', {
  cluster,
  memoryLimitMiB: 1024,
  cpu: 512,
  queue: queue,
  image: ecs.ContainerImage.fromDockerImageAsset(asset),
  desiredTaskCount: 2,
  minScalingCapacity: 1,
  maxScalingCapacity: 5,
  maxHealthyPercent: 200,
  minHealthyPercent: 66,
  environment: {
    "SQS_ENDPOINT_URL": queueUrl,
    "EVENTBUS_NAME": props?.bus.eventBusName,
    "DYNAMODB_TABLE_NAME": tableName
  },
  capacityProviderStrategies: [
    {
      capacityProvider: 'FARGATE_SPOT',
      weight: 2,
    },
    {
      capacityProvider: 'FARGATE',
      weight: 1,
    },
  ],
});

This pattern abstracts the automatic scaling behavior of the Fargate service based on the queue depth.

Running the application

To test the application, follow How to use the Application after the initial setup. Once complete, you see that the browser receives a Settlement.Finalized event:

{
  "version": "0",
  "id": "e2a9c866-cb5b-728c-ce18-3b17477fa5ff",
  "detail-type": "Settlement.Finalized",
  "source": "settlement.service",
  "account": "123456789",
  "time": "2023-04-09T23:20:44Z",
  "region": "us-east-2",
  "resources": [],
  "detail": {
    "settlementId": "377d788b-9922-402a-a56c-c8460e34e36d",
    "customerId": "67cac76c-40b1-4d63-a8b5-ad20f6e2e6b9",
    "claimId": "b1192ba0-de7e-450f-ac13-991613c48041",
    "settlementMessage": "Based on our analysis on the damage of your car per claim id b1192ba0-de7e-450f-ac13-991613c48041, your out-of-pocket expense will be $100.00."
  }
}

Cleaning up

The stack creates a custom VPC and other related resources. Be sure to clean up resources after usage to avoid the ongoing cost of running these services. To clean up the infrastructure, follow the clean-up steps shown in the sample application.

Conclusion

The blog explains a way to integrate existing container workload running on AWS Fargate with a new event-driven architecture. You use EventBridge to decouple different services from each other that are built using different compute technologies, languages, and frameworks. Using AWS CDK, you gain the modularity of building services decoupled from each other.

This blog shows an evolutionary architecture that allows you to modernize existing container workloads with minimal changes that still give you the additional benefits of building with serverless and EDA on AWS.

The major difference between the event-driven approach and the REST approach is that you unblock the producer once it emits an event. The event producer from the settlement domain that subscribes to that event is loosely coupled. The business functionality remains intact, and no significant refactoring or re-architecting effort is required. With these agility gains, you may get to the market faster

The sample application shows the implementation details and steps to set up, run, and clean up the application. The app uses ECS Fargate for a domain service, but you do not limit it to just Fargate. You can also bring container-based applications running on Amazon EKS similarly to event-driven architecture.

Learn more about event-driven architecture on Serverless Land.

Patterns for building an API to upload files to Amazon S3

2023-05-03 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/patterns-for-building-an-api-to-upload-files-to-amazon-s3/

This blog is written by Thomas Moore, Senior Solutions Architect and Josh Hart, Senior Solutions Architect.

Applications often require a way for users to upload files. The traditional approach is to use an SFTP service (such as the AWS Transfer Family), but this requires specific clients and management of SSH credentials. Modern applications instead need a way to upload to Amazon S3 via HTTPS. Typical file upload use cases include:

Sharing datasets between businesses as a direct replacement for traditional FTP workflows.
Uploading telemetry and logs from IoT devices and mobile applications.
Uploading media such as videos and images.
Submitting scanned documents and PDFs.

If you have control over the application that sends the uploads, then you can integrate with the AWS SDK from within the browser with a framework such as AWS Amplify. To learn more, read Allowing external users to securely and directly upload files to Amazon S3.

Often you must provide end users direct access to upload files via an endpoint. You could build a bespoke service for this purpose, but this results in more code to build, maintain, and secure.

This post explores three different approaches to securely upload content to an Amazon S3 bucket via HTTPS without the need to build a dedicated API or client application.

Using Amazon API Gateway as a direct proxy

The simplest option is to use API Gateway to proxy an S3 bucket. This allows you to expose S3 objects as REST APIs without additional infrastructure. By configuring an S3 integration in API Gateway, this allows you to manage authentication, authorization, caching, and rate limiting more easily.

This pattern allows you to implement an authorizer at the API Gateway level and requires no changes to the client application or caller. The limitation with this approach is that API Gateway has a maximum request payload size of 10 MB. For step-by-step instructions to implement this pattern, see this knowledge center article.

This is an example implementation (you can deploy this from Serverless Land):

Using API Gateway with presigned URLs

The second pattern uses S3 presigned URLs, which allow you to grant access to S3 objects for a specific period, after which the URL expires. This time-bound access helps prevent unauthorized access to S3 objects and provides an additional layer of security.

They can be used to control access to specific versions or ranges of bytes within an object. This granularity allows you to fine-tune access permissions for different users or applications, and ensures that only authorized parties have access to the required data.

This avoids the 10 MB limit of API Gateway as the API is only used to generate the presigned URL, which is then used by the caller to upload directly to S3. Presigned URLs are straightforward to generate and use programmatically, but it does require the client to make two separate requests: one to generate the URL and one to upload the object. To learn more, read Uploading to Amazon S3 directly from a web or mobile application.

This pattern is limited by the 5GB maximum request size of the S3 Put Object API call. One way to work around this limit with this pattern is to leverage S3 multipart uploads. This requires that the client split the payload into multiple segments and send a separate request for each part.

This adds some complexity to the client and is used by libraries such as AWS Amplify that abstract away the multipart upload implementation. This allows you to upload objects up to 5TB in size. For more details, see uploading large objects to Amazon S3 using multipart upload and transfer acceleration.

An example of this pattern is available on Serverless Land.

Using Amazon CloudFront with Lambda@Edge

The final pattern leverages Amazon CloudFront instead of API Gateway. CloudFront is primarily a content delivery network (CDN) that caches and delivers content from an S3 bucket or other origin. However, CloudFront can also be used to upload data to an S3 bucket. Without any additional configuration, this would essentially make the S3 bucket publicly writable. To secure the solution so that only authenticated users can upload objects, you can use a Lambda@Edge function to verify the users’ permissions.

The maximum size of the object that you can upload with this pattern is 5GB. If you need to upload files larger than 5GB, then you must use multipart uploads. To implement this, deploy the example Serverless Land pattern:

This pattern uses an origin access identity (OAI) to limit access to the S3 bucket to only come from CloudFront. The default OAI has s3:GetObject permission, which is changed to s3:PutObject to allow uploads explicitly and prevent and read operations:

{
    "Version": "2008-10-17",
    "Id": "PolicyForCloudFrontPrivateContent",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::cloudfront:user/CloudFront Origin Access Identity <origin access identity ID>"
            },
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3::: DOC-EXAMPLE-BUCKET/*"
        }
    ]
}

As CloudFront is not used to cache content, the managed cache policy is set to CachingDisabled.

There are multiple options for implementing the authorization in the Lambda@Edge function. The sample repository uses an Amazon Cognito authorizer that validates a JSON Web Token (JWT) sent as an HTTP authorization header.

Using a JWT is secure as it implies this token is dynamically vended by an Identity Provider, such as Amazon Cognito. This does mean that the caller needs a mechanism to obtain this JWT token. You are in control of this authorizer function, and the exact implementation depends on your use-case. You could instead use an API Key or integrate with an alternate identity provider such as Auth0 or Okta.

Lambda@Edge functions do not currently support environment variables. This means that the configuration parameters are dynamically resolved at runtime. In the example code, AWS Systems Manager Parameter Store is used to store the Amazon Cognito user pool ID and app client ID that is required for the token verification. For more details on how to choose where to store your configuration parameters, see Choosing the right solution for AWS Lambda external parameters.

To verify the JWT token, the example code uses the aws-jwt-verify package. This supports JWTs issued by Amazon Cognito and third-party identity providers.

The Serverless Land pattern uses an Amazon Cognito identity provider to do authentication in the Lambda@Edge function. This code snippet shows an example using a pre-shared key for basic authorization:

import json

def lambda_handler(event, context):
       
    print(event)
       
    response = event["Records"][0]["cf"]["request"]
    headers = response["headers"]
       
    if 'authorization' not in headers or headers['authorization'] == None:
        return unauthorized()
           
    if headers['authorization'] == 'my-secret-key':
        return request

    return response
       
def unauthorized():
    response = {
            'status': "401",
            'statusDescription': 'Unauthorized',
            'body': 'Unauthorized'
        }
    return response

The Lambda function is associated with the CloudFront distribution by creating a Lambda trigger. The CloudFront event is set Viewer request to meaning the function is invoked in reaction to PUT events from the client.

The solution can be tested with an API testing client, such as Postman. In Postman, issue a PUT request to https://<your-cloudfront-domain>/<object-name> with a binary payload as the body. You receive a 401 Unauthorized response.

Next, add the Authorization header with a valid token and submit the request again. For more details on how to obtain a JWT from Amazon Cognito, see the README in the repository. Now the request works and you receive a 200 OK message.

To troubleshoot, the Lambda function logs to Amazon CloudWatch Logs. For Lambda@Edge functions, look for the logs in the Region closest to the request, and not the same Region as the function.

The Lambda@Edge function in this example performs basic authorization. It validates the user has access to the requested resource. You can perform any custom authorization action here. For example, in a multi-tenant environment, you could restrict the prefix so that specific tenants only have permission to write to their own prefix, and validate the requested object name in the function.

Additionally, you could implement controls traditionally performed by the API Gateway such as throttling by tenant or user. Another use for the function is to validate the file type. If users can only upload images, you could validate the content-length to ensure the images are a certain size and the file extension is correct.

Conclusion

Which option you choose depends on your use case. This table summarizes the patterns discussed in this blog post:

	API Gateway as a proxy	Presigned URLs with API Gateway	CloudFront with Lambda@Edge
Max Object Size	10 MB	5 GB (5 TB with multipart upload)	5 GB
Client Complexity	Single HTTP Request	Multiple HTTP Requests	Single HTTP Request
Authorization Options	Amazon Cognito, IAM, Lambda Authorizer	Amazon Cognito, IAM, Lambda Authorizer	Lambda@Edge
Throttling	API Key throttling	API Key throttling	Custom throttling

Each of the available methods has its strengths and weaknesses and the choice of which one to use depends on your specific needs. The maximum object size supported by S3 is 5 TB, regardless of which method you use to upload objects. Additionally, some methods have more complex configuration that requires more technical expertise. Considering these factors with your specific use-case can help you make an informed decision on the best API option for uploading to S3.

For more serverless learning resources, visit Serverless Land.