Tag Archives: serverless

Capturing client events using Amazon API Gateway and Amazon EventBridge

2022-02-08 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/capturing-client-events-using-amazon-api-gateway-and-amazon-eventbridge/

This post is written by Tim Bruce, Senior Solutions Architect, DevAx.

Event producers are one of the three main components in an event-driven architecture. Event producers create and publish events to event routers, which send them to event consumers. Any portion of a system, including a mobile or web client, can be an event producer.

To extend the event model to your mobile and web clients, you must implement standards for security, messaging formats, and event storage.

This post shows how to build a client-enabled event-handling solution. It uses Amazon EventBridge, Amazon API Gateway, AWS Lambda, and Amazon Cognito. This architecture supports routing client events to internal and external destinations. It provides a blueprint that you can use to simplify the integration.

Overview

This example creates a RESTful API using API Gateway. It sends events directly to EventBridge without the need for compute services. In production, you have more requirements than only receiving and forwarding events. Additional requirements include security, user identification, validation, enrichment, transformation, event forwarding, and storing.

In this example, API Gateway provides security and user identification by invoking a Lambda authorizer. The authorizer generates a policy and returns client identification to API Gateway. API Gateway then performs request validation and message enrichment before forwarding the events to EventBridge.

EventBridge evaluates the events against rules and forwards the events to targets. The rules apply transformation to the events and forward an event to up to five targets. Targets include AWS services, such as Amazon Kinesis Data Firehose, and many third-party solutions, such as Zendesk, with HTTPS endpoints.

Lastly, Kinesis Data Firehose provides a cost-effective solution to store events into an Amazon S3 bucket. Before storing the events, Kinesis Data Firehose transforms records via Lambda transformers. It also partitions records using data in the record or calculated data via a Lambda function. Kinesis Data Firehose uses this partitioning data to create keys in the bucket and store matching records within the keys.

Example architecture

The example consists of the following resources defined in the AWS SAM template:

An API Gateway instance to receive the messages.
A Lambda authorizer to validate requests.
An EventBridge event bus to receive events.
An EventBridge rule to forward all events to Kinesis Data Firehose.
An EventBridge rule to forward specific events to Zendesk.
An EventBridge API destination to connect to your Zendesk.
A Kinesis Data Firehose to transform, partition, and store events in an S3 bucket.
A Lambda Kinesis Data Firehose data transformation.
An S3 bucket to store event data.

Data flow

Application clients collect or generate the events.
The client sends the events to API Gateway as URL-encoded JSON. The client includes the user’s JWT in an authorization header with the request for validation.
The Lambda authorizer validates the JWT with Amazon Cognito and returns the user’s unique clientID value to API Gateway.
API Gateway transforms the request into events, appending clientId, the bus name, and environment.
API Gateway sends the events to EventBridge.
EventBridge rules match the events and:
1. Forwards all client events to Kinesis Data Firehose.
2. Forwards client events with detail.eventType of “loyaltypurchase” to Zendesk.
Kinesis Data Firehose receives the records.
The Kinesis Data Firehose data transformation processes each record, moving the client ID to the detail object.
Kinesis Data Firehose partitions the records and stores them in an S3 bucket.

Overall design

The following sections discuss details of the solution, starting from the event in a web or mobile client. This solution requires the client to create an HTTPS request, including the user’s JWT as an authorization header.

{"entries": [{"entry": "{\"eventType\": \"searching\", \"schemaVersion\":1, \"data\": {\"searchTerm\":\"games\"}}"}]}

The preceding JSON shows a sample request body for this solution. The top-level item “entries” is an array of “entry” items. API Gateway will translate each “entry” to the event-detail field in EventBridge events. The client must escape the data for “entry” to prevent translation errors.

API Gateway and Lambda authorizer

API Gateway receives the request and validates the JWT by invoking the Lambda authorizer. The authorizer generates a policy allowing the request for valid tokens. It adds the Amazon Cognito “custom:clientId” custom attribute to the response context before returning the response to API Gateway. The “custom:clientId” attribute is a unique client identifier in the form of a UUID that downstream systems can use to retrieve data about the customer.

API Gateway validates the request by matching the request body against a model. Models represent what a request should look like. A mapping template then transforms valid requests to the format required by EventBridge. Mapping templates use velocity templating language (VTL) to do this.

This mapping template uses a #foreach loop to process the array “entries” from the request body. The process enriches each event with the user’s “custom:clientId” and stage variables for bus name and environment from API Gateway.

The preceding API Gateway AWS integration enables API Gateway to send the events to EventBridge without using compute services, such as Lambda or Amazon EC2. The integration and IAM execution role enable API Gateway to call the EventBridge PutEvents API to do this.

EventBridge rules and transformations

EventBridge rules match events against criteria, transform the events, and forward the events to targets. There are two rules in this example. One processes events for Zendesk tickets and the other forwards data to Kinesis Data Firehose to store events for triage and analytics.

This example creates service tickets in the Zendesk ticketing system. The tickets trigger agents to contact customers who are expecting a call to complete their purchases. The software client, by sending the event directly, reducing time-to-action for back-office processes and helping improve customer satisfaction.

This rule matches client event messages for loyalty purchases and forwards details to the Zendesk API. The rule includes a transformation, which selects a portion of the event before sending the information to the target.

EventBridge uses an API destination to store details about the HTTP endpoint and usage policies. Additionally, an EventBridge connection and an AWS Secrets Manager secret store details. These include the authentication policy and authentication credentials to connect to the API destination.

Successfully processed events open tickets in Zendesk using the API destination. Agents now have a list of customers to contact.

Enterprises often require storing the events for troubleshooting or analytics. EventBridge does not include a newline between records when forwarding events to Kinesis Data Firehose. Because of this, it may be more challenging to discern each record when analyzing the data.

A rule for all client events changes this behavior. This AWS CloudFormation snippet defines the rule that will transform each event, adding a new line after each. The “\n” character in the InputTemplate field adds the separator between records before forwarding the data to Kinesis Data Firehose.

After, Kinesis Data Firehose receives each record separated by a new line, enabling both triage and analytics without extra overhead.

Kinesis Data Firehose to S3

Kinesis Data Firehose is a cost-effective way to batch and write records to S3. It offers optional transformation capabilities by invoking a Lambda function. This example uses a Lambda function that moves the “clientID” field to the detail section of the event record.

Kinesis Data Firehose also supports dynamic partitioning of records when writing to S3. It selects data from the records or data calculated by a Lambda function. In this example, it selects data from the records to store data in separate folders in S3.

Event durability considerations

You can extend this example using an EventBridge archive and Amazon Kinesis Data Streams. Archiving allows you to create an encrypted archive of matching events. You can define the data retention in days, from one through indefinite. You can replay events from your archive when you must re-process data.

Kinesis Data Streams is a serverless data streaming solution. The EventBridge rule for all records can forward data to Kinesis Data Streams instead of Kinesis Data Firehose. Multiple applications can consume the Kinesis Data Streams. Kinesis Data Firehose would consume this stream of data and store it in S3.

Prerequisites

You need the following prerequisites to deploy the example solution:

AWS account
AWS CLI
AWS Serverless Application Model (AWS SAM) CLI
Python 3.9
An AWS Identity and Access Management (IAM) role with appropriate access.
A Zendesk trial account
A Zendesk API key

Implementation

The full source of the solution is in the GitHub repository and is deployed with AWS SAM.

Create a Secrets Manager secret using the command the AWS CLI:
aws secretsmanager create-secret --name proto/Zendesk --secret-string '{"username":"<YOUR EMAIL>","apiKey":"<YOUR APIKEY>"}
Clone the solution repository using git:
git clone https://github.com/aws-samples/client-event-sample
Build the AWS SAM project:
sam build --use-container
Deploy the project using AWS SAM:
sam deploy --guided --capabilities CAPABILITY_NAMED_IAM

From the outputs from the deployment, set the following shell variables:

APPCLIENTID=<output APPCLIENTID>
APIID=<output APIID>
REGION=<region you deployed to>

Create a user in Amazon Cognito using the AWS CLI:
aws cognito-idp sign-up --client-id $APPCLIENTID --username <YOUR USER ID> --password <YOUR PASSWORD> --user-attributes Name=email,Value=<YOUR EMAIL>
After you receive the confirmation code, confirm the user using the AWS CLI:
aws cognito-idp confirm-sign-up --client-id $APPCLIENTID --username <userid> --confirmation-code <confirmation code>
Test the user login with the AWS CLI:
aws cognito-idp initiate-auth --auth-flow USER_PASSWORD_AUTH --client-id $APPCLIENTID --auth-parameters USERNAME=<YOUR USER ID>,PASSWORD=<YOUR PASSWORD>

If successful, this returns a JSON web token (JWT).

Testing the client event solution

The sample repository includes an event generator in the util directory. The generator uses your credentials and simulates events from a user’s software client. From the utils directory, run the generator:
python3 generator.py --minutes <minutes to run generator> --batch <batch size from 1-10> --errors <True|False> --userid <YOUR USER ID> --password <YOUR PASSWORD> --region $REGION --appclientid $APPCLIENTID --apiid $APIID
Log in to your Zendesk console and view the created tickets.
After five minutes, review the “clientevents” bucket to view the event records.

Cleaning up

To remove the example:

Delete the data stored in the clientevents buckets created from the template.
Delete the stack using the command:
sam delete --stack-name clientevents
Delete the secret using the command:
aws secretsmanager delete-secret --secret-id <arn of secret>

Conclusion

This post shows how to send client events to an API and EventBridge to enable new customer experiences. The example covers enabling new experiences by creating a way for software clients to send events with minimal custom code. This blueprint shows how you can include client events in your solution, featuring validation, enrichment, transformation, and storage.

You can modify the example code provided here for your use in your organization. This enables your client software to register events without modifying backend code.

For more serverless learning resources, visit Serverless Land.

Mocking service integrations with AWS Step Functions Local

2022-01-31 Benjamin Smith

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/mocking-service-integrations-with-aws-step-functions-local/

This post is written by Sam Dengler, Principal Specialist Solutions Architect, and Dhiraj Mahapatro, Senior Specialist Solutions Architect.

AWS Step Functions now supports over 200 AWS Service integrations via AWS SDK Integration. Developers want to build and test control flow logic for workflows using branching logic, error handling, and retries. This allows for precise workflow execution with deterministic results. Additionally, developers use Step Functions’ input and output processing features to transform data as it enters and exits tasks.

Developers can test their state machines locally using Step Functions Local before deploying them to an AWS account. However state machines that use service integrations like AWS Lambda, Amazon SQS, or Amazon SNS require Step Functions Local to perform calls to AWS service endpoints. Often, developers want to test the control and data flows of their state machine executions in isolation, without any dependency on service integration availability.

Today, AWS is releasing Mocked Service Integrations for Step Functions Local. This allows developers to define sample outputs from AWS service integrations. You can combine them into test case scenarios to validate workflow control and data flow definitions. You can find the code used in this post in the Step Functions examples GitHub repository.

Sales lead generation sample workflow

In this example, new sales leads are created in a customer relationship management system. This triggers the sample workflow execution using input data, which provides information about the contact.

Using the sales lead data, the workflow first validates the contact’s identity and address. If valid, it uses Step Functions’ AWS SDK integration for Amazon Comprehend to call the DetectSentiment API. It uses the sales lead’s comments as input for sentiment analysis.

If the comments have a positive sentiment, it adds the sales leads information to a DynamoDB table for follow-up. The event is published to Amazon EventBridge to notify subscribers.

If the sales lead data is invalid or a negative sentiment is detected, it publishes events to EventBridge for notification. No record is added to the Amazon DynamoDB table. The following Step Functions Workflow Studio diagram shows the control logic:

The full workflow definition is available in the code repository. Note the workflow task names in the diagram, such as DetectSentiment, which are important when defining the mocked responses.

Sentiment analysis test case

In this example, you test a scenario in which:

The identity and address are successfully validated using a Lambda function.
A positive sentiment is detected using the Comprehend.DetectSentiment API after three retries.
A contact item is written to a DynamoDB table successfully
An event is published to an EventBridge event bus successfully

The execution path for this test scenario is shown in the following diagram (the red and green numbers have been added). 0 represents the first execution; 1, 2, and 3 represent the max retry attempts (MaxAttempts), in case of an InternalServerException.

Mocked response configuration

To use service integration mocking, create a mock configuration file with sections specifying mock AWS service responses. These are grouped into test cases that can be activated when executing state machines locally. The following example provides code snippets and the full mock configuration is available in the code repository.

To mock a successful Lambda function invocation, define a mock response that conforms to the Lambda.Invoke API response elements. Associate it to the first request attempt:

"CheckIdentityLambdaMockedSuccess": {
  "0": {
    "Return": {
      "StatusCode": 200,
      "Payload": {
        "statusCode": 200,
        "body": "{\"approved\":true,\"message\":\"identity validation passed\"
}"
      }
    }
  }
}

To mock the DetectSentiment retry behavior, define failure and successful mock responses that conform to the Comprehend.DetectSentiment API call. Associate the failure mocks to three request attempts, and associate the successful mock to the fourth attempt:

"DetectSentimentRetryOnErrorWithSuccess": {
  "0-2": {
    "Throw": {
      "Error": "InternalServerException",
      "Cause": "Server Exception while calling DetectSentiment API in Comprehend Service"
    }
  },
  "3": {
    "Return": {
      "Sentiment": "POSITIVE",
      "SentimentScore": {
        "Mixed": 0.00012647535,
        "Negative": 0.00008031699,
        "Neutral": 0.0051454515,
        "Positive": 0.9946478
      }
    }
  }
}

Note that Step Functions Local does not validate the structure of the mocked responses. Ensure that your mocked responses conform to actual responses before testing. To review the structure of service responses, either perform the actual service calls using Step Functions or view the documentation for those services.

Next, associate the mocked responses to a test case identifier:

"RetryOnServiceExceptionTest": {
  "Check Identity": "CheckIdentityLambdaMockedSuccess",
  "Check Address": "CheckAddressLambdaMockedSuccess",
  "DetectSentiment": "DetectSentimentRetryOnErrorWithSuccess",
  "Add to FollowUp": "AddToFollowUpSuccess",
  "CustomerAddedToFollowup": "CustomerAddedToFollowupSuccess"
}

With the test case and mock responses configured, you can use them for testing with Step Functions Local.

Test case execution using Step Functions Local

The Step Functions Developer Guide describes the steps used to set up Step Functions Local on your workstation and create a state machine.

After these steps are complete, you can run a workflow locally using the start-execution AWS CLI command. Activate the mocked responses by appending a pound sign and the test case identifier to the state machine ARN:

aws stepfunctions start-execution \
  --endpoint http://localhost:8083 \
  --state-machine arn:aws:states:us-east-1:123456789012:stateMachine: LeadGenerationStateMachine#RetryOnServiceExceptionTest \
  --input file://events/sfn_valid_input.json

Test case validation

To validate the workflow executed correctly in the test case, examine the state machine execution events using the StepFunctions.GetExecutionHistory API. This ensures that the correct states are used. There are a variety of validation tools available. This post shows how to achieve this using the AWS CLI filtering feature using JMESPath syntax.

In this test case, you validate the TaskFailed and TaskSucceeded events match the retry definition for the DetectSentiment task, which specifies three retries. Use the following AWS CLI command to get the execution history and filter on the execution events:

aws stepfunctions get-execution-history \
  --endpoint http://localhost:8083 \
  --execution-arn <ExecutionArn>
  --query 'events[?(type==`TaskFailed` && contains(taskFailedEventDetails.cause, `Server Exception while calling DetectSentiment API in Comprehend Service`)) || (type==`TaskSucceeded` && taskSucceededEventDetails.resource==`comprehend:detectSentiment`)]'

The results include matching events:

{
  "timestamp": "2022-01-13T17:24:32.276000-05:00",
  "type": "TaskFailed",
  "id": 19,
  "previousEventId": 18,
  "taskFailedEventDetails": {
    "error": "InternalServerException",
    "cause": "Server Exception while calling DetectSentiment API in Comprehend Service"
  }
}

These results should be compared to the test acceptance criteria to verify the execution behavior. Test cases, acceptance criteria, and validation expressions vary by customer and use case. These techniques are flexible to accommodate various happy path and error scenarios. To explore additional sample test cases and examples, visit the example code repository.

Conclusion

This post introduces a new robust way to test AWS Step Functions state machines in isolation. With mocking, developers get more control over the type of scenarios that a state machine can handle, leading to assertion of multiple behaviors. Testing a state machine with mocks can also be part of the software release. Asserting on behaviors like error handling, branching, parallel, dynamic parallel (map state) helps test the entire state machine’s behavior. For any new behavior in the state machine, such as a new type of exception from a state, you can mock and add as a test.

See the Step Functions Developer Guide for more information on service mocking with Step Functions Local. The sample application covers basic scenarios of testing a state machine. You can use a similar approach for complex scenarios including other Step Functions flows, like map and wait.

For more serverless learning resources, visit Serverless Land.

Using the circuit breaker pattern with AWS Step Functions and Amazon DynamoDB

2022-01-31 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/using-the-circuit-breaker-pattern-with-aws-step-functions-and-amazon-dynamodb/

This post is written by Anitha Deenadayalan, Developer Specialist SA, DevAx

Modern applications use microservices as an architectural and organizational approach to software development, where the application comprises small independent services that communicate over well-defined APIs.

When multiple microservices collaborate to handle requests, one or more services may become unavailable or exhibit a high latency. Microservices communicate through remote procedure calls, and it is always possible that transient errors could occur in the network connectivity, causing failures.

This can cause performance degradation in the entire application during synchronous execution because of the cascading of timeouts or failures causing poor user experience. When complex applications use microservices, an outage in one microservice can lead to application failure. This post shows how to use the circuit breaker design pattern to help with a graceful service degradation.

Introducing circuit breakers

Michael Nygard popularized the circuit breaker pattern in his book, Release It. This design pattern can prevent a caller service from retrying another callee service call that has previously caused repeated timeouts or failures. It can also detect when the callee service is functional again.

Fallacies of distributed computing are a set of assertions made by Peter Deutsch and others at Sun Microsystems. They say the programmers new to distributed applications invariably make false assumptions. The network reliability, zero-latency expectations, and bandwidth limitations result in software applications written with minimal error handling for network errors.

During a network outage, applications may indefinitely wait for a reply and continually consume application resources. Failure to retry the operations when the network becomes available can also lead to application degradation. If API calls to a database or an external service time-out due to network issues, repeated calls with no circuit breaker can affect cost and performance.

The circuit breaker pattern

There is a circuit breaker object that routes the calls from the caller to the callee in the circuit breaker pattern. For example, in an ecommerce application, the order service can call the payment service to collect the payments. When there are no failures, the order service routes all calls to the payment service by the circuit breaker:

Circuit breaker with no failures

If the payment service times out, the circuit breaker can detect the timeout and track the failure. If the timeouts exceed a specified threshold, the application opens the circuit:

Circuit breaker with payment service failure

Once the circuit is open, the circuit breaker object does not route the calls to the payment service. It returns an immediate failure when the order service calls the payment service:

Circuit breaker stops routing to payment service

The circuit breaker object periodically tries to see if the calls to the payment service are successful:

Circuit breaker retries payment service

When the call to payment service succeeds, the circuit is closed, and all further calls are routed to the payment service again:

Circuit breaker with working payment service again

Architecture overview

This example uses the AWS Step Functions, AWS Lambda, and Amazon DynamoDB to implement the circuit breaker pattern:

Circuit breaker architecture

The Step Functions workflow provides circuit breaker capabilities. When a service wants to call another service, it starts the workflow with the name of the callee service.

The workflow gets the circuit status from the CircuitStatus DynamoDB table, which stores the currently degraded services. If the CircuitStatus contains a record for the service called, then the circuit is open. The Step Functions workflow returns an immediate failure and exit with a FAIL state.

If the CircuitStatus table does not contain an item for the called service, then the service is operational. The ExecuteLambda step in the state machine definition invokes the Lambda function sent through a parameter value. The Step Functions workflow exits with a SUCCESS state, if the call succeeds.

The items in the DynamoDB table have the following attributes:

DynamoDB items list

If the service call fails or a timeout occurs, the application retries with exponential backoff for a defined number of times. If the service call fails after the retries, the workflow inserts a record in the CircuitStatus table for the service with the CircuitStatus as OPEN, and the workflow exits with a FAIL state. Subsequent calls to the same service return an immediate failure as long as the circuit is open.

I enter the item with an associated time-to-live (TTL) value to ensure eventual connection retries and the item expires at the defined TTL time. DynamoDB’s time to live (TTL) allows you to define a per-item timestamp to determine when an item is no longer needed. Shortly after the date and time of the specified timestamp, DynamoDB deletes the item from your table without consuming write throughput.

For example, if you set the TTL value to 60 seconds to check a service status after a minute, DynamoDB deletes the item from the table after 60 seconds. The workflow invokes the service to check for availability when a new call comes in after the item has expired.

Circuit breaker Step Function

Prerequisites

For this walkthrough, you need:

An AWS account and an AWS user with AdministratorAccess (see the instructions on the AWS Identity and Access Management (IAM) console)
Access to the following AWS services: AWS Lambda, AWS Step Functions, and Amazon DynamoDB.
AWS SAM CLI using the instructions here.
NET Core 3.1 SDK installed
JetBrains Rider or Microsoft Visual Studio 2017 or later (or Visual Studio Code)

Setting up the environment

Use the .NET Core 3.1 code in the GitHub repository and the AWS SAM template to create the AWS resources for this walkthrough. These include IAM roles, DynamoDB table, the Step Functions workflow, and Lambda functions.

You need an AWS access key ID and secret access key to configure the AWS Command Line Interface (AWS CLI). To learn more about configuring the AWS CLI, follow these instructions.
Clone the repo:
git clone https://github.com/aws-samples/circuit-breaker-netcore-blog
After cloning, this is the folder structure:

Project file structure

Deploy using Serverless Application Model (AWS SAM)

The AWS Serverless Application Model (AWS SAM) CLI provides developers with a local tool for managing serverless applications on AWS.

The sam build command processes your AWS SAM template file, application code, and applicable language-specific files and dependencies. The command copies build artifacts in the format and location expected for subsequent steps in your workflow. Run these commands to process the template file:
```
cd circuit-breaker
sam build
```
After you build the application, test using the sam deploy command. AWS SAM deploys the application to AWS and displays the output in the terminal.
```
sam deploy --guided
```
Output from sam deploy
You can also view the output in AWS CloudFormation page.

Output in CloudFormation console
The Step Functions workflow provides the circuit-breaker function. Refer to the circuitbreaker.asl.json file in the statemachine folder for the state machine definition in the Amazon States Language (ASL).

To deploy with the CDK, refer to the GitHub page.

Running the service through the circuit breaker

To provide circuit breaker capabilities to the Lambda microservice, you must send the name or function ARN of the Lambda function to the Step Functions workflow:

{
  "TargetLambda": "<Name or ARN of the Lambda function>"
}

Successful run

To simulate a successful run, use the HelloWorld Lambda function provided by passing the name or ARN of the Lambda function the stack has created. Your input appears as follows:

{
  "TargetLambda": "circuit-breaker-stack-HelloWorldFunction-pP1HNkJGugQz"
}

During the successful run, the Get Circuit Status step checks the circuit status against the DynamoDB table. Suppose that the circuit is CLOSED, which is indicated by zero records for that service in the DynamoDB table. In that case, the Execute Lambda step runs the Lambda function and exits the workflow successfully.

Step Function with closed circuit

Service timeout

To simulate a timeout, use the TestCircuitBreaker Lambda function by passing the name or ARN of the Lambda function the stack has created. Your input appears as:

{
  "TargetLambda": "circuit-breaker-stack-TestCircuitBreakerFunction-mKeyyJq4BjQ7"
}

Again, the circuit status is checked against the DynamoDB table by the Get Circuit Status step in the workflow. The circuit is CLOSED during the first pass, and the Execute Lambda step runs the Lambda function and timeout.

The workflow retries based on the retry count and the exponential backoff values, and finally returns a timeout error. It runs the Update Circuit Status step where a record is inserted in the DynamoDB table for that service, with a predefined time-to-live value specified by TTL attribute ExpireTimeStamp.

Step Function with open circuit

Repeat timeout

As long as there is an item for the service in the DynamoDB table, the circuit breaker workflow returns an immediate failure to the calling service. When you re-execute the call to the Step Functions workflow for the TestCircuitBreaker Lambda function within 20 seconds, the circuit is still open. The workflow immediately fails, ensuring the stability of the overall application performance.

Step Function workflow immediately fails until retry

The item in the DynamoDB table expires after 20 seconds, and the workflow retries the service again. This time, the workflow retries with exponential backoffs, and if it succeeds, the workflow exits successfully.

Cleaning up

To avoid incurring additional charges, clean up all the created resources. Run the following command from a terminal window. This command deletes the created resources that are part of this example.

sam delete --stack-name circuit-breaker-stack --region <region name>

Conclusion

This post showed how to implement the circuit breaker pattern using Step Functions, Lambda, DynamoDB, and .NET Core 3.1. This pattern can help prevent system degradation in service failures or timeouts. Step Functions and the TTL feature of DynamoDB can make it easier to implement the circuit breaker capabilities.

To learn more about developing microservices on AWS, refer to the whitepaper on microservices. To learn more about serverless and AWS SAM, visit the Sessions with SAM series and find more resources at Serverless Land.

Codacy Measures Developer Productivity using AWS Serverless

2022-01-27 Catarina Gralha

Post Syndicated from Catarina Gralha original https://aws.amazon.com/blogs/architecture/codacy-measures-developer-productivity-using-aws-serverless/

Codacy is a DevOps insights company based in Lisbon, Portugal. Since its launch in 2012, Codacy has helped software development and engineering teams reduce defects, keep technical debt in check, and ship better code, faster.

Codacy’s latest product, Pulse, is a service that helps understand and improve the performance of software engineering teams. This includes measuring metrics such as deployment frequency, lead time for changes, or mean time to recover. Codacy’s main platform is built on top of AWS products like Amazon Elastic Kubernetes Service (EKS), but they have taken Pulse one step further with AWS serverless.

In this post, we will explore the Pulse’s requirements, architecture, and the services it is built on, including AWS Lambda, Amazon API Gateway, and Amazon DynamoDB.

Pulse prototype requirements

Codacy had three clear requirements for their initial Pulse prototype.

The solution must enable the development team to iterate quickly and have minimal time-to-market (TTM) to validate the idea.
The solution must be easily scalable and match the demands of both startups and large enterprises alike. This was of special importance, as Codacy wanted to onboard Pulse with some of their existing customers. At the time, these customers already had massive amounts of information.
The solution must be cost-effective, particularly during the early stages of the product development.

Enter AWS serverless

Codacy could have built Pulse on top of Amazon EC2 instances. However, this brings the undifferentiated heavy lifting of having to provision, secure, and maintain the instances themselves.

AWS serverless technologies are fully managed services that abstract the complexity of infrastructure maintenance away from developers and operators, so they can focus on building products.

Serverless applications also scale elastically and automatically behind the scenes, so customers don’t need to worry about capacity provisioning. Furthermore, these services are highly available by design and span multiple Availability Zones (AZs) within the Region in which they are deployed. This gives customers higher confidence that their systems will continue running even if one Availability Zone is impaired.

AWS serverless technologies are cost-effective too, as they are billed per unit of value, as opposed to billing per provisioned capacity. For example, billing is calculated by the amount of time a function takes to complete or the number of messages published to a queue, rather than how long an EC2 instance runs. Customers only pay when they are getting value out of the services, for example when serving an actual customer request.

Overview of Pulse’s solution architecture

An event is generated when a developer performs a specific action as part of their day-to-day tasks, such as committing code or merging a pull request. These events are the foundational data that Pulse uses to generate insights and are thus processed by multiple Pulse components called modules.

Let’s take a detailed look at a few of them.

Ingestion module

Figure 1. Pulse ingestion module architecture

Figure 1 shows the ingestion module, which is the entry point of events into the Pulse platform and is built on AWS serverless applications as follows:

The ingestion API is exposed to customers using Amazon API Gateway. This defines REST, HTTP, and WebSocket APIs with sophisticated functionality such as request validation, rate limiting, and more.
The actual business logic of the API is implemented as AWS Lambda functions. Lambda can run custom code in a fully managed way. You only pay for the time that the function takes to run, in 1-millisecond increments. Lambda natively supports multiple languages, but customers can also bring their own runtimes or container images as needed.
API requests are authorized with keys, which are stored in Amazon DynamoDB, a key-value NoSQL database that delivers single-digit millisecond latency at any scale. API Gateway invokes a Lambda function that validates the key against those stored in DynamoDB (this is called a Lambda authorizer.)
While API Gateway provides a default domain name for each API, Codacy customizes it with Amazon Route 53, a service that registers domain names and configures DNS records. Route 53 offers a service level agreement (SLA) of 100% availability.
Events are stored in raw format in Pulse’s data lake, which is built on top of AWS’ object storage service, Amazon Simple Storage Service (S3). With Amazon S3, you can store massive amounts of information at low cost using simple HTTP requests. The data is highly available and durable.
Whenever a new event is ingested by the API, a message is published in Pulse’s message bus. (More information later in this post.)

Events module

Figure 2. Pulse events module architecture

The events module handles the aggregation and storage of events for actual consumption by customers, see Figure 2:

Events are consumed from the message bus and processed with a Lambda function, which stores them in Amazon Redshift.
Amazon Redshift is AWS’ managed data warehouse, and enables Pulse’s users to get insights and metrics by running analytical (OLAP) queries with the highest performance.
These metrics are exposed to customers via another API (the public API), which is also built on API Gateway.
The business logic for this API is implemented using Lambda functions, like the Ingestion module.

Message bus

Figure 3. Message bus architecture

We mentioned earlier that Pulse’s modules communicate messages with each other via the “message bus.” When something occurs at a specific component, a message (event) is published to the bus. At the same time, developers create subscriptions for each module that should receive these messages. This is known as the publisher/subscriber pattern (pub/sub for short), and is a fundamental piece of event-driven architectures.

With the message bus, you can decouple all modules from each other. In this way, a publisher does not need to worry about how many or who their subscribers are, or what to do if a new one arrives. This is all handled by the message bus.

Pulse’s message bus is built like this, shown in Figure 3:

Events are published via Amazon Simple Notification Service (SNS), using a construct called a topic. Topics are the basic unit of message publication and consumption. Components are subscribed to this topic, and you can filter out unwanted messages.
Developers configure Amazon SNS subscriptions to have the events sent to a queue, which provides a buffering layer from which workers can process messages. At the same time, queues also ensure that messages are not lost if there is an error. In Pulse’s case, these queues are implemented with Amazon Simple Queue Service (SQS).

Other modules

There are other parts of Pulse architecture that also use AWS serverless. For example, user authentication and sign-up are handled by Amazon Cognito, and Pulse’s frontend application is hosted on Amazon S3. This app is served to customers worldwide with low latency using Amazon CloudFront, a content delivery network.

Summary and next steps

By using AWS serverless, Codacy has been able to reduce the time required to bring Pulse to market by staying focused on developing business logic, rather than managing servers. Furthermore, Codacy is confident they can handle Pulse’s growth, as this serverless architecture will scale automatically according to demand.

Learn more about Serverless on AWS.
Visit Codacy to find out more about Pulse.

Migrating AWS Lambda functions to Arm-based AWS Graviton2 processors

2022-01-24 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/migrating-aws-lambda-functions-to-arm-based-aws-graviton2-processors/

AWS Lambda now allows you to configure new and existing functions to run on Arm-based AWS Graviton2 processors in addition to x86-based functions. Using this processor architecture option allows you to get up to 34% better price performance. This blog post highlights some considerations when moving from x86 to arm64 as the migration process is code and workload dependent.

Functions using the Arm architecture benefit from the performance and security built into the Graviton2 processor, which is designed to deliver up to 19% better performance for compute-intensive workloads. Workloads using multithreading and multiprocessing, or performing many I/O operations, can experience lower invocation time, which reduces costs.

Duration charges, billed with millisecond granularity, are 20 percent lower when compared to current x86 pricing. This also applies to duration charges when using Provisioned Concurrency. Compute Savings Plans supports Lambda functions powered by Graviton2.

The architecture change does not affect the way your functions are invoked or how they communicate their responses back. Integrations with APIs, services, applications, or tools are not affected by the new architecture and continue to work as before.

The following runtimes, which use Amazon Linux 2, are supported on Arm:

Node.js 12 and 14
Python 3.8 and 3.9
Java 8 (java8.al2) and 11
.NET Core 3.1
Ruby 2.7
Custom runtime (provided.al2)

Lambda@Edge does not support Arm as an architecture option.

You can create and manage Lambda functions powered by Graviton2 processor using the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS CloudFormation, AWS Serverless Application Model (AWS SAM), and AWS Cloud Development Kit (AWS CDK). Support is also available through many AWS Lambda Partners.

Understanding Graviton2 processors

AWS Graviton processors are custom built by AWS. Generally, you don’t need to know about the specific Graviton processor architecture, unless your applications can benefit from specific features.

The Graviton2 processor uses the Neoverse-N1 core and supports Arm V8.2 (include CRC and crypto extensions) plus several other architectural extensions. In particular, Graviton2 supports the Large System Extensions (LSE), which improve locking and synchronization performance across large systems.

Migrating x86 Lambda functions to arm64

Many Lambda functions may only need a configuration change to take advantage of the price/performance of Graviton2. Other functions may require repackaging the Lambda function using Arm-specific dependencies, or rebuilding the function binary or container image.

You may not require an Arm processor on your development machine to create Arm-based functions. You can build, test, package, compile, and deploy Arm Lambda functions on x86 machines using AWS SAM and Docker Desktop. If you have an Arm-based system, such as an Apple M1 Mac, you can natively compile binaries.

Functions without architecture-specific dependencies or binaries

If your functions don’t use architecture-specific dependencies or binaries, you can switch from one architecture to the other with a single configuration change. Many functions using interpreted languages such as Node.js and Python, or functions compiled to Java bytecode, can switch without any changes. Ensure you check binaries in dependencies, Lambda layers, and Lambda extensions.

To switch functions from x86 to arm64, you can change the Architecture within the function runtime settings using the Lambda console.

Edit AWS Lambda function Architecture

If you want to display or log the processor architecture from within a Lambda function, you can use OS specific calls. For example, Node.js process.arch or Python platform.machine().

When using the AWS CLI to create a Lambda function, specify the --architectures option. If you do not specify the architecture, the default value is x86-64. For example, to create an arm64 function, specify --architectures arm64.

aws lambda create-function \
    --function-name MyArmFunction \
    --runtime nodejs14.x \
    --architectures arm64 \
    --memory-size 512 \
    --zip-file fileb://MyArmFunction.zip \
    --handler lambda.handler \
    --role arn:aws:iam::123456789012:role/service-role/MyArmFunction-role

When using AWS SAM or CloudFormation, add or amend the Architectures property within the function configuration.

MyArmFunction:
  Type: AWS::Lambda::Function
  Properties:
    Runtime: nodejs14.x
    Code: src/
    Architectures:
  	- arm64
    Handler: lambda.handler
    MemorySize: 512

When initiating an AWS SAM application, you can specify:

sam init --architecture arm64

When building Lambda layers, you can specify CompatibleArchitectures.

MyArmLayer:
  Type: AWS::Lambda::LayerVersion
  Properties:
    ContentUri: layersrc/
    CompatibleArchitectures:
      - arm64

Building function code for Graviton2

If you have dependencies or binaries in your function packages, you must rebuild the function code for the architecture you want to use. Many packages and dependencies have arm64 equivalent versions. Test your own workloads against arm64 packages to see if your workloads are good migration candidates. Not all workloads show improved performance due to the different processor architecture features.

For compiled languages like Rust and Go, you can use the provided.al2 custom runtime, which supports Arm. You provide a binary that communicates with the Lambda Runtime API.

When compiling for Go, set GOARCH to arm.

GOOS=linux GOARCH=arm go build

When compiling for Rust, set the target.

cargo build --release -- target-cpu=neoverse-n1

The default installation of Python pip on some Linux distributions is out of date (<19.3). To install binary wheel packages released for Graviton, upgrade the pip installation using:

sudo python3 -m pip install --upgrade pip

The Arm software ecosystem is continually improving. As a general rule, use later versions of compilers and language runtimes whenever possible. The AWS Graviton Getting Started GitHub repository includes known recent changes to popular packages that improve performance, including ffmpeg, PHP, .Net, PyTorch, and zlib.

You can use https://pkgs.org/ as a package repository search tool.

Sometimes code includes architecture specific optimizations. These can include code optimized in assembly using specific instructions for CRC, or enabling a feature that works well on particular architectures. One way to see if any optimizations are missing for arm64 is to search the code for __x86_64__ ifdefs and see if there is corresponding arm64 code included. If not, consider alternative solutions.

For additional language-specific considerations, see the links within the GitHub repository.

The Graviton performance runbook is a performance profiling reference by the Graviton to benchmark, debug, and optimize application code.

Building functions packages as container images

Functions packaged as container images must be built for the architecture (x86 or arm64) they are going to use. There are arm64 architecture versions of the AWS provided base images for Lambda. To specify a container image for arm64, use the arm64 specific image tag, for example, for Node.js 14:

public.ecr.aws/lambda/nodejs:14-arm64
public.ecr.aws/lambda/nodejs:latest-arm64
public.ecr.aws/lambda/nodejs:14.2021.10.01.16-arm64

Arm64 Images are also available from Docker Hub.

You can also use arbitrary Linux base images in addition to the AWS provided Amazon Linux 2 images. Images that support arm64 include Alpine Linux 3.12.7 or later, Debian 10 and 11, Ubuntu 18.04 and 20.04. For more information and details of other supported Linux versions, see Operating systems available for Graviton based instances.

Migrating a function

Here is an example of how to migrate a Lambda function from x86 to arm64 and take advantage of newer software versions to improve price and performance. You can follow a similar approach to test your own code.

I have an existing Lambda function as part of an AWS SAM template configured without an Architectures property, which defaults to x86_64.

  Imagex86Function:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/
      Handler: app.lambda_handler
      Runtime: python3.9

The Lambda function code performs some compute intensive image manipulation. The code uses a dependency configured with the following version:

{
  "dependencies": {
    "imagechange": "^1.1.1"
  }
}

I duplicate the Lambda function within the AWS SAM template using the same source code and specify arm64 as the Architectures.

  ImageArm64Function:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/
      Handler: app.lambda_handler
      Runtime: python3.9
      Architectures:
        - arm64

I use AWS SAM to build both Lambda functions. I specify the --use-container flag to build each function within its architecture-specific build container.

sam build –use-container

I can use sam local invoke to test the arm64 function locally even on an x86 system.

AWS SAM local invoke

I then use sam deploy to deploy the functions to the AWS Cloud.

The AWS Lambda Power Tuning open-source project runs your functions using different settings to suggest a configuration to minimize costs and maximize performance. The tool allows you to compare two results on the same chart and incorporate arm64-based pricing. This is useful to compare two versions of the same function, one using x86 and the other arm64.

I compare the performance of the X86 and arm64 Lambda functions and see that the arm64 Lambda function is 12% cheaper to run:

Compare x86 and arm64 with dependency version 1.1.1

I then upgrade the package dependency to use version 1.2.1, which has been optimized for arm64 processors.

{
  "dependencies": {
    "imagechange": "^1.2.1"
  }
}

I use sam build and sam deploy to redeploy the updated Lambda functions with the updated dependencies.

I compare the original x86 function with the updated arm64 function. Using arm64 with a newer dependency code version increases the performance by 30% and reduces the cost by 43%.

Compare x86 and arm64 with dependency version 1.2.1

You can use Amazon CloudWatch,to view performance metrics such as duration, using statistics. You can then compare average and p99 duration between the two architectures. Due to the Graviton2 architecture, functions may be able to use less memory. This could allow you to right-size function memory configuration, which also reduces costs.

Deploying arm64 functions in production

Once you have confirmed your Lambda function performs successfully on arm64, you can migrate your workloads. You can use function versions and aliases with weighted aliases to control the rollout. Traffic gradually shifts to the arm64 version or rolls back automatically if any specified CloudWatch alarms trigger.

AWS SAM supports gradual Lambda deployments with a feature called Safe Lambda deployments using AWS CodeDeploy. You can compile package binaries for arm64 using a number of CI/CD systems. AWS CodeBuild supports building Arm based applications natively. CircleCI also has Arm compute resource classes for deployment. GitHub Actions allows you to use self-hosted runners. You can also use AWS SAM within GitHub Actions and other CI/CD pipelines to create arm64 artifacts.

Conclusion

Lambda functions using the Arm/Graviton2 architecture provide up to 34 percent price performance improvement. This blog discusses a number of considerations to help you migrate functions to arm64.

Many functions can migrate seamlessly with a configuration change, others need to be rebuilt to use arm64 packages. I show how to migrate a function and how updating software to newer versions may improve your function performance on arm64. You can test your own functions using the Lambda PowerTuning tool.

Start migrating your Lambda functions to Arm/Graviton2 today.

For more serverless learning resources, visit Serverless Land.

Introducing AWS Lambda batching controls for message broker services

2022-01-20 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-aws-lambda-batching-controls-for-message-broker-services/

This post is written by Mithun Mallick, Senior Specialist Solutions Architect.

AWS Lambda now supports configuring a maximum batch window for instance-based message broker services to fine tune when Lambda invocations occur. This feature gives you an additional control on batching behavior when processing data. It applies to Amazon Managed Streaming for Apache Kafka (Amazon MSK), self-hosted Apache Kafka, and Amazon MQ for Apache ActiveMQ and RabbitMQ.

Apache Kafka is an open source event streaming platform used to support workloads such as data pipelines and streaming analytics. It is conceptually similar to Amazon Kinesis. Amazon MSK is a fully managed, highly available service that simplifies the setup, scaling, and management of clusters running Kafka.

Amazon MQ is a managed, highly available message broker service for Apache ActiveMQ and RabbitMQ that makes it easier to set up and operate message brokers on AWS. Amazon MQ reduces your operational responsibilities by managing the provisioning, setup, and maintenance of message brokers for you.

Amazon MSK, self-hosted Apache Kafka and Amazon MQ for ActiveMQ and RabbitMQ are all available as event sources for AWS Lambda. You configure an event source mapping to use Lambda to process items from a stream or queue. This allows you to use these message broker services to store messages and asynchronously integrate them with downstream serverless workflows.

In this blog, I explain how message batching works. I show how to use the new maximum batching window control for the managed message broker services and self-managed Apache Kafka.

Understanding batching

For event source mappings, the Lambda service internally polls for new records or messages from the event source, and then synchronously invokes the target Lambda function. Lambda reads the messages in batches and provides these to your function as an event payload. Batching allows higher throughput message processing, up to 10,000 messages in a batch. The payload limit of a single invocation is 6 MB.

Previously, you could only use batch size to configure the maximum number of messages Lambda would poll for. Once a defined batch size is reached, the poller invokes the function with the entire set of messages. This feature is ideal when handling a low volume of messages or batches of data that take time to build up.

Batching window

The new Batch Window control allows you to set the maximum amount of time, in seconds, that Lambda spends gathering records before invoking the function. This brings similar batching functionality that AWS supports with Amazon SQS to Amazon MQ, Amazon MSK and self-managed Apache Kafka. The Lambda event source mapping batching functionality can be described as follows.

Batching controls with Lambda event source mapping

Using MaximumBatchingWindowInSeconds, you can set your function to wait up to 300 seconds for a batch to build before processing it. This allows you to create bigger batches if there are enough messages. You can manage the average number of records processed by the function with each invocation. This increases the efficiency of each invocation, and reduces the frequency.

Setting MaximumBatchingWindowInSeconds to 0 invokes the target Lambda function as soon as the Lambda event source receives a message from the broker.

Message broker batching behavior

For ActiveMQ, the Lambda event source mapping uses the Java Message Service (JMS) API to receive messages. For RabbitMQ, Lambda uses a RabbitMQ client library to get messages from the queue.

The Lambda event source mappings act as a consumer when polling the queue. The batching pattern for all instance-based message broker services is the same. As soon as a message is received, the batching window timer starts. If there are more messages, the consumer makes additional calls to the broker and adds them to a buffer. It keeps a count of the number of messages and the total size of the payload.

The batch is considered complete if the addition of a new message makes the batch size equal to or greater than 6 MB, or the batch window timeout is reached. If the batch size is greater than 6 MB, the last message is returned back to the broker.

Lambda then invokes the target Lambda function synchronously and passes on the batch of messages to the function. The Lambda event source continues to poll for more messages and as soon as it retrieves the next message, the batching window starts again. Polling and invocation of the target Lambda function occur in separate processes.

Kafka uses a distributed append log architecture to store messages. This works differently from ActiveMQ and RabbitMQ as messages are not removed from the broker once they have been consumed. Instead, consumers must maintain an offset to the last record or message that was consumed from the broker. Kafka provides several options in the consumer API to simplify the tracking of offsets.

Amazon MSK and Apache Kafka store data in multiple partitions to provide higher scalability. Lambda reads the messages sequentially for each partition and a batch may contain messages from different partitions. Lambda then commits the offsets once the target Lambda function is invoked successfully.

Configuring the maximum batching window

To reduce Lambda function invocations for existing or new functions, set the MaximumBatchingWindowInSeconds value close to 300 seconds. A longer batching window can introduce additional latency. For latency-sensitive workloads set the MaximumBatchingWindowInSeconds value to an appropriate setting.

To configure Maximum Batching on a function in the AWS Management Console, navigate to the function in the Lambda console. Create a new Trigger, or edit an existing once. Along with the Batch size you can configure a Batch window. The Trigger Configuration page is similar across the broker services.

Max batching trigger window

You can also use the AWS CLI to configure the --maximum-batching-window-in-seconds parameter.

For example, with Amazon MQ:

aws lambda create-event-source-mapping --function-name my-function \
--maximum-batching-window-in-seconds 300 --batch-size 100 --starting-position AT_TIMESTAMP \
--event-source-arn arn:aws:mq:us-east-1:123456789012:broker:ExampleMQBroker:b-24cacbb4-b295-49b7-8543-7ce7ce9dfb98

You can use AWS CloudFormation to configure the parameter. The following example configures the MaximumBatchingWindowInSeconds as part of the AWS::Lambda::EventSourceMapping resource for Amazon MQ:

  LambdaFunctionEventSourceMapping:
    Type: AWS::Lambda::EventSourceMapping
    Properties:
      BatchSize: 10
      MaximumBatchingWindowInSeconds: 300
      Enabled: true
      Queues:
        - "MyQueue"
      EventSourceArn: !GetAtt MyBroker.Arn
      FunctionName: !GetAtt LambdaFunction.Arn
      SourceAccessConfigurations:
        - Type: BASIC_AUTH
          URI: !Ref secretARNParameter

You can also use AWS Serverless Application Model (AWS SAM) to configure the parameter as part of the Lambda function event source.

MQReceiverFunction:
      Type: AWS::Serverless::Function 
      Properties:
        FunctionName: MQReceiverFunction
        CodeUri: src/
        Handler: app.lambda_handler
        Runtime: python3.9
        Events:
          MQEvent:
            Type: MQ
            Properties:
              Broker: !Ref brokerARNParameter
              BatchSize: 10
              MaximumBatchingWindowInSeconds: 300
              Queues:
                - "workshop.queueC"
              SourceAccessConfigurations:
                - Type: BASIC_AUTH
                  URI: !Ref secretARNParameter

Error handling

If your function times out or returns an error for any of the messages in a batch, Lambda retries the whole batch until processing succeeds or the messages expire.

When a function encounters an unrecoverable error, the event source mapping is paused and the consumer stops processing records. Any other consumers can continue processing, provided that they do not encounter the same error. If your Lambda event records exceed the allowed size limit of 6 MB, they can go unprocessed.

For Amazon MQ, you can redeliver messages when there’s a function error. You can configure dead-letter queues (DLQs) for both Apache ActiveMQ, and RabbitMQ. For RabbitMQ, you can set a per-message TTL to move failed messages to a DLQ.

Since the same event may be received more than once, functions should be designed to be idempotent. This means that receiving the same event multiple times does not change the result beyond the first time the event was received.

Conclusion

Lambda supports a number of event sources including message broker services like Amazon MQ and Amazon MSK. This post explains how batching works with the event sources and how messages are sent to the Lambda function.

Previously, you could only control the batch size. The new Batch Window control allows you to set the maximum amount of time, in seconds, that Lambda spends gathering records before invoking the function. This can increase the overall throughput of message processing and reduces Lambda invocations, which may improve cost.

For more serverless learning resources, visit Serverless Land.

Using Amazon Aurora Global Database for Low Latency without Application Changes

2022-01-11 Roneel Kumar

Post Syndicated from Roneel Kumar original https://aws.amazon.com/blogs/architecture/using-amazon-aurora-global-database-for-low-latency-without-application-changes/

Deploying global applications has many challenges, especially when accessing a database to build custom pages for end users. One example is an application using AWS Lambda@Edge. Two main challenges include performance and availability.

This blog explains how you can optimally deploy a global application with fast response times and without application changes.

The Amazon Aurora Global Database enables a single database cluster to span multiple AWS Regions by asynchronously replicating your data within subsecond timing. This provides fast, low-latency local reads in each Region. It also enables disaster recovery from Region-wide outages using multi-Region writer failover. These capabilities minimize the recovery time objective (RTO) of cluster failure, thus reducing data loss during failure. You will then be able to achieve your recovery point objective (RPO).

However, there are some implementation challenges. Most applications are designed to connect to a single hostname with atomic, consistent, isolated, and durable (ACID) consistency. But Global Aurora clusters provide reader hostname endpoints in each Region. In the primary Region, there are two endpoints, one for writes, and one for reads. To achieve strong data consistency, a global application requires the ability to:

Choose the optimal reader endpoints
Change writer endpoints on a database failover
Intelligently select the reader with the most up-to-date, freshest data

These capabilities typically require additional development.

The Heimdall Proxy coupled with Amazon Route 53 allows edge-based applications to access the Aurora Global Database seamlessly, without application changes. Features include automated Read/Write split with ACID compliance and edge results caching.

Figure 1. Heimdall Proxy architecture

The architecture in Figure 1 shows Aurora Global Databases primary Region in AP-SOUTHEAST-2, and secondary Regions in AP-SOUTH-1 and US-WEST-2. The Heimdall Proxy uses latency-based routing to determine the closest Reader Instance for read traffic, and redirects all write traffic to the Writer Instance. The Heimdall Configuration stores the Amazon Resource Name (ARN) of the global cluster. It automatically detects failover and cross-Region on the cluster, and directs traffic accordingly.

With an Aurora Global Database, there are two approaches to failover:

Managed planned failover. To relocate your primary database cluster to one of the secondary Regions in your Aurora global database, see Managed planned failovers with Amazon Aurora Global Database. With this feature, RPO is 0 (no data loss) and it synchronizes secondary DB clusters with the primary before making any other changes. RTO for this automated process is typically less than that of the manual failover.
Manual unplanned failover. To recover from an unplanned outage, you can manually perform a cross-Region failover to one of the secondaries in your Aurora Global Database. The RTO for this manual process depends on how quickly you can manually recover an Aurora global database from an unplanned outage. The RPO is typically measured in seconds, but this is dependent on the Aurora storage replication lag across the network at the time of the failure.

The Heimdall Proxy automatically detects Amazon Relational Database Service (RDS) / Amazon Aurora configuration changes based on the ARN of the Aurora Global cluster. Therefore, both managed planned and manual unplanned failovers are supported.

Solution benefits for global applications

Implementing the Heimdall Proxy has many benefits for global applications:

An Aurora Global Database has a primary DB cluster in one Region and up to five secondary DB clusters in different Regions. But the Heimdall Proxy deployment does not have this limitation. This allows for a larger number of endpoints to be globally deployed. Combined with Amazon Route 53 latency-based routing, new connections have a shorter establishment time. They can use connection pooling to connect to the database, which reduces overall connection latency.
SQL results are cached to the application for faster response times.
The proxy intelligently routes non-cached queries. When safe to do so, the closest (lowest latency) reader will be used. When not safe to access the reader, the query will be routed to the global writer. Proxy nodes globally synchronize their state to ensure that volatile tables are locked to provide ACID compliance.

For more information on configuring the Heimdall Proxy and Amazon Route 53 for a global database, read the Heimdall Proxy for Aurora Global Database Solution Guide.

Download a free trial from the AWS Marketplace.

Resources:

AWS Blog: How to Split Reads and Writes for Amazon RDS
AWS Blog: Automated Query Caching
AWS Blog: Advanced Connection Pooling
Contact: [email protected]

Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced ISV partner. They have AWS Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes.

Using Node.js ES modules and top-level await in AWS Lambda

2022-01-06 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-node-js-es-modules-and-top-level-await-in-aws-lambda/

This post is written by Dan Fox, Principal Specialist Solutions Architect, Serverless.

AWS Lambda now enables the use of ECMAScript (ES) modules in Node.js 14 runtimes. This feature allows Lambda customers to use dependency libraries that are configured as ES modules, or to designate their own function code as an ES module. It provides customers the benefits of ES module features like import/export operators, language-level support for modules, strict mode by default, and improved static analysis and tree shaking. ES modules also enable top-level await, a feature that can lower cold start latency when used with Provisioned Concurrency.

This blog post shows how to use ES modules in a Lambda function. It also provides guidance on how to use top-level await with Provisioned Concurrency to improve cold start performance for latency sensitive workloads.

Designating a function handler as an ES module

You may designate function code as an ES module in one of two ways. The first way is to specify the “type” in the function’s package.json file. By setting the type to “module”, you designate all “.js” files in the package to be treated as ES modules. Set the “type” as “commonjs” to specify the package contents explicitly as CommonJS modules:

// package.json
{
  "name": "ec-module-example",
  "type": "module",
  "description": "This package will be treated as an ES module.",
  "version": "1.0",
  "main": "index.js",
  "author": "Dan Fox",
  "license": "ISC"
}

// index.js – this file will inherit the type from 
// package.json and be treated as an ES module.

import { double } from './lib.mjs';

export const handler = async () => {
    let result = double(6); // 12
    return result;
};

// lib.mjs

export function double(x) {
    return x + x;
}

The second way to designate a function as either an ES module or a CommonJS module is by using the file name extension. File name extensions override the package type directive.

File names ending in .cjs are always treated as CommonJS modules. File names ending in .mjs are always treated as ES modules. File names ending in .js inherit their type from the package. You may mix ES modules and CommonJS modules within the same package. Packages are designated as CommonJS by default:

// this file is named index.mjs – it will always be treated as an ES module
import { square } from './lib.mjs';

export async function handler() {
    let result = square(6); // 36
    return result;
};

// lib.mjs
export function square(x) {
    return x * x;
}

Understanding Provisioned Concurrency

When a Lambda function scales out, the process of allocating and initializing new runtime environments may increase latency for end users. Provisioned Concurrency gives customers more control over cold start performance by enabling them to create runtime environments in advance.

In addition to creating execution environments, Provisioned Concurrency also performs initialization tasks defined by customers. Customer initialization code performs a variety of tasks including importing libraries and dependencies, retrieving secrets and configurations, and initializing connections to other services. According to an AWS analysis of Lambda service usage, customer initialization code is the largest contributor to cold start latency.

Provisioned Concurrency runs both environment setup and customer initialization code. This enables runtime environments to be ready to respond to invocations with low latency and reduces the impact of cold starts for end users.

Reviewing the Node.js event loop

Node.js has an event loop that causes it to behave differently than other runtimes. Specifically, it uses a non-blocking input/output model that supports asynchronous operations. This model enables it to perform efficiently in most cases.

For example, if a Node.js function makes a network call, that request may be designated as an asynchronous operation and placed into a callback queue. The function may continue to process other operations within the main call stack without getting blocked by waiting for the network call to return. Once the network call is returned, the callback is run and then removed from the callback queue.

This non-blocking model affects the Lambda execution environment lifecycle. Asynchronous functions written in the initialization block of a Node.js Lambda function may not complete before handler invocation. In fact, it is possible for function handlers to be invoked with open items remaining in the callback queue.

Typically, JavaScript developers use the await keyword to instruct a function to block and force it to complete before moving on to the next step. However, await is not permitted in the initialization block of a CommonJS JavaScript function. This behavior limits the amount of asynchronous initialization code that can be run by Provisioned Concurrency before the invocation cycle.

Improving cold start performance with top-level await

With ES modules, developers may use top-level await within their functions. This allows developers to use the await keyword in the top level of the file. With this feature, Node.js functions may now complete asynchronous initialization code before handler invocations. This maximizes the effectiveness of Provisioned Concurrency as a mechanism for limiting cold start latency.

Consider a Lambda function that retrieves a parameter from the AWS Systems Manager Parameter Store. Previously, using CommonJS syntax, you place the await operator in the body of the handler function:

// method1 – CommonJS

// CommonJS require syntax
const { SSMClient, GetParameterCommand } = require("@aws-sdk/client-ssm"); 

const ssmClient = new SSMClient();
const input = { "Name": "/configItem" };
const command = new GetParameterCommand(input);
const init_promise = ssmClient.send(command);

exports.handler = async () => {
    const parameter = await init_promise; // await inside handler
    console.log(parameter);

    const response = {
        "statusCode": 200,
        "body": parameter.Parameter.Value
    };
    return response;
};

When you designate code as an ES module, you can use the await keyword at the top level of the code. As a result, the code that makes a request to the AWS Systems Manager Parameter Store now completes before the first invocation:

// method2 – ES module

// ES module import syntax
import { SSMClient, GetParameterCommand } from "@aws-sdk/client-ssm"; 

const ssmClient = new SSMClient();
const input = { "Name": "/configItem" }
const command = new GetParameterCommand(input);
const parameter = await ssmClient.send(command); // top-level await

export async function handler() {
    const response = {
        statusCode: 200,
        "body": parameter.Parameter.Value
    };
    return response;
};

With on-demand concurrency, an end user is unlikely to see much difference between these two methods. But when you run these functions using Provisioned Concurrency, you may see performance improvements. Using top-level await, Provisioned Concurrency fetches the parameter during its startup period instead of during the handler invocation. This reduces the duration of the handler execution and improves end user response latency for cold invokes.

Performing benchmark testing

You can perform benchmark tests to measure the impact of top level await. I have created a project that contains two Lambda functions, one that contains an ES module and one that contains a CommonJS module.

Both functions are configured to respond to a single API Gateway endpoint. Both functions retrieve a parameter from AWS Systems Manager Parameter Store and are configured to use Provisioned Concurrency. The ES module uses top-level await to retrieve the parameter. The CommonJS function awaits the parameter retrieval in the handler.

Before deploying the solution, you need:

An AWS account (sign up for an account if you don’t have one).
The AWS SAM CLI installed.
Node.js installed (version 14.8 minimum).

To deploy:

From a terminal window, clone the git repo:
git clone https://github.com/aws-samples/aws-lambda-es-module-performance-benchmark
Change directory:
cd ./aws-lambda-es-module-performance-benchmark
Build the application:
sam build
Deploy the application to your AWS account:
sam deploy --guided
Take note of the API Gateway URL in the Outputs section.

This post uses a popular open source tool Artillery to provide load testing. To perform load tests:

Open config.yaml document in the /load_test directory and replace the target string with the URL of the API Gateway:
target: “Put API Gateway url string here”
From a terminal window, navigate to the /load_test directory:
cd load_test
Download and install dependencies:
npm install
Begin load test for the CommonJS function.
./test_commonjs.sh
Begin load test for ES module function.
./test_esmodule.sh

Reviewing the results

Here is a side-by-side comparison of the results of two load tests of 600 requests each. The left shows the results for the CommonJS module and the right shows the results for the ES module. The p99 response time reflects the cold start durations when the Lambda service scales up the function due to load. The p99 for the CommonJS module is 603 ms while the p99 for the ES module is 340.5 ms, a performance improvement of 43.5% (262.5 ms) for the p99 of this comparison load test.

Cleaning up

To delete the sample application, use the latest version of the AWS SAM CLI and run:

sam delete

Conclusion

Lambda functions now support ES modules in Node.js 14.x runtimes. ES modules support await at the top-level of function code. Using top-level await maximizes the effectiveness of Provisioned Concurrency and can reduce the latency experienced by end users during cold starts.

This post demonstrates a sample application that can be used to perform benchmark tests that measure the impact of top-level await.

For more serverless content, visit Serverless Land.

Validating addresses with AWS Lambda and the Amazon Location Service

2022-01-06 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/validating-addresses-with-aws-lambda-and-the-amazon-location-service/

This post is written by Matthew Nightingale, Associate Solutions Architect.

Traditional methods of performing address validation on geospatial datasets can be expensive and time consuming. Using Amazon Location Service with AWS Lambda in a serverless data processing pipeline, you may achieve significant performance improvements and cost savings on address validation jobs that use geospatial data.

This blog contains a deployable AWS Serverless Application Model (AWS SAM) template. It also uses sample data sourced from publicly available datasets that you can deploy and use to test the application. This blog offers a starting point to build out a serverless address validation pipeline in your own AWS account.

Overview

This application implements a serverless scatter/gather architecture using Lambda and Amazon S3, performing address validation with the Amazon Location Service. An S3 PUT event triggers each Lambda function to run data processing jobs along each step of the pipeline.

To test the application, a user uploads a .CSV file to S3. This dataset is labeled with fields that are recognized by the 2waygeocoder Lambda function. The application returns a processed dataset to S3 appended with location information from the Amazon Location Places API.

The Scatter Lambda function takes a dataset from the S3 bucket labeled input and splits it into equally sized shards.
The Process Lambda function takes each shard from the pre-processed bucket. It performs address validation in parallel with a 2waygeocoder function calling the Amazon Location Service Places API.
The Gather Lambda function takes each shard from the post-processed bucket. It appends the data into a complete dataset with additional address information.

Amazon Location Service

Amazon Location Service sources high-quality geospatial data from HERE and ESRI to support searches by using a place index resource.

With the Amazon Locations Places API, you can convert addresses and other textual queries into geographic coordinates (also known as geocoding). You can also convert geographic positions into addresses and place descriptions (known as reverse geocoding).

The example application includes a 2waygeocoder capable of both geocoding and reverse geocoding. The next section shows examples of the call and response from the Amazon Location Places API for both geocoding and reverse geocoding.

Geocoding with Amazon Location Service

Here is an example of calling the Amazon Location Service Places API using the AWS SDK for Python (Boto3). This uses the search_place_index_for_text method:

Response = location.search_place_index_for_text(
	IndexName = ‘explore.place’ 
###index is created using Amazon Location service
	Text = “Boston, MA”)
location_response = Reponse[“Results”]
print(location_response)

Example response:

Example reverse-geocoding with Amazon Location Service

Here is another example of calling the Amazon Location Service Places API using the AWS SDK for Python (boto3). This uses the search_place_index_for_position method:

Response = location.search_place_index_for_position(
	IndexName = ‘explore.place’ 
###index is created using Amazon Location service
	Position = “-71.056739, 42.358660”))
location_response = Reponse[“Results”]
print(location_response)

Example response:

Design considerations

Processing data with Lambda in parallel using a serverless scatter/gather pipeline helps provide performance efficiency at lower cost. To provide even greater performance, you can optimize your Lambda configuration for higher throughput. There are several strategies you can implement to do this and a key few topics to keep in mind.

Increase the allocated memory for your Lambda function

The simplest way to increase throughput is to increase the allocated memory of the Lambda function.

Faster Lambda functions can process more data and increase throughput. This works even if a Lambda function’s memory utilization is low. This is because increasing memory also increases vCPUs in proportion to the amount configured. Each function supports up to 10 GB of memory and you can access up to six vCPUs per function.

To see the average cost and execution speed for each memory configuration, the Lambda Power Tuning tool helps to visualize the tradeoffs.

Optimize shard size

Another method for increasing performance in a serverless scatter/gather architecture is to optimize the total number of shards created by the scatter function. Increasing the total number of shards consequently reduces the size of any single shard, allowing Lambda to process each shard faster.

When scaling with Lambda, one instance of a function handles one request at a time. When the number of requests increases, Lambda creates more instances of the function to process traffic. Because S3 invokes Lambda asynchronously, there is an internal queue buffering requests between the event source and the Lambda service.

In a serverless scatter/gather architecture, having more shards results in more concurrent invocations of the process Lambda function. For more information about scaling and concurrency with Lambda, see this blog post. Increasing concurrency with Lambda can lead to API request throttling.

Consider API request throttling with your concurrent Lambda functions

In a serverless scatter/gather architecture, the rate at which your code calls APIs increases by a factor equal to the number of concurrent Lambda functions. This means API request limits can quickly be exceeded. You must consider Service Quotas and API request limits when trying to increase the performance of your serverless scatter/gather architecture.

For example, the Amazon Location Places APIs called in the processing function of this application has a default limit of 50 API requests per second. The 2waygeocoder calls on average about 12 APIs per second. Splitting the application into more than four shards may cause API throttling exception errors in this case. Requests to increase Service Quotas can be made through your AWS account.

Deploying the solution

You need the following perquisites to deploy the example application:

AWS account.
AWS SAM CLI.
Python 3.9.
An AWS Identity and Access Management (IAM) role with appropriate access.

Deploy the example application:

Clone the repository and download the sample source code to your environment where AWS SAM is installed:
git clone https://github.com/aws-samples/amazon-location-service-serverless-address-validation
Change into the project directory containing the template.yaml file:
cd ~/environment/amazon-location-service-serverless-address-validation
Build the application using AWS SAM:
sam build
Deploy the application to your account using AWS SAM. Be sure to follow proper S3 naming conventions providing globally unique names for S3 buckets:
sam deploy --guided

Testing the application

Testing geocoding

To test the application, download the dataset that is linked in Testing the Application section of the GitHub repository. These tests demonstrate both the geocoding and reverse-geocoding capabilities of the application.

First, test the geocoding capabilities. You perform address validation on the City of Hartford Business Listing dataset linked in the GitHub repository. The dataset contains a listing of all the active businesses registered in the city Hartford, CT, and each business address. The GitHub repo links to an external website where you can download the dataset.

Download the .csv version of the City of Hartford Business Listing dataset. The link is found in the Testing the Application section of the README file on GitHub.
Open the file locally to explore its contents.
Ensure that the .csv file contains columns labeled as “Address”, “City”, and “State”. The 2waygeocoder deployed as part of the AWS SAM template recognizes these columns to perform geocoding.
Before testing the application’s geocoding capabilities, explore the pricing of Amazon Location Service. In order to save money, you can trim the length of the dataset for testing by removing rows. Once the dataset is trimmed to a desired length, navigate to S3 in the AWS Management Console.
Upload the dataset to the S3 bucket labeled “input”. This triggers the scatter function.
Navigate to the S3 bucket labeled “raw” to view the shards of your dataset created by the scatter function.
Navigate to Lambda and select the 2waygeocoder function to view the CloudWatch Logs to see any information that is returned by the function code in near-real-time.
Once the data is processed, navigate to the S3 bucket labeled “destination” to view the complete processed dataset that is created by the gather function. It may take several minutes for your dataset to finish processing.

Congratulations! You have successfully geocoded a dataset using Amazon Location Service with a serverless address validation pipeline.

Testing reverse-geocoding

Next, test the reverse-geocoding capabilities of the application. You perform address validation on the Miami Housing Dataset linked in the GitHub repository. This dataset contains information on 13,932 single-family homes sold in Miami. The repo links to an external website where you can download the dataset.

Before testing, explore the pricing of Amazon Location Service. To start the test:

Download the zip file containing the .csv version of the dataset from . The link is found in the Testing the Application section of the README file on GitHub.
Open the file locally to explore its contents.
Ensure the .csv file contains columns A and B labeled “Latitude” and “Longitude”. You must edit these column headers to match the correct format that is recognized by the 2waygeocoder to perform reverse-geocoding. Only the “L” should be capitalized.
To minimize cost, trim the length of the dataset for testing by removing rows. At the full size of ~13,933 rows, the dataset takes approx. 5 minutes to process.
Once the dataset is trimmed to a desired length and both column A and B are labeled as “Latitude” and “Longitude” respectively, navigate to S3 in the AWS Management Console, and upload the dataset to your S3 bucket labeled “Input”.
Navigate to the S3 bucket labeled “raw” to view the shards of your dataset.
Navigate to Lambda and select the 2waygeocoder function to view the CloudWatch Logs to see any information that is returned by the function code in near-real-time.
Navigate to the S3 bucket labeled “destination” to view the complete processed dataset that is created by the gather function. It may take several minutes for your dataset to finish processing.

Congratulations! You have successfully reverse-geocoded a dataset with Amazon Location Service using a serverless scatter/gather pipeline. You can move on to the conclusion, or continue to test the geocoding capabilities of the application with additional datasets.

Next steps

To get started testing your own datasets, use the AWS SAM template from GitHub deployed as part of this blog. Ensure that the labels in your dataset are labeled to match the constructs used in this blog post. The 2waygeocoder recognizes columns labeled “Latitude” and “Longitude” to perform reverse-geocoding, and “Address”, “City”, and “State” to perform geocoding.

Now that the data has been geocoded by Amazon Location Service and is in S3, you can use Amazon QuickSight geospatial charts to quickly and easily create interactive charts. For information on how to create a Dataset in QuickSight using Amazon S3 Files, check out the QuickSight User Guide.

Below is an example using QuickSight Geospatial charts to map the Miami housing dataset. The map shows average sale price by zipcode:

This example uses QuickSight geospatial charts to map the City of Hartford Business dataset. The map shows DBA (doing business as) by latitude and longitude:

Conclusion

This blog post performs address validation with the Amazon Location Service, demonstrating both geocoding and reverse geocoding capabilities.

Using a serverless architecture with S3 and Lambda, you can achieve both cost optimization and performance improvement compared with traditional methods of address validation. Using this application, your organization can better understand and harness geospatial data.

For more serverless learning resources, visit Serverless Land.

ICYMI: Serverless Q4 2021

2022-01-04 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2021/

Welcome to the 15th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, check out what happened last quarter here.

AWS Lambda

For developers using Amazon MSK as an event source, Lambda has expanded authentication options to include IAM, in addition to SASL/SCRAM. Lambda also now supports mutual TLS authentication for Amazon MSK and self-managed Kafka as an event source.

Lambda also launched features to make it easier to operate across AWS accounts. You can now invoke Lambda functions from Amazon SQS queues in different accounts. You must grant permission to the Lambda function’s execution role and have SQS grant cross-account permissions. For developers using container packaging for Lambda functions, Lambda also now supports pulling images from Amazon ECR in other AWS accounts. To learn about the permissions required, see this documentation.

The service now supports a partial batch response when using SQS as an event source for both standard and FIFO queues. When messages fail to process, Lambda marks the failed messages and allows reprocessing of only those messages. This helps to improve processing performance and may reduce compute costs.

Lambda launched content filtering options for functions using SQS, DynamoDB, and Kinesis as an event source. You can specify up to five filter criteria that are combined using OR logic. This uses the same content filtering language that’s used in Amazon EventBridge, and can dramatically reduce the number of downstream Lambda invocations.

Amazon EventBridge

Previously, you could consume Amazon S3 events in EventBridge via CloudTrail. Now, EventBridge receives events from the S3 service directly, making it easier to build serverless workflows triggered by activity in S3. You can use content filtering in rules to identify relevant events and forward these to 18 service targets, including AWS Lambda. You can also use event archive and replay, making it possible to reprocess events in testing, or in the event of an error.

AWS Step Functions

The AWS Batch console has added support for visualizing Step Functions workflows. This makes it easier to combine these services to orchestrate complex workflows over business-critical batch operations, such as data analysis or overnight processes.

Additionally, Amazon Athena has also added console support for visualizing Step Functions workflows. This can help when building distributed data processing pipelines, allowing Step Functions to orchestrate services such as AWS Glue, Amazon S3, or Amazon Kinesis Data Firehose.

Synchronous Express Workflows now supports AWS PrivateLink. This enables you to start these workflows privately from within your virtual private clouds (VPCs) without traversing the internet. To learn more about this feature, read the What’s New post.

Amazon SNS

Amazon SNS announced support for token-based authentication when sending push notifications to Apple devices. This creates a secure, stateless communication between SNS and the Apple Push Notification (APN) service.

SNS also launched the new PublishBatch API which enables developers to send up to 10 messages to SNS in a single request. This can reduce cost by up to 90%, since you need fewer API calls to publish the same number of messages to the service.

Amazon SQS

Amazon SQS released an enhanced DLQ management experience for standard queues. This allows you to redrive messages from a DLQ back to the source queue. This can be configured in the AWS Management Console, as shown here.

Amazon DynamoDB

The NoSQL Workbench for DynamoDB is a tool to simplify designing, visualizing and querying DynamoDB tables. The tools now supports importing sample data from CSV files and exporting the results of queries.

DynamoDB announced the new Standard-Infrequent Access table class. Use this for tables that store infrequently accessed data to reduce your costs by up to 60%. You can switch to the new table class without an impact on performance or availability and without changing application code.

AWS Amplify

AWS Amplify now allows developers to override Amplify-generated IAM, Amazon Cognito, and S3 configurations. This makes it easier to customize the generated resources to best meet your application’s requirements. To learn more about the “amplify override auth” command, visit the feature’s documentation.

Similarly, you can also add custom AWS resources using the AWS Cloud Development Kit (CDK) or AWS CloudFormation. In another new feature, developers can then export Amplify backends as CDK stacks and incorporate them into their deployment pipelines.

AWS Amplify UI has launched a new Authenticator component for React, Angular, and Vue.js. Aside from the visual refresh, this provides the easiest way to incorporate social sign-in in your frontend applications with zero-configuration setup. It also includes more customization options and form capabilities.

AWS launched AWS Amplify Studio, which automatically translates designs made in Figma to React UI component code. This enables you to connect UI components visually to backend data, providing a unified interface that can accelerate development.

AWS AppSync

You can now use custom domain names for AWS AppSync GraphQL endpoints. This enables you to specify a custom domain for both GraphQL API and Realtime API, and have AWS Certificate Manager provide and manage the certificate.

To learn more, read the feature’s documentation page.

News from other services

Serverless blog posts

October

Oct 4 – Simplifying B2B integrations with AWS Step Functions Workflow Studio
Oct 6 – Operating serverless at scale: Implementing governance – Part 1
Oct 7 – Using Okta as an identity provider with Amazon MWAA
Oct 11 – Avoiding recursive invocation with Amazon S3 and AWS Lambda
Oct 12 – Operating serverless at scale: Improving consistency – Part 2
Oct 14 – Using JSONPath effectively in AWS Step Functions
Oct 14 – Accepting API keys as a query string in Amazon API Gateway
Oct 14 – Visualizing AWS Step Functions workflows from the AWS Batch console
Oct 18 – Building dynamic Amazon SNS subscriptions for auto scaling container workloads
Oct 19 – Operating serverless at scale: Keeping control of resources – Part 3
Oct 21 – Creating AWS Serverless batch processing architectures
Oct 25 – Building a difference checker with Amazon S3 and AWS Lambda
Oct 26 – Monitoring and tuning federated GraphQL performance on AWS Lambda
Oct 27 – Accelerating serverless development with AWS SAM Accelerate
Oct 28 – Creating AWS Lambda environment variables from AWS Secrets Manager

November

Nov 1 – Build workflows for Amazon Forecast with AWS Step Functions
Nov 2 – Choosing between storage mechanisms for ML inferencing with AWS Lambda
Nov 4 – Introducing cross-account Amazon ECR access for AWS Lambda
Nov 8 – Implementing header-based API Gateway versioning with Amazon CloudFront
Nov 9 – Creating static custom domain endpoints with Amazon MQ for RabbitMQ
Nov 9 – Token-based authentication for iOS applications with Amazon SNS
Nov 11 – Understanding how AWS Lambda scales with Amazon SQS standard queues
Nov 17 – Modernizing deployments with container images in AWS Lambda
Nov 18 – Deploying AWS Lambda layers automatically across multiple Regions
Nov 18 – Publishing messages in batch to Amazon SNS topics
Nov 19 – Introducing mutual TLS authentication for Amazon MSK as an event source
Nov 22 – Expanding cross-Region event routing with Amazon EventBridge
Nov 22 – Offset lag metric for Amazon MSK as an event source for Lambda
Nov 23 – Visualizing AWS Step Functions workflows from the Amazon Athena console
Nov 26 – Filtering event sources for AWS Lambda functions

December

AWS re:Invent breakouts

AWS re:Invent was held in Las Vegas from November 29 to December 3, 2021. The Serverless DA team presented numerous breakouts, workshops and chalk talks. Rewatch all our breakout content:

We also launched an interactive serverless application at re:Invent to help customers get caffeinated!

Serverlesspresso is a contactless, serverless order management system for a physical coffee bar. The architecture comprises several serverless apps that support an ordering process from a customer’s smartphone to a real espresso bar. The customer can check the virtual line, place an order, and receive a notification when their drink is ready for pickup.

You can learn more about the architecture and download the code repo at https://serverlessland.com/reinvent2021/serverlesspresso. You can also see a video of the exhibit.

Videos

Serverless Office Hours – Tues 10 AM PT

Weekly live virtual office hours. In each session we talk about a specific topic or technology related to serverless and open it up to helping you with your real serverless challenges and issues. Ask us anything you want about serverless technologies and applications.

YouTube: youtube.com/serverlessland
Twitch: twitch.tv/aws

October

Oct 5 – Serverless Surprise! Ben Kehoe & security
Oct 12 – AWS Lambda – ARM support for Lambda functions
Oct 19 – AWS Step Functions – AWS SDK Service Integrations
Oct 20 – Using the AWS Serverless Application Model (AWS SAM) to Build Serverless Applications
Oct 26 – API Gateway – Migration tips for API keys

November

Nov 2 – pre:Invent session #1 – The serverless sessions
Nov 3 – DynamoDB Office Hours – Data Modeling with Dynobase
Nov 9 – pre:Invent session #2
Nov 16 – pre:Invent session #3
Nov 23 – pre:Invent session #4
Nov 29 – Heroes @ re:Invent part one
Nov 30 – Secret projects @ re:Invent

December

Dec 1 – Serverless leadership @ re:Invent
Dec 2 – Heroes @ re:Invent part two

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

Eric Johnson: @edjgeek
James Beswick: @jbesw
Ben Smith: @benjamin_l_s
Julian Wood: @julian_wood
Talia Nassi: @talia_nassi

Building a serverless multi-player game that scales: Part 3

2021-12-27 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/building-a-serverless-multi-player-game-that-scales-part-3/

This post is written by Tim Bruce, Sr. Solutions Architect, DevAx, Chelsie Delecki, Solutions Architect, DNB, and Brian Krygsman, Solutions Architect, Enterprise.

This blog series discusses building a serverless game that scales, using Simple Trivia Service:

Part 1 describes the overall architecture, how to deploy to your AWS account, and the different communication methods.
Part 2 describes adding automation to the game to help your teams scale.

This post discusses how the game scales to support concurrent users (CCU) under a load test. While this post focuses on Simple Trivia Service, you can apply the concepts to any serverless workload.

To set up the example, see the instructions in the Simple Trivia Service GitHub repo and the README.md file. This example uses services beyond AWS Free Tier and incurs charges. To remove the example from your account, see the README.md file.

Overview

Simple Trivia Service is launching at a new trivia conference. There are 200,000 registered attendees who are invited to play the game during the conference. The developers are following AWS Well-Architected best practice and load test before the launch.

Load testing is the practice of simulating user load to validate the system’s ability to scale. The focus of the load test is the game’s microservices, built using AWS Serverless services, including:

Amazon API Gateway and AWS IoT, which provide serverless endpoints, allowing users to interact with the Simple Trivia Service microservices.
AWS Lambda, which provides serverless compute services for the microservices.
Amazon DynamoDB, which provides a serverless NoSQL database for storing game data.

Preparing for load testing

Defining success criteria is one of the first steps in preparing for a load test. You use success criteria to determine how well the game meets the requirements and includes concurrent users, error rates, and response time. These three criteria help to ensure that your users have a good experience when playing your game.

Excluding one can lead to invalid assumptions about the scale of users that the game can support. If you exclude error rate goals, for example, users may encounter more errors, impacting their experience.

The success criteria used for Simple Trivia Service are:

200,000 concurrent users split across game types.
Error rates below 0.05%.
95th percentile synchronous responses under 1 second.

With these identified, you can develop dashboards to report on the targets. Dashboards allow you to monitor system metrics over the course of load tests. You can develop dashboards using Amazon CloudWatch dashboards, using custom widgets that organize and display metrics.

Common metrics to monitor include:

Error rates – total errors / total invocations.
Throttles – invocations resulting in 429 errors.
Percentage of quota usage – usage against your game’s Service Quotas.
Concurrent execution counts – maximum concurrent Lambda invocations.
Provisioned concurrency invocation rate – provisioned concurrency spillover invocation count / provisioned concurrency invocation count.
Latency – percentile-based response time, such as 90th and 95th percentiles.

Documentation and other services are also helpful during load testing. Centralized logging via Amazon CloudWatch Logs and AWS CloudTrail provide your team with operational data for the game. This data can help triage issues during testing.

System architecture documents provide key details to help your team focus their work during triage. Amazon DevOps Guru can also provide your team with potential solutions for issues. This uses machine learning to identify operational deviations and deployments and provides recommendations for resolving issues.

A load testing tool simplifies your testing, allowing you to model users playing the game. Popular load testing tools include Apache JMeter, Artillery.io Artillery, and Locust.io Locust. The load testing tool you select can act as your application client and access your endpoints directly.

This example uses Locust to load test Simple Trivia Service based on language and technical requirements. It allows you to accurately model usage and not only generate transactions. In production applications, select a tool that aligns to your team’s skills and meets your technical requirements.

You can place automation around load testing tool to reduce manual effort of running tests. Automation can include allocating environments, deploying and running test scripts, and collecting results. You can include this as part of your continuous integration/continuous delivery (CI/CD) pipeline. You can use the Distributed Load Testing on AWS solution to support Taurus-compatible load testing.

Also, document a plan, working backwards from your goals to help measure your progress. Plans typically use incremental growth of CCU, which can help you to identify constraints in your game. Use your plan while you are in development once portions of your game feature complete.

This shows an example plan for load testing Simple Trivia Service:

Start with individual game testing to validate tests and game modes separately.
Add in testing of the three game modes together, mirroring expected real world activity.

Finally, evaluate your load test and architecture against your AWS Customer Agreement, AWS Acceptable Use Policy, Amazon EC2 Testing Policy, and the AWS Customer Support Policy for Penetration Testing. These policies are put in place to help you to be successful in your load testing efforts. AWS Support requires you to notify them at least two weeks prior to your load test using the Simulated Events Submission Form with the AWS Management Console. This form can also be used if you have questions before your load test.

Additional help for your load test may be available on the AWS Forums, AWS re:Post, or via your account team.

Testing

After triggering a test, automation scales up your infrastructure and initializes the test users. Depending on the number of users you need and their startup behavior, this ramp-up phase can take several minutes. Similarly, when the test run is complete, your test users should ramp down. Unless you have modeled the ramp-up and ramp-down phases to match real-world behavior, exclude these phases from your measurements. If you include them, you may optimize for unrealistic user behavior.

While tests are running, let metrics normalize before drawing conclusions. Services may report data at different rates. Investigate when you find metrics that cross your acceptable thresholds. You may need to make adjustments like adding Lambda Provisioned Concurrency or changing application code to resolve constraints. You may even need to re-evaluate your requirements based on how the system performs. When you make changes, re-test to verify any changes had the impact you expected before continuing with your plan.

Finally, keep an organized record of the inputs and outputs of tests, including dashboard exports and your own observations. This record is valuable when sharing test outcomes and comparing test runs. Mark your progress against the plan to stay on track.

Analyzing and improving Simple Trivia Service performance

Running the test plan, using observability tools to measure performance, finds opportunities to tune performance bottlenecks.

In this example, during single player individual tests, the dashboards show acceptable latency values. As the test size grows, increasing read capacity for retrieving leaderboards indicates a tuning opportunity:

The CloudWatch dashboard reveals that the LeaderboardGet function is leading to high consumed read capacity for the Players DynamoDB table. A process within the function is querying scores and player records with every call to load avatar URLs
Standardizing the player avatar URL process within the code reduces reads from the table. The update improves DynamoDB reads.

Moving into the full test phase of the plan with combined game types identified additional areas for performance optimization. In one case, dashboards highlight unexpected error rates for a Lambda function. Consulting function logs and DevOps Guru to triage the behavior, these show a downstream issue with an Amazon Kinesis Data Stream:

DevOps Guru, within an insight, highlights the problem of the Kinesis:WriteProvisionedThroughputExceeded metric during our test window
DevOps Guru also correlates that metric with the Kinesis:GetRecords.Latency metric.

DevOps Guru also links to a recommendation for Kinesis Data Streams to troubleshoot and resolve the incident with the data stream. Following this advice helps to resolve the Lambda error rates during the next test.

Load testing results

By following the plan, making incremental changes as optimizations became apparent, you can reach the goals.

The preceding table is a summary of data from Amazon CloudWatch Lambda Insights and statistics captured from Locust:

The test exceeded the goal of 200k CCU with a combined total of 236,820 CCU.
Less than 0.05% error rate with a combined average of 0.010%.
Performance goals are achieved without needing Provisioned Concurrency in Lambda.

The function latency goal of < 1 second is met, based on data from CloudWatch Lambda Insights.
Function concurrency is below Service Quotas for Lambda during the test, based on data from our custom CloudWatch dashboard.

Conclusion

This post discusses how to perform a load test on a serverless workload. The process was used to validate a scale of Simple Trivia Service, a single- and multi-player game built using a serverless-first architecture on AWS. The results show a scale of over 220,000 CCUs while maintaining less than 1-second response time and an error rate under 0.05%.

For more serverless learning resources, visit Serverless Land.

Use AWS Step Functions to Monitor Services Choreography

2021-12-22 Vito De Giosa

Post Syndicated from Vito De Giosa original https://aws.amazon.com/blogs/architecture/use-aws-step-functions-to-monitor-services-choreography/

Organizations frequently need access to quick visual insight on the status of complex workflows. This involves collaboration across different systems. If your customer requires assistance on an order, you need an overview of the fulfillment process, including payment, inventory, dispatching, packaging, and delivery. If your products are expensive assets such as cars, you must track each item’s journey instantly.

Modern applications use event-driven architectures to manage the complexity of system integration at scale. These often use choreography for service collaboration. Instead of directly invoking systems to perform tasks, services interact by exchanging events through a centralized broker. Complex workflows are the result of actions each service initiates in response to events produced by other services. Services do not directly depend on each other. This increases flexibility, development speed, and resilience.

However, choreography can introduce two main challenges for the visibility of your workflow.

It obfuscates the workflow definition. The sequence of events emitted by individual services implicitly defines the workflow. There is no formal statement that describes steps, permitted transitions, and possible failures.
It might be harder to understand the status of workflow executions. Services act independently, based on events. You can implement distributed tracing to collect information related to a single execution across services. However, getting visual insights from traces may require custom applications. This increases time to market (TTM) and cost.

To address these challenges, we will show you how to use AWS Step Functions to model choreographies as state machines. The solution enables stakeholders to gain visual insights on workflow executions, identify failures, and troubleshoot directly from the AWS Management Console.

This GitHub repository provides a Quick Start and examples on how to model choreographies.

Modeling choreographies with Step Functions

Monitoring a choreography requires a formal representation of the distributed system behavior, such as state machines. State machines are mathematical models representing the behavior of systems through states and transitions. States model situations in which the system can operate. Transitions define which input causes a change from the current state to the next. They occur when a new event happens. Figure 1 shows a state machine modeling an order workflow.

Figure 1. Order workflow

The solution in this post uses Amazon State Language to describe a choreography as a Step Functions state machine. The state machine pauses, using Task states combined with a callback integration pattern. It then waits for the next event to be published on the broker. Choice states control transitions to the next state by inspecting event payloads. Figure 2 shows how the workflow in Figure 1 translates to a Step Functions state machine.

Figure 2. Order workflow translated into Step Functions state machine

Figure 3 shows the architecture for monitoring choreographies with Step Functions.

Figure 3. Choreography monitoring with AWS Step Functions

Services involved in the choreography publish events to Amazon EventBridge. There are two configured rules. The first rule matches the first event of the choreography sequence, Order Placed in the example. The second rule matches any other event of the sequence. Event payloads contain a correlation id (order_id) to group them by workflow instance.
The first rule invokes an AWS Lambda function, which starts a new execution of the choreography state machine. The correlation id is passed in the name parameter, so you can quickly identify an execution in the AWS Management Console.
The state machine uses Task states with AWS SDK service integrations, to directly call Amazon DynamoDB. Tasks are configured with a callback pattern. They issue a token, which is stored in DynamoDB with the execution name. Then, the workflow pauses.
A service publishes another event on the event bus.
The second rule invokes another Lambda function with the event payload.
The function uses the correlation id to retrieve the task token from DynamoDB.
The function invokes the Step Functions SendTaskSuccess API, with the token and the event payload as parameters.
The state machine resumes the execution and uses Choice states to transition to the next state. If the choreography definition expects the received event payload, it selects the next state and the process will restart from Step # 3. The state machine transitions to a Fail state when it receives an unexpected event.

Increased visibility with Step Functions console

Modeling service choreographies as Step Functions Standard Workflows increases visibility with out-of-the-box features.

1. You can centrally track events produced by distributed components. Step Functions records full execution history for 90 days after the execution completes. You’ll be able to capture detailed information about the input and output of each state, including event payloads. Additionally, state machines integrate with Amazon CloudWatch to publish execution logs and metrics.

2. You can monitor choreographies visually. The Step Functions console displays a list of executions with information such as execution id, status, and start date (see Figure 4).

Figure 4. Step Functions workflow dashboard

After you’ve selected an execution, a graph inspector is displayed (see Figure 5). It shows states, transitions, and marks individual states with colors. This identifies at a glance, successful tasks, failures, and tasks that are still in progress.

Figure 5. Step Functions graph inspector

3. You can implement event-driven automation. Step Functions enables you to capture execution status changes emitting events directly to EventBridge (see Figure 6). Additionally, AWS gives you the ability to emit events by setting alarms on top of metrics. Step Functions publishes these to CloudWatch. You can respond to events by initiating corrective actions, sending notifications, or integrating with third-party solutions, such as issue tracking systems.

Figure 6. Automation with Step Functions, EventBridge, and CloudWatch alarms

Enabling access to AWS Step Functions console

Stakeholders need secure access to the Step Functions console. This requires mechanisms to authenticate users and authorize read-only access to specific Step Functions workflows.

AWS Single Sign-On authenticates users by directly managing identities or through federation. SSO supports federation with Active Directory and SAML 2.0 compliant external identity providers (IdP). Users gain access to Step Functions state machines by assigning a permission set, which is a collection of AWS Identity and Access Management (IAM) policies. Additionally, with permission sets, you can configure a relay state, which is a URL to redirect the user after successful authentication. You can authenticate the user through the selected identity provider and immediately show the AWS Step Functions console with the workflow state machine already displayed. Figure 7 shows this process.

Figure 7. Access to Step Functions state machine with AWS SSO

The user logs in through the selected identity provider.
The SSO user portal uses the SSO endpoint to send the response from the previous step. SSO uses AWS Security Token Service (STS) to get temporary security credentials on behalf of the user. It then creates a console sign-in URL using those credentials and the relay state. Finally, it sends the URL back as a redirect.
The browser redirects the user to the Step Functions console.

When the identity provider does not support SAML 2.0, SSO is not a viable solution. In this case, you can create a URL with a sign-in token for users to securely access the AWS Management Console. This approach uses STS AssumeRole to get temporary security credentials. Then, it uses credentials to obtain a sign-in token from the AWS federation endpoint. Finally, it constructs a URL for the AWS Management Console, which includes the token. It then distributes this to users to grant access. This is similar to the SSO process. However, it requires custom development.

Conclusion

This post shows how you can increase visibility on choreographed business processes using AWS Step Functions. The solution provides detailed visual insights directly from the AWS Management Console, without requiring custom UI development. This reduces TTM and cost.

To learn more:

Serverless Scheduling with Amazon EventBridge, AWS Lambda, and Amazon DynamoDB

2021-12-16 Peter Grman

Post Syndicated from Peter Grman original https://aws.amazon.com/blogs/architecture/serverless-scheduling-with-amazon-eventbridge-aws-lambda-and-amazon-dynamodb/

Many applications perform scheduled tasks. For instance, you might want to automatically publish an article at a given time, change prices for offers which were defined weeks in advance, or notify customers 8 hours before a flight. These might be one-off tasks, or recurring ones.

On Unix-like operating systems, you might have opted for the cron utility. There are also similar alternatives for many web application frameworks, as well as advanced libraries, to schedule future one-off tasks. In a single server environment, this might seem like a simple solution. However, when you run dozens of instances of your application server, it gets harder to rely on those libraries to schedule tasks reliably at least once, without taking up too many resources. If you decide to build a serverless application, you need a new approach all together.

This post shows how you can build a scalable serverless job scheduler. You can use this method to scale to thousands, or even millions, of distributed jobs per minute. Because you are using serverless technologies, the underlying infrastructure is fully managed by AWS and you only pay for what you use. You can use this solution as an addition to your existing applications, regardless if they already use serverless technologies.

Similarly to a cron job running on a single instance, this solution uses an Amazon EventBridge rule, which starts new events periodically on a schedule. For recurring jobs, you would use this capability to start specific actions directly. This will work if you have only a few dozen periodic tasks, whose execution cycle can be defined as a cron expression. However, remember that there are limits to how many rules can be defined per event bus, and rules with a scheduled expression can only be defined on the default event bus. This post describes a method to multiplex a single Amazon EventBridge rule via an AWS Lambda function and Amazon DynamoDB, to scale beyond thousands of jobs. While this example focuses on one-off tasks, you can use the same approach for recurring jobs as well.

Overview of solution

The following diagram shows the architecture of the serverless scheduling solution.

Figure 1 – Architecture diagram showing Serverless Scheduling with Amazon EventBridge, AWS Lambda, and Amazon DynamoDB

Amazon EventBridge with scheduled expressions periodically starts an AWS Lambda function. An Amazon DynamoDB table stores the future jobs. The Lambda function queries the table for due jobs and distributes them via Amazon EventBridge to the workers.

The following services are used:

Amazon EventBridge: to initiate the serverless scheduling solution. Amazon EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale. It can also schedule events based on time intervals or cron expressions.

In this solution, you’ll use EventBridge for two things:

to periodically start the AWS Lambda function, which checks for new jobs to be executed, and
to distribute those jobs to the workers.

Here, you can control the granularity of your job executions. The fastest rate possible is once every minute. But if you don’t need a 1-minute precision, you can also opt for once every 5 minutes, or even once every hour. Remember that you cannot control at which second the event is started. It might be at the beginning of the minute, in the middle, or at the end.

AWS Lambda: to execute the scheduler logic. AWS Lambda is a serverless, event-driven compute service that lets you run code without provisioning or managing servers. The Lambda function queries the jobs from DynamoDB and distributes them via EventBridge. Based on your requirements, you can adjust this to use different mechanisms to notify the workers about the jobs, such as HTTP APIs, gRPC calls, or AWS services like Amazon Simple Notification Service (SNS) or Amazon Simple Queue Service (SQS).

Amazon DynamoDB: to store scheduled jobs. Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. Defining the right data model is important to be able to scale to thousands or even millions of scheduled and processed jobs per minute. The DynamoDB table in this solution has a partition key “pk” and a sort key “sk”. For the Lambda function, to be able to query all due jobs quickly and efficiently, jobs must be partitioned. For this, they are grouped together based on their scheduled times in intervals of 5 minutes. This value is the partition key “pk”. How to calculate this value is explained in detail, when you will test the solution.

The sort key “sk” contains the precise execution time concatenated with a unique identifier, such as a job ID, because the combination of “pk” and “sk” must be unique. To schedule a job in this example, you write it manually into the DynamoDB table. In your production code you can abstract the synchronous DynamoDB access, by implementing it in a shared library, or using Amazon API Gateway. You could also schedule jobs from a Lambda function reacting to events in your system.

Amazon EventBridge: to distribute the jobs. The Lambda function uses Amazon EventBridge as an example to distribute the jobs. The workers which should receive the jobs, must configure the corresponding rules upfront. For testing purposes, this solution comes with a rule which logs all events from the Lambda function into Amazon CloudWatch Logs.

Walkthrough

In this section, you will deploy the solution and test it.

An AWS account
An AWS user, which has access to the AWS Management Console and has the IAM permissions to launch the AWS CloudFormation stack and create the aforementioned resources.

Deploying the solution

To deploy it in your account:

1. Select Launch Stack.

2. Select the Region where you want to launch your serverless scheduler.

3. Define a name for your stack. Leave the parameters with the default values for now and select Next.

parameters table

4. At the bottom of the page, acknowledge the required Capabilities and select Create stack.

5. Wait until the status of the stack is CREATE_COMPLETE, this can take a minute or two.

Testing the solution

In this section, you test the serverless scheduler. First, you’ll schedule a job for some time in the near future. Afterwards you will check that the job has been logged in CloudWatch Logs at the time, it was scheduled.

1. In the AWS Management Console, navigate to the DynamoDB service and select the Items sub-menu on the left side, between Tables and PartiQL editor.

2. Select the JobsTable which you created via the CloudFormation Stack; it should be empty for now:

Jobstable

3. Select Create item. Make sure you switch to the JSON editor at the top, and disable View DynamoDB JSON. Now copy this item into the editor:

{
  "pk": "j#2015-03-20T09:45",
  "sk": "2015-03-20T09:46:47.123Z#564ade05-efda-4a2e-a7db-933ad3c89a83",
  "detail": {
    "action": "send-reminder",
    "userId": "16f3a019-e3a5-47ed-8c46-f668347503d1",
    "taskId": "6d2f710d-99d8-49d8-9f52-92a56d0c6b81",
    "params": {
      "can_skip": false,
      "reminder_volume": 0.5
    }
  },
  "detail_type": "job-reminder"
}

Create table DynamoDB table

This is a sample job definition. You will need to adjust it, to be started a few minutes from now. For this you need to adjust the first 2 attributes, the partition key “pk” and the sort key “sk”. Start with “sk”, this is the UTC timestamp for the due date of the job in ISO 8601 format (YYYY-MM-DDTHH:MM:SS), followed by a separator (“#”) and a unique identifier, to make sure that multiple jobs can have the same due timestamp.

Afterwards adjust “pk”. The “pk” looks like the ISO 8601 timestamp in the “sk” reduced to date and time in hours and minutes. The minutes for the partition key must be an integer multiple of 5. This value represents the grouping of the jobs, so they can be queried quickly and efficiently by the Lambda function. For instance, for me 2021-11-26T13:31:55.000Z is in the future and the corresponding partition would be 2021-11-26T13:30.

Note: your local time zone might not be UTC. You can get the current UTC time on timeanddate.com.

You can find in the following table for every “sk” minute the corresponding “pk” minute:

SK and PK table

The corresponding python code would be:

f'{(sk_minutes – sk_minutes % 5):02d}'

Create table DynamoDB table

4. Now that you defined your event in the near future, you can optionally adjust the content of the “detail” and “detail_type” attributes. These are forwarded to EventBridge as “detail” and “detail-type” and should be used by your workers to understand which task they are supposed to perform. You can find more details on EventBridge event structure in our documentation. After you configured the job correctly, select Create item.

5. It is time to navigate to CloudWatch Log groups and wait for the item to be due and to show up in the debug logs.

CloudWatch log groups

For now, the log streams should be empty:

Log streams sceenshot

After the item was due, you should see a new log stream with the item “detail” and “detail_type” attributes logged.

If you don’t see a new log stream with the item, check back in your DynamoDB table, if the “sk” is in the UTC time zone and the minutes of the “pk” are a multiple of 5. You can consult the table at the end of step 3, to check for the correct “pk” minutes based on your “sk” minutes.

Log events screenshot

You might notice that the timestamp of the message is within a minute after the job was scheduled. In my example, I scheduled the job for 2021-11-26T13:31:55.000Z and it was put into EventBridge at 2021-11-26T13:32:33Z. The delay comes from the Lambda function only starting once per minute. As I mentioned in the beginning, the function also isn’t started at second 00 but at a random second within that minute.

Exploring the Lambda function

Now, let’s have a look at the core logic. For this, navigate to AWS Lambda in the AWS Management console and open the SchedulerFunction.

AWS Lambda screenshot

In the function configuration, you can see that it is triggered by EventBridge via a scheduled expression at the rate, which was defined in the CloudFormation Stack.

Function Configuration in Cloudformation Stack

When you open the Code tab, you can see that it is less than 100 lines of python code. The main part is the lambda_handler function:

def lambda_handler(event, context):
    event_time_in_utc = event['time']
    previous_partition, current_partition = get_partitions(event_time_in_utc)

    previous_jobs = query_jobs(previous_partition, event_time_in_utc)
    current_jobs = query_jobs(current_partition, event_time_in_utc)
    all_jobs = previous_jobs + current_jobs

    print('dispatching {} jobs'.format(len(all_jobs)))

    put_all_jobs_into_event_bridge(all_jobs)
    delete_all_jobs(all_jobs)

    print('dispatched and deleted {} jobs'.format(len(all_jobs)))

The function starts by calculating the current and previous partitions. This is done to ensure that no jobs stay unprocessed in the old partition, when a new one starts. Afterwards, jobs from these partitions are queried up to the current time, so no future jobs will be fetched from the current partition. Lastly, all jobs are put into EventBridge and deleted from the table.

Instead of pushing the jobs into EventBridge, they could be started via HTTP(S), gRPC, or pushed into other AWS services, like Amazon Simple Notification Service (SNS) or Amazon Simple Queue Service (SQS). Also remember that the communication with other AWS services is synchronous and does not use batching options when putting jobs into EventBridge or deleting them from the DynamoDB table. This is to keep the function simpler and easier to understand. When you plan to distribute thousands of jobs per minute, you’d want to adjust this, to improve the throughput of the Lambda function.

Cleaning up

To avoid incurring future charges, delete the CloudFormation Stack and all resources you created.

Conclusion

In this post, you learned how to build a serverless scheduling solution. Using only serverless technologies which scale automatically, don’t require maintenance, and offer a pay as you go pricing model, this scheduler solution can be implemented for use cases with varying throughput requirements for their scheduled jobs. These could range from publishing articles at a scheduled time to notifying hundreds of passengers per minute about their upcoming flight.

You can adjust the Lambda function to distribute the jobs with a technology more fitting to your application, as well as to handle recurring tasks. The grouping interval of 5 minutes for the partition key, can be also adjusted based on your throughput requirements. It’s important to note that for this solution to work, the interval by which the jobs are grouped must be longer than the rate at which the Lambda function is started.

Give it a try and let us know your thoughts in the comments!

Modernize your Penetration Testing Architecture on AWS Fargate

2021-12-14 Conor Walsh

Post Syndicated from Conor Walsh original https://aws.amazon.com/blogs/architecture/modernize-your-penetration-testing-architecture-on-aws-fargate/

Organizations in all industries are innovating their application stack through modernization. Developers have found that modular architecture patterns, serverless operational models, and agile development processes provide great benefits. They offer faster innovation, reduced risk, and reduction in total cost of ownership.

Security organizations must evolve and innovate as well. But security practitioners often find themselves stuck between using powerful yet inflexible open-source tools with little support, and monolithic software with expensive and restrictive licenses.

This post describes how you can use modern cloud technologies to build a scalable penetration testing platform, with no infrastructure to manage.

The penetration testing monolith

AWS operates under the shared responsibility model, where AWS is responsible for the security of the cloud, and the customer is responsible for securing workloads in the cloud. This includes validating the security of your internal and external attack surface. Following the AWS penetration testing policy, customers can run tests against their AWS accounts, except for denial of service (DoS).

A legacy model commonly involves a central server for running a scanning application among the team. The server must be powerful enough for peak load and likely runs 24/7. Common licensing for scanner software is capped on the number of targets you can scan. This model does not scale, and incurs cost when no assessments are being performed.

Penetration testers must constantly reinvent their toolkit. Many one-off tools or scripts are built during engagements when encountering a unique problem. These tools and their environments are often customized, making standardization between machines and software difficult. Building, maintaining, and testing UI/UX and platform compatibility can be expensive and difficult to scale. This often leads to these tools being discarded and the value lost when the analyst moves on to the next engagement. Later, other analysts may run into the same scenario and need to rebuild the tool all over again, resulting in duplicated effort.

Network security scanning using modern cloud infrastructure

By using modern cloud container technologies, we can redesign this monolithic architecture to one that scales to meet increased demand, yet incurs no cost when idle. Containerization provides flexibility and secure isolation.

Figure 1. Overview of the serverless security scanning architecture

Scanning task flow

This workflow is based on the architecture shown in Figure 1:

User authenticates to Amazon Cognito with their organization’s SSO.
User makes authorized request to Amazon API Gateway.
Request is forwarded to an AWS Lambda function that pulls configuration from Amazon Simple Storage Service (S3).
Lambda function validates parameters, incorporates them into the task definition, and calls Amazon Elastic Container Service (ECS).
ECS orchestrates worker nodes using AWS Fargate compute engine and initiates task.
ECS asynchronously returns the task configuration to Lambda, which sanitizes sensitive data and sends response through API Gateway.
The ECS task launches one or more containers, which run the tool.
Scan results are stored in the ephemeral storage provided by Fargate.
Final container in the ECS task copies the scan report to S3.

Now we’ll describe the different components of the architecture shown in Figure 1. Start by packaging one’s favorite tool into a container, and publish it to Amazon Elastic Container Registry (ECR). ECR provides your containers additional layers of security assurance with built-in dependency vulnerability scans.

AWS Fargate is a serverless compute engine powering Amazon ECS to orchestrate container tasks. Fargate scales up capacity to support the current load, and scales down once complete to reduce cost. By default, Fargate offers 20 GB of ephemeral storage to each ECS task for shared storage between containers as volume mounts.

Task input and output can be processed with custom code running on the serverless computing service AWS Lambda. For multi-stage Lambda functionality, you can use AWS Step Functions.

Amazon API Gateway can forward incoming requests to these Lambda functions. API Gateway provides serverless REST endpoints to handle requests processed by Lambda functions. Amazon Cognito authorizes users through API Gateway or your organization’s single-sign on (SSO) provider.

The final step of the ECS task can upload any resulting files to an Amazon S3 bucket. Amazon S3 offers industry-leading scalability, data availability, security, and performance with integration into other AWS services. This means that the results of your data can be consumed by other AWS services for processing, analytics, machine learning, and security controls.

Amazon CloudWatch Events are used to build an event-based workflow. The S3 upload initiates a CloudWatch Event, which can then invoke a Lambda function to process the file, or launch another ECS task.

This solution is completely serverless. It will scale on demand, yet cost nothing when not in use. This architecture can support anything that can be run in a container, regardless of tool function.

Network Mapper workflow

Figure 2. Network Mapper scanner task workflow

The example in Figure 2 was based on using a tool called Network Mapper, or Nmap. However, a variety of tools can be used, including nslookup/dig, Selenium, Nikto, recon-ng, SpiderFoot, Greenbone Vulnerability Manager (GVM), or OWASP ZAP. You can use anything that runs in a container! With some additional work, findings could be fed into AWS security services like AWS Security Hub, or Amazon GuardDuty. You can also use AWS Partner Network services like Splunk and Datadog, or open source frameworks like Metasploit and DefectDojo. The flexibility to add additional applications that integrate with AWS services means that this architecture can be easily deployed into a variety of AWS environments.

Remember, installation and use of software not included in an AWS-supported Amazon Machine Image (AMI) or container, falls into the customer side of the shared responsibility model. Make sure to do your due diligence in securing any software you decide to use in this or any workload. To reduce blast radius, run this in an isolated account and only provide least privilege access to targets.

Conclusion

In this blog post, we showed how to run a penetration testing workload on a modern platform, powered with serverless, and container-based services. Amazon API Gateway is the entry point for your architecture, which calls on AWS Lambda. Lambda builds a task definition to launch a fully orchestrated, on-demand container workload using AWS Fargate and Amazon ECS. The final stage of the ECS task copies the results of the scan to Amazon S3. This can be accessed by security analysts or other downstream containers, tools, or services.

We encourage you to go build this architecture in your own environment, and begin conducting your own tests! Construct your Nmap container and store it in Amazon ECR or use securecodebox/nmap, a Docker container built for the Open Web Application Security Project® (OWASP) SecureCodeBox project. Make sure to spend time securing this workload, especially when using open-source software you’re not familiar with. Now go get scanning!

Using an Amazon MQ network of broker topologies for distributed microservices

2021-12-13 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-an-amazon-mq-network-of-broker-topologies-for-distributed-microservices/

This post is written by Suranjan Choudhury Senior Manager SA and Anil Sharma, Apps Modernization SA.

This blog looks at ActiveMQ topologies that customers can evaluate when planning hybrid deployment architectures spanning AWS Regions and customer data centers, using a network of brokers. A network of brokers can have brokers on-premises and Amazon MQ brokers on AWS.

Distributing broker nodes across AWS and on-premises allows for messaging infrastructure to scale, provide higher performance, and improve reliability. This post also explains a topology spanning two Regions and demonstrates how to deploy on AWS.

A network of brokers is composed of multiple simultaneously active single-instance brokers or active/standby brokers. A network of brokers provides a large-scale messaging fabric in which multiple brokers are networked together. It allows a system to survive the failure of a broker. It also allows distributed messaging. Applications on remote, disparate networks can exchange messages with each other. A network of brokers helps to scale the overall broker throughput in your network, providing increased availability and performance.

Types of ActiveMQ topologies

Network of brokers can be configured in a variety of topologies – for example, mesh, concentrator, and hub and spoke. The topology depends on requirements such as security and network policies, reliability, scaling and throughput, and management and operational overheads. You can configure individual brokers to operate as a single broker or in an active/standby configuration.

Mesh topology

A mesh topology provides multiple brokers that are all connected to each other. This example connects three single-instance brokers, but you can configure more brokers as a mesh. The mesh topology needs subnet security group rules to be opened for allowing brokers in internal subnets to communicate with brokers in external subnets.

For scaling, it’s simpler to add new brokers for incrementing overall broker capacity. The mesh topology by design offers higher reliability with no single point of failure. Operationally, adding or deleting of nodes requires broker re-configuration and restarting the broker service.

Concentrator topology

In a concentrator topology, you deploy brokers in two (or more) layers to funnel incoming connections into a smaller collection of services. This topology allows segmenting brokers into internal and external subnets without any additional security group changes. If additional capacity is needed, you can add new brokers without needing to update other brokers’ configurations. The concentrator topology provides higher reliability with alternate paths for each broker. This enables hybrid deployments with lower operational overheads.

Hub and spoke topology

A hub and spoke topology preserves messages if there is disruption to any broker on a spoke. Messages are forwarded throughout and only the central Broker1 is critical to the network’s operation. Subnet security group rules must be opened to allow brokers in internal subnets to communicate with brokers in external subnets.

Adding brokers for scalability is constrained by the hub’s capacity. Hubs are a single point of failure and should be configured as active-standby to increase reliability. In this topology, depending on the location of the hub, there may be increased bandwidth needs and latency challenges.

Using a concentrator topology for large-scale hybrid deployments

When planning deployments spanning AWS and customer data centers, the starting point is the concentrator topology. The brokers are deployed in tiers such that brokers in each tier connect to fewer brokers at the next tier. This allows you to funnel connections and messages from a large number of producers to a smaller number of brokers. This concentrates messages at fewer subscribers:

Deploying ActiveMQ brokers across Regions and on-premises

When placing brokers on-premises and in the AWS Cloud in a hybrid network of broker topologies, security and network routing are key. The following diagram shows a typical hybrid topology:

Amazon MQ brokers on premises are placed behind a firewall. They can communicate to Amazon MQ brokers through an IPsec tunnel terminating on the on-premises firewall. On the AWS side, this tunnel terminates on an AWS Transit Gateway (TGW). The TGW routes all network traffic to a firewall in AWS in a service VPC.

The firewall inspects the network traffic and routes all inspected traffic sent back to the transit gateway. The TGW, based on routing configured, sends the traffic to the Amazon MQ broker in the application VPC. This broker concentrates messages from Amazon MQ brokers hosted on AWS. The on premises brokers and the AWS brokers form a hybrid network of brokers that spans AWS and customer data center. This allows applications and services to communicate securely. This architecture exposes only the concentrating broker to receive and send messages to the broker on premises. The applications are protected from outside, non-validated network traffic.

This blog shows how to create a cross-Region network of brokers. This topology removes multiple brokers in the internal subnet. However, in a production environment, you have multiple brokers’ internal subnets catering to multiple producers and consumers. This topology spans an AWS Region and an on-premises customer data center represented in a second AWS Region:

Best practices for configuring network of brokers

Client-side failover

In a network of brokers, failover transport configures a reconnect mechanism on top of the transport protocols. The configuration allows you to specify multiple URIs to connect to. An additional configuration using the randomize transport option allows for random selection of the URI when re-establishing a connection.

The example Lambda functions provided in this blog use the following configuration:

//Failover URI
failoverURI = "failover:(" + uri1 + "," + uri2 + ")?randomize=True";

Broker side failover

Dynamic failover allows a broker to receive a list of all other brokers in the network. It can use the configuration to update producer and consumer clients with this list. The clients can update to rebalance connections to these brokers.

In the broker configuration in this blog, the following configuration is set up:

<transportConnectors> <transportConnector name="openwire" updateClusterClients="true" updateClusterClientsOnRemove = "false" rebalanceClusterClients="true"/> </transportConnectors>

Network connector properties – TTL and duplex

TTL values allow messages to traverse through the network. There are two TTL values – messageTTL and consumerTTL. Another way is to set up the network TTL, which sets both the message and consumer TTL.

The duplex option allows for creating a bidirectional path between two brokers for sending and receiving messages. This blog uses the following configuration:

<networkConnector name="connector_1_to_3" networkTTL="5" uri="static:(ssl://xxxxxxxxx.mq.us-east-2.amazonaws.com:61617)" userName="MQUserName"/>

Connection pooling for producers

In the example Lambda function, a pooled connection factory object is created to optimize connections to broker:

// Create a conn factory

final ActiveMQSslConnectionFactory connFacty = new ActiveMQSslConnectionFactory(failoverURI);
connFacty.setConnectResponseTimeout(10000);
return connFacty;

// Create a pooled conn factory

final PooledConnectionFactory pooledConnFacty = new PooledConnectionFactory();
pooledConnFacty.setMaxConnections(10);
pooledConnFacty.setConnectionFactory(connFacty);
return pooledConnFacty;

Deploying the example solution

Create an IAM role for Lambda by following the steps at https://github.com/aws-samples/aws-mq-network-of-brokers#setup-steps.
Create the network of brokers in the first Region. Navigate to the CloudFormation console and choose Create stack:
Provide the parameters for the network configuration section:
In the Amazon MQ configuration section, configure the following parameters. Ensure that these two parameter values are the same in both Regions.
Configure the following in the Lambda configuration section. Deploy mqproducer and mqconsumer in two separate Regions:
Create the network of brokers in the second Region. Repeat step 2 to create the network of brokers in the second Region. Ensure that the VPC CIDR in region2 is different than the one in region1. Ensure that the user name and password are the same as in the first Region.
Complete VPC peering and updating route tables:
1. Follow the steps here to complete VPC peering between the two VPCs.
2. Update the route tables in both the VPC.
3. Enable DNS resolution for the peering connection.
Configure the network of brokers and create network connectors:
1. In region1, choose Broker3. In the Connections section, copy the endpoint for the openwire protocol.
2. In region2 on broker3, set up the network of brokers using the networkConnector configuration element.
3. Edit the configuration revision and add a new NetworkConnector within the NetworkConnectors section. Replace the uri with the URI for the broker3 in region1.
```
<networkConnector name="broker3inRegion2_to_ broker3inRegion1" duplex="true" networkTTL="5" userName="MQUserName" uri="static:(ssl://b-123ab4c5-6d7e-8f9g-ab85-fc222b8ac102-1.mq.ap-south-1.amazonaws.com:61617)" />
```

Send a test message using the mqProducer Lambda function in region1. Invoke the producer Lambda function:

aws lambda invoke --function-name mqProducer out --log-type Tail --query 'LogResult' --output text | base64 -d

Receive the test message. In region2, invoke the consumer Lambda function:

aws lambda invoke --function-name mqConsumer out --log-type Tail --query 'LogResult' --output text | base64 -d

The message receipt confirms that the message has crossed the network of brokers from region1 to region2.

Cleaning up

To avoid incurring ongoing charges, delete all the resources by following the steps at https://github.com/aws-samples/aws-mq-network-of-brokers#clean-up.

Conclusion

This blog explains the choices when designing a cross-Region or a hybrid network of brokers architecture that spans AWS and your data centers. The example starts with a concentrator topology and enhances that with a cross-Region design to help address network routing and network security requirements.

The blog provides a template that you can modify to suit specific network and distributed application scenarios. It also covers best practices when architecting and designing failover strategies for a network of brokers or when developing producers and consumers client applications.

The Lambda functions used as producer and consumer applications demonstrate best practices in designing and developing ActiveMQ clients. This includes storing and retrieving parameters, such as passwords from the AWS Systems Manager.

For more serverless learning resources, visit Serverless Land.

Introducing Amazon Simple Queue Service dead-letter queue redrive to source queues

2021-12-02 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-amazon-simple-queue-service-dead-letter-queue-redrive-to-source-queues/

This blog post is written by Mark Richman, a Senior Solutions Architect for SMB.

Today AWS is launching a new capability to enhance the dead-letter queue (DLQ) management experience for Amazon Simple Queue Service (SQS). DLQ redrive to source queues allows SQS to manage the lifecycle of unconsumed messages stored in DLQs.

SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. Using Amazon SQS, you can send, store, and receive messages between software components at any volume without losing messages or requiring other services to be available.

To use SQS, a producer sends messages to an SQS queue, and a consumer pulls the messages from the queue. Sometimes, messages can’t be processed due to a number of possible issues. These can include logic errors in consumers that cause message processing to fail, network connectivity issues, or downstream service failures. This can result in unconsumed messages remaining in the queue.

Understanding SQS dead-letter queues (DLQs)

SQS allows you to manage the life cycle of the unconsumed messages using dead-letter queues (DLQs).

A DLQ is a separate SQS queue that one or many source queues can send messages that can’t be processed or consumed. DLQs allow you to debug your application by letting you isolate messages that can’t be processed correctly to determine why their processing didn’t succeed. Use a DLQ to handle message consumption failures gracefully.

When you create a source queue, you can specify a DLQ and the condition under which SQS moves messages from the source queue to the DLQ. This is called the redrive policy. The redrive policy condition specifies the maxReceiveCount. When a producer places messages on an SQS queue, the ReceiveCount tracks the number of times a consumer tries to process the message. When the ReceiveCount for a message exceeds the maxReceiveCount for a queue, SQS moves the message to the DLQ. The original message ID is retained.

For example, a source queue has a redrive policy with maxReceiveCount set to 5. If the consumer of the source queue receives a message 6, without successfully consuming it, SQS moves the message to the dead-letter queue.

You can configure an alarm to alert you when any messages are delivered to a DLQ. You can then examine logs for exceptions that might have caused them to be delivered to the DLQ. You can analyze the message contents to diagnose consumer application issues. Once the issue has been resolved and the consumer application recovers, these messages can be redriven from the DLQ back to the source queue to process them successfully.

Previously, this required dedicated operational cycles to review and redrive these messages back to their source queue.

DLQ redrive to source queues

DLQ redrive to source queues enables SQS to manage the second part of the lifecycle of unconsumed messages that are stored in DLQs. Once the consumer application is available to consume the failed messages, you can now redrive the messages from the DLQ back to the source queue. You can optionally review a sample of the available messages in the DLQ. You redrive the messages using the Amazon SQS console. This allows you to more easily recover from application failures.

Using redrive to source queues

To show how to use the new functionality there is an existing standard source SQS queue called MySourceQueue.

SQS does not create DLQs automatically. You must first create an SQS queue and then use it as a DLQ. The DLQ must be in the same region as the source queue.

Create DLQ

Navigate to the SQS Management Console and create a standard SQS queue for the DLQ called MyDLQ. Use the default configuration. Refer to the SQS documentation for instructions on creating a queue.
Navigate to MySourceQueue and choose Edit.
Navigate to the Dead-letter queue section and choose Enabled.
Select the Amazon Resource Name (ARN) of the MyDLQ queue you created previously.
You can configure the number of times that a message can be received before being sent to a DLQ by setting Set Maximum receives to a value between 1 and 1,000. For this demo enter a value of 1 to immediately drive messages to the DLQ.
Choose Save.

Configure source queue with DLQ

The console displays the Details page for the queue. Within the Dead-letter queue tab, you can see the Maximum receives value and DLQ ARN.

DLQ configuration

Send and receive test messages

You can send messages to test the functionality in the SQS console.

Navigate to MySourceQueue and choose Send and receive messages
Send a number of test messages by entering the message content in Message body and choosing Send message.

Send and receive messages

Navigate to the Receive messages section where you can see the number of messages available.
Choose Poll for messages. The Maximum message count is set to 10 by default If you sent more than 10 test messages, poll multiple times to receive all the messages.

Poll for messages

All the received messages are sent to the DLQ because the maxReceiveCount is set to 1. At this stage you would normally review the messages. You would determine why their processing didn’t succeed and resolve the issue.

Redrive messages to source queue

Navigate to the list of all queues and filter if required to view the DLQ. The queue displays the approximate number of messages available in the DLQ. For standard queues, the result is approximate because of the distributed architecture of SQS. In most cases, the count should be close to the actual number of messages in the queue.

Messages available in DLQ

Select the DLQ and choose Start DLQ redrive.

DLQ redrive

SQS allows you to redrive messages either to their source queue(s) or to a custom destination queue.

Choose to Redrive to source queue(s), which is the default.

Redrive has two velocity control settings.

System optimized sends messages back to the source queue as fast as possible
Custom max velocity allows SQS to redrive messages with a custom maximum rate of messages per second. This feature is useful for minimizing the impact to normal processing of messages in the source queue.

You can optionally inspect messages prior to redrive.

To redrive the messages back to the source queue, choose DLQ redrive.

DLQ redrive

The Dead-letter queue redrive status panel shows the status of the redrive and percentage processed. You can refresh the display or cancel the redrive.

Dead-letter queue redrive status

Once the redrive is complete, which takes a few seconds in this example, the status reads Successfully completed.

Redrive status completed

Navigate back to the source queue and you can see all the messages are redriven back from the DLQ to the source queue.

Messages redriven from DLQ to source queue

Conclusion

Dead-letter queue redrive to source queues allows you to effectively manage the life cycle of unconsumed messages stored in dead-letter queues. You can build applications with the confidence that you can easily examine unconsumed messages, recover from errors, and reprocess failed messages.

You can redrive messages from their DLQs to their source queues using the Amazon SQS console.

Dead-letter queue redrive to source queues is available in all commercial regions, and coming soon to GovCloud.

To get started, visit https://aws.amazon.com/sqs/

For more serverless learning resources, visit Serverless Land.

Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure

2021-11-30 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-redshift-serverless-run-analytics-at-any-scale-without-having-to-manage-infrastructure/

We’re seeing the use of data analytics expanding among new audiences within organizations, for example with users like developers and line of business analysts who don’t have the expertise or the time to manage a traditional data warehouse. Also, some customers have variable workloads with unpredictable spikes, and it can be very difficult for them to constantly manage capacity.

With Amazon Redshift, you use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Today, I am happy to introduce the public preview of Amazon Redshift Serverless, a new capability that makes it super easy to run analytics in the cloud with high performance at any scale. Just load your data and start querying. There is no need to set up and manage clusters. You pay for the duration in seconds when your data warehouse is in use, for example, while you are querying or loading data. There is no charge when your data warehouse is idle.

Amazon Redshift Serverless automatically provisions the right compute resources for you to get started. As your demand evolves with more concurrent users and new workloads, your data warehouse scales seamlessly and automatically to adapt to the changes. You can optionally specify the base data warehouse size to have additional control on cost and application-specific SLAs.

With the new serverless option, you can continue to query data in other AWS data stores, such as Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Aurora and Amazon Relational Database Service (RDS) databases.

Amazon Redshift Serverless is ideal when it is difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. This approach is also a good fit for ad-hoc analytics needs that need to get started quickly and for test and development environments.

Let’s see how this works in practice.

Using Amazon Redshift Serverless
I go to the Amazon Redshift console and choose the new serverless option. The first time, I set up the serverless endpoint and configure networking and security.

I confirm the default settings that use all subnets in my default Amazon Virtual Private Cloud (VPC) and its default security group. Data is always encrypted, and I use the default AWS-owned key. Optionally, I can customize all settings. I can associate now or later the AWS Identity and Access Management (IAM) roles to give permissions to access other AWS resources, for example, to be able to load data from an S3 bucket. The configuration of the serverless endpoint will be shared by all my serverless data warehouses in the same AWS account and Region.

To query data, I use Amazon Redshift Query Editor V2, a new free web-based tool that we made available a few months back. The query editor provides quick access to a few sample datasets to make it easy to learn Amazon Redshift’s SQL capabilities: TPC-H, TPC-DS, and tickit, a dataset containing information on ticket sales for events.

For a quick test, I use the tickit sample dataset so I don’t need to load any data. I prepare a query to get the list of tickets sold per date, sorted to see the dates with more sales first:

SELECT caldate, sum(qtysold) as sumsold
FROM   tickit.sales, tickit.date
WHERE  sales.dateid = date.dateid 
GROUP BY caldate
ORDER BY sumsold DESC;

By using the web-based query editor, I don’t need to configure a SQL client or set up the network permissions to reach the serverless endpoint. Instead, I just write my SQL query and run it.

I am a visual person. I enable the Chart option on the right of the result table and select a bar chart.

Satisfied with the clarity of the chart, I export it as an image file. In this way, I can quickly share it or include it in a report.

Amazon Redshift Serverless supports all rich SQL functionality of Amazon Redshift such as semi-structured data support. I can use any JDBC/ODBC-compliant tool or the Amazon Redshift Data API to query my data. To migrate data, I can take a snapshot of an Amazon Redshift provisioned cluster and restore it as serverless. Then, I just need to update my SQL applications to use the new serverless endpoint.

Availability and Pricing
Amazon Redshift Serverless is available in public preview in the following AWS Regions: US East (N. Virginia), US West (N. California, Oregon), Europe (Frankfurt, Ireland), Asia Pacific (Tokyo).

With Amazon Redshift Serverless, you pay separately for the compute and storage you use. Compute capacity is measured in Redshift Processing Units (RPUs), and you pay for the workloads in RPU-hours with per-second billing. For storage, you pay for data stored in Amazon Redshift-managed storage and storage used for snapshots, similar to what you’d pay with a provisioned cluster using RA3 instances.

To control your costs, you can specify usage limits and define actions that Amazon Redshift automatically takes if those limits are reached. You can specify usage limits in RPU-hours and associated with a daily, weekly, or monthly duration. Setting higher usage limits can improve the overall throughput of the system, especially for workloads that need to handle high concurrency while maintaining consistently high performance.

Compute resources automatically shutdown behind the scenes when there is no activity and resume when you are loading data, or there are queries coming in. When accessing your S3 data lake via the new serverless endpoint, you do not pay for Amazon Redshift Spectrum separately. You have a unified serverless experience and pay for data lake queries also in RPU-seconds. For more information, see the Amazon Redshift pricing page.

The serverless end point is configured at the AWS account level. If you have multiple teams or projects and want to manage costs separately, you can use separate AWS accounts. You can share data between your provisioned clusters and serverless endpoint and between serverless endpoints across accounts.

To help you get practice, we provide you upfront with $500 in AWS credits to try the Amazon Redshift Serverless public preview. You get the credits when you first create a database with Amazon Redshift Serverless. These credits are used to cover your costs for compute, storage, and snapshot usage of Amazon Redshift Serverless only.

Start using Amazon Redshift Serverless today to run and scale analytics without having to provision and manage data warehouse clusters.

— Danilo

Filtering event sources for AWS Lambda functions

2021-11-26 Benjamin Smith

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/filtering-event-sources-for-aws-lambda-functions/

This post is written by Heeki Park, Principal Specialist Solutions Architect – Serverless.

When an AWS Lambda function is configured with an event source, the Lambda service triggers a Lambda function for each message or record. The exact behavior depends on the choice of event source and the configuration of the event source mapping. The event source mapping defines how the Lambda service handles incoming messages or records from the event source.

Today, AWS announces the ability to filter messages before the invocation of a Lambda function. Filtering is supported for the following event sources: Amazon Kinesis Data Streams, Amazon DynamoDB Streams, and Amazon SQS. This helps reduce requests made to your Lambda functions, may simplify code, and can reduce overall cost.

Overview

Consider a logistics company with a fleet of vehicles in the field. Each vehicle is enabled with sensors and 4G/5G connectivity to emit telemetry data into Kinesis Data Streams:

In one scenario, they use machine learning models to infer the health of vehicles based on each payload of telemetry data, which is outlined in example 2 on the Lambda pricing page.
In another scenario, they want to invoke a function, but only when tire pressure is low on any of the tires.

If tire pressure is low, the company notifies the maintenance team to check the tires when the vehicle returns. The process checks if the warehouse has enough spare replacements. Optionally, it notifies the purchasing team to buy additional tires.

The application responds to the stream of incoming messages and runs business logic if tire pressure is below 32 psi. Each vehicle in the field emits telemetry as follows:

{
    "time": "2021-11-09 13:32:04",
    "fleet_id": "fleet-452",
    "vehicle_id": "a42bb15c-43eb-11ec-81d3-0242ac130003",
    "lat": 47.616226213162406,
    "lon": -122.33989110734133,
    "speed": 43,
    "odometer": 43519,
    "tire_pressure": [41, 40, 31, 41],
    "weather_temp": 76,
    "weather_pressure": 1013,
    "weather_humidity": 66,
    "weather_wind_speed": 8,
    "weather_wind_dir": "ne"
}

To process all messages from a fleet of vehicles, you configure a filter matching the fleet id in the following example. The Lambda service applies the filter pattern against the full payload that it receives.

The schema of the payload for Kinesis and DynamoDB Streams is shown under the “kinesis” attribute in the example Kinesis record event. When building filters for Kinesis or DynamoDB Streams, you filter the payload under the “data” attribute. The schema of the payload for SQS is shown in the array of records in the example SQS message event. When working with SQS, you filter the payload under the “body” attribute:

{
    "data": {
        "fleet_id": ["fleet-452"]
    }
}

To process all messages associated with a specific vehicle, configure a filter on only that vehicle id. The fleet id is kept in the example to show that it matches on both of those filter criteria:

{
    "data": {
        "fleet_id": ["fleet-452"],
        "vehicle_id": ["a42bb15c-43eb-11ec-81d3-0242ac130003"]
    }
}

To process all messages associated with that fleet but only if tire pressure is below 32 psi, you configure the following rule pattern. This pattern searches the array under tire_pressure to match values less than 32:

{
    "data": {
        "fleet_id": ["fleet-452"],
        "tire_pressure": [{"numeric": ["<", 32]}]
    }
}

To create the event source mapping with this filter criteria with an AWS CLI command, run the following command.

aws lambda create-event-source-mapping \
--function-name fleet-tire-pressure-evaluator \
--batch-size 100 \
--starting-position LATEST \
--event-source-arn arn:aws:kinesis:us-east-1:0123456789012:stream/fleet-telemetry \
--filter-criteria '{"Filters": [{"Pattern": "{\"tire_pressure\": [{\"numeric\": [\"<\", 32]}]}"}]}'

For the CLI, the value for Pattern in the filter criteria requires the double quotes to be escaped in order to be properly captured.

Alternatively, to create the event source mapping with this filter criteria with an AWS Serverless Application Model (AWS SAM) template, use the following snippet.

Events: 
  TirePressureEvent: 
    Type: Kinesis    
    Properties: 
      BatchSize: 100
      StartingPosition: LATEST
      Stream: "arn:aws:kinesis:us-east-1:0123456789012:stream/fleet-telemetry"
      Filters: 
        - Pattern: "{\"data\": {\"tire_pressure\": [{\"numeric\": [\"<\", 32]}]}}"

For the AWS SAM template, the value for Pattern in the filter criteria does not require escaped double quotes.

For more information on how to create filters, refer to examples of event pattern rules in EventBridge, as Lambda filters messages in the same way.

Reducing costs with event filtering

By configuring the event source with this filter criteria, you can reduce the number of messages that are used to invoke your Lambda function.

Using the example from the Lambda pricing page, with a fleet of 10,000 vehicles in the field, each is emitting telemetry once an hour. Each month, the vehicles emit 10,000 * 24 * 31 = 7,440,000 messages, which trigger the same number of Lambda invocations. You configure the function with 256 MB of memory and the average duration of the function is 100 ms. In this example, vehicles emit low-pressure telemetry once every 31 days.

Without filtering, the cost of the application is:

Monthly request charges → 7.44M * $0.20/million = $1.49
Monthly compute duration (seconds) → 7.44M * 0.1 seconds = 0.744M seconds
Monthly compute (GB-s) → 256MB/1024MB * 0.744M seconds = 0.186M GB-s
Monthly compute charges → 0.186M GB-s * $0.0000166667 = $3.10
Monthly total charges = $1.49 + $3.10 = $4.59

With filtering, the cost of the application is:

Monthly request charges → (7.44M / 31)* $0.20/million = $0.05
Monthly compute duration (seconds) → (7.44M / 31) * 0.1 seconds = 0.024M seconds
Monthly compute (GB-s) → 256MB/1024MB * 0.024M seconds = 0.006M GB-s
Monthly compute charges → 0.006M GB-s * $0.0000166667 = $0.10
Monthly total charges = $0.05 + $0.10 = $0.15

By using filtering, the cost is reduced from $4.59 to $0.15, a 96.7% cost reduction.

Designing and implementing event filtering

In addition to reducing cost, the functions now operate more efficiently. This is because they no longer iterate through arrays of messages to filter out messages. The Lambda service filters the messages that it receives from the source before batching and sending them as the payload for the function invocation. This is the order of operations:

Event flow with filtering

As you design filter criteria, keep in mind a few additional properties. The event source mapping allows up to five patterns. Each pattern can be up to 2048 characters. As the Lambda service receives messages and filters them with the pattern, it fills the batch per the normal event source behavior.

For example, if the maximum batch size is set to 100 records and the maximum batching window is set to 10 seconds, the Lambda service filters and accumulates records in a batch until one of those two conditions is satisfied. In the case where 100 records that meet the filter criteria come during the batching window, the Lambda service triggers a function with those filtered 100 records in the payload.

If fewer than 100 records meeting the filter criteria arrive during the batch window, Lambda triggers a function with the filtered records that came during the batch window at the end of the 10-second batch window. Be sure to configure the batch window to match your latency requirements.

The Lambda service ignores filtered messages and treats them as successfully processed. For Kinesis Data Streams and DynamoDB Streams, the iterator advances past the records that were sent via the event source mapping.

For SQS, the messages are deleted from the queue without any additional processing. With SQS, be sure that the messages that are filtered out are not required. For example, you have an Amazon SNS topic with multiple SQS queues subscribed. The Lambda functions consuming each of those SQS queues process different subsets of messages. You could use filters on SNS but that would require the message publisher to add attributes to the messages that it sends. You could instead use filters on the event source mapping for SQS. Now the publisher does not need to make any changes, as the filter is applied on the messages payload directly.

Conclusion

Lambda now supports the ability to filter messages based on a criteria that you define. This can reduce the number of messages that your functions process, may reduce cost, and can simplify code.

You can now build applications for specific use cases that use only a subset of the messages that flow through your event-driven architectures. This can help optimize the compute efficiency of your functions.

Learn more about this capability in our AWS Lambda Developer Guide.

Visualizing AWS Step Functions workflows from the Amazon Athena console

2021-11-23 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/visualizing-aws-step-functions-workflows-from-the-amazon-athena-console/

This post is written by Dhiraj Mahapatro, Senior Specialist SA, Serverless.

In October 2021, AWS announced visualizing AWS Step Functions from the AWS Batch console. Now you can also visualize Step Functions from the Amazon Athena console.

Amazon Athena is an interactive query service that makes it easier to analyze Amazon S3 data using standard SQL. Athena is a serverless service and can interact directly with data stored in S3. Athena can process unstructured, semistructured, and structured datasets.

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Step Functions workflows manage failures, retries, parallelization, service integrations, and observability so builders can focus on business logic. Athena is one of the service integrations that are available for Step Functions.

This blog walks through Step Functions integration in Amazon Athena console. It shows how you can visualize and operate Athena queries at scale using Step Functions.

Introducing workflow orchestration in Amazon Athena console

AWS customers store large amounts of historical data on S3 and query the data using Athena to get results quickly. They also use Athena to process unstructured data or analyze structured data as part of a data processing pipeline.

Data processing involves discrete steps for ingesting, processing, storing the transformed data, and post-processing, such as visualizing or analyzing the transformed data. Each step involves multiple AWS services. With Step Functions workflow integration, you can orchestrate these steps. This helps to create repeatable and scalable data processing pipelines as part of a larger business application and visualize the workflows in the Athena console.

With Step Functions, you can run queries on a schedule or based on an event by using Amazon EventBridge. You can poll long-running Athena queries before moving to the next step in the process, and handle errors without writing custom code. Combining these two services provides developers with a single method that is scalable and repeatable.

Step Functions workflows in the Amazon Athena console allow orchestration of Athena queries with Step Functions state machines:

Using Athena query patterns from Step Functions

Execute multiple queries

In Athena, you run SQL queries in the Athena console against Athena workgroups. With Step Functions, you can run Athena queries in a sequence or run independent queries simultaneously in parallel using a parallel state. Step Functions also natively handles errors and retries related to Athena query tasks.

Workflow orchestration in the Athena console provides these capabilities to run and visualize multiple queries in Step Functions. For example:

Choose Get Started from Execute multiple queries.
From the pop-up, choose Create your own workflow and select Continue.

A new browser tab opens with the Step Functions Workflow Studio. The designer shows a workflow pattern template pre-created. The workflow loads data from a data source running two Athena queries in parallel. The results are then published to an Amazon SNS topic.

Alternatively, choosing Deploy a sample project from the Get Started pop-up deploys a sample Step Functions workflow.

This option creates a state machine. You then review the workflow definition, deploy an AWS CloudFormation stack, and run the workflow in the Step Functions console.

Once deployed, the state machine is visible in the Step Functions console as:

Select the AthenaMultipleQueriesStateMachine to land on the details page:

The CloudFormation template provisions the required AWS Glue database, S3 bucket, an Athena workgroup, the required AWS Lambda functions, and the SNS topic for query results notification.

To see the Step Functions workflow in action, choose Start execution. Keep the optional name and input and choose Start execution:

The state machine completes the tasks successfully by Executing multiple queries in parallel using Amazon Athena and Sending query results using the SNS topic:

The state machine used the Amazon Athena StartQueryExecution and GetQueryResults tasks. The Workflow orchestration in Athena console now highlights this newly created Step Functions state machine:

Any state machine that uses this task in Step Functions in this account is listed here as a state machine that orchestrates Athena queries.

Query large datasets

You can also ingest an extensive dataset in Amazon S3, partition it using AWS Glue crawlers, then run Amazon Athena queries against that partition.

Select Get Started from the Query large datasets pop-up, then choose Create your own workflow and Continue. This action opens the Step Functions Workflow studio with the following pattern. The Glue crawler starts and partitions large datasets for Athena to query in the subsequent query execution task:

Step Functions allows you to combine Glue crawler tasks and Athena queries to partition where necessary before querying and publishing the results.

Keeping data up to date

You can also use Athena to query a target table to fetch data, then update it with new data from other sources using Step Functions’ choice state. The choice state in Step Functions provides branching logic for a state machine.

You are not limited to the previous three patterns shown in workflow orchestration in the Athena console. You can start from scratch and build Step Functions state machine by navigating to the bottom right and using Create state machine:

Create State Machine in the Athena console opens a new tab showing the Step Functions console’s Create state machine page.

Refer to building a state machine AWS Step Functions Workflow Studio for additional details.

Step Functions integrates with all Amazon Athena’s API actions

In September 2021, Step Functions announced integration support for 200 AWS services to enable easier workflow automation. With this announcement, Step Functions can integrate with all Amazon Athena API actions today.

Step Functions can automate the lifecycle of an Athena query: Create/read/update/delete/list workGroups; Create/read/update/delete/list data catalogs, and more.

Other AWS service integrations

Step Functions’ integration with the AWS SDK provides support for 200 AWS Services and over 9,000 API actions. Athena tasks in Step Functions can evolve by integrating available AWS services in the workflow for their pre and post-processing needs.

For example, you can read Athena query results that are put to an S3 bucket by using a GetObject S3 task AWS SDK integration in Step Functions. You can combine different AWS services into a single business process so that they can ingest data through Amazon Kinesis, do processing via AWS Lambda or Amazon EMR jobs, and send notifications to interested parties via Amazon SNS or Amazon SQS or Amazon EventBridge to trigger other parts of their business application.

There are multiple ways to decorate around an Amazon Athena job task. Refer to AWS SDK service integrations and optimized integrations for Step Functions for additional details.

Important considerations

Workflow orchestrations in the Athena console only show Step Functions state machines that use Athena’s optimized API integrations. This includes StartQueryExecution, StopQueryExecution, GetQueryExecution, and GetQueryResults.

Step Functions state machines do not show in the Athena console when:

A state machine uses any other AWS SDK Athena API integration task.
The APIs are invoked inside a Lambda function task using an AWS SDK client (like Boto3 or Node.js or Java).

Cleanup

First, empty DataBucket and AthenaWorkGroup to delete the stack successfully. To delete the sample application stack, use the latest version of AWS CLI and run:

aws cloudformation delete-stack --stack-name <stack-name>

Alternatively, delete the sample application stack in the CloudFormation console by selecting the stack and choosing Delete:

Conclusion

Amazon Athena console now provides an integration with AWS Step Functions’ workflows. You can use the provided patterns to create and visualize Step Functions’ workflows directly from the Amazon Athena console. Step Functions’ workflows that use Athena’s optimized API integration appear in the Athena console. To learn more about Amazon Athena, read the user guide.

To get started, open the Workflows page in the Athena console. Select Create Athena jobs with Step Functions Workflows to deploy a sample project, if you are new to Step Functions.

For more serverless learning resources, visit Serverless Land.

Offset lag metric for Amazon MSK as an event source for Lambda

2021-11-23 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/offset-lag-metric-for-amazon-msk-as-an-event-source-for-lambda/

This post written by Adam Wagner, Principal Serverless Solutions Architect.

Last year, AWS announced support for Amazon Managed Streaming for Apache Kafka (MSK) and self-managed Apache Kafka clusters as event sources for AWS Lambda. Today, AWS adds a new OffsetLag metric to Lambda functions with MSK or self-managed Apache Kafka event sources.

Offset in Apache Kafka is an integer that marks the current position of a consumer. OffsetLag is the difference in offset between the last record written to the Kafka topic and the last record processed by Lambda. Kafka expresses this in the number of records, not a measure of time. This metric provides visibility into whether your Lambda function is keeping up with the records added to the topic it is processing.

This blog walks through using the OffsetLag metric along with other Lambda and MSK metrics to understand your streaming application and optimize your Lambda function.

Overview

In this example application, a producer writes messages to a topic on the MSK cluster that is an event source for a Lambda function. Each message contains a number and the Lambda function finds the factors of that number. It outputs the input number and results to an Amazon DynamoDB table.

Finding all the factors of a number is fast if the number is small but takes longer for larger numbers. This difference means the size of the number written to the MSK topic influences the Lambda function duration.

Example application architecture

A Kafka client writes messages to a topic in the MSK cluster.
The Lambda event source polls the MSK topic on your behalf for new messages and triggers your Lambda function with batches of messages.
The Lambda function factors the number in each message and then writes the results to DynamoDB.

In this application, several factors can contribute to offset lag. The first is the volume and size of messages. If more messages are coming in, the Lambda may take longer to process them. Other factors are the number of partitions in the topic, and the number of concurrent Lambda functions processing messages. A full explanation of how Lambda concurrency scales with the MSK event source is in the documentation.

If the average duration of your Lambda function increases, this also tends to increase the offset lag. This lag could be latency in a downstream service or due to the complexity of the incoming messages. Lastly, if your Lambda function errors, the MSK event source retries the identical records set until they succeed. This retry functionality also increases offset lag.

Measuring OffsetLag

To understand how the new OffsetLag metric works, you first need a working MSK topic as an event source for a Lambda function. Follow this blog post to set up an MSK instance.

To find the OffsetLag metric, go to the CloudWatch console, select All Metrics from the left-hand menu. Then select Lambda, followed by By Function Name to see a list of metrics by Lambda function. Scroll or use the search bar to find the metrics for this function and select OffsetLag.

OffsetLag metric example

To make it easier to look at multiple metrics at once, create a CloudWatch dashboard starting with the OffsetLag metric. Select Actions -> Add to Dashboard. Select the Create new button, provide the dashboard a name. Choose Create, keeping the rest of the options at the defaults.

Adding OffsetLag to dashboard

After choosing Add to dashboard, the new dashboard appears. Choose the Add widget button to add the Lambda duration metric from the same function. Add another widget that combines both Lambda errors and invocations for the function. Finally, add a widget for the BytesInPerSec metric for the MSK topic. Find this metric under AWS/Kafka -> Broker ID, Cluster Name, Topic. Finally, click Save dashboard.

After a few minutes, you see a steady stream of invocations, as you would expect when consuming from a busy topic.

Data incoming to dashboard

This example is a CloudWatch dashboard showing the Lambda OffsetLag, Duration, Errors, and Invocations, along with the BytesInPerSec for the MSK topic.

In this example, the OffSetLag metric is averaging about eight, indicating that the Lambda function is eight records behind the latest record in the topic. While this is acceptable, there is room for improvement.

The first thing to look for is Lambda function errors, which can drive up offset lag. The metrics show that there are no errors so the next step is to evaluate and optimize the code.

The Lambda handler function loops through the records and calls the process_msg function on each record:

def lambda_handler(event, context):
    for batch in event['records'].keys():
        for record in event['records'][batch]:
            try:
                process_msg(record)
            except:
                print("error processing record:", record)
    return()

The process_msg function handles base64 decoding, calls a factor function to factor the number, and writes the record to a DynamoDB table:

def process_msg(record):
    #messages are base64 encoded, so we decode it here
    msg_value = base64.b64decode(record['value']).decode()
    msg_dict = json.loads(msg_value)
    #using the number as the hash key in the dynamodb table
    msg_id = f"{msg_dict['number']}"
    if msg_dict['number'] <= MAX_NUMBER:
        factors = factor_number(msg_dict['number'])
        print(f"number: {msg_dict['number']} has factors: {factors}")
        item = {'msg_id': msg_id, 'msg':msg_value, 'factors':factors}
        resp = ddb_table.put_item(Item=item)
    else:
        print(f"ERROR: {msg_dict['number']} is >= limit of {MAX_NUMBER}")

The heavy computation takes place in the factor function:

def factor(number):
    factors = [1,number]
    for x in range(2, (int(1 + number / 2))):
        if (number % x) == 0:
            factors.append(x)
    return factors

The code loops through all numbers up to the factored number divided by two. The code is optimized by only looping up to the square root of the number.

def factor(number):
    factors = [1,number]
    for x in range(2, 1 + int(number**0.5)):
        if (number % x) == 0:
            factors.append(x)
            factors.append(number // x)
    return factors

There are further optimizations and libraries for factoring numbers but this provides a noticeable performance improvement in this example.

Data after optimization

After deploying the code, refresh the metrics after a while to see the improvements:

The average Lambda duration has dropped to single-digit milliseconds and the OffsetLag is now averaging two.

If you see a noticeable change in the OffsetLag metric, there are several things to investigate. The input side of the system, increased messages per second, or a significant increase in the size of the message are a few options.

Conclusion

This post walks through implementing the OffsetLag metric to understand latency between the latest messages in the MSK topic and the records a Lambda function is processing. It also reviews other metrics that help understand the underlying cause of increases to the offset lag. For more information on this topic, refer to the documentation and other MSK Lambda metrics.

For more serverless learning resources, visit Serverless Land.