Tag Archives: serverless

Validating addresses with AWS Lambda and the Amazon Location Service

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/validating-addresses-with-aws-lambda-and-the-amazon-location-service/

This post is written by Matthew Nightingale, Associate Solutions Architect.

Traditional methods of performing address validation on geospatial datasets can be expensive and time consuming. Using Amazon Location Service with AWS Lambda in a serverless data processing pipeline, you may achieve significant performance improvements and cost savings on address validation jobs that use geospatial data.

This blog contains a deployable AWS Serverless Application Model (AWS SAM) template. It also uses sample data sourced from publicly available datasets that you can deploy and use to test the application. This blog offers a starting point to build out a serverless address validation pipeline in your own AWS account.

Overview

This application implements a serverless scatter/gather architecture using Lambda and Amazon S3, performing address validation with the Amazon Location Service. An S3 PUT event triggers each Lambda function to run data processing jobs along each step of the pipeline.

To test the application, a user uploads a .CSV file to S3. This dataset is labeled with fields that are recognized by the 2waygeocoder Lambda function. The application returns a processed dataset to S3 appended with location information from the Amazon Location Places API.

Solution overview

  1. The Scatter Lambda function takes a dataset from the S3 bucket labeled input and splits it into equally sized shards.
  2. The Process Lambda function takes each shard from the pre-processed bucket. It performs address validation in parallel with a 2waygeocoder function calling the Amazon Location Service Places API.
  3. The Gather Lambda function takes each shard from the post-processed bucket. It appends the data into a complete dataset with additional address information.

Amazon Location Service

Amazon Location Service sources high-quality geospatial data from HERE and ESRI to support searches by using a place index resource.

With the Amazon Locations Places API, you can convert addresses and other textual queries into geographic coordinates (also known as geocoding). You can also convert geographic positions into addresses and place descriptions (known as reverse geocoding).

The example application includes a 2waygeocoder capable of both geocoding and reverse geocoding. The next section shows examples of the call and response from the Amazon Location Places API for both geocoding and reverse geocoding.

Geocoding with Amazon Location Service

Here is an example of calling the Amazon Location Service Places API using the AWS SDK for Python (Boto3). This uses the search_place_index_for_text method:

Response = location.search_place_index_for_text(
	IndexName = ‘explore.place’ 
###index is created using Amazon Location service
	Text = “Boston, MA”)
location_response = Reponse[“Results”]
print(location_response)

Example response:

Response

Example reverse-geocoding with Amazon Location Service

Here is another example of calling the Amazon Location Service Places API using the AWS SDK for Python (boto3). This uses the search_place_index_for_position method:

Response = location.search_place_index_for_position(
	IndexName = ‘explore.place’ 
###index is created using Amazon Location service
	Position = “-71.056739, 42.358660”))
location_response = Reponse[“Results”]
print(location_response)

Example response:

Response

Design considerations

Processing data with Lambda in parallel using a serverless scatter/gather pipeline helps provide performance efficiency at lower cost. To provide even greater performance, you can optimize your Lambda configuration for higher throughput. There are several strategies you can implement to do this and a key few topics to keep in mind.

Increase the allocated memory for your Lambda function

The simplest way to increase throughput is to increase the allocated memory of the Lambda function.

Faster Lambda functions can process more data and increase throughput. This works even if a Lambda function’s memory utilization is low. This is because increasing memory also increases vCPUs in proportion to the amount configured. Each function supports up to 10 GB of memory and you can access up to six vCPUs per function.

To see the average cost and execution speed for each memory configuration, the Lambda Power Tuning tool helps to visualize the tradeoffs.

Optimize shard size

Another method for increasing performance in a serverless scatter/gather architecture is to optimize the total number of shards created by the scatter function. Increasing the total number of shards consequently reduces the size of any single shard, allowing Lambda to process each shard faster.

When scaling with Lambda, one instance of a function handles one request at a time. When the number of requests increases, Lambda creates more instances of the function to process traffic. Because S3 invokes Lambda asynchronously, there is an internal queue buffering requests between the event source and the Lambda service.

In a serverless scatter/gather architecture, having more shards results in more concurrent invocations of the process Lambda function. For more information about scaling and concurrency with Lambda, see this blog post. Increasing concurrency with Lambda can lead to API request throttling.

Consider API request throttling with your concurrent Lambda functions

In a serverless scatter/gather architecture, the rate at which your code calls APIs increases by a factor equal to the number of concurrent Lambda functions. This means API request limits can quickly be exceeded. You must consider Service Quotas and API request limits when trying to increase the performance of your serverless scatter/gather architecture.

For example, the Amazon Location Places APIs called in the processing function of this application has a default limit of 50 API requests per second. The 2waygeocoder calls on average about 12 APIs per second. Splitting the application into more than four shards may cause API throttling exception errors in this case. Requests to increase Service Quotas can be made through your AWS account.

Deploying the solution

You need the following perquisites to deploy the example application:

Deploy the example application:

  1. Clone the repository and download the sample source code to your environment where AWS SAM is installed:
    git clone https://github.com/aws-samples/amazon-location-service-serverless-address-validation
  2. Change into the project directory containing the template.yaml file:
    cd ~/environment/amazon-location-service-serverless-address-validation
  3. Build the application using AWS SAM:
    sam build
    Terminal output
  4. Deploy the application to your account using AWS SAM. Be sure to follow proper S3 naming conventions providing globally unique names for S3 buckets:
    sam deploy --guided
    Deployment output

Testing the application

Testing geocoding

To test the application, download the dataset that is linked in Testing the Application section of the GitHub repository. These tests demonstrate both the geocoding and reverse-geocoding capabilities of the application.

First, test the geocoding capabilities. You perform address validation on the City of Hartford Business Listing dataset linked in the GitHub repository. The dataset contains a listing of all the active businesses registered in the city Hartford, CT, and each business address. The GitHub repo links to an external website where you can download the dataset.

  1. Download the .csv version of the City of Hartford Business Listing dataset. The link is found in the Testing the Application section of the README file on GitHub.
  2. Open the file locally to explore its contents.
  3. Ensure that the .csv file contains columns labeled as “Address”, “City”, and “State”. The 2waygeocoder deployed as part of the AWS SAM template recognizes these columns to perform geocoding.
  4. Before testing the application’s geocoding capabilities, explore the pricing of Amazon Location Service. In order to save money, you can trim the length of the dataset for testing by removing rows. Once the dataset is trimmed to a desired length, navigate to S3 in the AWS Management Console.
  5. Upload the dataset to the S3 bucket labeled “input”. This triggers the scatter function.
  6. Navigate to the S3 bucket labeled “raw” to view the shards of your dataset created by the scatter function.
  7. Navigate to Lambda and select the 2waygeocoder function to view the CloudWatch Logs to see any information that is returned by the function code in near-real-time.
  8. Once the data is processed, navigate to the S3 bucket labeled “destination” to view the complete processed dataset that is created by the gather function. It may take several minutes for your dataset to finish processing.

Congratulations! You have successfully geocoded a dataset using Amazon Location Service with a serverless address validation pipeline.

Testing reverse-geocoding

Next, test the reverse-geocoding capabilities of the application. You perform address validation on the Miami Housing Dataset linked in the GitHub repository. This dataset contains information on 13,932 single-family homes sold in Miami. The repo links to an external website where you can download the dataset.

Before testing, explore the pricing of Amazon Location Service. To start the test:

  1. Download the zip file containing the .csv version of the dataset from . The link is found in the Testing the Application section of the README file on GitHub.
  2. Open the file locally to explore its contents.
  3. Ensure the .csv file contains columns A and B labeled “Latitude” and “Longitude”. You must edit these column headers to match the correct format that is recognized by the 2waygeocoder to perform reverse-geocoding. Only the “L” should be capitalized.
  4. To minimize cost, trim the length of the dataset for testing by removing rows. At the full size of ~13,933 rows, the dataset takes approx. 5 minutes to process.
  5. Once the dataset is trimmed to a desired length and both column A and B are labeled as “Latitude” and “Longitude” respectively, navigate to S3 in the AWS Management Console, and upload the dataset to your S3 bucket labeled “Input”.
  6. Navigate to the S3 bucket labeled “raw” to view the shards of your dataset.
  7. Navigate to Lambda and select the 2waygeocoder function to view the CloudWatch Logs to see any information that is returned by the function code in near-real-time.
  8. Navigate to the S3 bucket labeled “destination” to view the complete processed dataset that is created by the gather function. It may take several minutes for your dataset to finish processing.

Congratulations! You have successfully reverse-geocoded a dataset with Amazon Location Service using a serverless scatter/gather pipeline. You can move on to the conclusion, or continue to test the geocoding capabilities of the application with additional datasets.

Next steps

To get started testing your own datasets, use the AWS SAM template from GitHub deployed as part of this blog. Ensure that the labels in your dataset are labeled to match the constructs used in this blog post. The 2waygeocoder recognizes columns labeled “Latitude” and “Longitude” to perform reverse-geocoding, and “Address”, “City”, and “State” to perform geocoding.

Now that the data has been geocoded by Amazon Location Service and is in S3, you can use Amazon QuickSight geospatial charts to quickly and easily create interactive charts. For information on how to create a Dataset in QuickSight using Amazon S3 Files, check out the QuickSight User Guide.

Below is an example using QuickSight Geospatial charts to map the Miami housing dataset. The map shows average sale price by zipcode:

QuickSight map

This example uses QuickSight geospatial charts to map the City of Hartford Business dataset. The map shows DBA (doing business as) by latitude and longitude:

Dataset visualization

Conclusion

This blog post performs address validation with the Amazon Location Service, demonstrating both geocoding and reverse geocoding capabilities.

Using a serverless architecture with S3 and Lambda, you can achieve both cost optimization and performance improvement compared with traditional methods of address validation. Using this application, your organization can better understand and harness geospatial data.

For more serverless learning resources, visit Serverless Land.

ICYMI: Serverless Q4 2021

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2021/

Welcome to the 15th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

Q4 calendar

In case you missed our last ICYMI, check out what happened last quarter here.

AWS Lambda

For developers using Amazon MSK as an event source, Lambda has expanded authentication options to include IAM, in addition to SASL/SCRAM. Lambda also now supports mutual TLS authentication for Amazon MSK and self-managed Kafka as an event source.

Lambda also launched features to make it easier to operate across AWS accounts. You can now invoke Lambda functions from Amazon SQS queues in different accounts. You must grant permission to the Lambda function’s execution role and have SQS grant cross-account permissions. For developers using container packaging for Lambda functions, Lambda also now supports pulling images from Amazon ECR in other AWS accounts. To learn about the permissions required, see this documentation.

The service now supports a partial batch response when using SQS as an event source for both standard and FIFO queues. When messages fail to process, Lambda marks the failed messages and allows reprocessing of only those messages. This helps to improve processing performance and may reduce compute costs.

Lambda launched content filtering options for functions using SQS, DynamoDB, and Kinesis as an event source. You can specify up to five filter criteria that are combined using OR logic. This uses the same content filtering language that’s used in Amazon EventBridge, and can dramatically reduce the number of downstream Lambda invocations.

Amazon EventBridge

Previously, you could consume Amazon S3 events in EventBridge via CloudTrail. Now, EventBridge receives events from the S3 service directly, making it easier to build serverless workflows triggered by activity in S3. You can use content filtering in rules to identify relevant events and forward these to 18 service targets, including AWS Lambda. You can also use event archive and replay, making it possible to reprocess events in testing, or in the event of an error.

AWS Step Functions

The AWS Batch console has added support for visualizing Step Functions workflows. This makes it easier to combine these services to orchestrate complex workflows over business-critical batch operations, such as data analysis or overnight processes.

Additionally, Amazon Athena has also added console support for visualizing Step Functions workflows. This can help when building distributed data processing pipelines, allowing Step Functions to orchestrate services such as AWS Glue, Amazon S3, or Amazon Kinesis Data Firehose.

Synchronous Express Workflows now supports AWS PrivateLink. This enables you to start these workflows privately from within your virtual private clouds (VPCs) without traversing the internet. To learn more about this feature, read the What’s New post.

Amazon SNS

Amazon SNS announced support for token-based authentication when sending push notifications to Apple devices. This creates a secure, stateless communication between SNS and the Apple Push Notification (APN) service.

SNS also launched the new PublishBatch API which enables developers to send up to 10 messages to SNS in a single request. This can reduce cost by up to 90%, since you need fewer API calls to publish the same number of messages to the service.

Amazon SQS

Amazon SQS released an enhanced DLQ management experience for standard queues. This allows you to redrive messages from a DLQ back to the source queue. This can be configured in the AWS Management Console, as shown here.

Amazon DynamoDB

The NoSQL Workbench for DynamoDB is a tool to simplify designing, visualizing and querying DynamoDB tables. The tools now supports importing sample data from CSV files and exporting the results of queries.

DynamoDB announced the new Standard-Infrequent Access table class. Use this for tables that store infrequently accessed data to reduce your costs by up to 60%. You can switch to the new table class without an impact on performance or availability and without changing application code.

AWS Amplify

AWS Amplify now allows developers to override Amplify-generated IAM, Amazon Cognito, and S3 configurations. This makes it easier to customize the generated resources to best meet your application’s requirements. To learn more about the “amplify override auth” command, visit the feature’s documentation.

Similarly, you can also add custom AWS resources using the AWS Cloud Development Kit (CDK) or AWS CloudFormation. In another new feature, developers can then export Amplify backends as CDK stacks and incorporate them into their deployment pipelines.

AWS Amplify UI has launched a new Authenticator component for React, Angular, and Vue.js. Aside from the visual refresh, this provides the easiest way to incorporate social sign-in in your frontend applications with zero-configuration setup. It also includes more customization options and form capabilities.

AWS launched AWS Amplify Studio, which automatically translates designs made in Figma to React UI component code. This enables you to connect UI components visually to backend data, providing a unified interface that can accelerate development.

AWS AppSync

You can now use custom domain names for AWS AppSync GraphQL endpoints. This enables you to specify a custom domain for both GraphQL API and Realtime API, and have AWS Certificate Manager provide and manage the certificate.

To learn more, read the feature’s documentation page.

News from other services

Serverless blog posts

October

November

December

AWS re:Invent breakouts

AWS re:Invent was held in Las Vegas from November 29 to December 3, 2021. The Serverless DA team presented numerous breakouts, workshops and chalk talks. Rewatch all our breakout content:

Serverlesspresso

We also launched an interactive serverless application at re:Invent to help customers get caffeinated!

Serverlesspresso is a contactless, serverless order management system for a physical coffee bar. The architecture comprises several serverless apps that support an ordering process from a customer’s smartphone to a real espresso bar. The customer can check the virtual line, place an order, and receive a notification when their drink is ready for pickup.

Serverlesspresso booth

You can learn more about the architecture and download the code repo at https://serverlessland.com/reinvent2021/serverlesspresso. You can also see a video of the exhibit.

Videos

Serverless Land videos

Serverless Office Hours – Tues 10 AM PT

Weekly live virtual office hours. In each session we talk about a specific topic or technology related to serverless and open it up to helping you with your real serverless challenges and issues. Ask us anything you want about serverless technologies and applications.

YouTube: youtube.com/serverlessland
Twitch: twitch.tv/aws

October

November

December

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

Building a serverless multi-player game that scales: Part 3

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/building-a-serverless-multi-player-game-that-scales-part-3/

This post is written by Tim Bruce, Sr. Solutions Architect, DevAx, Chelsie Delecki, Solutions Architect, DNB, and Brian Krygsman, Solutions Architect, Enterprise.

This blog series discusses building a serverless game that scales, using Simple Trivia Service:

  • Part 1 describes the overall architecture, how to deploy to your AWS account, and the different communication methods.
  • Part 2 describes adding automation to the game to help your teams scale.

This post discusses how the game scales to support concurrent users (CCU) under a load test. While this post focuses on Simple Trivia Service, you can apply the concepts to any serverless workload.

To set up the example, see the instructions in the Simple Trivia Service GitHub repo and the README.md file. This example uses services beyond AWS Free Tier and incurs charges. To remove the example from your account, see the README.md file.

Overview

Simple Trivia Service is launching at a new trivia conference. There are 200,000 registered attendees who are invited to play the game during the conference. The developers are following AWS Well-Architected best practice and load test before the launch.

Load testing is the practice of simulating user load to validate the system’s ability to scale. The focus of the load test is the game’s microservices, built using AWS Serverless services, including:

  • Amazon API Gateway and AWS IoT, which provide serverless endpoints, allowing users to interact with the Simple Trivia Service microservices.
  • AWS Lambda, which provides serverless compute services for the microservices.
  • Amazon DynamoDB, which provides a serverless NoSQL database for storing game data.

Preparing for load testing

Defining success criteria is one of the first steps in preparing for a load test. You use success criteria to determine how well the game meets the requirements and includes concurrent users, error rates, and response time. These three criteria help to ensure that your users have a good experience when playing your game.

Excluding one can lead to invalid assumptions about the scale of users that the game can support. If you exclude error rate goals, for example, users may encounter more errors, impacting their experience.

The success criteria used for Simple Trivia Service are:

  • 200,000 concurrent users split across game types.
  • Error rates below 0.05%.
  • 95th percentile synchronous responses under 1 second.

With these identified, you can develop dashboards to report on the targets. Dashboards allow you to monitor system metrics over the course of load tests. You can develop dashboards using Amazon CloudWatch dashboards, using custom widgets that organize and display metrics.

Common metrics to monitor include:

  • Error rates – total errors / total invocations.
  • Throttles – invocations resulting in 429 errors.
  • Percentage of quota usage – usage against your game’s Service Quotas.
  • Concurrent execution counts – maximum concurrent Lambda invocations.
  • Provisioned concurrency invocation rate – provisioned concurrency spillover invocation count / provisioned concurrency invocation count.
  • Latency – percentile-based response time, such as 90th and 95th percentiles.

Documentation and other services are also helpful during load testing. Centralized logging via Amazon CloudWatch Logs and AWS CloudTrail provide your team with operational data for the game. This data can help triage issues during testing.

System architecture documents provide key details to help your team focus their work during triage. Amazon DevOps Guru can also provide your team with potential solutions for issues. This uses machine learning to identify operational deviations and deployments and provides recommendations for resolving issues.

A load testing tool simplifies your testing, allowing you to model users playing the game. Popular load testing tools include Apache JMeter, Artillery.io Artillery, and Locust.io Locust. The load testing tool you select can act as your application client and access your endpoints directly.

This example uses Locust to load test Simple Trivia Service based on language and technical requirements. It allows you to accurately model usage and not only generate transactions. In production applications, select a tool that aligns to your team’s skills and meets your technical requirements.

You can place automation around load testing tool to reduce manual effort of running tests. Automation can include allocating environments, deploying and running test scripts, and collecting results. You can include this as part of your continuous integration/continuous delivery (CI/CD) pipeline. You can use the Distributed Load Testing on AWS solution to support Taurus-compatible load testing.

Also, document a plan, working backwards from your goals to help measure your progress. Plans typically use incremental growth of CCU, which can help you to identify constraints in your game. Use your plan while you are in development once portions of your game feature complete.

Testing plan for STS

This shows an example plan for load testing Simple Trivia Service:

  1. Start with individual game testing to validate tests and game modes separately.
  2. Add in testing of the three game modes together, mirroring expected real world activity.

Finally, evaluate your load test and architecture against your AWS Customer Agreement, AWS Acceptable Use Policy, Amazon EC2 Testing Policy, and the AWS Customer Support Policy for Penetration Testing. These policies are put in place to help you to be successful in your load testing efforts. AWS Support requires you to notify them at least two weeks prior to your load test using the Simulated Events Submission Form with the AWS Management Console. This form can also be used if you have questions before your load test.

Additional help for your load test may be available on the AWS Forums, AWS re:Post, or via your account team.

Testing

After triggering a test, automation scales up your infrastructure and initializes the test users. Depending on the number of users you need and their startup behavior, this ramp-up phase can take several minutes. Similarly, when the test run is complete, your test users should ramp down. Unless you have modeled the ramp-up and ramp-down phases to match real-world behavior, exclude these phases from your measurements. If you include them, you may optimize for unrealistic user behavior.

While tests are running, let metrics normalize before drawing conclusions. Services may report data at different rates. Investigate when you find metrics that cross your acceptable thresholds. You may need to make adjustments like adding Lambda Provisioned Concurrency or changing application code to resolve constraints. You may even need to re-evaluate your requirements based on how the system performs. When you make changes, re-test to verify any changes had the impact you expected before continuing with your plan.

Finally, keep an organized record of the inputs and outputs of tests, including dashboard exports and your own observations. This record is valuable when sharing test outcomes and comparing test runs. Mark your progress against the plan to stay on track.

Analyzing and improving Simple Trivia Service performance

Running the test plan, using observability tools to measure performance, finds opportunities to tune performance bottlenecks.

In this example, during single player individual tests, the dashboards show acceptable latency values. As the test size grows, increasing read capacity for retrieving leaderboards indicates a tuning opportunity:

Dashboard reads 1

Dashboard reads 2

  1. The CloudWatch dashboard reveals that the LeaderboardGet function is leading to high consumed read capacity for the Players DynamoDB table. A process within the function is querying scores and player records with every call to load avatar URLs
  2. Standardizing the player avatar URL process within the code reduces reads from the table. The update improves DynamoDB reads.

Moving into the full test phase of the plan with combined game types identified additional areas for performance optimization. In one case, dashboards highlight unexpected error rates for a Lambda function. Consulting function logs and DevOps Guru to triage the behavior, these show a downstream issue with an Amazon Kinesis Data Stream:

Identifying error rates in dashboards

  1. DevOps Guru, within an insight, highlights the problem of the Kinesis:WriteProvisionedThroughputExceeded metric during our test window
  2. DevOps Guru also correlates that metric with the Kinesis:GetRecords.Latency metric.

DevOps Guru also links to a recommendation for Kinesis Data Streams to troubleshoot and resolve the incident with the data stream. Following this advice helps to resolve the Lambda error rates during the next test.

Load testing results

By following the plan, making incremental changes as optimizations became apparent, you can reach the goals.

Table of results

The preceding table is a summary of data from Amazon CloudWatch Lambda Insights and statistics captured from Locust:

  1. The test exceeded the goal of 200k CCU with a combined total of 236,820 CCU.
  2. Less than 0.05% error rate with a combined average of 0.010%.
  3. Performance goals are achieved without needing Provisioned Concurrency in Lambda.

Function latency

Function concurrency and throttles

  1. The function latency goal of < 1 second is met, based on data from CloudWatch Lambda Insights.
  2. Function concurrency is below Service Quotas for Lambda during the test, based on data from our custom CloudWatch dashboard.

Conclusion

This post discusses how to perform a load test on a serverless workload. The process was used to validate a scale of Simple Trivia Service, a single- and multi-player game built using a serverless-first architecture on AWS. The results show a scale of over 220,000 CCUs while maintaining less than 1-second response time and an error rate under 0.05%.

For more serverless learning resources, visit Serverless Land.

Use AWS Step Functions to Monitor Services Choreography

Post Syndicated from Vito De Giosa original https://aws.amazon.com/blogs/architecture/use-aws-step-functions-to-monitor-services-choreography/

Organizations frequently need access to quick visual insight on the status of complex workflows. This involves collaboration across different systems. If your customer requires assistance on an order, you need an overview of the fulfillment process, including payment, inventory, dispatching, packaging, and delivery. If your products are expensive assets such as cars, you must track each item’s journey instantly.

Modern applications use event-driven architectures to manage the complexity of system integration at scale. These often use choreography for service collaboration. Instead of directly invoking systems to perform tasks, services interact by exchanging events through a centralized broker. Complex workflows are the result of actions each service initiates in response to events produced by other services. Services do not directly depend on each other. This increases flexibility, development speed, and resilience.

However, choreography can introduce two main challenges for the visibility of your workflow.

  1. It obfuscates the workflow definition. The sequence of events emitted by individual services implicitly defines the workflow. There is no formal statement that describes steps, permitted transitions, and possible failures.
  2. It might be harder to understand the status of workflow executions. Services act independently, based on events. You can implement distributed tracing to collect information related to a single execution across services. However, getting visual insights from traces may require custom applications. This increases time to market (TTM) and cost.

To address these challenges, we will show you how to use AWS Step Functions to model choreographies as state machines. The solution enables stakeholders to gain visual insights on workflow executions, identify failures, and troubleshoot directly from the AWS Management Console.

This GitHub repository provides a Quick Start and examples on how to model choreographies.

Modeling choreographies with Step Functions

Monitoring a choreography requires a formal representation of the distributed system behavior, such as state machines. State machines are mathematical models representing the behavior of systems through states and transitions. States model situations in which the system can operate. Transitions define which input causes a change from the current state to the next. They occur when a new event happens. Figure 1 shows a state machine modeling an order workflow.

Figure 1. Order workflow

Figure 1. Order workflow

The solution in this post uses Amazon State Language to describe a choreography as a Step Functions state machine. The state machine pauses, using Task states combined with a callback integration pattern. It then waits for the next event to be published on the broker. Choice states control transitions to the next state by inspecting event payloads. Figure 2 shows how the workflow in Figure 1 translates to a Step Functions state machine.

Figure 2. Order workflow translated into Step Functions state machine

Figure 2. Order workflow translated into Step Functions state machine

Figure 3 shows the architecture for monitoring choreographies with Step Functions.

Figure 3. Choreography monitoring with AWS Step Functions

Figure 3. Choreography monitoring with AWS Step Functions

  1. Services involved in the choreography publish events to Amazon EventBridge. There are two configured rules. The first rule matches the first event of the choreography sequence, Order Placed in the example. The second rule matches any other event of the sequence. Event payloads contain a correlation id (order_id) to group them by workflow instance.
  2. The first rule invokes an AWS Lambda function, which starts a new execution of the choreography state machine. The correlation id is passed in the name parameter, so you can quickly identify an execution in the AWS Management Console.
  3. The state machine uses Task states with AWS SDK service integrations, to directly call Amazon DynamoDB. Tasks are configured with a callback pattern. They issue a token, which is stored in DynamoDB with the execution name. Then, the workflow pauses.
  4. A service publishes another event on the event bus.
  5. The second rule invokes another Lambda function with the event payload.
  6. The function uses the correlation id to retrieve the task token from DynamoDB.
  7. The function invokes the Step Functions SendTaskSuccess API, with the token and the event payload as parameters.
  8. The state machine resumes the execution and uses Choice states to transition to the next state. If the choreography definition expects the received event payload, it selects the next state and the process will restart from Step # 3. The state machine transitions to a Fail state when it receives an unexpected event.

Increased visibility with Step Functions console

Modeling service choreographies as Step Functions Standard Workflows increases visibility with out-of-the-box features.

1. You can centrally track events produced by distributed components. Step Functions records full execution history for 90 days after the execution completes. You’ll be able to capture detailed information about the input and output of each state, including event payloads. Additionally, state machines integrate with Amazon CloudWatch to publish execution logs and metrics.

2. You can monitor choreographies visually. The Step Functions console displays a list of executions with information such as execution id, status, and start date (see Figure 4).

Figure 4. Step Functions workflow dashboard

Figure 4. Step Functions workflow dashboard

After you’ve selected an execution, a graph inspector is displayed (see Figure 5). It shows states, transitions, and marks individual states with colors. This identifies at a glance, successful tasks, failures, and tasks that are still in progress.

Figure 5. Step Functions graph inspector

Figure 5. Step Functions graph inspector

3. You can implement event-driven automation. Step Functions enables you to capture execution status changes emitting events directly to EventBridge (see Figure 6). Additionally, AWS gives you the ability to emit events by setting alarms on top of metrics. Step Functions publishes these to CloudWatch. You can respond to events by initiating corrective actions, sending notifications, or integrating with third-party solutions, such as issue tracking systems.

Figure 6. Automation with Step Functions, EventBridge, and CloudWatch alarms

Figure 6. Automation with Step Functions, EventBridge, and CloudWatch alarms

Enabling access to AWS Step Functions console

Stakeholders need secure access to the Step Functions console. This requires mechanisms to authenticate users and authorize read-only access to specific Step Functions workflows.

AWS Single Sign-On authenticates users by directly managing identities or through federation. SSO supports federation with Active Directory and SAML 2.0 compliant external identity providers (IdP). Users gain access to Step Functions state machines by assigning a permission set, which is a collection of AWS Identity and Access Management (IAM) policies. Additionally, with permission sets, you can configure a relay state, which is a URL to redirect the user after successful authentication. You can authenticate the user through the selected identity provider and immediately show the AWS Step Functions console with the workflow state machine already displayed. Figure 7 shows this process.

Figure 7. Access to Step Functions state machine with AWS SSO

Figure 7. Access to Step Functions state machine with AWS SSO

  1. The user logs in through the selected identity provider.
  2. The SSO user portal uses the SSO endpoint to send the response from the previous step. SSO uses AWS Security Token Service (STS) to get temporary security credentials on behalf of the user. It then creates a console sign-in URL using those credentials and the relay state. Finally, it sends the URL back as a redirect.
  3. The browser redirects the user to the Step Functions console.

When the identity provider does not support SAML 2.0, SSO is not a viable solution. In this case, you can create a URL with a sign-in token for users to securely access the AWS Management Console. This approach uses STS AssumeRole to get temporary security credentials. Then, it uses credentials to obtain a sign-in token from the AWS federation endpoint. Finally, it constructs a URL for the AWS Management Console, which includes the token. It then distributes this to users to grant access. This is similar to the SSO process. However, it requires custom development.

Conclusion

This post shows how you can increase visibility on choreographed business processes using AWS Step Functions. The solution provides detailed visual insights directly from the AWS Management Console, without requiring custom UI development. This reduces TTM and cost.

To learn more:

Serverless Scheduling with Amazon EventBridge, AWS Lambda, and Amazon DynamoDB

Post Syndicated from Peter Grman original https://aws.amazon.com/blogs/architecture/serverless-scheduling-with-amazon-eventbridge-aws-lambda-and-amazon-dynamodb/

Many applications perform scheduled tasks. For instance, you might want to automatically publish an article at a given time, change prices for offers which were defined weeks in advance, or notify customers 8 hours before a flight. These might be one-off tasks, or recurring ones.

On Unix-like operating systems, you might have opted for the cron utility. There are also similar alternatives for many web application frameworks, as well as advanced libraries, to schedule future one-off tasks. In a single server environment, this might seem like a simple solution. However, when you run dozens of instances of your application server, it gets harder to rely on those libraries to schedule tasks reliably at least once, without taking up too many resources. If you decide to build a serverless application, you need a new approach all together.

This post shows how you can build a scalable serverless job scheduler. You can use this method to scale to thousands, or even millions, of distributed jobs per minute. Because you are using serverless technologies, the underlying infrastructure is fully managed by AWS and you only pay for what you use. You can use this solution as an addition to your existing applications, regardless if they already use serverless technologies.

Similarly to a cron job running on a single instance, this solution uses an Amazon EventBridge rule, which starts new events periodically on a schedule. For recurring jobs, you would use this capability to start specific actions directly. This will work if you have only a few dozen periodic tasks, whose execution cycle can be defined as a cron expression. However, remember that there are limits to how many rules can be defined per event bus, and rules with a scheduled expression can only be defined on the default event bus. This post describes a method to multiplex a single Amazon EventBridge rule via an AWS Lambda function and Amazon DynamoDB, to scale beyond thousands of jobs. While this example focuses on one-off tasks, you can use the same approach for recurring jobs as well.

Overview of solution

The following diagram shows the architecture of the serverless scheduling solution.

Figure 1 - Architecture diagram showing Serverless Scheduling with Amazon EventBridge, AWS Lambda, and Amazon DynamoDB

Figure 1 – Architecture diagram showing Serverless Scheduling with Amazon EventBridge, AWS Lambda, and Amazon DynamoDB

Amazon EventBridge with scheduled expressions periodically starts an AWS Lambda function. An Amazon DynamoDB table stores the future jobs. The Lambda function queries the table for due jobs and distributes them via Amazon EventBridge to the workers.

The following services are used:

Amazon EventBridge: to initiate the serverless scheduling solution. Amazon EventBridge is a serverless event bus that makes it easier to build event-driven applications at scale. It can also schedule events based on time intervals or cron expressions.

In this solution, you’ll use EventBridge for two things:

  1. to periodically start the AWS Lambda function, which checks for new jobs to be executed, and
  2. to distribute those jobs to the workers.

Here, you can control the granularity of your job executions. The fastest rate possible is once every minute. But if you don’t need a 1-minute precision, you can also opt for once every 5 minutes, or even once every hour. Remember that you cannot control at which second the event is started. It might be at the beginning of the minute, in the middle, or at the end.

AWS Lambda: to execute the scheduler logic. AWS Lambda is a serverless, event-driven compute service that lets you run code without provisioning or managing servers. The Lambda function queries the jobs from DynamoDB and distributes them via EventBridge. Based on your requirements, you can adjust this to use different mechanisms to notify the workers about the jobs, such as HTTP APIs, gRPC calls, or AWS services like Amazon Simple Notification Service (SNS) or Amazon Simple Queue Service (SQS).

Amazon DynamoDB: to store scheduled jobs. Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. Defining the right data model is important to be able to scale to thousands or even millions of scheduled and processed jobs per minute. The DynamoDB table in this solution has a partition key “pk” and a sort key “sk”. For the Lambda function, to be able to query all due jobs quickly and efficiently, jobs must be partitioned. For this, they are grouped together based on their scheduled times in intervals of 5 minutes. This value is the partition key “pk”. How to calculate this value is explained in detail, when you will test the solution.

The sort key “sk” contains the precise execution time concatenated with a unique identifier, such as a job ID, because the combination of “pk” and “sk” must be unique. To schedule a job in this example, you write it manually into the DynamoDB table. In your production code you can abstract the synchronous DynamoDB access, by implementing it in a shared library, or using Amazon API Gateway. You could also schedule jobs from a Lambda function reacting to events in your system.

Amazon EventBridge: to distribute the jobs. The Lambda function uses Amazon EventBridge as an example to distribute the jobs. The workers which should receive the jobs, must configure the corresponding rules upfront. For testing purposes, this solution comes with a rule which logs all events from the Lambda function into Amazon CloudWatch Logs.

Walkthrough

In this section, you will deploy the solution and test it.

Deploying the solution

To deploy it in your account:

1.  Select Launch Stack.

Launch Stack

2. Select the Region where you want to launch your serverless scheduler.

3. Define a name for your stack. Leave the parameters with the default values for now and select Next.

parameters table

4. At the bottom of the page, acknowledge the required Capabilities and select Create stack.

5.  Wait until the status of the stack is CREATE_COMPLETE, this can take a minute or two.

Testing the solution

In this section, you test the serverless scheduler. First, you’ll schedule a job for some time in the near future. Afterwards you will check that the job has been logged in CloudWatch Logs at the time, it was scheduled.

1. In the AWS Management Console, navigate to the DynamoDB service and select the Items sub-menu on the left side, between Tables and PartiQL editor.

2. Select the JobsTable which you created via the CloudFormation Stack; it should be empty for now:

Jobstable

3. Select Create item. Make sure you switch to the JSON editor at the top, and disable View DynamoDB JSON. Now copy this item into the editor:

{
  "pk": "j#2015-03-20T09:45",
  "sk": "2015-03-20T09:46:47.123Z#564ade05-efda-4a2e-a7db-933ad3c89a83",
  "detail": {
    "action": "send-reminder",
    "userId": "16f3a019-e3a5-47ed-8c46-f668347503d1",
    "taskId": "6d2f710d-99d8-49d8-9f52-92a56d0c6b81",
    "params": {
      "can_skip": false,
      "reminder_volume": 0.5
    }
  },
  "detail_type": "job-reminder"
}

Create table DynamoDB table

This is a sample job definition. You will need to adjust it, to be started a few minutes from now. For this you need to adjust the first 2 attributes, the partition key “pk” and the sort key “sk”. Start with “sk”, this is the UTC timestamp for the due date of the job in ISO 8601 format (YYYY-MM-DDTHH:MM:SS), followed by a separator (“#”) and a unique identifier, to make sure that multiple jobs can have the same due timestamp.

Afterwards adjust “pk”. The “pk” looks like the ISO 8601 timestamp in the “sk” reduced to date and time in hours and minutes. The minutes for the partition key must be an integer multiple of 5. This value represents the grouping of the jobs, so they can be queried quickly and efficiently by the Lambda function. For instance, for me 2021-11-26T13:31:55.000Z is in the future and the corresponding partition would be 2021-11-26T13:30.

Note: your local time zone might not be UTC. You can get the current UTC time on timeanddate.com.

You can find in the following table for every “sk” minute the corresponding “pk” minute:

SK and PK table

The corresponding python code would be:

f'{(sk_minutes – sk_minutes % 5):02d}'

Create table DynamoDB table

4. Now that you defined your event in the near future, you can optionally adjust the content of the “detail” and “detail_type” attributes. These are forwarded to EventBridge as “detail” and “detail-type” and should be used by your workers to understand which task they are supposed to perform. You can find more details on EventBridge event structure in our documentation. After you configured the job correctly, select Create item.

5. It is time to navigate to CloudWatch Log groups and wait for the item to be due and to show up in the debug logs.

CloudWatch log groups

For now, the log streams should be empty:

Log streams sceenshot

After the item was due, you should see a new log stream with the item “detail” and “detail_type” attributes logged.

If you don’t see a new log stream with the item, check back in your DynamoDB table, if the “sk” is in the UTC time zone and the minutes of the “pk” are a multiple of 5. You can consult the table at the end of step 3, to check for the correct “pk” minutes based on your “sk” minutes.

Log events screenshot

You might notice that the timestamp of the message is within a minute after the job was scheduled. In my example, I scheduled the job for 2021-11-26T13:31:55.000Z and it was put into EventBridge at 2021-11-26T13:32:33Z. The delay comes from the Lambda function only starting once per minute. As I mentioned in the beginning, the function also isn’t started at second 00 but at a random second within that minute.

Exploring the Lambda function

Now, let’s have a look at the core logic. For this, navigate to AWS Lambda in the AWS Management console and open the SchedulerFunction.

AWS Lambda screenshot

In the function configuration, you can see that it is triggered by EventBridge via a scheduled expression at the rate, which was defined in the CloudFormation Stack.

Function Configuration in Cloudformation Stack

When you open the Code tab, you can see that it is less than 100 lines of python code. The main part is the lambda_handler function:

def lambda_handler(event, context):
    event_time_in_utc = event['time']
    previous_partition, current_partition = get_partitions(event_time_in_utc)

    previous_jobs = query_jobs(previous_partition, event_time_in_utc)
    current_jobs = query_jobs(current_partition, event_time_in_utc)
    all_jobs = previous_jobs + current_jobs

    print('dispatching {} jobs'.format(len(all_jobs)))

    put_all_jobs_into_event_bridge(all_jobs)
    delete_all_jobs(all_jobs)

    print('dispatched and deleted {} jobs'.format(len(all_jobs)))

The function starts by calculating the current and previous partitions. This is done to ensure that no jobs stay unprocessed in the old partition, when a new one starts. Afterwards, jobs from these partitions are queried up to the current time, so no future jobs will be fetched from the current partition. Lastly, all jobs are put into EventBridge and deleted from the table.

Instead of pushing the jobs into EventBridge, they could be started via HTTP(S), gRPC, or pushed into other AWS services, like Amazon Simple Notification Service (SNS) or Amazon Simple Queue Service (SQS). Also remember that the communication with other AWS services is synchronous and does not use batching options when putting jobs into EventBridge or deleting them from the DynamoDB table. This is to keep the function simpler and easier to understand. When you plan to distribute thousands of jobs per minute, you’d want to adjust this, to improve the throughput of the Lambda function.

Cleaning up

To avoid incurring future charges, delete the CloudFormation Stack and all resources you created.

Conclusion

In this post, you learned how to build a serverless scheduling solution. Using only serverless technologies which scale automatically, don’t require maintenance, and offer a pay as you go pricing model, this scheduler solution can be implemented for use cases with varying throughput requirements for their scheduled jobs. These could range from publishing articles at a scheduled time to notifying hundreds of passengers per minute about their upcoming flight.

You can adjust the Lambda function to distribute the jobs with a technology more fitting to your application, as well as to handle recurring tasks. The grouping interval of 5 minutes for the partition key, can be also adjusted based on your throughput requirements. It’s important to note that for this solution to work, the interval by which the jobs are grouped must be longer than the rate at which the Lambda function is started.

Give it a try and let us know your thoughts in the comments!

Modernize your Penetration Testing Architecture on AWS Fargate

Post Syndicated from Conor Walsh original https://aws.amazon.com/blogs/architecture/modernize-your-penetration-testing-architecture-on-aws-fargate/

Organizations in all industries are innovating their application stack through modernization. Developers have found that modular architecture patterns, serverless operational models, and agile development processes provide great benefits. They offer faster innovation, reduced risk, and reduction in total cost of ownership.

Security organizations must evolve and innovate as well. But security practitioners often find themselves stuck between using powerful yet inflexible open-source tools with little support, and monolithic software with expensive and restrictive licenses.

This post describes how you can use modern cloud technologies to build a scalable penetration testing platform, with no infrastructure to manage.

The penetration testing monolith

AWS operates under the shared responsibility model, where AWS is responsible for the security of the cloud, and the customer is responsible for securing workloads in the cloud. This includes validating the security of your internal and external attack surface. Following the AWS penetration testing policy, customers can run tests against their AWS accounts, except for denial of service (DoS).

A legacy model commonly involves a central server for running a scanning application among the team. The server must be powerful enough for peak load and likely runs 24/7. Common licensing for scanner software is capped on the number of targets you can scan. This model does not scale, and incurs cost when no assessments are being performed.

Penetration testers must constantly reinvent their toolkit. Many one-off tools or scripts are built during engagements when encountering a unique problem. These tools and their environments are often customized, making standardization between machines and software difficult. Building, maintaining, and testing UI/UX and platform compatibility can be expensive and difficult to scale. This often leads to these tools being discarded and the value lost when the analyst moves on to the next engagement. Later, other analysts may run into the same scenario and need to rebuild the tool all over again, resulting in duplicated effort.

Network security scanning using modern cloud infrastructure

By using modern cloud container technologies, we can redesign this monolithic architecture to one that scales to meet increased demand, yet incurs no cost when idle. Containerization provides flexibility and secure isolation.

Figure 1. Overview of the serverless security scanning architecture

Figure 1. Overview of the serverless security scanning architecture

Scanning task flow

This workflow is based on the architecture shown in Figure 1:

  1. User authenticates to Amazon Cognito with their organization’s SSO.
  2. User makes authorized request to Amazon API Gateway.
  3. Request is forwarded to an AWS Lambda function that pulls configuration from Amazon Simple Storage Service (S3).
  4. Lambda function validates parameters, incorporates them into the task definition, and calls Amazon Elastic Container Service (ECS).
  5. ECS orchestrates worker nodes using AWS Fargate compute engine and initiates task.
  6. ECS asynchronously returns the task configuration to Lambda, which sanitizes sensitive data and sends response through API Gateway.
  7. The ECS task launches one or more containers, which run the tool.
  8. Scan results are stored in the ephemeral storage provided by Fargate.
  9. Final container in the ECS task copies the scan report to S3.

Now we’ll describe the different components of the architecture shown in Figure 1. Start by packaging one’s favorite tool into a container, and publish it to Amazon Elastic Container Registry (ECR). ECR provides your containers additional layers of security assurance with built-in dependency vulnerability scans.

AWS Fargate is a serverless compute engine powering Amazon ECS to orchestrate container tasks. Fargate scales up capacity to support the current load, and scales down once complete to reduce cost. By default, Fargate offers 20 GB of ephemeral storage to each ECS task for shared storage between containers as volume mounts.

Task input and output can be processed with custom code running on the serverless computing service AWS Lambda. For multi-stage Lambda functionality, you can use AWS Step Functions.

Amazon API Gateway can forward incoming requests to these Lambda functions. API Gateway provides serverless REST endpoints to handle requests processed by Lambda functions. Amazon Cognito authorizes users through API Gateway or your organization’s single-sign on (SSO) provider.

The final step of the ECS task can upload any resulting files to an Amazon S3 bucket. Amazon S3 offers industry-leading scalability, data availability, security, and performance with integration into other AWS services. This means that the results of your data can be consumed by other AWS services for processing, analytics, machine learning, and security controls.

Amazon CloudWatch Events are used to build an event-based workflow. The S3 upload initiates a CloudWatch Event, which can then invoke a Lambda function to process the file, or launch another ECS task.

This solution is completely serverless. It will scale on demand, yet cost nothing when not in use. This architecture can support anything that can be run in a container, regardless of tool function.

Network Mapper workflow

Figure 2. Network Mapper scanner task workflow

Figure 2. Network Mapper scanner task workflow

The example in Figure 2 was based on using a tool called Network Mapper, or Nmap. However, a variety of tools can be used, including nslookup/dig, Selenium, Nikto, recon-ng, SpiderFoot, Greenbone Vulnerability Manager (GVM), or OWASP ZAP. You can use anything that runs in a container! With some additional work, findings could be fed into AWS security services like AWS Security Hub, or Amazon GuardDuty. You can also use AWS Partner Network services like Splunk and Datadog, or open source frameworks like Metasploit and DefectDojo. The flexibility to add additional applications that integrate with AWS services means that this architecture can be easily deployed into a variety of AWS environments.

Remember, installation and use of software not included in an AWS-supported Amazon Machine Image (AMI) or container, falls into the customer side of the shared responsibility model. Make sure to do your due diligence in securing any software you decide to use in this or any workload. To reduce blast radius, run this in an isolated account and only provide least privilege access to targets.

Conclusion

In this blog post, we showed how to run a penetration testing workload on a modern platform, powered with serverless, and container-based services. Amazon API Gateway is the entry point for your architecture, which calls on AWS Lambda. Lambda builds a task definition to launch a fully orchestrated, on-demand container workload using AWS Fargate and Amazon ECS. The final stage of the ECS task copies the results of the scan to Amazon S3. This can be accessed by security analysts or other downstream containers, tools, or services.

We encourage you to go build this architecture in your own environment, and begin conducting your own tests! Construct your Nmap container and store it in Amazon ECR or use securecodebox/nmap, a Docker container built for the Open Web Application Security Project® (OWASP) SecureCodeBox project. Make sure to spend time securing this workload, especially when using open-source software you’re not familiar with. Now go get scanning!

Using an Amazon MQ network of broker topologies for distributed microservices

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-an-amazon-mq-network-of-broker-topologies-for-distributed-microservices/

This post is written by Suranjan Choudhury Senior Manager SA and Anil Sharma, Apps Modernization SA.

This blog looks at ActiveMQ topologies that customers can evaluate when planning hybrid deployment architectures spanning AWS Regions and customer data centers, using a network of brokers. A network of brokers can have brokers on-premises and Amazon MQ brokers on AWS.

Distributing broker nodes across AWS and on-premises allows for messaging infrastructure to scale, provide higher performance, and improve reliability. This post also explains a topology spanning two Regions and demonstrates how to deploy on AWS.

A network of brokers is composed of multiple simultaneously active single-instance brokers or active/standby brokers. A network of brokers provides a large-scale messaging fabric in which multiple brokers are networked together. It allows a system to survive the failure of a broker. It also allows distributed messaging. Applications on remote, disparate networks can exchange messages with each other. A network of brokers helps to scale the overall broker throughput in your network, providing increased availability and performance.

Types of ActiveMQ topologies

Network of brokers can be configured in a variety of topologies – for example, mesh, concentrator, and hub and spoke. The topology depends on requirements such as security and network policies, reliability, scaling and throughput, and management and operational overheads. You can configure individual brokers to operate as a single broker or in an active/standby configuration.

Mesh topology

Mesh topology

A mesh topology provides multiple brokers that are all connected to each other. This example connects three single-instance brokers, but you can configure more brokers as a mesh. The mesh topology needs subnet security group rules to be opened for allowing brokers in internal subnets to communicate with brokers in external subnets.

For scaling, it’s simpler to add new brokers for incrementing overall broker capacity. The mesh topology by design offers higher reliability with no single point of failure. Operationally, adding or deleting of nodes requires broker re-configuration and restarting the broker service.

Concentrator topology

Concentrator topology

In a concentrator topology, you deploy brokers in two (or more) layers to funnel incoming connections into a smaller collection of services. This topology allows segmenting brokers into internal and external subnets without any additional security group changes. If additional capacity is needed, you can add new brokers without needing to update other brokers’ configurations. The concentrator topology provides higher reliability with alternate paths for each broker. This enables hybrid deployments with lower operational overheads.

Hub and spoke topology

Hub and spoke topology

A hub and spoke topology preserves messages if there is disruption to any broker on a spoke. Messages are forwarded throughout and only the central Broker1 is critical to the network’s operation. Subnet security group rules must be opened to allow brokers in internal subnets to communicate with brokers in external subnets.

Adding brokers for scalability is constrained by the hub’s capacity. Hubs are a single point of failure and should be configured as active-standby to increase reliability. In this topology, depending on the location of the hub, there may be increased bandwidth needs and latency challenges.

Using a concentrator topology for large-scale hybrid deployments

When planning deployments spanning AWS and customer data centers, the starting point is the concentrator topology. The brokers are deployed in tiers such that brokers in each tier connect to fewer brokers at the next tier. This allows you to funnel connections and messages from a large number of producers to a smaller number of brokers. This concentrates messages at fewer subscribers:

Hybrid deployment topology

Deploying ActiveMQ brokers across Regions and on-premises

When placing brokers on-premises and in the AWS Cloud in a hybrid network of broker topologies, security and network routing are key. The following diagram shows a typical hybrid topology:

Typical hybrid topology

Amazon MQ brokers on premises are placed behind a firewall. They can communicate to Amazon MQ brokers through an IPsec tunnel terminating on the on-premises firewall. On the AWS side, this tunnel terminates on an AWS Transit Gateway (TGW). The TGW routes all network traffic to a firewall in AWS in a service VPC.

The firewall inspects the network traffic and routes all inspected traffic sent back to the transit gateway. The TGW, based on routing configured, sends the traffic to the Amazon MQ broker in the application VPC. This broker concentrates messages from Amazon MQ brokers hosted on AWS. The on premises brokers and the AWS brokers form a hybrid network of brokers that spans AWS and customer data center. This allows applications and services to communicate securely. This architecture exposes only the concentrating broker to receive and send messages to the broker on premises. The applications are protected from outside, non-validated network traffic.

This blog shows how to create a cross-Region network of brokers. This topology removes multiple brokers in the internal subnet. However, in a production environment, you have multiple brokers’ internal subnets catering to multiple producers and consumers. This topology spans an AWS Region and an on-premises customer data center represented in a second AWS Region:

Cross-Region topology

Best practices for configuring network of brokers

Client-side failover

In a network of brokers, failover transport configures a reconnect mechanism on top of the transport protocols. The configuration allows you to specify multiple URIs to connect to. An additional configuration using the randomize transport option allows for random selection of the URI when re-establishing a connection.

The example Lambda functions provided in this blog use the following configuration:

//Failover URI
failoverURI = "failover:(" + uri1 + "," + uri2 + ")?randomize=True";

Broker side failover

Dynamic failover allows a broker to receive a list of all other brokers in the network. It can use the configuration to update producer and consumer clients with this list. The clients can update to rebalance connections to these brokers.

In the broker configuration in this blog, the following configuration is set up:

<transportConnectors> <transportConnector name="openwire" updateClusterClients="true" updateClusterClientsOnRemove = "false" rebalanceClusterClients="true"/> </transportConnectors>

Network connector properties – TTL and duplex

TTL values allow messages to traverse through the network. There are two TTL values – messageTTL and consumerTTL. Another way is to set up the network TTL, which sets both the message and consumer TTL.

The duplex option allows for creating a bidirectional path between two brokers for sending and receiving messages. This blog uses the following configuration:

<networkConnector name="connector_1_to_3" networkTTL="5" uri="static:(ssl://xxxxxxxxx.mq.us-east-2.amazonaws.com:61617)" userName="MQUserName"/>

Connection pooling for producers

In the example Lambda function, a pooled connection factory object is created to optimize connections to broker:

// Create a conn factory

final ActiveMQSslConnectionFactory connFacty = new ActiveMQSslConnectionFactory(failoverURI);
connFacty.setConnectResponseTimeout(10000);
return connFacty;

// Create a pooled conn factory

final PooledConnectionFactory pooledConnFacty = new PooledConnectionFactory();
pooledConnFacty.setMaxConnections(10);
pooledConnFacty.setConnectionFactory(connFacty);
return pooledConnFacty;

Deploying the example solution

  1. Create an IAM role for Lambda by following the steps at https://github.com/aws-samples/aws-mq-network-of-brokers#setup-steps.
  2. Create the network of brokers in the first Region. Navigate to the CloudFormation console and choose Create stack:
    Stack name
  3. Provide the parameters for the network configuration section:
    Network configuration
  4. In the Amazon MQ configuration section, configure the following parameters. Ensure that these two parameter values are the same in both Regions.
    MQ configuration
  5. Configure the following in the Lambda configuration section. Deploy mqproducer and mqconsumer in two separate Regions:
    Lambda Configuration
  6. Create the network of brokers in the second Region. Repeat step 2 to create the network of brokers in the second Region. Ensure that the VPC CIDR in region2 is different than the one in region1. Ensure that the user name and password are the same as in the first Region.
  7. Complete VPC peering and updating route tables:
    1. Follow the steps here to complete VPC peering between the two VPCs.
    2. Update the route tables in both the VPC.
    3. Enable DNS resolution for the peering connection.
      Route table
  8. Configure the network of brokers and create network connectors:
    1. In region1, choose Broker3. In the Connections section, copy the endpoint for the openwire protocol.
    2. In region2 on broker3, set up the network of brokers using the networkConnector configuration element.
      Brokers
    3. Edit the configuration revision and add a new NetworkConnector within the NetworkConnectors section. Replace the uri with the URI for the broker3 in region1.
      <networkConnector name="broker3inRegion2_to_ broker3inRegion1" duplex="true" networkTTL="5" userName="MQUserName" uri="static:(ssl://b-123ab4c5-6d7e-8f9g-ab85-fc222b8ac102-1.mq.ap-south-1.amazonaws.com:61617)" />
  9. Send a test message using the mqProducer Lambda function in region1. Invoke the producer Lambda function:
    aws lambda invoke --function-name mqProducer out --log-type Tail --query 'LogResult' --output text | base64 -d

    Terminal output

  10. Receive the test message. In region2, invoke the consumer Lambda function:
    aws lambda invoke --function-name mqConsumer out --log-type Tail --query 'LogResult' --output text | base64 -d

    Terminal response

The message receipt confirms that the message has crossed the network of brokers from region1 to region2.

Cleaning up

To avoid incurring ongoing charges, delete all the resources by following the steps at https://github.com/aws-samples/aws-mq-network-of-brokers#clean-up.

Conclusion

This blog explains the choices when designing a cross-Region or a hybrid network of brokers architecture that spans AWS and your data centers. The example starts with a concentrator topology and enhances that with a cross-Region design to help address network routing and network security requirements.

The blog provides a template that you can modify to suit specific network and distributed application scenarios. It also covers best practices when architecting and designing failover strategies for a network of brokers or when developing producers and consumers client applications.

The Lambda functions used as producer and consumer applications demonstrate best practices in designing and developing ActiveMQ clients. This includes storing and retrieving parameters, such as passwords from the AWS Systems Manager.

For more serverless learning resources, visit Serverless Land.

Introducing Amazon Simple Queue Service dead-letter queue redrive to source queues

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-amazon-simple-queue-service-dead-letter-queue-redrive-to-source-queues/

This blog post is written by Mark Richman, a Senior Solutions Architect for SMB.

Today AWS is launching a new capability to enhance the dead-letter queue (DLQ) management experience for Amazon Simple Queue Service (SQS). DLQ redrive to source queues allows SQS to manage the lifecycle of unconsumed messages stored in DLQs.

SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications. Using Amazon SQS, you can send, store, and receive messages between software components at any volume without losing messages or requiring other services to be available.

To use SQS, a producer sends messages to an SQS queue, and a consumer pulls the messages from the queue. Sometimes, messages can’t be processed due to a number of possible issues. These can include logic errors in consumers that cause message processing to fail, network connectivity issues, or downstream service failures. This can result in unconsumed messages remaining in the queue.

Understanding SQS dead-letter queues (DLQs)

SQS allows you to manage the life cycle of the unconsumed messages using dead-letter queues (DLQs).

A DLQ is a separate SQS queue that one or many source queues can send messages that can’t be processed or consumed. DLQs allow you to debug your application by letting you isolate messages that can’t be processed correctly to determine why their processing didn’t succeed. Use a DLQ to handle message consumption failures gracefully.

When you create a source queue, you can specify a DLQ and the condition under which SQS moves messages from the source queue to the DLQ. This is called the redrive policy. The redrive policy condition specifies the maxReceiveCount. When a producer places messages on an SQS queue, the ReceiveCount tracks the number of times a consumer tries to process the message. When the ReceiveCount for a message exceeds the maxReceiveCount for a queue, SQS moves the message to the DLQ. The original message ID is retained.

For example, a source queue has a redrive policy with maxReceiveCount set to 5. If the consumer of the source queue receives a message 6, without successfully consuming it, SQS moves the message to the dead-letter queue.

You can configure an alarm to alert you when any messages are delivered to a DLQ. You can then examine logs for exceptions that might have caused them to be delivered to the DLQ. You can analyze the message contents to diagnose consumer application issues. Once the issue has been resolved and the consumer application recovers, these messages can be redriven from the DLQ back to the source queue to process them successfully.

Previously, this required dedicated operational cycles to review and redrive these messages back to their source queue.

DLQ redrive to source queues

DLQ redrive to source queues enables SQS to manage the second part of the lifecycle of unconsumed messages that are stored in DLQs. Once the consumer application is available to consume the failed messages, you can now redrive the messages from the DLQ back to the source queue. You can optionally review a sample of the available messages in the DLQ. You redrive the messages using the Amazon SQS console. This allows you to more easily recover from application failures.

Using redrive to source queues

To show how to use the new functionality there is an existing standard source SQS queue called MySourceQueue.

SQS does not create DLQs automatically. You must first create an SQS queue and then use it as a DLQ. The DLQ must be in the same region as the source queue.

Create DLQ

  1. Navigate to the SQS Management Console and create a standard SQS queue for the DLQ called MyDLQ. Use the default configuration. Refer to the SQS documentation for instructions on creating a queue.
  2. Navigate to MySourceQueue and choose Edit.
  3. Navigate to the Dead-letter queue section and choose Enabled.
  4. Select the Amazon Resource Name (ARN) of the MyDLQ queue you created previously.
  5. You can configure the number of times that a message can be received before being sent to a DLQ by setting Set Maximum receives to a value between 1 and 1,000. For this demo enter a value of 1 to immediately drive messages to the DLQ.
  6. Choose Save.
Configure source queue with DLQ

Configure source queue with DLQ

The console displays the Details page for the queue. Within the Dead-letter queue tab, you can see the Maximum receives value and DLQ ARN.

DLQ configuration

DLQ configuration

Send and receive test messages

You can send messages to test the functionality in the SQS console.

  1. Navigate to MySourceQueue and choose Send and receive messages
  2. Send a number of test messages by entering the message content in Message body and choosing Send message.
  3. Send and receive messages

    Send and receive messages

  4. Navigate to the Receive messages section where you can see the number of messages available.
  5. Choose Poll for messages. The Maximum message count is set to 10 by default If you sent more than 10 test messages, poll multiple times to receive all the messages.
Poll for messages

Poll for messages

All the received messages are sent to the DLQ because the maxReceiveCount is set to 1. At this stage you would normally review the messages. You would determine why their processing didn’t succeed and resolve the issue.

Redrive messages to source queue

Navigate to the list of all queues and filter if required to view the DLQ. The queue displays the approximate number of messages available in the DLQ. For standard queues, the result is approximate because of the distributed architecture of SQS. In most cases, the count should be close to the actual number of messages in the queue.

Messages available in DLQ

Messages available in DLQ

  1. Select the DLQ and choose Start DLQ redrive.
  2. DLQ redrive

    DLQ redrive

    SQS allows you to redrive messages either to their source queue(s) or to a custom destination queue.

  3. Choose to Redrive to source queue(s), which is the default.
  4. Redrive has two velocity control settings.

  • System optimized sends messages back to the source queue as fast as possible
  • Custom max velocity allows SQS to redrive messages with a custom maximum rate of messages per second. This feature is useful for minimizing the impact to normal processing of messages in the source queue.

You can optionally inspect messages prior to redrive.

  • To redrive the messages back to the source queue, choose DLQ redrive.
  • DLQ redrive

    DLQ redrive

    The Dead-letter queue redrive status panel shows the status of the redrive and percentage processed. You can refresh the display or cancel the redrive.

    Dead-letter queue redrive status

    Dead-letter queue redrive status

    Once the redrive is complete, which takes a few seconds in this example, the status reads Successfully completed.

    Redrive status completed

    Redrive status completed

  • Navigate back to the source queue and you can see all the messages are redriven back from the DLQ to the source queue.
  • Messages redriven from DLQ to source queue

    Messages redriven from DLQ to source queue

    Conclusion

    Dead-letter queue redrive to source queues allows you to effectively manage the life cycle of unconsumed messages stored in dead-letter queues. You can build applications with the confidence that you can easily examine unconsumed messages, recover from errors, and reprocess failed messages.

    You can redrive messages from their DLQs to their source queues using the Amazon SQS console.

    Dead-letter queue redrive to source queues is available in all commercial regions, and coming soon to GovCloud.

    To get started, visit https://aws.amazon.com/sqs/

    For more serverless learning resources, visit Serverless Land.

    Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure

    Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-redshift-serverless-run-analytics-at-any-scale-without-having-to-manage-infrastructure/

    We’re seeing the use of data analytics expanding among new audiences within organizations, for example with users like developers and line of business analysts who don’t have the expertise or the time to manage a traditional data warehouse. Also, some customers have variable workloads with unpredictable spikes, and it can be very difficult for them to constantly manage capacity.

    With Amazon Redshift, you use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Today, I am happy to introduce the public preview of Amazon Redshift Serverless, a new capability that makes it super easy to run analytics in the cloud with high performance at any scale. Just load your data and start querying. There is no need to set up and manage clusters. You pay for the duration in seconds when your data warehouse is in use, for example, while you are querying or loading data. There is no charge when your data warehouse is idle.

    Amazon Redshift Serverless automatically provisions the right compute resources for you to get started. As your demand evolves with more concurrent users and new workloads, your data warehouse scales seamlessly and automatically to adapt to the changes. You can optionally specify the base data warehouse size to have additional control on cost and application-specific SLAs.

    With the new serverless option, you can continue to query data in other AWS data stores, such as Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Aurora and Amazon Relational Database Service (RDS) databases.

    Amazon Redshift Serverless is ideal when it is difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. This approach is also a good fit for ad-hoc analytics needs that need to get started quickly and for test and development environments.

    Let’s see how this works in practice.

    Using Amazon Redshift Serverless
    I go to the Amazon Redshift console and choose the new serverless option. The first time, I set up the serverless endpoint and configure networking and security.

    I confirm the default settings that use all subnets in my default Amazon Virtual Private Cloud (VPC) and its default security group. Data is always encrypted, and I use the default AWS-owned key. Optionally, I can customize all settings. I can associate now or later the AWS Identity and Access Management (IAM) roles to give permissions to access other AWS resources, for example, to be able to load data from an S3 bucket. The configuration of the serverless endpoint will be shared by all my serverless data warehouses in the same AWS account and Region.

    Console screenshot.

    To query data, I use Amazon Redshift Query Editor V2, a new free web-based tool that we made available a few months back. The query editor provides quick access to a few sample datasets to make it easy to learn Amazon Redshift’s SQL capabilities: TPC-H, TPC-DS, and tickit, a dataset containing information on ticket sales for events.

    For a quick test, I use the tickit sample dataset so I don’t need to load any data. I prepare a query to get the list of tickets sold per date, sorted to see the dates with more sales first:

    SELECT caldate, sum(qtysold) as sumsold
    FROM   tickit.sales, tickit.date
    WHERE  sales.dateid = date.dateid 
    GROUP BY caldate
    ORDER BY sumsold DESC;

    By using the web-based query editor, I don’t need to configure a SQL client or set up the network permissions to reach the serverless endpoint. Instead, I just write my SQL query and run it.

    Console screenshot.

    I am a visual person. I enable the Chart option on the right of the result table and select a bar chart.

    Console screenshot.

    Satisfied with the clarity of the chart, I export it as an image file. In this way, I can quickly share it or include it in a report.

    Bar chart

    Amazon Redshift Serverless supports all rich SQL functionality of Amazon Redshift such as semi-structured data support. I can use any JDBC/ODBC-compliant tool or the Amazon Redshift Data API to query my data. To migrate data, I can take a snapshot of an Amazon Redshift provisioned cluster and restore it as serverless. Then, I just need to update my SQL applications to use the new serverless endpoint.

    Availability and Pricing
    Amazon Redshift Serverless is available in public preview in the following AWS Regions: US East (N. Virginia), US West (N. California, Oregon), Europe (Frankfurt, Ireland), Asia Pacific (Tokyo).

    With Amazon Redshift Serverless, you pay separately for the compute and storage you use. Compute capacity is measured in Redshift Processing Units (RPUs), and you pay for the workloads in RPU-hours with per-second billing. For storage, you pay for data stored in Amazon Redshift-managed storage and storage used for snapshots, similar to what you’d pay with a provisioned cluster using RA3 instances.

    To control your costs, you can specify usage limits and define actions that Amazon Redshift automatically takes if those limits are reached. You can specify usage limits in RPU-hours and associated with a daily, weekly, or monthly duration. Setting higher usage limits can improve the overall throughput of the system, especially for workloads that need to handle high concurrency while maintaining consistently high performance.

    Compute resources automatically shutdown behind the scenes when there is no activity and resume when you are loading data, or there are queries coming in. When accessing your S3 data lake via the new serverless endpoint, you do not pay for Amazon Redshift Spectrum separately. You have a unified serverless experience and pay for data lake queries also in RPU-seconds. For more information, see the Amazon Redshift pricing page.

    The serverless end point is configured at the AWS account level. If you have multiple teams or projects and want to manage costs separately, you can use separate AWS accounts. You can share data between your provisioned clusters and serverless endpoint and between serverless endpoints across accounts.

    To help you get practice, we provide you upfront with $500 in AWS credits to try the Amazon Redshift Serverless public preview. You get the credits when you first create a database with Amazon Redshift Serverless. These credits are used to cover your costs for compute, storage, and snapshot usage of Amazon Redshift Serverless only.

    Start using Amazon Redshift Serverless today to run and scale analytics without having to provision and manage data warehouse clusters.

    Danilo

    Filtering event sources for AWS Lambda functions

    Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/filtering-event-sources-for-aws-lambda-functions/

    This post is written by Heeki Park, Principal Specialist Solutions Architect – Serverless.

    When an AWS Lambda function is configured with an event source, the Lambda service triggers a Lambda function for each message or record. The exact behavior depends on the choice of event source and the configuration of the event source mapping. The event source mapping defines how the Lambda service handles incoming messages or records from the event source.

    Today, AWS announces the ability to filter messages before the invocation of a Lambda function. Filtering is supported for the following event sources: Amazon Kinesis Data Streams, Amazon DynamoDB Streams, and Amazon SQS. This helps reduce requests made to your Lambda functions, may simplify code, and can reduce overall cost.

    Overview

    Consider a logistics company with a fleet of vehicles in the field. Each vehicle is enabled with sensors and 4G/5G connectivity to emit telemetry data into Kinesis Data Streams:

    • In one scenario, they use machine learning models to infer the health of vehicles based on each payload of telemetry data, which is outlined in example 2 on the Lambda pricing page.
    • In another scenario, they want to invoke a function, but only when tire pressure is low on any of the tires.

    If tire pressure is low, the company notifies the maintenance team to check the tires when the vehicle returns. The process checks if the warehouse has enough spare replacements. Optionally, it notifies the purchasing team to buy additional tires.

    The application responds to the stream of incoming messages and runs business logic if tire pressure is below 32 psi. Each vehicle in the field emits telemetry as follows:

    {
        "time": "2021-11-09 13:32:04",
        "fleet_id": "fleet-452",
        "vehicle_id": "a42bb15c-43eb-11ec-81d3-0242ac130003",
        "lat": 47.616226213162406,
        "lon": -122.33989110734133,
        "speed": 43,
        "odometer": 43519,
        "tire_pressure": [41, 40, 31, 41],
        "weather_temp": 76,
        "weather_pressure": 1013,
        "weather_humidity": 66,
        "weather_wind_speed": 8,
        "weather_wind_dir": "ne"
    }
    

    To process all messages from a fleet of vehicles, you configure a filter matching the fleet id in the following example. The Lambda service applies the filter pattern against the full payload that it receives.

    The schema of the payload for Kinesis and DynamoDB Streams is shown under the “kinesis” attribute in the example Kinesis record event. When building filters for Kinesis or DynamoDB Streams, you filter the payload under the “data” attribute. The schema of the payload for SQS is shown in the array of records in the example SQS message event. When working with SQS, you filter the payload under the “body” attribute:

    {
        "data": {
            "fleet_id": ["fleet-452"]
        }
    }
    

    To process all messages associated with a specific vehicle, configure a filter on only that vehicle id. The fleet id is kept in the example to show that it matches on both of those filter criteria:

    {
        "data": {
            "fleet_id": ["fleet-452"],
            "vehicle_id": ["a42bb15c-43eb-11ec-81d3-0242ac130003"]
        }
    }

    To process all messages associated with that fleet but only if tire pressure is below 32 psi, you configure the following rule pattern. This pattern searches the array under tire_pressure to match values less than 32:

    {
        "data": {
            "fleet_id": ["fleet-452"],
            "tire_pressure": [{"numeric": ["<", 32]}]
        }
    }
    

    To create the event source mapping with this filter criteria with an AWS CLI command, run the following command.

    aws lambda create-event-source-mapping \
    --function-name fleet-tire-pressure-evaluator \
    --batch-size 100 \
    --starting-position LATEST \
    --event-source-arn arn:aws:kinesis:us-east-1:0123456789012:stream/fleet-telemetry \
    --filter-criteria '{"Filters": [{"Pattern": "{\"tire_pressure\": [{\"numeric\": [\"<\", 32]}]}"}]}'
    

    For the CLI, the value for Pattern in the filter criteria requires the double quotes to be escaped in order to be properly captured.

    Alternatively, to create the event source mapping with this filter criteria with an AWS Serverless Application Model (AWS SAM) template, use the following snippet.

    Events: 
      TirePressureEvent: 
        Type: Kinesis    
        Properties: 
          BatchSize: 100
          StartingPosition: LATEST
          Stream: "arn:aws:kinesis:us-east-1:0123456789012:stream/fleet-telemetry"
          Filters: 
            - Pattern: "{\"data\": {\"tire_pressure\": [{\"numeric\": [\"<\", 32]}]}}"
    

    For the AWS SAM template, the value for Pattern in the filter criteria does not require escaped double quotes.

    For more information on how to create filters, refer to examples of event pattern rules in EventBridge, as Lambda filters messages in the same way.

    Reducing costs with event filtering

    By configuring the event source with this filter criteria, you can reduce the number of messages that are used to invoke your Lambda function.

    Using the example from the Lambda pricing page, with a fleet of 10,000 vehicles in the field, each is emitting telemetry once an hour. Each month, the vehicles emit 10,000 * 24 * 31 = 7,440,000 messages, which trigger the same number of Lambda invocations. You configure the function with 256 MB of memory and the average duration of the function is 100 ms. In this example, vehicles emit low-pressure telemetry once every 31 days.

    Without filtering, the cost of the application is:

    • Monthly request charges → 7.44M * $0.20/million = $1.49
    • Monthly compute duration (seconds) → 7.44M * 0.1 seconds = 0.744M seconds
    • Monthly compute (GB-s) → 256MB/1024MB * 0.744M seconds = 0.186M GB-s
    • Monthly compute charges → 0.186M GB-s * $0.0000166667 = $3.10
    • Monthly total charges = $1.49 + $3.10 = $4.59

    With filtering, the cost of the application is:

    • Monthly request charges → (7.44M / 31)* $0.20/million = $0.05
    • Monthly compute duration (seconds) → (7.44M / 31) * 0.1 seconds = 0.024M seconds
    • Monthly compute (GB-s) → 256MB/1024MB * 0.024M seconds = 0.006M GB-s
    • Monthly compute charges → 0.006M GB-s * $0.0000166667 = $0.10
    • Monthly total charges = $0.05 + $0.10 = $0.15

    By using filtering, the cost is reduced from $4.59 to $0.15, a 96.7% cost reduction.

    Designing and implementing event filtering

    In addition to reducing cost, the functions now operate more efficiently. This is because they no longer iterate through arrays of messages to filter out messages. The Lambda service filters the messages that it receives from the source before batching and sending them as the payload for the function invocation. This is the order of operations:

    Event flow with filtering

    Event flow with filtering

    As you design filter criteria, keep in mind a few additional properties. The event source mapping allows up to five patterns. Each pattern can be up to 2048 characters. As the Lambda service receives messages and filters them with the pattern, it fills the batch per the normal event source behavior.

    For example, if the maximum batch size is set to 100 records and the maximum batching window is set to 10 seconds, the Lambda service filters and accumulates records in a batch until one of those two conditions is satisfied. In the case where 100 records that meet the filter criteria come during the batching window, the Lambda service triggers a function with those filtered 100 records in the payload.

    If fewer than 100 records meeting the filter criteria arrive during the batch window, Lambda triggers a function with the filtered records that came during the batch window at the end of the 10-second batch window. Be sure to configure the batch window to match your latency requirements.

    The Lambda service ignores filtered messages and treats them as successfully processed. For Kinesis Data Streams and DynamoDB Streams, the iterator advances past the records that were sent via the event source mapping.

    For SQS, the messages are deleted from the queue without any additional processing. With SQS, be sure that the messages that are filtered out are not required. For example, you have an Amazon SNS topic with multiple SQS queues subscribed. The Lambda functions consuming each of those SQS queues process different subsets of messages. You could use filters on SNS but that would require the message publisher to add attributes to the messages that it sends. You could instead use filters on the event source mapping for SQS. Now the publisher does not need to make any changes, as the filter is applied on the messages payload directly.

    Conclusion

    Lambda now supports the ability to filter messages based on a criteria that you define. This can reduce the number of messages that your functions process, may reduce cost, and can simplify code.

    You can now build applications for specific use cases that use only a subset of the messages that flow through your event-driven architectures. This can help optimize the compute efficiency of your functions.

    Learn more about this capability in our AWS Lambda Developer Guide.

    Visualizing AWS Step Functions workflows from the Amazon Athena console

    Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/visualizing-aws-step-functions-workflows-from-the-amazon-athena-console/

    This post is written by Dhiraj Mahapatro, Senior Specialist SA, Serverless.

    In October 2021, AWS announced visualizing AWS Step Functions from the AWS Batch console. Now you can also visualize Step Functions from the Amazon Athena console.

    Amazon Athena is an interactive query service that makes it easier to analyze Amazon S3 data using standard SQL. Athena is a serverless service and can interact directly with data stored in S3. Athena can process unstructured, semistructured, and structured datasets.

    AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Step Functions workflows manage failures, retries, parallelization, service integrations, and observability so builders can focus on business logic. Athena is one of the service integrations that are available for Step Functions.

    This blog walks through Step Functions integration in Amazon Athena console. It shows how you can visualize and operate Athena queries at scale using Step Functions.

    Introducing workflow orchestration in Amazon Athena console

    AWS customers store large amounts of historical data on S3 and query the data using Athena to get results quickly. They also use Athena to process unstructured data or analyze structured data as part of a data processing pipeline.

    Data processing involves discrete steps for ingesting, processing, storing the transformed data, and post-processing, such as visualizing or analyzing the transformed data. Each step involves multiple AWS services. With Step Functions workflow integration, you can orchestrate these steps. This helps to create repeatable and scalable data processing pipelines as part of a larger business application and visualize the workflows in the Athena console.

    With Step Functions, you can run queries on a schedule or based on an event by using Amazon EventBridge. You can poll long-running Athena queries before moving to the next step in the process, and handle errors without writing custom code. Combining these two services provides developers with a single method that is scalable and repeatable.

    Step Functions workflows in the Amazon Athena console allow orchestration of Athena queries with Step Functions state machines:

    Athena console

    Using Athena query patterns from Step Functions

    Execute multiple queries

    In Athena, you run SQL queries in the Athena console against Athena workgroups. With Step Functions, you can run Athena queries in a sequence or run independent queries simultaneously in parallel using a parallel state. Step Functions also natively handles errors and retries related to Athena query tasks.

    Workflow orchestration in the Athena console provides these capabilities to run and visualize multiple queries in Step Functions. For example:

    UI flow

    1. Choose Get Started from Execute multiple queries.
    2. From the pop-up, choose Create your own workflow and select Continue.

    A new browser tab opens with the Step Functions Workflow Studio. The designer shows a workflow pattern template pre-created. The workflow loads data from a data source running two Athena queries in parallel. The results are then published to an Amazon SNS topic.

    Alternatively, choosing Deploy a sample project from the Get Started pop-up deploys a sample Step Functions workflow.

    Get started flow

    This option creates a state machine. You then review the workflow definition, deploy an AWS CloudFormation stack, and run the workflow in the Step Functions console.

    Deploy and run

    Once deployed, the state machine is visible in the Step Functions console as:

    State machines

    Select the AthenaMultipleQueriesStateMachine to land on the details page:

    Details page

    The CloudFormation template provisions the required AWS Glue database, S3 bucket, an Athena workgroup, the required AWS Lambda functions, and the SNS topic for query results notification.

    To see the Step Functions workflow in action, choose Start execution. Keep the optional name and input and choose Start execution:

    Start execution

    The state machine completes the tasks successfully by Executing multiple queries in parallel using Amazon Athena and Sending query results using the SNS topic:

    Successful execution

    The state machine used the Amazon Athena StartQueryExecution and GetQueryResults tasks. The Workflow orchestration in Athena console now highlights this newly created Step Functions state machine:

    Athena console

    Any state machine that uses this task in Step Functions in this account is listed here as a state machine that orchestrates Athena queries.

    Query large datasets

    You can also ingest an extensive dataset in Amazon S3, partition it using AWS Glue crawlers, then run Amazon Athena queries against that partition.

    Select Get Started from the Query large datasets pop-up, then choose Create your own workflow and Continue. This action opens the Step Functions Workflow studio with the following pattern. The Glue crawler starts and partitions large datasets for Athena to query in the subsequent query execution task:

    Query large datasets

    Step Functions allows you to combine Glue crawler tasks and Athena queries to partition where necessary before querying and publishing the results.

    Keeping data up to date

    You can also use Athena to query a target table to fetch data, then update it with new data from other sources using Step Functions’ choice state. The choice state in Step Functions provides branching logic for a state machine.

    Keep data up to date

    You are not limited to the previous three patterns shown in workflow orchestration in the Athena console. You can start from scratch and build Step Functions state machine by navigating to the bottom right and using Create state machine:

    Create state machine

    Create State Machine in the Athena console opens a new tab showing the Step Functions console’s Create state machine page.

    Choosing authoring method

    Refer to building a state machine AWS Step Functions Workflow Studio for additional details.

    Step Functions integrates with all Amazon Athena’s API actions

    In September 2021, Step Functions announced integration support for 200 AWS services to enable easier workflow automation. With this announcement, Step Functions can integrate with all Amazon Athena API actions today.

    Step Functions can automate the lifecycle of an Athena query: Create/read/update/delete/list workGroups; Create/read/update/delete/list data catalogs, and more.

    Other AWS service integrations

    Step Functions’ integration with the AWS SDK provides support for 200 AWS Services and over 9,000 API actions. Athena tasks in Step Functions can evolve by integrating available AWS services in the workflow for their pre and post-processing needs.

    For example, you can read Athena query results that are put to an S3 bucket by using a GetObject S3 task AWS SDK integration in Step Functions. You can combine different AWS services into a single business process so that they can ingest data through Amazon Kinesis, do processing via AWS Lambda or Amazon EMR jobs, and send notifications to interested parties via Amazon SNS or Amazon SQS or Amazon EventBridge to trigger other parts of their business application.

    There are multiple ways to decorate around an Amazon Athena job task. Refer to AWS SDK service integrations and optimized integrations for Step Functions for additional details.

    Important considerations

    Workflow orchestrations in the Athena console only show Step Functions state machines that use Athena’s optimized API integrations. This includes StartQueryExecution, StopQueryExecution, GetQueryExecution, and GetQueryResults.

    Step Functions state machines do not show in the Athena console when:

    1. A state machine uses any other AWS SDK Athena API integration task.
    2. The APIs are invoked inside a Lambda function task using an AWS SDK client (like Boto3 or Node.js or Java).

    Cleanup

    First, empty DataBucket and AthenaWorkGroup to delete the stack successfully. To delete the sample application stack, use the latest version of AWS CLI and run:

    aws cloudformation delete-stack --stack-name <stack-name>

    Alternatively, delete the sample application stack in the CloudFormation console by selecting the stack and choosing Delete:

    Stack deletion

    Conclusion

    Amazon Athena console now provides an integration with AWS Step Functions’ workflows. You can use the provided patterns to create and visualize Step Functions’ workflows directly from the Amazon Athena console. Step Functions’ workflows that use Athena’s optimized API integration appear in the Athena console. To learn more about Amazon Athena, read the user guide.

    To get started, open the Workflows page in the Athena console. Select Create Athena jobs with Step Functions Workflows to deploy a sample project, if you are new to Step Functions.

    For more serverless learning resources, visit Serverless Land.

    Offset lag metric for Amazon MSK as an event source for Lambda

    Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/offset-lag-metric-for-amazon-msk-as-an-event-source-for-lambda/

    This post written by Adam Wagner, Principal Serverless Solutions Architect.

    Last year, AWS announced support for Amazon Managed Streaming for Apache Kafka (MSK) and self-managed Apache Kafka clusters as event sources for AWS Lambda. Today, AWS adds a new OffsetLag metric to Lambda functions with MSK or self-managed Apache Kafka event sources.

    Offset in Apache Kafka is an integer that marks the current position of a consumer. OffsetLag is the difference in offset between the last record written to the Kafka topic and the last record processed by Lambda. Kafka expresses this in the number of records, not a measure of time. This metric provides visibility into whether your Lambda function is keeping up with the records added to the topic it is processing.

    This blog walks through using the OffsetLag metric along with other Lambda and MSK metrics to understand your streaming application and optimize your Lambda function.

    Overview

    In this example application, a producer writes messages to a topic on the MSK cluster that is an event source for a Lambda function. Each message contains a number and the Lambda function finds the factors of that number. It outputs the input number and results to an Amazon DynamoDB table.

    Finding all the factors of a number is fast if the number is small but takes longer for larger numbers. This difference means the size of the number written to the MSK topic influences the Lambda function duration.

    Example application architecture

    Example application architecture

    1. A Kafka client writes messages to a topic in the MSK cluster.
    2. The Lambda event source polls the MSK topic on your behalf for new messages and triggers your Lambda function with batches of messages.
    3. The Lambda function factors the number in each message and then writes the results to DynamoDB.

    In this application, several factors can contribute to offset lag. The first is the volume and size of messages. If more messages are coming in, the Lambda may take longer to process them. Other factors are the number of partitions in the topic, and the number of concurrent Lambda functions processing messages. A full explanation of how Lambda concurrency scales with the MSK event source is in the documentation.

    If the average duration of your Lambda function increases, this also tends to increase the offset lag. This lag could be latency in a downstream service or due to the complexity of the incoming messages. Lastly, if your Lambda function errors, the MSK event source retries the identical records set until they succeed. This retry functionality also increases offset lag.

    Measuring OffsetLag

    To understand how the new OffsetLag metric works, you first need a working MSK topic as an event source for a Lambda function. Follow this blog post to set up an MSK instance.

    To find the OffsetLag metric, go to the CloudWatch console, select All Metrics from the left-hand menu. Then select Lambda, followed by By Function Name to see a list of metrics by Lambda function. Scroll or use the search bar to find the metrics for this function and select OffsetLag.

    OffsetLag metric example

    OffsetLag metric example

    To make it easier to look at multiple metrics at once, create a CloudWatch dashboard starting with the OffsetLag metric. Select Actions -> Add to Dashboard. Select the Create new button, provide the dashboard a name. Choose Create, keeping the rest of the options at the defaults.

    Adding OffsetLag to dashboard

    Adding OffsetLag to dashboard

    After choosing Add to dashboard, the new dashboard appears. Choose the Add widget button to add the Lambda duration metric from the same function. Add another widget that combines both Lambda errors and invocations for the function. Finally, add a widget for the BytesInPerSec metric for the MSK topic. Find this metric under AWS/Kafka -> Broker ID, Cluster Name, Topic. Finally, click Save dashboard.

    After a few minutes, you see a steady stream of invocations, as you would expect when consuming from a busy topic.

    Data incoming to dashboard

    Data incoming to dashboard

    This example is a CloudWatch dashboard showing the Lambda OffsetLag, Duration, Errors, and Invocations, along with the BytesInPerSec for the MSK topic.

    In this example, the OffSetLag metric is averaging about eight, indicating that the Lambda function is eight records behind the latest record in the topic. While this is acceptable, there is room for improvement.

    The first thing to look for is Lambda function errors, which can drive up offset lag. The metrics show that there are no errors so the next step is to evaluate and optimize the code.

    The Lambda handler function loops through the records and calls the process_msg function on each record:

    def lambda_handler(event, context):
        for batch in event['records'].keys():
            for record in event['records'][batch]:
                try:
                    process_msg(record)
                except:
                    print("error processing record:", record)
        return()

    The process_msg function handles base64 decoding, calls a factor function to factor the number, and writes the record to a DynamoDB table:

    def process_msg(record):
        #messages are base64 encoded, so we decode it here
        msg_value = base64.b64decode(record['value']).decode()
        msg_dict = json.loads(msg_value)
        #using the number as the hash key in the dynamodb table
        msg_id = f"{msg_dict['number']}"
        if msg_dict['number'] <= MAX_NUMBER:
            factors = factor_number(msg_dict['number'])
            print(f"number: {msg_dict['number']} has factors: {factors}")
            item = {'msg_id': msg_id, 'msg':msg_value, 'factors':factors}
            resp = ddb_table.put_item(Item=item)
        else:
            print(f"ERROR: {msg_dict['number']} is >= limit of {MAX_NUMBER}")
    

    The heavy computation takes place in the factor function:

    def factor(number):
        factors = [1,number]
        for x in range(2, (int(1 + number / 2))):
            if (number % x) == 0:
                factors.append(x)
        return factors

    The code loops through all numbers up to the factored number divided by two. The code is optimized by only looping up to the square root of the number.

    def factor(number):
        factors = [1,number]
        for x in range(2, 1 + int(number**0.5)):
            if (number % x) == 0:
                factors.append(x)
                factors.append(number // x)
        return factors

    There are further optimizations and libraries for factoring numbers but this provides a noticeable performance improvement in this example.

    Data after optimization

    Data after optimization

    After deploying the code, refresh the metrics after a while to see the improvements:

    The average Lambda duration has dropped to single-digit milliseconds and the OffsetLag is now averaging two.

    If you see a noticeable change in the OffsetLag metric, there are several things to investigate. The input side of the system, increased messages per second, or a significant increase in the size of the message are a few options.

    Conclusion

    This post walks through implementing the OffsetLag metric to understand latency between the latest messages in the MSK topic and the records a Lambda function is processing. It also reviews other metrics that help understand the underlying cause of increases to the offset lag. For more information on this topic, refer to the documentation and other MSK Lambda metrics.

    For more serverless learning resources, visit Serverless Land.

    Expanding cross-Region event routing with Amazon EventBridge

    Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/expanding-cross-region-event-routing-with-amazon-eventbridge/

    This post is written by Stephen Liedig, Sr Serverless Specialist SA.

    In April 2021, AWS announced a new feature for Amazon EventBridge that allows you to route events from any commercial AWS Region to US East (N. Virginia), US West (Oregon), and Europe (Ireland). From today, you can now route events between any AWS Regions, except AWS GovCloud (US) and China.

    EventBridge enables developers to create event-driven applications by routing events between AWS services, integrated software as a service (SaaS) applications, and your own applications. This helps you produce loosely coupled, distributed, and maintainable architectures. With these new capabilities, you can now route events across Regions and accounts using the same model used to route events to existing targets.

    Cross-Region event routing with Amazon EventBridge makes it easier for customers to develop multi-Region workloads to:

    • Centralize your AWS events into one Region for auditing and monitoring purposes, such as aggregating security events for compliance reasons in a single account.
    • Replicate events from source to destinations Regions to help synchronize data in cross-Region data stores.
    • Invoke asynchronous workflows in a different Region from a source event. For example, you can load balance from a target Region by routing events to another Region.

    A previous post shows how cross-Region routing works. This blog post expands on these concepts and discusses a common use case for cross-Region event delivery – event auditing. This example explores how you can manage resources using AWS CloudFormation and EventBridge resource policies.

    Multi-Region event auditing example walkthrough

    Compliance is an important part of building event-driven applications and reacting to any potential policy or security violations. Customers use EventBridge to route security events from applications and globally distributed infrastructure into a single account for analysis. In many cases, they share specific AWS CloudTrail events with security teams. Customers also audit events from their custom-built applications to monitor sensitive data usage.

    In this scenario, a company has their base of operations located in Asia Pacific (Singapore) with applications distributed across US East (N. Virginia) and Europe (Frankfurt). The applications in US East (N. Virginia) and Europe (Frankfurt) are using EventBridge for their respective applications and services. The security team in Asia Pacific (Singapore) wants to analyze events from the applications and CloudTrail events for specific API calls to monitor infrastructure security.

    Reference architecture

    To create the rules to receive these events:

    1. Create a new set of rules directly on all the event buses across the global infrastructure. Alternatively, delegate the responsibility of managing security rules to distributed teams that manage the event bus resources.
    2. Provide the security team with the ability to manage rules centrally, and control the lifecycle of rules on the global infrastructure.

    Allowing the security team to manage the resources centrally provides more scalability. It is more consistent with the design principle that event consumers own and manage the rules that define how they process events.

    Deploying the example application

    The following code snippets are shortened for brevity. The full source code of the solution is in the GitHub repository. The solution uses AWS Serverless Application Model (AWS SAM) for deployment. Clone the repo and navigate to the solution directory:

    git clone https://github.com/aws-samples/amazon-eventbridge-resource-policy-samples
    cd ./patterns/ cross-region-cross-account-pattern/

    To allow the security team to start receiving accounts from any of the cross-Region accounts:

    1. Create a security event bus in the Asia Pacific (Singapore) Region with a rule that processes events from the respective event sources.

    ap-southest-1 architecture

    For simplicity, this example uses an Amazon CloudWatch Logs target to visualize the events arriving from cross-Region accounts:

     SecurityEventBus:
      Type: AWS::Events::EventBus
      Properties:
       Name: !Ref SecurityEventBusName
    
     # This rule processes events coming in from cross-Region accounts
     SecurityAnalysisRule:
      Type: AWS::Events::Rule
      Properties:
       Name: SecurityAnalysisRule
       Description: Analyze events from cross-Region event buses
       EventBusName: !GetAtt SecurityEventBus.Arn
       EventPattern:
        source:
         - anything-but: com.company.security
       State: ENABLED
       RoleArn: !GetAtt WriteToCwlRole.Arn
       Targets:
        - Id: SendEventToSecurityAnalysisRule
         Arn: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:${SecurityAnalysisRuleTarget}"
    

    In this example, you set the event pattern to process any event from a source that is not from the security team’s own domain. This allows you to process events from any account in any Region. You can filter this further as needed.

    2. Set an event bus policy on each default and custom event bus that the security team must receive events from.

    Event bus policy

    This policy allows the security team to create rules to route events to its own security event bus in the Asia Pacific (Singapore) Region. The following policy defines a custom event bus in Account 2 in US East (N. Virginia) and an AWS::Events::EventBusPolicy that sets the Principal as the security team account.

    This allows the security team to manage rules on the CustomEventBus:

     CustomEventBus: 
      Type: AWS::Events::EventBus
      Properties: 
        Name: !Ref EventBusName
    
     SecurityServiceRuleCreationStatement:
      Type: AWS::Events::EventBusPolicy
      Properties:
       EventBusName: !Ref CustomEventBus # If you omit this, the default event bus is used.
       StatementId: "AllowCrossRegionRulesForSecurityTeam"
       Statement:
        Effect: "Allow"
        Principal:
         AWS: !Sub "arn:aws:iam::${SecurityAccountNo}:root"
        Action:
         - "events:PutRule"
         - "events:DeleteRule"
         - "events:DescribeRule"
         - "events:DisableRule"
         - "events:EnableRule"
         - "events:PutTargets"
         - "events:RemoveTargets"
        Resource:
         - !Sub 'arn:aws:events:${AWS::Region}:${AWS::AccountId}:rule/${CustomEventBus.Name}/*'
        Condition:
         StringEqualsIfExists:
          "events:creatorAccount": "${aws:PrincipalAccount}"
    

    3. With the policies set on the cross-Region accounts, now create the rules. Because you cannot create CloudFormation resources across Regions, you must define the rules in separate templates. This also gives the ability to expand to other Regions.

    Once the template is deployed to the cross-Region accounts, use EventBridge resource policies to propagate rule definitions across accounts in the same Region. The security account must have permission to create CloudFormation resources in the cross-Region accounts to deploy the rule templates.

    Resource policies

    There are two parts to the rule templates. The first specifies a role that allows EventBridge to assume a role to send events to the target event bus in the security account:

     # This IAM role allows EventBridge to assume the permissions necessary to send events
     # from the source event buses to the destination event bus.
     SourceToDestinationEventBusRole:
      Type: "AWS::IAM::Role"
      Properties:
       AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
         - Effect: Allow
          Principal:
           Service:
            - events.amazonaws.com
          Action:
           - "sts:AssumeRole"
       Path: /
       Policies:
        - PolicyName: PutEventsOnDestinationEventBus
         PolicyDocument:
          Version: 2012-10-17
          Statement:
           - Effect: Allow
            Action: "events:PutEvents"
            Resource:
             - !Ref SecurityEventBusArn
    

    The second is the definition of the rule resource. This requires the Amazon Resource Name (ARN) of the event bus where you want to put the rule, the ARN of the target event bus in the security account, and a reference to the SourceToDestinationEventBusRole role:

     SecurityAuditRule2:
      Type: AWS::Events::Rule
      Properties:
       Name: SecurityAuditRuleAccount2
       Description: Audit rule for the Security team in Singapore
       EventBusName: !Ref EventBusArnAccount2 # ARN of the custom event bus in Account 2
       EventPattern:
        source:
         - com.company.marketing
       State: ENABLED
       Targets:
        - Id: SendEventToSecurityEventBusArn
         Arn: !Ref SecurityEventBusArn
         RoleArn: !GetAtt SourceToDestinationEventBusRole.Arn
    

    You can use the AWS SAM CLI to deploy this:

    sam deploy -t us-east-1-rules.yaml \
      --stack-name us-east-1-rules \
      --region us-east-1 \
      --profile default \
      --capabilities=CAPABILITY_IAM \
      --parameter-overrides SecurityEventBusArn="arn:aws:events:ap-southeast-1:111111111111:event-bus/SecurityEventBus" EventBusArnAccount1="arn:aws:events:us-east-1:111111111111:event-bus/default" EventBusArnAccount2="arn:aws:events:us-east-1:222222222222:event-bus/custom-eventbus-account-2"
    

    Testing the example application

    With the rules deployed across the Regions, you can test by sending events to the event bus in Account 2:

    1. Navigate to the applications/account_2 directory. Here you find an events.json file, which you use as input for the put-events API call.
    2. Run the following command using the AWS CLI. This sends messages to the event bus in us-east-1 which are routed to the security event bus in ap-southeast-1:
      aws events put-events \
       --region us-east-1 \
       --profile [NAMED PROFILE FOR ACCOUNT 2] \
       --entries file://events.json
      

      If you have run this successfully, you see:
      Entries:
      - EventId: a423b35e-3df0-e5dc-b854-db9c42144fa2
      - EventId: 5f22aea8-51ea-371f-7a5f-8300f1c93973
      - EventId: 7279fa46-11a6-7495-d7bb-436e391cfcab
      - EventId: b1e1ecc1-03f7-e3ef-9aa4-5ac3c8625cc7
      - EventId: b68cea94-28e2-bfb9-7b1f-9b2c5089f430
      - EventId: fc48a303-a1b2-bda8-8488-32daa5f809d8
      FailedEntryCount: 0

    3. Navigate to the Amazon CloudWatch console to see a collection of log entries with the events you published. The log group is /aws/events/SecurityAnalysisRule.CloudWatch Logs

    Congratulations, you have successfully sent your first events across accounts and Regions!

    Conclusion

    With cross-Region event routing in EventBridge, you can now route events to and from any AWS Region. This post explains how to manage and configure cross-Region event routing using CloudFormation and EventBridge resource policies to simplify rule propagation across your global event bus infrastructure. Finally, I walk through an example you can deploy to your AWS account.

    For more serverless learning resources, visit Serverless Land.

    Introducing mutual TLS authentication for Amazon MSK as an event source

    Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-mutual-tls-authentication-for-amazon-msk-as-an-event-source/

    This post is written by Uma Ramadoss, Senior Specialist Solutions Architect, Integration.

    Today, AWS Lambda is introducing mutual TLS (mTLS) authentication for Amazon Managed Streaming for Apache Kafka (Amazon MSK) and self-managed Kafka as an event source.

    Many customers use Amazon MSK for streaming data from multiple producers. Multiple subscribers can then consume the streaming data and build data pipelines, analytics, and data integration. To learn more, read Using Amazon MSK as an event source for AWS Lambda.

    You can activate any combination of authentication modes (mutual TLS, SASL SCRAM, or IAM access control) on new or existing clusters. This is useful if you are migrating to a new authentication mode or must run multiple authentication modes simultaneously. Lambda natively supports consuming messages from both self-managed Kafka and Amazon MSK through event source mapping.

    By default, the TLS protocol only requires a server to authenticate itself to the client. The authentication of the client to the server is managed by the application layer. The TLS protocol also offers the ability for the server to request that the client send an X.509 certificate to prove its identity. This is called mutual TLS as both parties are authenticated via certificates with TLS.

    Mutual TLS is a commonly used authentication mechanism for business-to-business (B2B) applications. It’s used in standards such as Open Banking, which enables secure open API integrations for financial institutions. It is one of the popular authentication mechanisms for customers using Kafka.

    To use mutual TLS authentication for your Kafka-triggered Lambda functions, you provide a signed client certificate, the private key for the certificate, and an optional password if the private key is encrypted. This establishes a trust relationship between Lambda and Amazon MSK or self-managed Kafka. Lambda supports self-signed server certificates or server certificates signed by a private certificate authority (CA) for self-managed Kafka. Lambda trusts the Amazon MSK certificate by default as the certificates are signed by Amazon Trust Services CAs.

    This blog post explains how to set up a Lambda function to process messages from an Amazon MSK cluster using mutual TLS authentication.

    Overview

    Using Amazon MSK as an event source operates in a similar way to using Amazon SQS or Amazon Kinesis. You create an event source mapping by attaching Amazon MSK as event source to your Lambda function.

    The Lambda service internally polls for new records from the event source, reading the messages from one or more partitions in batches. It then synchronously invokes your Lambda function, sending each batch as an event payload. Lambda continues to process batches until there are no more messages in the topic.

    The Lambda function’s event payload contains an array of records. Each array item contains details of the topic and Kafka partition identifier, together with a timestamp and base64 encoded message.

    Kafka event payload

    Kafka event payload

    You store the signed client certificate, the private key for the certificate, and an optional password if the private key is encrypted in the AWS Secrets Manager as a secret. You provide the secret in the Lambda event source mapping.

    The steps for using mutual TLS authentication for Amazon MSK as event source for Lambda are:

    1. Create a private certificate authority (CA) using AWS Certificate Manager (ACM) Private Certificate Authority (PCA).
    2. Create a client certificate and private key. Store them as secret in AWS Secrets Manager.
    3. Create an Amazon MSK cluster and a consuming Lambda function using the AWS Serverless Application Model (AWS SAM).
    4. Attach the event source mapping.

    This blog walks through these steps in detail.

    Prerequisites

    1. Creating a private CA.

    To use mutual TLS client authentication with Amazon MSK, create a root CA using AWS ACM Private Certificate Authority (PCA). We recommend using independent ACM PCAs for each MSK cluster when you use mutual TLS to control access. This ensures that TLS certificates signed by PCAs only authenticate with a single MSK cluster.

    1. From the AWS Certificate Manager console, choose Create a Private CA.
    2. In the Select CA type panel, select Root CA and choose Next.
    3. Select Root CA

      Select Root CA

    4. In the Configure CA subject name panel, provide your certificate details, and choose Next.
    5. Provide your certificate details

      Provide your certificate details

    6. From the Configure CA key algorithm panel, choose the key algorithm for your CA and choose Next.
    7. Configure CA key algorithm

      Configure CA key algorithm

    8. From the Configure revocation panel, choose any optional certificate revocation options you require and choose Next.
    9. Configure revocation

      Configure revocation

    10. Continue through the screens to add any tags required, allow ACM to renew certificates, review your options, and confirm pricing. Choose Confirm and create.
    11. Once the CA is created, choose Install CA certificate to activate your CA. Configure the validity of the certificate and the signature algorithm and choose Next.
    12. Configure certificate

      Configure certificate

    13. Review the certificate details and choose Confirm and install. Note down the Amazon Resource Name (ARN) of the private CA for the next section.
    14. Review certificate details

      Review certificate details

    2. Creating a client certificate.

    You generate a client certificate using the root certificate you previously created, which is used to authenticate the client with the Amazon MSK cluster using mutual TLS. You provide this client certificate and the private key as AWS Secrets Manager secrets to the AWS Lambda event source mapping.

    1. On your local machine, run the following command to create a private key and certificate signing request using OpenSSL. Enter your certificate details. This creates a private key file and a certificate signing request file in the current directory.
    2. openssl req -new -newkey rsa:2048 -days 365 -keyout key.pem -out client_cert.csr -nodes
      OpenSSL create a private key and certificate signing request

      OpenSSL create a private key and certificate signing request

    3. Use the AWS CLI to sign your certificate request with the private CA previously created. Replace Private-CA-ARN with the ARN of your private CA. The certificate validity value is set to 300, change this if necessary. Save the certificate ARN provided in the response.
    4. aws acm-pca issue-certificate --certificate-authority-arn Private-CA-ARN --csr fileb://client_cert.csr --signing-algorithm "SHA256WITHRSA" --validity Value=300,Type="DAYS"
    5. Retrieve the certificate that ACM signed for you. Replace the Private-CA-ARN and Certificate-ARN with the ARN you obtained from the previous commands. This creates a signed certificate file called client_cert.pem.
    6. aws acm-pca get-certificate --certificate-authority-arn Private-CA-ARN --certificate-arn Certificate-ARN | jq -r '.Certificate + "\n" + .CertificateChain' >> client_cert.pem
    7. Create a new file called secret.json with the following structure
    8. {
      "certificate":"",
      "privateKey":""
      }
      
    9. Copy the contents of the client_cert.pem in certificate and the content of key.pem in privatekey. Ensure that there are no extra spaces added. The file structure looks like this:
    10. Certificate file structure

      Certificate file structure

    11. Create the secret and save the ARN for the next section.
    aws secretsmanager create-secret --name msk/mtls/lambda/clientcert --secret-string file://secret.json

    3. Setting up an Amazon MSK cluster with AWS Lambda as a consumer.

    Amazon MSK is a highly available service, so it must be configured to run in a minimum of two Availability Zones in your preferred Region. To comply with security best practice, the brokers are usually configured in private subnets in each Region.

    You can use AWS CLI, AWS Management Console, AWS SDK and AWS CloudFormation to create the cluster and the Lambda functions. This blog uses AWS SAM to create the infrastructure and the associated code is available in the GitHub repository.

    The AWS SAM template creates the following resources:

    1. Amazon Virtual Private Cloud (VPC).
    2. Amazon MSK cluster with mutual TLS authentication.
    3. Lambda function for consuming the records from the Amazon MSK cluster.
    4. IAM roles.
    5. Lambda function for testing the Amazon MSK integration by publishing messages to the topic.

    The VPC has public and private subnets in two Availability Zones with the private subnets configured to use a NAT Gateway. You can also set up VPC endpoints with PrivateLink to allow the Amazon MSK cluster to communicate with Lambda. To learn more about different configurations, see this blog post.

    The Lambda function requires permission to describe VPCs and security groups, and manage elastic network interfaces to access the Amazon MSK data stream. The Lambda function also needs two Kafka permissions: kafka:DescribeCluster and kafka:GetBootstrapBrokers. The policy template AWSLambdaMSKExecutionRole includes these permissions. The Lambda function also requires permission to get the secret value from AWS Secrets Manager for the secret you configure in the event source mapping.

      ConsumerLambdaFunctionRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Principal:
                  Service: lambda.amazonaws.com
                Action: sts:AssumeRole
          ManagedPolicyArns:
            - arn:aws:iam::aws:policy/service-role/AWSLambdaMSKExecutionRole
          Policies:
            - PolicyName: SecretAccess
              PolicyDocument:
                Version: "2012-10-17"
                Statement:
                  - Effect: Allow
                    Action: "SecretsManager:GetSecretValue"
                    Resource: "*"

    This release adds two new SourceAccessConfiguration types to the Lambda event source mapping:

    1. CLIENT_CERTIFICATE_TLS_AUTH – (Amazon MSK, Self-managed Apache Kafka) The Secrets Manager ARN of your secret key containing the certificate chain (PEM), private key (PKCS#8 PEM), and private key password (optional) used for mutual TLS authentication of your Amazon MSK/Apache Kafka brokers. A private key password is required if the private key is encrypted.

    2. SERVER_ROOT_CA_CERTIFICATE – This is only for self-managed Apache Kafka. This contains the Secrets Manager ARN of your secret containing the root CA certificate used by your Apache Kafka brokers in PEM format. This is not applicable for Amazon MSK as Amazon MSK brokers use public AWS Certificate Manager certificates which are trusted by AWS Lambda by default.

    Deploying the resources:

    To deploy the example application:

    1. Clone the GitHub repository
    2. git clone https://github.com/aws-samples/aws-lambda-msk-mtls-integration.git
    3. Navigate to the aws-lambda-msk-mtls-integration directory. Copy the client certificate file and the private key file to the producer lambda function code.
    4. cd aws-lambda-msk-mtls-integration
      cp ../client_cert.pem code/producer/client_cert.pem
      cp ../key.pem code/producer/client_key.pem
    5. Navigate to the code directory and build the application artifacts using the AWS SAM build command.
    6. cd code
      sam build
    7. Run sam deploy to deploy the infrastructure. Provide the Stack Name, AWS Region, ARN of the private CA created in section 1. Provide additional information as required in the sam deploy and deploy the stack.
    8. sam deploy -g
      Running sam deploy -g

      Running sam deploy -g

      The stack deployment takes about 30 minutes to complete. Once complete, note the output values.

    9. Create the event source mapping for the Lambda function. Replace the CONSUMER_FUNCTION_NAME and MSK_CLUSTER_ARN from the output of the stack created by the AWS SAM template. Replace SECRET_ARN with the ARN of the AWS Secrets Manager secret created previously.
    10. aws lambda create-event-source-mapping --function-name CONSUMER_FUNCTION_NAME --batch-size 10 --starting-position TRIM_HORIZON --topics exampleTopic --event-source-arn MSK_CLUSTER_ARN --source-access-configurations '[{"Type": "CLIENT_CERTIFICATE_TLS_AUTH","URI": "SECRET_ARN"}]'
    11. Navigate one directory level up and configure the producer function with the Amazon MSK broker details. Replace the PRODUCER_FUNCTION_NAME and MSK_CLUSTER_ARN from the output of the stack created by the AWS SAM template.
    12. cd ../
      ./setup_producer.sh MSK_CLUSTER_ARN PRODUCER_FUNCTION_NAME
    13. Verify that the event source mapping state is enabled before moving on to the next step. Replace UUID from the output of step 5.
    14. aws lambda get-event-source-mapping --uuid UUID
    15. Publish messages using the producer. Replace PRODUCER_FUNCTION_NAME from the output of the stack created by the AWS SAM template. The following command creates a Kafka topic called exampleTopic and publish 100 messages to the topic.
    16. ./produce.sh PRODUCER_FUNCTION_NAME exampleTopic 100
    17. Verify that the consumer Lambda function receives and processes the messages by checking in Amazon CloudWatch log groups. Navigate to the log group by searching for aws/lambda/{stackname}-MSKConsumerLambda in the search bar.
    Consumer function log stream

    Consumer function log stream

    Conclusion

    Lambda now supports mutual TLS authentication for Amazon MSK and self-managed Kafka as an event source. You now have the option to provide a client certificate to establish a trust relationship between Lambda and MSK or self-managed Kafka brokers. It supports configuration via the AWS Management Console, AWS CLI, AWS SDK, and AWS CloudFormation.

    To learn more about how to use mutual TLS Authentication for your Kafka triggered AWS Lambda function, visit AWS Lambda with self-managed Apache Kafka and Using AWS Lambda with Amazon MSK.

    Publishing messages in batch to Amazon SNS topics

    Post Syndicated from Talia Nassi original https://aws.amazon.com/blogs/compute/publishing-messages-in-batch-to-amazon-sns-topics/

    This post is written by Heeki Park (Principal Solutions Architect, Serverless Specialist), Marc Pinaud (Senior Product Manager, Amazon SNS), Amir Eldesoky (Software Development Engineer, Amazon SNS), Jack Li (Software Development Engineer, Amazon SNS), and William Nguyen (Software Development Engineer, Amazon SNS).

    Today, we are announcing the ability for AWS customers to publish messages in batch to Amazon SNS topics. Until now, you were only able to publish one message to an SNS topic per Publish API request. With the new PublishBatch API, you can send up to 10 messages at a time in a single API request. This reduces cost for API requests by up to 90%, as you need fewer API requests to publish the same number of messages.

    Introducing the PublishBatch API

    Consider a log processing application where you process system logs and have different requirements for downstream processing. For example, you may want to do inference on
    incoming log data, populate an operational Amazon OpenSearch Service environment, and store log data in an enterprise data lake.

    Systems send log data to a standard SNS topic, and Amazon SQS queues and Amazon Kinesis Data Firehose are configured as subscribers. An AWS Lambda function subscribes to the first SQS queue and uses machine learning models to perform inference to detect security incidents or system access anomalies. A Lambda function subscribes to the second SQS queue and emits those log entries to an Amazon OpenSearch Service cluster. The workload uses Kibana dashboards to visualize log data. An Amazon Kinesis Data Firehose delivery stream subscribes to the SNS topic and archives all log data into Amazon S3. This allows data scientists to conduct further investigation and research on those logs.

    To do this, the following Java code publishes a set of log messages. In this code, you construct a publish request for a single message to an SNS topic and submit that request via the publish() method:

    // tab 1: standard publish example
    private static AmazonSNS snsClient;
    private static final String MESSAGE_PAYLOAD = " 192.168.1.100 - - [28/Oct/2021:10:27:10 -0500] "GET /index.html HTTP/1.1" 200 3395";
    
    PublishRequest request = new PublishRequest()
        .withTopicArn(topicArn)
        .withMessage(MESSAGE_PAYLOAD);
    PublishResult response = snsClient.publish(request);
    
    // tab 2: fifo publish example
    private static AmazonSNS snsClient;
    private static final String MESSAGE_PAYLOAD = " 192.168.1.100 - - [28/Oct/2021:10:27:10 -0500] "GET /index.html HTTP/1.1" 200 3395";
    private static final String MESSAGE_FIFO_GROUP = "server1234";
    
    PublishRequest request = new PublishRequest()
        .withTopicArn(topicArn)
        .withMessage(MESSAGE_PAYLOAD)
        .withMessageGroupId(MESSAGE_FIFO_GROUP)
        .withMessageDeduplicationId(UUID.randomUUID().toString());
    PublishResult response = snsClient.publish(request);
    

    If you extended the example above and had 10 log lines that each needed to be published as a message, you would have to write code to construct 10 publish requests, and subsequently submit each of those requests via the publish() method.

    With the new ability to publish batch messages, you write the following new code. In the code below, you construct a list of publish entries first, then create a single publish batch request, and subsequently submit that batch request via the new publishBatch() method. In the code below, you use a sample helper method getLoggingPayload(i) to get the appropriate payload for the message, which you can replace with your own business logic.

    // tab 1: standard publish example
    private static final String MESSAGE_BATCH_ID_PREFIX = "server1234-batch-id-";
    
    List<PublishBatchRequestEntry> entries = IntStream.range(0, 10)
    .mapToObj(i -> {
    new PublishBatchRequestEntry()
    .withId(MESSAGE_BATCH_ID_PREFIX + i)
    .withMessage(getLoggingPayload(i));
    })
    .collect(Collectors.toList());
    PublishBatchRequest request = new PublishBatchRequest()
    .withTopicArn(topicArn)
    .withPublishBatchRequestEntries(entries);
    PublishBatchResult response = snsClient.publishBatch(request);
    
    // tab 2: fifo publish example
    private static final String MESSAGE_BATCH_ID_PREFIX = "server1234-batch-id-";
    private static final String MESSAGE_FIFO_GROUP = "server1234";
    
    
    List<PublishBatchRequestEntry> entries = IntStream.range(0, 10)
    .mapToObj(i -> {
    new PublishBatchRequestEntry()
    .withId(MESSAGE_BATCH_ID_PREFIX + i)
    .withMessage(getLoggingPayload(i))
    .withMessageGroupId(MESSAGE_FIFO_GROUP)
    .withMessageDeduplicationId(UUID.randomUUID().toString());
    })
    .collect(Collectors.toList());
    PublishBatchRequest request = new PublishBatchRequest()
    .withTopicArn(topicArn)
    .withPublishBatchRequestEntries(entries);
    PublishBatchResult response = snsClient.publishBatch(request);

    In the list of publish requests, the application must assign a unique batch ID (up to 80 characters) to each publish request within that batch. When the SNS service successfully receives a message, the SNS service assigns a unique message ID and returns that message ID in the response object.

    If publishing to a FIFO topic, the SNS service additionally returns a sequence number in the response. When publishing a batch of messages, the PublishBatchResult object returns a list of response objects for successful and failed messages. If you iterate through the list of response objects for successful messages, you might see the following:

    // tab 1: standard publish output
    {
    "Id": "server1234-batch-id-0",
    "MessageId": "fcaef5b3-e9e3-5c9e-b761-ac46c4a779bb",
    ...
    }
    
    // tab 2: fifo publish output
    {
    "Id": "server1234-batch-id-0",
    "MessageId": "fcaef5b3-e9e3-5c9e-b761-ac46c4a779bb",
    "SequenceNumber": "10000000000000003000",
    ...
    }

    When receiving the message from SNS in the SQS queue, the application reads the following message:

    // tab 1: standard publish output
    {
    "Type" : "Notification",
    "MessageId" : "fcaef5b3-e9e3-5c9e-b761-ac46c4a779bb",
    "TopicArn" : "arn:aws:sns:us-east-1:112233445566:publishBatchTopic",
    "Message" : "payload-0",
    "Timestamp" : "2021-10-28T22:58:12.862Z",
    "UnsubscribeURL" : "http://sns.us-east-1.amazon.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-1:112233445566:publishBatchTopic:ff78260a-0953-4b60-9c2c-122ebcb5fc96"
    }
    
    // tab 2: fifo publish output
    {
    "Type" : "Notification",
    "MessageId" : "fcaef5b3-e9e3-5c9e-b761-ac46c4a779bb",
    "SequenceNumber" : "10000000000000003000",
    "TopicArn" : "arn:aws:sns:us-east-1:112233445566:publishBatchTopic",
    "Message" : "payload-0",
    "Timestamp" : "2021-10-28T22:58:12.862Z",
    "UnsubscribeURL" : "http://sns.us-east-1.amazon.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-1:112233445566:publishBatchTopic.fifo:ff78260a-0953-4b60-9c2c-122ebcb5fc96"
    }

    In the standard publish example, the MessageId of fcaef5b3-e9e3-5c9e-b761-ac46c4a779bb is propagated down to the message in SQS. In the FIFO publish example, the SequenceNumber of 10000000000000003000 is also propagated down to the message in SQS.

    Handling errors and quotas

    When publishing messages in batch, the application must handle errors that may have occurred during the publish batch request. Errors can occur at two different levels. The first is when publishing the batch request to the SNS topic. For example, if the application does not specify a unique message batch ID, it fails with the following error:

    com.amazonaws.services.sns.model.BatchEntryIdsNotDistinctException: Two or more batch entries in the request have the same Id. (Service: AmazonSNS; Status Code: 400; Error Code: BatchEntryIdsNotDistinct; Request ID: 44cdac03-eeac-5760-9264-f5f99f4914ad; Proxy: null)

    The second is within the batch request at the message level. The application must inspect the returned PublishBatchResult object by iterating through successful and failed responses:

    PublishBatchResult publishBatchResult = snsClient.publishBatch(request);
    publishBatchResult.getSuccessful().forEach(entry -> {
    System.out.println(entry.toString());
    });
    
    publishBatchResult.getFailed().forEach(entry -> {
    System.out.println(entry.toString());
    });

    With respect to quotas, the overall message throughput for an SNS topic remains the same. For example, in US East (N. Virginia), standard topics support up to 30,000 messages per second. Before this feature, 30,000 messages also meant 30,000 API requests per second. Because SNS now supports up to 10 messages per request, you can publish the same number of messages using only 3,000 API requests. With FIFO topics, the message throughput remains the same at 300 messages per second, but you can now send that volume of messages using only 30 API requests, thus reducing your messaging costs with SNS.

    Conclusion

    SNS now supports the ability to publish up to 10 messages in a single API request, reducing costs for publishing messages into SNS. Your applications can validate the publish status of each of the messages sent in the batch and handle failed publish requests accordingly. Message throughput to SNS topics remains the same for both standard and FIFO topics.

    Learn more about this ability in the SNS Developer Guide.
    Learn more about the details of the API request in the SNS API reference.
    Learn more about SNS quotas.

    For more serverless learning resources, visit Serverless Land.

    Deploying AWS Lambda layers automatically across multiple Regions

    Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/deploying-aws-lambda-layers-automatically-across-multiple-regions/

    This post is written by Ben Freiberg, Solutions Architect, and Markus Ziller, Solutions Architect.

    Many developers import libraries and dependencies into their AWS Lambda functions. These dependencies can be zipped and uploaded as part of the build and deployment process but it’s often easier to use Lambda layers instead.

    A Lambda layer is an archive containing additional code, such as libraries or dependencies. Layers are deployed as immutable versions, and the version number increments each time you publish a new layer. When you include a layer in a function, you specify the layer version you want to use.

    Lambda layers simplify and speed up the development process by providing common dependencies and reducing the deployment size of your Lambda functions. To learn more, refer to Using Lambda layers to simplify your development process.

    Many customers build Lambda layers for use across multiple Regions. But maintaining up-to-date and consistent layer versions across multiple Regions is a manual process. Layers are set as private automatically but they can be shared with other AWS accounts or shared publicly. Permissions only apply to a single version of a layer. This solution automates the creation and deployment of Lambda layers across multiple Regions from a centralized pipeline.

    Overview of the solution

    This solution uses AWS Lambda, AWS CodeCommit, AWS CodeBuild and AWS CodePipeline.

    Reference architecture

    This diagram outlines the workflow implemented in this blog:

    1. A CodeCommit repository contains the language-specific definition of dependencies and libraries that the layer contains, such as package.json for Node.js or requirements.txt for Python. Each commit to the main branch triggers an execution of the surrounding CodePipeline.
    2. A CodeBuild job uses the provided buildspec.yaml to create a zipped archive containing the layer contents. CodePipeline automatically stores the output of the CodeBuild job as artifacts in a dedicated Amazon S3 bucket.
    3. A Lambda function is invoked for each configured Region.
    4. The function first downloads the zip archive from S3.
    5. Next, the function creates the layer version in the specified Region with the configured permissions.

    Walkthrough

    The following walkthrough explains the components and how the provisioning can be automated via CDK. For this walkthrough, you need:

    To deploy the sample stack:

    1. Clone the associated GitHub repository by running the following command in a local directory:
      git clone https://github.com/aws-samples/multi-region-lambda-layers
    2. Open the repository in your preferred editor and review the contents of the src and cdk folder.
    3. Follow the instructions in the README.md to deploy the stack.
    4. Check the execution history of your pipeline in the AWS Management Console. The pipeline has been started once already and published a first version of the Lambda layer.
      Execution history

    Code repository

    The source code of the Lambda layers is stored in AWS CodeCommit. This is a secure, highly scalable, managed source control service that hosts private Git repositories. This example initializes a new repository as part of the CDK stack:

        const asset = new Asset(this, 'SampleAsset', {
          path: path.join(__dirname, '../../res')
        });
    
        const cfnRepository = new codecommit.CfnRepository(this, 'lambda-layer-source', {
          repositoryName: 'lambda-layer-source',
          repositoryDescription: 'Contains the source code for a nodejs12+14 Lambda layer.',
          code: {
            branchName: 'main',
            s3: {
              bucket: asset.s3BucketName,
              key: asset.s3ObjectKey
            }
          },
        });
    

    This code uploads the contents of the ./cdk/res/ folder to an S3 bucket that is managed by the CDK. The CDK then initializes a new CodeCommit repository with the contents of the bucket. In this case, the repository gets initialized with the following files:

    • LICENSE: A text file describing the license for this Lambda layer
    • package.json: In Node.js, the package.json file is a manifest for projects. It defines dependencies, scripts, and metainformation about the project. The npm install command installs all project dependencies in a node_modules folder. This is where you define the contents of the Lambda layer.

    The default package.json in the sample code defines a Lambda layer with the latest version of the AWS SDK for JavaScript:

    {
        "name": "lambda-layer",
        "version": "1.0.0",
        "description": "Sample AWS Lambda layer",
        "dependencies": {
          "aws-sdk": "latest"
        }
    }
    

    To see what is included in the layer, run npm install in the ./cdk/res/ directory. This shows the files that are bundled into the Lambda layer. The contents of this folder initialize the CodeCommit repository, so delete node_modules and package-lock.json inspecting these files.

    Node modules directory

    This blog post uses a new CodeCommit repository as the source but you can adapt this to other providers. CodePipeline also supports repositories on GitHub and Bitbucket. To connect to those providers, see the documentation.

    CI/CD Pipeline

    CodePipeline automates the process of building and distributing Lambda layers across Region for every change to the main branch of the source repository. It is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates. CodePipeline automates the build, test, and deploy phases of your release process every time there is a code change, based on the release model you define.

    The CDK creates a pipeline in CodePipeline and configures it so that every change to the code base of the Lambda layer runs through the following three stages:

    new codepipeline.Pipeline(this, 'Pipeline', {
          pipelineName: 'LambdaLayerBuilderPipeline',
          stages: [
            {
              stageName: 'Source',
              actions: [sourceAction]
            },
            {
              stageName: 'Build',
              actions: [buildAction]
            },
            {
              stageName: 'Distribute',
              actions: parallel,
            }
          ]
        });
    

    Source

    The source phase is the first phase of every run of the pipeline. It is typically triggered by a new commit to the main branch of the source repository. You can also start the source phase manually with the following AWS CLI command:

    aws codepipeline start-pipeline-execution --name LambdaLayerBuilderPipeline

    When started manually, the current head of the main branch is used. Otherwise CodePipeline checks out the code in the revision of the commit that triggered the pipeline execution.

    CodePipeline stores the code locally and uses it as an output artifact of the Source stage. Stages use input and output artifacts that are stored in the Amazon S3 artifact bucket you chose when you created the pipeline. CodePipeline zips and transfers the files for input or output artifacts as appropriate for the action type in the stage.

    Build

    In the second phase of the pipeline, CodePipeline installs all dependencies and packages according to the specs of the targeted Lambda runtime. CodeBuild is a fully managed build service in the cloud. It reduces the need to provision, manage, and scale your own build servers. It provides prepackaged build environments for popular programming languages and build tools like npm for Node.js.

    In CodeBuild, you use build specifications (buildspecs) to define what commands need to run to build your application. Here, it runs commands in a provisioned Docker image with Amazon Linux 2 to do the following:

    • Create the folder structure expected by Lambda Layer.
    • Run npm install to install all Node.js dependencies.
    • Package the code into a layer.zip file and define layer.zip as output of the Build stage.

    The following CDK code highlights the specifications of the CodeBuild project.

    const buildAction = new codebuild.PipelineProject(this, 'lambda-layer-builder', {
          buildSpec: codebuild.BuildSpec.fromObject({
            version: '0.2',
            phases: {
              install: {
                commands: [
                  'mkdir -p node_layer/nodejs',
                  'cp package.json ./node_layer/nodejs/package.json',
                  'cd ./node_layer/nodejs',
                  'npm install',
                ]
              },
              build: {
                commands: [
                  'rm package-lock.json',
                  'cd ..',
                  'zip ../layer.zip * -r',
                ]
              }
            },
            artifacts: {
              files: [
                'layer.zip',
              ]
            }
          }),
          environment: {
            buildImage: codebuild.LinuxBuildImage.STANDARD_5_0
          }
        })
    

    Distribute

    In the final stage, Lambda uses layer.zip to create and publish a Lambda layer across multiple Regions. The sample code defines four Regions as targets for the distribution process:

    regionCodesToDistribute: ['eu-central-1', 'eu-west-1', 'us-west-1', 'us-east-1']

    The Distribution phase consists of n (one per Region) parallel invocations of the same Lambda function, each with userParameter.region set to the respective Region. This is defined in the CDK stack:

    const parallel = props.regionCodesToDistribute.map((region) => new codepipelineActions.LambdaInvokeAction({
          actionName: `distribute-${region}`,
          lambda: distributor,
          inputs: [buildOutput],
          userParameters: { region, layerPrincipal: props.layerPrincipal }
    }));
    

    Each Lambda function runs the following code to publish a new Lambda layer in each Region:

    const parallel = props.regionCodesToDistribute.map((region) => new codepipelineActions.LambdaInvokeAction({
          actionName: `distribute-${region}`,
          lambda: distributor,
          inputs: [buildOutput],
          userParameters: { region, layerPrincipal: props.layerPrincipal }
    }));
    
    Each Lambda function runs the following code to publish a new Lambda layer in each Region:
    
    // Simplified code for brevity
    // Omitted error handling, permission management and logging 
    // See code samples for full code.
    export async function handler(event: any) {
        // #1 Get job specific parameters (e.g. target region)
        const { location } = event['CodePipeline.job'].data.inputArtifacts[0];
        const { region, layerPrincipal } = JSON.parse(event["CodePipeline.job"].data.actionConfiguration.configuration.UserParameters);
        
        // #2 Get location of layer.zip and download it locally
        const layerZip = s3.getObject(/* Input artifact location*/);
        const lambda = new Lambda({ region });
        // #3 Publish a new Lambda layer version based on layer.zip
        const layer = lambda.publishLayerVersion({
            Content: {
                ZipFile: layerZip.Body
            },
            LayerName: 'sample-layer',
            CompatibleRuntimes: ['nodejs12.x', 'nodejs14.x']
        })
        
        // #4 Report the status of the operation back to CodePipeline
        return codepipeline.putJobSuccessResult(..);
    }
    
    

    After each Lambda function completes successfully, the pipeline ends. In a production application, you likely would have additional steps after publishing. For example, it may send notifications via Amazon SNS. To learn more about other possible integrations, read Working with pipeline in CodePipeline.

    Pipeline output

    Testing the workflow

    With this automation, you can release a new version of the Lambda layer by changing package.json in the source repository.

    Add the AWS X-Ray SDK for Node.js as a dependency to your project, by making the following changes to package.json and committing the new version to your main branch:

    {
        "name": "lambda-layer",
        "version": "1.0.0",
        "description": "Sample AWS Lambda layer",
        "dependencies": {
            "aws-sdk": "latest",
            "aws-xray-sdk": "latest"
        }
    }
    

    After committing the new version to the repository, the pipeline is triggered again. After a while, you see that an updated version of the Lambda layer is published to all Regions:

    Execution history results

    Cleaning up

    Many services in this blog post are available in the AWS Free Tier. However, using this solution may incur cost and you should tear down the stack if you don’t need it anymore. Cleaning up steps are included in the readme in the repository.

    Conclusion

    This blog post shows how to create a centralized pipeline to build and distribute Lambda layers consistently across multiple Regions. The pipeline is configurable and allows you to adapt the Regions and permissions according to your use-case.

    For more serverless learning resources, visit Serverless Land.

    Modernizing deployments with container images in AWS Lambda

    Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/modernizing-deployments-with-container-images-in-aws-lambda/

    This post is written by Joseph Keating, AWS Modernization Architect, and Virginia Chu, Sr. DevSecOps Architect.

    Container image support for AWS Lambda enables developers to package function code and dependencies using familiar patterns and tools. With this pattern, developers use standard tools like Docker to package their functions as container images and deploy them to Lambda.

    In a typical deployment process for image-based Lambda functions, the container and Lambda function are created or updated in the same process. However, some use cases require developers to create the image first, and then update one or more Lambda functions from that image. In these situations, organizations may mandate that infrastructure components such as Amazon S3 and Amazon Elastic Container Registry (ECR) are centralized and deployed separately from their application deployment pipelines.

    This post demonstrates how to use AWS continuous integration and deployment (CI/CD) services and Docker to separate the container build process from the application deployment process.

    Overview

    There is a sample application that creates two pipelines to deploy a Java application. The first pipeline uses Docker to build and deploy the container image to the Amazon ECR. The second pipeline uses AWS Serverless Application Model (AWS SAM) to deploy a Lambda function based on the container from the first process.

    This shows how to build, manage, and deploy Lambda container images automatically with infrastructure as code (IaC). It also covers automatically updating or creating Lambda functions based on a container image version.

    Example architecture

    Example architecture

    The example application uses AWS CloudFormation to configure the AWS Lambda container pipelines. Both pipelines use AWS CodePipeline, AWS CodeBuild, and AWS CodeCommit. The lambda-container-image-deployment-pipeline builds and deploys a container image to ECR. The sam-deployment-pipeline updates or deploys a Lambda function based on the new container image.

    The pipeline deploys the sample application:

    1. The developer pushes code to the main branch.
    2. An update to the main branch invokes the pipeline.
    3. The pipeline clones the CodeCommit repository.
    4. Docker builds the container image and assigns tags.
    5. Docker pushes the image to ECR.
    6. The lambda-container-image-pipeline completion triggers an Amazon EventBridge event.
    7. The pipeline clones the CodeCommit repository.
    8. AWS SAM builds the Lambda-based container image application.
    9. AWS SAM deploys the application to AWS Lambda.

    Prerequisites

    To provision the pipeline deployment, you must have the following prerequisites:

    Infrastructure configuration

    The pipeline relies on infrastructure elements like AWS Identity and Access Management roles, S3 buckets, and an ECR repository. Due to security and governance considerations, many organizations prefer to keep these infrastructure components separate from their application deployments.

    To start, deploy the core infrastructure components using CloudFormation and the AWS CLI:

    1. Create a local directory called BlogDemoRepo and clone the source code repository found in the following location:
      mkdir -p $HOME/BlogDemoRepo
      cd $HOME/BlogDemoRepo
      git clone https://github.com/aws-samples/modernize-deployments-with-container-images-in-lambda
    2. Change directory into the cloned repository:
      cd modernize-deployments-with-container-images-in-lambda/
    3. Deploy the s3-iam-config CloudFormation template, keeping the following CloudFormation template names:
      aws cloudformation create-stack \
        --stack-name s3-iam-config \
        --template-body file://templates/s3-iam-config.yml \
        --parameters file://parameters/s3-iam-config-params.json \
        --capabilities CAPABILITY_NAMED_IAM

      The output should look like the following:

      Output example for stack creation

      Output example for stack creation

    Application overview

    The application uses Docker to build the container image and an ECR repository to store the container image. AWS SAM deploys the Lambda function based on the new container.

    The example application in this post uses a Java-based container image using Amazon Corretto. Amazon Corretto is a no-cost, multi-platform, production-ready Open Java Development Kit (OpenJDK).

    The Lambda container-image base includes the Amazon Linux operating system, and a set of base dependencies. The image also consists of the Lambda Runtime Interface Client (RIC) that allows your runtime to send and receive to the Lambda service. Take some time to review the Dockerfile and how it configures the Java application.

    Configure the repository

    The CodeCommit repository contains all of the configurations the pipelines use to deploy the application. To configure the CodeCommit repository:

    1. Get metadata about the CodeCommit repository created in a previous step. Run the following command from the BlogDemoRepo directory created in a previous step:
      aws codecommit get-repository \
        --repository-name DemoRepo \
        --query repositoryMetadata.cloneUrlHttp \
        --output text

      The output should look like the following:

      Output example for get repository

      Output example for get repository

    2. In your terminal, paste the Git URL from the previous step and clone the repository:
      git clone <insert_url_from_step_1_output>

      You receive a warning because the repository is empty.

      Empty repository warning

      Empty repository warning

    3. Create the main branch:
      cd DemoRepo
      git checkout -b main
    4. Copy all of the code from the cloned GitHub repository to the CodeCommit repository:
      cp -r ../modernize-deployments-with-container-images-in-lambda/* .
    5. Commit and push the changes:
      git add .
      git commit -m "Initial commit"
      git push -u origin main

    Pipeline configuration

    This example deploys two separate pipelines. The first is called the modernize-deployments-with-container-images-in-lambda, which consists of building and deploying a container-image to ECR using Docker and the AWS CLI. An EventBridge event starts the pipeline when the CodeCommit branch is updated.

    The second pipeline, sam-deployment-pipeline, is where the container image built from lambda-container-image-deployment-pipeline is deployed to a Lambda function using AWS SAM. This pipeline is also triggered using an Amazon EventBridge event. Successful completion of the lambda-container-image-deployment-pipeline invokes this second pipeline through Amazon EventBridge.

    Both pipelines consist of AWS CodeBuild jobs configured with a buildspec file. The buildspec file enables developers to run bash commands and scripts to build and deploy applications.

    Deploy the pipeline

    You now configure and deploy the pipelines and test the configured application in the AWS Management Console.

    1. Change directory back to modernize-serverless-deployments-leveraging-lambda-container-images directory and deploy the lambda-container-pipeline CloudFormation Template:
      cd $HOME/BlogDemoRepo/modernize-deployments-with-container-images-in-lambda/
      aws cloudformation create-stack \
        --stack-name lambda-container-pipeline \
        --template-body file://templates/lambda-container-pipeline.yml \
        --parameters file://parameters/lambda-container-params.json  \
        --capabilities CAPABILITY_IAM \
        --region us-east-1

      The output appears:

      Output example for stack creation

      Output example for stack creation

    2. Wait for the lambda-container-pipeline stack from the previous step to complete and deploy the sam-deployment-pipeline CloudFormation template:
      aws cloudformation create-stack \
        --stack-name sam-deployment-pipeline \
        --template-body file://templates/sam-deployment-pipeline.yml \
        --parameters file://parameters/sam-deployment-params.json  \
        --capabilities CAPABILITY_IAM \
        --region us-east-1

      The output appears:

      Output example of stack creation

      Output example of stack creation

    3. In the console, select CodePipelinepipelines:

    4. Wait for the status of both pipelines to show Succeeded:
    5. Navigate to the ECR console and choose demo-java. This shows that the pipeline is built and the image is deployed to ECR.
    6. Navigate to the Lambda console and choose the MyCustomLambdaContainer function.
    7. The Image configuration panel shows that the function is configured to use the image created earlier.
    8. To test the function, choose Test.
    9. Keep the default settings and choose Test.

    This completes the walkthrough. To further test the workflow, modify the Java application and commit and push your changes to the main branch. You can then review the updated resources you have deployed.

    Conclusion

    This post shows how to use AWS services to automate the creation of Lambda container images. Using CodePipeline, you create a CI/CD pipeline for updates and deployments of Lambda container-images. You then test the Lambda container-image in the AWS Management Console.

    For more serverless content visit Serverless Land.

    Cloudflare Pages now partners with your favorite CMS

    Post Syndicated from Nevi Shah original https://blog.cloudflare.com/cloudflare-pages-headless-cms-partnerships/

    Cloudflare Pages now partners with your favorite CMS

    Cloudflare Pages now partners with your favorite CMS

    Interest in headless CMSes has seen spectacular growth over the past few years with many businesses looking to adopt the tooling. As audiences consume content through new interfaces taking different forms — smartphones, wearables, personal devices — the idea of decoupling content with its backend begins to provide a better experience both for developing teams and end users. Because of this, we believe there are and will be more opportunities in the future to utilize headless CMSes which is why today, we’re thrilled to announce our partnerships with Sanity and Strapi and also share existing integrations with Contentful and WordPress — all your favorite CMS providers.

    A little on headless CMSes

    Headless CMSes are one of the most common API integrations we’ve seen so far among you and your teams — whether it’s for your marketing site, blog or e-commerce site. It provides your teams the ability to input the contents of your site through a user-friendly interface and store them in a database, so that updates can easily be made to your site without touching the code base. As a Jamstack platform, a big part of our roadmap is understanding how we can build our own tools or provide integrations for tools that fit in with your development ecosystem and Pages, which is why in August this year we announced Pages support for Deploy Hooks.

    What’s a hook got to do with it?

    Deploy Hooks are the key to what allows you to connect and trigger deployments in Pages via updates made in your headless CMS. As developers, instead of getting pinged several times a day to make content updates to your site, your marketing team can update the site directly within the headless CMS’s interface by way of a Deploy Hook. This is a URL created on Pages that accepts an HTTP POST request to trigger new deployments outside the realm of your git commands. You can configure settings within your CMS to accept the Deploy Hook so that anytime content is updated within your CMS, a new deployment is started in the Pages dashboard automatically — it couldn’t be any easier!

    Cloudflare Pages now partners with your favorite CMS

    How can I create a Deploy Hook?

    Within the Pages interface, there are two things you need to do to create your Deploy Hook:

    1. Choose your Deploy Hook name: You can name your deploy hook anything you’d like
    2. Select the Branch to Build: You can specify which branch will be built and deployed when the URL is requested with the Deploy Hook.

    Once you are given your Deploy Hook, you’re all set to set up a webhook within your chosen CMS where you will paste your Deploy Hook.

    That’s it! Now leave it to your marketing team to update their rich content and watch the builds trigger automatically to update your site!

    Our partners

    Of course, Deploy Hooks is just a starting point of ways we can provide a better dev experience for your team when using the headless CMS of your choice with your Pages site. But our story of integrations does not stop here. Introducing our CMS partners and integrations: Sanity, Strapi, Contentful, and WordPress!

    Cloudflare Pages now partners with your favorite CMS

    We continue to see the highest usage rates of these four CMSes on Pages among you and your teams, and in the months to come we’ll be working closely with our partners to build even more for you.

    We’re delighted to partner with Cloudflare and excited by this new release from Cloudflare Pages. At Sanity, we care deeply about people working with content on our platform. Cloudflare’s new deploy hooks allow developers to automate builds for static sites based on content changes, which is a huge improvement for content creators. Combining these with structured content and our GROQ-powered Webhooks, our customers can be strategic about when these builds should happen. We’re stoked to be part of this release and can’t wait to see what the community will build with Sanity and Cloudflare!
    Even Westvang, Co-founder, Sanity.io

    Check out Sanity’s video tutorial on how to build your site using Pages and Sanity!

    At Strapi, we’re excited about this partnership with Cloudflare because it enables developers to abstract away the complexity of deploying changes to production for content teams. By using the Deploy Hook for Strapi, everyone can push new content and updates quickly and autonomously.
    Pierre Burgy, CEO, Strapi.io

    With this integration, our customers can work more efficiently and cross-functionally with their teams. Marketing teams can update Contentful directly to automatically trigger deployments without relying on their developers to update any of their code base, which results in a better, more productive experience for all teams involved.
    – Jeff Blattel, Director of Technical Partnerships at Contentful

    Get started

    For now, to learn more about how you can connect your Pages project to one of our partner CMSes, check out our Deploy Hooks documentation to deploy your first project today!

    Cloudflare Pages now offers Gitlab support

    Post Syndicated from Nevi Shah original https://blog.cloudflare.com/cloudflare-pages-partners-with-gitlab/

    Cloudflare Pages now offers Gitlab support

    Cloudflare Pages now offers Gitlab support

    In the early stages of our ideation of Pages, we set out to build a platform with a smooth developer experience that integrates seamlessly with your existing workflow. However, after announcing Pages’ general availability, we realized our platform may not actually be usable by every developer. Before today, only those of you who used GitHub as your source code management tool could take advantage of the Pages experience.

    As part of Full Stack Week, we’re opening the doors of our platform to even more users by announcing our integration with GitLab the DevOps platform! You can now create new Pages projects by connecting your repos stored on GitLab and make site changes there via your usual git commands. And what’s more? We’re also launching an official partnership with GitLab to bring you even better integrations with the git provider in the months to come.

    Why GitLab?

    As a Jamstack platform, our goal is to enable you, the developer, to focus on what you do best — code, code, code — without the heavy lifting! Not only does this mean giving you all the tools you need to build out a full stack site but also provide you with integrations that fit your development needs. By expanding our platform ecosystem to GitLab, Cloudflare can now serve the needs of a broader developer community collaborating on their sites.

    Since our April launch, one of the most common questions and pieces of feedback we’ve received in customer calls, on Discord/Twitter, and on our community threads centered around GitLab. We knew our git integration story couldn’t just stop at one provider, especially given the diversity in tooling we see among our community. So it became glaringly obvious we needed to extend Pages to the GitLab community.

    Our partnership

    Today, we’re proud to now be official technology partners with GitLab Inc. In addition to our git integration, the goal of our partnership is to improve existing and develop future integrations, so your teams can seamlessly collaborate and accelerate site delivery and updates at scale. As you begin using Pages with GitLab, our teams will be working closely together in a cross-collaborative approach for new integrations.

    Developers can be more productive when they create, test, secure and deploy software from a single devops application instead of bouncing between multiple different tools. Cloudflare Pages’ integration with GitLab makes it easier for joint users to develop and deploy new code to Cloudflare’s network using the same syntax and git commands they’re already comfortable using.
    — Michael LeBeau, Alliance Manager at GitLab

    Get started

    To set up your first project with GitLab, just create a new project in the Pages dashboard. Select “GitLab” and Pages will bring you to your GitLab sign-in screen where you can sign in to your account. Then, select the repo with which you’d like to create your project, configure your build settings, and deploy! From here, you can begin making changes to your site directly via commits to GitLab, triggering a new build every time.

    Cloudflare Pages now offers Gitlab support

    Have questions? To get started, check out the Pages docs and be sure to leave us some feedback by clicking the “Give Feedback” button there. Show us what you build by joining the chatter in our Discord channel.

    Happy developing!

    Understanding how AWS Lambda scales with Amazon SQS standard queues

    Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/understanding-how-aws-lambda-scales-when-subscribed-to-amazon-sqs-queues/

    This post is written by John Lee, Solutions Architect, and Isael Pelletier, Senior Solutions Architect.

    Many architectures use Amazon SQS, a fully managed message queueing service, to decouple producers and consumers. SQS is a fundamental building block for building decoupled architectures. AWS Lambda is also fully managed by AWS and is a common choice as a consumer as it supports native integration with SQS. This combination of services allows you to write and maintain less code and unburden you from the heavy lifting of server management.

    This blog post looks into optimizing the scaling behavior of Lambda functions when subscribed to an SQS standard queue. It discusses how Lambda’s scaling works in this configuration and reviews best practices for maximizing message throughput. The post provides insight into building your own scalable workload and guides you in building well-architected workloads.

    Scaling Lambda functions

    When a Lambda function subscribes to an SQS queue, Lambda polls the queue as it waits for messages to arrive. Lambda consumes messages in batches, starting at five concurrent batches with five functions at a time.

    If there are more messages in the queue, Lambda adds up to 60 functions per minute, up to 1,000 functions, to consume those messages. This means that Lambda can scale up to 1,000 concurrent Lambda functions processing messages from the SQS queue.

    Lambda poller

    Figure 1. The Lambda service polls the SQS queue and batches messages that are processed by automatic scaling Lambda functions. You start with five concurrent Lambda functions.

    This scaling behavior is managed by AWS and cannot be modified. To process more messages, you can optimize your Lambda configuration for higher throughput. There are several strategies you can implement to do this.

    Increase the allocated memory for your Lambda function

    The simplest way to increase throughput is to increase the allocated memory of the Lambda function. While you do not have control over the scaling behavior of the Lambda functions subscribed to an SQS queue, you control the memory configuration.

    Faster Lambda functions can process more messages and increase throughput. This works even if a Lambda function’s memory utilization is low. This is because increasing memory also increases vCPUs in proportion to the amount configured. Each function now supports up to 10 GB of memory and you can access up to six vCPUs per function.

    You may need to modify code to take advantage of the extra vCPUs. This consists of implementing multithreading or parallel processing to use all the vCPUs. You can find a Python example in this blog post.

    To see the average cost and execution speed for each memory configuration before making a decision, Lambda Power Tuning tool helps to visualize the tradeoffs.

    Optimize batching behavior

    Batching can increase message throughput. By default, Lambda batches up to 10 messages in a queue to process them during a single Lambda execution. You can increase this number up to 10,000 messages, or up to 6 MB of messages in a single batch for standard SQS queues.

    If each payload size is 256 KB (the maximum message size for SQS), Lambda can only take 23 messages per batch, regardless of the batch size setting. Similar to increasing memory for a Lambda function, processing more messages per batch can increase throughput.

    However, increasing the batch size does not always achieve this. It is important to understand how message batches are processed. All messages in a failed batch return to the queue. This means that if a Lambda function with five messages fails while processing the third message, all five messages are returned to the queue, including the successfully processed messages. The Lambda function code must be able to process the same message multiple times without side effects.

    Failed message processing

    Figure 2. A Lambda function returning all five messages to the queue after failing to process the third message.

    To prevent successfully processed messages from being returned to SQS, you can add code to delete the processed messages from the queue manually. You can also use existing open source libraries, such as Lambda Powertools for Python or Lambda Powertools for Java that provide this functionality.

    Catch errors from the Lambda function

    The Lambda service scales to process messages from the SQS queue when there are sufficient messages in the queue.

    However, there is a case where Lambda scales down the number of functions, even when there are messages remaining in the queue. This is when a Lambda function throws errors. The Lambda function scales down to minimize erroneous invocations. To sustain or increase the number of concurrent Lambda functions, you must catch the errors so the function exits successfully.

    To retain the failed messages, use an SQS dead-letter queue (DLQ). There are caveats to this approach. Catching errors without proper error handling and tracking mechanisms can result in errors being ignored instead of raising alerts. This may lead to silent failures while Lambda continues to scale and process messages in the queue.

    Relevant Lambda configurations

    There are several Lambda configuration settings to consider for optimizing Lambda’s scaling behavior. Paying attention to the following configurations can help prevent throttling and increase throughput of your Lambda function.

    Reserved concurrency

    If you use reserved concurrency for a Lambda function, set this value greater than five. This value sets the maximum number of concurrent Lambda functions that can run. Lambda allocates five functions to consume five batches at a time. If the reserved concurrency value is lower than five, the function is throttled when it tries to process more than this value concurrently.

    Batching Window

    For larger batch sizes, set the MaximumBatchingWindowInSeconds parameter to at least 1 second. This is the maximum amount of time that Lambda spends gathering records before invoking the function. If this value is too small, Lambda may invoke the function with a batch smaller than the batch size. If this value is too large, Lambda polls for a longer time before processing the messages in the batch. You can adjust this value to see how it affects your throughput.

    Queue visibility timeout

    All SQS messages have a visibility timeout that determines how long the message is hidden from the queue after being selected up by a consumer. If the message is not successfully processed or deleted, the message reappears in the queue when the visibility timeout ends.

    Give your Lambda function enough time to process the message by setting the visibility timeout based on your function-specific metrics. You should set this value to six times the Lambda function timeout plus the value of MaximumBatchingWindowInSeconds. This prevents other functions from unnecessarily processing the messages while the message is already being processed.

    Dead-letter queues (DLQs)

    Failed messages are placed back in the queue to be retried by Lambda. To prevent failed messages from getting added to the queue multiple times, designate a DLQ and send failed messages there.

    The number of times the messages should be retried is set by the Maximum receives value for the DLQ. Once the message is re-added to the queue more than the Maximum receives value, it is placed in the DLQ. You can then process this message at a later time.

    This allows you to avoid situations where many failed messages are continuously placed back into the queue, consuming Lambda resources. Failed messages scale down the Lambda function and add the entire batch to the queue, which can worsen the situation. To ensure smooth scaling of the Lambda function, move repeatedly failing messages to the DLQ.

    Conclusion

    This post explores Lambda’s scaling behavior when subscribed to SQS standard queues. It walks through several ways to scale faster and maximize Lambda throughput when needed. This includes increasing the memory allocation for the Lambda function, increasing batch size, catching errors, and making configuration changes. Better understanding the levers available for SQS and Lambda interaction can help in meeting your scaling needs.

    To learn more about building decoupled architectures, see these videos on Amazon SQS. For more serverless learning resources, visit https://serverlessland.com.