Tag Archives: AWS Lambda

Sending mobile push notifications and managing device tokens with serverless applications

2021-08-26 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/sending-mobile-push-notifications-and-managing-device-tokens-with-serverless-application/

This post is written by Rafa Xu, Cloud Architect, Serverless and Joely Huang, Cloud Architect, Serverless.

Amazon Simple Notification Service (SNS) is a fast, flexible, fully managed push messaging service in the cloud. SNS can send mobile push notifications directly to applications on mobile devices such as message alerts and badge updates. SNS sends push notifications to a mobile endpoint created by supplying a mobile token and platform application.

When publishing mobile push notifications, a device token is used to generate an endpoint. This identifies where the push notification is sent (target destination). To push notifications successfully, the token must be up to date and the endpoint must be validated and enabled.

A common challenge when pushing notifications is keeping the token up to date. Tokens can automatically change due to reasons such as mobile operating system (OS) updates and application store updates.

This post provides a serverless solution to this challenge. It also provides a way to publish push notifications to specific end users by maintaining a mapping between users, endpoints, and tokens.

Overview

To publish mobile push notifications using SNS, generate an SNS endpoint to use as a destination target for the push notification. To create the endpoint, you must supply:

A mobile application token: The mobile operating system (OS) issues the token to the application. It is a unique identifier for the application and mobile device pair.
Platform Application Amazon Resource Name (ARN): SNS provides this ARN when you create a platform application object. The platform application object requires a valid set of credentials issued by the mobile platform, which you provide to SNS.

Once the endpoint is generated, you can store and reuse it again. This prevents the application from creating endpoints indefinitely, which could exhaust the SNS endpoint limit.

To reuse the endpoints and successfully push notifications, there are a number of challenges:

Mobile application tokens can change due to a number of reasons, such as application updates. As a result, the publisher must update the platform endpoint to ensure it uses an up-to-date token.
Mobile application tokens can become invalid. When this happens, messages won’t be published, and SNS disables the endpoint with the invalid token. To resolve this, publishers must retrieve a valid token and re-enable the platform endpoint
Mobile applications can have many users, each user could have multiple devices, or one device could have multiple users. To send a push notification to a specific user, a mapping between the user, device, and platform endpoints should be maintained.

For more information on best practices for managing mobile tokens, refer to this post.

Follow along the blog post to learn how to implement a serverless workflow for managing and maintaining valid endpoints and user mappings.

Solution overview

The solution uses the following AWS services:

Amazon API Gateway: Provides a token registration endpoint URL used by the mobile application. Once called, it invokes an AWS Lambda function via the Lambda integration.
Amazon SNS: Generates and maintains the target endpoint and manages platform application objects.
Amazon DynamoDB: Serverless database for storing endpoints that also maintains a mapping between the user, endpoint, and mobile operating system.
AWS Lambda: Retrieves endpoints from DynamoDB, validates and generates endpoints, and publishes notifications by making requests to SNS.

The following diagram represents a simplified interaction flow between the AWS services:

To register the token, the mobile app invokes the registration token endpoint URL generated by Amazon API Gateway. The token registration happens every time a user logs in or opens the application. This ensures that the token and endpoints are always valid during the application usage.

The mobile application passes the token, user, and mobileOS as parameters to API Gateway, which forwards the request to the Lambda function.

The Lambda function validates the token and endpoint for the user by making API calls to DynamoDB and SNS:

The Lambda function checks DynamoDB to see if the endpoint has been previously created.
1. If the endpoint does not exist, it creates a platform endpoint via SNS.
Obtain the endpoint attributes from SNS:
1. Check the “enabled” endpoint attribute and set to “true” to enable the platform endpoint, if necessary.
2. Validate the “token” endpoint attribute with the token provided in the API Gateway request. If it does not match, update the “token” attribute.
3. Send a request to SNS to update the endpoint attributes.
If a new endpoint is created, update DynamoDB with the new endpoint.
Return a successful response to API Gateway.

Deploying the AWS Serverless Application Model (AWS SAM) template

Use the AWS SAM template to deploy the infrastructure for this workflow. Before deploying the template, first create a platform application in SNS.

Navigate to the SNS console. Select Push Notifications on the left-hand menu to create a platform application:
This shows the creation of a platform application for iOS applications:
To install AWS SAM, visit the installation page.

To deploy the AWS SAM template, navigate to the directory where the template is located. Run the commands in the terminal:

git clone https://github.com/aws-samples/serverless-mobile-push-notification
cd serverless-mobile-push-notification
sam build
sam deploy --guided

Lambda function code snippets

The following section explains code from the Lambda function for the workflow.

Create the platform endpoint

If the endpoint exists, store it as a variable in the code. If the platform endpoint does not exist in the DynamoDB database, create a new endpoint:

        need_update_ddb = False
        response = table.get_item(Key={'username': username, 'appos': appos})
        if 'Item' not in response:
            # create endpoint
            response = snsClient.create_platform_endpoint(
                PlatformApplicationArn=SUPPORTED_PLATFORM[appos],
                Token=token,
            )
            devicePushEndpoint = response['EndpointArn']
            need_update_ddb = True
        else:
            # update the endpoint
            devicePushEndpoint = response['Item']['endpoint']

Check and update endpoint attributes

Check that the token attribute for the platform endpoint matches the token received from the mobile application through the request. This also checks for the endpoint “enabled” attribute and re-enables the endpoint if necessary:

response = snsClient.get_endpoint_attributes(
                EndpointArn=devicePushEndpoint
            )
            endpointAttributes = response['Attributes']

            previousToken = endpointAttributes['Token']
            previousStatus = endpointAttributes['Enabled']
            if previousStatus.lower() != 'true' or previousToken != token:
                snsClient.set_endpoint_attributes(
                    EndpointArn=devicePushEndpoint,
                    Attributes={
                        'Token': token,
                        'Enabled': 'true'
                    }
                )

Update the DynamoDB table with the newly generated endpoint

If a platform endpoint is newly created, meaning there is no item in the DynamoDB table, create a new item in the table:

        if need_update_ddb:
            table.update_item(
                Key={
                    'username': username,
                    'appos': appos
                },
                UpdateExpression="set endpoint=:e",
                ExpressionAttributeValues={
                    ':e': devicePushEndpoint
                },
                ReturnValues="UPDATED_NEW"
            )

As best practice, the code cleans up the table, in case there are multiple entries for the same endpoint mapped to different users. This can happen when the mobile application is used by multiple users on the same device. When one user logs out and a different user logs in, this creates a new entry in the DynamoDB table to map the endpoint with the new user.

As a result, you must remove the entry that maps the same endpoint to the previously logged in user. This way, you only keep the endpoint that matches the user provided by the mobile application through the request.

result = table.query(
    # Add the name of the index you want to use in your query.
        IndexName="endpoint-index",
        KeyConditionExpression=Key('endpoint').eq(devicePushEndpoint),
    )
    for item in result['Items']:
        if item['username'] != username and item['appos'] == appos:
            print(f"deleting orphan item: username {username}, os {appos}".format(username=item['username'], appos=appos))
            table.delete_item(
                Key={
                    'username': item['username'],
                    'appos': appos
                },
            )

Conclusion

This blog shows how to deploy a serverless solution for validating and managing SNS platform endpoints and tokens. To publish push notifications successfully, use SNS to check the endpoint attribute and ensure it is mapped to the correct token and the endpoint is enabled.

This approach uses DynamoDB to store the device token and platform endpoints for each user. This allows you to send push notifications to specific users, retrieve, and reuse previously created endpoints. You create a Lambda function to facilitate the workflow, including validating the DynamoDB item for storing an enabled and up-to-date token.

Visit this link to learn more about Amazon SNS mobile push notifications: http://docs.aws.amazon.com/sns/latest/dg/SNSMobilePush.html

For more serverless learning resources, visit Serverless Land.

Improve the performance of Lambda applications with Amazon CodeGuru Profiler

2021-08-26 Vishaal Thanawala

Post Syndicated from Vishaal Thanawala original https://aws.amazon.com/blogs/devops/improve-performance-of-lambda-applications-amazon-codeguru-profiler/

As businesses expand applications to reach more users and devices, they risk higher latencies that could degrade the customer experience. Slow applications can frustrate or even turn away customers. Developers therefore increasingly face the challenge of ensuring that their code runs as efficiently as possible, since application performance can make or break businesses.

Amazon CodeGuru Profiler helps developers improve their application speed by analyzing its runtime. CodeGuru Profiler analyzes single and multi-threaded applications and then generates visualizations to help developers understand the latency sources. Afterwards, CodeGuru Profiler provides recommendations to help resolve the root cause.

CodeGuru Profiler recently began providing recommendations for applications written in Python. Additionally, the new automated onboarding process for AWS Lambda functions makes it even easier to use CodeGuru Profiler with serverless applications built on AWS Lambda.

This post highlights these new features by explaining how to set up and utilize CodeGuru Profiler on an AWS Lambda function written in Python.

Prerequisites

This post focuses on improving the performance of an application written with AWS Lambda, so it’s important to understand the Lambda functions that work best with CodeGuru Profiler. You will get the most out of CodeGuru Profiler on long-duration Lambda functions (>10 seconds) or frequently invoked shorter generation Lambda functions (~100 milliseconds). Because CodeGuru Profiler requires five minutes of runtime data before the Lambda container is recycled, very short duration Lambda functions with an execution time of 1-10 milliseconds may not provide sufficient data for CodeGuru Profiler to generate meaningful results.

The automated CodeGuru Profiler onboarding process, which automatically creates the profiling group for you, supports Lambda functions running on Java 8 (Amazon Corretto), Java 11, and Python 3.8 runtimes. Additional runtimes, without the automated onboarding process, are supported and can be found in the Java documentation and the Python documentation.

Getting Started

Let’s quickly demonstrate the new Lambda onboarding process and the new Python recommendations. This example assumes you have already created a Lambda function, so we will just walk through the process of turning on CodeGuru Profiler and viewing results. If you don’t already have a Lambda function created, you can create one by following these set up instructions. If you would like to replicate this example, the code we used can be found on GitHub here.

On the AWS Lambda Console page, open your Lambda function. For this example, we’re using a function with a Python 3.8 runtime.

This image shows the Lambda console page for the Lambda function we are referencing.

2. Navigate to the Configuration tab, go to the Monitoring and operations tools page, and click Edit on the right side of the page.

This image shows the instructions to open the page to turn on profiling

3. Scroll down to “Amazon CodeGuru Profiler” and click the button next to “Code profiling” to turn it on. After enabling Code profiling, click Save.

This image shows how to turn on CodeGuru Profiler

4. Verify that CodeGuru Profiler has been turned on within the Monitoring and operations tools page

This image shows how to validate Code profiling has been turned on

That’s it! You can now navigate to CodeGuru Profiler within the AWS console and begin viewing results.

Viewing your results

CodeGuru Profiler requires 5 minutes of Lambda runtime data to generate results. After your Lambda function provides this runtime data, which may need multiple runs if your lambda has a short runtime, it will display within the “Profiling group” page in the CodeGuru Profiler console. The profiling group will be given a default name (i.e., aws-lambda-<lambda-function-name>), and it will take approximately 15 minutes after CodeGuru Profiler receives the runtime data before it appears on this page.

This image shows where you can see the profiling group created after turning on CodeGuru Profiler on your Lambda function

After the profile appears, customers can view their profiling results by analyzing the flame graphs. Additionally, after approximately 1 hour, customers will receive their first recommendation set (if applicable). For more information on how to reading the CodeGuru Profiler results, see Investigating performance issues with Amazon CodeGuru Profiler.

The below images show the two flame graphs (CPU Utilization and Latency) generated from profiling the Lambda function. Note that the highlighted horizontal bar (also referred to as a frame) in both images corresponds with one of the three frames that generates a recommendation. We’ll dive into more details on the recommendation in the following sections.

CPU Utilization Flame Graph:

This image shows the CPU Utilization flame graph for the Lambda function

Latency Flame Graph:

This image shows the Latency flame graph for the Lambda function

Here are the three recommendations generated from the above Lambda function:

This image shows the 3 recommendations generated by CodeGuru Profiler

Addressing a recommendation

Let’s dive even further into it an example recommendation. The first recommendation above notices that the Lambda function is spending more than the normal amount of runnable time (6.8% vs <1%) creating AWS SDK service clients. It recommends ensuring that the function doesn’t unnecessarily create AWS SDK service clients, which wastes CPU time.

This image shows the detailed recommendation for 'Recreation of AWS SDK service clients'

Based on the suggested resolution step, we made a quick and easy code change, moving the client creation outside of the lambda-handler function. This ensures that we don’t create unnecessary AWS SDK clients. The code change below shows how we would resolve the issue.

...
s3_client = boto3.client('s3')
cw_client = boto3.client('cloudwatch')

def lambda_handler(event, context):
...

Reviewing your savings

After making each of the three changes recommended above by CodeGuru Profiler, look at the new flame graphs to see how the changes impacted the applications profile. You’ll notice below that we no longer see the previously wide frames for boto3 clients, put_metric_data, or the logger in the S3 API call.

CPU Utilization Flame Graph:

This image shows the CPU Utilization flame graph after making the recommended changes

Latency Flame Graph:

This image shows the Latency flame graph after making the recommended changes

Moreover, we can run the Lambda function for one day (1439 invocations) and see the results in Lambda Insights in order to understand our total savings. After every recommendation was addressed, this Lambda function, with a 128 MB memory and 10 second timeout, decreased in CPU time by 10% and dropped in maximum memory usage and network IO leading to a 30% drop in GB-s. Decreasing GB-s leads to 30% lower cost for the Lambda’s duration bill as explained in the AWS Lambda Pricing.

Latency (before and after CodeGuru Profiler):

The graph below displays the duration change while running the Lambda function for 1 day.

This image shows the duration change while running the Lambda function for 1 day.

Cost (before and after CodeGuru Profiler):

The graph below displays the function cost change while running the lambda function for 1 day.

This image shows the function cost change while running the lambda function for 1 day.

Conclusion

This post showed how developers can easily onboard onto CodeGuru Profiler in order to improve the performance of their serverless applications built on AWS Lambda. Get started with CodeGuru Profiler by visiting the CodeGuru console.

All across Amazon, numerous teams have utilized CodeGuru Profiler’s technology to generate performance optimizations for customers. It has also reduced infrastructure costs, saving millions of dollars annually.

Onboard your Python applications onto CodeGuru Profiler by following the instructions on the documentation. If you’re interested in the Python agent, note that it is open-sourced on GitHub. Also for more demo applications using the Python agent, check out the GitHub repository for additional samples.

About the authors

Headshot of Vishaal

Vishaal is a Product Manager at Amazon Web Services. He enjoys getting to know his customers’ pain points and transforming them into innovative product solutions.

Headshot of Mirela

Mirela is working as a Software Development Engineer at Amazon Web Services as part of the CodeGuru Profiler team in London. Previously, she has worked for Amazon.com’s teams and enjoys working on products that help customers improve their code and infrastructure.

Headshot of Parag

Parag is a Software Development Engineer at Amazon Web Services as part of the CodeGuru Profiler team in Seattle. Previously, he was working for Amazon.com’s catalog and selection team. He enjoys working on large scale distributed systems and products that help customers build scalable, and cost-effective applications.

Building well-architected serverless applications: Optimizing application performance – part 2

2021-08-24 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-optimizing-application-performance-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

PERF 1. Optimizing your serverless application’s performance

This post continues part 1 of this security question. Previously, I cover measuring and optimizing function startup time. I explain cold and warm starts and how to reuse the Lambda execution environment to improve performance. I show a number of ways to analyze and optimize the initialization startup time. I explain how only importing necessary libraries and dependencies increases application performance.

Good practice: Design your function to take advantage of concurrency via asynchronous and stream-based invocations

AWS Lambda functions can be invoked synchronously and asynchronously.

Favor asynchronous over synchronous request-response processing.

Consider using asynchronous event processing rather than synchronous request-response processing. You can use asynchronous processing to aggregate queues, streams, or events for more efficient processing time per invocation. This reduces wait times and latency from requesting apps and functions.

When you invoke a Lambda function with a synchronous invocation, you wait for the function to process the event and return a response.

Synchronous invocation

As synchronous processing involves a request-response pattern, the client caller also needs to wait for a response from a downstream service. If the downstream service then needs to call another service, you end up chaining calls that can impact service reliability, in addition to response times. For example, this POST /order request must wait for the response to the POST /invoice request before responding to the client caller.

Example synchronous processing

The more services you integrate, the longer the response time, and you can no longer sustain complex workflows using synchronous transactions.

Asynchronous processing allows you to decouple the request-response using events without waiting for a response from the function code. This allows you to perform background processing without requiring the client to wait for a response, improving client performance. You pass the event to an internal Lambda queue for processing and Lambda handles the rest. An external process, separate from the function, manages polling and retries. Using this asynchronous approach can also make it easier to handle unpredictable traffic with significant volumes.

Asynchronous invocation

For example, the client makes a POST /order request to the order service. The order service accepts the request and returns that it has been received, without waiting for the invoice service. The order service then makes an asynchronous POST /invoice request to the invoice service, which can then process independently of the order service. If the client must receive data from the invoice service, it can handle this separately via a GET /invoice request.

Example asynchronous processing

You can configure Lambda to send records of asynchronous invocations to another destination service. This helps you to troubleshoot your invocations. You can also send messages or events that can’t be processed correctly into a dedicated Amazon Simple Queue Service (SQS) dead-letter queue for investigation.

You can add triggers to a function to process data automatically. For more information on which processing model Lambda uses for triggers, see “Using AWS Lambda with other services”.

Asynchronous workflows handle a variety of use cases including data Ingestion, ETL operations, and order/request fulfillment. In these use-cases, data is processed as it arrives and is retrieved as it changes. For example asynchronous patterns, see “Serverless Data Processing” and “Serverless Event Submission with Status Updates”.

For more information on Lambda synchronous and asynchronous invocations, see the AWS re:Invent presentation “Optimizing your serverless applications”.

Tune batch size, batch window, and compress payloads for high throughput

When using Lambda to process records using Amazon Kinesis Data Streams or SQS, there are a number of tuning parameters to consider for performance.

You can configure a batch window to buffer messages or records for up to 5 minutes. You can set a limit of the maximum number of records Lambda can process by setting a batch size. Your Lambda function is invoked whichever comes first.

For high volume SQS standard queue throughput, Lambda can process up to 1000 concurrent batches of records per second. For more information, see “Using AWS Lambda with Amazon SQS”.

For high volume Kinesis Data Streams throughput, there are a number of options. Configure the ParallelizationFactor setting to process one shard of a Kinesis Data Stream with more than one Lambda invocation simultaneously. Lambda can process up to 10 batches in each shard. For more information, see “New AWS Lambda scaling controls for Kinesis and DynamoDB event sources.” You can also add more shards to your data stream to increase the speed at which your function can process records. This increases the function concurrency at the expense of ordering per shard. For more details on using Kinesis and Lambda, see “Monitoring and troubleshooting serverless data analytics applications”.

Kinesis enhanced fan-out can maximize throughput by dedicating a 2 MB/second input/output channel per second per consumer instead of 2 MB per shard. For more information, see “Increasing stream processing performance with Enhanced Fan-Out and Lambda”.

Kinesis stream producers can also compress records. This is at the expense of additional CPU cycles for decompressing the records in your Lambda function code.

Required practice: Measure, evaluate, and select optimal capacity units

Capacity units are a unit of consumption for a service. They can include function memory size, number of stream shards, number of database reads/writes, request units, or type of API endpoint. Measure, evaluate and select capacity units to enable optimal configuration of performance, throughput, and cost.

Identify and implement optimal capacity units.

For Lambda functions, memory is the capacity unit for controlling the performance of a function. You can configure the amount of memory allocated to a Lambda function, between 128 MB and 10,240 MB. The amount of memory also determines the amount of virtual CPU available to a function. Adding more memory proportionally increases the amount of CPU, increasing the overall computational power available. If a function is CPU-, network- or memory-bound, then changing the memory setting can dramatically improve its performance.

Choosing the memory allocated to Lambda functions is an optimization process that balances performance (duration) and cost. You can manually run tests on functions by selecting different memory allocations and measuring the time taken to complete. Alternatively, use the AWS Lambda Power Tuning tool to automate the process.

The tool allows you to systematically test different memory size configurations and depending on your performance strategy – cost, performance, balanced – it identifies what is the most optimum memory size to use. For more information, see “Operating Lambda: Performance optimization – Part 2”.

AWS Lambda Power Tuning report

Amazon DynamoDB manages table processing throughput using read and write capacity units. There are two different capacity modes, on-demand and provisioned.

On-demand capacity mode supports up to 40K read/write request units per second. This is recommended for unpredictable application traffic and new tables with unknown workloads. For higher and predictable throughputs, provisioned capacity mode along with DynamoDB auto scaling is recommended. For more information, see “Read/Write Capacity Mode”.

For high throughput Amazon Kinesis Data Streams with multiple consumers, consider using enhanced fan-out for dedicated 2 MB/second throughput per consumer. When possible, use Kinesis Producer Library and Kinesis Client Library for effective record aggregation and de-aggregation.

Amazon API Gateway supports multiple endpoint types. Edge-optimized APIs provide a fully managed Amazon CloudFront distribution. These are better for geographically distributed clients. API requests are routed to the nearest CloudFront Point of Presence (POP), which typically improves connection time.

Edge-optimized API Gateway deployment

Regional API endpoints are intended when clients are in the same Region. This helps you to reduce request latency and allows you to add your own content delivery network if necessary.

Regional endpoint API Gateway deployment

Private API endpoints are API endpoints that can only be accessed from your Amazon Virtual Private Cloud (VPC) using an interface VPC endpoint. For more information, see “Creating a private API in Amazon API Gateway”.

For more information on endpoint types, see “Choose an endpoint type to set up for an API Gateway API”. For more general information on API Gateway, see the AWS re:Invent presentation “I didn’t know Amazon API Gateway could do that”.

AWS Step Functions has two workflow types, standard and express. Standard Workflows have exactly once workflow execution and can run for up to one year. Express Workflows have at-least-once workflow execution and can run for up to five minutes. Consider the per-second rates you require for both execution start rate and the state transition rate. For more information, see “Standard vs. Express Workflows”.

Performance load testing is recommended at both sustained and burst rates to evaluate the effect of tuning capacity units. Use Amazon CloudWatch service dashboards to analyze key performance metrics including load testing results. I cover performance testing in more detail in “Regulating inbound request rates – part 1”.

For general serverless optimization information, see the AWS re:Invent presentation “Serverless at scale: Design patterns and optimizations”.

Conclusion

Evaluate and optimize your serverless application’s performance based on access patterns, scaling mechanisms, and native integrations. You can improve your overall experience and make more efficient use of the platform in terms of both value and resources.

This post continues from part 1 and looks at designing your function to take advantage of concurrency via asynchronous and stream-based invocations. I cover measuring, evaluating, and selecting optimal capacity units.

This well-architected question will continue in part 3 where I look at integrating with managed services directly over functions when possible. I cover optimizing access patterns and applying caching where applicable.

For more serverless learning resources, visit Serverless Land.

Convert and Watermark Documents Automatically with Amazon S3 Object Lambda

2021-08-24 Joseph Simon

Post Syndicated from Joseph Simon original https://aws.amazon.com/blogs/architecture/convert-and-watermark-documents-automatically-with-amazon-s3-object-lambda/

When you provide access to a sensitive document to someone outside of your organization, you likely need to ensure that the document is read-only. In this case, your document should be associated with a specific user in case it is shared.

For example, authors often embed user-specific watermarks into their ebooks. This way, if their ebook gets posted to a file-sharing site, they can prevent the purchaser from downloading copies of the ebook in the future.

In this blog post, we provide you a cost-efficient, scalable, and secure solution to efficiently generate user-specific versions of sensitive documents. This solution helps users track who their documents are shared with. This helps prevent fraud and ensure that private information isn’t leaked. Our solution uses a RESTful API, which uses Amazon S3 Object Lambda to convert documents to PDF and apply a watermark based on the requesting user. It also provides a method for authentication and tracks access to the original document.

Architectural overview

S3 Object Lambda processes and transforms data that is requested from Amazon Simple Storage Service (Amazon S3) before it’s sent back to a client. The AWS Lambda function is invoked inline via a standard S3 GET request. It can return different results from the same document based on parameters, such as who is requesting the document. Figure 1 provides a high-level view of the different components that make up the solution.

Figure 1. Document processing architectural diagram

Authenticating users with Amazon Cognito

This architecture defines a RESTful API, but users will likely be using a mobile or web application that calls the API. Thus, the application will first need to authenticate users. We do this via Amazon Cognito, which functions as its own identity provider (IdP). You could also use an external IdP, including those that support OpenID Connect and SAML.

Validating the JSON Web Token with API Gateway

Once the user is successfully authenticated with Amazon Cognito, the application will be sent a JSON Web Token (JWT). This JWT contains information about the user and will be used in subsequent requests to the API.

Now that the application has a token, it will make a request to the API, which is provided by Amazon API Gateway. API Gateway provides a secure, scalable entryway into your application. The API Gateway validates the JWT sent from the client with Amazon Cognito to make sure it is valid. If it is validated, the request is accepted and sent on to the Lambda API Handler. If it’s not, the client gets rejected and sent an error code.

Storing user data with DynamoDB

When the Lambda API Handler receives the request, it parses the JWT to extract the user making the request. It then logs that user, file, and access time into Amazon DynamoDB. Optionally, you may use DynamoDB to store an encoded string that will be used as the watermark, rather than something in plaintext, like user name or email.

Generating the PDF and user-specific watermark

At this point, the Lambda API Handler sends an S3 GET request. However, instead of going to Amazon S3 directly, it goes to a different endpoint that invokes the S3 Object Lambda function. This endpoint is called an S3 Object Lambda Access Point. The S3 GET request contains the original file name and the string that will be used for the watermark.

The S3 Object Lambda function transforms the original file that it downloads from its source S3 bucket. It uses the open-source office suite LibreOffice (and specifically this Lambda layer) to convert the source document to PDF. Once it is converted, a JavaScript library (PDF-Lib) embeds the watermark into the PDF before it’s sent back to the Lambda API Handler function.

The Lambda API Handler stores the converted file in a temporary S3 bucket, generates a presigned URL, and sends that URL back to the client as a 302 redirect. Then the client sends a request to that presigned URL to get the converted file.

To keep the temporary S3 bucket tidy, we use an S3 lifecycle configuration with an expiration policy.

Figure 2. Process workflow for document transformation

Alternate approach

Before S3 Object Lambda was available, Lambda@Edge was used. However, there are three main issues with using Lambda@Edge instead of S3 Object Lambda:

It is designed to run code closer to the end user to decrease latency, but in this case, latency is not a major concern.
It requires using an Amazon CloudFront distribution, and the single-download pattern described here will not take advantage of Lamda@Edge’s caching.
It has quotas on memory that don’t lend themselves to complex libraries like OfficeLibre.

Extending this solution

This blog post describes the basic building blocks for the solution, but it can be extended relatively easily. For example, you could add another function to the API that would convert, resize, and watermark images. To do this, create an S3 Object Lambda function to perform those tasks. Then, add an S3 Object Lambda Access Point to invoke it based on a different API call.

API Gateway has many built-in security features, but you may want to enhance the security of your RESTful API. To do this, add enhanced security rules via AWS WAF. Integrating your IdP into Amazon Cognito can give you a single place to manage your users.

Monitoring any solution is critical, and understanding how an application is behaving end to end can greatly benefit optimization and troubleshooting. Adding AWS X-Ray and Amazon CloudWatch Lambda Insights will show you how functions and their interactions are performing.

Should you decide to extend this architecture, follow the architectural principles defined in AWS Well-Architected, and pay particular attention to the Serverless Application Lens.

Figure 3. Example expanded document processing architecture

Conclusion

You can implement this solution in a number of ways. However, by using S3 Object Lambda, you can transform documents without needing intermediary storage. S3 Object Lambda will also decouple your file logic from the rest of the application.

The Serverless on AWS components mentioned in this post allow you to reduce administrative overhead, saving you time and money.

Finally, the extensible nature of this architecture allows you to add functionality easily as your organization’s needs grow and change.

The following links provide more information on how to use S3 Object Lambda in your architectures:

Queue Integration with Third-party Services on AWS

2021-08-23 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/queue-integration-with-third-party-services-on-aws/

Commercial off-the-shelf software and third-party services can present an integration challenge in event-driven workflows when they do not natively support AWS APIs. This is even more impactful when a workflow is subject to unpredicted usage spikes, and you want to increase decoupling and fault tolerance. Given the third-party nature of services, polling an Amazon Simple Queue Service (SQS) queue and having built-in AWS API handling logic may not be an immediate option.

In such cases, AWS Lambda helps out-task the Amazon SQS queue integration and AWS API handling to an additional layer. The success of this depends on how well exception handling is implemented across the different interacting services. In this blog post, we outline issues to consider when adopting this design pattern. We also share a reusable solution.

Design pattern for third-party integration with SQS

With this design pattern, one or more services (producers) asynchronously invoke other third-party downstream consumer services. They publish messages to an Amazon SQS queue, which acts as buffer for requests. Producers provide all commands and other parameters required for consumer service execution with the message.

As messages are written to the queue, the queue is configured to invoke a message broker (implemented as AWS Lambda) for each message. AWS Lambda can interact natively with target AWS services such as Amazon EC2, Amazon Elastic Container Service (ECS), or Amazon Elastic Kubernetes Service (EKS). It can also be configured to use an Amazon Virtual Private Cloud (VPC) interface endpoint to establish a connection to VPC resources without traversing the internet. The message broker assigns the tasks to consumer services by invoking the RunTask API of Amazon ECS and AWS Fargate (see Figure 1.)

Figure 1. On-premises and AWS queue integration for third-party services using AWS Lambda

The message broker asynchronously invokes the API in ‘fire-and-forget’ mode. Therefore, error handling must be built in to respond to API invocation errors. In an event-driven scenario, an error will be invoked if you asynchronously call the third-party service hundreds or thousands of times and reach Service Quotas. This is a potential issue with RunTask API actions, or a large volume of concurrent tasks running on AWS Fargate. Two mechanisms can help implement troubleshooting API request errors.

API retries with exponential backoff. The message broker retries for a number of times with configurable sleep intervals and exponential backoff in-between. This enforces progressively longer waits between retries for consecutive error responses. If the RunTask API fails to process the request and initiate the third-party service, the message remains in the queue for a subsequent retry. The AWS General Reference provides further guidance.
API error handling. Error handling and consequent logging should be implemented at every step. Since there are several services working together in tandem, crucial debugging information from errors may be lost. Additionally, error handling also provides opportunity to define automated corrective actions or notifications on event occurrence. The message broker can publish failure notifications including the root cause to an Amazon Simple Notification Service (SNS) topic.

SNS topic subscription can be configured via different protocols. You can email a distribution group for active monitoring and processing of errors. If persistence is required for messages that failed to process, error handling can be associated directly with SQS by configuring a dead letter queue.

Reference implementation for third-party integration with SQS

We implemented the design pattern in Figure 1, with Broad Institute’s Cell Painting application workflow. This is for morphological profiling from microscopy cell images running on Amazon EC2. It interacts with CellProfiler version 3.0 cell image analysis software as the downstream consumer hosted on ECS/Fargate. Every invocation of CellProfiler required approximately 1,500 tasks for a single processing step.

Resource constraints determined the rate of scale-out. In this case, it was for an Amazon ECS task creation. Address space for Amazon ECS subnets should be large enough to prevent running out of available IPs within your VPC. If Amazon ECS Service Quotas provide further constraints, a quota increase can be requested.

Exceptions must be handled both when validating and initiating requests. As part of the validation workflow, exceptions are captured as follows, also shown in Figure 2.

1. Invalid arguments exception. The message broker validates that the SQS message contains all the needed information to initiate the ECS task. This information includes subnets, security groups and container names required to start the ECS task, and else raises exception.

2. Retry limit exception. On each iteration, the message broker will evaluate whether the SQS retry limit has been reached, before invoking the RunTask API. It will then exit, by sending failure notification to SNS when the retry limit is reached.

Figure 2. Exception handling flow during request validation

As part of the initiation workflow, exceptions are handled as follows, shown in Figure 3:

1. ECS/Fargate API and concurrent execution limitations. The message broker catches API exceptions when calling the API RunTask operation. These exceptions can include:

- When the call to the launch tasks exceeds the maximum allowed API request limit for your AWS account
- When failing to retrieve security group information
- When you have reached the limit on the number of tasks you can run concurrently

With each of the preceding exceptions, the broker will increase the retry count.

2. Networking and IP space limitations. Network interface timeouts received after initiating the ECS task set off an Amazon CloudWatch Events rule, causing the message broker to re-initiate the ECS task.

Figure 3. Exception handling flow during request initiation

While we specifically address downstream consumer services running on ECS/Fargate, this solution can be adjusted for third-party services running on Amazon EC2 or EKS. With EC2, the message broker must be adjusted to interact with the RunInstances API, and include troubleshooting API request errors. Integration with downstream consumers on Amazon EKS requires that the AWS Lambda function is associated via the IAM role with a Kubernetes service account. A Python client for Kubernetes can be used to simplify interaction with the Kubernetes REST API and AWS Lambda would invoke the run API.

Conclusion

This pattern is useful when queue polling is not an immediate option. This is typical with event-driven workflows involving third-party services and vendor applications subject to unpredictable, intermittent load spikes. Exception handling is essential for these types of workflows. Offloading AWS API handling to a separate layer orchestrated by AWS Lambda can improve the resiliency of such third-party services on AWS. This pattern represents an incremental optimization until the third party provides native SQS integration. It can be achieved with the initial move to AWS, for example as part of the V1 AWS design strategy for third-party services.

Some limitations should be acknowledged. While the pattern enables graceful failure, it does not prevent the overloading of the ECS RunTask API. By invoking Amazon ECS RunTask API in ‘fire-and-forget’ mode, it does not monitor service execution once a task was successfully invoked. Therefore, it should be adopted when direct queue polling is not an option. In our example, Broad Institute’s CellProfiler application enabled direct queue polling with its subsequent product version of Distributed CellProfiler.

Further reading

The referenced deployment with consumer services on Amazon ECS can be accessed via AWSLabs.

Adding resiliency to AWS CloudFormation custom resource deployments

2021-08-23 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/adding-resiliency-to-aws-cloudformation-custom-resource-deployments/

This post is written by Dathu Patil, Solutions Architect and Naomi Joshi, Cloud Application Architect.

AWS CloudFormation custom resources allow you to write custom provisioning logic in templates. These run anytime you create, update, or delete stacks. Using AWS Lambda-backed custom resources, you can associate a Lambda function with a CloudFormation custom resource. The function is invoked whenever the custom resource is created, updated, or deleted.

When CloudFormation asynchronously invokes the function, it passes the request data, such as the request type and resource properties to the function. The customizability of Lambda functions in combination with CloudFormation allow a wide range of scenarios. For example, you can dynamically look up Amazon Machine Image (AMI) IDs during stack creation or use utilities such as string reversal functions.

Unhandled exceptions or transient errors in the custom resource Lambda function can cause your code to exit without sending a response. CloudFormation requires an HTTPS response to confirm if the operation is successful or not. An unreported exception causes CloudFormation to wait until the operation times out before starting a stack rollback.

If the exception occurs again on rollback, CloudFormation waits for a timeout exception before ending in a rollback failure. During this time, your stack is unusable. You can learn more about this and best practices by reviewing Best Practices for CloudFormation Custom Resources.

In this blog, you learn how you can use Amazon SQS and Lambda to add resiliency to your Lambda-backed CloudFormation custom resource deployments. The example shows how to use CloudFormation custom resource to look up an AMI ID dynamically during Amazon EC2 creation.

Overview

CloudFormation templates that declare an EC2 instance must also specify an AMI ID. This includes an operating system and other software and configuration information used to launch the instance. The correct AMI ID depends on the instance type and Region in which you’re launching your stack. AMI ID can change regularly, such as when an AMI is updated with software updates.

Customers often implement a CloudFormation custom resource to look up an AMI ID while creating an EC2 instance. In this example, the lookup Lambda function calls the EC2 API. It fetches the available AMI IDs, uses the latest AMI ID, and checks for a compliance tag. This implementation assumes that there are separate processes for creating AMI and running compliance checks. The process that performs compliance and security checks creates a compliance tag on a successful scan.

This solution shows how you can use SQS and Lambda to add resiliency to handle an exception. In this case, the exception occurs in the AMI lookup custom resource due to a missing compliance tag. When the AMI lookup function fails processing, it uses the Lambda destination configuration to send the request to an SQS queue. The message is reprocessed using the SQS queue and Lambda function.

The CloudFormation custom resource asynchronously invokes the AMI lookup Lambda function to perform appropriate actions.
The AMI lookup Lambda function calls the EC2 API to fetch the list of AMIs and checks for a compliance tag. If the tag is missing, it throws an unhandled exception.
On failure, the Lambda destination configuration sends the request to the retry queue that is configured as a dead-letter queue (DLQ). SQS adds a custom delay between retry processing to support more than two retries.
The retry Lambda function processes messages in the retry queue using Lambda with SQS. Lambda polls the queue and invokes the retry Lambda function synchronously with an event that contains queue messages.
The retry function then synchronously invokes the AMI lookup function using the information from the request SQS message.

The AMI Lookup Lambda function

An AWS Serverless Application Model (AWS SAM) template is used to create the AMI lookup Lambda function. You can configure async event options such as number of retries on the Lambda function. The maximum retries allowed is 2 and there is no option to set a delay between the invocation attempts.

When a transient failure or unhandled error occurs, the request is forwarded to the retry queue. This part of the AWS SAM template creates AMI lookup Lambda function:

  AMILookupLambda:
    Type: AWS::Serverless::Function 
    Properties:
      CodeUri: amilookup/
      Handler: app.lambda_handler
      Runtime: python3.8
      Timeout: 300
      EventInvokeConfig:
          MaximumEventAgeInSeconds: 60
          MaximumRetryAttempts: 2
          DestinationConfig:
            OnFailure:
              Type: SQS
              Destination: !GetAtt RetryQueue.Arn
      Policies:
        - AMIDescribePolicy: {}

This function calls the EC2 API using the boto3 AWS SDK for Python. It calls the describe_images method to get a list of images with given filter conditions. The Lambda function iterates through the AMI list and checks for compliance tags. If the tag is not present, it raises an exception:

ec2_client = boto3.client('ec2', region_name=region)
         # Get AMI IDs with the specified name pattern and owner
         describe_response = ec2_client.describe_images(
            Filters=[{'Name': "name", 'Values': architectures},
                     {'Name': "tag-key", 'Values': ['ami-compliance-check']}],
            Owners=["amazon"]
        )

The queue and the retry Lambda function

The retry queue adds a 60-second delay before a message is available for the processing. The time delay between retry processing attempts provides time for transient errors to be corrected. This is the AWS SAM template for creating these resources:

RetryQueue:
  Type: AWS::SQS::Queue
  Properties:
    VisibilityTimeout: 60
    DelaySeconds: 60
    MessageRetentionPeriod: 600

RetryFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: retry/
      Handler: app.lambda_handler
      Runtime: python3.8
      Timeout: 60
      Events:
        MySQSEvent:
          Type: SQS
          Properties:
            Queue: !GetAtt RetryQueue.Arn
            BatchSize: 1
      Policies:
        - LambdaInvokePolicy:
            FunctionName: !Ref AMILookupFunction

The retry Lambda function periodically polls for new messages in the retry queue. The function synchronously invokes the AMI lookup Lambda function. On success, a response is sent to CloudFormation. This process runs until the AMI lookup function returns a successful response or the message is deleted from the SQS queue. The deletion is based on the MessageRetentionPeriod, which is set to 600 seconds in this case.

for record in event['Records']:
        body = json.loads(record['body'])
        response = client.invoke(
            FunctionName=body['requestContext']['functionArn'],
            InvocationType='RequestResponse',
            Payload=json.dumps(body['requestPayload']).encode()
        )

Deployment walkthrough

Prerequisites

To get started with this solution, you need:

AWS CLI and AWS SAM CLI installed to deploy the solution.
An existing Amazon EC2 public image. You can choose any of the AMIs from the AWS Management Console with Architecture = X86_64 and Owner = amazon for test purposes. Note the AMI ID.

Download the source code from the resilient-cfn-custom-resource GitHub repository. The template.yaml file is an AWS SAM template. It deploys the Lambda functions, SQS, and IAM roles required for the Lambda function. It uses Python 3.8 as the runtime and assigns 128 MB of memory for the Lambda functions.

To build and deploy this application using the AWS SAM CLI build and guided deploy:
```
sam build --use-container
sam deploy --guided
```

The custom resource stack creation invokes the AMI lookup Lambda function. This fetches the AMI ID from all public EC2 images available in your account with the tag ami-compliance-check. Typically, the compliance tags are created by a process that performs security scans.

In this example, the security scan process is not running and the tag is not yet added to any AMIs. As a result, the custom resource throws an exception, which goes to the retry queue. This is retried by the retry function until it is successfully processed.

Use the console or AWS CLI to add the tag to the chosen EC2 AMI. In this example, this is analogous to a separate governance process that checks for AMI compliance and adds the compliance tag if passed. Replace the $AMI-ID with the AMI ID captured in the prerequisites:
```
aws ec2 create-tags –-resources $AMI-ID --tags Key=ami-compliance-check,Value=True
```
After the tags are added, a response is sent successfully from the custom resource Lambda function to the CloudFormation stack. It includes your $AMI-ID and a test EC2 instance is created using that image. The stack creation completes successfully with all resources deployed.

Conclusion

This blog post demonstrates how to use SQS and Lambda to add resiliency to CloudFormation custom resources deployments. This solution can be customized for use cases where CloudFormation stacks have a dependency on a custom resource.

CloudFormation custom resource failures can happen due to unhandled exceptions. These are caused by issues with a dependent component, internal service, or transient system errors. Using this solution, you can handle the failures automatically without the need for manual intervention. To get started, download the code from the GitHub repo and start customizing.

For more serverless learning resources, visit Serverless Land.

Using AWS CodePipeline for deploying container images to AWS Lambda Functions

2021-08-20 Kirankumar Chandrashekar

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/using-aws-codepipeline-for-deploying-container-images-to-aws-lambda-functions/

AWS Lambda launched support for packaging and deploying functions as container images at re:Invent 2020. In the post working with Lambda layers and extensions in container images, we demonstrated packaging Lambda Functions with layers while using container images. This post will teach you to use AWS CodePipeline to deploy docker images for microservices architecture involving multiple Lambda Functions and a common layer utilized as a separate container image. Lambda functions packaged as container images do not support adding Lambda layers to the function configuration. Alternatively, we can use a container image as a common layer while building other container images along with Lambda Functions shown in this post. Packaging Lambda functions as container images enables familiar tooling and larger deployment limits.

Here are some advantages of using container images for Lambda:

Easier dependency management and application building with container
- Install native operating system packages
- Install language-compatible dependencies
Consistent tool set for containers and Lambda-based applications
- Utilize the same container registry to store application artifacts (Amazon ECR, Docker Hub)
- Utilize the same build and pipeline tools to deploy
- Tools that can inspect Dockerfile work the same
Deploy large applications with AWS-provided or third-party images up to 10 GB
- Include larger application dependencies that previously were impossible

When using container images with Lambda, CodePipeline automatically detects code changes in the source repository in AWS CodeCommit, then passes the artifact to the build server like AWS CodeBuild and pushes the container images to ECR, which is then deployed to Lambda functions.

Architecture diagram

DevOps Architecture

Lambda-docker-images-DevOps-Architecture

Application Architecture

lambda-docker-image-microservices-app

In the above architecture diagram, two architectures are combined, namely 1, DevOps Architecture and 2, Microservices Application Architecture. DevOps architecture demonstrates the use of AWS Developer services such as AWS CodeCommit, AWS CodePipeline, AWS CodeBuild along with Amazon Elastic Container Repository (ECR) and AWS CloudFormation. These are used to support Continuous Integration and Continuous Deployment/Delivery (CI/CD) for both infrastructure and application code. Microservices Application architecture demonstrates how various AWS Lambda Functions that are part of microservices utilize container images for application code. This post will focus on performing CI/CD for Lambda functions utilizing container containers. The application code used in here is a simpler version taken from Serverless DataLake Framework (SDLF). For more information, refer to the AWS Samples GitHub repository for SDLF here.

DevOps workflow

There are two CodePipelines: one for building and pushing the common layer docker image to Amazon ECR, and another for building and pushing the docker images for all the Lambda Functions within the microservices architecture to Amazon ECR, as well as deploying the microservices architecture involving Lambda Functions via CloudFormation. Common layer container image functions as a common layer in all other Lambda Function container images, therefore its code is maintained in a separate CodeCommit repository used as a source stage for a CodePipeline. Common layer CodePipeline takes the code from the CodeCommit repository and passes the artifact to a CodeBuild project that builds the container image and pushes it to an Amazon ECR repository. This common layer ECR repository functions as a source in addition to the CodeCommit repository holding the code for all other Lambda Functions and resources involved in the microservices architecture CodePipeline.

Due to all or the majority of the Lambda Functions in the microservices architecture requiring the common layer container image as a layer, any change made to it should invoke the microservices architecture CodePipeline that builds the container images for all Lambda Functions. Moreover, a CodeCommit repository holding the code for every resource in the microservices architecture is another source to that CodePipeline to get invoked. This has two sources, because the container images in the microservices architecture should be built for changes in the common layer container image as well as for the code changes made and pushed to the CodeCommit repository.

Below is the sample dockerfile that uses the common layer container image as a layer:

ARG ECR_COMMON_DATALAKE_REPO_URL
FROM ${ECR_COMMON_DATALAKE_REPO_URL}:latest AS layer
FROM public.ecr.aws/lambda/python:3.8
# Layer Code
WORKDIR /opt
COPY --from=layer /opt/ .
# Function Code
WORKDIR /var/task
COPY src/lambda_function.py .
CMD ["lambda_function.lambda_handler"]

where the argument ECR_COMMON_DATALAKE_REPO_URL should resolve to the ECR url for common layer container image, which is provided to the --build-args along with docker build command. For example:

export ECR_COMMON_DATALAKE_REPO_URL="0123456789.dkr.ecr.us-east-2.amazonaws.com/dev-routing-lambda"
docker build --build-arg ECR_COMMON_DATALAKE_REPO_URL=$ECR_COMMON_DATALAKE_REPO_URL .

Deploying a Sample

Step1: Clone the repository Codepipeline-lambda-docker-images to your workstation. If using the zip file, then unzip the file to a local directory.
- ```
git clone https://github.com/aws-samples/codepipeline-lambda-docker-images.git
```
Step 2: Change the directory to the cloned directory or extracted directory. The local code repository structure should appear as follows:
- ```
cd codepipeline-lambda-docker-images
```

code-repository-structure

Step 3: Deploy the CloudFormation stack used in the template file CodePipelineTemplate/codepipeline.yaml to your AWS account. This deploys the resources required for DevOps architecture involving AWS CodePipelines for common layer code and microservices architecture code. Deploy CloudFormation stacks using the AWS console by following the documentation here, providing the name for the stack (for example datalake-infra-resources) and passing the parameters while navigating the console. Furthermore, use the AWS CLI to deploy a CloudFormation stack by following the documentation here.
Step 4: When the CloudFormation Stack deployment completes, navigate to the AWS CloudFormation console and to the Outputs section of the deployed stack, then note the CodeCommit repository urls. Three CodeCommit repo urls are available in the CloudFormation stack outputs section for each CodeCommit repository. Choose one of them based on the way you want to access it. Refer to the following documentation Setting up for AWS CodeCommit. I will be using the git-remote-codecommit (grc) method throughout this post for CodeCommit access.
Step 5: Clone the CodeCommit repositories and add code:
- - Common Layer CodeCommit repository: Take the value of the Output for the key oCommonLayerCodeCommitHttpsGrcRepoUrl from datalake-infra-resources CloudFormation Stack Outputs section which looks like below:
- - Clone the repository:
    - git clone codecommit::us-east-2://dev-CommonLayerCode
  - Change the directory to dev-CommonLayerCode
    - cd dev-CommonLayerCode
  - Add contents to the cloned repository from the source code downloaded in Step 1. Copy the code from the CommonLayerCode directory and the repo contents should appear as follows:
- - Create the main branch and push to the remote repository
```
git checkout -b main
git add ./
git commit -m "Initial Commit"
git push -u origin main
```
  - Application CodeCommit repository: Take the value of the Output for the key oAppCodeCommitHttpsGrcRepoUrl from datalake-infra-resources CloudFormation Stack Outputs section which looks like below:
- - Clone the repository:
    - git clone codecommit::us-east-2://dev-AppCode
  - Change the directory to dev-CommonLayerCode
    - cd dev-AppCode
  - Add contents to the cloned repository from the source code downloaded in Step 1. Copy the code from the ApplicationCode directory and the repo contents should appear as follows from the root:
- Create the main branch and push to the remote repository
```
git checkout -b main
git add ./
git commit -m "Initial Commit"
git push -u origin main
```

What happens now?

Now the Common Layer CodePipeline goes to the InProgress state and invokes the Common Layer CodeBuild project that builds the docker image and pushes it to the Common Layer Amazon ECR repository. The image tag utilized for the container image is the value resolved for the environment variable available in the AWS CodeBuild project CODEBUILD_RESOLVED_SOURCE_VERSION. This is the CodeCommit git Commit Id in this case.
For example, if the CommitId in CodeCommit is f1769c87, then the pushed docker image will have this tag along with latest

buildspec.yaml files appears as follows:

version: 0.2
phases:
  install:
    runtime-versions:
      docker: 19
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - aws --version
      - $(aws ecr get-login --region $AWS_DEFAULT_REGION --no-include-email)
      - REPOSITORY_URI=$ECR_COMMON_DATALAKE_REPO_URL
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=${COMMIT_HASH:=latest}
  build:
    commands:
      - echo Build started on `date`
      - echo Building the Docker image...          
      - docker build -t $REPOSITORY_URI:latest .
      - docker tag $REPOSITORY_URI:latest $REPOSITORY_URI:$IMAGE_TAG
  post_build:
    commands:
      - echo Build completed on `date`
      - echo Pushing the Docker images...
      - docker push $REPOSITORY_URI:latest
      - docker push $REPOSITORY_URI:$IMAGE_TAG

Now the microservices architecture CodePipeline goes to the InProgress state and invokes all of the application image builder CodeBuild project that builds the docker images and pushes them to the Amazon ECR repository.

To improve the performance, every docker image is built in parallel within the codebuild project. The buildspec.yaml executes the build.sh script. This has the logic to build docker images required for each Lambda Function part of the microservices architecture. The docker images used for this sample architecture took approximately 4 to 5 minutes when the docker images were built serially. After switching to parallel building, it took approximately 40 to 50 seconds.

buildspec.yaml files appear as follows:

version: 0.2
phases:
  install:
    runtime-versions:
      docker: 19
    commands:
      - uname -a
      - set -e
      - chmod +x ./build.sh
      - ./build.sh
artifacts:
  files:
    - cfn/**/*
  name: builds/$CODEBUILD_BUILD_NUMBER/cfn-artifacts

build.sh file appears as follows:

#!/bin/bash
set -eu
set -o pipefail

RESOURCE_PREFIX="${RESOURCE_PREFIX:=stg}"
AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:=us-east-1}"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text 2>&1)
ECR_COMMON_DATALAKE_REPO_URL="${ECR_COMMON_DATALAKE_REPO_URL:=$ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com\/$RESOURCE_PREFIX-common-datalake-library}"
pids=()
pids1=()

PROFILE='new-profile'
aws configure --profile $PROFILE set credential_source EcsContainer

aws --version
$(aws ecr get-login --region $AWS_DEFAULT_REGION --no-include-email)
COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
BUILD_TAG=build-$(echo $CODEBUILD_BUILD_ID | awk -F":" '{print $2}')
IMAGE_TAG=${BUILD_TAG:=COMMIT_HASH:=latest}

cd dockerfiles;
mkdir ../logs
function pwait() {
    while [ $(jobs -p | wc -l) -ge $1 ]; do
        sleep 1
    done
}

function build_dockerfiles() {
    if [ -d $1 ]; then
        directory=$1
        cd $directory
        echo $directory
        echo "---------------------------------------------------------------------------------"
        echo "Start creating docker image for $directory..."
        echo "---------------------------------------------------------------------------------"
            REPOSITORY_URI=$ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$RESOURCE_PREFIX-$directory
            docker build --build-arg ECR_COMMON_DATALAKE_REPO_URL=$ECR_COMMON_DATALAKE_REPO_URL . -t $REPOSITORY_URI:latest -t $REPOSITORY_URI:$IMAGE_TAG -t $REPOSITORY_URI:$COMMIT_HASH
            echo Build completed on `date`
            echo Pushing the Docker images...
            docker push $REPOSITORY_URI
        cd ../
        echo "---------------------------------------------------------------------------------"
        echo "End creating docker image for $directory..."
        echo "---------------------------------------------------------------------------------"
    fi
}

for directory in *; do 
   echo "------Started processing code in $directory directory-----"
   build_dockerfiles $directory 2>&1 1>../logs/$directory-logs.log | tee -a ../logs/$directory-logs.log &
   pids+=($!)
   pwait 20
done

for pid in "${pids[@]}"; do
  wait "$pid"
done

cd ../cfn/
function build_cfnpackages() {
    if [ -d ${directory} ]; then
        directory=$1
        cd $directory
        echo $directory
        echo "---------------------------------------------------------------------------------"
        echo "Start packaging cloudformation package for $directory..."
        echo "---------------------------------------------------------------------------------"
        aws cloudformation package --profile $PROFILE --template-file template.yaml --s3-bucket $S3_BUCKET --output-template-file packaged-template.yaml
        echo "Replace the parameter 'pEcrImageTag' value with the latest built tag"
        echo $(jq --arg Image_Tag "$IMAGE_TAG" '.Parameters |= . + {"pEcrImageTag":$Image_Tag}' parameters.json) > parameters.json
        cat parameters.json
        ls -al
        cd ../
        echo "---------------------------------------------------------------------------------"
        echo "End packaging cloudformation package for $directory..."
        echo "---------------------------------------------------------------------------------"
    fi
}

for directory in *; do
    echo "------Started processing code in $directory directory-----"
    build_cfnpackages $directory 2>&1 1>../logs/$directory-logs.log | tee -a ../logs/$directory-logs.log &
    pids1+=($!)
    pwait 20
done

for pid in "${pids1[@]}"; do
  wait "$pid"
done

cd ../logs/
ls -al
for f in *; do
  printf '%s\n' "$f"
  paste /dev/null - < "$f"
done

cd ../

The function build_dockerfiles() loops through each directory within the dockerfiles directory and runs the docker build command in order to build the docker image. The name for the docker image and then the ECR repository is determined by the directory name in which the DockerFile is used from. For example, if the DockerFile directory is routing-lambda and the environment variables take the below values,

ACCOUNT_ID=0123456789
AWS_DEFAULT_REGION=us-east-2
RESOURCE_PREFIX=dev
directory=routing-lambda
REPOSITORY_URI=$ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$RESOURCE_PREFIX-$directory

Then REPOSITORY_URI becomes 0123456789.dkr.ecr.us-east-2.amazonaws.com/dev-routing-lambda
And the docker image is pushed to this resolved REPOSITORY_URI. Similarly, docker images for all other directories are built and pushed to Amazon ECR.

Important Note: The ECR repository names match the directory names where the DockerFiles exist and was already created as part of the CloudFormation template codepipeline.yaml that was deployed in step 3. In order to add more Lambda Functions to the microservices architecture, make sure that the ECR repository name added to the new repository in the codepipeline.yaml template matches the directory name within the AppCode repository dockerfiles directory.

Every docker image is built in parallel in order to save time. Each runs as a separate operating system process and is pushed to the Amazon ECR repository. This also controls the number of processes that could run in parallel by setting a value for the variable pwait within the loop. For example, if pwait 20, then the maximum number of parallel processes is 20 at a given time. The image tag for all docker images used for Lambda Functions is constructed via the CodeBuild BuildId, which is available via environment variable $CODEBUILD_BUILD_ID, in order to ensure that a new image gets a new tag. This is required for CloudFormation to detect changes and update Lambda Functions with the new container image tag.

Once every docker image is built and pushed to Amazon ECR in the CodeBuild project, it builds every CloudFormation package by uploading all local artifacts to Amazon S3 via AWS Cloudformation package CLI command for the templates available in its own directory within the cfn directory. Moreover, it updates every parameters.json file for each directory with the ECR image tag to the parameter value pEcrImageTag. This is required for CloudFormation to detect changes and update the Lambda Function with the new image tag.

After this, the CodeBuild project will output the packaged CloudFormation templates and parameters files as an artifact to AWS CodePipeline so that it can be deployed via AWS CloudFormation in further stages. This is done by first creating a ChangeSet and then deploying it at the next stage.

Testing the microservices architecture

As stated above, the sample application utilized for microservices architecture involving multiple Lambda Functions is a modified version of the Serverless Data Lake Framework. The microservices architecture CodePipeline deployed every AWS resource required to run the SDLF application via AWS CloudFormation stages. As part of SDLF, it also deployed a set of DynamoDB tables required for the applications to run. I utilized the meteorites sample for this, thereby the DynamoDb tables should be added with the necessary data for the application to run for this sample.

Utilize the AWS console to write data to the AWS DynamoDb Table. For more information, refer to this documentation. The sample json files are in the utils/DynamoDbConfig/ directory.

1. Add the record below to the octagon-Pipelines-dev DynamoDB table:

{
"description": "Main Pipeline to Ingest Data",
"ingestion_frequency": "WEEKLY",
"last_execution_date": "2020-03-11",
"last_execution_duration_in_seconds": 4.761,
"last_execution_id": "5445249c-a097-447a-a957-f54f446adfd2",
"last_execution_status": "COMPLETED",
"last_execution_timestamp": "2020-03-11T02:34:23.683Z",
"last_updated_timestamp": "2020-03-11T02:34:23.683Z",
"modules": [
{
"name": "pandas",
"version": "0.24.2"
},
{
"name": "Python",
"version": "3.7"
}
],
"name": "engineering-main-pre-stage",
"owner": "Yuri Gagarin",
"owner_contact": "y.gagarin@",
"status": "ACTIVE",
"tags": [
{
"key": "org",
"value": "VOSTOK"
}
],
"type": "INGESTION",
"version": 127
}

2. Add the record below to the octagon-Pipelines-dev DynamoDB table:

{
"description": "Main Pipeline to Merge Data",
"ingestion_frequency": "WEEKLY",
"last_execution_date": "2020-03-11",
"last_execution_duration_in_seconds": 570.559,
"last_execution_id": "0bb30d20-ace8-4cb2-a9aa-694ad018694f",
"last_execution_status": "COMPLETED",
"last_execution_timestamp": "2020-03-11T02:44:36.069Z",
"last_updated_timestamp": "2020-03-11T02:44:36.069Z",
"modules": [
{
"name": "PySpark",
"version": "1.0"
}
],
"name": "engineering-main-post-stage",
"owner": "Neil Armstrong",
"owner_contact": "n.armstrong@",
"status": "ACTIVE",
"tags": [
{
"key": "org",
"value": "NASA"
}
],
"type": "TRANSFORM",
"version": 4
}

3. Add the record below to the octagon-Datsets-dev DynamoDB table:

{
"classification": "Orange",
"description": "Meteorites Name, Location and Classification",
"frequency": "DAILY",
"max_items_process": 250,
"min_items_process": 1,
"name": "engineering-meteorites",
"owner": "NASA",
"owner_contact": "[email protected]",
"pipeline": "main",
"tags": [
{
"key": "cost",
"value": "meteorites division"
}
],
"transforms": {
"stage_a_transform": "light_transform_blueprint",
"stage_b_transform": "heavy_transform_blueprint"
},
"type": "TRANSACTIONAL",
"version": 1
}

If you want to create these samples using AWS CLI, please refer to this documentation.

Record 1:

aws dynamodb put-item --table-name octagon-Pipelines-dev --item '{"description":{"S":"Main Pipeline to Merge Data"},"ingestion_frequency":{"S":"WEEKLY"},"last_execution_date":{"S":"2021-03-16"},"last_execution_duration_in_seconds":{"N":"930.097"},"last_execution_id":{"S":"e23b7dae-8e83-4982-9f97-5784a9831a14"},"last_execution_status":{"S":"COMPLETED"},"last_execution_timestamp":{"S":"2021-03-16T04:31:16.968Z"},"last_updated_timestamp":{"S":"2021-03-16T04:31:16.968Z"},"modules":{"L":[{"M":{"name":{"S":"PySpark"},"version":{"S":"1.0"}}}]},"name":{"S":"engineering-main-post-stage"},"owner":{"S":"Neil Armstrong"},"owner_contact":{"S":"n.armstrong@"},"status":{"S":"ACTIVE"},"tags":{"L":[{"M":{"key":{"S":"org"},"value":{"S":"NASA"}}}]},"type":{"S":"TRANSFORM"},"version":{"N":"8"}}'

Record 2:

aws dynamodb put-item --table-name octagon-Pipelines-dev --item '{"description":{"S":"Main Pipeline to Ingest Data"},"ingestion_frequency":{"S":"WEEKLY"},"last_execution_date":{"S":"2021-03-28"},"last_execution_duration_in_seconds":{"N":"1.75"},"last_execution_id":{"S":"7e0e04e7-b05e-41a6-8ced-829d47866a6a"},"last_execution_status":{"S":"COMPLETED"},"last_execution_timestamp":{"S":"2021-03-28T20:23:06.031Z"},"last_updated_timestamp":{"S":"2021-03-28T20:23:06.031Z"},"modules":{"L":[{"M":{"name":{"S":"pandas"},"version":{"S":"0.24.2"}}},{"M":{"name":{"S":"Python"},"version":{"S":"3.7"}}}]},"name":{"S":"engineering-main-pre-stage"},"owner":{"S":"Yuri Gagarin"},"owner_contact":{"S":"y.gagarin@"},"status":{"S":"ACTIVE"},"tags":{"L":[{"M":{"key":{"S":"org"},"value":{"S":"VOSTOK"}}}]},"type":{"S":"INGESTION"},"version":{"N":"238"}}'

Record 3:

aws dynamodb put-item --table-name octagon-Pipelines-dev --item '{"description":{"S":"Main Pipeline to Ingest Data"},"ingestion_frequency":{"S":"WEEKLY"},"last_execution_date":{"S":"2021-03-28"},"last_execution_duration_in_seconds":{"N":"1.75"},"last_execution_id":{"S":"7e0e04e7-b05e-41a6-8ced-829d47866a6a"},"last_execution_status":{"S":"COMPLETED"},"last_execution_timestamp":{"S":"2021-03-28T20:23:06.031Z"},"last_updated_timestamp":{"S":"2021-03-28T20:23:06.031Z"},"modules":{"L":[{"M":{"name":{"S":"pandas"},"version":{"S":"0.24.2"}}},{"M":{"name":{"S":"Python"},"version":{"S":"3.7"}}}]},"name":{"S":"engineering-main-pre-stage"},"owner":{"S":"Yuri Gagarin"},"owner_contact":{"S":"y.gagarin@"},"status":{"S":"ACTIVE"},"tags":{"L":[{"M":{"key":{"S":"org"},"value":{"S":"VOSTOK"}}}]},"type":{"S":"INGESTION"},"version":{"N":"238"}}'

Now upload the sample json files to the raw s3 bucket. The raw S3 bucket name can be obtained in the output of the common-cloudformation stack deployed as part of the microservices architecture CodePipeline. Navigate to the CloudFormation console in the region where the CodePipeline was deployed and locate the stack with the name common-cloudformation, navigate to the Outputs section, and then note the output bucket name with the key oCentralBucket. Navigate to the Amazon S3 Bucket console and locate the bucket for oCentralBucket, create two path directories named engineering/meteorites, and upload every sample json file to this directory. Meteorites sample json files are available in the utils/meteorites-test-json-files directory of the previously cloned repository. Wait a few minutes and then navigate to the stage bucket noted from the common-cloudformation stack output name oStageBucket. You can see json files converted into csv in pre-stage/engineering/meteorites folder in S3. Wait a few more minutes and then navigate to the post-stage/engineering/meteorites folder in the oStageBucket to see the csv files converted to parquet format.

Cleanup

Navigate to the AWS CloudFormation console, note the S3 bucket names from the common-cloudformation stack outputs, and empty the S3 buckets. Refer to Emptying the Bucket for more information.

Delete the CloudFormation stacks in the following order:
1. Common-Cloudformation
2. stagea
3. stageb
4. sdlf-engineering-meteorites
Then delete the infrastructure CloudFormation stack datalake-infra-resources deployed using the codepipeline.yaml template. Refer to the following documentation to delete CloudFormation Stacks: Deleting a stack on the AWS CloudFormation console or Deleting a stack using AWS CLI.

Conclusion

This method lets us use CI/CD via CodePipeline, CodeCommit, and CodeBuild, along with other AWS services, to automatically deploy container images to Lambda Functions that are part of the microservices architecture. Furthermore, we can build a common layer that is equivalent to the Lambda layer that could be built independently via its own CodePipeline, and then build the container image and push to Amazon ECR. Then, the common layer container image Amazon ECR functions as a source along with its own CodeCommit repository which holds the code for the microservices architecture CodePipeline. Having two sources for microservices architecture codepipeline lets us build every docker image. This is due to a change made to the common layer docker image that is referred to in other docker images, and another source that holds the code for other microservices including Lambda Function.

About the Author

Kirankumar Chandrashekar is a Sr.DevOps consultant at AWS Professional Services. He focuses on leading customers in architecting DevOps technologies. Kirankumar is passionate about DevOps, Infrastructure as Code, and solving complex customer issues. He enjoys music, as well as cooking and traveling.

Query SAP HANA using Athena Federated Query and join with data in your Amazon S3 data lake

2021-08-18 Navnit Shukla

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-sap-hana-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use SAP HANA as your transactional data store, you may need to join the data in your data lake with SAP HANA in the cloud, SAP HANA running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises SAP HANA, for example to build a dashboard or create consolidated reporting.

In such use cases, Amazon Athena Federated Query allows you to seamlessly access the data from SAP HANA database without building ETL pipelines to copy or unload the data to the S3 data lake or SAP HANA. This removes the overhead of creating additional extract, transform, and load (ETL) processes and shortens the development cycle.

In this post, we walk you through a step-by-step configuration to set up Amazon Athena Federated Query using AWS Lambda to access data in a SAP HANA database running on AWS.

For this post, we will be using the SAP HANA Athena Federated query connector developed by Trianz. You can deploy the Athena Federated query connector developed by Trianz available in the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

Create a secret for the SAP HANA instance using AWS Secrets Manager.
Create an S3 bucket and subfolder for Lambda to use.
Configure Athena federation with the SAP HANA instance.
Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have a SAP HANA database up and running on AWS.

Create a secret for the SAP HANA instance

Our first step is to create a secret for the SAP HANA instance with a username and password using Secrets Manager.

On the Secrets Manager console, choose Secrets.
Choose Store a new secret.
Select Other types of secrets.
Set the credentials as key-value pairs (username, password) for your SAP HANA instance.
For Secret name, enter a name for your secret. Use the prefix SAP HANAAFQ so it’s easy to find.
Leave the remaining fields at their defaults and choose Next.
Complete your secret creation.

Setting up your S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use

For this post, we have used (Amazon S3 bucket name/folder) athena-accelerator/saphana.

Configure Athena federation with the SAP HANA instance

To configure Athena federation with your SAP HANA instance, complete the following steps:

On the AWS Serverless Application Repository console, choose Available applications.
In the search field, enter TrianzSAPHANAAthenaJDBC.

In the Application settings section, provide the following details:

For Application name, enter TrianzSAPHANAAthenaJDBC.
For SecretNamePrefix, enter trianz-saphana-athena-jdbc.
For SpillBucket, enter Athena-accelerator/saphana.

For JDBCConnectorConfig, use the format saphana://jdbc:sap://{saphana_instance_url}/?${secretname}.

For DisableSpillEncyption, choose False.
For LambdaFunctionName, enter trsaphana.
For SecurityGroupID, use the security group id using which lambda can connect to the SAP HANA

Make sure to apply valid inbound and outbound rules based on your connection.

For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
For Subnetids – Use the subnets using which lambda can connect to SAP HANA instance with comma separation.

Make sure the subnet is in a VPC and has a NAT gateway and internet gateway attached.

Select the I acknowledge check box.
Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, Amazon CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your federated queries using lambda:trsaphana to run against tables in the SAP HANA database. trsaphana is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:trsaphana is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot demonstrates joining the dataset between SAP HANA and the data lake.

Key performance best practice considerations

If you’re considering Athena federation with a SAP HANA database, we recommend the following best practices:

Athena federation works great for queries with predicate filtering because the predicates are pushed down to the SAP HANA database. Use filter and limited-range scans in your queries to avoid full table scans.
If your SQL query requires returning a large volume of data from the SAP HANA database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from SAP HANA to your S3 data lake.
Star schema is a commonly used data model in SAP HANA databases. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in your SAP HANA database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the SAP HANA database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your SAP HANA database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with SAP HANA using Lambda. Now you don’t need to wait for all the data in your SAP HANA data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from SAP HANA for better performance. When queries are well-written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.

About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Field Notes: Three Steps to Port Your Containerized Application to AWS Lambda

2021-08-18 Arthi Jaganathan

Post Syndicated from Arthi Jaganathan original https://aws.amazon.com/blogs/architecture/field-notes-three-steps-to-port-your-containerized-application-to-aws-lambda/

AWS Lambda support for container images allows porting containerized web applications to run in a serverless environment. This gives you automatic scaling, built-in high availability, and a pay-for-value billing model so you don’t pay for over-provisioned resources. If you are currently using containers, container image support for AWS Lambda means you can use these benefits without additional upfront engineering efforts to adopt new tooling or development workflows. You can continue to use your team’s familiarity with containers while gaining the benefits from the operational simplicity and cost effectiveness of Serverless computing.

This blog post describes the steps to take a containerized web application and run it on Lambda with only minor changes to the development, packaging, and deployment process. We use Ruby, as an example, but the steps are the same for any programming language.

Overview of solution

The following sample application is a containerized web service that generates PDF invoices. We will migrate this application to run the business logic in a Lambda function, and use Amazon API Gateway to provide a Serverless RESTful web API. API Gateway is a managed service to create and run API operations at scale.

Figure 1: Generating PDF invoice with Lambda

Walkthrough

In this blog post, you will learn how to port the containerized web application to run in a serverless environment using Lambda.

At a high level, you are going to:

Get the containerized application running locally for testing
Port the application to run on Lambda
1. Create a Lambda function handler
2. Modify the container image for Lambda
3. Test the containerized Lambda function locally
Deploy and test on Amazon Web Services (AWS)

Prerequisites

For this walkthrough, you need the following:

An AWS account
IAM permissions to create Lambda function, API Gateway, AWS Identity and Access Management (IAM) roles, and Amazon Elastic Container Registry (Amazon ECR)
Docker and AWS Command Line Interface (AWS CLI) installed
A basic understanding of how Lambda works

1. Get the containerized application running locally for testing

The sample code for this application is available on GitHub. Clone the repository to follow along.

```bash

git clone https://github.com/aws-samples/aws-lambda-containerized-custom-runtime-blog.git

```

1.1. Build the Docker image

Review the Dockerfile in the root of the cloned repository. The Dockerfile uses Bitnami’s Ruby 3.0 image from the Amazon ECR Public Gallery as the base. It follows security best practices by running the application as a non-root user and exposes the invoice generator service on port 4567. Open your terminal and navigate to the folder where you cloned the GitHub repository. Build the Docker image using the following command.

```bash
docker build -t ruby-invoice-generator .
```

1.2 Test locally

Run the application locally using the Docker run command.

```bash
docker run -p 4567:4567 ruby-invoice-generator
```

In a real-world scenario, the order and customer details for the invoice would be passed as POST request body or GET request query string parameters. To keep things simple, we are randomly selecting from a few hard coded values inside lib/utils.rb. Open another terminal, and test invoice creation using the following command.

```bash
curl "http://localhost:4567/invoice" \
--output invoice.pdf \
--header 'Accept: application/pdf'
```

This command creates the file invoice.pdf in the folder where you ran the curl command. You can also test the URL directly in a browser. Press Ctrl+C to stop the container. At this point, we know our application works and we are ready to port it to run on Lambda as a container.

2. Port the application to run on Lambda

There is no change to the Lambda operational model and request plane. This means the function handler is still the entry point to application logic when you package a Lambda function as a container image. Also, by moving our business logic to a Lambda function, we get to separate out two concerns and replace the web server code from the container image with an HTTP API powered by API Gateway. You can focus on the business logic in the container with API Gateway acting as a proxy to route requests.

2.1. Create the Lambda function handler

The code for our Lambda function is defined in function.rb, and the handler function from that file will be described shortly. The main difference to note between the original Sintra-powered code and our Lambda handler version is the need to base64 encode the PDF. This is a requirement for returning binary media from API Gateway Lambda proxy integration. API Gateway will automatically decode this to return the PDF file to the client.

```ruby

def self.process(event:, context:)

self.logger.debug(JSON.generate(event))

invoice_pdf = Base64.encode64(Invoice.new.generate)

{ 'headers' => { 'Content-Type': 'application/pdf' },

'statusCode' => 200,

'body' => invoice_pdf,

'isBase64Encoded' => true

}

end

```

If you need a reminder on the basics of Lambda function handlers, review the documentation on writing a Lambda handler in Ruby. This completes the new addition to the development workflow—creating a Lambda function handler as the wrapper for the business logic.

2.2 Modify the container image for Lambda

AWS provides open source base images for Lambda. At this time, these base images are only available for Ruby runtime versions 2.5 and 2.7. But, you can bring any version of your preferred runtime (Ruby 3.0 in this case) by packaging it with your Docker image. We will use Bitnami’s Ruby 3.0 image from the Amazon ECR Public Gallery as the base. Amazon ECR is a fully managed container registry, and it is important to note that Lambda only supports running container images that are stored in Amazon ECR; you can’t upload an arbitrary container image to Lambda directly.

Because the function handler is the entry point to business logic, the Dockerfile CMD must be set to the function handler instead of starting the web server. In our case, because we are using our own base image to bring a custom runtime, there is an additional change we must make. Custom images need the runtime interface client to manage the interaction between the Lambda service and our function’s code.

The runtime interface client is an open-source lightweight interface that receives requests from Lambda, passes the requests to the function handler, and returns the runtime results back to the Lambda service. The following are the relevant changes to the Dockerfile.

```bash

ENTRYPOINT ["aws_lambda_ric"]

CMD [ "function.Billing::InvoiceGenerator.process" ]

```

The Docker command that is implemented when the container runs is: aws_lambda_ric function.Billing::InvoiceGenerator.process.

Refer to Dockerfile.lambda in the cloned repository for the complete code. This image follows best practices for optimizing Lambda container images by using a multi-stage build. The final image uses tag named 3.0-prod as its base. This does not include development dependencies and helps keep the image size down. Create the Lambda-compatible container image using the following command.

```bash

docker build -f Dockerfile.lambda -t lambda-ruby-invoice-generator .

```

This concludes changes to the Dockerfile. We have introduced a new dependency on the runtime interface client and used it as our container’s entrypoint.

2.3. Test the containerized Lambda function locally

The runtime interface client expects requests from the Lambda Runtime API. But when we test on our local development workstation, we don’t run the Lambda service. So, we need a way to proxy the Runtime API for local testing. Because local testing is an integral part of most development workflows, AWS provides the Lambda runtime interface emulator. The emulator is a lightweight web server running on port 8080 that converts HTTP requests to Lambda-compatible JSON events. The flow for local testing is shown in Figure 2.

Figure 2: Testing the Lambda container image locally

When we want to perform local testing, the runtime interface emulator becomes the entrypoint. Consequently, the Docker command that is executed when the container runs locally is: aws-lambda-rie aws_lambda_ric function.Billing::InvoiceGenerator.process.

You can package the emulator with the image. The Lambda container image support launch blog documents steps for this approach. However, to keep the image as slim as possible, we recommend that you install it locally, and mount it while running the container instead. Here is the installation command for Linux platforms.

``` bash

mkdir -p
~/.aws-lambda-rie && curl -Lo ~/.aws-lambda-rie/aws-lambda-rie
https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie
&& chmod +x ~/.aws-lambda-rie/aws-lambda-rie

```

Use the Docker run command with the appropriate overrides to entrypoint and cmd to start the Lambda container. The emulator is mapped to local port 9000.

```bash

docker run \

-v ~/.aws-lambda-rie:/aws-lambda \

-p 9000:8080 \

-e LOG_LEVEL=DEBUG \

--entrypoint /aws-lambda/aws-lambda-rie \

lambda-ruby-invoice-generator \

/opt/bitnami/ruby/bin/aws_lambda_ric function.Billing::InvoiceGenerator.process

```

Open another terminal and run the curl command below to simulate a Lambda request.

```bash

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

```
You will see a JSON output with the base64 encoded value of the invoice for body and isBase64Encoded property set to true. After Lambda is integrated with API Gateway, the API endpoint uses the flag to decode the text before returning the response to the caller. The client will receive the decoded PDF invoice. Push Ctrl+C to stop the container. This concludes changes to the local testing workflow.

3. Deploy and test on AWS

The final step is to deploy the invoice generator service. Set your AWS Region and AWS Account ID as environment variables.

```bash

export AWS_REGION=Region

export AWS_ACCOUNT_ID=account

```

3.1. Push the Docker image to Amazon ECR

Create an Amazon ECR repository for the image.

```bash

ECR_REPOSITORY=`aws ecr create-repository \

--region $AWS_REGION \

--repository-name lambda-ruby-invoice-generator \

--tags Key=Project,Value=lambda-ruby-invoice-generator-blog \

--query "repository.repositoryName" \

--output text`

```

```bash

aws ecr get-login-password \

--region $AWS_REGION | docker login \

--username AWS \

--password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

docker tag \

lambda-ruby-invoice-generator:latest \

$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:latest

docker push

$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:latest

```

3.2. Create the Lambda function

After the push succeeds, you can create the Lambda function. You need to create a Lambda execution role first and attach the managed IAM policy named AWSLambdaBasicExecutionRole. This gives the function access to Amazon CloudWatch for logging and monitoring.

```bash

LAMBDA_ROLE=`aws iam create-role \

--region $AWS_REGION \

--role-name ruby-invoice-generator-lambda-role \

--assume-role-policy-document file://lambda-role-trust-policy.json \

--tags Key=Project,Value=lambda-ruby-invoice-generator-blog \

--query "Role.Arn" \

--output text`

aws iam attach-role-policy \

--region $AWS_REGION \

--role-name ruby-invoice-generator-lambda-role \

--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

LAMBDA_ARN=`aws lambda create-function \

--region $AWS_REGION \

--function-name ruby-invoice-generator \

--description "[AWS Blog] Lambda Ruby Invoice Generator" \

--role $LAMBDA_ROLE \

--package-type Image \

--code ImageUri=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$ECR_REPOSITORY:latest \

--timeout 15 \

--memory-size 256 \

--tags Project=lambda-ruby-invoice-generator-blog \

--query "FunctionArn" \
--output text`
```

Wait for the function to be ready. Use the following command to verify that the function state is set to Active.

```bash

aws lambda get-function \

--region $AWS_REGION \

--function-name ruby-invoice-generator \

--query "Configuration.State"

```

3.3. Integrate with API Gateway

API Gateway offers two option to create RESTful APIs: HTTP and REST. We will use HTTP API because it offers lower cost and latency when compared to REST API. REST API provides additional features that we don’t need for this demo.

```bash

aws apigatewayv2 create-api \

--region $AWS_REGION \

--name invoice-generator-api \

--protocol-type HTTP \

--target $LAMBDA_ARN \

--route-key "GET /invoice" \

--tags Key=Project,Value=lambda-ruby-invoice-generator-blog \

--query "{ApiEndpoint: ApiEndpoint, ApiId: ApiId}" \

--output json

```

Record the ApiEndpoint and ApiId from the earlier command, and substitute them for the placeholders in the following command. You need to update the Lambda resource policy to allow HTTP API to invoke it.

```bash

export API_ENDPOINT="<ApiEndpoint>"

export API_ID="<ApiId>"

aws lambda add-permission \

--region $AWS_REGION \

--statement-id invoice-generator-api \

--action lambda:InvokeFunction \

--function-name $LAMBDA_ARN \

--principal apigateway.amazonaws.com \

--source-arn "arn:aws:execute-api:$AWS_REGION:$AWS_ACCOUNT_ID:$API_ID/*/*/invoice"

```

3.4. Verify the AWS deployment

Use the curl command to generate a PDF invoice.

```bash

curl "$API_ENDPOINT/invoice" \
--output lambda-invoice.pdf \
--header 'Accept: application/pdf'

```

This creates the file lambda-invoice.pdf in the local folder. You can also test directly from a browser.

This concludes the final step for porting the containerized invoice web service to Lambda. This deployment workflow is very similar to any other containerized application where you first build the image and then create or update your application to the new image as part of deployment. Only the actual deploy command has changed because we are deploying to Lambda instead of a container platform.

Cleaning up

To avoid incurring future charges, delete the resources. Follow cleanup instructions in the README file on the GitHub repo.

Conclusion

In this blog post, we learned how to port an existing containerized application to Lambda with only minor changes to development, packaging, and testing workflows. For teams with limited time, this accelerates the adoption of Serverless by using container domain knowledge and promoting reuse of existing tooling. We also saw how you can easily bring your own runtime by packaging it in your image and how you can simplify your application by eliminating the web application framework.

For more Serverless learning resources, visit https://serverlessland.com.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Building well-architected serverless applications: Optimizing application performance – part 1

2021-08-17 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-optimizing-application-performance-part-1/

PERF 1. Optimizing your serverless application’s performance

Evaluate and optimize your serverless application’s performance based on access patterns, scaling mechanisms, and native integrations. This allows you to continuously gain more value per transaction. You can improve your overall experience and make more efficient use of the platform in terms of both value and resources.

Good practice: Measure and optimize function startup time

Evaluate your AWS Lambda function startup time for both performance and cost.

Take advantage of execution environment reuse to improve the performance of your function.

Lambda invokes your function in a secure and isolated runtime environment, and manages the resources required to run your function. When a function is first invoked, the Lambda service creates an instance of the function to process the event. This is called a cold start. After completion, the function remains available for a period of time to process subsequent events. These are called warm starts.

Lambda functions must contain a handler method in your code that processes events. During a cold start, Lambda runs the function initialization code, which is the code outside the handler, and then runs the handler code. During a warm start, Lambda runs the handler code.

Lambda function cold and warm starts

Initialize SDK clients, objects, and database connections outside of the function handler so that they are started during the cold start process. These connections then remain during subsequent warm starts, which improves function performance and cost.

Lambda provides a writable local file system available at /tmp. This is local to each function but shared between subsequent invocations within the same execution environment. You can download and cache assets locally in the /tmp folder during the cold start. This data is then available locally by all subsequent warm start invocations, improving performance.

In the serverless airline example used in this series, the confirm booking Lambda function initializes a number of components during the cold start. These include the Lambda Powertools utilities and creating a session to the Amazon DynamoDB table BOOKING_TABLE_NAME.

import boto3
from aws_lambda_powertools import Logger, Metrics, Tracer
from aws_lambda_powertools.metrics import MetricUnit
from botocore.exceptions import ClientError

logger = Logger()
tracer = Tracer()
metrics = Metrics()

session = boto3.Session()
dynamodb = session.resource("dynamodb")
table_name = os.getenv("BOOKING_TABLE_NAME", "undefined")
table = dynamodb.Table(table_name)

Analyze and improve startup time

There are a number of steps you can take to measure and optimize Lambda function initialization time.

You can view the function cold start initialization time using Amazon CloudWatch Logs and AWS X-Ray. A log REPORT line for a cold start includes the Init Duration value. This is the time the initialization code takes to run before the handler.

CloudWatch Logs cold start report line

When X-Ray tracing is enabled for a function, the trace includes the Initialization segment.

X-Ray trace cold start showing initialization segment

A subsequent warm start REPORT line does not include the Init Duration value, and is not present in the X-Ray trace:

CloudWatch Logs warm start report line

X-Ray trace warm start without showing initialization segment

CloudWatch Logs Insights allows you to search and analyze CloudWatch Logs data over multiple log groups. There are some useful searches to understand cold starts.

Understand cold start percentage over time:

filter @type = "REPORT"
| stats
  sum(strcontains(
    @message,
    "Init Duration"))
  / count(*)
  * 100
  as coldStartPercentage,
  avg(@duration)
  by bin(5m)

Cold start percentage over time

Cold start count and InitDuration:

filter @type="REPORT" 
| fields @memorySize / 1000000 as memorySize
| filter @message like /(?i)(Init Duration)/
| parse @message /^REPORT.*Init Duration: (?<initDuration>.*) ms.*/
| parse @log /^.*\/aws\/lambda\/(?<functionName>.*)/
| stats count() as coldStarts, median(initDuration) as avgInitDuration, max(initDuration) as maxInitDuration by functionName, memorySize

Cold start count and InitDuration

Once you have measured cold start performance, there are a number of ways to optimize startup time. For Python, you can use the PYTHONPROFILEIMPORTTIME=1 environment variable.

PYTHONPROFILEIMPORTTIME environment variable

This shows how long each package import takes to help you understand how packages impact startup time.

Python import time

Previously, for the AWS Node.js SDK, you enabled HTTP keep-alive in your code to maintain TCP connections. Enabling keep-alive allows you to avoid setting up a new TCP connection for every request. Since AWS SDK version 2.463.0, you can also set the Lambda function environment variable AWS_NODEJS_CONNECTION_REUSE_ENABLED=1 to make the SDK reuse connections by default.

You can configure Lambda’s provisioned concurrency feature to pre-initialize a requested number of execution environments. This runs the cold start initialization code so that they are prepared to respond immediately to your function’s invocations.

Use Amazon RDS Proxy to pool and share database connections to improve function performance. For additional options for using RDS with Lambda, see the AWS Serverless Hero blog post “How To: Manage RDS Connections from AWS Lambda Serverless Functions”.

Choose frameworks that load quickly on function initialization startup. For example, prefer simpler Java dependency injection frameworks like Dagger or Guice over more complex framework such as Spring. When using the AWS SDK for Java, there are some cold start performance optimization suggestions in the documentation. For further Java performance optimization tips, see the AWS re:Invent session, “Best practices for AWS Lambda and Java”.

To minimize deployment packages, choose lightweight web frameworks optimized for Lambda. For example, use MiddyJS, Lambda API JS, and Python Chalice over Node.js Express, Python Django or Flask.

If your function has many objects and connections, consider splitting the function into multiple, specialized functions. These are individually smaller and have less initialization code. I cover designing smaller, single purpose functions from a security perspective in “Managing application security boundaries – part 2”.

Minimize your deployment package size to only its runtime necessities

Smaller functions also allow you to separate functionality. Only import the libraries and dependencies that are necessary for your application processing. Use code bundling when you can to reduce the impact of file system lookup calls. This also includes deployment package size.

For example, if you only use Amazon DynamoDB in the AWS SDK, instead of importing the entire SDK, you can import an individual service. Compare the following three examples as shown in the Lambda Operator Guide:

// Instead of const AWS = require('aws-sdk'), use: +
const DynamoDB = require('aws-sdk/clients/dynamodb')

// Instead of const AWSXRay = require('aws-xray-sdk'), use: +
const AWSXRay = require('aws-xray-sdk-core')

// Instead of const AWS = AWSXRay.captureAWS(require('aws-sdk')), use: +
const dynamodb = new DynamoDB.DocumentClient() +
AWSXRay.captureAWSClient(dynamodb.service)

In testing, importing the DynamoDB library instead of the entire AWS SDK was 125 ms faster. Importing the X-Ray core library was 5 ms faster than the X-Ray SDK. Similarly, when wrapping a service initialization, preparing a DocumentClient before wrapping showed a 140-ms gain. Version 3 of the AWS SDK for JavaScript supports modular imports, which can further help reduce unused dependencies.

For additional options when for optimizing AWS Node.js SDK imports, see the AWS Serverless Hero blog post.

Conclusion

In this post, I cover measuring and optimizing function startup time. I explain cold and warm starts and how to reuse the Lambda execution environment to improve performance. I show a number of ways to analyze and optimize the initialization startup time. I explain how only importing necessary libraries and dependencies increases application performance.

This well-architected question will be continued is part 2 where I look at designing your function to take advantage of concurrency via asynchronous and stream-based invocations. I cover measuring, evaluating, and selecting optimal capacity units.

For more serverless learning resources, visit Serverless Land.

Python 3.9 runtime now available in AWS Lambda

2021-08-16 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/python-3-9-runtime-now-available-in-aws-lambda/

You can now use the Python 3.9 runtime to develop your AWS Lambda functions. You can do this in the AWS Management Console, AWS CLI, or AWS SDK, AWS Serverless Application Model (AWS SAM), or AWS Cloud Development Kit (AWS CDK). This post outlines some of the improvements to the Python runtime in version 3.9 and how to use this version in your Lambda functions.

New features and improvements to the Python language

Python 3.9 introduces new features for strings and dictionaries. There are new methods to remove prefixes and suffixes in strings. To remove a prefix, use str.removeprefix(prefix). To remove a suffix, use str.removesuffix(suffix). To learn more, read about PEP 616.

Dictionaries now offer two new operators (| and |=). The first is a union operator for merging dictionaries and the second allows developers to update the contents of a dictionary with another dictionary. To learn more, read about PEP 584.

You can alter the behavior of Python functions by using decorators. Previously, these could only consist of the @ symbol, a name, a dotted name, and an optional single call. Decorators can now consist of any valid expression, as explained in PEP 614.

There are also improvements for time zone handling. The zoneinfo module now supports the IANA time zone database. This can help remove boilerplate and brings improvements for code handling multiple timezones.

While existing Python 3 versions support TLS1.2, Python 3.9 now provides support for TLS1.3. This helps improve the performance of encrypted connections with features such as False Start and Zero Round Trip Time (0-RTT).

For a complete list of updates in Python 3.9, read the launch documentation on the Python website.

Performance improvements in Python 3.9

There are two important performance improvements in Python 3.9 that you can benefit from without making any code changes.

The first impacts code that uses the built-in Python data structures tuple, list, dict, set, or frozenset. In Python 3.9, these internally use the vectorcall protocol, which can make function calls faster by reducing the number of temporary objects used. Second, Python 3.9 uses a new parser that is more performant than previous versions. To learn more, read about PEP 617.

Changes to how Lambda works with the Python runtime

In Python, the presence of an __init__.py file in a directory causes it to be treated as a package. Frequently, __init__.py is an empty file that’s used to ensure that Python identifies the directory as a package. However, it can also contain initialization code for the package. Before Python 3.9, where you provided your Lambda function in a package, Lambda did not run the __init__.py code in the handler’s directory and parent directories during function initialization. From Python 3.9, Lambda now runs this code during the initialization phase. This ensures that imported packages are properly initialized if they make use of __init__.py. Note that __init__.py code is only run when the execution environment is first initialized.

Finally, there is a change to the error response in this new version. When previous Python versions threw errors, the formatting appeared as:

{"errorMessage": "name 'x' is not defined", "errorType": "NameError", "stackTrace": [" File \"/var/task/error_function.py\", line 2, in lambda_handler\n return x + 10\n"]}

From Python 3.9, the error response includes a RequestId:

{"errorMessage": "name 'x' is not defined", "errorType": "NameError", **"requestId"**: "<request id of function invoke>" "stackTrace": [" File \"/var/task/error_function.py\", line 2, in lambda_handler\n return x + 10\n"]}

Using Python 3.9 in Lambda

You can now use the Python 3.9 runtime to develop your AWS Lambda functions. To use this version, specify a runtime parameter value python3.9 when creating or updating Lambda functions. You can see the new version in the Runtime dropdown in the Create function page.

To update an existing Lambda function to Python 3.9, navigate to the function in the Lambda console, then choose Edit in the Runtime settings panel. You see the new version in the Runtime dropdown:

In the AWS Serverless Application Model (AWS SAM), set the Runtime attribute to python3.9 to use this version in your application deployments:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Simple Lambda Function
  
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Description: My Python Lambda Function
    Properties:
      CodeUri: my_function/
      Handler: lambda_function.lambda_handler
      Runtime: python3.9

Conclusion

You can now create new functions or upgrade existing Python functions to Python 3.9. Lambda’s support of the Python 3.9 runtime enables you to take advantage of improved performance and new features in this version. Additionally, the Lambda service now runs the __init_.py code before the handler, supports TLS 1.3, and provides enhanced logging for errors.

For more serverless learning resources, visit Serverless Land.

Field Notes: Creating Custom Analytics Dashboards with FireEye Helix and Amazon QuickSight

2021-08-16 Karish Chowdhury

Post Syndicated from Karish Chowdhury original https://aws.amazon.com/blogs/architecture/field-notes-creating-custom-analytics-dashboards-with-fireeye-helix-and-amazon-quicksight/

FireEye Helix is a security operations platform that allows organizations to take control of any incident from detection to response. FireEye Helix detects security incidents by correlating logs and configuration settings from sources like VPC Flow Logs, AWS CloudTrail, and Security groups.

In this blog post, we will discuss an architecture that allows you to create custom analytics dashboards with Amazon QuickSight. These dashboards are based on the threat detection logs collected by FireEye Helix. We automate this process so that data can be pulled and ingested based on a provided schedule. This approach uses AWS Lambda, and Amazon Simple Storage Service (Amazon S3) in addition to QuickSight.

Architecture Overview

The solution outlines how to ingest the security log data from FireEye Helix to Amazon S3 and create QuickSight visualizations from the log data. With this approach, you need Amazon EventBridge to invoke a Lambda function to connect to the FireEye Helix API. There are two steps to this process:

Download the log data from FireEye Helix and store it in Amazon S3.
Create a visualization Dashboard in QuickSight.

The architecture shown in Figure 1 represents the process we will walk through in this blog post. To implement this solution, you will need the following AWS services and features involved:

Figure 1: Solution architecture

Prerequisites to implement the solution:

The following items are required to get your environment set up for this walkthrough.

AWS account.
FireEye Helix search alerts API endpoint. This is available under the API documentation in the FireEye Helix console.
FireEye Helix API key. This FireEye community page explains how to generate an API key with appropriate permissions (always follow least privilege principles). This key is used by the Lambda function to periodically fetch alerts.
AWS Secrets Manager secret (to store the FireEye Helix API key). To set it up, follow the steps outlined in the Creating a secret.

Extract the data from FireEye Helix and load it into Amazon S3

You will use the following high-level steps to retrieve the necessary security log data from FireEye Helix and store it on Amazon S3 to make it available for QuickSight.

Establish an AWS Identity and Access Management (IAM) role for the Lambda function. It must have permissions to access Secrets Manager and Amazon S3 so they can retrieve the FireEye Helix API key and store the extracted data, respectively.
Create an Amazon S3 bucket to store the FireEye Helix security log data.
Create a Lambda function that uses the API key from Secrets Manager, calls the FireEye Helix search alerts API to extract FireEye Helix’s threat detection logs, and stores the data in the S3 bucket.
Establish a CloudWatch EventBridge rule to invoke the Lambda function on an automated schedule.

To simplify this deployment, we have developed a CloudFormation template to automate creating the preceding requirements. Follow the below steps to deploy the template:

Download the source code from the GitHub repository
Navigate to CloudFormation console and select Create Stack
Select “Upload a template file” radio button, click on “Choose file” button, and select “helix-dashboard.yaml” file in the downloaded Github repository. Click “Next” to proceed.
On “Specify stack details” screen enter the parameters shown in Figure 2.

Figure 2: CloudFormation stack creation with initial parameters

The parameters in Figure 2 include:

HelixAPISecretName – Enter the Secrets Manager secret name where the FireEye Helix API key is stored.
HelixEndpointUrl – Enter the Helix search API endpoint URL.
Amazon S3 bucket – Enter the bucket prefix (a random suffix will be added to make it unique).
Schedule – Choose the default option that pulls logs once a day, or enter the CloudWatch event schedule expression.

Select the check box next to “I acknowledge that AWS CloudFormation might create IAM resources.” and then press the Create Stack button. After the CloudFormation stack completes, you will have a fully functional process that will retrieve the FireEye Helix security log data and store it on the S3 bucket.

You can also select the Lambda function from the CloudFormation stack outputs to navigate to the Lambda console. Review the following default code, and add any additional transformation logic according to your needs after fetching the results (line 32).

import boto3
from datetime import datetime
import requests
import os

region_name = os.environ['AWS_REGION']
secret_name = os.environ['APIKEY_SECRET_NAME']
bucket_name = os.environ['S3_BUCKET_NAME']
helix_api_url = os.environ['HELIX_ENDPOINT_URL']

def lambda_handler(event, context):

    now = datetime.now()
    # Create a Secrets Manager client to fetch API Key from Secrets Manager
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )
    apikey = client.get_secret_value(
            SecretId=secret_name
        )['SecretString']
    
    datestr = now.strftime("%B %d, %Y")
    apiheader = {'x-fireeye-api-key': apikey}
    
    try:
        # Call Helix Rest API to fetch the Alerts
        helixalerts = requests.get(
            f'{helix_api_url}?format=csv&query=start:"{datestr}" end:"{datestr}" class=alerts', headers=apiheader)
    
        # Optionally transform the content according to your needs..
        
        # Create a S3 client to upload the CSV file
        s3 = boto3.client('s3')
        path = now.strftime("%Y/%m/%d")+'/alerts-'+now.strftime("%H-%M-%S")+'.csv'
        response = s3.put_object(
            Body=helixalerts.content,
            Bucket=bucket_name,
            Key=path,
        )
        print('S3 upload response:', response)
    except Exception as e:
        print('error while fetching the alerts', e)
        raise e
        
    return {
        'statusCode': 200,
        'body': f'Successfully fetched alerts from Helix and uploaded to {path}'
    }

Creating a visualization in QuickSight

Once the FireEye data is ingested into an S3 bucket, you can start creating custom reports and dashboards using QuickSight. Following is a walkthrough on how to create a visualization in QuickSight based on the data that was ingested from FireEye Helix.

Step 1 – When placing the FireEye data into Amazon S3, ensure you have a clean directory structure so you can partition your data. By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. The following is a sample directory structure you could use.

ssw-fireeye-logs

2021

04

05

The following is an example of what the data will look like after it is ingested from FireEye Helix into your Amazon S3 bucket. In this blog post, we will use the Alert_Desc column to report on the types of common attacks.

Step 2 – Next, you must create a manifest file that will instruct QuickSight how to read the FireEye log files on Amazon S3. The preceding example is a manifest file that instructs QuickSight to recursively search for files in the ssw-fireeye-logs bucket, and can be seen in the URIPrefixes section. The GlobalUploadSettings section informs QuickSight the type and format of files it will read.

"fileLocations": [
	{
		"URIPrefixes": [
			"s3://ssw-fireeye-logs/"
		]
	},
	],
	"globalUploadSettings": {
		"format": "CSV",
		"delimiter": ",",
		"textqalifier": "'",
		"containsHeader": "true"
	}
}

Step 3 – Open Amazon QuickSight. Use the AWS Management Console and search for QuickSight.

Step 4 – Below the QuickSight logo, find and select Datasets.

Step 5 – Push the blue New dataset button.

Step 6 – Now you are on Create a Dataset page which enables you to select a data source you would like QuickSight to ingest. Because we have stored the FireEye Helix data on S3, you should choose the S3 data source.

Step 7 – A pop-up box will appear called New S3 data source. Type a data source name, and upload the manifest file you created. Next, push the Connect button.

Step 8 – You are now directed to the Visualize screen. For this exercise let’s choose a Pie chart, you can find this in the Visual types section by hovering over each icon and reading each tool tip that comes up. Look for the tool tip that says Pie chart. After selecting the Pie Chart visual type, two entries in the Field wells section at the top of the screen will show up called Group/Color and Value. Click the drop down in Group/Color and select the Alert_Desc column. Now click the drop down in Value and also select Alter_Desc column but choose count as an aggregate. This will create an informative visualization of the most common attacks based on the sample data shown previously in Step 1.

Figure 3: Visualization screen in QuickSight

Clean up

If you created any resources in AWS for this solution, consider removing them and any example resources you deployed. This includes the FireEye Helix API keys, S3 buckets, Lambda functions, and QuickSight visualizations. This helps ensure that there are no unwanted recurring charges. For more information, review how to delete the CloudFormation stack.

Conclusion

This blog post showed you how to ingest security log data from FireEye Helix and store that data in Amazon S3. We also showed you how to create your own visualizations in QuickSight to better understand the overall security posture of your AWS environment. With this solution, you can perform deep analysis and share these insights among different audiences in your organization (such as, operations, security, and executive management).

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Chaos Testing with AWS Fault Injection Simulator and AWS CodePipeline

2021-08-11 Matt Chastain

Post Syndicated from Matt Chastain original https://aws.amazon.com/blogs/architecture/chaos-testing-with-aws-fault-injection-simulator-and-aws-codepipeline/

The COVID-19 pandemic has proven to be the largest stress test of our technology infrastructures in generations. A meteoric increase in internet consumption followed, due in large part to working and schooling from home. The chaotic, early months of the pandemic have clearly demonstrated the value of resiliency in production. How can we better prepare our critical systems for these global events in the future? A more modern approach to testing and validating your application architecture is needed. Chaos engineering has emerged as an innovative approach to solving these types of challenges.

This blog shows an architecture pattern for automating chaos testing as part of your continuous integration/continuous delivery (CI/CD) process. By automating the implementation of chaos experiments inside CI/CD pipelines, complex risks and modeled failure scenarios can be tested against application environments with every deployment. Application teams can use the results of these experiments to prioritize improvements in their architecture. These results will give your team the confidence they need to operate in an unpredictable production environment.

AWS Fault Injection Simulator (FIS) is a managed service that enables you to perform fault injection experiments on your AWS workloads. Fault injection is based on the principles of chaos engineering. These experiments stress an application by creating disruptive events so that you can observe how your application responds. You can then use this information to improve the performance and resiliency of your applications. With AWS FIS, you set up and run experiments that help you create the real-world conditions needed to uncover application issues.

AWS CodePipeline is a fully managed continuous delivery service for fast and reliable application and infrastructure updates. You can use AWS CodePipeline to model and automate your software release processes. Automating your build, test, and release process allows you to quickly test each code change. You can ensure the quality of your application or infrastructure code by running each change through your staging and release process.

Continuous chaos testing

Figure 1. High-level architecture pattern for automating chaos engineering

Create FIS experiments

Begin with creating an FIS experiment template by configuring one or more actions (action set) to run against the target resources of the application architecture. Here we have created an action to stop Amazon EC2 instances in our Amazon Elastic Container Service (ECS) cluster identified by a tag. Target resources can be identified by resource IDs, filters, or tags. We can also set up the action parameters for running the actions before or after the actions/duration. Additionally, you can set up Amazon CloudWatch alarms to stop running one or more fault experiments once a particular threshold or boundary has been reached. In Increase your e-commerce website reliability using chaos engineering and AWS Fault Injection Simulator, Bastien Leblanc shares how to set up CloudWatch metric thresholds as stop conditions for experiments.

Figure 2. AWS FIS experiment template

Author AWS Lambda to initiate FIS experiments

Create/Add a FIS IAM role to the Lambda function in the configuration permissions section. To start a specific FIS experiment, we use the experimentTemplateId parameter in our Lambda code. Refer to the AWS FIS API Reference when writing your Lambda code. When integrating the Lambda function into your pipeline, a new AWS CodePipeline can be created or an existing one can be used. A new pipeline stage is added at the point we initiate our Lambda function (post deployment stage), which launches our FIS experiment.

Figure 3. AWS Lambda function initiating AWS FIS experiment

Figure 4. AWS CodePipeline with AWS FIS experiment stage

The experimentTemplateId parameter can also be staged as a key/value ‘environment variable’ in your Lambda function configuration. This is useful as it allows you to change your FIS experiment template without having to adjust your function code. You can use the same Lambda function code by dynamically injecting the experimentTemplateId in multiple environments on your way to production.

Verify FIS experiment results on deployed application

By continuously performing fault injection post-deployment in AWS CodePipeline, you learn about complex failure conditions, which you must solve. User experience and availability testing on your application during the runtime of the FIS experiment can be started by a notification rule. In AWS CodePipeline, you can use an Amazon Simple Notification Service (SNS) topic or chatbot integration. CloudWatch Synthetics can be used for those looking to automate experience testing on the candidate application while other FIS experiments are running.

Figure 5. AWS CodePipeline notification rule setting

Summary

Using AWS CodePipeline to automate chaos engineering experiments on application architecture with AWS FIS is straightforward. Following are some benefits from automating fault injection testing in our CI/CD pipelines:

Our team can achieve a higher degree of confidence in meeting the resiliency requirements of our application. We use a more modern approach to testing and automating this experimentation inside our existing CI/CD process with AWS CodePipeline.
We know more about the unknown risks to our application. All testing results we receive provide benefit and learning opportunities for our team. We use these results to understand what we do well, where we need to improve, or what we are willing to tolerate based on our application requirements.
Continuously evaluating our architectural fitness inside CI/CD allows our team to validate the impact each feature or component iteration has on the resiliency of our application.

However, the sole value of automating chaos testing is not limited to finding, fixing, or documenting the risks that surface in our application. Additional confidence is gained through constantly validating your operational practices, such as alerts and alarms, monitoring, and notifications.

FIS gives you a controlled and repeatable way to reproduce necessary conditions to fine-tune your operational procedures and runbooks. Automating this testing inside a CI/CD pipeline ensures a nearly continuous feedback loop for these operational practices.

If you want to review additional FIS experiment examples, check out our AWS Samples GitHub
More blog posts on FIS

Building well-architected serverless applications: Building in resiliency – part 2

2021-08-10 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-building-in-resiliency-part-2/

Reliability question REL2: How do you build resiliency into your serverless application?

This post continues part 1 of this reliability question. Previously, I cover managing failures using retries, exponential backoff, and jitter. I explain how DLQs can isolate failed messages. I show how to use state machines to orchestrate long running transactions rather than handling these in application code.

Required practice: Manage duplicate and unwanted events

Duplicate events can occur when a request is retried or multiple consumers process the same message from a queue or stream. A duplicate can also happen when a request is sent twice at different time intervals with the same parameters. Design your applications to process multiple identical requests to have the same effect as making a single request.

Idempotency refers to the capacity of an application or component to identify repeated events and prevent duplicated, inconsistent, or lost data. This means that receiving the same event multiple times does not change the result beyond the first time the event was received. An idempotent application can, for example, handle multiple identical refund operations. The first refund operation is processed. Any further refund requests to the same customer with the same payment reference should not be processes again.

When using AWS Lambda, you can make your function idempotent. The function’s code must properly validate input events and identify if the events were processed before. For more information, see “How do I make my Lambda function idempotent?”

When processing streaming data, your application must anticipate and appropriately handle processing individual records multiple times. There are two primary reasons why records may be delivered more than once to your Amazon Kinesis Data Streams application: producer retries and consumer retries. For more information, see “Handling Duplicate Records”.

Generate unique attributes to manage duplicate events at the beginning of the transaction

Create, or use an existing unique identifier at the beginning of a transaction to ensure idempotency. These identifiers are also known as idempotency tokens. A number of Lambda triggers include a unique identifier as part of the event:

Amazon API Gateway: requestContext.requestId
Kinesis: Records[].eventID
Amazon CloudWatch Events: id
Amazon Simple Notification Service (SNS): Records[].Sns.MessageId
Amazon Simple Queue Service (SQS): Records[].messageId

You can also create your own identifiers. These can be business-specific, such as transaction ID, payment ID, or booking ID. You can use an opaque random alphanumeric string, unique correlation identifiers, or the hash of the content.

A Lambda function, for example can use these identifiers to check whether the event has been previously processed.

Depending on the final destination, duplicate events might write to the same record with the same content instead of generating a duplicate entry. This may therefore not require additional safeguards.

Use an external system to store unique transaction attributes and verify for duplicates

Lambda functions can use Amazon DynamoDB to store and track transactions and idempotency tokens to determine if the transaction has been handled previously. DynamoDB Time to Live (TTL) allows you to define a per-item timestamp to determine when an item is no longer needed. This helps to limit the storage space used. Base the TTL on the event source. For example, the message retention period for SQS.

Using DynamoDB to store idempotent tokens

You can also use DynamoDB conditional writes to ensure a write operation only succeeds if an item attribute meets one of more expected conditions. For example, you can use this to fail a refund operation if a payment reference has already been refunded. This signals to the application that it is a duplicate transaction. The application can then catch this exception and return the same result to the customer as if the refund was processed successfully.

Third-party APIs can also support idempotency directly. For example, Stripe allows you to add an Idempotency-Key: <key> header to the request. Stripe saves the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeded or failed. Subsequent requests with the same key return the same result.

Validate events using a pre-defined and agreed upon schema

Implicitly trusting data from clients, external sources, or machines could lead to malformed data being processed. Use a schema to validate your event conforms to what you are expecting. Process the event using the schema within your application code or at the event source when applicable. Events not adhering to your schema should be discarded.

For API Gateway, I cover validating incoming HTTP requests against a schema in “Implementing application workload security – part 1”.

Amazon EventBridge rules match event patterns. EventBridge provides schemas for all events that are generated by AWS services. You can create or upload custom schemas or infer schemas directly from events on an event bus. You can also generate code bindings for event schemas.

SNS supports message filtering. This allows a subscriber to receive a subset of the messages sent to the topic using a filter policy. For more information, see the documentation.

JSON Schema is a tool for validating the structure of JSON documents. There are a number of implementations available.

Best practice: Consider scaling patterns at burst rates

Load testing your serverless application allows you to monitor the performance of an application before it is deployed to production. Serverless applications can be simpler to load test, thanks to the automatic scaling built into many of the services. For more information, see “How to design Serverless Applications for massive scale”.

In addition to your baseline performance, consider evaluating how your workload handles initial burst rates. This ensures that your workload can sustain burst rates while scaling to meet possibly unexpected demand.

Perform load tests using a burst strategy with random intervals of idleness

Perform load tests using a burst of requests for a short period of time. Also introduce burst delays to allow your components to recover from unexpected load. This allows you to future-proof the workload for key events when you do not know peak traffic levels.

There are a number of AWS Marketplace and AWS Partner Network (APN) solutions available for performance testing, including Gatling FrontLine, BlazeMeter, and Apica.

In regulating inbound request rates – part 1, I cover running a performance test suite using Gatling, an open source tool.

Gatling performance results

Amazon does have a network stress testing policy that defines which high volume network tests are allowed. Tests that purposefully attempt to overwhelm the target and/or infrastructure are considered distributed denial of service (DDoS) tests and are prohibited. For more information, see “Amazon EC2 Testing Policy”.

Review service account limits with combined utilization across resources

AWS accounts have default quotas, also referred to as limits, for each AWS service. These are generally Region-specific. You can request increases for some limits while other limits cannot be increased. Service Quotas is an AWS service that helps you manage your limits for many AWS services. Along with looking up the values, you can also request a limit increase from the Service Quotas console.

Service Quotas dashboard

As these limits are shared within an account, review the combined utilization across resources including the following:

Amazon API Gateway: number of requests per second across all APIs. (link)
AWS AppSync: throttle rate limits. (link)
AWS Lambda: function concurrency reservations and pool capacity to allow other functions to scale. (link)
Amazon CloudFront: requests per second per distribution. (link)
AWS IoT Core message broker: concurrent requests per second. (link)
Amazon EventBridge: API requests and target invocations limit. (link)
Amazon Cognito: API limits. (link)
Amazon DynamoDB: throughput, indexes, and request rates limits. (link)

Evaluate key metrics to understand how workloads recover from bursts

There are a number of key Amazon CloudWatch metrics to evaluate and alert on to understand whether your workload recovers from bursts.

AWS Lambda: Duration, Errors, Throttling, ConcurrentExecutions, UnreservedConcurrentExecutions. (link)
Amazon API Gateway: Latency, IntegrationLatency, 5xxError, 4xxError. (link)
Application Load Balancer: HTTPCode_ELB_5XX_Count, RejectedConnectionCount, HTTPCode_Target_5XX_Count, UnHealthyHostCount, LambdaInternalError, LambdaUserError. (link)
AWS AppSync: 5XX, Latency. (link)
Amazon SQS: ApproximateAgeOfOldestMessage. (link)
Amazon Kinesis Data Streams: ReadProvisionedThroughputExceeded, WriteProvisionedThroughputExceeded, GetRecords.IteratorAgeMilliseconds, PutRecord.Success, PutRecords.Success (if using Kinesis Producer Library), GetRecords.Success. (link)
Amazon SNS: NumberOfNotificationsFailed, NumberOfNotificationsFilteredOut-InvalidAttributes. (link)
Amazon Simple Email Service (SES): Rejects, Bounces, Complaints, Rendering Failures. (link)
AWS Step Functions: ExecutionThrottled, ExecutionsFailed, ExecutionsTimedOut. (link)
Amazon EventBridge: FailedInvocations, ThrottledRules. (link)
Amazon S3: 5xxErrors, TotalRequestLatency. (link)
Amazon DynamoDB: ReadThrottleEvents, WriteThrottleEvents, SystemErrors, ThrottledRequests, UserErrors. (link)

Conclusion

This post continues from part 1 and looks at managing duplicate and unwanted events with idempotency and an event schema. I cover how to consider scaling patterns at burst rates by managing account limits and show relevant metrics to evaluate

Build resiliency into your workloads. Ensure that applications can withstand partial and intermittent failures across components that may only surface in production. In the next post in the series, I cover the performance efficiency pillar from the Well-Architected Serverless Lens.

For more serverless learning resources, visit Serverless Land.

Classifying Millions of Amazon items with Machine Learning, Part I: Event Driven Architecture

2021-08-09 Mahmoud Abid

Post Syndicated from Mahmoud Abid original https://aws.amazon.com/blogs/architecture/classifying-millions-of-amazon-items-with-machine-learning-part-i-event-driven-architecture/

As part of AWS Professional Services, we work with customers across different industries to understand their needs and supplement their teams with specialized skills and experience.

Some of our customers are internal teams from the Amazon retail organization who request our help with their initiatives. One of these teams, the Global Environmental Affairs team, identifies the number of electronic products sold. Then they classify these products according to local laws and accurately report this data to regulators. This process covers the products’ end-of-life costs and ensures a high quality of recycling.

These electronic products have classification codes that differ from country to country, and these codes change according to each country’s latest regulations. This poses a complex technical problem. How do we automate our compliance teams’ work to efficiently and accurately classify over three million product classifications every month, in more than 38 countries, while also complying with evolving classification regulations?

To solve this problem, we used Amazon Machine Learning (Amazon ML) capabilities to build a resilient architecture. It ingests and processes data, trains ML models, and predicts (also known as inference workflow) monthly sales data for all countries concurrently.

In this post, we outline how we used AWS Lambda, Amazon EventBridge, and AWS Step Functions to build a scalable and cost-effective solution. We’ll also show you how to keep the data secure while processing it in Amazon ML flows.

Solution overview

Our solution consists of three main parts, which are summarized here and detailed in the following sections:

Training the ML models
Evaluating their performance
Using them to run an inference workflow (in other words, label) the sold items with the correct classification codes

Training the Amazon ML model

For training our Amazon ML model, we use the architecture in Figure 1. It starts with a periodic query against the Amazon.com data warehouse in Amazon Redshift.

Figure 1. Training workflow

A labeled dataset containing pre-recorded classification codes is extracted from Amazon Redshift. This dataset is stored in an Amazon Simple Storage Service (Amazon S3) bucket and split up by country. The data is encrypted at rest with server-side encryption using an AWS Key Management Service (AWS KMS) key. This is also known as server-side encryption with AWS KMS (SSE-KMS). The extraction query uses the AWS KMS key to encrypt the data when storing it in the S3 bucket.
Each time a country’s dataset is uploaded to the S3 bucket, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue. This prompts a Lambda function. We use Amazon SQS to ensure resiliency. If the Lambda function fails, the message will be tried again automatically. Overall, the message is either processed successfully, or ends up in a dead letter queue that we monitor (not displayed in Figure 1).
If the message is processed successfully, the Lambda function generates necessary input parameters. Then it starts a Step Functions workflow execution for the training process.
The training process involves orchestrating Amazon SageMaker Processing jobs to prepare the data. Once the data is prepared, a hyperparameter optimization job invokes multiple training jobs. These run in parallel with different values from a range of hyperparameters. The model that performs the best is chosen to move forward.
After the model is trained successfully, an EventBridge event is prompted, which will be used to invoke the performance comparison process.

Comparing performance of Amazon ML models

Because Amazon ML models are automatically trained periodically, we want to assess their performance automatically too. Newly created models should perform better than their predecessors. To measure this, we use the flow in Figure 2.

Figure 2. Model performance comparison workflow

The flow is activated by the EventBridge event at the end of the training flow.
A Lambda function gathers the necessary input parameters and uses them to start an inference workflow, implemented as a Step Function.
The inference workflow use SageMaker Processing jobs to prepare a new test dataset. It performs predictions using SageMaker Batch Transform jobs with the new model. The test dataset is a labeled subset that was not used in model training. Its prediction gives an unbiased estimation of the model’s performance, proving that the model can generalize.
After the inference workflow is completed and the results are stored on Amazon S3, an EventBridge event is performed, which prompts another Lambda function. This function runs the performance comparison Step Function.
The performance comparison workflow uses a SageMaker Processing job to analyze the inference results and calculate its performance score based on ground truth. For each country, the job compares the performance of the new model with the performance of the last used model to determine which one was best, otherwise known as the “winner model.” The metadata of the winner model is saved in an Amazon DynamoDB table so it can be queried and used in the next production inference job.
At the end of the performance comparison flow, an informational notification is sent to an Amazon Simple Notification Service (Amazon SNS) topic, which will be received by the MLOps team.

Running inference

The inference flow starts with a periodic query against the Amazon.com data warehouse in Amazon Redshift, as shown in Figure 3.

Figure 3. Inference workflow

As with training, the dataset is extracted from Amazon Redshift, split up by country, and stored in an S3 bucket and encrypted at rest using the AWS KMS key.
Every country dataset upload prompts a message to an SQS queue, which invokes a Lambda function.
The Lambda function gathers necessary input parameters and starts a workflow execution for the inference process. This is the same Step Function we used in the performance comparison. Now it runs against the real dataset instead of the test set.
The inference Step Function orchestrates the data preparation and prediction using the winner model for each country, as stored in the model performance DynamoDB table. The predictions are uploaded back to the S3 bucket to be further consumed for reporting.
Lastly, an Amazon SNS message is sent to signal completion of the inference flow, which will be received by different stakeholders.

Data encryption

One of the key requirements of this solution was to provide least privilege access to all data. To achieve this, we use AWS KMS to encrypt all data as follows:

We configured SSE-KMS as the default behavior on the S3 bucket.
We restricted the AWS KMS key policy to allow only a set of predefined AWS Identity and Access Management (IAM) roles to use it so no other role or user could use it to decrypt the data.
We set the S3 bucket policy to deny PutObject requests that do not explicitly specify the key in the request header, as shown in Figure 4.

Figure 4. Restriction of data decryption permissions

Conclusion

In this post, we outline how we used a serverless architecture to handle the end-to-end flow of data extraction, processing, and storage. We also talk about how we use this data for model training and inference.

With this solution, our customer team onboarded 38 countries and brought 60 Amazon ML models to production to classify 3.3 million items on a monthly basis.

In the next post, we show you how we use AWS Developer Tools to build a comprehensive continuous integration/continuous delivery (CI/CD) pipeline that safeguards the code behind this solution.

CICD on Serverless Applications using AWS CodeArtifact

2021-08-07 Anand Krishna

Post Syndicated from Anand Krishna original https://aws.amazon.com/blogs/devops/cicd-on-serverless-applications-using-aws-codeartifact/

Developing and deploying applications rapidly to users requires a working pipeline that accepts the user code (usually via a Git repository). AWS CodeArtifact was announced in 2020. It’s a secure and scalable artifact management product that easily integrates with other AWS products and services. CodeArtifact allows you to publish, store, and view packages, list package dependencies, and share your application’s packages.

In this post, I will show how we can build a simple DevOps pipeline for a sample JAVA application (JAR file) to be built with Maven.

Solution Overview

We utilize the following AWS services/Tools/Frameworks to set up our continuous integration, continuous deployment (CI/CD) pipeline:

The following diagram illustrates the pipeline architecture and flow:

aws-codeartifact-pipeline

Our pipeline is built on CodePipeline with CodeCommit as the source (CodePipeline Source Stage). This triggers the pipeline via a CloudWatch Events rule. Then the code is fetched from the CodeCommit repository branch (main) and sent to the next pipeline phase. This CodeBuild phase is specifically for compiling, packaging, and publishing the code to CodeArtifact by utilizing a package manager—in this case Maven.

After Maven publishes the code to CodeArtifact, the pipeline asks for a manual approval to be directly approved in the pipeline. It can also optionally trigger an email alert via Amazon Simple Notification Service (Amazon SNS). After approval, the pipeline moves to another CodeBuild phase. This downloads the latest packaged JAR file from a CodeArtifact repository and deploys to the AWS Lambda function.

Clone the Repository

Clone the GitHub repository as follows:

git clone https://github.com/aws-samples/aws-cdk-codeartifact-pipeline-sample.git

Code Deep Dive

After the Git repository is cloned, the directory structure is shown as in the following screenshot :

aws-codeartifact-pipeline-code

Let’s study the files and code to understand how the pipeline is built.

The directory java-events is a sample Java Maven project. Find numerous sample applications on GitHub. For this post, we use the sample application java-events.

To add your own application code, place the pom.xml and settings.xml files in the root directory for the AWS CDK project.

Let’s study the code in the file lib/cdk-pipeline-codeartifact-new-stack.ts of the stack CdkPipelineCodeartifactStack. This is the heart of the AWS CDK code that builds the whole pipeline. The stack does the following:

Creates a CodeCommit repository called ca-pipeline-repository.
References a CloudFormation template (lib/ca-template.yaml) in the AWS CDK code via the module @aws-cdk/cloudformation-include.
Creates a CodeArtifact domain called cdkpipelines-codeartifact.
Creates a CodeArtifact repository called cdkpipelines-codeartifact-repository.
Creates a CodeBuild project called JarBuild_CodeArtifact. This CodeBuild phase does all of the code compiling, packaging, and publishing to CodeArtifact into a repository called cdkpipelines-codeartifact-repository.
Creates a CodeBuild project called JarDeploy_Lambda_Function. This phase fetches the latest artifact from CodeArtifact created in the previous step (cdkpipelines-codeartifact-repository) and deploys to the Lambda function.
Finally, creates a pipeline with four phases:
- Source as CodeCommit (ca-pipeline-repository).
- CodeBuild project JarBuild_CodeArtifact.
- A Manual approval Stage.
- CodeBuild project JarDeploy_Lambda_Function.

CodeArtifact shows the domain-specific and repository-specific connection settings to mention/add in the application’s pom.xml and settings.xml files as below:

aws-codeartifact-repository-connections

Deploy the Pipeline

The AWS CDK code requires the following packages in order to build the CI/CD pipeline:

@aws-cdk/core
@aws-cdk/aws-codepipeline
@aws-cdk/aws-codepipeline-actions
@aws-cdk/aws-codecommit
@aws-cdk/aws-codebuild
@aws-cdk/aws-iam
@aws-cdk/cloudformation-include

Install the required AWS CDK packages as below:

npm i @aws-cdk/core @aws-cdk/aws-codepipeline @aws-cdk/aws-codepipeline-actions @aws-cdk/aws-codecommit @aws-cdk/aws-codebuild @aws-cdk/pipelines @aws-cdk/aws-iam @ @aws-cdk/cloudformation-include

Compile the AWS CDK code:

npm run build

Deploy the AWS CDK code:

cdk synth cdk deploy

After the AWS CDK code is deployed, view the final output on the stack’s detail page on the AWS CloudFormation :

aws-codeartifact-pipeline-cloudformation-stack

How the pipeline works with artifact versions (using SNAPSHOTS)

In this demo, I publish SNAPSHOT to the repository. As per the documentation here and here, a SNAPSHOT refers to the most recent code along a branch. It’s a development version preceding the final release version. Identify a snapshot version of a Maven package by the suffix SNAPSHOT appended to the package version.

The application settings are defined in the pom.xml file. For this post, we define the following:

The version to be used, called 1.0-SNAPSHOT.
The specific packaging, called jar.
The specific project display name, called JavaEvents.
The specific group ID, called JavaEvents.

The screenshot below shows the pom.xml settings we utilised in the application:

aws-codeartifact-pipeline-pom-xml

You can’t republish a package asset that already exists with different content, as per the documentation here.

When a Maven snapshot is published, its previous version is preserved in a new version called a build. Each time a Maven snapshot is published, a new build version is created.

When a Maven snapshot is published, its status is set to Published, and the status of the build containing the previous version is set to Unlisted. If you request a snapshot, the version with status Published is returned. This is always the most recent Maven snapshot version.

For example, the image below shows the state when the pipeline is run for the FIRST RUN. The latest version has the status Published and previous builds are marked Unlisted.

aws-codeartifact-repository-package-versions

For all subsequent pipeline runs, multiple Unlisted versions will occur every time the pipeline is run, as all previous versions of a snapshot are maintained in its build versions.

aws-codeartifact-repository-package-versions

Fetching the Latest Code

Retrieve the snapshot from the repository in order to deploy the code to an AWS Lambda Function. I have used AWS CLI to list and fetch the latest asset of package version 1.0-SNAPSHOT.

Listing the latest snapshot

export ListLatestArtifact = `aws codeartifact list-package-version-assets —domain cdkpipelines-codeartifact --domain-owner $Account_Id --repository cdkpipelines-codeartifact-repository --namespace JavaEvents --format maven --package JavaEvents --package-version "1.0-SNAPSHOT"| jq ".assets[].name"|grep jar|sed ’s/“//g’`

NOTE : Please note the dynamic CDK variable $Account_Id which represents AWS Account ID.

Fetching the latest code using Package Version

aws codeartifact get-package-version-asset --domain cdkpipelines-codeartifact --repository cdkpipelines-codeartifact-repository --format maven --package JavaEvents --package-version 1.0-SNAPSHOT --namespace JavaEvents --asset $ListLatestArtifact demooutput

Notice that I’m referring the last code by using variable $ListLatestArtifact. This always fetches the latest code, and demooutput is the outfile of the AWS CLI command where the content (code) is saved.

Testing the Pipeline

Now clone the CodeCommit repository that we created with the following code:

git clone https://git-codecommit.<region>.amazonaws.com/v1/repos/codeartifact-pipeline-repository

Enter the following code to push the code to the CodeCommit repository:

cp -rp cdk-pipeline-codeartifact-new /* ca-pipeline-repository cd ca-pipeline-repository git checkout -b main git add . git commit -m “testing the pipeline” git push origin main

Once the code is pushed to Git repository, the pipeline is automatically triggered by Amazon CloudWatch events.

The following screenshots shows the second phase (AWS CodeBuild Phase – JarBuild_CodeArtifact) of the pipeline, wherein the asset is successfully compiled and published to the CodeArtifact repository by Maven:

aws-codeartifact-pipeline-codebuild-jarbuild

aws-codeartifact-pipeline-codebuild-screenshot

aws-codeartifact-pipeline-codebuild-screenshot2

The following screenshots show the last phase (AWS CodeBuild Phase – Deploy-to-Lambda) of the pipeline, wherein the latest asset is successfully pulled and deployed to AWS Lambda Function.

Asset JavaEvents-1.0-20210618.131629-5.jar is the latest snapshot code for the package version 1.0-SNAPSHOT. This is the same asset version code that will be deployed to AWS Lambda Function, as seen in the screenshots below:

aws-codeartifact-pipeline-codebuild-jardeploy

aws-codeartifact-pipeline-codebuild-screenshot-jarbuild

The following screenshot of the pipeline shows a successful run. The code was fetched and deployed to the existing Lambda function (codeartifact-test-function).

aws-codeartifact-pipeline-codepipeline

Cleanup

To clean up, You can either delete the entire stack through the AWS CloudFormation console or use AWS CDK command like below –

cdk destroy

For more information on the AWS CDK commands, please check the here or sample here.

Summary

In this post, I demonstrated how to build a CI/CD pipeline for your serverless application with AWS CodePipeline by utilizing AWS CDK with AWS CodeArtifact. Please check the documentation here for an in-depth explanation regarding other package managers and the getting started guide.

Automating Your Home with Grafana and Siemens Controllers

2021-08-04 Viktoria Semaan

Post Syndicated from Viktoria Semaan original https://aws.amazon.com/blogs/architecture/automating-your-home-with-grafana-and-siemens-controllers/

Imagine that you have access to a digital twin of your house that allows you to remotely monitor and control different devices inside your home. Forgot to turn off the heater or air conditioning? Didn’t close water faucets? Wondering how long your kids have been watching TV? Wouldn’t it be nice to have all the information from multiple devices in a single place?

Nowadays, many of us have smart things at home, such as thermostats, security cameras, wireless sensors, switches, etc. The problem is that most of these smart things come with different mobile applications. To get a full picture, we end up switching between applications that serve limited needs.

In this blog, we explain how to use Siemens controllers, AWS IoT, and the open-source visualization platform Grafana to quickly build a digital twin of any processes. This includes home automation, industrial applications, security systems, and others. As an example, we will monitor environmental conditions, including temperature and humidity sensors, thermostat settings, light switches, as well as monthly water and energy consumption. We will go through the architecture and steps required to integrate different building components to store data for historical analysis, enable voice control, and create interactive near real-time dashboards showing a digital representation of your house. If you would like to learn more about the solution, we will provide links to all the architecture components and detailed configuration steps.

Figure 1. Smart home automation solution with Siemens LOGO! compact controller

Architecture

In this solution overview, we are using a low-cost Siemens LOGO! controller (hardware version 8.3 or higher). This controller supports traditional industrial protocols such as Simatic S7 and Modbus TCP/IP as well as MQTT through the native AWS IoT Core interface. Automation controllers are the brains behind smart systems that allow orchestrating all the devices in your home. This reference architecture can be extended to other devices that support MQTT protocol, have programmatic APIs, or Software Development Kits (SDKs). It could serve as a starting point for building a home automation solution using AWS IoT and Grafana and further customized based on customer needs.

Figure 2. Reference architecture for smart home automation solution

The components of this solution are:

The LOGO! controller controls home automation equipment and ingests data to AWS IoT Core.
AWS IoT Core collects data at scale and routes messages to multiple AWS services.
AWS Lambda is called inside the AWS IoT Core statement to transform the incoming data prior to ingestion.
Amazon Timestream stores time series data and optimizes it for fast analytical queries.
AWS IoT SiteWise models and stores data from equipment for large scale deployments.
Grafana installed on Amazon Elastic Compute Cloud (Amazon EC2) visualizes data in near real-time using interactive dashboards.
The Alexa Skills Kit (ASK) allows interaction with devices using voice commands.

Figure 3. Remote monitoring dashboard allows homeowners to view and control conditions

Solution overview

Step 1: Ingest data to AWS IoT

The IoT-enabled LOGO! controller provides the out-of-box capability to send data to AWS IoT Core service. In a few clicks, you can configure variables and their update frequency to be published to the AWS Cloud. To get started with the LOGO! controller, please refer to Siemens E-learning portal. AWS IoT Core collects and processes messages from remote devices transmitted over the secured MQTT protocol.

The LOGO! controller publishes data to AWS IoT Core in hexadecimal format. The Lambda function converts the data from hexadecimal to the standard decimal numeric system. If your home automation equipment sends data in the standard decimal format, then AWS IoT Core can directly write data to other AWS services without Lambda.

Step 2: Store data in Timestream or AWS IoT SiteWise

The ingested IoT data is saved for historical analysis. Timestream is a serverless time series database service that is optimized for high throughput ingestion and has built-in analytical functions. It is one of the options you can use to store IoT data. Time series is a common data format to observe how things are changing over time and it is suitable for building IoT applications.

AWS IoT SiteWise is an alternative option to store and organize data at scale. It is beneficial for large-scale commercial building automation and management systems, including offices, hotels, and factories. You can structure data by using built-in asset modeling capabilities.

Step 3: Visualize data in Grafana dashboard

Once data is stored, it can be made available to multiple applications. Grafana is a data visualization platform that you can use to monitor data. It supports near real-time visualization with a refresh rate of 5 seconds or higher. You can visualize data from multiple AWS sources (such as AWS IoT SiteWise, Timestream, and Amazon CloudWatch) and other data sources with a single Grafana dashboard. Grafana can be installed on an Amazon Linux system, Windows, macOS, or deployed on Kubernetes (K8S) or Docker containers. For customers who don’t want to manage infrastructure and are interested in developing completely serverless solution, Amazon offers an Amazon Managed Service for Grafana. At the time of writing this post, this service is available in preview with a limited number of supported plugins.

To build Grafana dashboard and retrieve data from Timestream, you can use SQL queries. Timestream query example to retrieve humidity values and timestamps for the past 24 hours:

SELECT measure_value::double as humidity, time FROM "myhome_db"."livingroom" WHERE measure_name='humidity' and time >=ago(1d)

To retrieve data from AWS IoT SiteWise, you can select asset properties from the asset navigation tab, which makes it simple for non-technical users to build dashboards.

Figure 4. Grafana dashboard configuration with AWS IoT SiteWise

One of the common issues of operational dashboards is that it’s hard to get a physical representation by looking at a cluster of multiple readings. To reflect conditions of physical assets, the information from sensors must be overlaid on top of original physical objects. ImageIt Panel Plugin for Grafana allows you to overcome this issue. You can upload a picture of your house or a system and drag sensor readings to their exact locations, thus creating digital representations of physical objects.

Step 4: Control using Alexa

Using the Alexa Skills Kit, you can build voice skills to be used on devices enabled by Alexa globally. Alexa and AWS IoT enables you to create an end-to-end voice-controlled experience without using any additional hardware. Instead, your functions run on the cloud only when you invoke Alexa with voice commands.

The easiest way to build a custom Alexa skill is to use a Lambda function. You can upload the code for your Alexa skill to a Lambda function. The code will execute in response to Alexa voice interactions and send commands to the LOGO! controller.

Conclusion

In this blog, we reviewed how you can create a digital twin of your home automation or industrial systems using Siemens controllers, AWS IoT, and Grafana dashboards. Connecting the LOGO! controller to AWS gives it access to the Internet of Things (IoT) and opens many potential applications such as anomaly detection, predictive maintenance, intrusion detection, and others.

Building well-architected serverless applications: Building in resiliency – part 1

2021-08-03 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-building-in-resiliency-part-1/

Reliability question REL2: How do you build resiliency into your serverless application?

Evaluate scaling mechanisms for serverless and non-serverless resources to meet customer demand. Build resiliency into your workload to make your serverless application resilient to withstand partial and intermittent failures across components that may only surface in production.

Required practice: Manage transaction, partial, and intermittent failures

Whenever one service or system calls another, there is a chance that failures can happen. Services or systems often don’t fail as a single unit, but rather suffer partial or transient failures. Applications should be designed to handle component failures as part of the architecture. The system should be designed to detect failure and, ideally, automatically heal itself.

Transaction failures can occur when a component is unavailable or under high load. Partial failures can occur when a percentage of requests succeeds, including during batch processing. Intermittent failures might occur when a request fails for a short period of time due to network or other transient issues.

AWS serverless services, including AWS Lambda, are fault-tolerant and designed to handle failures. If a service invokes a Lambda function and there is a service disruption, Lambda invokes the function in a different Availability Zone.

When you invoke a function directly, you determine the strategy for handling errors. You can retry, send the event to a destination or queue for debugging, or ignore the error. Clients such as the AWS Command Line Interface (CLI) and the AWS SDK retry on client timeouts, throttling errors (429), and other errors that are not caused by a bad request.

When you invoke a function indirectly, you must be aware of the retry behavior of the invoker and any service that the request encounters along the way. For more information, see “Error handling and automatic retries in AWS Lambda”. You can configure Maximum Retry Attempts and Maximum Event Age for asynchronous invocations.

When reading from Amazon Kinesis Data Streams and Amazon DynamoDB Streams, Lambda retries the entire batch of items. Retries continue until the records expire or exceed the maximum age that you configure on the event source mapping. You can also configure the event source mapping to split a failed batch into two batches. Retrying with smaller batches isolates bad records and works around timeout issues.

Partial failures can occur in non-atomic operations. PutRecords for Kinesis and BatchWriteItem for DynamoDB return a successful response if at least one record is ingested successfully. Always inspect the response when using such operations and programmatically deal with partial failures.

Use exponential backoff with jitter

The simplest technique for dealing with failures in a networked environment is to retry calls until they succeed. This technique increases the reliability of the application and reduces operational costs for the developer.

However, it is not always safe to retry. A retry can further increase the load on the system being called if the system is already failing due to an overload. To avoid this problem, use backoff. Instead of retrying immediately and aggressively, the client waits some amount of time between tries. The most common pattern is an exponential backoff, which uses exponentially longer wait times between retries. This is typically capped to a maximum delay and number of retries.

If all backoff retries are still happening at the same time, this can still overload a system or cause contention. To avoid this problem, use jitter. Jitter adds some amount of randomness to the backoff to spread the retries around in time. This can help prevent large bursts by spreading out the rate when clients connect. For more information see the Amazon Builders’ Library article “Timeouts, retries, and backoff with jitter” and AWS Architecture blog post “Exponential Backoff And Jitter”.

Exponential backoff and jitter

When your application responds to callers in fail-fast scenarios and when performance is degraded, inform the caller via headers or metadata when they can retry.

Each AWS SDK implements automatic retry logic including exponential backoff. For downstream calls, you can adjust AWS and third-party SDK retries, backoffs, TCP, and HTTP timeouts. This helps you decide when to stop retrying. For more information, see the documentation and troubleshooting steps for Lambda and the AWS SDK.

Use a dead-letter queue mechanism to retain, investigate and retry failed transactions

There are a number of ways to handle message failures including destinations and dead-letter queues.

You can configure Lambda to send records of asynchronous invocations to another destination service. These include Amazon Simple Queue Service (SQS), Amazon Simple Notification Service (SNS), Lambda, and Amazon EventBridge. You can configure separate destinations for events that fail processing and events that are successfully processed. The invocation record contains details about the event, the response, and the reason that the record was sent.

The following example shows a function that sends a record of a successful invocation to an EventBridge event bus. When an event fails all processing attempts, Lambda sends an invocation record to an SQS queue. It includes the function’s response in the invocation record.

AWS Lambda destinations for asynchronous invocation

SNS, SQS, Lambda, and EventBridge support dead-letter queues (DLQs). DLQs make your applications more resilient and durable by storing messages or events that can’t be processed correctly into a dedicated SQS queue. This helps you debug your application by isolating the problematic messages to determine why their processing failed. One you have resolved the issue, re-process the failed message. For more information, see “When should I use a dead-letter queue?” There is an example serverless application to redrive the messages from an SQS DLQ back to its source SQS queue.

For Lambda, DLQs provide an alternative to a failure destination. Lambda destinations is preferable for asynchronous invocations.

Good practice: Orchestrate long-running transactions

Long-running transactions can be processed by one or multiple components. Consider implementing the saga pattern using state machines for these types of transactions.

The saga pattern coordinates transactions between multiple microservices as part of a state machine. Each service that performs a transaction publishes an event to trigger the next transaction in the saga. This continues until the transaction chain is complete. If a transaction fails, saga orchestrates a series of compensating transactions that undo the changes that were made by the preceding transactions.

This is preferable to handling complex or long-running transactions within application code. State machines prevent cascading failures and avoid tightly coupling components with orchestrating logic and business logic.

Use a state machine to visualize distributed transactions, and to separate business logic from orchestration logic.

AWS Step Functions lets you coordinate multiple AWS services into serverless workflows via state machines. Within Step Functions, you can set separate retries, backoff rates, max attempts, intervals, and timeouts. These are set for every step of your state machine using a declarative language.

In the serverless airline example used in this series, Step Functions is used to orchestrate the Booking microservice. The ProcessBooking state machine handles all the necessary steps to create bookings, including payment.

Booking service Step Functions state machine

The state machine uses a combination of service integrations using DynamoDB, SQS, and Lambda functions to coordinate transactions and handle failures.

For example, the Reserve Booking task invokes a Lambda function. The task has retry and error handling configured as part of the task definition.

"Reserve Booking": {
	"Type": "Task",
	"Resource": "${ReserveBooking.Arn}",
	"TimeoutSeconds": 5,
	"Retry": [
		{
			"ErrorEquals": [
				"BookingReservationException"
			],
			"IntervalSeconds": 1,
			"BackoffRate": 2,
			"MaxAttempts": 2
		}
	],
	"Catch": [
		{
			"ErrorEquals": [
				"States.ALL"
			],
			"ResultPath": "$.bookingError",
			"Next": "Cancel Booking"
		}
	],
	"ResultPath": "$.bookingId",
	"Next": "Collect Payment"
},

Step Functions supports direct service integrations, including DynamoDB. The Reserve Flight task directly updates the flightTable without requiring a Lambda function.

"Reserve Flight": {
	"Type": "Task",
	"Resource": "arn:aws:states:::dynamodb:updateItem",
	"Parameters": {
		"TableName.$": "$.flightTable",
		"Key": {
			"id": {
				"S.$": "$.outboundFlightId"
			}
		},
		"UpdateExpression": "SET seatCapacity = seatCapacity - :dec",
		"ExpressionAttributeValues": {
			":dec": {
				"N": "1"
			},
			":noSeat": {
				"N": "0"
			}
		},
		"ConditionExpression": "seatCapacity > :noSeat"
	},

By default, when a state reports an error, Step Functions causes the execution to fail entirely.

Utilize dead-letter queues in response to failed state machine executions

Any state within the Step Functions workflow can encounter runtime errors. These include state machine definition issues, task failures such as Lambda function exceptions, or transient issues such as network connectivity issues. For more information, see “Error handling in Step Functions”.

Use the Step Functions service integration with SQS to send failed transactions to a DLQ as the final step. This adds a higher level of durability within your state machines.

For example, the airline Notify Failed Booking final task catches failed states from four previous steps. It sends the results to the Booking DLQ.

Booking service Step Functions DLQ

The message includes the output of the previous failed states for further troubleshooting.

"Booking DLQ": {
	"Type": "Task",
	"Resource": "arn:aws:states:::sqs:sendMessage",
	"Parameters": {
		"QueueUrl": "${BookingsDLQ}",
		"MessageBody.$": "$"
	},
	"ResultPath": "$.deadLetterQueue",
	"Next": "Booking Failed"
},

The Step Functions documentation has more information on calling SQS.

Conclusion

Build resiliency into your workloads. This makes sure that your application can withstand partial and intermittent failures across components that may only surface in production.

In this post, I cover managing failures using retries, exponential backoff, and jitter. I explain how DLQs can isolate failed messages. I show how to use state machines to orchestrate long running transactions rather than handling these in application code.

This well-architected question continues in part 2 where I look at managing duplicate and unwanted events with idempotency and an event schema. I cover how to consider scaling patterns at burst rates by managing account limits and show relevant metrics to evaluate.

For more serverless learning resources, visit Serverless Land.

Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

2021-08-03 Hareesh Singireddy

Post Syndicated from Hareesh Singireddy original https://aws.amazon.com/blogs/architecture/expiring-amazon-s3-objects-based-on-last-accessed-date-to-decrease-costs/

Organizations are using Amazon Simple Storage Service (S3) for building their data lakes, websites, mobile applications, and enterprise applications. As the number of objects within your S3 bucket increases, you may want to move older objects into lower-cost tiers of Amazon S3. In some cases you may want to delete the objects altogether to further reduce S3 storage costs. A common practice is to use S3 Lifecycle rules to achieve this. These rules can be applied to objects based on their creation date. In certain situations, you may want to keep objects available that are still being accessed, but transition or delete objects that are no longer in use.

In this post, we will demonstrate how you can create custom object expiry rules for Amazon S3 based on the last accessed date of the object. We will first walk through the various features used within the workflow, followed by an architecture diagram outlining the process flow.

Amazon S3 server access logging

S3 Server access logging provides detailed records of the requests that are made to objects in Amazon S3 buckets. Amazon S3 periodically collects access log records, consolidates the records in log files, and then uploads log files to your target bucket as log objects. Each log record consists of information such as bucket name, the operation in the request, and the time at which the request was received. S3 Server Access Log Format provides more details about the format of the log file.

Amazon S3 inventory

Amazon S3 inventory provides a list of your objects and the corresponding metadata on a daily or weekly basis, for an S3 bucket or a shared prefix. The inventory lists are stored as a comma-separated value (CSV) file compressed with GZIP, as an Apache optimized row columnar (ORC) file compressed with ZLIB, or as an Apache Parquet file compressed with Snappy.

Amazon S3 Lifecycle

Amazon S3 Lifecycle policies help you manage your objects through two types of actions, Transition and Expiration. In the architecture shown following in Figure 1, we create an S3 Lifecycle configuration rule that expires objects after ‘x’ days. It has a filter for an object tag of “delete=True”. You can configure the value of ‘x’ based on your requirements.

If you are using an S3 bucket to store short lived objects with unknown access patterns, you might want to keep the objects that are still being accessed, but delete the rest. This will let you retain objects in your S3 bucket even after their expiry date as per the S3 lifecycle rules, while saving you costs by deleting objects that are not needed anymore. The following diagram shows an architecture that considers the last accessed date of the object before deleting S3 objects.

Figure 1. Object expiry architecture flow

This architecture uses native S3 features mentioned earlier in combination with other AWS services to achieve the desired outcome.

Here is the architecture flow:

The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
- Capture the number of days (x) configuration from the S3 Lifecycle configuration.
- Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than ‘x’ days, but not accessed during that time.
- Write a manifest file with the list of these objects to an S3 bucket.
- Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of “delete=True”.
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to ‘x’ days. They will have the tag given via the S3 batch operation of “delete=True”.

The preceding architecture is built for fault tolerance. If a particular run fails, all the objects that must be expired will be picked up during the next run. You can configure error handling and automatic retries in your Lambda function. An Amazon Simple Notification Service (SNS) topic will send out a notification in the event of a failure.

Cost considerations

S3 server access logs, S3 inventory lists, and manifest files can accumulate many objects over time. We recommend you configure an S3 Lifecycle policy on the target bucket to periodically delete older objects. Although following the guidelines in this post can decrease some of your costs, S3 requests, S3 inventory, S3 Object Tagging, and Lifecycle transitions also have costs associated with them. Additional details can be found on the S3 pricing page.

Amazon Athena charges you based on the amount of data scanned by each query. But Amazon S3 inventory can also output files in Apache ORC or Apache Parquet format, which can reduce the amount data scanned by Athena. The Athena pricing page would be helpful to review.

AWS Lambda has a free usage tier of 1M free requests per month and 400,000 GB-seconds of compute time per month. However, you are charged based on the number of requests, the amount of memory allocated, and the runtime duration of the function. See more at the Lambda pricing page.

Conclusion

In this blog post, we showed how you can create a custom process to delete objects from your S3 bucket based on the last time the object was accessed. You can use this architecture to customize your object transitions, clean up your S3 buckets for any unnecessary objects, and keep your S3 buckets cost-effective. This architecture can also be used on versioned S3 buckets with some minor modifications.

We hope you found this blog post useful and welcome your feedback!

Read more about queries, rules, and tags:

Insights for CTOs: Part 1 – Building and Operating Cloud Applications

2021-08-02 Syed Jaffry

Post Syndicated from Syed Jaffry original https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-1-building-and-operating-cloud-applications/

This 6-part series shares insights gained from various CTOs during their cloud adoption journeys at their respective organizations. This post takes those learnings and summarizes architecture best practices to help you build and operate applications successfully in the cloud. This series will also cover topics on cloud financial management, security, modern data and artificial intelligence (AI), cloud operating models, and strategies for cloud migration.

Optimize cost and performance with AWS services

Your technology costs vs. return on investment (ROI) is likely being constantly evaluated. In the cloud, you “pay as you go.” This means your technology spend is an operational cost rather than capital expenditure, as discussed in Part 3: Cloud economics – OPEX vs. CAPEX.

So, how do you maximize the ROI of your cloud spend? The following sections provide hosting options and to help you choose a hosting model that best suits your needs.

Evaluating your hosting options

EC2 instances and the lift and shift strategy

Using cloud native dynamic provisioning/de-provisioning of Amazon Elastic Compute Cloud (Amazon EC2) instances will help you meet business needs more accurately and optimize compute costs. EC2 instances allow you to use the “lift and shift” migration strategy for your applications. This helps you avoid overhead costs you may incur from upfront capacity planning.

Figure 1. Comparing on-premises vs. cloud infrastructure provisioning

Containerized hosting (with EC2 hosts)

Engineering teams already skilled in containerized hosting have saved additional costs by using Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS). This is because your unit of deployment is a container instead of an entire instance, and Amazon EKS or Amazon ECS can pack multiple containers into a single instance. Application change management is also less risky because you can leverage Amazon EKS or Amazon ECS built-in orchestration to manage non-disruptive deployments.

Serverless architecture

Use AWS Lambda and AWS Fargate to scale to match unpredictable usage. We have seen AWS software as a service (SaaS) customers build better measures of “cost per user” of an application into their metering systems using serverless. This is because instead of paying for server uptime, you only pay for runtime usage (down to millisecond increments for Lambda) when you run your application.

Further considerations for choosing the right hosting platform

The following table provides considerations for implementing the most cost-effective model for some use cases you may encounter when building your architecture:

Building a cloud operating model and managing risk

Building an effective people, governance, and platform capability is summarized in the following sections and discussed in detail in Part 5: Organizing teams to enable effective build/run/manage.

People

If your team only builds applications on virtual machines, asking them to move to the cloud serverless model without sufficiently training them could go poorly. We suggest starting small. Select a handful of applications that have lower risk yet meaningful business value and allow your team to build their cloud “muscles.”

Governance

If your teams don’t have the “muscle memory” to make cloud architecture decisions, build a Cloud Center of Excellence (CCOE) to enforce a consistent approach to building in the cloud. Without this team, managing cost, security, and reliability will be harder. Ask the CCOE team to regularly review the application architecture suitability (cost, performance, resiliency) against changing business conditions. This will help you incrementally evolve architecture as appropriate.

Platform

In a typical on-premises environment, changes are deployed “in-place.” This requires a slow and “involve everyone” approach. Deploying in the cloud replaces the in-place approach with blue/green deployments, as shown in Figure 2.

With this strategy, new application versions can be deployed on new machines (green) running side by side with the old machines (blue). Once the new version is validated, switch traffic to the new (green) machines and they become production. This model reduces risk and increases velocity of change.

Figure 2. AWS blue/green deployment model

Securing your application and infrastructure

Security controls in the cloud are defined and enforced in software, which brings risks and opportunities. If not managed with a robust change management process, software-defined firewall misconfiguration can create unexpected threat vectors.

To avoid this, use cloud native patterns like “infrastructure as code” that express all infrastructure provisioning and configuration as declarative templates (JSON or YAML files). Then apply the same “Git pull request” process to infrastructure change management as you do for your applications to enforce strong governance. Use tools like AWS CloudFormation or AWS Cloud Development Kit (AWS CDK) to implement infrastructure templates into your cloud environment.

Apply a layered security model (“defense in depth”) to your application stack, as shown in Figure 3, to prevent against distributed denial of service (DDoS) and application layer attacks. Part 2: Protecting AWS account, data, and applications provides a detailed discussion on security.

Figure 3. Defense in depth

Data stores

How many is too many?

In on-premises environments, it is typically difficult to provision a separate database per microservice. As a result, the application or microservice isolation stops at the compute layer, and the database becomes the key shared dependency that slows down change.

The cloud provides API instantiable, fully managed databases like Amazon Relational Database Service (Amazon RDS) (SQL), Amazon DynamoDB (NoSQL), and others. This allows you to isolate your application end to end and create a more resilient architecture. For example, in a cell-based architecture where users are placed into self-contained, isolated application stack “cells,” the “blast radius” of an impact, such as application downtime or user experience degradation, is limited to each cell.

Database engines

Relational databases are typically the default starting point for many organizations. While relational databases offer speed and flexibility to bootstrap a new application, they bring complexity when you need to horizontally scale.

Your application needs will determine whether you use a relational or non-relational database. In the cloud, API instantiated, fully managed databases give you options to closely match your application’s use case. For example, in-memory databases like Amazon ElastiCache reduce latency for website content and key-value databases like DynamoDB provide a horizontally scalable backend for building an ecommerce shopping cart.

Summary

We acknowledge that CTO responsibilities can differ among organizations; however, this blog discusses common key considerations when building and operating an application in the cloud.

Choosing the right application hosting platform depends on your application’s use case and can impact the operational cost of your application in the cloud. Consider the people, governance, and platform aspects carefully because they will influence the success or failure of your cloud adoption. Use lower risk application deployment patterns in the cloud. Managed data stores in the cloud open your choice for data stores beyond relational. In the next post of this series, Part 2: Protecting AWS account, data, and applications, we will explore best practices and principles to apply when thinking about security in the cloud.

Related information

Part 2: Protecting AWS account, data and applications
Part 3: Cloud economics – OPEX vs CAPEX
Part 4: Building a modern data platform for AI
Part 5: Organizing teams to enable effective build/run/manage
Part 6: Strategies and lessons on migrating workloads to the cloud