Tag Archives: Amazon DynamoDB

Building well-architected serverless applications: Optimizing application performance – part 2

2021-08-24 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-optimizing-application-performance-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

PERF 1. Optimizing your serverless application’s performance

This post continues part 1 of this security question. Previously, I cover measuring and optimizing function startup time. I explain cold and warm starts and how to reuse the Lambda execution environment to improve performance. I show a number of ways to analyze and optimize the initialization startup time. I explain how only importing necessary libraries and dependencies increases application performance.

Good practice: Design your function to take advantage of concurrency via asynchronous and stream-based invocations

AWS Lambda functions can be invoked synchronously and asynchronously.

Favor asynchronous over synchronous request-response processing.

Consider using asynchronous event processing rather than synchronous request-response processing. You can use asynchronous processing to aggregate queues, streams, or events for more efficient processing time per invocation. This reduces wait times and latency from requesting apps and functions.

When you invoke a Lambda function with a synchronous invocation, you wait for the function to process the event and return a response.

Synchronous invocation

As synchronous processing involves a request-response pattern, the client caller also needs to wait for a response from a downstream service. If the downstream service then needs to call another service, you end up chaining calls that can impact service reliability, in addition to response times. For example, this POST /order request must wait for the response to the POST /invoice request before responding to the client caller.

Example synchronous processing

The more services you integrate, the longer the response time, and you can no longer sustain complex workflows using synchronous transactions.

Asynchronous processing allows you to decouple the request-response using events without waiting for a response from the function code. This allows you to perform background processing without requiring the client to wait for a response, improving client performance. You pass the event to an internal Lambda queue for processing and Lambda handles the rest. An external process, separate from the function, manages polling and retries. Using this asynchronous approach can also make it easier to handle unpredictable traffic with significant volumes.

Asynchronous invocation

For example, the client makes a POST /order request to the order service. The order service accepts the request and returns that it has been received, without waiting for the invoice service. The order service then makes an asynchronous POST /invoice request to the invoice service, which can then process independently of the order service. If the client must receive data from the invoice service, it can handle this separately via a GET /invoice request.

Example asynchronous processing

You can configure Lambda to send records of asynchronous invocations to another destination service. This helps you to troubleshoot your invocations. You can also send messages or events that can’t be processed correctly into a dedicated Amazon Simple Queue Service (SQS) dead-letter queue for investigation.

You can add triggers to a function to process data automatically. For more information on which processing model Lambda uses for triggers, see “Using AWS Lambda with other services”.

Asynchronous workflows handle a variety of use cases including data Ingestion, ETL operations, and order/request fulfillment. In these use-cases, data is processed as it arrives and is retrieved as it changes. For example asynchronous patterns, see “Serverless Data Processing” and “Serverless Event Submission with Status Updates”.

For more information on Lambda synchronous and asynchronous invocations, see the AWS re:Invent presentation “Optimizing your serverless applications”.

Tune batch size, batch window, and compress payloads for high throughput

When using Lambda to process records using Amazon Kinesis Data Streams or SQS, there are a number of tuning parameters to consider for performance.

You can configure a batch window to buffer messages or records for up to 5 minutes. You can set a limit of the maximum number of records Lambda can process by setting a batch size. Your Lambda function is invoked whichever comes first.

For high volume SQS standard queue throughput, Lambda can process up to 1000 concurrent batches of records per second. For more information, see “Using AWS Lambda with Amazon SQS”.

For high volume Kinesis Data Streams throughput, there are a number of options. Configure the ParallelizationFactor setting to process one shard of a Kinesis Data Stream with more than one Lambda invocation simultaneously. Lambda can process up to 10 batches in each shard. For more information, see “New AWS Lambda scaling controls for Kinesis and DynamoDB event sources.” You can also add more shards to your data stream to increase the speed at which your function can process records. This increases the function concurrency at the expense of ordering per shard. For more details on using Kinesis and Lambda, see “Monitoring and troubleshooting serverless data analytics applications”.

Kinesis enhanced fan-out can maximize throughput by dedicating a 2 MB/second input/output channel per second per consumer instead of 2 MB per shard. For more information, see “Increasing stream processing performance with Enhanced Fan-Out and Lambda”.

Kinesis stream producers can also compress records. This is at the expense of additional CPU cycles for decompressing the records in your Lambda function code.

Required practice: Measure, evaluate, and select optimal capacity units

Capacity units are a unit of consumption for a service. They can include function memory size, number of stream shards, number of database reads/writes, request units, or type of API endpoint. Measure, evaluate and select capacity units to enable optimal configuration of performance, throughput, and cost.

Identify and implement optimal capacity units.

For Lambda functions, memory is the capacity unit for controlling the performance of a function. You can configure the amount of memory allocated to a Lambda function, between 128 MB and 10,240 MB. The amount of memory also determines the amount of virtual CPU available to a function. Adding more memory proportionally increases the amount of CPU, increasing the overall computational power available. If a function is CPU-, network- or memory-bound, then changing the memory setting can dramatically improve its performance.

Choosing the memory allocated to Lambda functions is an optimization process that balances performance (duration) and cost. You can manually run tests on functions by selecting different memory allocations and measuring the time taken to complete. Alternatively, use the AWS Lambda Power Tuning tool to automate the process.

The tool allows you to systematically test different memory size configurations and depending on your performance strategy – cost, performance, balanced – it identifies what is the most optimum memory size to use. For more information, see “Operating Lambda: Performance optimization – Part 2”.

AWS Lambda Power Tuning report

Amazon DynamoDB manages table processing throughput using read and write capacity units. There are two different capacity modes, on-demand and provisioned.

On-demand capacity mode supports up to 40K read/write request units per second. This is recommended for unpredictable application traffic and new tables with unknown workloads. For higher and predictable throughputs, provisioned capacity mode along with DynamoDB auto scaling is recommended. For more information, see “Read/Write Capacity Mode”.

For high throughput Amazon Kinesis Data Streams with multiple consumers, consider using enhanced fan-out for dedicated 2 MB/second throughput per consumer. When possible, use Kinesis Producer Library and Kinesis Client Library for effective record aggregation and de-aggregation.

Amazon API Gateway supports multiple endpoint types. Edge-optimized APIs provide a fully managed Amazon CloudFront distribution. These are better for geographically distributed clients. API requests are routed to the nearest CloudFront Point of Presence (POP), which typically improves connection time.

Edge-optimized API Gateway deployment

Regional API endpoints are intended when clients are in the same Region. This helps you to reduce request latency and allows you to add your own content delivery network if necessary.

Regional endpoint API Gateway deployment

Private API endpoints are API endpoints that can only be accessed from your Amazon Virtual Private Cloud (VPC) using an interface VPC endpoint. For more information, see “Creating a private API in Amazon API Gateway”.

For more information on endpoint types, see “Choose an endpoint type to set up for an API Gateway API”. For more general information on API Gateway, see the AWS re:Invent presentation “I didn’t know Amazon API Gateway could do that”.

AWS Step Functions has two workflow types, standard and express. Standard Workflows have exactly once workflow execution and can run for up to one year. Express Workflows have at-least-once workflow execution and can run for up to five minutes. Consider the per-second rates you require for both execution start rate and the state transition rate. For more information, see “Standard vs. Express Workflows”.

Performance load testing is recommended at both sustained and burst rates to evaluate the effect of tuning capacity units. Use Amazon CloudWatch service dashboards to analyze key performance metrics including load testing results. I cover performance testing in more detail in “Regulating inbound request rates – part 1”.

For general serverless optimization information, see the AWS re:Invent presentation “Serverless at scale: Design patterns and optimizations”.

Conclusion

Evaluate and optimize your serverless application’s performance based on access patterns, scaling mechanisms, and native integrations. You can improve your overall experience and make more efficient use of the platform in terms of both value and resources.

This post continues from part 1 and looks at designing your function to take advantage of concurrency via asynchronous and stream-based invocations. I cover measuring, evaluating, and selecting optimal capacity units.

This well-architected question will continue in part 3 where I look at integrating with managed services directly over functions when possible. I cover optimizing access patterns and applying caching where applicable.

For more serverless learning resources, visit Serverless Land.

Convert and Watermark Documents Automatically with Amazon S3 Object Lambda

2021-08-24 Joseph Simon

Post Syndicated from Joseph Simon original https://aws.amazon.com/blogs/architecture/convert-and-watermark-documents-automatically-with-amazon-s3-object-lambda/

When you provide access to a sensitive document to someone outside of your organization, you likely need to ensure that the document is read-only. In this case, your document should be associated with a specific user in case it is shared.

For example, authors often embed user-specific watermarks into their ebooks. This way, if their ebook gets posted to a file-sharing site, they can prevent the purchaser from downloading copies of the ebook in the future.

In this blog post, we provide you a cost-efficient, scalable, and secure solution to efficiently generate user-specific versions of sensitive documents. This solution helps users track who their documents are shared with. This helps prevent fraud and ensure that private information isn’t leaked. Our solution uses a RESTful API, which uses Amazon S3 Object Lambda to convert documents to PDF and apply a watermark based on the requesting user. It also provides a method for authentication and tracks access to the original document.

Architectural overview

S3 Object Lambda processes and transforms data that is requested from Amazon Simple Storage Service (Amazon S3) before it’s sent back to a client. The AWS Lambda function is invoked inline via a standard S3 GET request. It can return different results from the same document based on parameters, such as who is requesting the document. Figure 1 provides a high-level view of the different components that make up the solution.

Figure 1. Document processing architectural diagram

Authenticating users with Amazon Cognito

This architecture defines a RESTful API, but users will likely be using a mobile or web application that calls the API. Thus, the application will first need to authenticate users. We do this via Amazon Cognito, which functions as its own identity provider (IdP). You could also use an external IdP, including those that support OpenID Connect and SAML.

Validating the JSON Web Token with API Gateway

Once the user is successfully authenticated with Amazon Cognito, the application will be sent a JSON Web Token (JWT). This JWT contains information about the user and will be used in subsequent requests to the API.

Now that the application has a token, it will make a request to the API, which is provided by Amazon API Gateway. API Gateway provides a secure, scalable entryway into your application. The API Gateway validates the JWT sent from the client with Amazon Cognito to make sure it is valid. If it is validated, the request is accepted and sent on to the Lambda API Handler. If it’s not, the client gets rejected and sent an error code.

Storing user data with DynamoDB

When the Lambda API Handler receives the request, it parses the JWT to extract the user making the request. It then logs that user, file, and access time into Amazon DynamoDB. Optionally, you may use DynamoDB to store an encoded string that will be used as the watermark, rather than something in plaintext, like user name or email.

Generating the PDF and user-specific watermark

At this point, the Lambda API Handler sends an S3 GET request. However, instead of going to Amazon S3 directly, it goes to a different endpoint that invokes the S3 Object Lambda function. This endpoint is called an S3 Object Lambda Access Point. The S3 GET request contains the original file name and the string that will be used for the watermark.

The S3 Object Lambda function transforms the original file that it downloads from its source S3 bucket. It uses the open-source office suite LibreOffice (and specifically this Lambda layer) to convert the source document to PDF. Once it is converted, a JavaScript library (PDF-Lib) embeds the watermark into the PDF before it’s sent back to the Lambda API Handler function.

The Lambda API Handler stores the converted file in a temporary S3 bucket, generates a presigned URL, and sends that URL back to the client as a 302 redirect. Then the client sends a request to that presigned URL to get the converted file.

To keep the temporary S3 bucket tidy, we use an S3 lifecycle configuration with an expiration policy.

Figure 2. Process workflow for document transformation

Alternate approach

Before S3 Object Lambda was available, Lambda@Edge was used. However, there are three main issues with using Lambda@Edge instead of S3 Object Lambda:

It is designed to run code closer to the end user to decrease latency, but in this case, latency is not a major concern.
It requires using an Amazon CloudFront distribution, and the single-download pattern described here will not take advantage of Lamda@Edge’s caching.
It has quotas on memory that don’t lend themselves to complex libraries like OfficeLibre.

Extending this solution

This blog post describes the basic building blocks for the solution, but it can be extended relatively easily. For example, you could add another function to the API that would convert, resize, and watermark images. To do this, create an S3 Object Lambda function to perform those tasks. Then, add an S3 Object Lambda Access Point to invoke it based on a different API call.

API Gateway has many built-in security features, but you may want to enhance the security of your RESTful API. To do this, add enhanced security rules via AWS WAF. Integrating your IdP into Amazon Cognito can give you a single place to manage your users.

Monitoring any solution is critical, and understanding how an application is behaving end to end can greatly benefit optimization and troubleshooting. Adding AWS X-Ray and Amazon CloudWatch Lambda Insights will show you how functions and their interactions are performing.

Should you decide to extend this architecture, follow the architectural principles defined in AWS Well-Architected, and pay particular attention to the Serverless Application Lens.

Figure 3. Example expanded document processing architecture

Conclusion

You can implement this solution in a number of ways. However, by using S3 Object Lambda, you can transform documents without needing intermediary storage. S3 Object Lambda will also decouple your file logic from the rest of the application.

The Serverless on AWS components mentioned in this post allow you to reduce administrative overhead, saving you time and money.

Finally, the extensible nature of this architecture allows you to add functionality easily as your organization’s needs grow and change.

The following links provide more information on how to use S3 Object Lambda in your architectures:

Middleware-assisted Zero-downtime Live Database Migration to AWS

2021-08-20 Seif Elharaki

Post Syndicated from Seif Elharaki original https://aws.amazon.com/blogs/architecture/middleware-assisted-zero-downtime-live-database-migration-to-aws/

When trying to figure out how to refactor your applications to leverage AWS Managed Services, you have some decisions to make. You may have decided to move your storage layer to AWS before the computational layer. This may help with using advanced database features, in addition to reducing costs associated with writing and reading data. AWS Professional Services recently helped a large customer with this implementation.

With more than a quarter billion daily users, this customer uses highly transactional NoSQL databases that are few hundred GBs in size. Volume of data is growing rapidly. The downtime requirements for the migration were stringently low, as their applications are used globally, 24-7. The source data layer was Cloud Datastore, which runs outside of AWS. The destination was Amazon DynamoDB. Several hundred globally distributed applications (writing to and reading from the database) had little or no room for refactoring in the initial phase. While the go-to solution for this scenario is usually AWS Database Migration Service, Cloud Datastore is not yet supported by AWS DMS as a source.

Using a configurable middleware to migrate in-use data layer

The architectural approach chosen was to develop a custom middleware that the applications would communicate with rather than directly calling the database. Common database operations such as Get, Put, Delete, Conditional Updates, Deletes, and Transactions, would be issued against this middleware that is loaded as an in-memory library. Data would be read from and written to this middleware layer. It would then issue reads and writes to multiple databases with configurable load factors. The solution was developed and tested in stages (Dual-Write Single-Read, Dual-Write Dual-Read, and Single-Write Single-Read) as shown in the diagrams following.

Architecture: routing database traffic to multiple storage targets

Figure 1 shows the initial state when the application layer communicates directly with the source database.

Figure 1. Architecture of initial state: Generic 2 or 3 tier application with application and storage layers running inside or outside AWS Cloud

Figures 2 and 3 illustrate the intermediate and final states, respectively, with database traffic moved to DynamoDB progressively.

Figure 2. Architecture of intermediate state: The middleware layer introduced to switch database traffic between source and target databases

Figure 3. Architecture of post-migration final state

A closer look into the configurable stages of the migration

Initially, the middleware should be tested with the source database alone. It can then be configured to work with DynamoDB in Dual-Write mode. Reads will still continue from the source database. The target database is synchronized by copying old data in parallel.

In the next stage, reads are expanded to the target database. Reading from two sources allows in-memory comparison of the final result set. This ensures consistency of the data being returned. Upon successful validation, the system is finally configured as Single-Write Single-Read, operating solely on the target database. This is the “Point of no Return,” where the target database surges ahead with new data. In this mode, the migration is deemed complete, and the older database is ready to be taken offline.

This multi-stage approach results in a “live migration” of the data layer to DynamoDB with zero downtime. Higher-level applications are also left intact. This increases the speed and accuracy of the overall migration.

Configurable stages of migration load balanced traffic to underlying databases

The middleware layer acts as a valve or switch regulating traffic from the applications to one or more databases. This allows support for a canary-like load balancing, where a certain percentage of traffic can be diverted in either direction. We can visualize this behavior with the analogy of a 3-stage dial, as shown in Figures 4 through 6. These stages are developed and tested in a non-production environment with production-like data. All related sets of tables should be migrated together.

Dual-Write Single-Read stage

Figure 4. Migration Stage 1: Dual-Write Single-Read mode

In this stage, shown in Figure 4, data is written to both the source database and the target database (DynamoDB). At this point, data is read only from the source, because the target is not ready to handle reads yet. While new data is being written to the target database, older data is copied and backfilled by background processes.

Avoid data corruption while copying older data. Don’t change it while you’re bringing the target database on par with the source. The middleware can implement a locking mechanism for write operations based on primary keys. One way to monitor the movement of older data can be a temporary table, which the copying process can update. The middleware can read this table to allow or deny a write operation. In most use cases, writes taper off with time, making it easier for older data to be copied without running into contention.

Dual-Write Dual-Read stage

Figure 5. Migration Stage 2: Dual-Write Dual-Read mode

The prerequisite of this stage is the parity of content between the two databases. As shown in Figure 5, both reads and writes are routed to both databases. In this stage, the middleware layer activates data validation. The records that are read from both the data layers are compared and contrasted for accuracy and consistency. This allows any discrepancies in the data to be fixed and the solution redeployed.

Single-Write Single-Read stage

Figure 6. Migration Stage 3: Single-Write Single-Read mode

In this stage shown in Figure 6, all reads and writes are directed only to the target database on AWS. This is the “Point of no Return”, as new data is written to the target database alone. The source database falls behind, and can be taken offline for eventual retirement.

Dealing with differences in database features

Apart from acting as a switch, the job of the middleware layer in this design pattern is to accept, translate, and forward the generic database call. For example, when it receives a “Put” call, it must invoke the “Put” API on the specific underlying target. After due translation, it follows the rules governing the corresponding service. The middleware layer does this twice for two different underlying databases, when operated in Dual-Write or Dual-Read modes.

You must deal with differences in databases in terms of specific features, limits, and limitations. The following is a non-exhaustive list of such areas:

Specific quantitative limits: DynamoDB imposes a size limit of 16 MB on Transactions. This limit is likely different for the source database.
Behavioral differences for features like indices: Cloud Datastore supports writing empty values to indexed fields which DynamoDB doesn’t support.
Behavioral differences for primary and secondary keys: Other databases might not treat keys the same way DynamoDB treats its hash and sort keys.
Differences in capacity, throughput, and latency: The middleware may need to throttle or even decline requests. This can happen if it starts operating in an Availability Zone where one underlying database is able to scale, but the other can’t scale.

An object-oriented approach can be an efficient way to deal with such differences. Create a base class encapsulating features that are common to different databases. Then use inheritance and polymorphism to account for the differences. This can ensure reusability, readability, and maintainability.

As the AWS Professional Services team has experienced, the resulting tool can be reused several times in a large organization to migrate different application suites. It can potentially be applied to other use cases. These include, but are not limited to:

Support for more storage configurations and databases, abstraction of application code base by making them largely agnostic of underlying database technology
Extensive database compatibility testing using granular migration stages
Modularization and containerization with computational platforms such as Amazon Elastic Kubernetes Service (EKS) or Amazon Elastic Container Service (ECS)

Conclusion

This design pattern showcases the power of abstraction in enabling live database migrations. Several optimizations are possible based on the rate of writes and pre-existing size of the database. The key benefit of this approach is the elimination of the need to make extensive changes in the application layer. This can result in significant savings in terms of effort, time, and cost, especially if different applications are managed by different teams in an organization. In addition, migration to DynamoDB alone can save AWS customers significantly. This depends on the size and access pattern of data, and whether the solution is architected for cost-savings. Refer to the Cost Optimization Pillar of the Well-Architected Framework for further best practices.

Further reading

Building well-architected serverless applications: Optimizing application performance – part 1

2021-08-17 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-optimizing-application-performance-part-1/

PERF 1. Optimizing your serverless application’s performance

Evaluate and optimize your serverless application’s performance based on access patterns, scaling mechanisms, and native integrations. This allows you to continuously gain more value per transaction. You can improve your overall experience and make more efficient use of the platform in terms of both value and resources.

Good practice: Measure and optimize function startup time

Evaluate your AWS Lambda function startup time for both performance and cost.

Take advantage of execution environment reuse to improve the performance of your function.

Lambda invokes your function in a secure and isolated runtime environment, and manages the resources required to run your function. When a function is first invoked, the Lambda service creates an instance of the function to process the event. This is called a cold start. After completion, the function remains available for a period of time to process subsequent events. These are called warm starts.

Lambda functions must contain a handler method in your code that processes events. During a cold start, Lambda runs the function initialization code, which is the code outside the handler, and then runs the handler code. During a warm start, Lambda runs the handler code.

Lambda function cold and warm starts

Initialize SDK clients, objects, and database connections outside of the function handler so that they are started during the cold start process. These connections then remain during subsequent warm starts, which improves function performance and cost.

Lambda provides a writable local file system available at /tmp. This is local to each function but shared between subsequent invocations within the same execution environment. You can download and cache assets locally in the /tmp folder during the cold start. This data is then available locally by all subsequent warm start invocations, improving performance.

In the serverless airline example used in this series, the confirm booking Lambda function initializes a number of components during the cold start. These include the Lambda Powertools utilities and creating a session to the Amazon DynamoDB table BOOKING_TABLE_NAME.

import boto3
from aws_lambda_powertools import Logger, Metrics, Tracer
from aws_lambda_powertools.metrics import MetricUnit
from botocore.exceptions import ClientError

logger = Logger()
tracer = Tracer()
metrics = Metrics()

session = boto3.Session()
dynamodb = session.resource("dynamodb")
table_name = os.getenv("BOOKING_TABLE_NAME", "undefined")
table = dynamodb.Table(table_name)

Analyze and improve startup time

There are a number of steps you can take to measure and optimize Lambda function initialization time.

You can view the function cold start initialization time using Amazon CloudWatch Logs and AWS X-Ray. A log REPORT line for a cold start includes the Init Duration value. This is the time the initialization code takes to run before the handler.

CloudWatch Logs cold start report line

When X-Ray tracing is enabled for a function, the trace includes the Initialization segment.

X-Ray trace cold start showing initialization segment

A subsequent warm start REPORT line does not include the Init Duration value, and is not present in the X-Ray trace:

CloudWatch Logs warm start report line

X-Ray trace warm start without showing initialization segment

CloudWatch Logs Insights allows you to search and analyze CloudWatch Logs data over multiple log groups. There are some useful searches to understand cold starts.

Understand cold start percentage over time:

filter @type = "REPORT"
| stats
  sum(strcontains(
    @message,
    "Init Duration"))
  / count(*)
  * 100
  as coldStartPercentage,
  avg(@duration)
  by bin(5m)

Cold start percentage over time

Cold start count and InitDuration:

filter @type="REPORT" 
| fields @memorySize / 1000000 as memorySize
| filter @message like /(?i)(Init Duration)/
| parse @message /^REPORT.*Init Duration: (?<initDuration>.*) ms.*/
| parse @log /^.*\/aws\/lambda\/(?<functionName>.*)/
| stats count() as coldStarts, median(initDuration) as avgInitDuration, max(initDuration) as maxInitDuration by functionName, memorySize

Cold start count and InitDuration

Once you have measured cold start performance, there are a number of ways to optimize startup time. For Python, you can use the PYTHONPROFILEIMPORTTIME=1 environment variable.

PYTHONPROFILEIMPORTTIME environment variable

This shows how long each package import takes to help you understand how packages impact startup time.

Python import time

Previously, for the AWS Node.js SDK, you enabled HTTP keep-alive in your code to maintain TCP connections. Enabling keep-alive allows you to avoid setting up a new TCP connection for every request. Since AWS SDK version 2.463.0, you can also set the Lambda function environment variable AWS_NODEJS_CONNECTION_REUSE_ENABLED=1 to make the SDK reuse connections by default.

You can configure Lambda’s provisioned concurrency feature to pre-initialize a requested number of execution environments. This runs the cold start initialization code so that they are prepared to respond immediately to your function’s invocations.

Use Amazon RDS Proxy to pool and share database connections to improve function performance. For additional options for using RDS with Lambda, see the AWS Serverless Hero blog post “How To: Manage RDS Connections from AWS Lambda Serverless Functions”.

Choose frameworks that load quickly on function initialization startup. For example, prefer simpler Java dependency injection frameworks like Dagger or Guice over more complex framework such as Spring. When using the AWS SDK for Java, there are some cold start performance optimization suggestions in the documentation. For further Java performance optimization tips, see the AWS re:Invent session, “Best practices for AWS Lambda and Java”.

To minimize deployment packages, choose lightweight web frameworks optimized for Lambda. For example, use MiddyJS, Lambda API JS, and Python Chalice over Node.js Express, Python Django or Flask.

If your function has many objects and connections, consider splitting the function into multiple, specialized functions. These are individually smaller and have less initialization code. I cover designing smaller, single purpose functions from a security perspective in “Managing application security boundaries – part 2”.

Minimize your deployment package size to only its runtime necessities

Smaller functions also allow you to separate functionality. Only import the libraries and dependencies that are necessary for your application processing. Use code bundling when you can to reduce the impact of file system lookup calls. This also includes deployment package size.

For example, if you only use Amazon DynamoDB in the AWS SDK, instead of importing the entire SDK, you can import an individual service. Compare the following three examples as shown in the Lambda Operator Guide:

// Instead of const AWS = require('aws-sdk'), use: +
const DynamoDB = require('aws-sdk/clients/dynamodb')

// Instead of const AWSXRay = require('aws-xray-sdk'), use: +
const AWSXRay = require('aws-xray-sdk-core')

// Instead of const AWS = AWSXRay.captureAWS(require('aws-sdk')), use: +
const dynamodb = new DynamoDB.DocumentClient() +
AWSXRay.captureAWSClient(dynamodb.service)

In testing, importing the DynamoDB library instead of the entire AWS SDK was 125 ms faster. Importing the X-Ray core library was 5 ms faster than the X-Ray SDK. Similarly, when wrapping a service initialization, preparing a DocumentClient before wrapping showed a 140-ms gain. Version 3 of the AWS SDK for JavaScript supports modular imports, which can further help reduce unused dependencies.

For additional options when for optimizing AWS Node.js SDK imports, see the AWS Serverless Hero blog post.

Conclusion

In this post, I cover measuring and optimizing function startup time. I explain cold and warm starts and how to reuse the Lambda execution environment to improve performance. I show a number of ways to analyze and optimize the initialization startup time. I explain how only importing necessary libraries and dependencies increases application performance.

This well-architected question will be continued is part 2 where I look at designing your function to take advantage of concurrency via asynchronous and stream-based invocations. I cover measuring, evaluating, and selecting optimal capacity units.

For more serverless learning resources, visit Serverless Land.

Address Modernization Tradeoffs with Lake House Architecture

2021-08-12 Sukhomoy Basak

Post Syndicated from Sukhomoy Basak original https://aws.amazon.com/blogs/architecture/address-modernization-tradeoffs-with-lake-house-architecture/

Many organizations are modernizing their applications to reduce costs and become more efficient. They must adapt to modern application requirements that provide 24×7 global access. The ability to scale up or down quickly to meet demand and process a large volume of data is critical. This is challenging while maintaining strict performance and availability. For many companies, modernization includes decomposing a monolith application into a set of independently developed, deployed, and managed microservices. The decoupled nature of a microservices environment allows each service to evolve agilely and independently. While there are many benefits for moving to a microservices-based architecture, there can be some tradeoffs. As your application monolith evolves into independent microservices, you must consider the implications to your data architecture.

In this blog post we will provide example use cases, and show how Lake House Architecture on AWS can streamline your microservices architecture. A Lake house architecture embraces the decentralized nature of microservices by facilitating data movement. These transfers can be between data stores, from data stores to data lake, and from data lake to data stores (Figure 1).

Figure 1. Integrating data lake, data warehouse, and all purpose-built stores into a coherent whole

Health and wellness application challenges

Our fictitious health and wellness customer has an application architecture comprised of several microservices backed by purpose-built data stores. User profiles, assessments, surveys, fitness plans, health preferences, and insurance claims are maintained in an Amazon Aurora MySQL-Compatible relational database. The event service monitors the number of steps walked, sleep pattern, pulse rate, and other behavioral data in Amazon DynamoDB, a NoSQL database (Figure 2).

Figure 2. Microservices architecture for health and wellness company

With this microservices architecture, it’s common to have data spread across various data stores. This is because each microservice uses a purpose-built data store suited to its usage patterns and performance requirements. While this provides agility, it also presents challenges to deriving needed insights.

Here are four challenges that different users might face:

As a health practitioner, how do I efficiently combine the data from multiple data stores to give personalized recommendations that improve patient outcomes?
As a sales and marketing professional, how do I get a 360 view of my customer, when data lives in multiple data stores? Profile and fitness data are in a relational data store, but important behavioral and clickstream data are in NoSQL data stores. It’s hard for me to run targeted marketing campaigns, which can lead to revenue loss.
As a product owner, how do I optimize healthcare costs when designing wellbeing programs for patients?
As a health coach, how do I find patients and help them with their wellness goals?

Our remaining subsections highlight AWS Lake House Architecture capabilities and features that allow data movement and the integration of purpose-built data stores.

1. Patient care use case

In this scenario, a health practitioner is interested in historical patient data to estimate the likelihood of a future outcome. To get the necessary insights and identify patterns, the health practitioner needs event data from Amazon DynamoDB and patient profile data from Aurora MySQL-Compatible. Our health practitioner will use Amazon Athena to run an ad-hoc analysis across these data stores.

Amazon Athena provides an interactive query service for both structured and unstructured data. The federated query functionality in Amazon Athena helps with running SQL queries across data stored in relational, NoSQL, and custom data sources. Amazon Athena uses Lambda-based data source connectors to run federated queries. Figure 3 illustrates the federated query architecture.

Figure 3. Amazon Athena federated query

The patient care team could use an Amazon Athena federated query to find out if a patient needs urgent care. It is able to detect anomalies in the combined datasets from claims processing, device data, and electronic health record (HER) as show in Figure 4.

Figure 4. Federated query result by combining data from claim, device, and EHR stores

Healthcare data from various sources, including EHRs and genetic data, helps improve personalized care. Machine learning (ML) is able to harness big data and perform predictive analytics. This creates opportunities for researchers to develop personalized treatments for various diseases, including cancer and depression.

To achieve this, you must move all the related data into a centralized repository such as an Amazon S3 data lake. For specific use cases, you also must move the data between the purpose-built data stores. Finally, you must build an ML solution that can predict the outcome. Amazon Redshift ML, combined with its federated query processing capabilities enables data analysts and database developers to create a platform to detect patterns (Figure 5). With this platform, health practitioners are able to make more accurate, data-driven decisions.

Figure 5. Amazon Redshift federated query with Amazon Redshift ML

2. Sales and marketing use case

To run marketing campaigns, the sales and marketing team must search customer data from a relational database, with event data in a non-relational data store. We will move the data from Aurora MySQL-Compatible and Amazon DynamoDB to Amazon Elasticsearch Service (ES) to meet this requirement.

AWS Database Migration Service (DMS) helps move the change data from Aurora MySQL-Compatible to Amazon ES using Change Data Capture (CDC). AWS Lambda could be used to move the change data from DynamoDB streams to Amazon ES, as shown in Figure 6.

Figure 6. Moving and combining data from Aurora MySQL-Compatible and Amazon DynamoDB to Amazon Elasticsearch Service

The sales and marketing team can now run targeted marketing campaigns by querying data from Amazon Elasticsearch Service, see Figure 7. They can improve sales operations by visualizing data with Amazon QuickSight.

Figure 7. Personalized search experience for ad-tech marketing team

3. Healthcare product owner use case

In this scenario, the product owner must define the care delivery value chain. They must develop process maps for patient activity and estimate the cost of patient care. They must analyze these datasets by building business intelligence (BI) reporting and dashboards with a tool like Amazon QuickSight. Amazon Redshift, a cloud scale data warehousing platform, is a good choice for this. Figure 8 illustrates this pattern.

Figure 8. Moving data from Amazon Aurora and Amazon DynamoDB to Amazon Redshift

The product owners can use integrated business intelligence reports with Amazon Redshift to analyze their data. This way they can make more accurate and appropriate decisions, see Figure 9.

Figure 9. Business intelligence for patient care processes

4. Health coach use case

In this scenario, the health coach must find a patient based on certain criteria. They would then send personalized communication to connect with the patient to ensure they are following the proposed health plan. This proactive approach contributes to a positive patient outcome. It can also reduce healthcare costs incurred by insurance companies.

To be able to search patient records with multiple data points, it is important to move data from Amazon DynamoDB to Amazon ES. This also will provide a fast and personalized search experience. The health coaches can be notified and will have the information they need to provide guidance to their patients. Figure 10 illustrates this pattern.

Figure 10. Moving Data from Amazon DynamoDB to Amazon ES

The health coaches can use Elasticsearch to search users based on specific criteria. This helps them with counseling and other health plans, as shown in Figure 11.

Figure 11. Simplified personalized search using patient device data

Summary

In this post, we highlight how Lake House Architecture on AWS helps with the challenges and tradeoffs of modernization. A Lake House architecture on AWS can help streamline the movement of data between the microservices data stores. This offers new capabilities for various analytics use cases.

For further reading on architectural patterns, and walkthroughs for building Lake House Architecture, see the following resources:

Building well-architected serverless applications: Building in resiliency – part 2

2021-08-10 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-building-in-resiliency-part-2/

Reliability question REL2: How do you build resiliency into your serverless application?

This post continues part 1 of this reliability question. Previously, I cover managing failures using retries, exponential backoff, and jitter. I explain how DLQs can isolate failed messages. I show how to use state machines to orchestrate long running transactions rather than handling these in application code.

Required practice: Manage duplicate and unwanted events

Duplicate events can occur when a request is retried or multiple consumers process the same message from a queue or stream. A duplicate can also happen when a request is sent twice at different time intervals with the same parameters. Design your applications to process multiple identical requests to have the same effect as making a single request.

Idempotency refers to the capacity of an application or component to identify repeated events and prevent duplicated, inconsistent, or lost data. This means that receiving the same event multiple times does not change the result beyond the first time the event was received. An idempotent application can, for example, handle multiple identical refund operations. The first refund operation is processed. Any further refund requests to the same customer with the same payment reference should not be processes again.

When using AWS Lambda, you can make your function idempotent. The function’s code must properly validate input events and identify if the events were processed before. For more information, see “How do I make my Lambda function idempotent?”

When processing streaming data, your application must anticipate and appropriately handle processing individual records multiple times. There are two primary reasons why records may be delivered more than once to your Amazon Kinesis Data Streams application: producer retries and consumer retries. For more information, see “Handling Duplicate Records”.

Generate unique attributes to manage duplicate events at the beginning of the transaction

Create, or use an existing unique identifier at the beginning of a transaction to ensure idempotency. These identifiers are also known as idempotency tokens. A number of Lambda triggers include a unique identifier as part of the event:

Amazon API Gateway: requestContext.requestId
Kinesis: Records[].eventID
Amazon CloudWatch Events: id
Amazon Simple Notification Service (SNS): Records[].Sns.MessageId
Amazon Simple Queue Service (SQS): Records[].messageId

You can also create your own identifiers. These can be business-specific, such as transaction ID, payment ID, or booking ID. You can use an opaque random alphanumeric string, unique correlation identifiers, or the hash of the content.

A Lambda function, for example can use these identifiers to check whether the event has been previously processed.

Depending on the final destination, duplicate events might write to the same record with the same content instead of generating a duplicate entry. This may therefore not require additional safeguards.

Use an external system to store unique transaction attributes and verify for duplicates

Lambda functions can use Amazon DynamoDB to store and track transactions and idempotency tokens to determine if the transaction has been handled previously. DynamoDB Time to Live (TTL) allows you to define a per-item timestamp to determine when an item is no longer needed. This helps to limit the storage space used. Base the TTL on the event source. For example, the message retention period for SQS.

Using DynamoDB to store idempotent tokens

You can also use DynamoDB conditional writes to ensure a write operation only succeeds if an item attribute meets one of more expected conditions. For example, you can use this to fail a refund operation if a payment reference has already been refunded. This signals to the application that it is a duplicate transaction. The application can then catch this exception and return the same result to the customer as if the refund was processed successfully.

Third-party APIs can also support idempotency directly. For example, Stripe allows you to add an Idempotency-Key: <key> header to the request. Stripe saves the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeded or failed. Subsequent requests with the same key return the same result.

Validate events using a pre-defined and agreed upon schema

Implicitly trusting data from clients, external sources, or machines could lead to malformed data being processed. Use a schema to validate your event conforms to what you are expecting. Process the event using the schema within your application code or at the event source when applicable. Events not adhering to your schema should be discarded.

For API Gateway, I cover validating incoming HTTP requests against a schema in “Implementing application workload security – part 1”.

Amazon EventBridge rules match event patterns. EventBridge provides schemas for all events that are generated by AWS services. You can create or upload custom schemas or infer schemas directly from events on an event bus. You can also generate code bindings for event schemas.

SNS supports message filtering. This allows a subscriber to receive a subset of the messages sent to the topic using a filter policy. For more information, see the documentation.

JSON Schema is a tool for validating the structure of JSON documents. There are a number of implementations available.

Best practice: Consider scaling patterns at burst rates

Load testing your serverless application allows you to monitor the performance of an application before it is deployed to production. Serverless applications can be simpler to load test, thanks to the automatic scaling built into many of the services. For more information, see “How to design Serverless Applications for massive scale”.

In addition to your baseline performance, consider evaluating how your workload handles initial burst rates. This ensures that your workload can sustain burst rates while scaling to meet possibly unexpected demand.

Perform load tests using a burst strategy with random intervals of idleness

Perform load tests using a burst of requests for a short period of time. Also introduce burst delays to allow your components to recover from unexpected load. This allows you to future-proof the workload for key events when you do not know peak traffic levels.

There are a number of AWS Marketplace and AWS Partner Network (APN) solutions available for performance testing, including Gatling FrontLine, BlazeMeter, and Apica.

In regulating inbound request rates – part 1, I cover running a performance test suite using Gatling, an open source tool.

Gatling performance results

Amazon does have a network stress testing policy that defines which high volume network tests are allowed. Tests that purposefully attempt to overwhelm the target and/or infrastructure are considered distributed denial of service (DDoS) tests and are prohibited. For more information, see “Amazon EC2 Testing Policy”.

Review service account limits with combined utilization across resources

AWS accounts have default quotas, also referred to as limits, for each AWS service. These are generally Region-specific. You can request increases for some limits while other limits cannot be increased. Service Quotas is an AWS service that helps you manage your limits for many AWS services. Along with looking up the values, you can also request a limit increase from the Service Quotas console.

Service Quotas dashboard

As these limits are shared within an account, review the combined utilization across resources including the following:

Amazon API Gateway: number of requests per second across all APIs. (link)
AWS AppSync: throttle rate limits. (link)
AWS Lambda: function concurrency reservations and pool capacity to allow other functions to scale. (link)
Amazon CloudFront: requests per second per distribution. (link)
AWS IoT Core message broker: concurrent requests per second. (link)
Amazon EventBridge: API requests and target invocations limit. (link)
Amazon Cognito: API limits. (link)
Amazon DynamoDB: throughput, indexes, and request rates limits. (link)

Evaluate key metrics to understand how workloads recover from bursts

There are a number of key Amazon CloudWatch metrics to evaluate and alert on to understand whether your workload recovers from bursts.

AWS Lambda: Duration, Errors, Throttling, ConcurrentExecutions, UnreservedConcurrentExecutions. (link)
Amazon API Gateway: Latency, IntegrationLatency, 5xxError, 4xxError. (link)
Application Load Balancer: HTTPCode_ELB_5XX_Count, RejectedConnectionCount, HTTPCode_Target_5XX_Count, UnHealthyHostCount, LambdaInternalError, LambdaUserError. (link)
AWS AppSync: 5XX, Latency. (link)
Amazon SQS: ApproximateAgeOfOldestMessage. (link)
Amazon Kinesis Data Streams: ReadProvisionedThroughputExceeded, WriteProvisionedThroughputExceeded, GetRecords.IteratorAgeMilliseconds, PutRecord.Success, PutRecords.Success (if using Kinesis Producer Library), GetRecords.Success. (link)
Amazon SNS: NumberOfNotificationsFailed, NumberOfNotificationsFilteredOut-InvalidAttributes. (link)
Amazon Simple Email Service (SES): Rejects, Bounces, Complaints, Rendering Failures. (link)
AWS Step Functions: ExecutionThrottled, ExecutionsFailed, ExecutionsTimedOut. (link)
Amazon EventBridge: FailedInvocations, ThrottledRules. (link)
Amazon S3: 5xxErrors, TotalRequestLatency. (link)
Amazon DynamoDB: ReadThrottleEvents, WriteThrottleEvents, SystemErrors, ThrottledRequests, UserErrors. (link)

Conclusion

This post continues from part 1 and looks at managing duplicate and unwanted events with idempotency and an event schema. I cover how to consider scaling patterns at burst rates by managing account limits and show relevant metrics to evaluate

Build resiliency into your workloads. Ensure that applications can withstand partial and intermittent failures across components that may only surface in production. In the next post in the series, I cover the performance efficiency pillar from the Well-Architected Serverless Lens.

For more serverless learning resources, visit Serverless Land.

Use ML predictions over Amazon DynamoDB data with Amazon Athena ML

2021-08-04 Sachin Doshi

Post Syndicated from Sachin Doshi original https://aws.amazon.com/blogs/big-data/use-ml-predictions-over-amazon-dynamodb-data-with-amazon-athena-ml/

Today’s modern applications use multiple purpose-built database engines, including relational, key-value, document, and in-memory databases. This purpose-built approach improves the way applications use data by providing better performance and reducing cost. However, the approach raises some challenges for data teams that need to provide a holistic view on top of these database engines, and especially when they need to merge the data with datasets in the organization’s data lake.

In this post, we show how you can use Amazon Athena to build complex aggregations over data from Amazon DynamoDB and enrich the data with ML inference using Amazon SageMaker. You use some of the latest features announced by Athena such as Athena Query Federation, integration with SageMaker for machine learning (ML) predictions, and querying geospatial data in Athena.

For our use case, assume you’re operating a fleet of scooters, and need to forecast whether enough scooters are made available in each part of the city. Specifically, you need to predict the number of scooters needed in each urban neighborhood for the upcoming hour. You have a pre-trained ML model that forecasts the demand for the next hour based on data from the past 4 hours. You use Athena and Amazon QuickSight to predict and visualize the demand, respectively.

Solution overview

The following diagram shows the overall architecture of our solution.

We use the following resources:

A public dataset of dockless vehicle rentals, provided by the Office of Civic Innovation and Technology of the Louisville (KY) Metro Government. This data is pre-populated in DynamoDB as part of the use case. However, in real life, this data would be sent to DynamoDB through various mechanisms such as internet of things (IoT) devices or Amazon Kinesis consumers, which insert data into DynamoDB via AWS Lambda.
Boundaries of historical and cultural neighborhoods within the city of Louisville. The public dataset is provided by the Louisville and Jefferson County, KY Information Consortium (LOJIC). You can download the GIS shapefiles directly. We converted original shapefiles into a text file that you can query with Athena. You can find the Python code for transforming shapefiles in the Jupyter notebook https://github.com/aws-samples/dynamodb-ml-prediction-amazon-athena/blob/main/notebook/GeoSpatialProcessingGISshapefilesWithAmazonAthena.ipynb.
A pre-trained ML model for hourly forecasts. You can find the Python code for training the ML model in the notebook Demand Prediction for scooters using Amazon SageMaker and Amazon Athena.
A SQL query in Athena that brings everything together for live predictions from the data stored in DynamoDB.
Optionally, you can use QuickSight to visualize geospatial data over a map of Louisville, Kentucky (see the following example).

Populate DynamoDB with data and create a SageMaker endpoint to query in Athena

For this post, we populate the DynamoDB table with scooter data. For demand prediction, we create a new SageMaker endpoint using a pre-trained XGBoost model using an elastic container registry path from Region us-east-1.

Choose Launch Stack to launch a CloudFormation stack in us-east-1.

On the CloudFormation console, accept default values for the parameters.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

The stack creates five resources:

A DynamoDB table
A Lambda function to load the table with relevant data
A SageMaker endpoint for inference requests, with the pre-trained XGBoost model from an Amazon Simple Storage Service (Amazon S3) location
An Athena workgroup named V2EngineWorkGroup
Named Athena queries to look up the shapefiles and predict the demand for scooters

The CloudFormation template also deploys a pre-built DynamoDB-to-Athena connector, using the AWS Serverless Application Model (AWS SAM).

It can take up to 15–20 minutes for the CloudFormation stack to create these resources.

You can verify the sample data provided by AWS CloudFormation was loaded into DynamoDB by navigating to the DynamoDB console and checking for the table DynamoDBTableDocklessVehicles.

When resource creation is complete, on the Athena console, choose Workgroups.
Select the workgroup V2EngineWorkGroup and choose Switch workgroup.
If you get a prompt to save the query result location, choose an S3 location where you have write permissions.
Choose Save.
In the Athena query editor, select the database athena-ml-db-<your-AWS-Account-Number>.

Now let’s load the geolocation files into the Athena database.

Create an Athena table with geospatial data of neighborhood boundaries

To load the geolocation files into Athena, complete the following steps:

On the Athena console, choose Saved queries.
Search for and select Q1: Neighborhoods.
Return to the Athena SQL window.

This query creates a new table for the geospatial data that represents the urban neighborhoods of the city. The data table has been created from GIS shapefiles. The following CREATE EXTERNAL TABLE statement defines the schema of the table, and the location and format of the underlying data file. You can find the Python code to process shapefiles and produce this table in the notebook Geo-Spatial processing of GIS shapefiles with Amazon Athena.

--------- Start SQL -----
-- Q1: Neighborhoods
-- -----------------
-- This Athena statement creates a table entry in the Glue catalog
-- The data format is a TAB-seperated text file without header row
--
CREATE EXTERNAL TABLE "<Here goes your Athena database>"."louisville_ky_neighborhoods"
(
    `objectid` int, 
    `nh_code` int, 
    `nh_name` string, 
    `shapearea` double, 
    `shapelen` double, 
    `bb_west` double, 
    `bb_south` double, 
    `bb_east` double, 
    `bb_north` double, 
    `shape` string, 
    `cog_longitude` double, 
    `cog_latitude` double)
ROW FORMAT DELIMITED 
 FIELDS TERMINATED BY '\t' 
 LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://<Here goes your S3 bucket>/louisville_ky_neighborhoods/'
TBLPROPERTIES (
    'has_encrypted_data'='false'
)
--------- End SQL -----

Choose Run query or press CTRL+Enter.

This action creates a table named louisville_ky_neighborhoods in your database. Make sure the table is created in the database athena-ml-db-<your-AWS-Account-Number>.

Predict demand for scooters by neighborhood from the aggregated DynamoDB data

Now you can use Athena to query transactional data directly from DynamoDB, and aggregate it for analysis and forecast. This feature isn’t easily achieved by directly querying a DynamoDB NoSQL database.

On the Athena console, choose Saved queries.
Search for and select Q2: DynamoDBAthenaMLScooterPredict.
Return to the Athena SQL window.

This SQL statement demonstrates the use of the Athena Query Federation to query the DyanamoDB table with the raw trip data, Athena’s geospatial functions to place geographic coordinates into neighborhoods, and enrich data with ML inference using SageMaker.

The first part of the SQL statement declares the external function to query ML inferences from the SageMaker endpoint that hosts the pre-trained model. We need to define the order and type of the input parameters and the type of the return values.

We use several sub-queries to build the feature table for ML prediction. With the first SELECT statement, we query the raw data from DynamoDB for a given window. Then we locate the start and end locations of each trip record within their urban neighborhoods. Next, we create two aggregations over time and space for the start and end of the trips and combine them to generate a table with the number of trips per hour that started and ended within each of the neighborhoods. Among a few other parameters, we use the hourly counts for the past 4 hours to predict the demand for vehicles for the next hour. You can find the Python code for training the ML model in the notebook Demand Prediction for scooters using Amazon SageMaker and Amazon Athena.

--------------- Start SQL------
-- Q2: DynamoDBAthenaMLScooterPredict
-- ----------------------------------
-- Query demand of scooters for the next hour
--
-- Define function to represent the SageMaker model and endpoint for the prediction
USING EXTERNAL FUNCTION predict_demand( location_id BIGINT, hr BIGINT , dow BIGINT,
 n_pickup_1 BIGINT, n_pickup_2 BIGINT, n_pickup_3 BIGINT, n_pickup_4 BIGINT,
 n_dropoff_1 BIGINT, n_dropoff_2 BIGINT, n_dropoff_3 BIGINT, n_dropoff_4 BIGINT
 )
 RETURNS DOUBLE SAGEMAKER '<Here goes your SageMaker endpoint>'
-- query raw trip data from DynamoDB, we only need the past five hours (i.e. 5 hour time window ending with the time we set as current time)
WITH trips_raw AS (
    SELECT *
    , from_unixtime(start_epoch) AS t_start
    , from_unixtime(end_epoch) AS t_end
    FROM "lambda:<Here goes your DynamoDB connector>"."<Here goes your database>"."<Here goes your DynamoDB table>" dls
      WHERE start_epoch BETWEEN to_unixtime(TIMESTAMP '2019-09-07 15:00' - interval '5' hour) AND to_unixtime(TIMESTAMP '2019-09-07 15:00')
          OR end_epoch BETWEEN to_unixtime(TIMESTAMP '2019-09-07 15:00' - interval '5' hour) AND to_unixtime(TIMESTAMP '2019-09-07 15:00')
),
-- prepare individual trip records
trips AS (
    SELECT tr.*
        , nb1.nh_code AS start_nbid
        , nb2.nh_code AS end_nbid
        , floor( ( tr.start_epoch - to_unixtime(TIMESTAMP '2019-09-07 15:00' - interval '5' hour) )/3600 ) AS t_hour_start
        , floor( ( tr.end_epoch - to_unixtime(TIMESTAMP '2019-09-07 15:00' - interval '5' hour) )/3600 ) AS t_hour_end
    FROM trips_raw tr
    JOIN "AwsDataCatalog"."<Here goes your Athena database >"."louisville_ky_neighborhoods" nb1
        ON ST_Within(ST_POINT(CAST(tr.startlongitude AS DOUBLE), CAST(tr.startlatitude AS DOUBLE)), ST_GeometryFromText(nb1.shape)) 
    JOIN "AwsDataCatalog"."< Here goes your Athena database >"."louisville_ky_neighborhoods" nb2
        ON ST_Within(ST_POINT(CAST(tr.endlongitude AS DOUBLE), CAST(tr.endlatitude AS DOUBLE)), ST_GeometryFromText(nb2.shape))
),
-- aggregating trips over start time and start neighborhood
start_count AS (
    SELECT start_nbid AS nbid, COUNT(start_nbid) AS n_total_start
        , SUM(CASE WHEN t_hour_start=1 THEN 1 ELSE 0 END) AS n1_start
        , SUM(CASE WHEN t_hour_start=2 THEN 1 ELSE 0 END) AS n2_start
        , SUM(CASE WHEN t_hour_start=3 THEN 1 ELSE 0 END) AS n3_start
        , SUM(CASE WHEN t_hour_start=4 THEN 1 ELSE 0 END) AS n4_start
    FROM trips
        WHERE start_nbid BETWEEN 1 AND 98
    GROUP BY start_nbid
),
-- aggregating trips over end time and end neighborhood
end_count AS (
    SELECT end_nbid AS nbid, COUNT(end_nbid) AS n_total_end
        , SUM(CASE WHEN t_hour_end=1 THEN 1 ELSE 0 END) AS n1_end
        , SUM(CASE WHEN t_hour_end=2 THEN 1 ELSE 0 END) AS n2_end
        , SUM(CASE WHEN t_hour_end=3 THEN 1 ELSE 0 END) AS n3_end
        , SUM(CASE WHEN t_hour_end=4 THEN 1 ELSE 0 END) AS n4_end
    FROM trips
    WHERE end_nbid BETWEEN 1 AND 98
    GROUP BY end_nbid
),
-- call the predictive model to get the demand forecast for the next hour
predictions AS (
    SELECT sc.nbid
        , predict_demand(
        CAST(sc.nbid AS BIGINT),
        hour(TIMESTAMP '2019-09-07 15:00'),
        day_of_week(TIMESTAMP '2019-09-07 15:00'),
        sc.n1_start, sc.n2_start, sc.n3_start, sc.n4_start,
        ec.n1_end, ec.n2_end, ec.n3_end, ec.n4_end
        ) AS n_demand 
    FROM start_count sc
    JOIN end_count ec
      ON sc.nbid=ec.nbid
)
-- finally join the predicted values with the neighborhoods' meta data
SELECT nh.nh_code AS nbid, nh.nh_name AS neighborhood, nh.cog_longitude AS longitude, nh.cog_latitude AS latitude
    , ST_POINT(nh.cog_longitude, nh.cog_latitude) AS geo_location
    , COALESCE( round(predictions.n_demand), 0 ) AS demand
FROM "AwsDataCatalog"."< Here goes your Athena database >"."louisville_ky_neighborhoods" nh
LEFT JOIN predictions
    ON nh.nh_code=predictions.nbid
    
    --------------- End SQL------

Choose Run query or press CTRL+Enter to run the query and predict the demand for scooters.

The output table includes the neighborhood, longitude and latitude of the centroid of the neighborhood, and the number of vehicles that are predicted for the next hour. This query produces the predictions for a selected point in time. You can make predictions for any other time by changing the expression TIMESTAMP '2019-09-07 15:00' everywhere in the statement. Change it to NOW()if you have a real-time data feed from your DynamoDB table.

Visualize predicted demand for scooters in QuickSight

You can use the same Athena query (Q2) to visualize the data in QuickSight. You may have to add additional permissions to your QuickSight role to invoke Lambda functions to access the DynamoDB tables and SageMaker endpoints for the predictions. For detailed instructions on how to set up QuickSight to visualize geo-location coordinates, see the GitHub repo.

As you can see in the following QuickSight visualization, we can spot the demand for scooters on the map of Louisville, Kentucky. Circles with a bigger radius denote higher demand for scooters in that neighborhood. You can also hover over any of the circles to view the detailed demand count and the name of the neighborhood.

With live data, you can also set the refresh rate on your QuickSight dashboard to update the demand prediction automatically.

Clean up

When you’re done, clean up the resources you created as part of this solution.

On the Amazon S3 console, empty the bucket you created as part of the CloudFormation stack.
On the AWS CloudFormation console, find stack bdb-1462-athena-dynamodb-ml-stack and delete the stack.
On the Amazon CloudWatch console, delete the /aws/sagemaker/Endpoints/Sg-athena-ml-dynamodb-model-endpoint log group.

Conclusion

This post demonstrated how to query data in Athena using the Athena Federated Query feature for DynamoDB, and enrich data with ML inference using SageMaker. We also showed how you can integrate geo-location-based queries, using geospatial features in Athena. The Athena Query Federation feature is very extensible and queries data from multiple data sources such as DynamoDB, Amazon Neptune, Amazon ElastiCache for Redis, and Amazon Relational Database Service (Amazon RDS), and gives you a wide range of possibilities to aggregate data.

About the Authors

Sachin Doshi is a Senior Application Architect working in the AWS Professional Services team. He is based out of New York metropolitan area. Sachin helps customers optimize their applications using cloud native AWS services.

Péter Molnár is a Data Scientist with AWS Professional Services. He develops machine learning and artificial intelligence solutions for customer business problems. He holds a doctorate in Theoretical Physics from the University of Stuttgart (Germany) and a master’s degree in Physics from Georgia Augusta University in Göttingen (Germany).

Building well-architected serverless applications: Building in resiliency – part 1

2021-08-03 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-building-in-resiliency-part-1/

Reliability question REL2: How do you build resiliency into your serverless application?

Evaluate scaling mechanisms for serverless and non-serverless resources to meet customer demand. Build resiliency into your workload to make your serverless application resilient to withstand partial and intermittent failures across components that may only surface in production.

Required practice: Manage transaction, partial, and intermittent failures

Whenever one service or system calls another, there is a chance that failures can happen. Services or systems often don’t fail as a single unit, but rather suffer partial or transient failures. Applications should be designed to handle component failures as part of the architecture. The system should be designed to detect failure and, ideally, automatically heal itself.

Transaction failures can occur when a component is unavailable or under high load. Partial failures can occur when a percentage of requests succeeds, including during batch processing. Intermittent failures might occur when a request fails for a short period of time due to network or other transient issues.

AWS serverless services, including AWS Lambda, are fault-tolerant and designed to handle failures. If a service invokes a Lambda function and there is a service disruption, Lambda invokes the function in a different Availability Zone.

When you invoke a function directly, you determine the strategy for handling errors. You can retry, send the event to a destination or queue for debugging, or ignore the error. Clients such as the AWS Command Line Interface (CLI) and the AWS SDK retry on client timeouts, throttling errors (429), and other errors that are not caused by a bad request.

When you invoke a function indirectly, you must be aware of the retry behavior of the invoker and any service that the request encounters along the way. For more information, see “Error handling and automatic retries in AWS Lambda”. You can configure Maximum Retry Attempts and Maximum Event Age for asynchronous invocations.

When reading from Amazon Kinesis Data Streams and Amazon DynamoDB Streams, Lambda retries the entire batch of items. Retries continue until the records expire or exceed the maximum age that you configure on the event source mapping. You can also configure the event source mapping to split a failed batch into two batches. Retrying with smaller batches isolates bad records and works around timeout issues.

Partial failures can occur in non-atomic operations. PutRecords for Kinesis and BatchWriteItem for DynamoDB return a successful response if at least one record is ingested successfully. Always inspect the response when using such operations and programmatically deal with partial failures.

Use exponential backoff with jitter

The simplest technique for dealing with failures in a networked environment is to retry calls until they succeed. This technique increases the reliability of the application and reduces operational costs for the developer.

However, it is not always safe to retry. A retry can further increase the load on the system being called if the system is already failing due to an overload. To avoid this problem, use backoff. Instead of retrying immediately and aggressively, the client waits some amount of time between tries. The most common pattern is an exponential backoff, which uses exponentially longer wait times between retries. This is typically capped to a maximum delay and number of retries.

If all backoff retries are still happening at the same time, this can still overload a system or cause contention. To avoid this problem, use jitter. Jitter adds some amount of randomness to the backoff to spread the retries around in time. This can help prevent large bursts by spreading out the rate when clients connect. For more information see the Amazon Builders’ Library article “Timeouts, retries, and backoff with jitter” and AWS Architecture blog post “Exponential Backoff And Jitter”.

Exponential backoff and jitter

When your application responds to callers in fail-fast scenarios and when performance is degraded, inform the caller via headers or metadata when they can retry.

Each AWS SDK implements automatic retry logic including exponential backoff. For downstream calls, you can adjust AWS and third-party SDK retries, backoffs, TCP, and HTTP timeouts. This helps you decide when to stop retrying. For more information, see the documentation and troubleshooting steps for Lambda and the AWS SDK.

Use a dead-letter queue mechanism to retain, investigate and retry failed transactions

There are a number of ways to handle message failures including destinations and dead-letter queues.

You can configure Lambda to send records of asynchronous invocations to another destination service. These include Amazon Simple Queue Service (SQS), Amazon Simple Notification Service (SNS), Lambda, and Amazon EventBridge. You can configure separate destinations for events that fail processing and events that are successfully processed. The invocation record contains details about the event, the response, and the reason that the record was sent.

The following example shows a function that sends a record of a successful invocation to an EventBridge event bus. When an event fails all processing attempts, Lambda sends an invocation record to an SQS queue. It includes the function’s response in the invocation record.

AWS Lambda destinations for asynchronous invocation

SNS, SQS, Lambda, and EventBridge support dead-letter queues (DLQs). DLQs make your applications more resilient and durable by storing messages or events that can’t be processed correctly into a dedicated SQS queue. This helps you debug your application by isolating the problematic messages to determine why their processing failed. One you have resolved the issue, re-process the failed message. For more information, see “When should I use a dead-letter queue?” There is an example serverless application to redrive the messages from an SQS DLQ back to its source SQS queue.

For Lambda, DLQs provide an alternative to a failure destination. Lambda destinations is preferable for asynchronous invocations.

Good practice: Orchestrate long-running transactions

Long-running transactions can be processed by one or multiple components. Consider implementing the saga pattern using state machines for these types of transactions.

The saga pattern coordinates transactions between multiple microservices as part of a state machine. Each service that performs a transaction publishes an event to trigger the next transaction in the saga. This continues until the transaction chain is complete. If a transaction fails, saga orchestrates a series of compensating transactions that undo the changes that were made by the preceding transactions.

This is preferable to handling complex or long-running transactions within application code. State machines prevent cascading failures and avoid tightly coupling components with orchestrating logic and business logic.

Use a state machine to visualize distributed transactions, and to separate business logic from orchestration logic.

AWS Step Functions lets you coordinate multiple AWS services into serverless workflows via state machines. Within Step Functions, you can set separate retries, backoff rates, max attempts, intervals, and timeouts. These are set for every step of your state machine using a declarative language.

In the serverless airline example used in this series, Step Functions is used to orchestrate the Booking microservice. The ProcessBooking state machine handles all the necessary steps to create bookings, including payment.

Booking service Step Functions state machine

The state machine uses a combination of service integrations using DynamoDB, SQS, and Lambda functions to coordinate transactions and handle failures.

For example, the Reserve Booking task invokes a Lambda function. The task has retry and error handling configured as part of the task definition.

"Reserve Booking": {
	"Type": "Task",
	"Resource": "${ReserveBooking.Arn}",
	"TimeoutSeconds": 5,
	"Retry": [
		{
			"ErrorEquals": [
				"BookingReservationException"
			],
			"IntervalSeconds": 1,
			"BackoffRate": 2,
			"MaxAttempts": 2
		}
	],
	"Catch": [
		{
			"ErrorEquals": [
				"States.ALL"
			],
			"ResultPath": "$.bookingError",
			"Next": "Cancel Booking"
		}
	],
	"ResultPath": "$.bookingId",
	"Next": "Collect Payment"
},

Step Functions supports direct service integrations, including DynamoDB. The Reserve Flight task directly updates the flightTable without requiring a Lambda function.

"Reserve Flight": {
	"Type": "Task",
	"Resource": "arn:aws:states:::dynamodb:updateItem",
	"Parameters": {
		"TableName.$": "$.flightTable",
		"Key": {
			"id": {
				"S.$": "$.outboundFlightId"
			}
		},
		"UpdateExpression": "SET seatCapacity = seatCapacity - :dec",
		"ExpressionAttributeValues": {
			":dec": {
				"N": "1"
			},
			":noSeat": {
				"N": "0"
			}
		},
		"ConditionExpression": "seatCapacity > :noSeat"
	},

By default, when a state reports an error, Step Functions causes the execution to fail entirely.

Utilize dead-letter queues in response to failed state machine executions

Any state within the Step Functions workflow can encounter runtime errors. These include state machine definition issues, task failures such as Lambda function exceptions, or transient issues such as network connectivity issues. For more information, see “Error handling in Step Functions”.

Use the Step Functions service integration with SQS to send failed transactions to a DLQ as the final step. This adds a higher level of durability within your state machines.

For example, the airline Notify Failed Booking final task catches failed states from four previous steps. It sends the results to the Booking DLQ.

Booking service Step Functions DLQ

The message includes the output of the previous failed states for further troubleshooting.

"Booking DLQ": {
	"Type": "Task",
	"Resource": "arn:aws:states:::sqs:sendMessage",
	"Parameters": {
		"QueueUrl": "${BookingsDLQ}",
		"MessageBody.$": "$"
	},
	"ResultPath": "$.deadLetterQueue",
	"Next": "Booking Failed"
},

The Step Functions documentation has more information on calling SQS.

Conclusion

Build resiliency into your workloads. This makes sure that your application can withstand partial and intermittent failures across components that may only surface in production.

In this post, I cover managing failures using retries, exponential backoff, and jitter. I explain how DLQs can isolate failed messages. I show how to use state machines to orchestrate long running transactions rather than handling these in application code.

This well-architected question continues in part 2 where I look at managing duplicate and unwanted events with idempotency and an event schema. I cover how to consider scaling patterns at burst rates by managing account limits and show relevant metrics to evaluate.

For more serverless learning resources, visit Serverless Land.

Build a Virtual Waiting Room with Amazon DynamoDB and AWS Lambda at SeatGeek

2021-07-29 Umesh Kalaspurkar

Post Syndicated from Umesh Kalaspurkar original https://aws.amazon.com/blogs/architecture/build-a-virtual-waiting-room-with-amazon-dynamodb-and-aws-lambda-at-seatgeek/

As retail sales, products, and customers continue to expand online, we’ve seen a trend towards releasing products in limited quantities to larger audiences. Demand of these products can be high, due to limited production capacity, venue capacity limits, or product exclusivity. Providers can then experience spikes in transaction volume, especially when multiple event sales occur simultaneously. This increased traffic and load can negatively impact customer experience and infrastructure.

To enhance the customer experience when releasing tickets to high demand events, SeatGeek has introduced a prioritization and queueing mechanism based on event type, venue, and customer type. For example, Dallas Cowboys’ tickets could have a different priority depending on seat type, or whether it’s a suite or a general admission ticket.

SeatGeek previously used a third-party waiting room solution, but it presented a number of shortcomings:

Lack of configuration and customization capabilities
More manual process that resulted in limiting the number of concurrent events could be set up
Inability to capture custom insights and metrics (for example, how long was the customer waiting in the queue before they dropped?)

Resolving these issues is crucial to improve the customer experience and audience engagement. SeatGeek decided to build a custom solution on AWS, in order to create a more robust system and address these third-party issues.

Virtual Waiting Room overview

Our solution redirects overflow customers waiting to complete their purchase to a separate queue. Personalized content is presented to improve the waiting experience. Public services such as school or voting registration can use this solution for limited spots or time slot management.

Figure 1. User path through a Virtual Waiting Room

During a sale event, all customers begin their purchase journey in the Virtual Waiting Room (see Figure 1). When the sale starts, they will be moved from the Virtual Waiting Room to the ticket selection page. This is referred to as the Protected Zone. Here is where the customer will complete their purchase. The Protected Zone is a group of customized pages that guide the user through the purchasing process.

When the virtual waiting room is enabled, it can operate in three modes: Waiting Room mode, Queueing mode, or a combination of the two.

In Waiting Room mode, any request made to an event ticketing page before the designated start time of sale is routed to a separate screen. This displays the on-sale information and other marketing materials. At the desired time, users are then routed to the event page at a predefined throughput rate. Figure 2 shows a screenshot of the Waiting Room mode:

Figure 2. Waiting Room mode

In Queueing mode, the event can be configured to allow a preset number of concurrent users to access the Protected Zone. Those beyond that preconfigured number wait in a First-In-First-Out (FIFO) queue. Exempt users, such as the event coordinator, can bypass the queue for management and operational visibility.

Figure 3. Queueing mode flow

Figure 4. Queueing mode

In some cases, the two modes can work together sequentially. This can occur when the Waiting Room mode is used before a sale starts and the Queueing mode takes over to control flow.

Once the customers move to the front of the queue, they are next in line for the Protected Zone. A ticket selection page, shown in Figure 5, must be protected from an overflow of customers, which could result in overselling.

Figure 5. Ticket selection page

Virtual Waiting Room implementation

In the following diagram, you can see the AWS services and flow that SeatGeek implemented for the Virtual Waiting Room solution. When a SeatGeek customer requests a protected resource like a concert ticket, a gate keeper application scans to see if the resource has an active waiting room. It also confirms if the configuration rules are satisfied in order to grant the customer access. If the customer isn’t allowed access to the protected resource for whatever reason, then that customer is redirected to the Virtual Waiting Room.

Figure 6. Architecture overview

SeatGeek built this initial iteration of the gate keeper service on Fastly’s Computer@Edge service to leverage its existing content delivery network (CDN) investment. However, similar functionality could be built using Amazon CloudFront and AWS Lambda@Edge.

The Bouncer, handling the user flow into either the protected zone or the waiting room, consists of 3 components – Amazon API Gateway, AWS Lambda, and a Token Service. The token service is at the heart of the Waiting Room’s core logic. Before a concert event sale goes live at SeatGeek, the number of access tokens generated is equivalent to the number of available tickets. The order of assigning access tokens to customers in the waiting room can be based on FIFO or customer status (VIP customers first). Tokens are allocated when the customer is admitted to the waiting room and expire when tickets are purchased or when the customer exits.

For data storage, SeatGeek uses Amazon DynamoDB to monitor protected resources, tokens, and queues. The key tables are:

Protected Zone table: This table contains metadata about available protected zones
Counters table: Monitors the number of access tokens issued per minute for a specific protected zone
User Connection table: Every time a customer connects to the Amazon API Gateway, a record is created in this table recording their visitor token and connection ID using AWS Lambda
Queue table: This is the main table where the visitor token to access token mapping is saved

For analytics, two types of metrics are captured to ensure operational integrity:

System metrics: These are built into the AWS runtime infrastructure, and are stored in Amazon CloudWatch. These metrics provide telemetry of each component of the solution: Lambda latency, DynamoDB throttle (read and write), API Gateway connections, and more.
Business metrics: These are used to understand previous user behavior to improve infrastructure provisioning and user experiences. SeatGeek uses an AWS Lambda function to capture metrics from data in a DynamoDB stream. It then forwards it to Amazon Timestream for time-based analytics processing. Metrics captured include queue length, waiting time per queue, number of users in the protected zone, and more.

For historical needs, long-lived data can be streamed to tiered data storage options such as Amazon Simple Storage Service (S3). They can then be used later for other purposes, such as auditing and data analysis.

Considerations and enhancements for the Virtual Waiting Room

Tokens: We recommend using first-party cookies and token confirmations to track the number of sessions. Use the same token at the same time to stop users from checking out multiple times and cutting in line.
DDoS protection: Token and first-party cookies usage must also comply with General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) guidelines depending on the geographic region. This system is susceptible to DDoS attacks, XSS attacks, and others, like any web-based solution. But these threats can be mitigated by using AWS Shield, a DDoS protection service, and AWS WAF – Web Application Firewall. For more information on DDoS protection, read this security blog post.
Marketing: Opportunities to educate the customer about the venue or product(s) while they wait in the Virtual Waiting Room (for example, parking or food options).
Alerts: Customers can be alerted via SMS or voice when their turn is up by using Amazon Pinpoint as a marketing communication service.

Conclusion

We have shown how to set up a Virtual Waiting Room for your customers. This can be used to improve the customer experience while they wait to complete their registration or purchase through your website. The solution takes advantage of several AWS services like AWS Lambda, Amazon DynamoDB, and Amazon Timestream.

While this references a retail use case, the waiting room concept can be used whenever throttling access to a specific resource is required. It can be useful during an infrastructure or application outage. You can use it during a load spike, while more resources (EC2 instances) are being launched. To block access to an unreleased feature or product, temporarily place all users in the waiting room and let them in as needed per your own configuration.

Providing a friendly, streamlined, and responsive user experience, even during peak load times, is a valuable way to keep existing customers and gain new ones.

Be mindful that there are costs associated with running these services. To be cost-efficient, see the following pages for details: AWS Lambda, Amazon S3, Amazon DynamoDB, Amazon Timestream.

Introducing Amazon Route 53 Application Recovery Controller

2021-07-28 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-route-53-application-recovery-controller/

I am pleased to announce the availability today of Amazon Route 53 Application Recovery Controller, a Amazon Route 53 set of capabilities that continuously monitors an application’s ability to recover from failures and controls application recovery across multiple AWS Availability Zones, AWS Regions, and on premises environments to help you to build applications that must deliver very high availability.

At AWS, the security and availability of your data and workloads are our top priorities. From the very beginning, AWS global infrastructure allowed you to build application architectures that are resilient to different type of failures. When your business or application requires high availability, you typically use AWS global infrastructure to deploy redundant application replicas across AWS Availability Zones inside an AWS Region. Then, you use a Network or Application Load Balancer to route traffic to the appropriate replica. This architecture handles the requirements of the vast majority of workloads.

However, some industries and workloads have higher requirements in terms of high availability: availability rate at or above 99.99% with recovery time objectives (RTO) measured in seconds or minutes. Think about how real-time payment processing or trading engines can affect entire economies if disrupted. To address these requirements, you typically deploy multiple replicas across a variety of AWS Availability Zones, AWS Regions, and on premises environments. Then, you use Amazon Route 53 to reliably route end users to the appropriate replica.

Amazon Route 53 Application Recovery Controller helps you to build these applications requiring very high availability and low RTO, typically those using active-active architectures, but other type of redundant architectures might also benefit from Amazon Route 53 Application Recovery Controller. It is made of two parts: readiness check and routing control.

Readiness checks continuously monitor AWS resource configurations, capacity, and network routing policies, and allow you to monitor for any changes that would affect the ability to execute a recovery operation. These checks ensure that the recovery environment is scaled and configured to take over when needed. They check the configuration of Auto Scaling groups, Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Elastic Block Store (EBS) volumes, load balancers, Amazon Relational Database Service (RDS) instances, Amazon DynamoDB tables, and several others. For example, readiness check verifies AWS service limits to ensure enough capacity can be deployed in an AWS Region in case of failover. It also verifies capacity and scaling characteristics of application replicas are the same across AWS Region.

Routing controls help to rebalance traffic across application replicas during failures, to ensure that the application stays available. Routing controls work with Amazon Route 53 health checks to redirect traffic to an application replica, using DNS resolution. Routing controls improve traditional automated Amazon Route 53 health check-based failovers in three ways:

First, routing controls give you a way to failover the entire application stack based on application metrics or partial failures, such as a 5% increased error rate or a millisecond of increased latency.
Second, routing controls give you safe and simple manual overrides. You can use them to shift traffic for maintenance purposes or to recover from failures when your monitors fail to detect an issue.
Third, routing controls can use a capability called safety rules to prevent common side effects associated with fully automated health checks, such as preventing fail over to an unprepared replica, or flapping issues.

To help you understand how Route 53 Application Recovery Controller works, I’ll walk you through the process I used to configure my own high availability application.

How It Works
For demo purposes, I built an application made up of a load balancer, an Auto Scaling group with two EC2 instances, and a global DynamoDB table. I wrote a CDK script to deploy the application in two AWS Regions: US East (N. Virginia) and US West (Oregon). The global DynamoDB table ensures data is replicated across the two AWS Regions. This is an active-standby architecture, as I described earlier.

The application is a multi-player TicTacToe game, an application that typically needs 99.99% availability or more :-). One DNS record (tictactoe.seb.go-aws.com) points to the load balancer in the US East (N. Virginia) region. The following diagram shows the architecture for this application:

Preparing My Application
To configure Route 53 Application Recovery Controller for my application, I first deployed independent replicas of my application stack so that I can fail over traffic across the stacks. These copies are deployed across AWS high-availability boundaries, such as Availability Zones, or AWS Regions. I chose to deploy my application replicas across multiple AWS Regions

Then, I configured data replication across these independent replicas. I’m using DynamoDB global tables to help replicate my data.

Lastly, I configured each independent stack to expose a DNS name. This DNS name is the entry point into my application, such as a regional load balancer DNS name.

Terminology
Before I configure readiness check, let me share some basic terminology.

A cell defines the silo that contains my application’s independent units of failover. It groups all AWS resources that are required for my application to operate independently. For my demo, I have two cells: one per AWS Region where my application is deployed. A cell is typically aligned with AWS high-availability boundaries, such as AWS Regions or Availability Zones, but it can be smaller too. It is possible to have multiple cells in one Availability Zone. This is an effective way to reduce blast radius, especially when you follow one-cell-at-a-time change management practices.

A recovery group is a collection of cells that represent an application or group of applications that I want to check for failover readiness. A recovery group typically consists of two or more cells that mirror each other in terms of functionality.

A resource set is a set of AWS resources that can span multiple cells. For this demo, I have three resource sets: one for the two load balancers in us-east-1 and us-west-2, one for the two Auto Scaling groups in the two Regions, and one for the global DynamoDB table.

A readiness check validates a set of AWS resources readiness to be failed over to. In this example, I want to audit readiness for my load balancers, Auto Scaling groups, and DynamoDB table. I create a readiness check for the Auto Scaling groups. The service constantly monitors the instance types and counts in the groups to make sure that each group is scaled equally. I repeat the process for the load balancer and the global DynamoDB table.

To help determine recovery readiness for my application, Route 53 Application Recovery Controller continuously audits mismatches in capacity, AWS resource limits, and AWS throttle limits across application cells (Availability Zones or Regions). When Route 53 Application Recovery Controller detects a mismatch in limits, it raises an AWS Service Quota request for the resource across the cells. If Route 53 Application Recovery Controller detects a capacity mismatch in resources, I can take actions to align capacity across the cells. For example, I could trigger a scaling increase for my Auto Scaling groups.

Create a Readiness Check
To create a readiness check, I open the AWS Management Console and navigate to the Application Recovery Controller section under Route 53.

To create a recovery group for my application, I navigate to the Getting Started section, then I choose Create recovery group.

I enter a name (for example AWSNewsBlogDemo) and then choose Next.

In Configure Architecture, I choose Add Cell, then I enter a cell name (AWSNewsBlogDemo-RegionWEST) and then choose Add Cell again to add a second cell. I enter AWSNewsBlogDemo-RegionEAST for the second cell. I choose Next to review my inputs, then I choose Create recovery group.

I now need to associate resources such as my load balancers, Auto Scaling groups, and DynamoDB table with my recovery group.

In the left navigation pane, I choose Resource Set and then I choose Create.

I enter a name for my first resource set (for example, load_balancers). For Resource type, I choose Network Load Balancer or Application Load Balancer and I then choose Add to add the load balancer ARN.

I choose Add again to enter the second load balancer ARN, and then I choose Create resource set.

I repeat the process to create one resource set for the two Auto Scaling groups and a third resource set for the global DynamoDB table (one ARN). I now have three resource sets:

My last step is to create the readiness check. This will associate the resources with cells in the resource groups.

In Readiness check, I choose Create at the top right of the screen, then Readiness check.

Step 1 (Create readiness check), I enter a name (for example, load_balancers). For Resource Type, I choose Network Load Balancer or Application Load Balancer and then choose Next.

Step 2 (Add resource set), I keep the default selection Use an existing resource set and for Resource set name, I choose load_balancers and then I choose Next.

Step 3 (Apply readiness rules), I review the rules and then choose Next.

Step 4 (Recovery Group Options), I keep the default selection Associate with an existing recovery group. For Recovery group name, I choose AWSNewsBlog. Then, I associate the two cells (EAST and WEST) with the two load balancers ARN. Be sure to associate the correct load balancer to each cell. The Region name is included in the ARN.

Step 5 (Review and create), I review my choices and then choose Create readiness check.

I repeat this process for the Auto Scaling group and the DynamoDB global table.

When all readiness checks in the group are green, the group has a status of Ready.

Now, I can configure and test the routing controls.

Terminology
Before I configure routing controls, let me share some basic terminology.

A cluster is a set of five redundant Regional endpoints against which you can execute API calls to update or get the state of routing controls. You can host multiple control panels and routing controls on one cluster.

A routing control is a simple on/off switch, hosted on a cluster, that you use to control routing of client traffic in and out of cells. When you create a routing control, you add a health check in Route 53 so that you can reroute traffic when you update the routing control in Route 53 Application Recovery Controller. The health checks must be associated with DNS failover records that front each application replica if you want to use them to route traffic with routing controls.

A control panel groups together a set of related routing controls.

Configure Routing Controls
I can use the Route 53 console or API actions to create a routing control for each AWS Region for my application. After I create routing controls, I create an Amazon Route 53 Application Recovery Controller health check for each one, and then associate each health check with a DNS failover record for my load balancers in each Region. Then, to fail over traffic between Regions, I change the routing control state for one routing control to off and another routing control state to on.

The first step is to create a cluster. A cluster is charged $2.5 / hour. When you create a cluster to experience Route 53 Application Recovery Controller, be sure to delete the cluster after your experimentation.

In the left navigation pane, I navigate to the cluster panel and then I choose Create.

I enter a name for my cluster and then choose Create cluster.

The cluster is in Pending state for a few minutes. After a while, its status changes to Deployed.

After it’s deployed, I select the cluster name to discover the five redundant API endpoints. You must specify one of those endpoints when you build recovery tools to retrieve or set routing control states. You can use any of the cluster endpoints, but in complex or automated scenarios, we recommend that your systems be prepared to retry with each of the available endpoints, using a different endpoint with each retry request.

Traffic routing is managed through routing controls that are grouped in a control panel. You can create one or use the default one that is created for you.

I choose DefaultControlPanel.

I choose Add routing control.

I enter a name for my routing (FailToWEST) control and then choose Create routing control. I repeat the operation for the second routing control (FailToEAST).

After the routing control is created, I choose it from the list. On the detail page, I choose Create health check to create a health check in Route 53.

I enter a name for the health check and then choose Create. I navigate to the Route 53 console to verify the health check was correctly created.

I create one health check for each routing control.

You might have noticed that the Control Panel provides a place where you can add Safety Rules. When you work with several routing controls at the same time, you might want some safeguards in place when you enable and disable them. These help you to avoid initiating a failover when a replica is not ready, or unintended consequences like turning both routing controls off and stopping all traffic flow. To create these safeguards, you create safety rules. For more information about safety rules, including usage examples, see the Route 53 Application Recovery Controller developer guide.

Now the routing controls and the DNS health check are in place, the last step is to route traffic to my application.

Adjust My DNS Settings
To route traffic to my application. I assign a DNS alias to the top-level entry point of the application in the cell. For this example, using the Route 53 console, I create two ALIAS A records of type FAILOVER and associate each health check with each DNS record. The two records have the same record name. One is the primary record and the other is the secondary record. For more information about Amazon Route 53 health checks, see the Amazon Route 53 developer guide.

On the application recovery routing controls page, I enable one of the two routing controls.

As soon as I do, all the traffic pointed to tictactoe.seb.go-aws.com goes to the infrastructure deployed on us-east-1.

Testing My Setup
To test my setup, I first use the dig command in a terminal. It shows the DNS CNAME record that points to the load balancer deployed in us-east-1.

I also test the application with a web browser. I observe the name tictactoe.seb.go-aws.com goes to us-east-1.

Now, using the update-routing-control-state API action, the CLI, or the console, I turn off the routing control to the us-east-1 Region and turn on the one to the us-west-2 Region. When I use the CLI, I use the endpoints provided by my cluster.

aws route53-recovery-cluster update-routing-control-state \
     --routing-control-arn arn:aws:route53-recovery-control::012345678:controlpanel/xxx/routingcontrol/abcd \
     --routing-control-state On \
     --region us-west-2 \
     --endpoint-url https://host-xxx.us-west-2.cluster.routing-control.amazonaws.com/v1

In the console, I navigate to the control panel, I select the routing control I want to change and click Change routing control states.

After less than a minute, the DNS address is updated. My application traffic is now routed to the us-west-2 Region.

Readiness checks and routing controls provide a controlled failover for my application traffic, redirecting traffic from my active replica to my standby one, in another AWS Region. I can change the traffic routing manually, as I showed in the demo, or I can automate it using Amazon CloudWatch alarms based on technical and business metrics for my application.

Pricing
This new capability is charged on demand. There are no upfront costs. You are charged per readiness check and per cluster per hour. Readiness checks are charged $0.045 / hour. Cluster are charged $2.5 / hour. In the demo example used for this blog post, there are three readiness checks and one cluster. The price per hour for this setup, excluding the application itself, is 3 x $0.045 + 1 x $2.5 = $2.635 / hour. For more details about the pricing, including an example, see the Route 53 pricing page.

This new capability is a global service that can be used to monitor and control application recovery for application running in any of the public commercial AWS Regions. Give it a try and let us know what you think. As always, you can send feedback through your usual AWS Support contacts or post it on the AWS forum for Route 53 Application Recovery Controller.

— seb

Creating a single-table design with Amazon DynamoDB

2021-07-26 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/creating-a-single-table-design-with-amazon-dynamodb/

Amazon DynamoDB is a highly performant NoSQL database that provides data storage for many serverless applications. Unlike traditional SQL databases, it does not use table joins and other relational database constructs. However, you can model many common relational designs in a single DynamoDB table but the process is different using a NoSQL approach.

This blog post uses the Alleycat racing application to explain the benefits of a single-table DynamoDB table. It also shows how to approach modeling data access requirements in a DynamoDB table. Alleycat is a home fitness system that allows users to compete in an intense series of 5-minute virtual bicycle races. Up to 1,000 racers at a time take the saddle and push the limits of cadence and resistance to set personal records and rank on virtual leaderboards.

Alleycat requirements

In the Alleycat example, the application offers a number of exercise classes. Each class has multiple races, and there are multiple racers in each race. The system logs the output for each racer per second of the race. An entity-relationship diagram in a traditional relational database shows how you could use normalized tables and relationships to store this data:

In a relational database, often each table has a key that relates to a foreign key in another table. By joining multiple tables, you can query related tables and return the results in a single table view. While this is flexible and convenient, it’s also computationally expensive and difficult to scale horizontally.

Many serverless architectures are built for scale and the relational database paradigm often does not scale as efficiently as a workload demands. DynamoDB scales to almost any level of traffic but one of the tradeoffs is the lack of joins. Fortunately, it offers alternative ways to model the data to meet Alleycat’s requirements.

DynamoDB terminology and concepts

Unlike traditional databases, there is no limit to how much data can be stored in a DynamoDB table. The service is also designed to provide predictable performance at any scale, so you can expect similar query latency regardless of the level of traffic.

The most important operational aspect of running DynamoDB in production is setting and managing throughput. There is a provisioned mode, where you set the throughput, and on-demand, which is managed by the service. In the provisioned mode, you can also use automatic scaling to let the service set the throughput between lower and upper limits you define.

The choice here is determined by the traffic patterns in your workload. For applications with predictable traffic with gradual changes, provisioned mode is the better choice and is more cost effective. If traffic patterns are unknown or you prefer to have capacity managed automatically, choose on-demand. To learn more about the capacity modes, visit the documentation page.

Within each table, you must have a partition key, which is a string, numeric, or binary value. This key is a hash value used to locate items in constant time regardless of table size. It is conceptually different to an ID or primary key field in a SQL-based database and does not relate to data in other tables. When there is only a partition key, these values must be unique across items in a table.

Each table can optionally have a sort key. This allows you to search and sort within items that match a given primary key. While you must search on exact single values in the partition key, you can pattern search on sort keys. It’s common to use a numeric sort key with timestamps to find items within a date range, or use string search operators to find data in hierarchical relationships.

With only partition key and sort keys, this limits the possible types of query without duplicating data in a table. To solve this issue, DynamoDB also offers two types of indexes:

Local secondary indexes (LSIs): these must be created at the same time the table is created and effectively enable another sort key using the same partition key.
Global secondary indexes (GSIs): create and delete these at any time, and optionally use a different partition key from the existing table.

There are other important differences between the two index types:

	LSI	GSI
Create	At table creation	Anytime
Delete	At table deletion	Anytime
Size	Up to 10 GB per partition	Unlimited
Throughput	Shared with table	Separate throughput
Key type	Primary key only or composite key (partition key and sort key)	Composite key only
Consistency model	Both eventual and strong consistency	Eventual consistency only

Determining data access requirements

Relational database design focuses on the normalization process without regard to data access patterns. However, designing NoSQL data schemas starts with the list of questions the application must answer. It’s important to develop a list of data access patterns before building the schema, since NoSQL databases offer less dynamic query flexibility than their SQL equivalents.

To determine data access patterns in new applications, user stories and use-cases can help identify the types of query. If you are migrating an existing application, use the query logs to identify the typical queries used. In the Alleycat example, the frontend application has the following queries:

Get the results for each race by racer ID.
Get a list of races by class ID.
Get the best performance by racer for a class ID.
Get the list of top scores by race ID.
Get the second-by-second performance by racer for all races.

While it’s possible to implement the design with multiple DynamoDB tables, it’s unnecessary and inefficient. A key goal in querying DynamoDB data is to retrieve all the required data in a single query request. This is one of the more difficult conceptual ideas when working with NoSQL databases but the single-table design can help simplify data management and maximize query throughput.

Modeling many-to-many relationships with DynamoDB

In traditional SQL, a many-to-many relationship is classically represented with three tables. In the earlier diagram for the Alleycat application, these tables are racers, raceResults, and races. Populated with sample data, the tables look like this:

In DynamoDB, the adjacency list design pattern enables you to combine multiple SQL-type tables into a single NoSQL table. It has multiple uses but in this case can model many-to-many relationships. To do this, the partition key contains both types of item – races and racers. The key value contains the type of data expected in the item (for example, “race-1” or “racer-2”):

With this table design, you can query by racer ID or by race ID. For a single race, you can query by partition key to return all results for a single race, or use the sort key to limit by a single racer or for the overall results. For per racer results, the second-by-second data is stored in a nested JSON structure.

To allow sorting by output to create leaderboard results, the output value must be a sort key. However, the sort key cannot be updated once it is set. Using the main sort key, the application would only be able to write a final race result per racer to query and sort on this data.

To resolve this problem, use an index. The index can use a separate sort key where the value can be updated. This allows Alleycat to store the latest results in this field, and then for queries to sort by output to create a leaderboard.

The preceding table does not represent the races table in the normalized view, so you cannot query by class ID to retrieve a list of races. Depending on your design, you can solve this by adding a second index to the table to enable querying by class ID and returning a list of partition keys (race IDs). However, you can also overload GSIs to contain multiple types of value.

The AlleyCat application uses both an LSI and GSI to accommodate all the data access patterns. This table shows how this is modeled, although the results attribute names are shorter in the application:

Main composite key: PK and SK.
Local secondary index: Partition key is PK and sort key is Numeric.
Global secondary index: Partition key is SK and sort key is Numeric.

Reviewing the data access patterns for Alleycat

Before creating the DynamoDB table, test the proposed schema against the list of data access patterns. In this section, I review Alleycat’s list of queries to ensure that each is supported by the table schema. I use the Item explorer feature to run queries against a test table, after running the Alleycat simulator for multiple races.

1. Get the results for each race by racer ID

Use the table’s partition key, searching for PK = racer ID. This returns a list of all races (PK) for a given racer. See the updateRaceResults function for an example of how this is used:

2. Get a list of races by class ID

Use the local secondary index, searching for partition key = class ID. This results in a list of races (PK) for a given class ID. See the getRaces function code for an example of this query:

3. Get the best performance by racer for a class ID.

Use the table’s partition key, searching for PK = class ID. This returns a list of racers and their best outputs for the given class ID. See the getLeaderboard function code for an example of this query:

4. Get the list of top scores by race ID.

Use the global secondary index, searching for PK = race ID, sorting by the GSI sort key (descending) to rank the results. This returns a sorted list of results for a race. See the updateRaceResults function for an example of how this is used:

5. Get the second-by-second performance by racer for all races.

Use the main table index, searching for PK = racer ID. Optionally use the sort key to restrict to a single race. This returns items with second-by-second performance stored in a nested JSON attribute. See the loadRealtimeHistory function for an example of how this is used:

Optimizing items and capacity

In the Alleycat application, races are only 5 minutes long so the results attribute only contains 300 separate data points (once per second). By using a nested JSON structure in the items, the schema flattens data that otherwise would use 300 rows in the earlier SQL-based design.

The maximum item size in DynamoDB is 400 KB, which includes attribute names. If you have many more data points, you may reach this limit. To work around this, split the data across multiple items and provide the item order in the sort key. This way, when your application retrieves the items, it can reassemble the attributes to create the original dataset.

For example, if races in Alleycat were an hour long, there would be 3,600 data points. These may be stored in six rows containing 600 second-by-second results each:

Additionally, to maximize the storage per row, choose short attribute names. You can also compress data in attributes by storing as GZIP output instead of raw JSON, and using a binary data type for the attribute. This increases processing for the producing and consuming applications, which must compress and decompress the items. However, it can significantly increase the amount of data stored per row.

To learn more, read Best practices for storing large items and attributes.

Conclusion

This post looks at implementing common relational database patterns using DynamoDB. Instead of using multiple tables, the single-table design pattern can use adjacency lists to provide many-to-many relational functionality.

Using the Alleycat example, I show how to list the data access patterns required by an application, and then model the data using composite keys and indexes to return the relevant data using single queries. Finally, I show how to optimize items and capacity for workloads storing large amounts of data.

For more serverless learning resources, visit Serverless Land.

Building well-architected serverless applications: Regulating inbound request rates – part 1

2021-07-22 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-regulating-inbound-request-rates-part-1/

Reliability question REL1: How do you regulate inbound request rates?

Defining, analyzing, and enforcing inbound request rates helps achieve better throughput. Regulation helps you adapt different scaling mechanisms based on customer demand. By regulating inbound request rates, you can achieve better throughput, and adapt client request submissions to a request rate that your workload can support.

Required practice: Control inbound request rates using throttling

Throttle inbound request rates using steady-rate and burst rate requests

Throttling requests limits the number of requests a client can make during a certain period of time. Throttling allows you to control your API traffic. This helps your backend services maintain their performance and availability levels by limiting the number of requests to actual system throughput.

To prevent your API from being overwhelmed by too many requests, Amazon API Gateway throttles requests to your API. These limits are applied across all clients using the token bucket algorithm. API Gateway sets a limit on a steady-state rate and a burst of request submissions. The algorithm is based on an analogy of filling and emptying a bucket of tokens representing the number of available requests that can be processed.

Each API request removes a token from the bucket. The throttle rate then determines how many requests are allowed per second. The throttle burst determines how many concurrent requests are allowed. I explain the token bucket algorithm in more detail in “Building well-architected serverless applications: Controlling serverless API access – part 2”

Token bucket algorithm

API Gateway limits the steady-state rate and burst requests per second. These are shared across all APIs per Region in an account. For further information on account-level throttling per Region, see the documentation. You can request account-level rate limit increases using the AWS Support Center. For more information, see Amazon API Gateway quotas and important notes.

You can configure your own throttling levels, within the account and Region limits to improve overall performance across all APIs in your account. This restricts the overall request submissions so that they don’t exceed the account-level throttling limits.

You can also configure per-client throttling limits. Usage plans restrict client request submissions to within specified request rates and quotas. These are applied to clients using API keys that are associated with your usage policy as a client identifier. You can add throttling levels per API route, stage, or method that are applied in a specific order.

For more information on API Gateway throttling, see the AWS re:Invent presentation “I didn’t know Amazon API Gateway could do that”.

API Gateway throttling

You can also throttle requests by introducing a buffering layer using Amazon Kinesis Data Stream or Amazon SQS. Kinesis can limit the number of requests at the shard level while SQS can limit at the consumer level. For more information on using SQS as a buffer with Amazon Simple Notification Service (SNS), read “How To: Use SNS and SQS to Distribute and Throttle Events”.

Identify steady-rate and burst rate requests that your workload can sustain at any point in time before performance degraded

Load testing your serverless application allows you to monitor the performance of an application before it is deployed to production. Serverless applications can be simpler to load test, thanks to the automatic scaling built into many of the services. During a load test, you can identify quotas that may act as a limiting factor for the traffic you expect and take action.

Perform load testing for a sustained period of time. Gradually increase the traffic to your API to determine your steady-state rate of requests. Also use a burst strategy with no ramp up to determine the burst rates that your workload can serve without errors or performance degradation. There are a number of AWS Marketplace and AWS Partner Network (APN) solutions available for performance testing, Gatling Frontline, BlazeMeter, and Apica.

In the serverless airline example used in this series, you can run a performance test suite using Gatling, an open source tool.

To deploy the test suite, follow the instructions in the GitHub repository perf-tests directory. Uncomment the deploy.perftest line in the repository Makefile.

Perf-test makefile

Once the file is pushed to GitHub, AWS Amplify Console rebuilds the application, and deploys an AWS CloudFormation stack. You can run the load tests locally, or use an AWS Step Functions state machine to run the setup and Gatling load test simulation.

Performance test using Step Functions

The Gatling simulation script uses constantUsersPerSec and rampUsersPerSec to add users for a number of test scenarios. You can use the test to simulate load on the application. Once the tests run, it generates a downloadable report.

Gatling performance results

Artillery Community Edition is another open-source tool for testing serverless APIs. You configure the number of requests per second and overall test duration, and it uses a headless Chromium browser to run its test flows. For Artillery, the maximum number of concurrent tests is constrained by your local computing resources and network. To achieve higher throughput, you can use Serverless Artillery, which runs the Artillery package on Lambda functions. As a result, this tool can scale up to a significantly higher number of tests.

For more information on how to use Artillery, see “Load testing a web application’s serverless backend”. This runs tests against APIs in a demo application. For example, one of the tests fetches 50,000 questions per hour. This calls an API Gateway endpoint and tests whether the AWS Lambda function, which queries an Amazon DynamoDB table, can handle the load.

Artillery performance test

This is a synchronous API so the performance directly impacts the user’s experience of the application. This test shows that the median response time is 165 ms with a p95 time of 201 ms.

Performance test API results

Another consideration for API load testing is whether the authentication and authorization service can handle the load. For more information on load testing Amazon Cognito and API Gateway using Step Functions, see “Using serverless to load test Amazon API Gateway with authorization”.

API load testing with authentication and authorization

Conclusion

Regulating inbound requests helps you adapt different scaling mechanisms based on customer demand. You can achieve better throughput for your workloads and make them more reliable by controlling requests to a rate that your workload can support.

In this post, I cover controlling inbound request rates using throttling. I show how to use throttling to control steady-rate and burst rate requests. I show some solutions for performance testing to identify the request rates that your workload can sustain before performance degradation.

This well-architected question will be continued where I look at using, analyzing, and enforcing API quotas. I cover mechanisms to protect non-scalable resources.

For more serverless learning resources, visit Serverless Land.

Data Caching Across Microservices in a Serverless Architecture

2021-07-21 Irfan Saleem

Post Syndicated from Irfan Saleem original https://aws.amazon.com/blogs/architecture/data-caching-across-microservices-in-a-serverless-architecture/

Organizations are re-architecting their traditional monolithic applications to incorporate microservices. This helps them gain agility and scalability and accelerate time-to-market for new features.

Each microservice performs a single function. However, a microservice might need to retrieve and process data from multiple disparate sources. These can include data stores, legacy systems, or other shared services deployed on premises in data centers or in the cloud. These scenarios add latency to the microservice response time because multiple real-time calls are required to the backend systems. The latency often ranges from milliseconds to a few seconds depending on size of the data, network bandwidth, and processing logic. In certain scenarios, it makes sense to maintain a cache close to the microservices layer to improve performance by reducing or eliminating the need for the real-time backend calls.

Caches reduce latency and service-to-service communication of microservice architectures. A cache is a high-speed data storage layer that stores a subset of data. When data is requested from a cache, it is delivered faster than if you accessed the data’s primary storage location.

While working with our customers, we have observed use cases where data caching helps reduce latency in the microservices layer. Caching can be implemented in several ways. In this blog post, we discuss a couple of these use cases that customers have built. In both use cases, the microservices layer is created using Serverless on AWS offerings. It requires data from multiple data sources deployed locally in the cloud or on premises. The compute layer is built using AWS Lambda. Though Lambda functions are short-lived, the cached data can be used by subsequent instances of the same microservice to avoid backend calls.

Use case 1: On-demand cache to reduce real-time calls

In this use case, the Cache-Aside design pattern is used for lazy loading of frequently accessed data. This means that an object is only cached when it is requested by a consumer, and the respective microservice decides if the object is worth saving.

This use case is typically useful when the microservices layer makes multiple real-time calls to fetch and process data. These calls can be greatly reduced by caching frequently accessed data for a short period of time.

Let’s discuss a real-world scenario. Figure 1 shows a customer portal that provides a list of car loans, their status, and the net outstanding amount for a customer:

The Billing microservice gets a request. It then tries to get required objects (for example, the list of car loans, their status, and the net outstanding balance) from the cache using an object_key. If the information is available in the cache, a response is sent back to the requester using cached data.
If requested objects are not available in the cache (a cache miss), the Billing microservice makes multiple calls to local services, applications, and data sources to retrieve data. The result is compiled and sent back to the requester. It also resides in the cache for a short period of time.
Meanwhile, if a customer makes a payment using the Payment microservice, the balance amount in the cache must be invalidated/deleted. The Payment microservice processes the payment and invokes an asynchronous event (payment_processed) with the respective object key for the downstream processes that will remove respective objects from the cache.
The events are stored in the event store.
The CacheManager microservice gets the event (payment_processed) and makes a delete request to the cache for the respective object_key. If necessary, the CacheManager can also refresh cached data. It can call a resource within the Billing service or it can refresh data directly from the source system depending on the data refresh logic.

Figure 1. Reducing latency by caching frequently accessed data on demand

Figure 2 shows AWS services for use case 1. The microservices layer (Billing, Payments, and Profile) is created using Lambda. The Amazon API Gateway is exposing Lambda functions as API operations to the internal or external consumers.

Figure 2. Suggested AWS services for implementing use case 1

All three microservices are connected with the data cache and can save and retrieve objects from the cache. The cache is maintained in-memory using Amazon ElastiCache. The data objects are kept in cache for a short period of time. Every object has an associated TTL (time to live) value assigned to it. After that time period, the object expires. The custom events (such as payment_processed) are published to Amazon EventBridge for downstream processing.

Use case 2: Proactive caching of massive volumes of data

During large modernization and migration initiatives, not all data sources are colocated for a certain period of time. Some legacy systems, such as mainframe, require a longer decommissioning period. Many legacy backend systems process data through periodic batch jobs. In such scenarios, front-end applications can use cached data for a certain period of time (ranging from a few minutes to few hours) depending on nature of data and its usage. The real-time calls to the backend systems cannot deal with the extensive call volume on the front-end application.

In such scenarios, required data/objects can be identified up front and loaded directly into the cache through an automated process as shown in Figure 3:

An automated process loads data/objects in the cache during the initial load. Subsequent changes to the data sources (either in a mainframe database or another system of record) are captured and applied to the cache through an automated CDC (change data capture) pipeline.
Unlike use case 1, the microservices layer does not make real-time calls to load data into the cache. In this use case, microservices use data already cached for their processing.
However, the microservices layer may create an event if data in the cache is stale or specific objects have been changed by another service (for example, by the Payment service when a payment is made).
The events are stored in Event Manager. Upon receiving an event, the CacheManager initiates a backend process to refresh stale data on demand.
All data changes are sent directly to the system of record.

Figure 3. Eliminating real-time calls by caching massive data volumes proactively

As shown in Figure 4, the data objects are maintained in Amazon DynamoDB, which provides low-latency data access at any scale. The data retrieval is managed through DynamoDB Accelerator (DAX), a fully managed, highly available, in-memory cache. It delivers up to a 10 times performance improvement, even at millions of requests per second.

Figure 4. Suggested AWS services for implementing use case 2

The data in DynamoDB can be loaded through different methods depending on the customer use case and technology landscape. API Gateway, Lambda, and EventBridge are providing similar functionality as described in use case 1.

Use case 2 is also beneficial in scenarios where front-end applications must cache data for an extended period of time, such as a customer’s shopping cart.

In addition to caching, the following best practices can also be used to reduce latency and to improve performance within the Lambda compute layer:

Best practices for working with AWS Lambda functions shows you how to use execution environment reuse to improve the performance of your Lambda functions
The Introducing AWS Lambda Extensions blog post shows you how to implement configuration and data cache using AWS Lambda Extensions

Conclusion

The microservices architecture allows you to build several caching layers depending on your use case. In this blog, we discussed data caching within the compute layer to reduce latency when data is retrieved from disparate sources. The information from use case 1 can help you reduce real-time calls to your back-end system by saving frequently used data to the cache. Use case 2 helps you maintain large volumes of data in caches for extended periods of time when real-time calls to the backend system are not possible.

Using Amazon Macie to Validate S3 Bucket Data Classification

2021-07-16 Bill Magee

Post Syndicated from Bill Magee original https://aws.amazon.com/blogs/architecture/using-amazon-macie-to-validate-s3-bucket-data-classification/

Securing sensitive information is a high priority for organizations for many reasons. At the same time, organizations are looking for ways to empower development teams to stay agile and innovative. Centralized security teams strive to create systems that align to the needs of the development teams, rather than mandating how those teams must operate.

Security teams who create automation for the discovery of sensitive data have some issues to consider. If development teams are able to self-provision data storage, how does the security team protect that data? If teams have a business need to store sensitive data, they must consider how, where, and with what safeguards that data is stored.

Let’s look at how we can set up Amazon Macie to validate data classifications provided by decentralized software development teams. Macie is a fully managed service that uses machine learning (ML) to discover sensitive data in AWS. If you are not familiar with Macie, read New – Enhanced Amazon Macie Now Available with Substantially Reduced Pricing.

Data classification is part of the security pillar of a Well-Architected application. Following the guidelines provided in the AWS Well-Architected Framework, we can develop a resource-tagging scheme that fits our needs.

Overview of decentralized data validation system

In our example, we have multiple levels of data classification that represent different levels of risk associated with each classification. When a software development team creates a new Amazon Simple Storage Service (S3) bucket, they are responsible for labeling that bucket with a tag. This tag represents the classification of data stored in that bucket. The security team must maintain a system to validate that the data in those buckets meets the classification specified by the development teams.

This separation of roles and responsibilities for development and security teams who work independently requires a validation system that’s decoupled from S3 bucket creation. It should automatically detect new buckets or data in the existing buckets, and validate the data against the assigned classification tags. It should also notify the appropriate development teams of misclassified or unclassified buckets in a timely manner. These notifications can be through standard notification channels, such as email or Slack channel notifications.

Validation and alerts with AWS services

Figure 1. Validation system for data classification

We assume that teams are permitted to create S3 buckets and we will use AWS Config to enforce the following required tags: DataClassification and SupportSNSTopic. The DataClassification tag indicates what type of data is allowed in the bucket. The SupportSNSTopic tag indicates an Amazon Simple Notification Service (SNS) topic. If there are issues found with the data in the bucket, a message is published to the topic, and Amazon SNS will deliver an alert. For example, if there is personally identifiable information (PII) data in a bucket that is classified as non-sensitive, the system will alert the owners of the bucket.

Macie is configured to scan all S3 buckets on a scheduled basis. This configuration ensures that any new bucket and data placed in the buckets is analyzed the next time the Macie job runs.

Macie provides several managed data identifiers for discovering and classifying the data. These include bank account numbers, credit card information, authentication credentials, PII, and more. You can also create custom identifiers (or rules) to gather information not covered by the managed identifiers.

Macie integrates with Amazon EventBridge to allow us to capture data classification events and route them to one or more destinations for reporting and alerting needs. In our configuration, the event initiates an AWS Lambda. The Lambda function is used to validate the data classification inferred by Macie against the classification specified in the DataClassification tag using custom business logic. If a data classification violation is found, the Lambda then sends a message to the Amazon SNS topic specified in the SupportSNSTopic tag.

The Lambda function also creates custom metrics and sends those to Amazon CloudWatch. The metrics are organized by engineering team and severity. This allows the security team to create a dashboard of metrics based on the Macie findings. The findings can also be filtered per engineering team and severity to determine which teams need to be contacted to ensure remediation.

Conclusion

This solution provides a centralized security team with the tools it needs. The team can validate the data classification of an Amazon S3 bucket that is self-provisioned by a development team. New Amazon S3 buckets are automatically included in the Macie jobs and alerts. These are only sent out if the data in the bucket does not conform to the classification specified by the development team. The data auditing process is loosely coupled with the Amazon S3 Bucket creation process, enabling self-service capabilities for development teams, while ensuring proper data classification. Your teams can stay agile and innovative, while maintaining a strong security posture.

Learn more about Amazon Macie and Data Classification.

Architecting a Highly Available Serverless, Microservices-Based Ecommerce Site

2021-07-15 Senthil Kumar

Post Syndicated from Senthil Kumar original https://aws.amazon.com/blogs/architecture/architecting-a-highly-available-serverless-microservices-based-ecommerce-site/

The number of ecommerce vendors is growing globally—they often handle large traffic at different times of the day and different days of the year. This, in addition to building, managing, and maintaining IT infrastructure on-premises data centers can present challenges to ecommerce businesses’ scalability and growth.

This blog provides you a Serverless on AWS solution that offloads the undifferentiated heavy lifting of managing resources and ensures your businesses’ architecture can handle peak traffic.

Common architecture set up versus serverless solution

The following sections describe a common monolithic architecture and our suggested alternative approach: setting up microservices-based order submission and product search modules. These modules are independently deployable and scalable.

Typical monolithic architecture

Figure 1 shows how a typical on-premises ecommerce infrastructure with different tiers is set up:

Web servers serve static assets and proxy requests to application servers
Application servers process ecommerce business logic and authentication logic
Databases store user and other dynamic data
Firewall and load balancers provide network components for load balancing and network security

Figure 1. Monolithic on-premises ecommerce infrastructure with different tiers

Monolithic architecture tightly couples different layers of the application. This prevents them from being independently deployed and scaled.

Microservices-based modules

Order submission workflow module

This three-layer architecture can be set up in the AWS Cloud using serverless components:

Static content layer (Amazon CloudFront and Amazon Simple Storage Service (Amazon S3)). This layer stores static assets on Amazon S3. By using CloudFront in front of the S3 storage cache, you can deliver assets to customers globally with low latency and high transfer speeds.
Authentication layer (Amazon Cognito or customer proprietary layer). Ecommerce sites deliver authenticated and unauthenticated content to the user. With Amazon Cognito, you can manage users’ sign-up, sign-in, and access controls, so this authentication layer ensures that only authenticated users have access to secure data.
Dynamic content layer (AWS Lambda and Amazon DynamoDB). All business logic required for the ecommerce site is handled by the dynamic content layer. Using Lambda and DynamoDB ensures that these components are scalable and can handle peak traffic.

As shown in Figure 2, the order submission workflow is split into two sections: synchronous and asynchronous.

By splitting the order submission workflow, you allow users to submit their order details and get an orderId. This makes sure that they don’t have to wait for backend processing to complete. This helps unburden your architecture during peak shopping periods when the backend process can get busy.

Figure 2. Microservices-based order submission workflow

The details of the order, such as credit card information in encrypted form, shipping information, etc., are stored in DynamoDB. This action invokes an asynchronous workflow managed by AWS Step Functions.

Figure 3 shows sample step functions from the asynchronous process. In this scenario, you are using external payment processing and shipping systems. When both systems get busy, step functions can manage long-running transactions and also the required retry logic. It uses a decision-based business workflow, so if a payment transaction fails, the order can be canceled. Or, once payment is successful, the order can proceed.

Amazon Simple Notification Service (Amazon SNS) notifies users whenever their order status changes. You can even extend Step Functions to have it react based on status of shipping.

Figure 3. Sample AWS Step Functions asynchronous workflow that uses external payment processing service and shipping system

Product search module

Our product search module is set up using the following serverless components:

Amazon Elasticsearch Service (Amazon ES) stores product data, which is updated whenever product-related data changes.
Lambda formats the data.
Amazon API Gateway allows users to search without authentication. As shown in Figure 4, searching for products on the ecommerce portal does not require users to log in. All traffic via API Gateway is unauthenticated.

Figure 4. Microservices-based product search workflow module with dynamic traffic through API Gateway

Replicating data across Regions

If your ecommerce application runs on multiple Regions, it may require the content and data to be replicated. This allows the application to handle local traffic from that Region and also act as a failover option if the application fails in another Region. The content and data are replicated using the multi-Region replication features of Amazon S3 and DynamoDB global tables.

Figure 5 shows a multi-Region ecommerce site built on AWS with serverless services. It uses the following features to make sure that data between all Regions are in sync for data/assets that do not need data residency compliance:

Amazon S3 multi-Region replication keeps static assets in sync for assets.
DynamoDB global tables keeps dynamic data in sync across Regions.

Assets that are specific to their Region are stored in Regional specific buckets.

Figure 5. Data replication for a multi-Region ecommerce website built using serverless components

Amazon Route 53 DNS web service manages traffic failover from one Region to another. Route 53 provides different routing policies, and depending on your business requirement, you can choose the failover routing policy.

Best practices

Now that we’ve shown you how to build these applications, make sure you follow these best practices to effectively build, deploy, and monitor the solution stack:

Infrastructure as Code (IaC). A well-defined, repeatable infrastructure is important for managing any solution stack. AWS CloudFormation allows you to treat your infrastructure as code and provides a relatively easy way to model a collection of related AWS and third-party resources.
AWS Serverless Application Model (AWS SAM). An open-source framework. Use it to build serverless applications on AWS.
Deployment automation. AWS CodePipeline is a fully managed continuous delivery service that automates your release pipelines for fast and reliable application and infrastructure updates.
AWS CodeStar. Allows you to quickly develop, build, and deploy applications on AWS. It provides a unified user interface, enabling you to manage all of your software development activities in one place.
AWS Well-Architected Framework. Provides a mechanism for regularly evaluating your workloads, identifying high risk issues, and recording your improvements.
Serverless Applications Lens. Documents how to design, deploy, and architect serverless application workloads.
Monitoring. AWS provides many services that help you monitor and understand your applications, including Amazon CloudWatch, AWS CloudTrail, and AWS X-Ray.

Conclusion

In this blog post, we showed you how to architect a highly available, serverless, and microservices-based ecommerce website that operates in multiple Regions.

We also showed you how to replicate data between different Regions for scaling and if your workload fails. These serverless services reduce the burden of building and managing physical IT infrastructure to help you focus more on building solutions.

Related information

10 Things Serverless Architects Should Know

Create a secure data lake by masking, encrypting data, and enabling fine-grained access with AWS Lake Formation

2021-07-12 Shekar Tippur

Post Syndicated from Shekar Tippur original https://aws.amazon.com/blogs/big-data/create-a-secure-data-lake-by-masking-encrypting-data-and-enabling-fine-grained-access-with-aws-lake-formation/

You can build data lakes with millions of objects on Amazon Simple Storage Service (Amazon S3) and use AWS native analytics and machine learning (ML) services to process, analyze, and extract business insights. You can use a combination of our purpose-built databases and analytics services like Amazon EMR, Amazon Elasticsearch Service (Amazon ES), and Amazon Redshift as the right tool for your specific job and benefit from optimal performance, scale, and cost.

In this post, you learn how to create a secure data lake using AWS Lake Formation for processing sensitive data. The data (simulated patient metrics) is ingested through a serverless pipeline to identify, mask, and encrypt sensitive data before storing it securely in Amazon S3. After the data has been processed and stored, you use Lake Formation to define and enforce fine-grained access permissions to provide secure access for data analysts and data scientists.

Target personas

The proposed solution focuses on the following personas, with each one having different level of access:

Cloud engineer – As the cloud infrastructure engineer, you implement the architecture but may not have access to the data itself or to define access permissions
secure-lf-admin – As a data lake administrator, you configure the data lake setting and assign data stewards
secure-lf-business-analyst – As a business analyst, you shouldn’t be able to access sensitive information
secure-lf-data-scientist – As a data scientist, you shouldn’t be able to access sensitive information

Solution overview

We use the following AWS services for ingesting, processing, and analyzing the data:

Amazon Athena is an interactive query service that can query data in Amazon S3 using standard SQL queries using tables in an AWS Glue Data Catalog. The data can be accessed via JDBC for further processing such as displaying in business intelligence (BI) dashboards.
Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and more. The logs from AWS Glue jobs and AWS Lambda functions are saved in CloudWatch logs.
Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover information in unstructured data.
Amazon DynamoDB is a NoSQL database that delivers single-digit millisecond performance at any scale and is used to avoid processing duplicates files.
AWS Glue is a serverless data preparation service that makes it easy to extract, transform, and load (ETL) data. An AWS Glue job encapsulates a script that reads, processes, and writes data to a new schema. This solution uses Python3.6 AWS Glue jobs for ETL processing.
AWS IoT provides the cloud services that connect your internet of things (IoT) devices to other devices and AWS Cloud services.
Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services.
AWS Lake Formation makes it easy to set up, secure, and manage your data lake. With Lake Formation, you can discover, cleanse, transform, and ingest data into your data lake from various sources; define fine-grained permissions at the database, table, or column level; and share controlled access across analytic, ML, and ETL services.
Amazon S3 is a scalable object storage service that hosts the raw data files and processed files in the data lake for millisecond access.

You can enhance the security of your sensitive data with the following methods:

Implement encryption at rest using AWS Key Management Service (AWS KMS) and customer managed encryption keys
Instrument AWS CloudTrail and audit logging
Restrict access to AWS resources based on the least privilege principle

Architecture overview

The solution emulates diagnostic devices sending Message Queuing Telemetry Transport (MQTT) messages onto an AWS IoT Core topic. We use Kinesis Data Firehose to preprocess and stage the raw data in Amazon S3. We then use AWS Glue for ETL to further process the data by calling Amazon Comprehend to identify any sensitive information. Finally, we use Lake Formation to define fine-grained permissions that restrict access to business analysts and data scientists who use Athena to query the data.

The following diagram illustrates the architecture for our solution.

Prerequisites

To follow the deployment walkthrough, you need an AWS account. Use us-east-1 or us-west-2 as your Region.

For this post, make sure you don’t have Lake Formation enabled in your AWS account.

Stage the data

Download the zipped archive file to use for this solution and unzip the files locally. patient.csv file is dummy data created to help demonstrate masking, encryption, and granting fine-grained access. The send-messages.sh script randomly generates simulated diagnostic data to represent body vitals. AWS Glue job uses glue-script.py script to perform ETL that detects sensitive information, masks/encrypt data, and populates curated table in AWS Glue catalog.

Create an S3 bucket called secure-datalake-scripts-<ACCOUNT_ID> via the Amazon S3 console. Upload the scripts and CSV files to this location.

Deploy your resources

For this post, we use AWS CloudFormation to create our data lake infrastructure.

Choose Launch Stack:
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names before deploying.

The stack takes approximately 5 minutes to complete.

The following screenshot shows the key-values the stack created. We use the TestUserPassword parameter for the Lake Formation personas to sign in to the AWS Management Console.

Load the simulation data

Stage the send-messages.sh script by running the Amazon S3 copy command:

aws s3 cp s3://secure-datalake-scripts-<ACCOUNT_ID>/send-messages.sh

Run your script by using the following command:

sh send-messages.sh.

The script runs for a few minutes and emits 300 messages. This sends MQTT messages to the secure_iot_device_analytics topic, filtered using IoT rules, processed using Kinesis Data Firehose, and converted to Parquet format. After a minute, data starts showing up in the raw bucket.

Run the AWS Glue ETL pipeline

Run AWS Glue workflow (secureGlueWorkflow) from the AWS Glue console; you can also schedule to run this using CloudWatch. It takes approximately 10 minutes to complete.

The AWS Glue job that is triggered as part of the workflow (ProcessSecureData) joins the patient metadata and patient metrics data. See the following code:

# Join Patient metadata and patient metrics dataframe
combined_df=Join.apply(patient_metadata, patient_metrics, 'PatientId', 'pid', transformation_ctx = "combined_df")

The ensuing dataframe contains sensitive information like FirstName, LastName, DOB, Address1, Address2, and AboutYourself. AboutYourself is freeform text entered by the patient during registration. In the following code snippet, the detect_sensitive_info function calls the Amazon Comprehend API to identify personally identifiable information (PII):

# Apply groupBy to get unique  AboutYourself records
group=combined_df.toDF().groupBy("pid","DOB", "FirstName", "LastName", "Address1", "Address2", "AboutYourself").count()
# Apply detect_sensitive_info to get the redacted string after masking  PII data
df_with_about_yourself = Map.apply(frame = group_df, f = detect_sensitive_info)
# Apply encryption to the identified fields
df_with_about_yourself_encrypted = Map.apply(frame = group_df, f = encrypt_rows)

Amazon Comprehend returns an object that has information about the entity name and entity type. Based on your needs, you can filter the entity types that need to be masked.

These fields are masked, encrypted, and written to their respective S3 buckets where fine-grained access controls are applied via Lake Formation:

Masked data – s3://secure-data-lake-masked-<ACCOUNT_ID>
secure-dl-masked-data/
Encrypted data – s3://secure-data-lake-masked-<ACCOUNT_ID>
secure-dl-encrypted-data/
Curated data – s3://secure-data-lake-<ACCOUNT_ID>
secure-dl-curated-data/

Now that the tables have been defined, we review permissions using Lake Formation.

Enable Lake Formation fine-grained access

To enable fine-grained access, we first add a Lake Formation admin user.

On the Lake Formation console, select Add other AWS users or roles.
On the drop-down menu, choose secure-lf-admin.
Choose Get started.
In the navigation pane, choose Settings.
On the Data Catalog Settings page, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
Choose Save.

Grant access to different personas

Before we grant permissions to different user personas, let’s register the S3 locations in Lake Formation so these personas can access S3 data without granting access through AWS Identity and Access Management (IAM).

On the Lake Formation console, choose Register and ingest in the navigation pane.
Choose Data lake locations.
Choose Register location.
Find and select each of the following S3 buckets and choose Register location:
1. s3://secure-raw-bucket-<ACCOUNT_ID>/temp-raw-table
2. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-encrypted-data
3. s3://secure-data-lake-<ACCOUNT_ID>/secure-dl-curated-data
4. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-masked-data

We’re now ready to grant access to our different users.

Grant read-only access to all the tables to secure-lf-admin

First, we grant read-only access to all the tables for the user secure-lf-admin.

Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
Navigate to AWS Lake Formation console
Under Data Catalog, choose Databases.
Select the database secure-db.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-admin.
Under Policy tags or catalog resources, select Named data catalog resources.
For Database, choose the database secure-db.
For Tables, choose All tables.
Under Permissions, select Table permissions.
For Table permissions, select Super.
Choose Grant.
Choosesecure_dl_curated_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-business-analyst

Now we grant read-only access to certain encrypted columns to the user secure-lf-business-analyst.

On the Lake Formation console, under Data Catalog, choose Databases.
Select the database secure-db and choose View tables.
Select the table secure_dl_encrypted_data.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-business-analyst.
Under Permissions, select Column-based permissions.
Choose the following columns:
1. count
2. address1_encrypted
3. firstname_encrypted
4. address2_encrypted
5. dob_encrypted
6. lastname_encrypted
For Grantable permissions, select Select.
Choose Grant.
Chose secure_dl_encrypted_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-data-scientist

Lastly, we grant read-only access to masked data to the user secure-lf-data-scientist.

On the Lake Formation console, under Data Catalog, choose Databases.
Select the database secure-db and choose View tables.
Select the table secure_dl_masked_data.
On the Actions drop-down menu, choose Grant.
Select IAM users and roles.
Choose the role secure-lf-data-scientist.
Under Permissions, select Table permissions.
For Table permissions, select Select.
Choose Grant.
Under Data Catalog, chose Tables.
Chose secure_dl_masked_data table.
On the Actions drop-down menu, chose View permissions.
Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Query the data lake using Athena from different personas

To validate the permissions of different personas, we use Athena to query against the S3 data lake.

Make sure you set the query result location to the location created as part of the CloudFormation stack (secure-athena-query-<ACCOUNT_ID>). The following screenshot shows the location information in the Settings section on the Athena console.

You can see all the tables listed under secure-db.

Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
Navigate to Athena Console.
Run a SELECT query against the secure_dl_curated_data

The user secure-lf-admin should see all the columns with encryption or masking.

Now let’s validate the permissions of secure-lf-business-analyst user.

Sign in to the console with secure-lf-business-analyst.
Navigate to Athena console.
Run a SELECT query against the secure_dl_encrypted_data table.

The secure-lf-business-analyst user can only view the selected encrypted columns.

Lastly, let’s validate the permissions of secure-lf-data-scientist.

Sign in to the console with secure-lf-data-scientist.
Run a SELECT query against the secure_dl_masked_data table.

The secure-lf-data-scientist user can only view the selected masked columns.

If you try to run a query on different tables, such as secure_dl_curated_data, you get an error message for insufficient permissions.

Clean up

To avoid unexpected future charges, delete the CloudFormation stack.

Conclusion

In this post, we presented a potential solution for processing and storing sensitive data workloads in an S3 data lake. We demonstrated how to build a data lake on AWS to ingest, transform, aggregate, and analyze data from IoT devices in near-real time. This solution also demonstrates how you can mask and encrypt sensitive data, and use fine-grained column-level security controls with Lake Formation, which benefits those with a higher level of security needs.

Lake Formation recently announced the preview for row-level access; and you can sign up for the preview now!

About the Authors

Shekar Tippur is an AWS Partner Solutions Architect. He specializes in machine learning and analytics workloads. He has been helping partners and customers adopt best practices and discover insights from data.

Ramakant Joshi is an AWS Solution Architect, specializing in the analytics and serverless domain. He has over 20 years of software development and architecture experience, and is passionate about helping customers in their cloud journey.

Navnit Shukla is AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Developing evolutionary architecture with AWS Lambda

2021-07-08 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/developing-evolutionary-architecture-with-aws-lambda/

This post was written by Luca Mezzalira, Principal Solutions Architect, Media and Entertainment.

Agility enables you to evolve a workload quickly, adding new features, or introducing new infrastructure as required. The key characteristics for achieving agility in a code base are loosely coupled components and strong encapsulation.

Loose coupling can help improve test coverage and create atomic refactoring. With encapsulation, you expose only what is needed to interact with a service without revealing the implementation logic.

Evolutionary architectures can help achieve agility in your design. In the book “Building Evolutionary Architectures”, this architecture is defined as one that “supports guided, incremental change across multiple dimensions”.

This blog post focuses on how to structure code for AWS Lambda functions in a modular fashion. It shows how to embrace the evolutionary aspect provided by the hexagonal architecture pattern and apply it to different use cases.

Introducing ports and adapters

Hexagonal architecture is also known as the ports and adapters architecture. It is an architectural pattern used for encapsulating domain logic and decoupling it from other implementation details, such as infrastructure or client requests.

Domain logic: Represents the task that the application should perform, abstracting any interaction with the external world.
Ports: Provide a way for the primary actors (on the left) to interact with the application, via the domain logic. The domain logic also uses ports for interacting with secondary actors (on the right) when needed.
Adapters: A design pattern for transforming one interface into another interface. They wrap the logic for interacting with a primary or secondary actor.
Primary actors: Users of the system such as a webhook, a UI request, or a test script.
Secondary actors: used by the application, these services are either a Repository (for example, a database) or a Recipient (such as a message queue).

Hexagonal architecture with Lambda functions

Lambda functions are units of compute logic that accomplish a specific task. For example, a function could manipulate data in a Amazon Kinesis stream, or process messages from an Amazon SQS queue.

In Lambda functions, hexagonal architecture can help you implement new business requirements and improve the agility of a workload. This approach can help create separation of concerns and separate the domain logic from the infrastructure. For development teams, it can also simplify the implementation of new features and parallelize the work across different developers.

The following example introduces a service for returning a stock value. The service supports different currencies for a frontend application that displays the information in a dashboard. The translation of a stock value between currencies happens in real time. The service must retrieve the exchange rates with every request made by the client.

The architecture for this service uses an Amazon API Gateway endpoint that exposes a REST API. When the client calls the API, it triggers a Lambda function. This gets the stock value from a DynamoDB table and the currency information from a third-party endpoint. The domain logic uses the exchange rate to convert the stock value to other currencies before responding to the client request.

The full example is available in the AWS GitHub samples repository. Here is the architecture for this service:

A client makes a request to the API Gateway endpoint, which invokes the Lambda function.

The primary adapter receives the request. It captures the stock ID and pass it to the port:

exports.lambdaHandler = async (event) => {
    try{
	// retrieve the stockID from the request
        const stockID = event.pathParameters.StockID;
	// pass the stockID to the port
        const response = await getStocksRequest(stockID);
        return response
    } 
};

The port is an interface for communicating with the domain logic. It enforces the separation between an adapter and the domain logic. With this approach, you can change and test the infrastructure and domain logic in isolation without impacting another part of the code base:
```
const retrieveStock = async (stockID) => {
    try{
	//use the port “stock” to access the domain logic
        const stockWithCurrencies = await stock.retrieveStockValues(stockID)
        return stockWithCurrencies;
    }
}
```

The port passing the stock ID invokes the domain logic entry point. The domain logic fetches the stock value from a DynamoDB table, then it requests the exchange rates. It returns the computed values to the primary adapter via the port. The domain logic always uses a port to interact with an adapter because the ports are the interfaces with the external world:

const CURRENCIES = [“USD”, “CAD”, “AUD”]
const retrieveStockValues = async (stockID) => {
try {
//retrieve the stock value from DynamoDB using a port
        const stockValue = await Repository.getStockData(stockID);
//fetch the currencies value using a port
        const currencyList = await Currency.getCurrenciesData(CURRENCIES);
//calculate the stock value in different currencies
        const stockWithCurrencies = {
            stock: stockValue.STOCK_ID,
            values: {
                "EUR": stockValue.VALUE
            }
        };
        for(const currency in currencyList.rates){
            stockWithCurrencies.values[currency] =  (stockValue.VALUE * currencyList.rates[currency]).toFixed(2)
        }
// return the final computation to the port
        return stockWithCurrencies;
    }
}

This is how the domain logic interacts with the DynamoDB table:

The domain logic uses the Repository port for interacting with the database. There is not a direct connection between the domain and the adapter:

const getStockData = async (stockID) => {
    try{
//the domain logic pass the request to fetch the stock ID value to this port
        const data = await getStockValue(stockID);
        return data.Item;
    } 
}

The secondary adapter encapsulates the logic for reading an item from a DynamoDB table. All the logic for interacting with DynamoDB is encapsulated in this module:

const getStockValue = async (stockID) => {
    let params = {
        TableName : DB_TABLE,
        Key:{
            'STOCK_ID': stockID
        }
    }
    try {
        const stockData = await documentClient.get(params).promise()
        return stockData
    }
}

The domain logic uses an adapter for fetching the exchange rates from the third-party service. It then processes the data and responds to the client request:

The second operation in the business logic is retrieving the currency exchange rates. The domain logic requests the operation via a port that proxies the request to the adapter:
```
const getCurrenciesData = async (currencies) => {
    try{
        const data = await getCurrencies(currencies);
        return data
    } 
}
```

The currencies service adapter fetches the data from a third-party endpoint and returns the result to the domain logic.

const getCurrencies = async (currencies) => {
    try{        
        const res = await axios.get(`http://api.mycurrency.io?symbols=${currencies.toString()}`)
        return res.data
    } 
}

These eight steps show how to structure the Lambda function code using a hexagonal architecture.

Adding a cache layer

In this scenario, the production stock service experiences traffic spikes during the day. The external endpoint for the exchange rates cannot support the level of traffic. To address this, you can implement a caching strategy with Amazon ElastiCache using a Redis cluster. This approach uses a cache-aside pattern for offloading traffic to the external service.

Typically, it can be challenging to evolve code to implement this change without the separation of concerns in the code base. However, in this example, there is an adapter that interacts with the external service. Therefore, you can change the implementation to add the cache-aside pattern and maintain the same API contract with the rest of the application:

const getCurrencies = async (currencies) => {
    try{        
// Check the exchange rates are available in the Redis cluster
        let res = await asyncClient.get("CURRENCIES");
        if(res){
// If present, return the value retrieved from Redis
            return JSON.parse(res);
        }
// Otherwise, fetch the data from the external service
        const getCurr = await axios.get(`http://api.mycurrency.io?symbols=${currencies.toString()}`)
// Store the new values in the Redis cluster with an expired time of 20 seconds
        await asyncClient.set("CURRENCIES", JSON.stringify(getCurr.data), "ex", 20);
// Return the data to the port
        return getCurr.data
    } 
}

This is a low-effort change only affecting the adapter. The domain logic and port interacting with the adapter are untouched and maintain the same API contract. The encapsulation provided by this architecture helps to evolve the code base. It also preserves many of the tests in place, considering only an adapter is modified.

Moving domain logic from a container to a Lambda function

In this example, the team working on this workload originally wrap all the functionality inside a container using AWS Fargate with Amazon ECS. In this case, the developers define a route for the GET method for retrieving the stock value:

// This web application uses the Fastify framework 
  fastify.get('/stock/:StockID', async (request, reply) => {
    try{
        const stockID = request.params.StockID;
        const response = await getStocksRequest(stockID);
        return response
    } 
})

In this case, the route’s entry point is exactly the same for the Lambda function. The team does not need to change anything else in the code base, thanks to the characteristics provided by the hexagonal architecture.

This pattern can help you more easily refactor code from containers or virtual machines to multiple Lambda functions. It introduces a level of code portability that can be more challenging with other solutions.

Benefits and drawbacks

As with any pattern, there are benefits and drawbacks to using hexagonal architecture.

The main benefits are:

The domain logic is agnostic and independent from the outside world.
The separation of concerns increases code testability.
It may help reduce technical debt in workloads.

The drawbacks are:

The pattern requires an upfront investment of time.
The domain logic implementation is not opinionated.

Whether you should use this architecture for developing Lambda functions depends upon the needs of your application. With an evolving workload, the extra implementation effort may be worthwhile.

The pattern can help improve code testability because of the encapsulation and separation of concerns provided. This approach can also be used with compute solutions other than Lambda, which may be useful in code migration projects.

Conclusion

This post shows how you can evolve a workload using hexagonal architecture. It explains how to add new functionality, change underlying infrastructure, or port the code base between different compute solutions. The main characteristics enabling this are loose coupling and strong encapsulation.

To learn more about hexagonal architecture and similar patterns, read:

The article written by Alistair Cockburn, the creator of this pattern.
The Onion Architecture, based on the Inversion of Control principle (IoC).
The Clean Architecture, which provides a more opinionated approach on how to structure domain logic via entities and use cases.

For more serverless learning resources, visit Serverless Land.

Implementing a LIFO task queue using AWS Lambda and Amazon DynamoDB

2021-06-29 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/implementing-a-lifo-task-queue-using-aws-lambda-and-amazon-dynamodb/

This post was written by Diggory Briercliffe, Senior IoT Architect.

When implementing a task queue, you can use Amazon SQS standard or FIFO (First-In-First-Out) queue types. Both queue types give priority to tasks created earlier over tasks that are created later. However, there are use cases where you need a LIFO (Last-In-First-Out) queue.

This post shows how to implement a serverless LIFO task queue. This uses AWS Lambda, Amazon DynamoDB, AWS Serverless Application Model (AWS SAM), and other AWS Serverless technologies.

The LIFO task queue gives priority to newer queue tasks over earlier tasks. Under heavy load, earlier tasks are deprioritized and eventually removed. This is useful when your workload must communicate with a system that is throughput-constrained and newer tasks should have priority.

To help understand the approach, consider the following use case. As part of optimizing the responsiveness of a mobile application, an IoT application validates device IP addresses after connecting to AWS IoT Core. Users open the application soon after the device connects so the most recent connection events should take priority for the validation work.

If the validation work is not done at connection time, it can be done later. A legacy system validates the IP addresses, but its throughput capacity cannot match the peak connection rate of the IoT devices. A LIFO queue can manage this load, by prioritizing validation of newer connection events. It can buffer or load shed earlier connection event validation.

For a more detailed discussion around insurmountable queue backlogs and queuing theory, read “Avoiding insurmountable queue backlogs” in the Amazon Builders’ Library.

Example application

An example application implementing the LIFO queue approach is available at https://github.com/aws-samples/serverless-lifo-queue-demonstration.

The application uses AWS SAM and the Lambda functions are written in Node.js. The AWS SAM template describes AWS resources required by the application. These include a DynamoDB table, Lambda functions, and Amazon SNS topics.

The README file contains instructions on deploying and testing the application, with detailed information on how it works.

Overview

The example application has the following queue characteristics:

Newer queue tasks are prioritized over earlier tasks.
Queue tasks are buffered if they cannot be processed.
Queue tasks are eventually deleted if they are never processed, such as when the queue is under insurmountable load.
Correct queue task state transition is maintained (such as PENDING to TAKEN, but not PENDING to SUCCESS).

A DynamoDB table stores queue task items. It uses the following DynamoDB features:

A global secondary index (GSI) sorts queue task items by a created timestamp, in reverse chronological (LIFO) order.
Update expressions and condition expressions provide atomic and exclusive queue task item updates. This prevents duplicate processing of queue tasks and ensures that the queue task state transitions are valid.
Time to live (TTL) deletes queue task items once they expire. Under insurmountable load, this ensures that tasks are deleted if they are never processed from the queue. It also deletes queue task items once they have been processed.
DynamoDB Streams invoke a Lambda function when new queue task items are inserted into the table and must be processed.

The application consists of the following resources defined in the AWS SAM template:

QueueTable: A DynamoDB table containing queue task items, which is configured for DynamoDB Streams to invoke a TriggerFunction.
TriggerFunction: A Lambda function, which governs triggering of queue task processing. Source code: app/trigger.js
ProcessTasksFunction: A Lambda function, which processes queue tasks and ensures consistent queue task state flow. Source code: app/process_tasks.js
CreateTasksFunction: A Lambda function, which inserts queue task items into the QueueTable. Source code: app/create_tasks.js
TriggerTopic: An SNS topic which TriggerFunction subscribes to.
ProcessTasksTopic: An SNS topic which ProcessTasksFunction subscribes to.

The following diagram illustrates how those resources interact to implement the LIFO queue.

LIFO Architecture diagram

CreateTasksFunction inserts queue task items into QueueTable with PENDING state.
A DynamoDB stream invokes TriggerFunction for all queue task item activity in QueueTable.
TriggerFunction publishes a notification on ProcessTasksTopic if queue tasks should be processed.
ProcessTasksFunction subscribes to ProcessTasksTopic.
ProcessTasksFunction queries for PENDING queue task items in QueueTable for up to 1 minute, or until no PENDING queue task items remain.
ProcessTasksFunction processes each PENDING queue task by calling the throughput constrained legacy system.
ProcessTasksFunction updates each queue task item during processing to reflect state (first to TAKEN, and then to SUCCESS, FAILURE, or PENDING).
ProcessTasksFunction publishes an SNS notification on TriggerTopic if PENDING tasks remain in the queue.
TriggerFunction subscribes to TriggerTasksTopic.

Application activity continues while DynamoDB Streams receives QueueTable events (2) or TriggerTasksTopic receives notifications (9).

LIFO queue DynamoDB table

A DynamoDB table stores the LIFO queue task items. The AWS SAM template defines this resource (named QueueTable):

Each item in the table represents a queue task. It has the item attributes taskId (hash key), taskStatus, taskCreated, and taskUpdated.
The table has a single global secondary index (GSI) with taskStatus as the hash key and taskCreated as the range key. This GSI is fundamental to LIFO queue characteristics. It allows you to query for PENDING queue tasks, in reverse chronological order, so that the newest tasks can be processed first.
The DynamoDB TTL attribute causes earlier queue tasks to expire and be deleted. This prevents the queue from growing indefinitely if there is insurmountable load.
DynamoDB Streams invokes the TriggerFunction Lambda function for all changes in QueueTable.

Triggering queue task processing

The application continuously processes all PENDING queue tasks until there is none remaining. With no PENDING queue tasks, the application will be idle.

As the application is serverless, task processing is triggered by events. If a single Lambda function cannot process the volume of PENDING tasks, the application notifies itself so that processing can continue in another invocation. This is a tail call, which is an SNS notification sent by ProcessTasksFunction to TriggerTopic.

The Lambda functions, which collaborate on managing the LIFO queue are:

TriggerFunction is a proxy to ProcessTasksFunction and decides if task processing should be triggered. This function is invoked by DynamoDB Streams events on item changes in QueueTable or by a tail call SNS notification received from TriggerTopic.
ProcessTasksFunction performs the processing of queue tasks and implements the LIFO queue behavior. An SNS notification published on ProcessTasksTopic invokes this function.

Processing queue task items

The ProcessTasksFunction function processes queue tasks:

The function is invoked by an SNS notification on ProcessTasksTopic.
While the function runs, it polls QueueTable for PENDING queue tasks.
The function processes each queue task and then updates the item.
The function stops polling after 1 minute or if there are no PENDING queue tasks remaining.
If there are more PENDING tasks in the queue, the function triggers another task. It sends a tail call SNS notification to TriggerTopic.

This uses DynamoDB expressions to ensure that tasks are not processed more than once during periods of concurrent function invocations. To prevent higher concurrency, the reserved concurrent executions attribute is set to 1.

Before processing a queue task, the taskStatus item attribute is transitioned from PENDING to TAKEN. Following queue task processing, the taskStatus item attribute is transitioned from TAKEN to SUCCESS or FAILURE.

If a queue task cannot be processed (for example, an external system has reached capacity), the item taskStatus attribute is set to PENDING again. Any aging PENDING queue tasks that cannot be processed are buffered. They are eventually deleted once they expire, due to the TTL configuration.

Querying for queue task items

To get the most recently created PENDING queue tasks, query the task-status-created-index GSI. The following shows the DynamoDB query action request parameters for the task-status-created-index. By using a Limit of 10 and setting ScanIndexForward to false, it retrieves the 10 most recently created queue task items:

{
  "TableName": "QueueTable",
  "IndexName": "task-status-created-index",
  "ExpressionAttributeValues": {
    ":taskStatus": {
      "S": "PENDING"
    }
  },
  "KeyConditionExpression": "taskStatus = :taskStatus",
  "Limit": 10,
  "ScanIndexForward": false
}

Updating queue tasks items

The following code shows request parameters for the DynamoDB UpdateItem action. This sets the taskStatus attribute of a queue task item (to TAKEN from PENDING). The update expression and condition expression ensure that the taskStatus is set (to TAKEN) only if the current value is as expected (from PENDING). It also ensures that the update is atomic. This prevents more-than-once processing of a queue task.

{
  "TableName": "QueueTable",
  "Key": {
    "taskId": {
      "S": "task-123"
    }
  },
  "UpdateExpression": "set taskStatus = :toTaskStatus, taskUpdated = :taskUpdated",
  "ConditionExpression": "taskStatus = :fromTaskStatus",
  "ExpressionAttributeValues": {
    ":fromTaskStatus": {
      "S": "PENDING"
    },
    ":toTaskStatus": {
      "S": "TAKEN"
    },
    ":taskUpdated": {
      "N": "1623241938151"
    }
  }
}

Conclusion

This post describes how to implement a LIFO queue with AWS Serverless technologies, using an example application as an example. Newer tasks in the queue are prioritized over earlier tasks. Tasks that cannot be processed are buffered and eventually load shed. This helps for use cases with heavy load and where newer queue tasks must take priority.

For more serverless learning resources, visit Serverless Land.

Building well-architected serverless applications: Managing application security boundaries – part 2

2021-06-29 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-managing-application-security-boundaries-part-2/

This series uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the nine serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Security question SEC2: How do you manage your serverless application’s security boundaries?

This post continues part 1 of this security question. Previously, I cover how to evaluate and define resource policies, showing what policies are available for various serverless services. I show some of the features of AWS Web Application Firewall (AWS WAF) to protect APIs. Then then go through how to control network traffic at all layers. I explain how AWS Lambda functions connect to VPCs, and how to use private APIs and VPC endpoints. I walk through how to audit your traffic.

Required practice: Use temporary credentials between resources and components

Do not share credentials and permissions policies between resources to maintain a granular segregation of permissions and improve the security posture. Use temporary credentials that are frequently rotated and that have policies tailored to the access the resource needs.

Use dynamic authentication when accessing components and managed services

AWS Identity and Access Management (IAM) roles allows your applications to access AWS services securely without requiring you to manage or hardcode the security credentials. When you use a role, you don’t have to distribute long-term credentials such as a user name and password, or access keys. Instead, the role supplies temporary permissions that applications can use when they make calls to other AWS resources. When you create a Lambda function, for example, you specify an IAM role to associate with the function. The function can then use the role-supplied temporary credentials to sign API requests.

Use IAM for authorizing access to AWS managed services such as Lambda or Amazon S3. Lambda also assumes IAM roles, exposing and rotating temporary credentials to your functions. This enables your application code to access AWS services.

Use IAM to authorize access to internal or private Amazon API Gateway API consumers. See this list of AWS services that work with IAM.

Within the serverless airline example used in this series, the loyalty service uses a Lambda function to fetch loyalty points and next tier progress. AWS AppSync acts as the client using an HTTP resolver, via an API Gateway REST API /loyalty/{customerId}/get resource, to invoke the function.

To ensure only AWS AppSync is authorized to invoke the API, IAM authorization is set within the API Gateway method request.

Viewing API Gateway IAM authorization

The IAM role specifies that appsync.amazonaws.com can perform an execute-api:Invoke on the specific API Gateway resource arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${LoyaltyApi}/*/*/*

For more information, see “Using an IAM role to grant permissions to applications”.

Use a framework such as the AWS Serverless Application Model (AWS SAM) to deploy your applications. This ensures that AWS resources are provisioned with unique per resource IAM roles. For example, AWS SAM automatically creates unique IAM roles for every Lambda function you create.

Best practice: Design smaller, single purpose functions

Creating smaller, single purpose functions enables you to keep your permissions aligned to least privileged access. This reduces the risk of compromise since the function does not require access to more than it needs.

Create single purpose functions with their own IAM role

Single purpose Lambda functions allow you to create IAM roles that are specific to your access requirements. For example, a large multipurpose function might need access to multiple AWS resources such as Amazon DynamoDB, Amazon S3, and Amazon Simple Queue Service (SQS). Single purpose functions would not need access to all of them at the same time.

With smaller, single purpose functions, it’s often easier to identify the specific resources and access requirements, and grant only those permissions. Additionally, new features are usually implemented by new functions in this architectural design. You can specifically grant permissions in new IAM roles for these functions.

Avoid sharing IAM roles with multiple cloud resources. As permissions are added to the role, these are shared across all resources using this role. For example, use one dedicated IAM role per Lambda function. This allows you to control permissions more intentionally. Even if some functions have the same policy initially, always separate the IAM roles to ensure least privilege policies.

Use least privilege access policies with your users and roles

When you create IAM policies, follow the standard security advice of granting least privilege, or granting only the permissions required to perform a task. Determine what users (and roles) must do and then craft policies that allow them to perform only those tasks.

Start with a minimum set of permissions and grant additional permissions as necessary. Doing so is more secure than starting with permissions that are too lenient and then trying to tighten them later. In the unlikely event of misused credentials, credentials will only be able to perform limited interactions.

To control access to AWS resources, AWS SAM uses the same mechanisms as AWS CloudFormation. For more information, see “Controlling access with AWS Identity and Access Management” in the AWS CloudFormation User Guide.

For a Lambda function, AWS SAM scopes the permissions of your Lambda functions to the resources that are used by your application. You add IAM policies as part of the AWS SAM template. The policies property can be the name of AWS managed policies, inline IAM policy documents, or AWS SAM policy templates.

For example, the serverless airline has a ConfirmBooking Lambda function that has UpdateItem permissions to the specific DynamoDB BookingTable resource.

Parameters:
    BookingTable:
        Type: AWS::SSM::Parameter::Value<String>
        Description: Parameter Name for Booking Table
Resources:
    ConfirmBooking:
        Type: AWS::Serverless::Function
        Properties:
            FunctionName: !Sub ServerlessAirline-ConfirmBooking-${Stage}
            Policies:
                - Version: "2012-10-17"
                  Statement:
                      Action: dynamodb:UpdateItem
                      Effect: Allow
                      Resource: !Sub "arn:${AWS::Partition}:dynamodb:${AWS::Region}:${AWS::AccountId}:table/${BookingTable}"

One of the fastest ways to scope permissions appropriately is to use AWS SAM policy templates. You can reference these templates directly in the AWS SAM template for your application, providing custom parameters as required.

The serverless patterns collection allows you to build integrations quickly using AWS SAM and AWS Cloud Development Kit (AWS CDK) templates.

The booking service uses the SNSPublishMessagePolicy. This policy gives permission to the NotifyBooking Lambda function to publish a message to an Amazon Simple Notification Service (Amazon SNS) topic.

    BookingTopic:
        Type: AWS::SNS::Topic

    NotifyBooking:
        Type: AWS::Serverless::Function
        Properties:
            Policies:
                - SNSPublishMessagePolicy:
                      TopicName: !Sub ${BookingTopic.TopicName}
        …

Auditing permissions and removing unnecessary permissions

Audit permissions regularly to help you identify unused permissions so that you can remove them. You can use last accessed information to refine your policies and allow access to only the services and actions that your entities use. Use the IAM console to view when last an IAM role was used.

IAM last used

Use IAM access advisor to review when was the last time an AWS service was used from a specific IAM user or role. You can view last accessed information for IAM on the Access Advisor tab in the IAM console. Using this information, you can remove IAM policies and access from your IAM roles.

IAM access advisor

When creating and editing policies, you can validate them using IAM Access Analyzer, which provides over 100 policy checks. It generates security warnings when a statement in your policy allows access AWS considers overly permissive. Use the security warning’s actionable recommendations to help grant least privilege. To learn more about policy checks provided by IAM Access Analyzer, see “IAM Access Analyzer policy validation”.

With AWS CloudTrail, you can use CloudTrail event history to review individual actions your IAM role has performed in the past. Using this information, you can detect which permissions were actively used, and decide to remove permissions.

AWS CloudTrail

To work out which permissions you may need, you can generate IAM policies based on access activity. You configure an IAM role with broad permissions while the application is in development. Access Analyzer reviews your CloudTrail logs. It generates a policy template that contains the permissions that the role used in your specified date range. Use the template to create a policy that grants only the permissions needed to support your specific use case. For more information, see “Generate policies based on access activity”.

IAM Access Analyzer

Conclusion

Managing your serverless application’s security boundaries ensures isolation for, within, and between components. In this post, I continue from part 1, looking at using temporary credentials between resources and components. I cover why smaller, single purpose functions are better from a security perspective, and how to audit permissions. I show how to use AWS SAM to create per-function IAM roles.

For more serverless learning resources, visit https://serverlessland.com.

Monitoring and troubleshooting serverless data analytics applications

2021-06-28 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/monitoring-and-troubleshooting-serverless-data-analytics-applications/

This series is about building serverless solutions in streaming data workloads. The application example used in this series is Alleycat, which allows bike racers to compete with each other virtually on home exercise bikes.

The first four posts have explored the architecture behind the application, which is enabled by Amazon Kinesis, Amazon DynamoDB, and AWS Lambda. This post explains how to monitor and troubleshoot issues that are common in streaming applications.

To set up the example, visit the GitHub repo and follow the instructions in the README.md file. Note that this walkthrough uses services that are not covered by the AWS Free Tier and incur cost.

Monitoring the Alleycat application

The business requirements for Alleycat state that it must handle up to 1,000 simultaneous racers. With each racer emitting a message every second, each 5-minute race results in 300,000 messages.

While the architecture can support this throughput, the settings for each service determine how the workload scales up. The deployment templates in the GitHub repo do not use sufficiently high settings to handle this amount of data. In the section, I show how this results in errors and what steps you can take to resolve the issues. To start, I run the simulator for several races with the maximum racers configuration set to 1,000.

Monitoring the Kinesis stream

The monitoring tab of the Kinesis stream provides visualizations of stream metrics. This immediately shows that there is a problem in the application when running at full capacity:

The iterator age is growing, indicating that the data consumers are falling behind the data producers. The Get records graph also shows the number of records in the stream growing.
The Incoming data (count) metric shows the number of separate records ingested by the stream. The red line indicates the maximum capacity of this single-shard stream. With 1,000 active racers, this is almost at full capacity.
However, the Incoming data – sum (bytes) graph shows that the total amount of data ingested by the stream is currently well under the maximum level shown by the red line.

There are two solutions for improving the capacity on the stream. First, the data producer application (the Alleycat frontend) could combine messages before sending. It’s currently reaching the total number of messages per second but the total byte capacity is significantly below the maximum. This action improves message packing but increases latency since the frontend waits to group messages.

Alternatively, you can add capacity by resharding. This enables you to increase (or decrease) the number of shards in a stream to adapt to the rate of data flowing through the application. You can do this with the UpdateShardCount API action. The existing stream goes into an Updating status and the stream scales by splitting shards. This creates two new child shards that split the partition keyspace of the parent. It also results in another, separate Lambda consumer for the new shard.

Monitoring the Lambda function

The monitoring tab of the consuming Lambda function provides visualization of metrics that can highlight problems in the workload. At full capacity, the monitoring highlights issues to resolve:

The Duration chart shows that the function is exceeding its 15-second timeout, when the function normally finishes in under a second. This typically indicates that there are too many records to process in a single batch or throttling is occurring downstream.
The Error count metric is growing, which highlights either logical errors in the code or errors from API calls to downstream resources.
The IteratorAge metric appears for Lambda functions that are consuming from streams. In this case, the growing metric confirms that data consumption is falling behind data production in the stream.
Concurrent executions remain at 1 throughout. This is set by the parallelization factor in the event source mapping and can be increased up to 10.

Monitoring the DynamoDB table

The metric tab on the application’s table in the DynamoDB console provides visualizations for the performance of the service:

The consumed Read usage is well within the provisioned maximum and there is no read throttling on the table.
Consumed Write usage, shown in blue, is frequently bursting through the provisioned capacity.
The number of Write throttled requests confirms that the DynamoDB service is throttling requests since the table is over capacity.

You can resolve this issue by increasing the provisioned throughput on the table and related global secondary indexes. Write capacity units (WCUs) provide 1 KB of write throughput per second. You can set this value manually, use automatic scaling to match varying throughout, or enable on-demand mode. Read more about the pricing models for each to determine the best approach for your workload.

Monitoring Kinesis Data Streams

Kinesis Data Streams ingests data into shards, which are fixed capacity sequences of records, up to 1,000 records or 1 MB per second. There is no limit to the amount of data held within a stream but there is a configurable retention period. By default, Kinesis stores records for 24 hours but you can increase this up to 365 days as needed.

Kinesis is integrated with Amazon CloudWatch. Basic metrics are published every minute, and you can optionally enable enhanced metrics for an additional charge. In this section, I review the most commonly used metrics for monitoring the health of streams in your application.

Metrics for monitoring data producers

When data producers are throttled, they cannot put new records onto a Kinesis stream. Use the WriteProvisionedThroughputExceeded metric to detect if producers are throttled. If this is more than zero, you won’t be able to put records to the stream. Monitoring the Average for this statistic can help you determine if your producers are healthy.

When producers succeed in sending data to a stream, the PutRecord.Success and PutRecords.Success are incremented. Monitoring for spikes or drops in these metrics can help you monitor the health of producers and catch problems early. There are two separate metrics for each of the API calls, so watch the Average statistic for whichever of the two calls your application uses.

Metrics for monitoring data consumers

When data consumers are throttled or start to generate errors, Kinesis continues to accept new records from producers. However, there is growing latency between when records are written and when they are consumed for processing.

Using the GetRecords.IteratorAgeMilliseconds metric, you can measure the difference between the age of the last record consumed and the latest record put to the stream. It is important to monitor the iterator age. If the age is high in relation to the stream’s retention period, you can lose data as records expire from the stream. This value should generally not exceed 50% of the stream’s retention period – when the value reaches 100% of the stream retention period, data is lost.

If the iterator age is growing, one temporary solution is to increase the retention time of the stream. This gives you more time to resolve the issue before losing data. A more permanent solution is to add more consumers to keep up with data production, or resolve any errors that are slowing consumers.

When consumers exceed the ReadProvisionedThroughputExceeded metric, they are throttled and you cannot read from the stream. This results in a growth of records in the stream waiting for processing. Monitor the Average statistic for this metric and aim for values as close to 0 as possible.

The GetRecords.Success metric is the consumer-side equivalent of PutRecords.Success. Monitor this value for spikes or drops to ensure that your consumers are healthy. The Average is usually the most useful statistic for this purpose.

Increasing data processing throughput for Kinesis Data Streams

Adjusting the parallelization factor

Kinesis invokes Lambda consumers every second with a configurable batch size of messages. It’s important that the processing in the function keeps pace with the rate of traffic to avoid a growing iterator age. For compute intensive functions, you can increase the memory allocated in the function, which also increases the amount of virtual CPU available. This can help reduce the duration of a processing function.

If this is not possible or the function is falling behind data production in the stream, consider increasing the parallelization factor. By default, this is set to 1, meaning that each shard has a single instance of a Lambda function it invokes. You can increase this up to 10, which results in multiple instances of the consumer function processing additional batches of messages.

Using enhanced fan-out to reduce iterator age

Standard consumers use a pull model over HTTP to fetch batches of records. Each consumer operates in serial. A stream with five consumers averages 200 ms of latency each, meaning it takes up to 1 second for all five to receive batches of records.

You can improve the overall latency by removing any unnecessary data consumers. If you use Kinesis Data Firehose and Kinesis Data Analytics on a stream, these count as consumers too. If you can remove subscribers, this helps with over data consumption throughput.

If the workload needs all of the existing subscribers, use enhanced fan-out (EFO). EFO consumers use a push model over HTTP/2 and are independent of each other. With EFO, the same five consumers in the previous example would receive batches of messages in parallel, using dedicated throughput. Overall latency averages 70 ms and typically data delivery speed is improved by up to 65%. There is an additional charge for this feature.

To learn more about processing streaming data with Lambda, see this AWS Online Tech Talk presentation.

Conclusion

In this post, I show how the existing settings in the Alleycat application are not sufficient for handling the expected amount of traffic. I walk through the metrics visualizations for Kinesis Data Streams, Lambda, and DynamoDB to find which quotas should be increased.

I explain which CloudWatch metrics can be used with Kinesis Data Stream to ensure that data producers and data consumers are healthy. Finally, I show how you can use the parallelization factor and enhanced fan-out features to increase the throughput of data consumers.

For more serverless learning resources, visit Serverless Land.