Tag Archives: Amazon CloudWatch

Custom logging with AWS Batch

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/custom-logging-with-aws-batch/

This post was written by Christian Kniep, Senior Developer Advocate for HPC and AWS Batch. 

For HPC workloads, visibility into the logs of jobs is important to debug a job which failed, but also to have insights into a running job and track its trajectory to influence the configuration of the next job or terminate the job because it went off track.

With AWS Batch, customers are able to run batch workloads at scale, reliably and with ease as this managed serves takes out the undifferentiated heavy lifting. The customer can then focus on submitting jobs and getting work done. Customers told us that at a certain scale, the single logging driver available within AWS Batch made it hard to separate logs as they were all ending up in the same log group in Amazon CloudWatch.

With the new release of customer logging driver support, customers are now able to adjust how the job output is logged. Not only customize the Amazon CloudWatch setting, but enable the use of external logging frameworks such as splunk, fluentd, json-files, syslog, gelf, journald.

This allow AWS Batch jobs to use the existing systems they are accustom to, with fine-grained control of the log data for debugging and access control purposes.

In this blog, I show the benefits of custom logging with AWS Batch by adjusting the log targets for jobs. The first example will customize the Amazon CloudWatch log group, the second will log to Splunk, an external logging service.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (CLI) to setup the following:

  1. IAM roles, policies, and profiles to grant access and permissions
  2. A compute environment to provide the compute resources to run jobs
  3. A job queue, which supervises the job execution and schedules jobs on a compute environment
  4. A job definition, which uses a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit a job and send logs to a customized CloudWatch log-group and Splunk.

Prerequisite

To make things easier, I first set a couple of environment variables to have the information handy for later use. I use the following code to set up the environment variables.

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=http://169.254.169.254/latest/meta-data
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')

IAM

When using the AWS Management Console, you must create IAM roles manually.

Trust Policies

IAM Roles are defined to be used by a certain service. In the simplest case, you want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. This defines which entity is able to use an IAM Role, called Trust Policy. To set up a trust policy for an IAM role, use the following code snippet.

cat > ec2-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

Instance role

With the IAM trust policy, I now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service Role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role.  You can set up this role with the following logic.

cat > svc-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

In addition to dealing with Amazon ECS, the instance role can create and write to Amazon CloudWatch log groups, to control which log group names are used, a condition is attached.

While the compute environment is coming up, let us create and attach a policy to make a new log-group possible.

cat > policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "logs:CreateLogGroup"
    ],
    "Resource": "*",
    "Condition": {
      "StringEqualsIfExists": {
        "batch:LogDriver": ["awslogs"],
        "batch:AWSLogsGroup": ["/aws/batch/custom/*"]
      }
    }
  }]
}
EOF
aws iam create-policy --policy-name batch-awslog-policy \
    --policy-document file://policy.json
aws iam attach-role-policy --policy-arn arn:aws:iam::${AWS_ACCT_ID}:policy/batch-awslog-policy --role-name ecsInstanceRole

At this point, I created the IAM roles and policies so that the instance and service are able to interact with the AWS APIs, including trust-policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a compute environment, which is going to spin up an instance (one vCPU target) to run the example job in.

cat > compute-environment.json << EOF
{
  "computeEnvironmentName": "od-ce",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "EC2",
    "allocationStrategy": "BEST_FIT_PROGRESSIVE",
    "minvCpus": 1,
    "maxvCpus": 8,
    "desiredvCpus": 1,
    "instanceTypes": ["m5.xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-compute"},
    "bidPercentage": 0
  },
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
}
EOF
aws batch create-compute-environment --cli-input-json file://compute-environment.json  

Once this section is complete, a compute environment is being spun up in the back. This will take a moment. You can use the following command to check on the status of the compute environment.

aws batch  describe-compute-environments

Once it is enabled and valid we can continue by setting up the job queue.

Job Queue

Now that I have a compute environment up and running, I will create a job queue which accepts job submissions and schedules the jobs on the compute environment.

cat > job-queue.json << EOF
{
  "jobQueueName": "jq",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "od-ce"
  }]
}
EOF
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. This example runs a plain container and prints the environment variables. With the new release of AWS Batch, the logging driver awslogs now allows you to change the log group configuration within the job definition.

cat > job-definition.json << EOF
{
  "jobDefinitionName": "alpine-env",
  "type": "container",
  "containerProperties": {
  "image": "alpine",
  "vcpus": 1,
  "memory": 128,
  "command": ["env"],
  "readonlyRootFilesystem": true,
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": { 
      "awslogs-region": "${AWS_REGION}", 
      "awslogs-group": "/aws/batch/custom/env-queue",
      "awslogs-create-group": "true"}
    }
  }
}
EOF
aws batch register-job-definition --cli-input-json file://job-definition.json

Job Submission

Using the above job definition, you can now submit a job.

aws batch submit-job \
  --job-name test-$(date +"%F_%H-%M-%S") \
  --job-queue arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-queue/jq \
  --job-definition arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-definition/alpine-env:1

Now, you can check the ‘Log Group’ in CloudWatch. Go to the CloudWatch console and find the ‘Log Group’ section on the left.

log groups in cloudwatch

Now, click on the log group defined above, and you should see the output of the job which allows for debugging if something within the container went wrong or processing logs and create alarms and reports.

cloudwatch log events

Splunk

Splunk is an established log engine for a broad set of customers. You can use the Docker container to set up a Splunk server quickly. More information can be found in the Splunk documentation. You need to configure the HTTP Event Collector, which provides you with a link and a token.

To send logs to Splunk, create an additional job-definition with the Splunk token and URL. Please adjust the splunk-url and splunk-token to match your Splunk setup.

{
  "jobDefinitionName": "alpine-splunk",
  "type": "container",
  "containerProperties": {
    "image": "alpine",
    "vcpus": 1,
    "memory": 128,
    "command": ["env"],
    "readonlyRootFilesystem": false,
    "logConfiguration": {
      "logDriver": "splunk",
      "options": {
        "splunk-url": "https://<splunk-url>",
        "splunk-token": "XXX-YYY-ZZZ"
      }
    }
  }
}

This forwards the logs to Splunk, as you can see in the following image.

forward to splunk

Conclusion

This blog post showed you how to apply custom logging to AWS Batch using the awslog and Splunk logging driver. While these are two important logging drivers, please head over to the documentation to find out about fluentd, syslog, json-file and other drivers to find the best driver to match your current logging infrastructure.

 

Why Deployment Requirements are Important When Making Architectural Choices

Post Syndicated from Yusuf Mayet original https://aws.amazon.com/blogs/architecture/why-deployment-requirements-are-important-when-making-architectural-choices/

Introduction

Too often, architects fall into the trap of thinking the architecture of an application is restricted to just the runtime part of the architecture. By doing this we focus on only a single customer (such as the application’s users and how they interact with the system) and we forget about other important customers like developers and DevOps teams. This means that requirements regarding deployment ease, deployment frequency, and observability are delegated to the back burner during design time and tacked on after the runtime architecture is built. This leads to increased costs and reduced ability to innovate.

In this post, I discuss the importance of key non-functional requirements, and how they can and should influence the target architecture at design time.

Architectural patterns

When building and designing new applications, we usually start by looking at the functional requirements, which will define the functionality and objective of the application. These are all the things that the users of the application expect, such as shopping online, searching for products, and ordering. We also consider aspects such as usability to ensure a great user experience (UX).

We then consider the non-functional requirements, the so-called “ilities,” which typically include requirements regarding scalability, availability, latency, etc. These are constraints around the functional requirements, like response times for placing orders or searching for products, which will define the expected latency of the system.

These requirements—both functional and non-functional together—dictate the architectural pattern we choose to build the application. These patterns include Multi-tierevent-driven architecturemicroservices, and others, and each one has benefits and limitations. For example, a microservices architecture allows for a system where services can be deployed and scaled independently, but this also introduces complexity around service discovery.

Aligning the architecture to technical users’ requirements

Amazon is a customer-obsessed organization, so it’s important for us to first identify who the main customers are at each point so that we can meet their needs. The customers of the functional requirements are the application users, so we need to ensure the application meets their needs. For the most part, we will ensure that the desired product features are supported by the architecture.

But who are the users of the architecture? Not the applications’ users—they don’t care if it’s monolithic or microservices based, as long as they can shop and search for products. The main customers of the architecture are the technical teams: the developers, architects, and operations teams that build and support the application. We need to work backwards from the customers’ needs (in this case the technical team), and make sure that the architecture meets their requirements. We have therefore identified three non-functional requirements that are important to consider when designing an architecture that can equally meet the needs of the technical users:

  1. Deployability: Flow and agility to consistently deploy new features
  2. Observability: feedback about the state of the application
  3. Disposability: throwing away resources and provision new ones quickly

Together these form part of the Developer Experience (DX), which is focused on providing developers with APIs, documentation, and other technologies to make it easy to understand and use. This will ensure that we design for Day 2 operations in mind.

Deployability: Flow

There are many reasons that organizations embark on digital transformation journeys, which usually involve moving to the cloud and adopting DevOps. According to Stephen Orban, GM of AWS Data Exchange, in his book Ahead in the Cloud, faster product development is often a key motivator, meaning the most important non-functional requirement is achieving flow, the speed at which you can consistently deploy new applications, respond to competitors, and test and roll out new features. As well, the architecture needs to be designed upfront to support deployability. If the architectural pattern is a monolithic application, this will hamper the developers’ ability to quickly roll out new features to production. So we need to choose and design the architecture to support easy and automated deployments. Results from years of research prove that leaders use DevOps to achieve high levels of throughput:

Graphic - Using DevOps to achieve high levels of throughput

Decisions on the pace and frequency of deployments will dictate whether to use rolling, blue/green, or canary deployment methodologies. This will then inform the architectural pattern chosen for the application.

Using AWS, in order to achieve flow of deployability, we will use services such as AWS CodePipelineAWS CodeBuildAWS CodeDeploy and AWS CodeStar.

Observability: feedback

Once you have achieved a rapid and repeatable flow of features into production, you need a constant feedback loop of logs and metrics in order to detect and avoid problems. Observability is a property of the architecture that will allow us to better understand the application across the delivery pipeline and into production. This requires that we design the architecture to ensure that health reports are generated to analyze and spot trends. This includes error rates and stats from each stage of the development process, how many commits were made, build duration, and frequency of deployments. This not only allows us to measure code characteristics such as test coverage, but also developer productivity.

On AWS, we can leverage Amazon CloudWatch to gather and search through logs and metrics, AWS X-Ray for tracing, and Amazon QuickSight as an analytics tool to measure CI/CD metrics.

Disposability: automation

In his book, Cloud Strategy: A Decision-based Approach to a Successful Cloud Journey, Gregor Hohpe, Enterprise Strategist at AWS, notes that cloud and automation add a new “-ility”: disposability, which is the ability to set up and dispose of new servers in an automated and pain-free manner. Having immutable, disposable infrastructure greatly enhances your ability to achieve high levels of deployability and flow, especially when used in a CI/CD pipeline, which can create new resources and kill off the old ones.

At AWS, we can achieve disposability with serverless using AWS Lambda, or with containers running on Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS), or using AWS Auto Scaling with Amazon Elastic Compute Cloud (EC2).

Three different views of the architecture

Once we have designed an architecture that caters for deployability, observability, and disposability, it exposes three lenses across which we can view the architecture:

3 views of the architecture

  1. Build lens: the focus of this part of the architecture is on achieving deployability, with the objective to give the developers an easy-to-use, automated platform that builds, tests, and pushes their code into the different environments, in a repeatable way. Developers can push code changes more reliably and frequently, and the operations team can see greater stability because environments have standard configurations and rollback procedures are automated
  2. Runtime lens: the focus is on the users of the application and on maximizing their experience by making the application responsive and highly available.
  3. Operate lens: the focus is on achieving observability for the DevOps teams, allowing them to have complete visibility into each part of the architecture.

Summary

When building and designing new applications, the functional requirements (such as UX) are usually the primary drivers for choosing and defining the architecture to support those requirements. In this post I have discussed how DX characteristics like deployability, observability, and disposability are not just operational concerns that get tacked on after the architecture is chosen. Rather, they should be as important as the functional requirements when choosing the architectural pattern. This ensures that the architecture can support the needs of both the developers and users, increasing quality and our ability to innovate.

Enhanced monitoring and automatic scaling for Apache Flink

Post Syndicated from Karthi Thyagarajan original https://aws.amazon.com/blogs/big-data/enhanced-monitoring-and-automatic-scaling-for-apache-flink/

Thousands of developers use Apache Flink to build streaming applications to transform and analyze data in real time. Apache Flink is an open-source framework and engine for processing data streams. It’s highly available and scalable, delivering high throughput and low latency for the most demanding stream-processing applications. Monitoring and scaling your applications is critical to keep your applications running successfully in a production environment.

Amazon Kinesis Data Analytics reduces the complexity of building and managing Apache Flink applications. Amazon Kinesis Data Analytics manages the underlying Apache Flink components that provide durable application state, metrics and logs, and more. Kinesis Data Analytics recently announced new Amazon CloudWatch metrics and the ability to create custom metrics to provide greater visibility into your application.

In this post, we show you how to easily monitor and automatically scale your Apache Flink applications with Amazon Kinesis Data Analytics. We walk through three examples. First, we create a custom metric in the Kinesis Data Analytics for Apache Flink application code. Second, we use application metrics to automatically scale the application. Finally, we share a CloudWatch dashboard for monitoring your application and recommend metrics that you can alarm on.

Custom metrics

Kinesis Data Analytics uses Apache Flink’s metrics system to send custom metrics to CloudWatch from your applications. For more information, see Using Custom Metrics with Amazon Kinesis Data Analytics for Apache Flink.

We use a basic word count program to illustrate the use of custom metrics. The following code shows how to extend RichFlatMapFunction to track the number of words it sees. This word count is then surfaced via the Flink metrics API.

private static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<String, Integer>> {
     
            private transient Counter counter;
     
            @Override
            public void open(Configuration config) {
                this.counter = getRuntimeContext().getMetricGroup()
                        .addGroup("kinesisanalytics")
                        .addGroup("Service", "WordCountApplication")
                        .addGroup("Tokenizer")
                        .counter("TotalWords");
            }
     
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>>out) {
                // normalize and split the line
                String[] tokens = value.toLowerCase().split("\\W+");
     
                // emit the pairs
                for (String token : tokens) {
                    if (token.length() > 0) {
                        counter.inc();
                        out.collect(new Tuple2<>(token, 1));
                    }
                }
            }
        }

Custom metrics emitted through the Flink metrics API are forwarded to CloudWatch metrics by Kinesis Data Analytics for Apache Flink. The following screenshot shows the word count metric in CloudWatch.

Custom automatic scaling

This section describes how to implement an automatic scaling solution for Kinesis Data Analytics for Apache Flink based on CloudWatch metrics. You can configure Kinesis Data Analytics for Apache Flink to perform CPU-based automatic scaling. However, you can automatically scale your application based on something other than CPU utilization. To perform custom automatic scaling, use Application Auto Scaling with the appropriate metric.

For applications that read from a Kinesis stream source, you can use the metric millisBehindLatest. This captures how far behind your application is from the head of the stream.

A target tracking policy is one of two scaling policy types offered by Application Auto Scaling. You can specify a threshold value around which to vary the degree of parallelism of your Kinesis Data Analytics application. The following sample code on GitHub configures Application Auto Scaling when millisBehindLatest for the consuming application exceeds 1 minute. This increases the parallelism, which increases the number of KPUs.

The following diagram shows how Application Auto Scaling, used with Amazon API Gateway and AWS Lambda, scales a Kinesis Data Analytics application in response to a CloudWatch alarm.

The sample code includes examples for automatic scaling based on the target tracking policy and step scaling policy.

Automatic scaling solution components

The following is a list of key components used in the automatic scaling solution. You can find these components in the AWS CloudFormation template in the GitHub repo accompanying this post.

  • Application Auto Scaling scalable target – A scalable target is a resource that Application Auto Scaling can scale in and out. It’s uniquely identified by the combination of resource ID, scalable dimension, and namespace. For more information, see RegisterScalableTarget.
  • Scaling policy – The scaling policy defines how your scalable target should scale. As described in the PutScalingPolicy, Application Auto Scaling supports two policy types: TargetTrackingScaling and StepScaling. In addition, you can configure a scheduled scaling action using Application Auto Scaling. If you specify TargetTrackingScaling, Application Auto Scaling also creates corresponding CloudWatch alarms for you.
  • API Gateway – Because the scalable target is a custom resource, we have to specify an API endpoint. Application Auto Scaling invokes this to perform scaling and get information about the current state of our scalable resource. We use an API Gateway and Lambda function to implement this endpoint.
  • Lambda – API Gateway invokes the Lambda function. This is called by Application Auto Scaling to perform the scaling actions. It also fetches information such as current scale value and returns information requested by Application Auto Scaling.

Additionally, you should be aware of the following:

  • When scaling out or in, this sample only updates the overall parallelism. It doesn’t adjust parallelism or KPU.
  • When scaling occurs, the Kinesis Data Analytics application experiences downtime.
  • The throughput of a Flink application depends on many factors, such as complexity of processing and destination throughput. The step-scaling example assumes a relationship between incoming record throughput and scaling. The millisBehindLatest metric used for target tracking automatic scaling works the same way.
  • We recommend using the default scaling policy provided by Kinesis Data Analytics for CPU-based scaling, the target tracking auto scaling policy for the millisBehindLatest metric, and a step scaling auto scaling policy for a metric such as numRecordsInPerSecond. However, you can use any automatic scaling policy for the metric you choose.

CloudWatch operational dashboard

Customers often ask us about best practices and the operational aspects of Kinesis Data Analytics for Apache Flink. We created a CloudWatch dashboard that captures the key metrics to monitor. We categorize the most common metrics in this dashboard with the recommended statistics for each metric.

This GitHub repo contains a CloudFormation template to deploy the dashboard for any Kinesis Data Analytics for Apache Flink application. You can also deploy a demo application with the dashboard. The dashboard includes the following:

  • Application health metrics:
    • Use uptime to see how long the job has been running without interruption and downtime to determine if a job failed to run. Non-zero downtime can indicate issues with your application.
    • Higher-than-normal job restarts can indicate an unhealthy application.
    • Checkpoint information size, duration, and number of failed checkpoints can help you understand application health and progress. Increasing checkpoint duration values can signify application health problems like backpressure and the inability to keep up with input data. Increasing checkpoint size over time can point to an infinitely growing state that can lead to out-of-memory errors.
  • Resource utilization metrics:
    • You can check the CPU and heap memory utilization along with the thread count. You can also check the garbage collection time taken across all Flink task managers.
  • Flink application progress metrics:
    • numRecordsInPerSecond and numRecordsOutPerSecond show the number of records accepted and emitted per second.
    • numLateRecordsDropped shows the number of records this operator or task has dropped due to arriving late.
    • Input and output watermarks are valid only when using event time semantics. You can use the difference between these two values to calculate event time latency.
  • Source metrics:
    • The Kinesis Data Streams-specific metric millisBehindLatest shows that the consumer is behind the head of the stream, indicating how far behind current time the consumer is. We used this metric to demonstrate Application Auto Scaling earlier in this post.
    • The Kafka-specific metric recordsLagMax shows the maximum lag in terms of number of records for any partition in this window.

The dashboard contains useful metrics to gauge the operational health of a Flink application. You can modify the threshold, configure additional alarms, and add other system or custom metrics to customize the dashboard for your use. The following screenshot shows a section of the dashboard.

Summary

In this post, we covered how to use the enhanced monitoring features for Kinesis Data Analytics for Apache Flink applications. We created custom metrics for an Apache Flink application within application code and emitted it to CloudWatch. We also used Application Auto Scaling to scale an application. Finally, we shared a CloudWatch dashboard to monitor the operational health of Kinesis Data Analytics for Apache Flink applications. For more information about using Kinesis Data Analytics, see Getting Started with Amazon Kinesis Data Analytics.


About the Authors

Karthi Thyagarajan is a Principal Solutions Architect on the Amazon Kinesis team.

 

 

 

 

Deepthi Mohan is a Sr. TPM on the Amazon Kinesis Data Analytics team.

Troubleshooting Amazon API Gateway with enhanced observability variables

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/troubleshooting-amazon-api-gateway-with-enhanced-observability-variables/

Amazon API Gateway is often used for managing access to serverless applications. Additionally, it can help developers reduce code and increase security with features like AWS WAF integration and authorizers at the API level.

Because more is handled by API Gateway, developers tell us they would like to see more data points on the individual parts of the request. This data helps developers understand each phase of the API request and how it affects the request as a whole. In response to this request, the API Gateway team has added new enhanced observability variables to the API Gateway access logs. With these new variables, developers can troubleshoot on a more granular level to quickly isolate and resolve request errors and latency issues.

The phases of an API request

API Gateway divides requests into phases, reflected by the variables that have been added. Depending upon the features configured for the application, an API request goes through multiple phases. The phases appear in a specific order as follows:

Phases of an API request

Phases of an API request

  • WAF: the WAF phase only appears when an AWS WAF web access control list (ACL) is configured for enhanced security. During this phase, WAF rules are evaluated and a decision is made on whether to continue or cancel the request.
  • Authenticate: the authenticate phase is only present when AWS Identity and Access Management (IAM) authorizers are used. During this phase, the credentials of the signed request are verified. Access is granted or denied based on the client’s right to assume the access role.
  • Authorizer: the authorizer phase is only present when a Lambda, JWT, or Amazon Cognito authorizer is used. During this phase, the authorizer logic is processed to verify the user’s right to access the resource.
  • Authorize: the authorize phase is only present when a Lambda or IAM authorizer is used. During this phase, the results from the authenticate and authorizer phase are evaluated and applied.
  • Integration: during this phase, the backend integration processes the request.

Each phrase can add latency to the request, return a status, or raise an error. To capture this data, API Gateway now provides enhanced observability variables based on each phase. The variables are named according to the phase they occur in and follow the naming structure, $context.phase.property. Therefore, you can get data about WAF latency by using $context.waf.latency.

Some existing variables have also been given aliases to match this naming schema. For example, $context.integrationErrorMessage has a new alias of $context.integration.error. The resulting list of variables is as follows:

Phases and variables for API Gateway requests

Phases and variables for API Gateway requests

API Gateway provides status, latency, and error data for each phase. In the authorizer and integration phases, there are additional variables you can use in logs. The $context.phase.requestId provides the request ID from that service and the $context.phase.integrationStatus provide the status code.

For example, when using an AWS Lambda function as the integration, API Gateway receives two status codes. The first, $context.integration.integrationStatus, is the status of the Lambda service itself. This is usually 200, unless there is a service or permissions error. The second, $context.integration.status, is the status of the Lambda function and reports on the success or failure of the code.

A full list of access log variables is in the documentation for REST APIs, WebSocket APIs, and HTTP APIs.

A troubleshooting example

In this example, an application is built using an API Gateway REST API with a Lambda function for the backend integration. The application uses an IAM authorizer to require AWS account credentials for application access. The application also uses an AWS WAF ACL to rate limit requests to 100 requests per IP, per five minutes. The demo application and deployment instructions can be found in the Sessions With SAM repository.

Because the application involves an AWS WAF and IAM authorizer for security, the request passes through four phases: waf, authenticate, authorize, and integration. The access log format is configured to capture all the data regarding these phases:

{
  "requestId":"$context.requestId",
  "waf-error":"$context.waf.error",
  "waf-status":"$context.waf.status",
  "waf-latency":"$context.waf.latency",
  "waf-response":"$context.wafResponseCode",
  "authenticate-error":"$context.authenticate.error",
  "authenticate-status":"$context.authenticate.status",
  "authenticate-latency":"$context.authenticate.latency",
  "authorize-error":"$context.authorize.error",
  "authorize-status":"$context.authorize.status",
  "authorize-latency":"$context.authorize.latency",
  "integration-error":"$context.integration.error",
  "integration-status":"$context.integration.status",
  "integration-latency":"$context.integration.latency",
  "integration-requestId":"$context.integration.requestId",
  "integration-integrationStatus":"$context.integration.integrationStatus",
  "response-latency":"$context.responseLatency",
  "status":"$context.status"
}

Once the application is deployed, use Postman to test the API with a sigV4 request.

Configuring Postman authorization

Configuring Postman authorization

To show troubleshooting with the new enhanced observability variables, the first request sent through contains invalid credentials. The user receives a 403 Forbidden error.

Client response view with invalid tokens

Client response view with invalid tokens

The access log for this request is:

{
    "requestId": "70aa9606-26be-4396-991c-405a3671fd9a",
    "waf-error": "-",
    "waf-status": "200",
    "waf-latency": "8",
    "waf-response": "WAF_ALLOW",
    "authenticate-error": "-",
    "authenticate-status": "403",
    "authenticate-latency": "17",
    "authorize-error": "-",
    "authorize-status": "-",
    "authorize-latency": "-",
    "integration-error": "-",
    "integration-status": "-",
    "integration-latency": "-",
    "integration-requestId": "-",
    "integration-integrationStatus": "-",
    "response-latency": "48",
    "status": "403"
}

The request passed through the waf phase first. Since this is the first request and the rate limit has not been exceeded, the request is passed on to the next phase, authenticate. During the authenticate phase, the user’s credentials are verified. In this case, the credentials are invalid and the request is rejected with a 403 response before invoking the downstream phases.

To correct this, the next request uses valid credentials, but those credentials do not have access to invoke the API. Again, the user receives a 403 Forbidden error.

Client response view with unauthorized tokens

Client response view with unauthorized tokens

The access log for this request is:

{
  "requestId": "c16d9edc-037d-4f42-adf3-eaadf358db2d",
  "waf-error": "-",
  "waf-status": "200",
  "waf-latency": "7",
  "waf-response": "WAF_ALLOW",
  "authenticate-error": "-",
  "authenticate-status": "200",
  "authenticate-latency": "8",
  "authorize-error": "The client is not authorized to perform this operation.",
  "authorize-status": "403",
  "authorize-latency": "0",
  "integration-error": "-",
  "integration-status": "-",
  "integration-latency": "-",
  "integration-requestId": "-",
  "integration-integrationStatus": "-",
  "response-latency": "52",
  "status": "403"
}

This time, the access logs show that the authenticate phase returns a 200. This indicates that the user credentials are valid for this account. However, the authorize phase returns a 403 and states, “The client is not authorized to perform this operation”. Again, the request is rejected with a 403 response before invoking downstream phases.

The last request for the API contains valid credentials for a user that has rights to invoke this API. This time the user receives a 200 OK response and the requested data.

Client response view with valid request

Client response view with valid request

The log for this request is:

{
  "requestId": "ac726ce5-91dd-4f1d-8f34-fcc4ae0bd622",
  "waf-error": "-",
  "waf-status": "200",
  "waf-latency": "7",
  "waf-response": "WAF_ALLOW",
  "authenticate-error": "-",
  "authenticate-status": "200",
  "authenticate-latency": "1",
  "authorize-error": "-",
  "authorize-status": "200",
  "authorize-latency": "0",
  "integration-error": "-",
  "integration-status": "200",
  "integration-latency": "16",
  "integration-requestId": "8dc58335-fa13-4d48-8f99-2b1c97f41a3e",
  "integration-integrationStatus": "200",
  "response-latency": "48",
  "status": "200"
}

This log contains a 200 status code from each of the phases and returns a 200 response to the user. Additionally, each of the phases reports latency. This request had a total of 48 ms of latency. The latency breaks down according to the following:

Request latency breakdown

Request latency breakdown

Developers can use this information to identify the cause of latency within the API request and adjust accordingly. While some phases like authenticate or authorize are immutable, optimizing the integration phase of this request could remove a large chunk of the latency involved.

Conclusion

This post covers the enhanced observability variables, the phases they occur in, and the order of those phases. With these new variables, developers can quickly isolate the problem and focus on resolving issues.

When configured with the proper access logging variables, API Gateway access logs can provide a detailed story of API performance. They can help developers to continually optimize that performance. To learn how to configure logging in API Gateway with AWS SAM, see the demonstration app for this blog.

#ServerlessForEveryone

Log your VPC DNS queries with Route 53 Resolver Query Logs

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/log-your-vpc-dns-queries-with-route-53-resolver-query-logs/

The Amazon Route 53 team has just launched a new feature called Route 53 Resolver Query Logs, which will let you log all DNS queries made by resources within your Amazon Virtual Private Cloud. Whether it’s an Amazon Elastic Compute Cloud (EC2) instance, an AWS Lambda function, or a container, if it lives in your Virtual Private Cloud and makes a DNS query, then this feature will log it; you are then able to explore and better understand how your applications are operating.

Our customers explained to us that DNS query logs were important to them. Some wanted the logs so that they could be compliant with regulations, others wished to monitor DNS querying behavior, so they could spot security threats. Others simply wanted to troubleshoot application issues that were related to DNS. The team listened to our customers and have developed what I have found to be an elegant and easy to use solution.

From knowing very little about the Route 53 Resolver, I was able to configure query logging and have it working with barely a second glance at the documentation; which I assure you is a testament to the intuitiveness of the feature rather than me having any significant experience with Route 53 or DNS query logging.

You can choose to have the DNS query logs sent to one of three AWS services: Amazon CloudWatch Logs, Amazon Simple Storage Service (S3), and Amazon Kinesis Data Firehose. The target service you choose will depend mainly on what you want to do with the data. If you have compliance mandates (For example, Australia’s Information Security Registered Assessors Program), then maybe storing the logs in Amazon Simple Storage Service (S3) is a good option. If you have plans to monitor and analyze DNS queries in real-time or you integrate your logs with a 3rd party data analysis tool like Kibana or a SEIM tool like Splunk, than perhaps Amazon Kinesis Data Firehose is the option for you. For those of you who want an easy way to search, query, monitor metrics, or raise alarms, then Amazon CloudWatch Logs is a great choice, and this is what I will show in the following demo.

Over in the Route 53 Console, near the Resolver menu section, I see a new item called Query logging. Clicking on this takes me to a screen where I can configure the logging.

The dashboard shows the current configurations that are setup. I click Configure query logging to get started.

The console asks me to fill out some necessary information, such as a friendly name; I’ve named mine demoNewsBlog.

I am now prompted to select the destination where I would like my logs to be sent. I choose the CloudWatch Logs log group and select the option to Create log group. I give my new log group the name /aws/route/demothebeebsnet.

Next, I need to select what VPC I would like to log queries for. Any resource that sits inside the VPCs I choose here will have their DNS queries logged. You are also able to add tags to this configuration. I am in the habit of tagging anything that I use as part of a demo with the tag demo. This is so I can easily distinguish between demo resources and live resources in my account.

Finally, I press the Configure query logging button, and the configuration is saved. Within a few moments, the service has successfully enabled the query logging in my VPC.

After a few minutes, I log into the Amazon CloudWatch Logs console and can see that the logs have started to appear.

As you can see below, I was quickly able to start searching my logs and running queries using Amazon CloudWatch Logs Insights.

There is a lot you can do with the Amazon CloudWatch Logs service, for example, I could use CloudWatch Metric Filters to automatically generate metrics or even create dashboards. While putting this demo together, I also discovered a feature inside of Amazon CloudWatch Logs called Contributor Insights that enables you to analyze log data and create time series that display top talkers. Very quickly, I was able to produce this graph, which lists out the most common DNS queries over time.
Route 53 Resolver Query Logs is available in all AWS Commercial Regions that support Route 53 Resolver Endpoints, and you can get started using either the API or the AWS Console. You do not pay for the Route 53 Resolver Query Logs, but you will pay for handling the logs in the destination service that you choose. So, for example, if you decided to use Amazon Kinesis Data Firehose, then you will incur the regular charges for handling logs with the Amazon Kinesis Data Firehose service.

Happy Logging

— Martin

Updates to the security pillar of the AWS Well-Architected Framework

Post Syndicated from Ben Potter original https://aws.amazon.com/blogs/security/updates-to-security-pillar-aws-well-architected-framework/

We have updated the security pillar of the AWS Well-Architected Framework, based on customer feedback and new best practices. In this post, I’ll take you through the highlights of the updates to the security information in the Security Pillar whitepaper and the AWS Well-Architected Tool, and explain the new best practices and guidance.

AWS developed the Well-Architected Framework to help customers build secure, high-performing, resilient, and efficient infrastructure for their applications. The Well-Architected Framework is based on five pillars — operational excellence, security, reliability, performance efficiency, and cost optimization. We periodically update the framework as we get feedback from customers, partners, and other teams within AWS. Along with the AWS Well-Architected whitepapers, there is an AWS Well-Architected Tool to help you review your architecture for alignment to these best practices.

Changes to identity and access management

The biggest changes are related to identity and access management, and how you operate your workload. Both the security pillar whitepaper, and the review in the tool, now start with the topic of how to securely operate your workload. Instead of just securing your root user, we want you to think about all aspects of your AWS accounts. We advise you to configure the contact information for your account, so that AWS can reach out to the right contact if needed. And we recommend that you use AWS Organizations with Service Control Policies to manage the guardrails on your accounts.

A new best practice is to identify and validate control objectives. This new best practice is about having objectives built from your own threat model, not just a list of controls to help you measure the effectiveness of your risk mitigation. Adding to that is the best practice to automate testing and validation of security controls in pipelines.

Identity and access management is no longer split between credentials and authentication, human and programmatic access. There are now just two questions which have a simpler approach, and which focus on identity and permissions. These questions don’t draw a distinction between humans and machines. You should think about how identity and permissions are applied to your AWS environment, to the systems supporting your workload, and to the workload itself. New best practices include the following:

  • Define permission guardrails for your organization – Establish common controls that restrict access to all identities in your organization, such as restricting which AWS Regions can be used.
  • Reduce permissions continuously – As teams and workloads determine what access they need, you should remove the permissions they no longer use, and you should establish review processes to achieve least privilege permissions.
  • Establish an emergency access process – Have a process that allows emergency access to your workload, so that you can react appropriately in the unlikely event of an issue with an automated process or the pipeline.
  • Analyze public and cross account access – Continuously monitor findings that highlight public and cross-account access.
  • Share resources securely – Govern the consumption of shared resources across accounts, or within your AWS Organization.

Changes to detective controls

The section that was previously called detective controls is now simply called detection. The main update in this section is a new best practice called automate response to events, which covers automating your investigation processes, alerts, and remediation. For example, you can use Amazon GuardDuty managed threat detection findings, or the new AWS Security Hub foundational best practices, to trigger a notification to your team with Amazon CloudWatch Events, and you can use an AWS Step Function to coordinate or automatically remediate what was found.

Changes to infrastructure protection

From a networking and compute perspective, there are only a few minor refinements. A new best practice in this section is to enable people to perform actions at a distance. This aligns to the design principle: keep people away from data. For example, you can use AWS Systems Manager to remotely run commands on an EC2 instance without needing to open ports outside your network or interactively connect to instances, which reduces the risk of human error.

Changes to data protection

One update in the best practice for data at rest is that we changed provide mechanisms to keep people away from data to use mechanisms to keep people away from data. Simply providing mechanisms is good, but it’s better if they are actually used.

One issue in data protection that hasn’t changed is how to enforce encryption at rest and in transit. For encryption in transit, you should never allow insecure protocols, such as HTTP. You can configure security groups and network access control lists (network ACLs) to not use insecure protocols, and you can use Amazon CloudFront to only serve your content over HTTPs with certificates deployed and automatically rotated by AWS Certificate Manager (ACM). For enforcing encryption at rest, you can use default EBS encryption, Amazon S3 encryption using AWS Key Management Service (AWS KMS) customer master keys (CMKs), and you can detect insecure configurations using AWS Config rules or conformance packs.

Changes to incident response

The final section of the security pillar is incident response. The new best practice in this section is automate containment and recovery capability. Your incident response should already include plans, pre-deployed tools, and access, so your next step should be automation. The incident response section of the whitepaper has been rewritten and takes on many parts of the very helpful AWS Security Incident Response Guide. You should also check out the incident response hands-on lab Incident Response Playbook with Jupyter – AWS IAM. This lab uses the power of Jupyter notebooks to perform interactive queries against AWS APIs for semi-automated playbooks. As always, a best practice is to plan and practice your incident response so you can continue to learn and improve as you operate your workloads securely in AWS.

A huge thank you to all our customers who give us feedback on the tool and whitepapers. I encourage you to review your workloads in the AWS Well-Architected Tool, and take a look at all of the updated Well-Architected whitepapers.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Well-Architected Tool forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Ben Potter

Ben is the global security leader for the AWS Well-Architected Framework and is responsible for sharing best practices in security with customers and partners. Ben is also an ambassador for the No More Ransom initiative helping fight cyber crime with Europol, McAfee and law enforcement across the globe. You can learn more about him in this interview.

Load testing a web application’s serverless backend

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/load-testing-a-web-applications-serverless-backend/

Many web applications experience high levels of traffic and spiky load patterns. The goal of load testing is to ensure that the architecture and system design works for the amount of traffic expected. It can help developers find bottlenecks and unexpected behavior before a system is deployed to production. This post uses the Ask Around Me application as an example to show how to test load in a serverless architecture.

In Ask Around Me, users ask and answer questions in their local geographic area. The expected hourly load is 1,000 new questions, 10,000 new answers, and 50,000 question lookup queries. I use these numbers as a baseline for the tests. This is the architecture of the Ask Around Me backend application:

Ask Around Me backend architecture

Focus areas for load testing

In serverless architectures using AWS services, you can perform a round-trip test from an API endpoint. You can also isolate areas in the design where you should test performance. API testing provides the best approximation of the performance that users experience but it may not always be possible. You can also isolate microservices consuming from SQS queue or receive events from Amazon EventBridge, and test only those parts of the infrastructure.

While AWS services are built to withstand high levels of traffic, it’s important to consider the effect of Service Quotas on your application. Service Quotas are applied at the Region and account levels depending upon the service. You can view all your quotas in one place from the Service Quotas console. These are designed to protect you and other customers if your applications use more resources than planned. These quotas consist of hard and soft limits. For soft limits, you can request quota increases by opening a support ticket.

You must also consider downstream services. While serverless services like Lambda scale on your behalf, you may use downstream services that could be overwhelmed when traffic increases. Load testing can help identify these areas. You can implement mechanisms like queuing, caching, or pooling to protect those non-serverless parts of your infrastructure. If you are using Amazon RDS, for example, you might implement Amazon RDS Proxy to help pool and scale resources.

Finally, load testing can help identify custom code in Lambda functions that may not run efficiently as traffic scales up. Typically, these issues are caused by the code itself or the function configuration. For example, code may process event batches effectively or may not be configured with the appropriate concurrency or memory configuration. Frequently these issues are unnoticed in development but resurface in a load test.

Load testing tools

Load testing serverless infrastructure can be both inexpensive and systematic. There are several tools available for serverless developers to perform this task. One of the most popular is Artillery Community Edition, which is an open-source tool for testing serverless APIs. You configure the number of requests per second and overall test duration, and it uses a headless Chromium browser to run its test flows.

The performance report measures the roundtrip time from the client device, so can be affected by your machine’s performance and network. One way to eliminate your local network’s impact on the results is to use AWS Cloud9 to run the tests remotely.

For Artillery, the maximum number of concurrent tests is constrained by your local computing resources and network. To achieve higher throughput, you can use Serverless Artillery, which runs the Artillery package on Lambda functions. As a result, this tool can scale up to a significantly higher number of tests.

The Ask Around Me application is deployed in my AWS account – see the application’s blog series to learn more about the deployment process. I use an AWS Cloud9 instance to run these API tests:

  1. Adding 1,000 questions per hour using the POST /questions API.
  2. Adding 10,000 answers per hour using the POST /answers API.
  3. Fetching 50,000 questions per hour based upon random geo-location using the GET /questions API.

You can find the test scripts and Artillery configurations in the testing directory of the application’s GitHub repo.

Artillery also enables you to specify custom functions to provide randomized data and custom query parameters, as required by your API. The loadTestFunction.js file contains a function to return randomized geo-point and rating data per test:

// Sets a bounding box around an area in Virginia, USA
const bounds = {
  latMax: 38.735083,
  latMin: 40.898677,
  lngMax: -77.109339,
  lngMin: -81.587841
}

const generateRandomData = (userContext, events, done) => {
  const randomLat = ((bounds.latMax-bounds.latMin) * Math.random()) + bounds.latMin
  const randomLng = ((bounds.lngMax-bounds.lngMin) * Math.random()) + bounds.lngMin

  const id = parseInt(Math.random()*1000000)+1  //random 0-1000000
  const rating = parseInt(Math.random()*5)+1    //returns 1-5

  userContext.vars.lat = randomLat.toFixed(7)
  userContext.vars.lng = randomLng.toFixed(7)
  userContext.vars.id = id
  userContext.vars.rating = rating

  return done()
}

module.exports = { generateRandomData }

Test #1: Adding 1,000 questions per hour

The POST questions API has the following architecture:

POST questions architecture

The Artillery configuration file 1-test.yaml is set to create three requests per second over a 5-minute duration. This equates to 10,800 questions per hour, significantly higher than the estimated load for this function. The scenario specifies the JSON payload expected by the questions API:

config:
  target: 'https://abcd1234567.execute-api.us-east-1.amazonaws.com'
  phases:
    - duration: 300
      arrivalRate: 3
  processor: "./loadTestFunction.js"          
  defaults:
    headers:
      Authorization: 'Bearer <<enter your valid JWT token>>'
scenarios:
  - flow:
    - function: "generateRandomData"
    - post:
        url: "/questions"
        json:
          question: "This is a load test question - #{{ id }}"
          type: "Star rating"
          position:
            latitude: {{ lat }}
            longitude: {{ lng }}
    - log: "Sent POST request to / with {{ lat }}, {{ lng }}"

You execute the Artillery test with the command artillery run ./1-test.yaml. My test concludes with the following results:

Artillery test results

Over 300 requests, the median response time is 114 ms. The p95 response time shows that 95% of all responses are served within 376 ms. The slowest response of 1401 ms is caused by cold starts when the Lambda service scales up the underlying function due to load.

As this process writes to a DynamoDB table, I can also see how many write capacity units (WCUs) are consumed by the test. From the DynamoDB console, select the table aamQuestions, then choose the Metrics tab. This shows the Write capacity metric:

CloudWatch DynamoDB metrics

Test #2: Adding 10,000 answers per hour.

The POST answers API has the following architecture:

POST answers architecture

The Artillery configuration in 2-test.yaml creates 10 answers per second over a 5-minute duration. This equates to 36,000 per hour, much higher than the estimated load. The scenario defines the randomized rating used by the testing process:

config:
  target: 'https://abcd1234567.execute-api.us-east-1.amazonaws.com'
  phases:
    - duration: 300
      arrivalRate: 10
  processor: "./loadTestFunction.js"          
  defaults:
    headers:
      Authorization: 'Bearer <<enter your valid JWT token>>’
scenarios:
  - flow:
    - function: "generateRandomData"
    - post:
        url: "/answers"
        json:
          type: "Star"
          rating: "{{ rating }}"
          question: 
            type: "Star"
            latitude: 39.08259127440097
            longitude: -77.46246339003038
            rangeKey: "testuser|1-1589380702281"
    - log: "Sent POST request to / with {{ rating }}"

The test results show a median response time of 111 ms with a p95 time of 218 ms. In the worst case, a request took 1102 ms to complete:

Artillery summary report

Checking the Metrics tab for the aaAnswers table, this test consumed just under 11 WCUs at peak:

CloudWatch DynamoDB metrics

Test #3: Fetching 50,000 questions per hour

The GET questions API invokes a Lambda function that uses the Geo Library for Amazon DynamoDB:

GET questions architecture

This process is read-intensive on the underlying DynamoDB table. The testing configuration simulates 20 queries per second over 2 minutes for random locations in a bounding box around Virginia, USA:

config:
  target: 'https://abcd1234567.execute-api.us-east-1.amazonaws.com'
  phases:
    - duration: 120
      arrivalRate: 20
  processor: "./loadTestFunction.js"          
  defaults:
    headers:
      Authorization: 'Bearer <<enter your valid JWT token>>’
scenarios:
  - flow:
    - function: "generateRandomData"
    - get:
        url: "/questions"
        qs:
          lat: "{{ lat }}"
          lng: "{{ lng }}"
    - log: "Sent POST request to / with {{ lat }}, {{ lng }}"

This is a synchronous API so the performance directly impacts the user’s experience of the application. This test shows that the median response time is 165 ms with a p95 time of 201 ms:

Artillery performance results

This level of load equates to 72,000 queries per hour, almost 50% above the expected usage. The DynamoDB metrics show a peak consumption of 82 read capacity units:

CloudWatch monitoring details

Testing authenticated routes

These API routes are protected from public access and require authorization. This application uses HTTP APIs, which accepts JWT tokens, and it uses Auth0 in the frontend application to generate these tokens. When you are load testing API Gateway routes with custom authorizers, you have a number of options.

At the early development stage, you may choose to remove the authentication to perform load tests. This simplifies the process but is not recommended beyond research and prototyping. If you turn off authentication for testing, there is a risk that it is not enabled again for production. This would leave your routes open to the public.

A better approach is to create a test user in your identity provider and use the JWT token for testing. Auth0 allows you to obtain a token manually, and use this in the Artillery configuration for the authorization header:

Embedding authorization token in test script

Since custom code frequently uses the decoded identity in processing, supplying a test token provides the closest simulation of actual usage. You must refresh this token in the test scripts periodically, and you can change scopes as needed.

The testing directory in the GitHub repo also includes a script for testing functions that consume from SQS queues. This allows you to test microservices further down in your infrastructure stack. This script injects messages into the SQS queue, simulating upstream processes.

Conclusion

In this post, I discuss focus areas for load testing of serverless applications, and highlight two tools commonly used. I show how to configure Artillery with customized functions, and how to run tests to simulate load on the Ask Around Me application.

I cover some of the options for testing authenticated API Gateway routes and how you can use JWT tokens in your load testing configuration. You can also test microservices within a serverless architecture by injecting messages into SQS queues to simulate upstream load.

To learn more about the Ask Around Me serverless applications, read the blog series.

Building well-architected serverless applications: Approaching application lifecycle management – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-approaching-application-lifecycle-management-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the nine serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the Introduction post for a table of contents and explanation of the example application.

Question OPS2: How do you approach application lifecycle management?

This post continues part 1 of this Operational Excellence question. Previously, I covered using infrastructure as code with version control to deploy applications in a repeatable manner.

Good practice: Prototype new features using temporary environments

Storing application configuration as infrastructure as code allows deployment of multiple, repeatable, isolated versions of an application.

Create multiple temporary environments for new features you may need to prototype, and tear them down as you complete them. Temporary environments enable fine grained feature isolation and higher fidelity development when interacting with managed services. This allows you to gain confidence your workload integrates and operates as intended.

These environments can also be in separate accounts which help isolate limits, access to data, and resiliency. For best practices on multi-account deployments, see the AWS Partner Network blog post: Best Practices Guide for Multi-Account AWS Deployments.

There are a number of ways to deploy separate environments for an application. To make the deployment simpler, it is good practice to separate dynamic configuration from your infrastructure logic.

For an application managed via the AWS Serverless Application Model (AWS SAM), use an AWS SAM CLI parameter to specify a new stack-name which deploys a new copy of the application as a separate stack.

For example, there is an existing AWS SAM application with a stack-name of app-test. To deploy a new copy, specify a new stack-name of app-newtest with the following command line:

sam deploy --stack-name app-newtest

This deploys a whole new copy of the application in the same account as a separate stack.

For the serverless airline example used in this series, deploy a whole new copy of the application following the deployment instructions, either into the same AWS account, or a completely different account. This is useful when each developer in a team has a sandbox environment. In this example, you only need to configure payment provider credentials as environment variables and seed the database with possible flights as these are currently manual post installation tasks.

However, maintaining an entirely separate codebase copy of an application becomes difficult to manage and reconcile over time.

As the airline application code is stored in a fork in a GitHub account, use git branches for separate environments. In typical development teams, developers may deploy a main branch to production, have a dev branch as staging, and create feature branches when working on new functionality. This allows safe prototyping in sandbox environments without affecting the main codebase, and use git as a mechanism to merge code and resolve conflicts. Changes are automatically pushed to production once they are merged into the main (or production) branch.

Git branching flow

Git branching flow

As the airline example is using AWS Amplify Console, there are a few different options to create a new environment linked to a feature branch.

You can create a whole new Amplify Console app deployment, either in a separate Region, or in a separate AWS account, which then connects to a feature branch by following the deployment instructions. Create a new branch called new-feature in GitHub and in the Amplify Console, select Connect App, and navigate to the repository and the new-feature branch. Configure the payment provider credentials as environment variables.

Deploy new application pointing to feature branch

You can also connect the existing Amplify Console deployment to a git branch, deploying the new-feature branch into the same AWS account and Region.

Amplify Environments

Amplify Environments

In the Amplify Console, navigate to the existing app, select Connect Branch, and choose the new-feature branch. Create a new Backend environment to deploy the full stack. If the feature branch is only frontend code changes, you can choose to use the same backend components.

Connect Amplify Console to feature branch

Connect Amplify Console to feature branch

Amplify Console then deploys a new stack in addition to the develop branch based on the code in the feature-branch.

New feature branch deploying within existing deployment.

New feature branch deploying within existing deployment.

You do not need to add the payment provider environment variables as these are stored per application, per Region, for all branches.

Amplify environment variables for All Branches.

Amplify environment variables for All Branches.

Using git and branching with Amplify Console, you have automatic deployments when any changes are pushed to the GitHub repository. If there are any issues with a particular deployment, you can revert the changes in git which will kick off a redeploy to a known good version. Once you are happy with the feature, you can merge the changes into the production branch which will again kick off another deployment.

As it is simple to set up multiple test environments, make sure to practice good application hygiene, as well as cost management, by identifying and deleting any temporary environments that are no longer required. It may be helpful to include the stack owner’s contact details via CloudFormation tags. Use Amazon CloudWatch scheduled tasks to notify and tag temporary environments for deletion, and provide a mechanism to delay its deletion if needed.

Prototyping locally

With AWS SAM or a third-party framework, you can run API Gateway, and invoke Lambda function code locally for faster development iteration. Local debugging and testing can help for quick confirmation that function code is working, and is also useful for some unit tests. Local testing cannot duplicate the full functionality of the cloud. It is suited to testing services with custom logic, such as Lambda, rather than trying to duplicate all cloud managed services such as Amazon SNS, or Amazon S3 locally. Don’t try to bring the cloud to the test, rather bring the testing to the cloud.

Here is an example of executing a function locally.

I use AWS SAM CLI to invoke the Airline-GetLoyalty Lambda function locally to test some functionality. AWS SAM CLI uses Docker to simulate the Lambda runtime. As the function only reads from DynamoDB, I use stubbed data, or can set up DynamoDB Local.

1. I pass a JSON event to the function to simulate the event from API Gateway, as well as passing in environment variables as JSON. Create sample events using sam local generate-event.

2. I run sam build GetFunc to build the function dependencies, in this case NodeJS.

$ sam build GetFunc
Building resource 'GetFunc'
Running NodejsNpmBuilder:NpmPack
Running NodejsNpmBuilder:CopyNpmrc
Running NodejsNpmBuilder:CopySource
Running NodejsNpmBuilder:NpmInstall
Running NodejsNpmBuilder:CleanUpNpmrc

Build Succeeded

3. I run sam local invoke passing in the event payload and environment variables. This spins up a Docker container, executes the function, and returns the result.

$ sam local invoke --event src/get/event.json --env-vars local-env-vars.json GetFunc
Invoking index.handler (nodejs10.x)

Fetching lambci/lambda:nodejs10.x Docker container image......
Mounting /home/ec2-user/environment/samples/aws-serverless-airline-booking/src/backend/loyalty/.aws-sam/build/GetFunc as /var/task:ro,delegated inside runtime container
START RequestId: 7be7e9a5-9f2f-1520-fbd1-a013485105d3 Version: $LATEST
END RequestId: 7be7e9a5-9f2f-1520-fbd1-a013485105d3
REPORT RequestId: 7be7e9a5-9f2f-1520-fbd1-a013485105d3 Init Duration: 249.89 ms Duration: 76.40 ms Billed Duration: 100 ms Memory Size: 512 MB Max Memory Used: 54 MB

{"statusCode": 200,"body": "{\"points\":0,\"level\":\"bronze\",\"remainingPoints\":50000}"}

For more information on using AWS SAM to run API Gateway, and invoke Lambda functions locally, see the AWS Documentation. For third-part framework solutions, see Invoking AWS Lambda functions locally with Serverless framework and Develop locally against cloud services with Stackery.

Improvement plan summary:

  1. Use a serverless framework to deploy temporary environments named after a feature.
  2. Implement a process to identify temporary environments that may not have been deleted over an extended period of time
  3. Prototype application code locally and test integrations directly with managed services

Good practice: Use a rollout deployment mechanism

Use a rollout deployment for production workloads as opposed to all-at-once mechanisms. Rollout deployments reduce the risk of a failed deployment by gradually deploying application changes to a limited set of customers. Use all-at-once deployments to deploy the entire application. This is best suited for non-production systems.

AWS Lambda versions and aliases

For production Lambda functions, it is best to deploy a new function version for every deployment. Versions can represent the stable version or reflect particular features. Create Lambda aliases which are pointers to particular function versions. Invoke Lambda functions using the aliases, with a specific alias for the stable production version. If an alias is not specified, the latest application code deployment is invoked which may not reflect a stable version or a desired feature. Use the new feature alias version for testing without affecting users of the stable production version.

AWS Lambda function versions and aliases

AWS Lambda function versions and aliases

See AWS Documentation to manage Lambda function versions and aliases using the AWS Management Console, or Lambda API.

Alias routing

Use Lambda alias’ routing configuration to introduce traffic shifting to send a small percentage of traffic to a second function alias or version for a rolling deployment. This is commonly called a canary release.

For example, configure Lambda alias named stable to point to function version 2. A new function version 3 is deployed with alias new-feature. Use the new-feature alias to test the new deployment without impacting production traffic to the stable version.

During production rollout, use alias routing. For example, 90% of invocations route to the stable version while 10% route to alias new-feature pointing to version 3. If the 10% is successful, deployment can continue until all traffic is migrated to version 3, and the stable alias is then pointed to version 3.

AWS Lambda alias routing

AWS Lambda alias routing

AWS SAM supports gradual Lambda deployments with a feature called Safe Lambda deployments using AWS CodeDeploy. This creates new versions of a Lambda function, and automatically creates aliases pointing to the new version. Customer traffic gradually shifts to the new version or rolls back automatically if any specified CloudWatch alarms trigger. AWS SAM supports canary, linear, and all-at-once deployments preference types.

Pre-traffic and post-traffic Lambda functions can also verify if the newly deployed code is working as expected.

In the airline example, create a safe deployment for the ReserveBooking Lambda function by adding the example AWS SAM template code specified in the instructions. This migrates 10 percent of traffic every 10 minutes with CloudWatch alarms to check for any function errors. You could also alarm on latency, or any custom metric.

During the Amplify Console build phase, the safe deployment is initiated. Navigate to the CodeDeploy console and see the deployment in progress.

AWS CodeDeploy deployment in progress

AWS CodeDeploy deployment in progress

Selecting the deployment, you can see the Traffic shifting progress and the Deployment details.

AWS CodeDeploy traffic shifting in progress.

AWS CodeDeploy traffic shifting in progress.

Within Deployment details, select the DeploymentGroup, and view the CloudWatch Alarms CodeDeploy is using to test the rollout.

Amazon CloudWatch Alarms AWS CodeDeploy is using to test the rollout

Amazon CloudWatch Alarms AWS CodeDeploy is using to test the rollout

Within Deployment details, select the Application, select the Revisions tab, and select the latest Revision location and view the CurrentVersion and TargetVersion for this deployment.

View deployment versions

View deployment versions

View Deployment status and see the traffic has now shifted to the new version. The Amplify Console build also continues.

Traffic shifting complete

Traffic shifting complete

View the Lambda function versions and aliases in the Lambda console, selecting Qualifiers.

Viewing Lambda function version and aliases

Viewing Lambda function version and aliases

Amazon API Gateway also supports canary release deployments at the API layer.

A rollout deployment provides traffic shifting, A/B testing, and the ability to roll back to any version at any point in time. AWS SAM makes it simple to add safe deployments to serverless applications.

Improvement plan summary

  1. For production systems, use a linear deployment strategy to gradually rollout changes to customers.
  2. For high volume production systems, use a canary deployment strategy when you want to limit changes to a fixed percentage of customers for an extended period of time.

Conclusion

Introducing application lifecycle management improves the development, deployment, and management of serverless applications. In this post I cover a number of methods to prototype new features using temporary environments. I show how to use rollout deployments to gradually shift traffic to new application code.

This well-architected question will continue in an upcoming post where I look at configuration management, CI/CD for serverless applications, and managing function runtime deprecation.

Upgrading to Amazon EventBridge from Amazon CloudWatch Events

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/upgrading-to-amazon-eventbridge-from-amazon-cloudwatch-events/

Amazon EventBridge was announced at the AWS New York Summit in 2019. It’s a serverless event bus service that uses the Amazon CloudWatch Events API, but also includes more functionality, like the ability to ingest events from SaaS apps. Event-based architectures can make it easier to decouple your application services and make your systems more extensible.

Events are emitted from services throughout AWS, and you can create custom events from your own applications. The popularity of event-based compute is why CloudWatch Events grew to process trillions of events every month. See this video for an overview of the features of the EventBridge service.

EventBridge is designed to extend the event model beyond AWS, bringing data from software as a service (SaaS) providers into your AWS environment. This means you can consume events from popular providers such as Zendesk, PagerDuty, and Auth0. You can use these in your applications with the same ease as any AWS-generated event.

This blog post explains the differences between the CloudWatch Events and EventBridge, and the benefits of upgrading. I also provide additional resources to help you gain the benefits of using the newer EventBridge service.

Integrated SaaS providers in EventBridge

Fetching data from third-party providers has typically relied on processes such as polling, or building custom webhooks where these are supported. Polling is compute-intensive and often wasteful, since many requests return no new data. Additionally, there is a lag between new data becoming available and your system receiving the information.

While webhooks offers improvements over polling and may approximate a real-time connection, these requests travel over the public internet. This means you must secure the webhook endpoint, and use a security mechanism between the two services. You must also scale up if the SaaS provider sends large numbers of messages.

EventBridge enables SaaS providers to publish data on an event bus within their AWS environment. These events are then routed to your AWS account, and appear on your partner event bus. All of this happens on the private AWS network, away from the public internet. The service manages scaling, security, and integration for you.

Configuring EventBridge in your SaaS provider account is easy. To learn how to set up the integration, see these videos for a full walkthrough:

A full list of EventBridge SaaS integrations is available on the EventBridge website. Additionally, if you want to integrate your own SaaS software with EventBridge, read more in the onboarding guide.

Custom event buses and enhanced rules

CloudWatch Events provides a default event bus that exists in every AWS account. All AWS events are routed via the default bus. You can also choose to publish your custom events to the default bus.

EventBridge introduces custom event buses you can use exclusively for your own workloads. These can be useful for controlling access to events to a limited set of AWS accounts or custom applications. Custom event buses are free to set up. Watch this video to see how to set up a custom event bus.

You can also use powerful advanced rules for routing events. With content-based filtering, you can use comparison operators to identify and filter values in events. This allows you to use EventBridge to handle more processing on behalf of your application, reducing the load on downstream services. See this blog post to learn more about using content-filtering to build advanced rules.

Schema registry

One of the traditional challenges of working with event-based architectures is the administration of managing event structures. With different applications, services and microservices publishing events, it can be hard to standardize event formats. These formats, or schemas, may also change when developers introduce new versions of their services.

The EventBridge Schema Registry allows you to automate the discovery and creation of schemas. You can find schemas for AWS services, integrated SaaS providers, and your own custom events. EventBridge infers the schemas and stores these in a registry. You can then download custom code bindings for popular type-safe programming languages. This accelerates the development process, making it easy to construct objects based on events.

Watch this video to see how to use the EventBridge Schema Registry, and get started with your own applications.

Migration and compatibility

Comparing the EventBridge console and CloudWatch Events console, EventBridge has a new design that makes it easier to build and manage rules and event buses. If you are new to using events within AWS, it’s recommended that you start with the EventBridge console.

The EventBridge service uses the CloudWatch Events API and it is fully backward compatible. Any rules you have configured in CloudWatch Events continue to work in EventBridge. You can access the default event bus and any configurations you created in CloudWatch Events from EventBridge immediately. If you’re a current CloudWatch Events customer, you can upgrade to EventBridge by simply opening the EventBridge console.

AWS continues to build new functionality to enhance the capabilities of event-based architectures. These features are only released via EventBridge, so it’s recommended that you upgrade to ensure you can take advantage of these new capabilities.

Additional EventBridge resources for serverless developers

Event-based architectures can help serverless developers create dynamic, decoupled applications. Here are additional resources with sample code repos to help you get started:

Conclusion

EventBridge is the evolution of the CloudWatch Events service. It brings new features, including the ability to integrate data from popular SaaS providers as events within AWS.

In this post, I discuss the new ability to create custom event buses, and how you can develop advanced rules for sophisticated event routing. I also discuss the new EventBridge Schema Registry, which automates event schema discovery, and downloading code bindings directly into your IDE.

New event-based features are now released in EventBridge. By migrating from CloudWatch Events, you can take advantage of new capabilities as they are released.

To learn more about using EventBridge for your AWS workloads, visit the EventBridge Learning Path, which includes a range of learning resources.

Monitor Spark streaming applications on Amazon EMR

Post Syndicated from Amir Shenavandeh original https://aws.amazon.com/blogs/big-data/monitor-spark-streaming-applications-on-amazon-emr/

For applications to be enterprise-ready, you need to consider many aspects of the application before moving to a production environment and have operational visibility of your application. You can get that visibility through metrics that measure your application’s health and performance and feed application dashboards and alarms.

In streaming applications, you need to benchmark different stages and tasks in each stage. Spark has provided some interfaces to plug in your probes for real-time monitoring and observation of your applications. SparkListeners is a flexible and powerful tool for both steaming and batch applications. You can combine it with Amazon CloudWatch metrics, dashboards, and alarms for visibility and generate notifications when issues arise or automatically scale clusters and services.

This post demonstrates how to implement a simple SparkListener, monitor and observe Spark streaming applications, and set up some alerts. The post also shows how to use alerts to set up automatic scaling on Amazon EMR clusters, based on your CloudWatch custom metrics.

Monitoring Spark streaming applications

For production use cases, you need to plan ahead to determine the amount of resources your Spark application requires. Real-time applications often have SLAs that they need to meet, such as how long each batch execution can safely run or how long each micro-batch can be delayed. Quite often, in the lifecycle of an application, sudden increases of data in the input stream require more application resources to process and catch up with the influx.

For these use cases, you may be interested in common metrics such as the count of records in each micro-batch, the delay on running scheduled micro-batches, and how long each batch takes to run. For example, in Amazon Kinesis Data Streams, you can monitor the IteratorAge metric. With Apache Kafka as a streaming source, you might monitor consumer lag, such as the delta between the latest offset and the consumer offset. For Kafka, there are various open-source tools for this purpose.

You can react in real time or raise alerts based on environment changes by provisioning more resources or reducing unused resources for cost optimization.

Different methods to monitor Spark streaming applications are already available. A very efficient, out-of-the-box feature of Spark is the Spark metrics system. Additionally, Spark can report metrics to various sinks including HTTP, JMX, and CSV files.

You can also monitor and record application metrics from within the application by emitting logs. This requires running count().print(), printing metrics in maps and reading the data that may cause delays, adding to the application stages, or performing unwanted shuffles that may be useful for testing but often prove to be expensive as a long-term solution.

This post discusses another method: using the SparkStreaming interface. The following screenshot shows some available metrics on the Spark UI’s Streaming tab.

Apache Spark listeners

Spark internally relies on SparkListeners for communication between its internal components in an event-based fashion. Also, Spark scheduler emits events for SparkListeners whenever the stage of each task changes. SparkListeners listen to the events that are coming from Spark’s DAGScheduler, which is the heart of the Spark execution engine. You can use custom Spark listeners to intercept SparkScheduler events so you know when a task or stage starts and finishes.

The Spark Developer API provides eight methods in the SparkListener trait called on different SparkEvents, mainly at start and stop, failure, completion, or submission of receivers, batches, and output operation. You can execute an application logic at each event by implementing these methods. For more information, see StreamingListener.scala on GitHub.

To register your custom Spark listener, set spark.extraListeners when launching the application, or programmatically by calling addSparkListener when setting up SparkContext in your application.

SparkStreaming micro-batches

By default, SparkStreaming has a micro-batch execution model. Spark starts a job in intervals on a continuous stream. Each micro-batch contains stages, and stages have tasks. Stages are based on the DAG and the operation that the application code defines, and the number of tasks in each stage is based on the number of DStream partitions.

At the start of a streaming application, the receivers are assigned to executors, in a round-robin fashion, as long-running tasks.

Receivers create blocks of data based on blockInterval. The received blocks are distributed by the BlockManager of the executors, and the network input tracker running on the driver is informed about the block locations for further processing.

On the driver, an RDD is created for the blocks in each batchInterval. Each block translates to a partition of the RDD and a task is scheduled to process each partition.

The following diagram illustrates this architecture.

Creating a custom SparkListener and sending metrics to CloudWatch

You can rely on CloudWatch custom metrics to react or raise alarms based on the custom Spark metrics you collect from a custom Spark listener.

You can implement your custom streaming listeners by directly implementing the SparkListener trait if writing in Scala, or its equivalent Java interface or PySpark Python wrapper pyspark.streaming.listener.

For this post, you only override onBatchCompleted and onReceiverError because you’re only collecting metrics about micro-batches.

From OnBatchCompleted, you submit the following metrics:

  • Heartbeat – A numeric 1 (one) whenever a batch completes so you can sum or average time periods to see how many micro-batches ran
  • Records – The number of records per batch
  • Scheduling delay – The delay from when the batch was scheduled to run until when it actually ran
  • Processing delay – How long the batch execution took
  • Total delay – The sum of the processing delay and scheduling delay

From OnRecieverError, you submit a numeric 1 (one), whenever a receiver fails. See the following code:

/**
    * This method executes when a Spark Streaming batch completes.
    *
    * @param batchCompleted Class having information on the completed batch
    */

  override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted): Unit = {
    log.info("CloudWatch Streaming Listener, onBatchCompleted:" + appName)

    // write performance metrics to CloutWatch Metrics
    writeBatchStatsToCloudWatch(batchCompleted)

  }
  /**
  * This method executes when a Spark Streaming batch completes.
  *
  * @param receiverError Class having information on the reciever Errors
  */

  override def onReceiverError(receiverError: StreamingListenerReceiverError): Unit = { 
    log.warn("CloudWatch Streaming Listener, onReceiverError:" + appName)

    writeRecieverStatsToCloudWatch(receiverError)
  }

For the full source code of this example for Scala implementation and a sample Spark Kinesis streaming application, see the AWSLabs GitHub repository.

To register your custom listener, make an instance of the custom listener object and pass the object to the streaming context, in the driver code, using the addStreamingListener method. See the following code:

val conf = new SparkConf().setAppName(appName)
val batchInterval = Milliseconds(1000)
val ssc = new StreamingContext(conf, batchInterval)
val cwListener = new CloudWatchSparkListener(appName)

ssc.addStreamingListener(cwListener)

When you run the application, you can find your metrics in CloudWatch in the same account as the one the EMR cluster is running in. See the following screenshot.

Using the sample code

This post provides an AWS CloudFormation template, which demonstrates the code. Download the emrtemplate.json file from the GitHub repo. The template launches an EMR cluster in a public subnet and a Kinesis data stream with three shards with the required default AWS Identity and Access Management (IAM) roles. The sample Spark Kinesis streaming application is a simple word count that an Amazon EMR step script compiles and packages with the sample custom StreamListener.

Using application alarms in CloudWatch

The alerts you need to set up mainly depend on the SLA of your application. As a general rule, you don’t want your batches to take longer than the micro-batch intervals because it causes the scheduled batches to queue and you start falling behind the input stream. Also, if the rate of your receivers reading from the stream is more than what you can process in the batches due to a surge, the read records can spill to disk and cause more delays to shuffle across to other executors. You can set up a CloudWatch alarm to notify you when a processing delay is approaching your application’s batchInterval. For instructions on setting up an alarm, see Using Amazon CloudWatch Alarms.

The CloudFormation template for this post has two sample alarms to monitor. One is based on the anomaly detection band on the processingDelays metric; the second is based on a threshold on a math expression that calculates schedulingDelay ratio to totalDelay or (schedulingDelay / totalDelay) * 100 .

Scaling streaming applications

In terms of scaling, as the amount of data grows, you have more DStream partitions, based on the blockIntervals of the streaming application. In addition to the batches that should catch up with the received records and finish within batch intervals, the receivers should also keep up with the influx of records. The source streams should provide enough bandwidth for the receivers to read fast enough from the stream, and there should be enough receivers reading at the right rate to consume the records from the source.

If your DStreams are backed by receivers and WALs, you need to consider the number of receivers in advance. When you launch the application, the number of receivers may not change without restarting the application.

When a SparkStreaming application starts, by default, the driver schedules the receivers in a round-robin fashion on the available executors unless a preferred location is defined for receivers. When all executors are allocated with receivers, the rest of the required receivers are scheduled on the executors to balance the number of receivers on each executor, and the receivers stay up in the executors as long-running tasks. For more information about scheduling receivers on executors, see ReceiverSchedulingPolicy.scala on GitHub and SPARK-8882 on the Spark issues website.

Sometimes you may want to slow down receivers because you want less data in micro-batches and don’t want to surpass your micro-batch intervals. To slow down receivers, in case you have streaming sources that can hold on to the records when the batches can’t run fast enough to keep up with the surge of records, you can enable the BackPressure feature to adapt to the input rate from receivers. To do so, set spark.streaming.backpressure.enabled to true.

Another factor you can consider is the dynamic allocation for streaming applications. By default, spark.dynamicAllocation is enabled on Amazon EMR, which is mutually exclusive to spark.streaming.dynamicAllocation. If you want the driver to request for more executors for your DStream tasks, you need to set spark.dynamicAllocation.enabled to false and spark.streaming.dynamicAllocation.enabled to true. Spark periodically looks into the average batch duration. If it’s above the scale-up ratio, it requests for more executors. If it’s below the scale-down ratio, it releases the idle executors, preferably those that aren’t running any receivers. For more information, see ExecutorAllocationManager.scala on GitHub and the Spark Streaming Programming Guide.

The ExecutorAllocationManager is already looking into the batch execution average time and requests more executors based on the scale-up and scale-down ratios. Because of this, you can set up automatic scaling in Amazon EMR, preferably on tasks instance groups, to add and remove nodes based on the ContainerPendingRatio and assign PreferredLocation for receivers to core nodes. The example code for this post provides a custom KinesisInputDStream, which allows assigning the preferred location for every receiver you request. It’s basically a function that returns a hostname to preferably place the receiver. The GitHub repo also has a sample application that uses the customKinesisInputDStream and customKinesisReciever, which allows requesting a preferredLocation for receivers.

At scale-down, Amazon EMR nominates the nodes with the fewest containers running for decommissioning in the task instance group.

For more information about setting up automatic scaling, see Using Automatic Scaling with a Custom Policy for Instance Groups. The example code contains a threshold on schedulingDelay. As a general rule, you should base the threshold on the batchIntervals and processingDelay. A growth in schedulingDelay usually means a lack of resources to schedule a task.

The following table summarizes the configuration attributes to tune when you launch your Spark streaming job.

Configuration AttributeDefault
spark.streaming.backpressure.enabledFalse
spark.streaming.backpressure.pid.proportional1.0
spark.streaming.backpressure.pid.integral0.2
spark.streaming.backpressure.pid.derived0.0
spark.streaming.backpressure.pid.minRate100
spark.dynamicAllocation.enabledTrue
spark.streaming.dynamicAllocation.enabledFalse
spark.streaming.dynamicAllocation.scalingInterval60 Seconds
spark.streaming.dynamicAllocation.minExecutorsmax(1,numReceivers)
spark.streaming.dynamicAllocation.maxExecutorsInteger.MAX_VALUE

Monitoring structured streaming with a listener

Structured streaming still processed records in micro-batches and triggers queries when there is data from receivers. You can monitor these queries using another listener interface, StreamingQueryListener. This post provides a sample listener for structured streaming on Kafka, with a sample application to run. For more information, see CloudWatchQueryListener.scala GitHub. The following image is a snapshot of few CloudWatch custom metrics the custom StreamingQueryListerer will collect.

Scaling down your EMR cluster

When you launch a Spark streaming application, Spark evenly schedules the receivers on all available executors at the start of the application. When an EMR cluster is set to scale down, Amazon EMR nominates the nodes running fewer tasks in the instance group with an automatic scaling rule. Although Spark receivers are long-running tasks, Amazon EMR waits for yarn.resourcemanager.decommissioning.timeout, or when the NodeManagers are decommissioned, to gracefully terminate and shrink the nodes. You’re always at risk of losing a running executor with a receiver. You should always consider enough Spark block replication and CheckPointing for the DStreams and ideally define a PreferedLocation so you don’t risk losing receivers.

Metrics pricing

In general, Amazon EMR metrics don’t incur CloudWatch costs. However, custom metrics incur charges based on CloudWatch metrics pricing. For more information, see Amazon CloudWatch pricing. Additionally, Spark Kinesis Streaming relies on the Kinesis Client Library, and it publishes custom CloudWatch metrics that also incur charges based on CloudWatch metrics pricing. For more information, see Monitoring the Kinesis Client Library with Amazon CloudWatch.

Conclusion

Monitoring and tuning Spark streaming and real-time applications is challenging, and you must react to environment changes in real time. You also need to monitor your source streams and job outputs to get a full picture. Spark is a very flexible and rich framework that provides multiple options for monitoring jobs. This post looked into an efficient way to monitor the performance of Spark streaming micro-batches using SparkListeners and integrate the extracted metrics with CloudWatch metrics.

 


About the Author

Amir Shenavandeh is a Hadoop systems engineer with AWS. He helps customers with architectural guidance and technical support using open-source applications, develops and advances the applications of the Hadoop ecosystem and works with the open source community. 

 

 

New – Use CloudWatch Synthetics to Monitor Sites, API Endpoints, Web Workflows, and More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-use-cloudwatch-synthetics-to-monitor-sites-api-endpoints-web-workflows-and-more/

Today’s applications encompass hundreds or thousands of moving parts including containers, microservices, legacy internal services, and third-party services. In addition to monitoring the health and performance of each part, you need to make sure that the parts come together to deliver an acceptable customer experience.

CloudWatch Synthetics (announced at AWS re:Invent 2019) allows you to monitor your sites, API endpoints, web workflows, and more. You get an outside-in view with better visibility into performance and availability so that you can become aware of and then fix any issues quicker than ever before. You can increase customer satisfaction and be more confident that your application is meeting your performance goals.

You can start using CloudWatch Synthetics in minutes. You simply create canaries that monitor individual web pages, multi-page web workflows such as wizards and checkouts, and API endpoints, with metrics stored in Amazon CloudWatch and other data (screen shots and HTML pages) stored in an S3 bucket. as you create your canaries, you can set CloudWatch alarms so that you are notified when thresholds based on performance, behavior, or site integrity are crossed. You can view screenshots, HAR (HTTP archive) files, and logs to learn more about the failure, with the goal of fixing it as quickly as possible.

CloudWatch Synthetics in Action
Canaries were once used to provide an early warning that deadly gases were present in a coal mine. The canaries provided by CloudWatch Synthetics provide a similar early warning, and are considerably more humane. I open the CloudWatch Console and click Canaries to get started:

I can see the overall status of my canaries at a glance:

I created several canaries last month in preparation for this blog post. I chose a couple of sites including the CNN home page, my personal blog, the Amazon Movers and Shakers page, and the Amazon Best Sellers page. I did not know which sites would return the most interesting results, and I certainly don’t mean to pick on any one of them. I do think that it is important to show you how this (and every) feature performs with real data, so here we go!

I can turn my attention to the Canary runs section, and look at individual data points. Each data point is an aggregation of runs for a single canary:

I can click on the amzn_movers_shakers canary to learn more:

I can see that there was a single TimeoutError issue in the last 24 hours. I can see the screenshots that were captured as part of each run, along with the HAR files and logs. Each HAR file contains a detailed log of the HTTP requests that were made when the canary was run, along with the responses and the amount of time that it took for the request to complete:

Each canary run is implemented using a Lambda function. I can access the function’s execution metrics in the Metrics tab:

And I can see the canary script and other details in the Configuration tab:

Hatching a Canary
Now that you have seen a canary in action, let me show you how to create one. I return to the list of canaries and click Create canary. I can use one of four blueprints to create my canary, or I can upload or import an existing one:

All of these methods ultimately result in a script that is run either once or periodically. The canaries that I showed above were all built from the Heartbeat monitoring blueprint, like this:

I can also create canaries for API endpoints, using either GET or PUT methods, any desired HTTP headers, and some request data:

Another blueprint lets me create a canary that checks a web page for broken links (I’ll use this post):

Finally, the GUI workflow builder lets me create a sophisticated canary that can include simulated clicks, content verification via CSS selector or text, text entry, and navigation to other URLs:

As you can see from these examples, the canary scripts are using the syn-1.0 runtime. This runtime supports Node.JS scripts that can use the Puppeteer and Chromium packages. Scripts can make use of a set of library functions and can (with the right IAM permissions) access other AWS services and resources. Here’s an example script that calls AWS Secrets Manager:

var synthetics = require('Synthetics');
const log = require('SyntheticsLogger');

const AWS = require('aws-sdk');
const secretsManager = new AWS.SecretsManager();

const getSecrets = async (secretName) => {
    var params = {
        SecretId: secretName
    };
    return await secretsManager.getSecretValue(params).promise();
}

const secretsExample = async function () {
    // Fetch secrets
    var secrets = await getSecrets("secretname")
    
    // Use secrets
    log.info("SECRETS: " + JSON.stringify(secrets));
};

exports.handler = async () => {
    return await secretsExample();
};

Scripts signal success by running to completion, and errors by raising an exception.

After I create my script, I establish a schedule and a pair of data retention periods. I also choose an S3 bucket that will store the artifacts created each time a canary is run:

I can also control the IAM role, set CloudWatch Alarms, and configure access to endpoints that are in a VPC:

Watch the demo video to see CloudWatch Synthetics in action:

Things to Know
Here are a couple of things to know about CloudWatch Synthetics:

Observability – You can use CloudWatch Synthetics in conjunction with ServiceLens and AWS X-Ray to map issues back to the proper part of your application. To learn more about how to do this, read Debugging with Amazon CloudWatch Synthetics and AWS X-Ray and Using ServiceLens to Monitor the Health of Your Applications.

Automation – You can create canaries using the Console, CLI, APIs, and from CloudFormation templates.

Pricing – As part of the AWS Free Tier you get 100 canary runs per month at no charge. After that, you pay per run, with prices starting at $0.0012 per run, plus the usual charges for S3 storage and Lambda invocations.

Limits – You can create up to 100 canaries per account in the US East (N. Virginia), Europe (Ireland), US West (Oregon), US East (Ohio), and Asia Pacific (Tokyo) Regions, and up to 20 per account in other regions where CloudWatch Synthetics are available.

Available Now
CloudWatch Synthetics are available now and you can start using them today!

Jeff;

Building well-architected serverless applications: Understanding application health – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-understanding-application-health-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the nine serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the Introduction post for a table of contents and explaining the example application.

Question OPS1: How do you evaluate your serverless application’s health?

This post continues part 1 of this Operational Excellence question. Previously, I covered using Amazon CloudWatch out-of-the-box standard metrics and alerts, and structured and centralized logging with the Embedded Metrics Format.

Best practice: Use application, business, and operations metrics

Identifying key performance indicators (KPIs), including business, customer, and operations outcomes, in additional to application metrics, helps to show a higher-level view of how the application and business is performing.

Business KPIs measure your application performance against business goals. For example, if fewer flight reservations are flowing through the system, the business would like to know.

Customer experience KPIs highlight the overall effectiveness of how customers use the application over time. Examples are perceived latency, time to find a booking, make a payment, etc.

Operational metrics help to see how operationally stable the application is over time. Examples are continuous integration/delivery/deployment feedback time, mean-time-between-failure/recovery, number of on-call pages and time to resolution, etc.

Custom Metrics

Embedded Metrics Format can also emit custom metrics to help understand your workload health’s impact on business.

The airline booking service uses AWS Step Functions, AWS Lambda, Amazon SNS, and Amazon DynamoDB.

In the confirm booking module function handler, I add a new namespace and dimension to associate this set of logs with this application and service.

metrics.set_namespace("ServerlessAirlineEMF")
metrics.put_dimensions({"service":"confirm_booking"})

Within the try statement within the try/except block, I emit a metric for a successful booking:

metrics.put_metric("BookingSuccessful", 1, "Count")

And within the except statement within the try/except block, I emit a metric for a failed booking:

metrics.put_metric("BookingFailed", 1, "Count")

Once I make a booking, within the CloudWatch console, I navigate to Logs | Log groups and select the Airline-ConfirmBooking-develop group. I select a log stream and find the custom metric as part of the structured log entry.

structured-log-entry

Custom metric structured log entry

I can also create a custom metric graph. Within the CloudWatch console, I navigate to Metrics. I see the ServerlessAirlineEMF Custom Namespace is available.

custom-namespace

Custom metric namespace

I select the Namespace and the available metric.

available-namespace

Available metric

I select a Line graph, add a name, and see the successful booking now plotted on the graph.

custom-metric-plotted

Plotted CloudWatch metric

I can also visualize and analyze this metric using CloudWatch Insights.

Once a booking is made, within the CloudWatch console, I navigate to Logs | Insights. I select the /aws/lambda/Airline-ConfirmBooking-develop log group. I choose Run query which shows a list of discovered fields on the right of the console.

I can search for discovered booking related fields.

cloudwatch-insights-discovered-fieldsI then enter the following in the query pane to search the logs and plot the sum of BookingReference and choose Run Query:

fields @timestamp, @message
| stats sum(@BookingReference)

cloudwatch-insights-displayed-bookingreference

CloudWatch Insights query

There are a number of other component metrics that are important to measure. Track interactions between upstream and downstream components such as message queue length, integration latency, and throttling.

Improvement plan summary:

  1. Identify user journeys and metrics from each customer transaction.
  2. Create custom metrics asynchronously instead of synchronously for improved performance, cost, and reliability outcomes.
  3. Emit business metrics from within your workload to measure application performance against business goals.
  4. Create and analyze component metrics to measure interactions with upstream and downstream components.
  5. Create and analyze operational metrics to assess the health of your continuous delivery pipeline and operational processes.

Good practice: Use distributed tracing and code is instrumented with additional context

Logging provides information on the individual point in time events the application generates. Tracing provides a wider continuous view of an application. Tracing helps to follow a user journey or transaction through the application.

AWS X-Ray is an example of a distributed tracing solution, there are a number of third-party options as well. X-Ray collects data about the requests that your application serves and also builds a service graph, which shows all the components of an application. This provides visibility to understand how upstream/downstream services may affect workload health.

For the most comprehensive view, enable X-Ray across as many services as possible and include X-Ray tracing instrumentation in code. This is the list of AWS Services integrated with X-Ray.

X-Ray receives data from services as segments. Segments for a common request are then grouped into traces. Segments can be further broken down into more granular subsegments. Custom data key-value pairs are added to segments and subsegments with annotations and metadata. Traces can also be grouped which helps with filter expressions.

AWS Lambda instruments incoming requests for all supported languages. Lambda application code can be further instrumented to emit information about its status, correlation identifiers, and business outcomes to determine transaction flows across your workload.

X-Ray tracing for a Lambda function is enabled in the Lambda Console. Select a Lambda function. I select the Airline-ReserveBooking-develop function. In the Configuration pane, I select to enable X-Ray Active tracing.

x-ray-enable

X-Ray tracing enabled

X-Ray can also be enabled via CloudFormation with the following code:

TracingConfig:
  Mode: Active

Lambda IAM permissions to write to X-Ray are added automatically when active tracing is enabled via the console. When using CloudFormation, Allow the Actions xray:PutTraceSegments and xray:PutTelemetryRecords

It is important to understand what invocations X-Ray does trace. X-Ray applies a sampling algorithm. If an upstream service, such as API Gateway with X-Ray tracing enabled, has already sampled a request, the Lambda function request is also sampled. Without an upstream request, X-Ray traces data for the first Lambda invocation each second, and then 5% of additional invocations.

For the airline application, X-Ray tracing is initiated within the shared library with the code:

from aws_xray_sdk.core import models, patch_all, xray_recorder

Segments, subsegments, annotations, and metadata are added to functions with the following example code:

segment = xray_recorder.begin_segment('segment_name')
# Start a subsegment
subsegment = xray_recorder.begin_subsegment('subsegment_name')
# Add metadata and annotations
segment.put_metadata('key', dict, 'namespace')
subsegment.put_annotation('key', 'value')
# Close the subsegment and segment
xray_recorder.end_subsegment()
xray_recorder.end_segment()

For example, within the collect payment module, an annotation is added for a successful payment with:

tracer.put_annotation("PaymentStatus", "SUCCESS")

CloudWatch ServiceLens

Once a booking is made and payment is successful, the tracing is available in the X-Ray console.

I explore how Amazon CloudWatch ServiceLens connects metrics, logs and the X-Ray tracing. Within the CloudWatch console, I navigate to ServiceLens | Service Map.

I can visualize all application resources and dependencies where X-Ray is enabled. I can trace performance or availability issues. If there was an issue connecting to SNS for example, this would be shown.

I select the Airline-CollectPayment-develop node and can view the out-of-the-box standard Lambda metrics.

I can select View Logs to jump to the CloudWatch Logs Insights console.

cloudwatch-insights-service-map-view

CloudWatch Insights Service map

I select View dashboard to see the function metrics, node map, and function details.

cloudwatch-insights-service-map-dashboard

CloudWatch Insights Service Map dashboard

I select View traces and can filter by the custom metric PaymentStatus. I select SUCCESS and chose Add to filter. I then select a trace.

cloudwatch-insights-service-map-select-trace

CloudWatch Insights Filtered traces

I see the full trace details, which show the full application transaction of a payment collection.

cloudwatch-insights-service-map-view-trace

Segments timeline

Selecting the Lambda handler subsegment – ## lambda_handler, I can view the trace Annotations and Metadata, which include the business transaction details such as Customer and PaymentStatus.

cloudwatch-insights-service-map-view-trace-annotations

X-Ray annotations

Trace groups are another feature of X-Ray and ServiceLens. Trace groups use filter expressions such as Annotation.PaymentStatus = "FAILED" which are used to view traces that match the particular group. Service graphs can also be viewed, and CloudWatch alarms created based on the group.

CloudWatch ServiceLens provides powerful capabilities to understand application performance bottlenecks and issues, helping determine how users are impacted.

Improvement plan summary:

  1. Identify common business context and system data that are commonly present across multiple transactions.
  2. Instrument SDKs and requests to upstream/downstream services to understand the flow of a transaction across system components.

Recent announcements

There have been a number of recent announcements for X-Ray and CloudWatch to improve how to evaluate serverless application health.

Conclusion

Evaluating application health helps you identify which services should be optimized to improve your customer’s experience. In part 1, I cover out-of-the-box standard metrics and alerts, as well as structured and centralized logging. In this post, I explore custom metrics and distributed tracing and show how to use ServiceLens to view logs, metrics, and traces together.

In an upcoming post, I will cover the next Operational Excellence question from the Well-Architected Serverless Lens – Approaching application lifecycle management.

Simplified Time-Series Analysis with Amazon CloudWatch Contributor Insights

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/simplified-time-series-analysis-with-amazon-cloudwatch-contributor-insights/

Inspecting multiple log groups and log streams can make it more difficult and time consuming to analyze and diagnose the impact of an issue in real time. What customers are affected? How badly? Are some affected more than others, or are outliers? Perhaps you performed deployment of an update using a staged rollout strategy and now want to know if any customers have hit issues or if everything is behaving as expected for the target customers before continuing further. All of the data points to help answer these questions is potentially buried in a mass of logs which engineers query to get ad-hoc measurements, or build and maintain custom dashboards to help track.

Amazon CloudWatch Contributor Insights, generally available today, is a new feature to help simplify analysis of Top-N contributors to time-series data in CloudWatch Logs that can help you more quickly understand who or what is impacting system and application performance, in real-time, at scale. This saves you time during an operational problem by helping you understand what is contributing to the operational issue and who or what is most affected. Amazon CloudWatch Contributor Insights can also help with ongoing analysis for system and business optimization by easily surfacing outliers, performance bottlenecks, top customers, or top utilized resources, all at a glance. In addition to logs, Amazon CloudWatch Contributor Insights can also be used with other products in the CloudWatch portfolio, including Metrics and Alarms.

Amazon CloudWatch Contributor Insights can analyze structured logs in either JSON or Common Log Format (CLF). Log data can be sourced from Amazon Elastic Compute Cloud (EC2) instances, AWS CloudTrail, Amazon Route 53, Apache Access and Error Logs, Amazon Virtual Private Cloud (VPC) Flow Logs, AWS Lambda Logs, and Amazon API Gateway Logs. You also have the choice of using structured logs published directly to CloudWatch, or using the CloudWatch Agent. Amazon CloudWatch Contributor Insights will evaluate these log events in real-time and display reports that show the top contributors and number of unique contributors in a dataset. A contributor is an aggregate metric based on dimensions contained as log fields in CloudWatch Logs, for example account-id, or interface-id in Amazon Virtual Private Cloud Flow Logs, or any other custom set of dimensions. You can sort and filter contributor data based on your own custom criteria. Report data from Amazon CloudWatch Contributor Insights can be displayed on CloudWatch dashboards, graphed alongside CloudWatch metrics, and added to CloudWatch alarms. For example customers can graph values from two Amazon CloudWatch Contributor Insights reports into a single metric describing the percentage of customers impacted by faults, and configure alarms to alert when this percentage breaches pre-defined thresholds.

Getting Started with Amazon CloudWatch Contributor Insights
To use Amazon CloudWatch Contributor Insights I simply need to define one or more rules. A rule is simply a snippet of data that defines what contextual data to extract for metrics reported from CloudWatch Logs. To configure a rule to identify the top contributors for a specific metric I supply three items of data – the log group (or groups), the dimensions for which the top contributors are evaluated, and filters to narrow down those top contributors. To do this, I head to the Amazon CloudWatch console dashboard and select Contributor Insights from the left-hand navigation links. This takes me to the Amazon CloudWatch Contributor Insights home where I can click Create a rule to get started.

To get started quickly, I can select from a library of sample rules for various services that send logs to CloudWatch Logs. You can see above that there are currently a variety of sample rules for Amazon API Gateway, Amazon Route 53 Query Logs, Amazon Virtual Private Cloud Flow Logs, and logs for container services. Alternatively, I can define my own rules, as I’ll do in the rest of this post.

Let’s say I have a deployed application that is publishing structured log data in JSON format directly to CloudWatch Logs. This application has two API versions, one that has been used for some time and is considered stable, and a second that I have just started to roll out to my customers. I want to know as early as possible if anyone who has received the new version, targeting the new api, is receiving any faults and how many faults are being triggered. My stable api version is sending log data to one log group and my new version is using a different group, so I need to monitor multiple log groups (since I also want to know if anyone is experiencing any error, regardless of version).

The JSON to define my rule, to report on 500 errors coming from any of my APIs, and to use account ID, HTTP method, and resource path as dimensions, is shown below.

{
  "Schema": {
    "Name": "CloudWatchLogRule",
    "Version": 1
  },
  "AggregateOn": "Count",
  "Contribution": {
    "Filters": [
      {
        "Match": "$.status",
        "EqualTo": 500
      }
    ],
    "Keys": [
      "$.accountId",
      "$.httpMethod",
      "$.resourcePath"
    ]
  },
  "LogFormat": "JSON",
  "LogGroupNames": [
    "MyApplicationLogsV*"
  ]
}

I can set up my rule using either the Wizard tab, or I can paste the JSON above into the Rule body field on the Syntax tab. Even though I have the JSON above, I’ll show using the Wizard tab in this post and you can see the completed fields below. When selecting log groups I can either select them from the drop down, if they already exist, or I can use wildcard syntax in the Select by prefix match option (MyApplicationLogsV* for example).

Clicking Create saves the new rule and makes it immediately start processing and analyzing data (unless I elect to create it in disabled state of course). Note that Amazon CloudWatch Contributor Insights processes new log data created once the rule is active, it does not perform historical inspection, so I need to build rules for operational scenarios that I anticipate happening in future.

With the rule in place I need to start generating some log data! To do that I’m going to use a script, written using the AWS Tools for PowerShell, to simulate my deployed application being invoked by a set of customers. Of those customers, a select few (let’s call them the unfortunate ones) will be directed to the new API version which will randomly fail on HTTP POST requests. Customers using the old API version will always succeed. The script, which runs for 5000 iterations, is shown below. The cmdlets being used to work with CloudWatch Logs are the ones with CWL in the name, for example Write-CWLLogEvent.

# Set up some random customer ids, and select a third of them to be our unfortunates
# who will experience random errors due to a bad api update being shipped!
$allCustomerIds = @( 1..15 | % { Get-Random })
$faultingIds = $allCustomerIds | Get-Random -Count 5

# Setup some log groups
$group1 = 'MyApplicationLogsV1'
$group2 = 'MyApplicationLogsV2'
$stream = "MyApplicationLogStream"

# When writing to a log stream we need to specify a sequencing token
$group1Sequence = $null
$group2Sequence = $null

$group1, $group2 | % {
    if (!(Get-CWLLogGroup -LogGroupName $_)) {
        New-CWLLogGroup -LogGroupName $_
        New-CWLLogStream -LogGroupName $_ -LogStreamName $stream
    } else {
        # When the log group and stream exist, we need to seed the sequence token to
        # the next expected value
        $logstream = Get-CWLLogStream -LogGroupName $_ -LogStreamName $stream
        $token = $logstream.UploadSequenceToken
        if ($_ -eq $group1) {
            $group1Sequence = $token
        } else {
            $group2Sequence = $token
        }
    }
}

# generate some log data with random failures for the subset of customers
1..5000 | % {

    Write-Host "Log event iteration $_" # just so we know where we are progress-wise

    $customerId = Get-Random $allCustomerIds

    # first select whether the user called the v1 or the v2 api
    $useV2Api = ((Get-Random) % 2 -eq 1)
    if ($useV2Api) {
        $resourcePath = '/api/v2/some/resource/path/'
        $targetLogGroup = $group2
        $nextToken = $group2Sequence
    } else {
        $resourcePath = '/api/v1/some/resource/path/'
        $targetLogGroup = $group1
        $nextToken = $group1Sequence
    }

    # now select whether they failed or not. GET requests for all customers on
    # all api paths succeed. POST requests to the v2 api fail for a subset of
    # customers.
    $status = 200
    $errorMessage = ''
    if ((Get-Random) % 2 -eq 0) {
        $httpMethod = "GET"
    } else {
        $httpMethod = "POST"
        if ($useV2Api -And $faultingIds.Contains($customerId)) {
            $status = 500
            $errorMessage = 'Uh-oh, something went wrong...'
        }
    }

    # Write an event and gather the sequence token for the next event
    $nextToken = Write-CWLLogEvent -LogGroupName $targetLogGroup -LogStreamName $stream -SequenceToken $nextToken -LogEvent @{
        TimeStamp = [DateTime]::UtcNow
        Message = (ConvertTo-Json -Compress -InputObject @{
            requestId = [Guid]::NewGuid().ToString("D")
            httpMethod = $httpMethod
            resourcePath = $resourcePath
            status = $status
            protocol = "HTTP/1.1"
            accountId = $customerId
            errorMessage = $errorMessage
        })
    }

    if ($targetLogGroup -eq $group1) {
        $group1Sequence = $nextToken
    } else {
        $group2Sequence = $nextToken
    }

    Start-Sleep -Seconds 0.25
}

I start the script running, and with my rule enabled, I start to see failures show up in my graph. Below is a snapshot after several minutes of running the script. I can clearly see a subset of my simulated customers are having issues with HTTP POST requests to the new v2 API.

From the Actions pull down in the Rules panel, I could now go on to create a single metric from this report, describing the percentage of customers impacted by faults, and then configure an alarm on this metric to alert when this percentage breaches pre-defined thresholds.

For the sample scenario outlined here I would use the alarm to halt the rollout of the new API if it fired, preventing the impact spreading to additional customers, while investigation of what is behind the increased faults is performed. Details on how to set up metrics and alarms can be found in the user guide.

Amazon CloudWatch Contributor Insights is available now to users in all commercial AWS Regions, including China and GovCloud.

— Steve

CloudWatch Contributor Insights for DynamoDB – Now Generally Available

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/cloudwatch-contributor-insights-for-dynamodb-now-generally-available/

Amazon DynamoDB provides our customers a fully-managed key-value database service that can easily scale from a few requests per month to millions of requests per second. DynamoDB supports some of the world’s largest scale applications by providing consistent, single-digit millisecond response times at any scale. You can build applications with virtually unlimited throughput and storage. DynamoDB global tables replicate your data across multiple AWS Regions to give you fast, local access to data for your globally distributed applications. For use cases that require even faster access with microsecond latency, DynamoDB Accelerator (DAX) provides a fully managed in-memory cache.

In November 2019, we announced Amazon CloudWatch Contributor Insights for Amazon DynamoDB as Preview, and today, I am happy to announce it is Generally Available for all AWS Regions.

Amazon CloudWatch Contributor Insights for Amazon DynamoDB

Amazon CloudWatch Contributor Insights, which also launched in November 2019, analyzes log data and creates time-series visualizations to provide a view of top contributors influencing system performance. You do this by creating Contributor Insights rules to evaluate CloudWatch Logs (including logs from AWS services) and any custom logs sent by your service or on-premises servers. For example, you can find bad hosts, identify the heaviest network users, or find the URLs that generate the most errors.

For developers building applications on top of DynamoDB, it’s useful to understand your database access patterns, such as traffic trends and frequently accessed keys, to help optimize DynamoDB costs and performance. You can create visualizations of these patterns with just a few clicks in the console using CloudWatch Contributor Insights for DynamoDB. DynamoDB automatically creates the required CloudWatch resources, then provides a summary view of the graphs. This summary view lives in the DynamoDB console, but you can also see the individual rule details in the CloudWatch console with other CloudWatch Contributor Insights rules, reports, and graphs of report data.

You can use these graphs to view traffic trends and pinpoint any hot keys in your DynamoDB tables.

How it works

Let’s see how it works and how it benefits developers. Here is a table on DynamoDB.

If we select any table, we can see details of the table. Click the new tab called [Contributor Insights].

CloudWatch Contributor Insights is now DISABLED. You can enable by selecting the upper [Contributor Insights] tab.

When you access the [Contributor Insights] tab, you can check its status. The activation process is very easy (this is one of my favorite points of this feature!) If you click [Manage Contributor Insights], a dialog pop-up comes up.

If you choose [Enabled] and click [Confirm], Dashboard comes up, and CloudWatch Contributor Insights for DynamoDB will record every access to the table.

After a while, you will see table insights by some graphs.

You can change the time range in the upper right corner of the dashboard, or simply click and drag a graph directly.

What does the dashboard tell us?

The dashboard shows us 4 metrics, which are powerful insights for application performance tuning. DynamoDB creates separate visualizations for partition key vs. partition+sort key, so if your table doesn’t have a sort key, then you will only see two graphs, not all four.

  • Most Accessed Items (Partition Key only) – Identifies the partition keys of the most accessed items in your table or global secondary index.
  • Most Accessed Items (Partition Key + Sort Key) –   Identifies the partition and sort keys of the most accessed items in your table or global secondary index.
  • Most Throttled Items (Partition Key only) –  Identifies the partition keys of the most throttled items in your table or global secondary index.
  • Most Throttled Items (Partition Key + Sort Key) – Identifies the partition and sort keys of the most throttled items in your table or global secondary index.

Most Accessed Items

Let’s break down what these metrics and graphs mean, starting with the “Most Accessed Items” metrics. This metric shows the frequency of which a key is accessed, based on both read and write traffic.

Outliers in these graphs are your most frequently accessed, or hottest, keys. Many DynamoDB workloads have at least some imbalanced traffic, but you can use this graph to see whether your workload will bump against DynamoDB’s per-key limits. On the other hand, if you see several closely clustered lines without any obvious outliers, it indicates that your workload is relatively balanced across items over the given time window (great job balancing your workload!)

Most Throttled Items

The “Most Throttled Items” shows just that, a graph of throttle count over time for your most throttled keys. If you see no data in this graph, it means your requests have not been throttled. If you see isolated points instead of connected lines, that indicates an item was throttled only for a brief period.

This blog article “Choosing the Right DynamoDB Partition Key” tells us the importance of considerations and strategies for choosing the right partition key for designing a schema that uses Amazon DynamoDB. Choosing the right partition key is an important step in the design and building of scalable and reliable applications on top of DynamoDB. Also, you can check our DynamoDB documentation page “Best Practices for Designing and Using Partition Keys Effectively“.

Integrating with CloudWatch Dashboard

This feature is integrated with CloudWatch for ease of use. You can integrate any of these graphs onto an existing CloudWatch dashboard. Let’s see how to do it. Going back to the DynamoDB dashboard, click [Add to dashboard].
You are redirected to CloudWatch Management Console, and are asked which dashboard to add in.

You can choose any existing dashboard or create a new one. For example, I put these metrics into my existing test dashboard as [test20180321].

Activating the feature does not affect anything in your existing production environment. You can enable it or disable it ay any time.

Generally Available Today

This feature is generally available today for all AWS regions.

– Kame;

 

Building well-architected serverless applications: Understanding application health – part 1

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-understanding-application-health-part-1/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the nine serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the Introduction post for a table of contents and explaining the example application.

Question OPS1: How do you evaluate your serverless application’s health?

Evaluating your metrics, distributed tracing, and logging gives you insight into business and operational events, and helps you understand which services should be optimized to improve your customer’s experience. By understanding the health of your Serverless Application, you will know whether it is functioning as expected, and can proactively react to any signals that indicate it is becoming unhealthy.

Required practice: Understand, analyze, and alert on metrics provided out of the box

It is important to understand metrics for every AWS service used in your application so you can decide how to measure its behavior. AWS services provide a number of out-of-the-box standard metrics to help monitor the operational health of your application.

As these metrics are generated automatically, it is a simple way to start monitoring your application and can also be augmented with custom metrics.

The first stage is to identify which services the application uses. The airline booking component uses AWS Step Functions, AWS Lambda, Amazon SNS, and Amazon DynamoDB.

When I make a booking, as shown in the Introduction post, AWS services emit metrics to Amazon CloudWatch. These are processed asynchronously without impacting the application’s performance.

There are two default CloudWatch dashboards to visualize key metrics quickly: per service and cross service.

Per service

To view the per service metrics dashboard, I open the CloudWatch console.

per-service-metrics-dashboardI select a service where Overview is shown, such as Lambda. Now I can view the metrics for all Lambda functions in the account.

per-service-metrics-lambdaCross service

To see an overview of key metrics across all AWS services, open the CloudWatch console and choose View cross service dashboard.

cross-service-metrics-dashboardI see a list of all services with one or two key metrics displayed. This provides a good overview of all services your application uses.

Alerting

The next stage is to identify the key metrics for comparison and set up alerts for under- and over-performing services. Here are some recommended metrics to alarm on for a number of AWS services.

Alerts can be configured manually or via infrastructure as code tools such as the AWS Serverless Application Model, AWS CloudFormation, or third-party tools.

To configure a manual alert for Lambda function errors using CloudWatch Alarms:

  1. I open the CloudWatch console and select Alarms and select Create Alarm.
  2. I choose Select Metric and from AWS Namespaces, select Lambda, Across All Functions and select Errors and select Select metric.add-metric-to-alert
  3. I change the Statistic to Sum and the Period to 1 minute.metric-values
  4. Under Conditions, I select a Static threshold Greater than 1 and select Next.

Alarms can also be created using anomaly detection rather than static values if there is a discernible pattern or trend. Anomaly detection looks at past metric data and uses machine learning to create a model of expected values. Alerts can then be configured if they fall outside this band of “normal” values. I use a Static threshold for this alarm.

  1. For the notification, I set the trigger to alarm to an existing SNS topic with my email address, then choose Next.metric-notification
  2. I enter a descriptive alarm name such as serverlessairline-lambda-prod-errors > 1, select Next, and choose Create alarm.

I have now manually set up an alarm.

Use CloudWatch composite alarms to combine multiple alarms to reduce noise and focus on critical issues. For example, a single alarm could trigger if there are both Lambda function errors as well as high Lambda concurrent executions.

It is simpler and more scalable to include alerting within infrastructure as code. Here is an example of alerting programmatically using CloudFormation.

I view the out of the box standard metrics and in this example, manually create an alarm for Lambda function errors.

Improvement plan summary:

  1. Understand what metrics and dimensions each managed service used provides.
  2. Configure alerts on relevant metrics for when services are unhealthy.

Good practice: Use structured and centralized logging

Central logging provides a single place to search and analyze logs. Structured logging means selecting a consistent log format and content structure to simplify querying across multiple components.

To identify a business transaction across components, such as a particular flight booking, log operational information from upstream and downstream services. Add information such as customer_id along with business outcomes such as order=accepted or order=confirmed. Make sure you are not logging any sensitive or personal identifying data in any logs.

Use JSON as your logging output format. Log multiple fields in a single object or dictionary rather than many one line messages for simpler searching.

Here is an example of a structured logging format.

The airline booking component, which is written in Python, currently uses a shared library with a separate log processing stack.

Embedded Metrics Format is a simpler mechanism to replace the shared library and use structured logging. CloudWatch Embedded Metrics adds environmental metadata such as Lambda Function version and also automatically extracts custom metrics so you can visualize and alarm on them. There are open-source client libraries available for Node.js and Python.

I then add embedded metrics to the individual confirm booking module with the following steps:

  1. I install the aws-embedded-metrics library using the instructions.
  2. In the function init code, I import the module and create a metric_scope with the following code

from aws_embedded_metrics import metric_scope
@metric_scope

  1. In the function handler, I log the generated bookingReference with the following code.

metrics.set_property("BookingReference", ret["bookingReference"])

In this example I also log the entire incoming event details.

metrics.set_property("event", event)

It is best practice to only log what is required to avoid unnecessary costs. Ensure the event does not have any sensitive or personal identifying data which is available to anyone who has access to the logs.

To avoid the duplicate logging in this example airline application which adds cost, I remove the existing shared library logger.*() lines.

When I make a booking, the CloudWatch log message is in structured JSON format. It contains the properties I set event, BookingReference, as well as function metadata.

I can then search for all log activity related to a specific booking across multiple functions with booking_id. I can track customer activity across multiple bookings using customer_id.

Logging is often created as a shared library resource which all functions reference. Another option is using Lambda Layers, which lets functions import additional code such as external libraries. Multiple functions can share this code.

Improvement plan summary:

  1. Log request identifiers from downstream services, component name, component runtime information, unique correlation identifiers, and information that helps identify a business transaction.
  2. Use JSON as the logging output. Prefer logging entire objects/dictionaries rather than many one line messages. Mask or remove sensitive data when logging.
  3. Minimize logging debugging information to a minimum as they can incur both costs and increase noise to signal ratio

Conclusion

Evaluating serverless application health helps understand which services should be optimized to improve your customer’s experience. I cover out of the box metrics and alerts, as well as structured and centralized logging.

This well-architected question will be continued in an upcoming post where I look at custom metrics and distributed tracing.

Building a serverless URL shortener app without AWS Lambda – part 3

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/building-a-serverless-url-shortener-app-without-lambda-part-3/

This is the final installment of a three-part series on building a serverless URL shortener without using AWS Lambda. This series highlights the power of Amazon API Gateway and its ability to directly integrate with services like Amazon DynamoDB. The result is a low latency, highly available application that is built with managed services and requires minimal code.

In part one of this series, I demonstrate building a serverless URL shortener application without using AWS Lambda. In part two, I walk through implementing application security using Amazon API Gateway settings and Amazon Cognito. In this part of this series, I cover application observability and performance.

Application observability

Before I can gauge the performance of the application, I must first be able to observe the performance of my application. There are two AWS services that I configure to help with observability, AWS X-Ray and Amazon CloudWatch.

X-Ray

X-Ray is a tracing service that enables developers to observe and debug distributed applications. With X-Ray enabled, every call to the API Gateway endpoint is tagged and monitored throughout the application services. Now that I have the application up and running, I want to test for errors and latency. I use Nordstrom’s open-source load testing library, serverless-artillery, to generate activity to the API endpoint. During the load test, serverless-artillery generates 8,000 requests per second (RPS) for a period of five minutes. The results are as follows:

X-Ray Tracing

This indicates that, from the point the request reaches API Gateway to when a response is generated, the average time for each request is 8 milliseconds (ms) with a 4 ms integration time to DynamoDB. It also indicates that there were no errors, faults, or throttling.

I change the parameters to increase the load and observe how the application performs. This time serverless-artillery generates 11,000rps for a period of 30 seconds. The results are as follows:

X-Ray tracing with throttling

X-Ray now indicates request throttling. This is due to the default throttling limits of API Gateway. Each account has a soft limit of 10,000rps with a burst limit of 5k requests. Since I am load testing the API with 11,000rps, API Gateway is throttling requests over 10k per second. When throttling occurs, API Gateway responds to the client with a status code of 429. Using X-Ray, I can drill down into the response data to get a closer look at requests by status code.

X-Ray analytics

CloudWatch

The next tool I use for application observability is Amazon CloudWatch. CloudWatch captures data for individual services and supports metric based alarms. I create the following alarms to have insight into my application:

AlarmTrigger
APIGateway4xxAlarmOne percent of the API calls result in a 4xx error over a one-minute period.
APIGateway5xxAlarmOne percent of the API calls result in a 5xx error over a one-minute period.
APIGatewayLatencyAlarmThe p99 latency experience is over 75 ms over a five-minute period.
DDB4xxAlarmOne percent of the DynamoDB requests result in a 4xx error over a one-minute period.
DDB5xxAlarmOne percent of the DynamoDB requests result in a 5xx error over a one-minute period.
CloudFrontTotalErrorRateAlarmFive requests to CloudFront result in a 4xx or 5xx error over a one-minute period.
CloudFrontTotalCacheHitRAteAlarm80% or less of the requests to CloudFront result in a cache hit over a five-minute period. While this is not an error or a problem, it indicates the need for a more aggressive caching story.

Each of these alarms is configured to publish to a notification topic using Amazon Simple Notification Service (SNS). In this example I have configured my email address as a subscriber to the SNS topic. I could also subscribe a Lambda function or a mobile number for SMS message notification. I can also get a quick view of the status of my alarms on the CloudWatch console.

CloudWatch and X-Ray provide additional alerts when there are problems. They also provide observability to help remediate discovered issues.

Performance

With observability tools in place, I am now able to evaluate the performance of the application. In part one, I discuss using API Gateway and DynamoDB as the primary services for this application and the performance advantage provided. However, these performance advantages are limited to the backend only. To improve performance between the client and the API I configure throttling and a content delivery network with Amazon CloudFront.

Throttling

Request throttling is handled with API Gateway and can be configured at the stage level or at the resource and method level. Because this application is a URL shortener, the most important action is the 301 redirect that happens at /{linkId} – GET. I want to ensure that these calls take priority, so I set a throttling limit on all other actions.

The best way to do this is to set a global throttling of 2000rps with a burst of 1k. I then configure an override on the /{linkId} – GET method to 10,000rps with a burst of 5k. If the API is experiencing an extraordinarily high volume of calls, all other calls are rejected.

Content delivery network

The distance between a user and the API endpoint can severely affect the performance of an application. Simply put, the further the data has to travel, the slower the application. By configuring a CloudFront distribution to use the Amazon CloudFront Global Edge Network, I bring the data closer to the user and increase performance.

I configure the cache for /{linkId} – GET to “max-age=300” which tells CloudFront to store the response of that call for 5 minutes. The first call queries the API and database for a response, while all subsequent calls in the next five minutes receive the local cached response. I then set all other endpoint cache to “no-cache, no-store”, which tells CloudFront to never store the value from these calls. This ensures that as users are creating or editing their short-links, they get the latest data.

By bringing the data closer to the user, I now ensure that regardless of where the user is, they receive improved performance. To evaluate this, I return to serverless-artillery and test the CloudFront endpoint. The results are as follows:

MinMaxAveragep10p50p90p95p99
8.12 ms739 ms21.7 ms10.1 ms12.1 ms20 ms34 ms375 ms

To be clear, these are the 301 redirect response times. I configured serverless-artillery not to follow the redirects as I have no control of the speed of the resulting site. The maximum response time was 739 ms. This would be the initial uncached call. The p50 metric shows that half of the traffic is seeing a 12 ms or better response time while the p95 indicates that most of my traffic is experiencing an equal or better than 34 ms response time.

Conclusion

In this series, I talk through building a serverless URL shortener without the use of any Lambda functions. The resulting architecture looks like this:

This application is built by integrating multiple managed services together and applying business logic via mapping templates on API Gateway. Because these are all serverless managed services, they provide inherent availability and scale to meet the client load as needed.

While the “Lambda-less” pattern is not a match for every application, it is a great answer for building highly performant applications with minimal logic. The advantage to this pattern is also in its extensibility. With the data saved to DynamoDB, I can use the DynamoDB streaming feature to connect additional processing as needed. I can also use CloudFront access logs to evaluate internal application metrics. Clone this repo to start serving your own shortened URLs and submit a pull request if you have an improvement.

Did you miss any of this series?

  1. Part 1: Building the application.
  2. Part 2: Securing the application.

Happy coding!

Analyzing API Gateway custom access logs for custom domain names

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/analyzing-api-gateway-custom-access-logs-for-custom-domain-names/

This post is courtesy of Taka Matsumoto, Cloud Support Engineer, AWS

If you are using custom domain names in Amazon API Gateway, it can be useful to gain insights into requests sent to each custom domain name. Although API Gateway provides CloudWatch metrics and options to deliver request logs to Amazon CloudWatch Logs, there is no pre-defined metric or log specific to custom domain names. If there is more than one custom domain name mapped to a single API, understanding the quantity and type of requests by domain name may help understand request patterns.

Using the custom access logging option in API Gateway enables delivery of custom logs to CloudWatch Logs, which can be analyzed using CloudWatch Logs Insights. This blog post walks through the steps to create a CloudWatch log group for API custom access logging, and uses CloudWatch Logs Insights for analysis.

Overview

In the tutorial, you create a CloudWatch log group for custom access logging. You then enable custom access logging for an API stage associated with a custom domain name. The IAM role used in this tutorial must be able to create and update the relevant resources in CloudWatch, IAM, and API Gateway. For this tutorial, use the US East (N. Virginia) Region. In the next steps, the tutorial covers:

  1. Creating a CloudWatch Log group.
  2. Creating an IAM role for access logging.
  3. Enabling custom access logging.
  4. Testing an API using a custom domain name.
  5. Analyzing Logs in CloudWatch Logs Insights.

Create a CloudWatch Log group

Before enabling custom access logging for your API’s stage, create a CloudWatch log group to deliver custom logs. Create a log group called APIGateway_CustomDomainLogs by following these steps:

  1. Go to the CloudWatch Logs console.
  2. Under Actions, click on Create log group and name the log group APIGateway_CustomDomainLogs. Learn more about creating a log group in Working with Log Groups and Log Streams.

Create log group

Create an IAM role for access logging

You must use an IAM role to deliver logs from API Gateway to CloudWatch Logs. If there is no IAM role already available for logging in API Gateway, create a new IAM role:

  1. Navigate to the IAM console.
  2. Under Roles, choose Create role.
  3. Select API Gateway for the service and choose Next: Permissions.

    IAM selection

  4. Leave the attached IAM policy (AmazonAPIGatewayPushToCloudWatchLogs), and choose Next: Tags.
  5. No tags are required for this tutorial. Leave these blank and choose Next: Review.
  6. Name the role APIGatewayCloudWatchLogsRole and choose Create role.
    Create role

3. Enable custom access logging

Now you enable custom access logging. Select one of the API stages that you invoke through a custom domain name:

  1. If there is no CloudWatch log role set for API Gateway, go to the API Gateway Settings page to add the CloudWatch log role ARN. The IAM role ARN follows this format: arn:aws:iam::123456789012:role/APIGatewayCloudWatchLogsRole.
  2. For your API with a custom domain name, go to Stages page and select the Logs/Tracing tab.
  3. Enter the following fields:
    – For CloudWatch Group, add the ARN (for example, arn:aws:logs:us-east-1:123456789012:log-group:APIGateway_CustomDomainLogs).
    – For Log Format, enter:

    {
        "RequestId": "$context.requestId",
        "DomainName": "$context.domainName",
        "APIId": "$context.apiId",
        "RequestPath": "$context.path",
        "RequestTime": "$context.requestTime",
        "SourceIp": "$context.identity.sourceIp",
        "ResourcePath": "$context.resourcePath",
        "Stage": "$context.stage"
    }

    Custom access logging

  4.  Choose Save changes.

Learn more about custom access logging setup in Set up API Logging Using the API Gateway Console. In the Log Format configuration, $context variables retrieve a domain name as well as other API request information. Learn more in $context Variables for Data Models, Authorizers, Mapping Templates, and CloudWatch Access Logging.

4. Test invoke an API using a custom domain name

Once you enable custom access logging, invoke the API using the custom domain name. The logs appear in the specified CloudWatch log group shortly after. A sample response in the CloudWatch log stream looks like the following:

{
    "RequestId": "1b1ebe20-817f-11e9-a796-f5e0ffdcdac7",
    "DomainName": "test.example.com”,
    "APIId": "12345abcde",
    "RequestPath": "/dev",
    "RequestTime": "28/May/2019:19:30:52 +0000",
    "SourceIp": "1.2.3.4",
    "ResourcePath": "/",
    "Stage": "dev"
}

5. Analyze logs in CloudWatch Logs Insights

After setting up the custom access logs, you can query against them to find more insights using the custom domain name.

  1. Go to CloudWatch Logs Insights console.
  2. In the log group text field, select the CloudWatch log group, APIGateway_CustomDomainLogs.
  3. Enter the following query.
    fields @timestamp, @message
    | filter DomainName like /(?i)(test.example.com)/

    This query returns a list of log entries for the custom domain called test.example.com. To run in your account, replace this value with your custom domain name.
    Custom filter

If your network security does not allow the use of web sockets, you cannot access the CloudWatch Logs Insights console. Instead, use the CloudWatch Logs Insights query capabilities using the API. Here are some example queries use the AWS CLI:

1. A sample command for aws logs start-query:

aws logs start-query --log-group-name APIGateway_CustomDomainLogs --start-time 1557085225000 --end-time 1559763625000 --query-string 'fields @timestamp, @message | filter DomainName like /(?i)(test.example.com)/'

The response looks like:

{
    "queryId": "a1234567-bfde-47c7-9d44-41ebed011c66"
}

Learn more about start-query command in aws logs start-query.

2. Run aws logs get-query-results to retrieve the result of the query. A sample command for aws logs get-query-results:

aws logs get-query-results --query-id a1234567-bfde-47c7-9d44-41ebed011c66

The response looks like:

{
    "results": [
        [
            {
                "field": "@timestamp",
                "value": "2019-05-28 19:30:52.494"
            },
            {
                "field": "@message",
                "value": "{\"RequestId\": \"12345678-7cb3-11e9-8896-c30af5588427\",\"DomainName\":\"test.exmaple.com\"}"
            },
            {
                "field": "@ptr",
                "value": "CmEKKAokOTYxNTQyNjM4MjQzOkFQSUdhdGV3YXlfQ3VzdG9tRG9tYWluEAISNRoYAgXM/KQtAAAAAA0L2foABc5YFyAAAAHSIAEoxviFhK4tMM+HhoSuLTgCQOoBSN4OUN4KEAAYAQ=="
            }
        ]
    ],
    "statistics": {
        "recordsMatched": 1.0,
        "recordsScanned": 1.0,
        "bytesScanned": 153.0
    },
    "status": "Complete"
}

You can use other queries to filter the results based on other attributes in the logs. Learn more about get-query-results command in aws logs get-query-results.

Conclusion

In this blog post, I show how to deliver custom access logs from API Gateway to CloudWatch Logs. I also show how to use CloudWatch Logs Insights to run a query against the logs for custom domain name metrics, which help provide insights into custom domain name usage.

To learn more about the query syntax, visit CloudWatch Logs Insights Query Syntax.

Customizing triggers for AWS CodePipeline with AWS Lambda and Amazon CloudWatch Events

Post Syndicated from Bryant Bost original https://aws.amazon.com/blogs/devops/adding-custom-logic-to-aws-codepipeline-with-aws-lambda-and-amazon-cloudwatch-events/

AWS CodePipeline is a fully managed continuous delivery service that helps automate the build, test, and deploy processes of your application. Application owners use CodePipeline to manage releases by configuring “pipeline,” workflow constructs that describe the steps, from source code to deployed application, through which an application progresses as it is released. If you are new to CodePipeline, check out Getting Started with CodePipeline to get familiar with the core concepts and terminology.

Overview

In a default setup, a pipeline is kicked-off whenever a change in the configured pipeline source is detected. CodePipeline currently supports sourcing from AWS CodeCommit, GitHub, Amazon ECR, and Amazon S3. When using CodeCommit, Amazon ECR, or Amazon S3 as the source for a pipeline, CodePipeline uses an Amazon CloudWatch Event to detect changes in the source and immediately kick off a pipeline. When using GitHub as the source for a pipeline, CodePipeline uses a webhook to detect changes in a remote branch and kick off the pipeline. Note that CodePipeline also supports beginning pipeline executions based on periodic checks, although this is not a recommended pattern.

CodePipeline supports adding a number of custom actions and manual approvals to ensure that pipeline functionality is flexible and code releases are deliberate; however, without further customization, pipelines will still be kicked-off for every change in the pipeline source. To customize the logic that controls pipeline executions in the event of a source change, you can introduce a custom CloudWatch Event, which can result in the following benefits:

  • Multiple pipelines with a single source: Trigger select pipelines when multiple pipelines are listening to a single source. This can be useful if your organization is using monorepos, or is using a single repository to host configuration files for multiple instances of identical stacks.
  • Avoid reacting to unimportant files: Avoid triggering a pipeline when changing files that do not affect the application functionality (e.g. documentation files, readme files, and .gitignore files).
  • Conditionally kickoff pipelines based on environmental conditions: Use custom code to evaluate whether a pipeline should be triggered. This allows for further customization beyond polling a source repository or relying on a push event. For example, you could create custom logic to automatically reschedule deployments on holidays to the next available workday.

This post explores and demonstrates how to customize the actions that invoke a pipeline by modifying the default CloudWatch Events configuration that is used for CodeCommit, ECR, or S3 sources. To illustrate this customization, we will walk through two examples: prevent updates to documentation files from triggering a pipeline, and manage execution of multiple pipelines monitoring a single source repository.

The key concepts behind customizing pipeline invocations extend to GitHub sources and webhooks as well; however, creating a custom webhook is outside the scope of this post.

Sample Architecture

This post is only interested in controlling the execution of the pipeline (as opposed to the deploy, test, or approval stages), so it uses simple source and pipeline configurations. The sample architecture considers a simple CodePipeline with only two stages: source and build.

Example CodePipeline Architecture

Example CodePipeline Architecture with Custom CloudWatch Event Configuration

The sample CodeCommit repository consists only of buildspec.yml, readme.md, and script.py files.

Normally, after you create a pipeline, it automatically triggers a pipeline execution to release the latest version of your source code. From then on, every time you make a change to your source location, a new pipeline execution is triggered. In addition, you can manually re-run the last revision through a pipeline using the “Release Change” button in the console. This architecture uses a custom CloudWatch Event and AWS Lambda function to avoid commits that change only the readme.md file from initiating an execution of the pipeline.

Creating a custom CloudWatch Event

When we create a CodePipeline that monitors a CodeCommit (or other) source, a default CloudWatch Events rule is created to trigger our pipeline for every change to the CodeCommit repository. This CloudWatch Events rule monitors the CodeCommit repository for changes, and triggers the pipeline for events matching the referenceCreated or referenceUpdated CodeCommit Event (refer to CodeCommit Event Types for more information).

Default CloudWatch Events Rule to Trigger CodePipeline

Default CloudWatch Events Rule to Trigger CodePipeline

To introduce custom logic and control the events that kickoff the pipeline, this example configures the default CloudWatch Events rule to detect changes in the source and trigger a Lambda function rather than invoke the pipeline directly. The example uses a CodeCommit source, but the same principle applies to Amazon S3 and Amazon ECR sources as well, as these both use CloudWatch Events rules to notify CodePipeline of changes.

Custom CloudWatch Events Rule to Trigger CodePipeline

Custom CloudWatch Events Rule to Trigger CodePipeline

When a change is introduced to the CodeCommit repository, the configured Lambda function receives an event from CloudWatch signaling that there has been a source change.

{
   "version":"0",
   "id":"2f9a75be-88f6-6827-729d-34495072e5a1",
   "detail-type":"CodeCommit Repository State Change",
   "source":"aws.codecommit",
   "account":"accountNumber",
   "time":"2019-11-12T04:56:47Z",
   "region":"us-east-1",
   "resources":[
      "arn:aws:codecommit:us-east-1:accountNumber:codepipeline-customization-sandbox-repo"
   ],
   "detail":{
      "callerUserArn":"arn:aws:sts::accountNumber:assumed-role/admin/roleName ",
      "commitId":"92e953e268345c77dd93cec860f7f91f3fd13b44",
      "event":"referenceUpdated",
      "oldCommitId":"5a058542a8dfa0dacf39f3c1e53b88b0f991695e",
      "referenceFullName":"refs/heads/master",
      "referenceName":"master",
      "referenceType":"branch",
      "repositoryId":"658045f1-c468-40c3-93de-5de2c000d84a",
      "repositoryName":"codepipeline-customization-sandbox-repo"
   }
}

The Lambda function is responsible for determining whether a source change necessitates kicking-off the pipeline, which in the example is necessary if the change contains modifications to files other than readme.md. To implement this, the Lambda function uses the commitId and oldCommitId fields provided in the body of the CloudWatch event message to determine which files have changed. If the function determines that a change has occurred to a “non-ignored” file, then the function programmatically executes the pipeline. Note that for S3 sources, it may be necessary to process an entire file zip archive, or to retrieve past versions of an artifact.

import boto3

files_to_ignore = [ "readme.md" ]

codecommit_client = boto3.client('codecommit')
codepipeline_client = boto3.client('codepipeline')

def lambda_handler(event, context):
    # Extract commits
    old_commit_id = event["detail"]["oldCommitId"]
    new_commit_id = event["detail"]["commitId"]
    # Get commit differences
    codecommit_response = codecommit_client.get_differences(
        repositoryName="codepipeline-customization-sandbox-repo",
        beforeCommitSpecifier=str(old_commit_id),
        afterCommitSpecifier=str(new_commit_id)
    )
    # Search commit differences for files to ignore
    for difference in codecommit_response["differences"]:
        file_name = difference["afterBlob"]["path"].lower()
        # If non-ignored file is present, kickoff pipeline
        if file_name not in files_to_ignore:
            codepipeline_response = codepipeline_client.start_pipeline_execution(
                name="codepipeline-customization-sandbox-pipeline"
                )
            # Break to avoid executing the pipeline twice
            break

Multiple pipelines sourcing from a single repository

Architectures that use a single-source repository monitored by multiple pipelines can add custom logic to control the types of events that trigger a specific pipeline to execute. Without customization, any change to the source repository would trigger all pipelines.

Consider the following example:

  • A CodeCommit repository contains a number of config files (for example, config_1.json and config_2.json).
  • Multiple pipelines (for example, codepipeline-customization-sandbox-pipeline-1 and codepipeline-customization-sandbox-pipeline-2) source from this CodeCommit repository.
  • Whenever a config file is updated, a custom CloudWatch Event triggers a Lambda function that is used to determine which config files changed, and therefore which pipelines should be executed.
Example CodePipeline Architecture

Example CodePipeline Architecture for Monorepos with Custom CloudWatch Event Configuration

This example follows the same pattern of creating a custom CloudWatch Event and Lambda function shown in the preceding example. However, in this scenario, the Lambda function is responsible for determining which files changed and which pipelines should be kicked off as a result. To execute this logic, the Lambda function uses the config_file_mapping variable to map files to corresponding pipelines. Pipelines are only executed if their designated config file has changed.

Note that the config_file_mapping can be exported to Amazon S3 or Amazon DynamoDB for more complex use cases.

import boto3

# Map config files to pipelines
config_file_mapping = {
        "config_1.json" : "codepipeline-customization-sandbox-pipeline-1",
        "config_2.json" : "codepipeline-customization-sandbox-pipeline-2"
        }
        
codecommit_client = boto3.client('codecommit')
codepipeline_client = boto3.client('codepipeline')

def lambda_handler(event, context):
    # Extract commits
    old_commit_id = event["detail"]["oldCommitId"]
    new_commit_id = event["detail"]["commitId"]
    # Get commit differences
    codecommit_response = codecommit_client.get_differences(
        repositoryName="codepipeline-customization-sandbox-repo",
        beforeCommitSpecifier=str(old_commit_id),
        afterCommitSpecifier=str(new_commit_id)
    )
    # Search commit differences for files that trigger executions
    for difference in codecommit_response["differences"]:
        file_name = difference["afterBlob"]["path"].lower()
        # If file corresponds to pipeline, execute pipeline
        if file_name in config_file_mapping:
            codepipeline_response = codepipeline_client.start_pipeline_execution(
                name=config_file_mapping["file_name"]
                )

Results

For the first example, updates affecting only the readme.md file are completely ignored by the pipeline, while updates affecting other files begin a normal pipeline execution. For the second example, the two pipelines monitor the same source repository; however, codepipeline-customization-sandbox-pipeline-1 is executed only when config_1.json is updated and codepipeline-customization-sandbox-pipeline-2 is executed only when config_2.json is updated.

These CloudWatch Event and Lambda function combinations serve as a good general examples of the introduction of custom logic to pipeline kickoffs, and can be expanded to account for variously complex processing logic.

Cleanup

To avoid additional infrastructure costs from the examples described in this post, be sure to delete all CodeCommit repositories, CodePipeline pipelines, Lambda functions, and CodeBuild projects. When you delete a CodePipeline, the CloudWatch Events rule that was created automatically is deleted, even if the rule has been customized.

Conclusion

For scenarios which need you to define additional custom logic to control the execution of one or multiple pipelines, configuring a CloudWatch Event to trigger a Lambda function allows you to customize the conditions and types of events that can kick-off your pipeline.

ICYMI: Serverless Q4 2019

Post Syndicated from Rob Sutter original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2019/

Welcome to the eighth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, checkout what happened last quarter here.

The three months comprising the fourth quarter of 2019

AWS re:Invent

AWS re:Invent 2019

re:Invent 2019 dominated the fourth quarter at AWS. The serverless team presented a number of talks, workshops, and builder sessions to help customers increase their skills and deliver value more rapidly to their own customers.

Serverless talks from re:Invent 2019

Chris Munns presenting 'Building microservices with AWS Lambda' at re:Invent 2019

We presented dozens of sessions showing how customers can improve their architecture and agility with serverless. Here are some of the most popular.

Videos

Decks

You can also find decks for many of the serverless presentations and other re:Invent presentations on our AWS Events Content.

AWS Lambda

For developers needing greater control over performance of their serverless applications at any scale, AWS Lambda announced Provisioned Concurrency at re:Invent. This feature enables Lambda functions to execute with consistent start-up latency making them ideal for building latency sensitive applications.

As shown in the below graph, provisioned concurrency reduces tail latency, directly impacting response times and providing a more responsive end user experience.

Graph showing performance enhancements with AWS Lambda Provisioned Concurrency

Lambda rolled out enhanced VPC networking to 14 additional Regions around the world. This change brings dramatic improvements to startup performance for Lambda functions running in VPCs due to more efficient usage of elastic network interfaces.

Illustration of AWS Lambda VPC to VPC NAT

New VPC to VPC NAT for Lambda functions

Lambda now supports three additional runtimes: Node.js 12, Java 11, and Python 3.8. Each of these new runtimes has new version-specific features and benefits, which are covered in the linked release posts. Like the Node.js 10 runtime, these new runtimes are all based on an Amazon Linux 2 execution environment.

Lambda released a number of controls for both stream and async-based invocations:

  • You can now configure error handling for Lambda functions consuming events from Amazon Kinesis Data Streams or Amazon DynamoDB Streams. It’s now possible to limit the retry count, limit the age of records being retried, configure a failure destination, or split a batch to isolate a problem record. These capabilities help you deal with potential “poison pill” records that would previously cause streams to pause in processing.
  • For asynchronous Lambda invocations, you can now set the maximum event age and retry attempts on the event. If either configured condition is met, the event can be routed to a dead letter queue (DLQ), Lambda destination, or it can be discarded.

AWS Lambda Destinations is a new feature that allows developers to designate an asynchronous target for Lambda function invocation results. You can set separate destinations for success and failure. This unlocks new patterns for distributed event-based applications and can replace custom code previously used to manage routing results.

Illustration depicting AWS Lambda Destinations with success and failure configurations

Lambda Destinations

Lambda also now supports setting a Parallelization Factor, which allows you to set multiple Lambda invocations per shard for Kinesis Data Streams and DynamoDB Streams. This enables faster processing without the need to increase your shard count, while still guaranteeing the order of records processed.

Illustration of multiple AWS Lambda invocations per Kinesis Data Streams shard

Lambda Parallelization Factor diagram

Lambda introduced Amazon SQS FIFO queues as an event source. “First in, first out” (FIFO) queues guarantee the order of record processing, unlike standard queues. FIFO queues support messaging batching via a MessageGroupID attribute that supports parallel Lambda consumers of a single FIFO queue, enabling high throughput of record processing by Lambda.

Lambda now supports Environment Variables in the AWS China (Beijing) Region and the AWS China (Ningxia) Region.

You can now view percentile statistics for the duration metric of your Lambda functions. Percentile statistics show the relative standing of a value in a dataset, and are useful when applied to metrics that exhibit large variances. They can help you understand the distribution of a metric, discover outliers, and find hard-to-spot situations that affect customer experience for a subset of your users.

Amazon API Gateway

Screen capture of creating an Amazon API Gateway HTTP API in the AWS Management Console

Amazon API Gateway announced the preview of HTTP APIs. In addition to significant performance improvements, most customers see an average cost savings of 70% when compared with API Gateway REST APIs. With HTTP APIs, you can create an API in four simple steps. Once the API is created, additional configuration for CORS and JWT authorizers can be added.

AWS SAM CLI

Screen capture of the new 'sam deploy' process in a terminal window

The AWS SAM CLI team simplified the bucket management and deployment process in the SAM CLI. You no longer need to manage a bucket for deployment artifacts – SAM CLI handles this for you. The deployment process has also been streamlined from multiple flagged commands to a single command, sam deploy.

AWS Step Functions

One powerful feature of AWS Step Functions is its ability to integrate directly with AWS services without you needing to write complicated application code. In Q4, Step Functions expanded its integration with Amazon SageMaker to simplify machine learning workflows. Step Functions also added a new integration with Amazon EMR, making EMR big data processing workflows faster to build and easier to monitor.

Screen capture of an AWS Step Functions step with Amazon EMR

Step Functions step with EMR

Step Functions now provides the ability to track state transition usage by integrating with AWS Budgets, allowing you to monitor trends and react to usage on your AWS account.

You can now view CloudWatch Metrics for Step Functions at a one-minute frequency. This makes it easier to set up detailed monitoring for your workflows. You can use one-minute metrics to set up CloudWatch Alarms based on your Step Functions API usage, Lambda functions, service integrations, and execution details.

Step Functions now supports higher throughput workflows, making it easier to coordinate applications with high event rates. This increases the limits to 1,500 state transitions per second and a default start rate of 300 state machine executions per second in US East (N. Virginia), US West (Oregon), and Europe (Ireland). Click the above link to learn more about the limit increases in other Regions.

Screen capture of choosing Express Workflows in the AWS Management Console

Step Functions released AWS Step Functions Express Workflows. With the ability to support event rates greater than 100,000 per second, this feature is designed for high-performance workloads at a reduced cost.

Amazon EventBridge

Illustration of the Amazon EventBridge schema registry and discovery service

Amazon EventBridge announced the preview of the Amazon EventBridge schema registry and discovery service. This service allows developers to automate discovery and cataloging event schemas for use in their applications. Additionally, once a schema is stored in the registry, you can generate and download a code binding that represents the schema as an object in your code.

Amazon SNS

Amazon SNS now supports the use of dead letter queues (DLQ) to help capture unhandled events. By enabling a DLQ, you can catch events that are not processed and re-submit them or analyze to locate processing issues.

Amazon CloudWatch

Amazon CloudWatch announced Amazon CloudWatch ServiceLens to provide a “single pane of glass” to observe health, performance, and availability of your application.

Screenshot of Amazon CloudWatch ServiceLens in the AWS Management Console

CloudWatch ServiceLens

CloudWatch also announced a preview of a capability called Synthetics. CloudWatch Synthetics allows you to test your application endpoints and URLs using configurable scripts that mimic what a real customer would do. This enables the outside-in view of your customers’ experiences, and your service’s availability from their point of view.

CloudWatch introduced Embedded Metric Format, which helps you ingest complex high-cardinality application data as logs and easily generate actionable metrics. You can publish these metrics from your Lambda function by using the PutLogEvents API or using an open source library for Node.js or Python applications.

Finally, CloudWatch announced a preview of Contributor Insights, a capability to identify who or what is impacting your system or application performance by identifying outliers or patterns in log data.

AWS X-Ray

AWS X-Ray announced trace maps, which enable you to map the end-to-end path of a single request. Identifiers show issues and how they affect other services in the request’s path. These can help you to identify and isolate service points that are causing degradation or failures.

X-Ray also announced support for Amazon CloudWatch Synthetics, currently in preview. CloudWatch Synthetics on X-Ray support tracing canary scripts throughout the application, providing metrics on performance or application issues.

Screen capture of AWS X-Ray Service map in the AWS Management Console

X-Ray Service map with CloudWatch Synthetics

Amazon DynamoDB

Amazon DynamoDB announced support for customer-managed customer master keys (CMKs) to encrypt data in DynamoDB. This allows customers to bring your own key (BYOK) giving you full control over how you encrypt and manage the security of your DynamoDB data.

It is now possible to add global replicas to existing DynamoDB tables to provide enhanced availability across the globe.

Another new DynamoDB capability to identify frequently accessed keys and database traffic trends is currently in preview. With this, you can now more easily identify “hot keys” and understand usage of your DynamoDB tables.

Screen capture of Amazon CloudWatch Contributor Insights for DynamoDB in the AWS Management Console

CloudWatch Contributor Insights for DynamoDB

DynamoDB also released adaptive capacity. Adaptive capacity helps you handle imbalanced workloads by automatically isolating frequently accessed items and shifting data across partitions to rebalance them. This helps reduce cost by enabling you to provision throughput for a more balanced workload instead of over provisioning for uneven data access patterns.

Amazon RDS

Amazon Relational Database Services (RDS) announced a preview of Amazon RDS Proxy to help developers manage RDS connection strings for serverless applications.

Illustration of Amazon RDS Proxy

The RDS Proxy maintains a pool of established connections to your RDS database instances. This pool enables you to support a large number of application connections so your application can scale without compromising performance. It also increases security by enabling IAM authentication for database access and enabling you to centrally manage database credentials using AWS Secrets Manager.

AWS Serverless Application Repository

The AWS Serverless Application Repository (SAR) now offers Verified Author badges. These badges enable consumers to quickly and reliably know who you are. The badge appears next to your name in the SAR and links to your GitHub profile.

Screen capture of SAR Verifiedl developer badge in the AWS Management Console

SAR Verified developer badges

AWS Developer Tools

AWS CodeCommit launched the ability for you to enforce rule workflows for pull requests, making it easier to ensure that code has pass through specific rule requirements. You can now create an approval rule specifically for a pull request, or create approval rule templates to be applied to all future pull requests in a repository.

AWS CodeBuild added beta support for test reporting. With test reporting, you can now view the detailed results, trends, and history for tests executed on CodeBuild for any framework that supports the JUnit XML or Cucumber JSON test format.

Screen capture of AWS CodeBuild

CodeBuild test trends in the AWS Management Console

Amazon CodeGuru

AWS announced a preview of Amazon CodeGuru at re:Invent 2019. CodeGuru is a machine learning based service that makes code reviews more effective and aids developers in writing code that is more secure, performant, and consistent.

AWS Amplify and AWS AppSync

AWS Amplify added iOS and Android as supported platforms. Now developers can build iOS and Android applications using the Amplify Framework with the same category-based programming model that they use for JavaScript apps.

Screen capture of 'amplify init' for an iOS application in a terminal window

The Amplify team has also improved offline data access and synchronization by announcing Amplify DataStore. Developers can now create applications that allow users to continue to access and modify data, without an internet connection. Upon connection, the data synchronizes transparently with the cloud.

For a summary of Amplify and AppSync announcements before re:Invent, read: “A round up of the recent pre-re:Invent 2019 AWS Amplify Launches”.

Illustration of AWS AppSync integrations with other AWS services

Q4 serverless content

Blog posts

October

November

December

Tech talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year. These are listed in the Serverless section of the AWS Online Tech Talks page.

Here are the ones from Q4:

Twitch

October

There are also a number of other helpful video series covering Serverless available on the AWS Twitch Channel.

AWS Serverless Heroes

We are excited to welcome some new AWS Serverless Heroes to help grow the serverless community. We look forward to some amazing content to help you with your serverless journey.

AWS Serverless Application Repository (SAR) Apps

In this edition of ICYMI, we are introducing a section devoted to SAR apps written by the AWS Serverless Developer Advocacy team. You can run these applications and review their source code to learn more about serverless and to see examples of suggested practices.

Still looking for more?

The Serverless landing page has much more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. We’re also kicking off a fresh series of Tech Talks in 2020 with new content providing greater detail on everything new coming out of AWS for serverless application developers.

Throughout 2020, the AWS Serverless Developer Advocates are crossing the globe to tell you more about serverless, and to hear more about what you need. Follow this blog to keep up on new launches and announcements, best practices, and examples of serverless applications in action.

You can also follow all of us on Twitter to see latest news, follow conversations, and interact with the team.

Chris Munns: @chrismunns
Eric Johnson: @edjgeek
James Beswick: @jbesw
Moheeb Zara: @virgilvox
Ben Smith: @benjamin_l_s
Rob Sutter: @rts_rob
Julian Wood: @julian_wood

Happy coding!

Top 10 Architecture Blog Posts of 2019

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/top-10-architecture-blog-posts-of-2019/

As we wind our way toward 2020, I want to take a moment to first thank you, our readers, for spending time on our blog. We grew our audience quite a bit this year and the credit goes to our hard-working Solutions Architects and other blog post writers. Below are the top 10 Architecture blog posts written in 2019.

#10: How to Architect APIs for Scale and Security

by George Mao

George Mao, a Specialist Solutions Architect at AWS, focuses on serverless computing and has FIVE posts in the top ten this year. Way to go, George!

This post was the first in a series that focused on best practices and concepts you should be familiar with when you architect APIs for your applications.

Read George’s post.

#9: From One to Many: Evolving VPC Guidance

by Androski Spicer

Since its inception, the Amazon Virtual Private Cloud (VPC) has acted as the embodiment of security and privacy for customers who are looking to run their applications in a controlled, private, secure, and isolated environment.

This logically isolated space has evolved, and in its evolution has increased the avenues that customers can take to create and manage multi-tenant environments with multiple integration points for access to resources on-premises.

Read Androski’s post.

#8: Things to Consider When You Build REST APIs with Amazon API Gateway

by George Mao

REST API 2

This post dives deeper into the things an API architect or developer should consider when building REST APIs with Amazon API Gateway.

Read George’s post.

#7: How to Design Your Serverless Apps for Massive Scale

by George Mao

Serverless at scale-1

Serverless is one of the hottest design patterns in the cloud today, allowing you to focus on building and innovating, rather than worrying about the heavy lifting of server and OS operations. In this series of posts, we’ll discuss topics that you should consider when designing your serverless architectures. First, we’ll look at architectural patterns designed to achieve massive scale with serverless.

Read George’s post.

#6: Best Practices for Developing on AWS Lambda

by George Mao

RDS instance: When to VPC enable a Lambda function

One of the benefits of using Lambda, is that you don’t have to worry about server and infrastructure management. This means AWS will handle the heavy lifting needed to execute your AWS Lambda functions. Take advantage of this architecture with the tips in this post.

Read George’s post.

#5: Stream Amazon CloudWatch Logs to a Centralized Account for Audit and Analysis

by David Bailey

Figure 1 - Initial Landing Zone logging account resources

A key component of enterprise multi-account environments is logging. Centralized logging provides a single point of access to all salient logs generated across accounts and regions, and is critical for auditing, security and compliance. While some customers use the built-in ability to push Amazon CloudWatch Logs directly into Amazon Elasticsearch Service for analysis, others would prefer to move all logs into a centralized Amazon Simple Storage Service (Amazon S3) bucket location for access by several custom and third-party tools. In this blog post, David Bailey will show you how to forward existing and any new CloudWatch Logs log groups created in the future to a cross-account centralized logging Amazon S3 bucket.

Read David’s post.

#4: Updates to Serverless Architectural Patterns and Best Practices

by Drew Dennis

Drew wrote this post at about the halfway point between re:Invent 2018 and re:Invent 2019, where he revisited some of the recent serverless announcements we’ve made. These are all complimentary to the patterns discussed in the re:Invent architecture track’s Serverless Architectural Patterns and Best Practices session.

Read Drew’s post.

#3: Understanding the Different Ways to Invoke Lambda Functions

by George Mao

Invoking Lambda

In George’s first post of this series (#7 on this list), he talked about general design patterns to enable massive scale with serverless applications. In this post, he’ll review the different ways you can invoke Lambda functions and what you should be aware of with each invocation model.

Read George’s post.

#2: Using API Gateway as a Single Entry Point for Web Applications and API Microservices

by Anandprasanna Gaitonde and Mohit Malik

In this post, Anand and Mohit talk about a reference architecture that allows API Gateway to act as single entry point for external-facing, API-based microservices and web applications across multiple external customers by leveraging a different subdomain for each one.

Read Anand’s and Mohit’s post.

#1: 10 Things Serverless Architects Should Know

by Justin Pirtle

Building on the first three parts of the AWS Lambda scaling and best practices series where you learned how to design serverless apps for massive scale, AWS Lambda’s different invocation models, and best practices for developing with AWS Lambda, Justin invited you to take your serverless knowledge to the next level by reviewing 10 topics to deepen your serverless skills.

Read Justin’s post.

Thank You

Thanks again to all our readers and blog post writers. We look forward to learning and building amazing things together in the coming year.

Best of 2019