Tag Archives: Amazon CloudWatch

New – Use CloudWatch Synthetics to Monitor Sites, API Endpoints, Web Workflows, and More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-use-cloudwatch-synthetics-to-monitor-sites-api-endpoints-web-workflows-and-more/

Today’s applications encompass hundreds or thousands of moving parts including containers, microservices, legacy internal services, and third-party services. In addition to monitoring the health and performance of each part, you need to make sure that the parts come together to deliver an acceptable customer experience.

CloudWatch Synthetics (announced at AWS re:Invent 2019) allows you to monitor your sites, API endpoints, web workflows, and more. You get an outside-in view with better visibility into performance and availability so that you can become aware of and then fix any issues quicker than ever before. You can increase customer satisfaction and be more confident that your application is meeting your performance goals.

You can start using CloudWatch Synthetics in minutes. You simply create canaries that monitor individual web pages, multi-page web workflows such as wizards and checkouts, and API endpoints, with metrics stored in Amazon CloudWatch and other data (screen shots and HTML pages) stored in an S3 bucket. as you create your canaries, you can set CloudWatch alarms so that you are notified when thresholds based on performance, behavior, or site integrity are crossed. You can view screenshots, HAR (HTTP archive) files, and logs to learn more about the failure, with the goal of fixing it as quickly as possible.

CloudWatch Synthetics in Action
Canaries were once used to provide an early warning that deadly gases were present in a coal mine. The canaries provided by CloudWatch Synthetics provide a similar early warning, and are considerably more humane. I open the CloudWatch Console and click Canaries to get started:

I can see the overall status of my canaries at a glance:

I created several canaries last month in preparation for this blog post. I chose a couple of sites including the CNN home page, my personal blog, the Amazon Movers and Shakers page, and the Amazon Best Sellers page. I did not know which sites would return the most interesting results, and I certainly don’t mean to pick on any one of them. I do think that it is important to show you how this (and every) feature performs with real data, so here we go!

I can turn my attention to the Canary runs section, and look at individual data points. Each data point is an aggregation of runs for a single canary:

I can click on the amzn_movers_shakers canary to learn more:

I can see that there was a single TimeoutError issue in the last 24 hours. I can see the screenshots that were captured as part of each run, along with the HAR files and logs. Each HAR file contains a detailed log of the HTTP requests that were made when the canary was run, along with the responses and the amount of time that it took for the request to complete:

Each canary run is implemented using a Lambda function. I can access the function’s execution metrics in the Metrics tab:

And I can see the canary script and other details in the Configuration tab:

Hatching a Canary
Now that you have seen a canary in action, let me show you how to create one. I return to the list of canaries and click Create canary. I can use one of four blueprints to create my canary, or I can upload or import an existing one:

All of these methods ultimately result in a script that is run either once or periodically. The canaries that I showed above were all built from the Heartbeat monitoring blueprint, like this:

I can also create canaries for API endpoints, using either GET or PUT methods, any desired HTTP headers, and some request data:

Another blueprint lets me create a canary that checks a web page for broken links (I’ll use this post):

Finally, the GUI workflow builder lets me create a sophisticated canary that can include simulated clicks, content verification via CSS selector or text, text entry, and navigation to other URLs:

As you can see from these examples, the canary scripts are using the syn-1.0 runtime. This runtime supports Node.JS scripts that can use the Puppeteer and Chromium packages. Scripts can make use of a set of library functions and can (with the right IAM permissions) access other AWS services and resources. Here’s an example script that calls AWS Secrets Manager:

var synthetics = require('Synthetics');
const log = require('SyntheticsLogger');

const AWS = require('aws-sdk');
const secretsManager = new AWS.SecretsManager();

const getSecrets = async (secretName) => {
    var params = {
        SecretId: secretName
    };
    return await secretsManager.getSecretValue(params).promise();
}

const secretsExample = async function () {
    // Fetch secrets
    var secrets = await getSecrets("secretname")
    
    // Use secrets
    log.info("SECRETS: " + JSON.stringify(secrets));
};

exports.handler = async () => {
    return await secretsExample();
};

Scripts signal success by running to completion, and errors by raising an exception.

After I create my script, I establish a schedule and a pair of data retention periods. I also choose an S3 bucket that will store the artifacts created each time a canary is run:

I can also control the IAM role, set CloudWatch Alarms, and configure access to endpoints that are in a VPC:

Watch the demo video to see CloudWatch Synthetics in action:

Things to Know
Here are a couple of things to know about CloudWatch Synthetics:

Observability – You can use CloudWatch Synthetics in conjunction with ServiceLens and AWS X-Ray to map issues back to the proper part of your application. To learn more about how to do this, read Debugging with Amazon CloudWatch Synthetics and AWS X-Ray and Using ServiceLens to Monitor the Health of Your Applications.

Automation – You can create canaries using the Console, CLI, APIs, and from CloudFormation templates.

Pricing – As part of the AWS Free Tier you get 100 canary runs per month at no charge. After that, you pay per run, with prices starting at $0.0012 per run, plus the usual charges for S3 storage and Lambda invocations.

Limits – You can create up to 100 canaries per account in the US East (N. Virginia), Europe (Ireland), US West (Oregon), US East (Ohio), and Asia Pacific (Tokyo) Regions, and up to 20 per account in other regions where CloudWatch Synthetics are available.

Available Now
CloudWatch Synthetics are available now and you can start using them today!

Jeff;

Building well-architected serverless applications: Understanding application health – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-understanding-application-health-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the nine serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the Introduction post for a table of contents and explaining the example application.

Question OPS1: How do you evaluate your serverless application’s health?

This post continues part 1 of this Operational Excellence question. Previously, I covered using Amazon CloudWatch out-of-the-box standard metrics and alerts, and structured and centralized logging with the Embedded Metrics Format.

Best practice: Use application, business, and operations metrics

Identifying key performance indicators (KPIs), including business, customer, and operations outcomes, in additional to application metrics, helps to show a higher-level view of how the application and business is performing.

Business KPIs measure your application performance against business goals. For example, if fewer flight reservations are flowing through the system, the business would like to know.

Customer experience KPIs highlight the overall effectiveness of how customers use the application over time. Examples are perceived latency, time to find a booking, make a payment, etc.

Operational metrics help to see how operationally stable the application is over time. Examples are continuous integration/delivery/deployment feedback time, mean-time-between-failure/recovery, number of on-call pages and time to resolution, etc.

Custom Metrics

Embedded Metrics Format can also emit custom metrics to help understand your workload health’s impact on business.

The airline booking service uses AWS Step Functions, AWS Lambda, Amazon SNS, and Amazon DynamoDB.

In the confirm booking module function handler, I add a new namespace and dimension to associate this set of logs with this application and service.

metrics.set_namespace("ServerlessAirlineEMF")
metrics.put_dimensions({"service":"confirm_booking"})

Within the try statement within the try/except block, I emit a metric for a successful booking:

metrics.put_metric("BookingSuccessful", 1, "Count")

And within the except statement within the try/except block, I emit a metric for a failed booking:

metrics.put_metric("BookingFailed", 1, "Count")

Once I make a booking, within the CloudWatch console, I navigate to Logs | Log groups and select the Airline-ConfirmBooking-develop group. I select a log stream and find the custom metric as part of the structured log entry.

structured-log-entry

Custom metric structured log entry

I can also create a custom metric graph. Within the CloudWatch console, I navigate to Metrics. I see the ServerlessAirlineEMF Custom Namespace is available.

custom-namespace

Custom metric namespace

I select the Namespace and the available metric.

available-namespace

Available metric

I select a Line graph, add a name, and see the successful booking now plotted on the graph.

custom-metric-plotted

Plotted CloudWatch metric

I can also visualize and analyze this metric using CloudWatch Insights.

Once a booking is made, within the CloudWatch console, I navigate to Logs | Insights. I select the /aws/lambda/Airline-ConfirmBooking-develop log group. I choose Run query which shows a list of discovered fields on the right of the console.

I can search for discovered booking related fields.

cloudwatch-insights-discovered-fieldsI then enter the following in the query pane to search the logs and plot the sum of BookingReference and choose Run Query:

fields @timestamp, @message
| stats sum(@BookingReference)

cloudwatch-insights-displayed-bookingreference

CloudWatch Insights query

There are a number of other component metrics that are important to measure. Track interactions between upstream and downstream components such as message queue length, integration latency, and throttling.

Improvement plan summary:

  1. Identify user journeys and metrics from each customer transaction.
  2. Create custom metrics asynchronously instead of synchronously for improved performance, cost, and reliability outcomes.
  3. Emit business metrics from within your workload to measure application performance against business goals.
  4. Create and analyze component metrics to measure interactions with upstream and downstream components.
  5. Create and analyze operational metrics to assess the health of your continuous delivery pipeline and operational processes.

Good practice: Use distributed tracing and code is instrumented with additional context

Logging provides information on the individual point in time events the application generates. Tracing provides a wider continuous view of an application. Tracing helps to follow a user journey or transaction through the application.

AWS X-Ray is an example of a distributed tracing solution, there are a number of third-party options as well. X-Ray collects data about the requests that your application serves and also builds a service graph, which shows all the components of an application. This provides visibility to understand how upstream/downstream services may affect workload health.

For the most comprehensive view, enable X-Ray across as many services as possible and include X-Ray tracing instrumentation in code. This is the list of AWS Services integrated with X-Ray.

X-Ray receives data from services as segments. Segments for a common request are then grouped into traces. Segments can be further broken down into more granular subsegments. Custom data key-value pairs are added to segments and subsegments with annotations and metadata. Traces can also be grouped which helps with filter expressions.

AWS Lambda instruments incoming requests for all supported languages. Lambda application code can be further instrumented to emit information about its status, correlation identifiers, and business outcomes to determine transaction flows across your workload.

X-Ray tracing for a Lambda function is enabled in the Lambda Console. Select a Lambda function. I select the Airline-ReserveBooking-develop function. In the Configuration pane, I select to enable X-Ray Active tracing.

x-ray-enable

X-Ray tracing enabled

X-Ray can also be enabled via CloudFormation with the following code:

TracingConfig:
  Mode: Active

Lambda IAM permissions to write to X-Ray are added automatically when active tracing is enabled via the console. When using CloudFormation, Allow the Actions xray:PutTraceSegments and xray:PutTelemetryRecords

It is important to understand what invocations X-Ray does trace. X-Ray applies a sampling algorithm. If an upstream service, such as API Gateway with X-Ray tracing enabled, has already sampled a request, the Lambda function request is also sampled. Without an upstream request, X-Ray traces data for the first Lambda invocation each second, and then 5% of additional invocations.

For the airline application, X-Ray tracing is initiated within the shared library with the code:

from aws_xray_sdk.core import models, patch_all, xray_recorder

Segments, subsegments, annotations, and metadata are added to functions with the following example code:

segment = xray_recorder.begin_segment('segment_name')
# Start a subsegment
subsegment = xray_recorder.begin_subsegment('subsegment_name')
# Add metadata and annotations
segment.put_metadata('key', dict, 'namespace')
subsegment.put_annotation('key', 'value')
# Close the subsegment and segment
xray_recorder.end_subsegment()
xray_recorder.end_segment()

For example, within the collect payment module, an annotation is added for a successful payment with:

tracer.put_annotation("PaymentStatus", "SUCCESS")

CloudWatch ServiceLens

Once a booking is made and payment is successful, the tracing is available in the X-Ray console.

I explore how Amazon CloudWatch ServiceLens connects metrics, logs and the X-Ray tracing. Within the CloudWatch console, I navigate to ServiceLens | Service Map.

I can visualize all application resources and dependencies where X-Ray is enabled. I can trace performance or availability issues. If there was an issue connecting to SNS for example, this would be shown.

I select the Airline-CollectPayment-develop node and can view the out-of-the-box standard Lambda metrics.

I can select View Logs to jump to the CloudWatch Logs Insights console.

cloudwatch-insights-service-map-view

CloudWatch Insights Service map

I select View dashboard to see the function metrics, node map, and function details.

cloudwatch-insights-service-map-dashboard

CloudWatch Insights Service Map dashboard

I select View traces and can filter by the custom metric PaymentStatus. I select SUCCESS and chose Add to filter. I then select a trace.

cloudwatch-insights-service-map-select-trace

CloudWatch Insights Filtered traces

I see the full trace details, which show the full application transaction of a payment collection.

cloudwatch-insights-service-map-view-trace

Segments timeline

Selecting the Lambda handler subsegment – ## lambda_handler, I can view the trace Annotations and Metadata, which include the business transaction details such as Customer and PaymentStatus.

cloudwatch-insights-service-map-view-trace-annotations

X-Ray annotations

Trace groups are another feature of X-Ray and ServiceLens. Trace groups use filter expressions such as Annotation.PaymentStatus = "FAILED" which are used to view traces that match the particular group. Service graphs can also be viewed, and CloudWatch alarms created based on the group.

CloudWatch ServiceLens provides powerful capabilities to understand application performance bottlenecks and issues, helping determine how users are impacted.

Improvement plan summary:

  1. Identify common business context and system data that are commonly present across multiple transactions.
  2. Instrument SDKs and requests to upstream/downstream services to understand the flow of a transaction across system components.

Recent announcements

There have been a number of recent announcements for X-Ray and CloudWatch to improve how to evaluate serverless application health.

Conclusion

Evaluating application health helps you identify which services should be optimized to improve your customer’s experience. In part 1, I cover out-of-the-box standard metrics and alerts, as well as structured and centralized logging. In this post, I explore custom metrics and distributed tracing and show how to use ServiceLens to view logs, metrics, and traces together.

In an upcoming post, I will cover the next Operational Excellence question from the Well-Architected Serverless Lens – Approaching application lifecycle management.

Simplified Time-Series Analysis with Amazon CloudWatch Contributor Insights

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/simplified-time-series-analysis-with-amazon-cloudwatch-contributor-insights/

Inspecting multiple log groups and log streams can make it more difficult and time consuming to analyze and diagnose the impact of an issue in real time. What customers are affected? How badly? Are some affected more than others, or are outliers? Perhaps you performed deployment of an update using a staged rollout strategy and now want to know if any customers have hit issues or if everything is behaving as expected for the target customers before continuing further. All of the data points to help answer these questions is potentially buried in a mass of logs which engineers query to get ad-hoc measurements, or build and maintain custom dashboards to help track.

Amazon CloudWatch Contributor Insights, generally available today, is a new feature to help simplify analysis of Top-N contributors to time-series data in CloudWatch Logs that can help you more quickly understand who or what is impacting system and application performance, in real-time, at scale. This saves you time during an operational problem by helping you understand what is contributing to the operational issue and who or what is most affected. Amazon CloudWatch Contributor Insights can also help with ongoing analysis for system and business optimization by easily surfacing outliers, performance bottlenecks, top customers, or top utilized resources, all at a glance. In addition to logs, Amazon CloudWatch Contributor Insights can also be used with other products in the CloudWatch portfolio, including Metrics and Alarms.

Amazon CloudWatch Contributor Insights can analyze structured logs in either JSON or Common Log Format (CLF). Log data can be sourced from Amazon Elastic Compute Cloud (EC2) instances, AWS CloudTrail, Amazon Route 53, Apache Access and Error Logs, Amazon Virtual Private Cloud (VPC) Flow Logs, AWS Lambda Logs, and Amazon API Gateway Logs. You also have the choice of using structured logs published directly to CloudWatch, or using the CloudWatch Agent. Amazon CloudWatch Contributor Insights will evaluate these log events in real-time and display reports that show the top contributors and number of unique contributors in a dataset. A contributor is an aggregate metric based on dimensions contained as log fields in CloudWatch Logs, for example account-id, or interface-id in Amazon Virtual Private Cloud Flow Logs, or any other custom set of dimensions. You can sort and filter contributor data based on your own custom criteria. Report data from Amazon CloudWatch Contributor Insights can be displayed on CloudWatch dashboards, graphed alongside CloudWatch metrics, and added to CloudWatch alarms. For example customers can graph values from two Amazon CloudWatch Contributor Insights reports into a single metric describing the percentage of customers impacted by faults, and configure alarms to alert when this percentage breaches pre-defined thresholds.

Getting Started with Amazon CloudWatch Contributor Insights
To use Amazon CloudWatch Contributor Insights I simply need to define one or more rules. A rule is simply a snippet of data that defines what contextual data to extract for metrics reported from CloudWatch Logs. To configure a rule to identify the top contributors for a specific metric I supply three items of data – the log group (or groups), the dimensions for which the top contributors are evaluated, and filters to narrow down those top contributors. To do this, I head to the Amazon CloudWatch console dashboard and select Contributor Insights from the left-hand navigation links. This takes me to the Amazon CloudWatch Contributor Insights home where I can click Create a rule to get started.

To get started quickly, I can select from a library of sample rules for various services that send logs to CloudWatch Logs. You can see above that there are currently a variety of sample rules for Amazon API Gateway, Amazon Route 53 Query Logs, Amazon Virtual Private Cloud Flow Logs, and logs for container services. Alternatively, I can define my own rules, as I’ll do in the rest of this post.

Let’s say I have a deployed application that is publishing structured log data in JSON format directly to CloudWatch Logs. This application has two API versions, one that has been used for some time and is considered stable, and a second that I have just started to roll out to my customers. I want to know as early as possible if anyone who has received the new version, targeting the new api, is receiving any faults and how many faults are being triggered. My stable api version is sending log data to one log group and my new version is using a different group, so I need to monitor multiple log groups (since I also want to know if anyone is experiencing any error, regardless of version).

The JSON to define my rule, to report on 500 errors coming from any of my APIs, and to use account ID, HTTP method, and resource path as dimensions, is shown below.

{
  "Schema": {
    "Name": "CloudWatchLogRule",
    "Version": 1
  },
  "AggregateOn": "Count",
  "Contribution": {
    "Filters": [
      {
        "Match": "$.status",
        "EqualTo": 500
      }
    ],
    "Keys": [
      "$.accountId",
      "$.httpMethod",
      "$.resourcePath"
    ]
  },
  "LogFormat": "JSON",
  "LogGroupNames": [
    "MyApplicationLogsV*"
  ]
}

I can set up my rule using either the Wizard tab, or I can paste the JSON above into the Rule body field on the Syntax tab. Even though I have the JSON above, I’ll show using the Wizard tab in this post and you can see the completed fields below. When selecting log groups I can either select them from the drop down, if they already exist, or I can use wildcard syntax in the Select by prefix match option (MyApplicationLogsV* for example).

Clicking Create saves the new rule and makes it immediately start processing and analyzing data (unless I elect to create it in disabled state of course). Note that Amazon CloudWatch Contributor Insights processes new log data created once the rule is active, it does not perform historical inspection, so I need to build rules for operational scenarios that I anticipate happening in future.

With the rule in place I need to start generating some log data! To do that I’m going to use a script, written using the AWS Tools for PowerShell, to simulate my deployed application being invoked by a set of customers. Of those customers, a select few (let’s call them the unfortunate ones) will be directed to the new API version which will randomly fail on HTTP POST requests. Customers using the old API version will always succeed. The script, which runs for 5000 iterations, is shown below. The cmdlets being used to work with CloudWatch Logs are the ones with CWL in the name, for example Write-CWLLogEvent.

# Set up some random customer ids, and select a third of them to be our unfortunates
# who will experience random errors due to a bad api update being shipped!
$allCustomerIds = @( 1..15 | % { Get-Random })
$faultingIds = $allCustomerIds | Get-Random -Count 5

# Setup some log groups
$group1 = 'MyApplicationLogsV1'
$group2 = 'MyApplicationLogsV2'
$stream = "MyApplicationLogStream"

# When writing to a log stream we need to specify a sequencing token
$group1Sequence = $null
$group2Sequence = $null

$group1, $group2 | % {
    if (!(Get-CWLLogGroup -LogGroupName $_)) {
        New-CWLLogGroup -LogGroupName $_
        New-CWLLogStream -LogGroupName $_ -LogStreamName $stream
    } else {
        # When the log group and stream exist, we need to seed the sequence token to
        # the next expected value
        $logstream = Get-CWLLogStream -LogGroupName $_ -LogStreamName $stream
        $token = $logstream.UploadSequenceToken
        if ($_ -eq $group1) {
            $group1Sequence = $token
        } else {
            $group2Sequence = $token
        }
    }
}

# generate some log data with random failures for the subset of customers
1..5000 | % {

    Write-Host "Log event iteration $_" # just so we know where we are progress-wise

    $customerId = Get-Random $allCustomerIds

    # first select whether the user called the v1 or the v2 api
    $useV2Api = ((Get-Random) % 2 -eq 1)
    if ($useV2Api) {
        $resourcePath = '/api/v2/some/resource/path/'
        $targetLogGroup = $group2
        $nextToken = $group2Sequence
    } else {
        $resourcePath = '/api/v1/some/resource/path/'
        $targetLogGroup = $group1
        $nextToken = $group1Sequence
    }

    # now select whether they failed or not. GET requests for all customers on
    # all api paths succeed. POST requests to the v2 api fail for a subset of
    # customers.
    $status = 200
    $errorMessage = ''
    if ((Get-Random) % 2 -eq 0) {
        $httpMethod = "GET"
    } else {
        $httpMethod = "POST"
        if ($useV2Api -And $faultingIds.Contains($customerId)) {
            $status = 500
            $errorMessage = 'Uh-oh, something went wrong...'
        }
    }

    # Write an event and gather the sequence token for the next event
    $nextToken = Write-CWLLogEvent -LogGroupName $targetLogGroup -LogStreamName $stream -SequenceToken $nextToken -LogEvent @{
        TimeStamp = [DateTime]::UtcNow
        Message = (ConvertTo-Json -Compress -InputObject @{
            requestId = [Guid]::NewGuid().ToString("D")
            httpMethod = $httpMethod
            resourcePath = $resourcePath
            status = $status
            protocol = "HTTP/1.1"
            accountId = $customerId
            errorMessage = $errorMessage
        })
    }

    if ($targetLogGroup -eq $group1) {
        $group1Sequence = $nextToken
    } else {
        $group2Sequence = $nextToken
    }

    Start-Sleep -Seconds 0.25
}

I start the script running, and with my rule enabled, I start to see failures show up in my graph. Below is a snapshot after several minutes of running the script. I can clearly see a subset of my simulated customers are having issues with HTTP POST requests to the new v2 API.

From the Actions pull down in the Rules panel, I could now go on to create a single metric from this report, describing the percentage of customers impacted by faults, and then configure an alarm on this metric to alert when this percentage breaches pre-defined thresholds.

For the sample scenario outlined here I would use the alarm to halt the rollout of the new API if it fired, preventing the impact spreading to additional customers, while investigation of what is behind the increased faults is performed. Details on how to set up metrics and alarms can be found in the user guide.

Amazon CloudWatch Contributor Insights is available now to users in all commercial AWS Regions, including China and GovCloud.

— Steve

CloudWatch Contributor Insights for DynamoDB – Now Generally Available

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/cloudwatch-contributor-insights-for-dynamodb-now-generally-available/

Amazon DynamoDB provides our customers a fully-managed key-value database service that can easily scale from a few requests per month to millions of requests per second. DynamoDB supports some of the world’s largest scale applications by providing consistent, single-digit millisecond response times at any scale. You can build applications with virtually unlimited throughput and storage. DynamoDB global tables replicate your data across multiple AWS Regions to give you fast, local access to data for your globally distributed applications. For use cases that require even faster access with microsecond latency, DynamoDB Accelerator (DAX) provides a fully managed in-memory cache.

In November 2019, we announced Amazon CloudWatch Contributor Insights for Amazon DynamoDB as Preview, and today, I am happy to announce it is Generally Available for all AWS Regions.

Amazon CloudWatch Contributor Insights for Amazon DynamoDB

Amazon CloudWatch Contributor Insights, which also launched in November 2019, analyzes log data and creates time-series visualizations to provide a view of top contributors influencing system performance. You do this by creating Contributor Insights rules to evaluate CloudWatch Logs (including logs from AWS services) and any custom logs sent by your service or on-premises servers. For example, you can find bad hosts, identify the heaviest network users, or find the URLs that generate the most errors.

For developers building applications on top of DynamoDB, it’s useful to understand your database access patterns, such as traffic trends and frequently accessed keys, to help optimize DynamoDB costs and performance. You can create visualizations of these patterns with just a few clicks in the console using CloudWatch Contributor Insights for DynamoDB. DynamoDB automatically creates the required CloudWatch resources, then provides a summary view of the graphs. This summary view lives in the DynamoDB console, but you can also see the individual rule details in the CloudWatch console with other CloudWatch Contributor Insights rules, reports, and graphs of report data.

You can use these graphs to view traffic trends and pinpoint any hot keys in your DynamoDB tables.

How it works

Let’s see how it works and how it benefits developers. Here is a table on DynamoDB.

If we select any table, we can see details of the table. Click the new tab called [Contributor Insights].

CloudWatch Contributor Insights is now DISABLED. You can enable by selecting the upper [Contributor Insights] tab.

When you access the [Contributor Insights] tab, you can check its status. The activation process is very easy (this is one of my favorite points of this feature!) If you click [Manage Contributor Insights], a dialog pop-up comes up.

If you choose [Enabled] and click [Confirm], Dashboard comes up, and CloudWatch Contributor Insights for DynamoDB will record every access to the table.

After a while, you will see table insights by some graphs.

You can change the time range in the upper right corner of the dashboard, or simply click and drag a graph directly.

What does the dashboard tell us?

The dashboard shows us 4 metrics, which are powerful insights for application performance tuning. DynamoDB creates separate visualizations for partition key vs. partition+sort key, so if your table doesn’t have a sort key, then you will only see two graphs, not all four.

  • Most Accessed Items (Partition Key only) – Identifies the partition keys of the most accessed items in your table or global secondary index.
  • Most Accessed Items (Partition Key + Sort Key) –   Identifies the partition and sort keys of the most accessed items in your table or global secondary index.
  • Most Throttled Items (Partition Key only) –  Identifies the partition keys of the most throttled items in your table or global secondary index.
  • Most Throttled Items (Partition Key + Sort Key) – Identifies the partition and sort keys of the most throttled items in your table or global secondary index.

Most Accessed Items

Let’s break down what these metrics and graphs mean, starting with the “Most Accessed Items” metrics. This metric shows the frequency of which a key is accessed, based on both read and write traffic.

Outliers in these graphs are your most frequently accessed, or hottest, keys. Many DynamoDB workloads have at least some imbalanced traffic, but you can use this graph to see whether your workload will bump against DynamoDB’s per-key limits. On the other hand, if you see several closely clustered lines without any obvious outliers, it indicates that your workload is relatively balanced across items over the given time window (great job balancing your workload!)

Most Throttled Items

The “Most Throttled Items” shows just that, a graph of throttle count over time for your most throttled keys. If you see no data in this graph, it means your requests have not been throttled. If you see isolated points instead of connected lines, that indicates an item was throttled only for a brief period.

This blog article “Choosing the Right DynamoDB Partition Key” tells us the importance of considerations and strategies for choosing the right partition key for designing a schema that uses Amazon DynamoDB. Choosing the right partition key is an important step in the design and building of scalable and reliable applications on top of DynamoDB. Also, you can check our DynamoDB documentation page “Best Practices for Designing and Using Partition Keys Effectively“.

Integrating with CloudWatch Dashboard

This feature is integrated with CloudWatch for ease of use. You can integrate any of these graphs onto an existing CloudWatch dashboard. Let’s see how to do it. Going back to the DynamoDB dashboard, click [Add to dashboard].
You are redirected to CloudWatch Management Console, and are asked which dashboard to add in.

You can choose any existing dashboard or create a new one. For example, I put these metrics into my existing test dashboard as [test20180321].

Activating the feature does not affect anything in your existing production environment. You can enable it or disable it ay any time.

Generally Available Today

This feature is generally available today for all AWS regions.

– Kame;

 

Building well-architected serverless applications: Understanding application health – part 1

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-understanding-application-health-part-1/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the nine serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the Introduction post for a table of contents and explaining the example application.

Question OPS1: How do you evaluate your serverless application’s health?

Evaluating your metrics, distributed tracing, and logging gives you insight into business and operational events, and helps you understand which services should be optimized to improve your customer’s experience. By understanding the health of your Serverless Application, you will know whether it is functioning as expected, and can proactively react to any signals that indicate it is becoming unhealthy.

Required practice: Understand, analyze, and alert on metrics provided out of the box

It is important to understand metrics for every AWS service used in your application so you can decide how to measure its behavior. AWS services provide a number of out-of-the-box standard metrics to help monitor the operational health of your application.

As these metrics are generated automatically, it is a simple way to start monitoring your application and can also be augmented with custom metrics.

The first stage is to identify which services the application uses. The airline booking component uses AWS Step Functions, AWS Lambda, Amazon SNS, and Amazon DynamoDB.

When I make a booking, as shown in the Introduction post, AWS services emit metrics to Amazon CloudWatch. These are processed asynchronously without impacting the application’s performance.

There are two default CloudWatch dashboards to visualize key metrics quickly: per service and cross service.

Per service

To view the per service metrics dashboard, I open the CloudWatch console.

per-service-metrics-dashboardI select a service where Overview is shown, such as Lambda. Now I can view the metrics for all Lambda functions in the account.

per-service-metrics-lambdaCross service

To see an overview of key metrics across all AWS services, open the CloudWatch console and choose View cross service dashboard.

cross-service-metrics-dashboardI see a list of all services with one or two key metrics displayed. This provides a good overview of all services your application uses.

Alerting

The next stage is to identify the key metrics for comparison and set up alerts for under- and over-performing services. Here are some recommended metrics to alarm on for a number of AWS services.

Alerts can be configured manually or via infrastructure as code tools such as the AWS Serverless Application Model, AWS CloudFormation, or third-party tools.

To configure a manual alert for Lambda function errors using CloudWatch Alarms:

  1. I open the CloudWatch console and select Alarms and select Create Alarm.
  2. I choose Select Metric and from AWS Namespaces, select Lambda, Across All Functions and select Errors and select Select metric.add-metric-to-alert
  3. I change the Statistic to Sum and the Period to 1 minute.metric-values
  4. Under Conditions, I select a Static threshold Greater than 1 and select Next.

Alarms can also be created using anomaly detection rather than static values if there is a discernible pattern or trend. Anomaly detection looks at past metric data and uses machine learning to create a model of expected values. Alerts can then be configured if they fall outside this band of “normal” values. I use a Static threshold for this alarm.

  1. For the notification, I set the trigger to alarm to an existing SNS topic with my email address, then choose Next.metric-notification
  2. I enter a descriptive alarm name such as serverlessairline-lambda-prod-errors > 1, select Next, and choose Create alarm.

I have now manually set up an alarm.

Use CloudWatch composite alarms to combine multiple alarms to reduce noise and focus on critical issues. For example, a single alarm could trigger if there are both Lambda function errors as well as high Lambda concurrent executions.

It is simpler and more scalable to include alerting within infrastructure as code. Here is an example of alerting programmatically using CloudFormation.

I view the out of the box standard metrics and in this example, manually create an alarm for Lambda function errors.

Improvement plan summary:

  1. Understand what metrics and dimensions each managed service used provides.
  2. Configure alerts on relevant metrics for when services are unhealthy.

Good practice: Use structured and centralized logging

Central logging provides a single place to search and analyze logs. Structured logging means selecting a consistent log format and content structure to simplify querying across multiple components.

To identify a business transaction across components, such as a particular flight booking, log operational information from upstream and downstream services. Add information such as customer_id along with business outcomes such as order=accepted or order=confirmed. Make sure you are not logging any sensitive or personal identifying data in any logs.

Use JSON as your logging output format. Log multiple fields in a single object or dictionary rather than many one line messages for simpler searching.

Here is an example of a structured logging format.

The airline booking component, which is written in Python, currently uses a shared library with a separate log processing stack.

Embedded Metrics Format is a simpler mechanism to replace the shared library and use structured logging. CloudWatch Embedded Metrics adds environmental metadata such as Lambda Function version and also automatically extracts custom metrics so you can visualize and alarm on them. There are open-source client libraries available for Node.js and Python.

I then add embedded metrics to the individual confirm booking module with the following steps:

  1. I install the aws-embedded-metrics library using the instructions.
  2. In the function init code, I import the module and create a metric_scope with the following code

from aws_embedded_metrics import metric_scope
@metric_scope

  1. In the function handler, I log the generated bookingReference with the following code.

metrics.set_property("BookingReference", ret["bookingReference"])

In this example I also log the entire incoming event details.

metrics.set_property("event", event)

It is best practice to only log what is required to avoid unnecessary costs. Ensure the event does not have any sensitive or personal identifying data which is available to anyone who has access to the logs.

To avoid the duplicate logging in this example airline application which adds cost, I remove the existing shared library logger.*() lines.

When I make a booking, the CloudWatch log message is in structured JSON format. It contains the properties I set event, BookingReference, as well as function metadata.

I can then search for all log activity related to a specific booking across multiple functions with booking_id. I can track customer activity across multiple bookings using customer_id.

Logging is often created as a shared library resource which all functions reference. Another option is using Lambda Layers, which lets functions import additional code such as external libraries. Multiple functions can share this code.

Improvement plan summary:

  1. Log request identifiers from downstream services, component name, component runtime information, unique correlation identifiers, and information that helps identify a business transaction.
  2. Use JSON as the logging output. Prefer logging entire objects/dictionaries rather than many one line messages. Mask or remove sensitive data when logging.
  3. Minimize logging debugging information to a minimum as they can incur both costs and increase noise to signal ratio

Conclusion

Evaluating serverless application health helps understand which services should be optimized to improve your customer’s experience. I cover out of the box metrics and alerts, as well as structured and centralized logging.

This well-architected question will be continued in an upcoming post where I look at custom metrics and distributed tracing.

Building a serverless URL shortener app without AWS Lambda – part 3

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/building-a-serverless-url-shortener-app-without-lambda-part-3/

This is the final installment of a three-part series on building a serverless URL shortener without using AWS Lambda. This series highlights the power of Amazon API Gateway and its ability to directly integrate with services like Amazon DynamoDB. The result is a low latency, highly available application that is built with managed services and requires minimal code.

In part one of this series, I demonstrate building a serverless URL shortener application without using AWS Lambda. In part two, I walk through implementing application security using Amazon API Gateway settings and Amazon Cognito. In this part of this series, I cover application observability and performance.

Application observability

Before I can gauge the performance of the application, I must first be able to observe the performance of my application. There are two AWS services that I configure to help with observability, AWS X-Ray and Amazon CloudWatch.

X-Ray

X-Ray is a tracing service that enables developers to observe and debug distributed applications. With X-Ray enabled, every call to the API Gateway endpoint is tagged and monitored throughout the application services. Now that I have the application up and running, I want to test for errors and latency. I use Nordstrom’s open-source load testing library, serverless-artillery, to generate activity to the API endpoint. During the load test, serverless-artillery generates 8,000 requests per second (RPS) for a period of five minutes. The results are as follows:

X-Ray Tracing

This indicates that, from the point the request reaches API Gateway to when a response is generated, the average time for each request is 8 milliseconds (ms) with a 4 ms integration time to DynamoDB. It also indicates that there were no errors, faults, or throttling.

I change the parameters to increase the load and observe how the application performs. This time serverless-artillery generates 11,000rps for a period of 30 seconds. The results are as follows:

X-Ray tracing with throttling

X-Ray now indicates request throttling. This is due to the default throttling limits of API Gateway. Each account has a soft limit of 10,000rps with a burst limit of 5k requests. Since I am load testing the API with 11,000rps, API Gateway is throttling requests over 10k per second. When throttling occurs, API Gateway responds to the client with a status code of 429. Using X-Ray, I can drill down into the response data to get a closer look at requests by status code.

X-Ray analytics

CloudWatch

The next tool I use for application observability is Amazon CloudWatch. CloudWatch captures data for individual services and supports metric based alarms. I create the following alarms to have insight into my application:

AlarmTrigger
APIGateway4xxAlarmOne percent of the API calls result in a 4xx error over a one-minute period.
APIGateway5xxAlarmOne percent of the API calls result in a 5xx error over a one-minute period.
APIGatewayLatencyAlarmThe p99 latency experience is over 75 ms over a five-minute period.
DDB4xxAlarmOne percent of the DynamoDB requests result in a 4xx error over a one-minute period.
DDB5xxAlarmOne percent of the DynamoDB requests result in a 5xx error over a one-minute period.
CloudFrontTotalErrorRateAlarmFive requests to CloudFront result in a 4xx or 5xx error over a one-minute period.
CloudFrontTotalCacheHitRAteAlarm80% or less of the requests to CloudFront result in a cache hit over a five-minute period. While this is not an error or a problem, it indicates the need for a more aggressive caching story.

Each of these alarms is configured to publish to a notification topic using Amazon Simple Notification Service (SNS). In this example I have configured my email address as a subscriber to the SNS topic. I could also subscribe a Lambda function or a mobile number for SMS message notification. I can also get a quick view of the status of my alarms on the CloudWatch console.

CloudWatch and X-Ray provide additional alerts when there are problems. They also provide observability to help remediate discovered issues.

Performance

With observability tools in place, I am now able to evaluate the performance of the application. In part one, I discuss using API Gateway and DynamoDB as the primary services for this application and the performance advantage provided. However, these performance advantages are limited to the backend only. To improve performance between the client and the API I configure throttling and a content delivery network with Amazon CloudFront.

Throttling

Request throttling is handled with API Gateway and can be configured at the stage level or at the resource and method level. Because this application is a URL shortener, the most important action is the 301 redirect that happens at /{linkId} – GET. I want to ensure that these calls take priority, so I set a throttling limit on all other actions.

The best way to do this is to set a global throttling of 2000rps with a burst of 1k. I then configure an override on the /{linkId} – GET method to 10,000rps with a burst of 5k. If the API is experiencing an extraordinarily high volume of calls, all other calls are rejected.

Content delivery network

The distance between a user and the API endpoint can severely affect the performance of an application. Simply put, the further the data has to travel, the slower the application. By configuring a CloudFront distribution to use the Amazon CloudFront Global Edge Network, I bring the data closer to the user and increase performance.

I configure the cache for /{linkId} – GET to “max-age=300” which tells CloudFront to store the response of that call for 5 minutes. The first call queries the API and database for a response, while all subsequent calls in the next five minutes receive the local cached response. I then set all other endpoint cache to “no-cache, no-store”, which tells CloudFront to never store the value from these calls. This ensures that as users are creating or editing their short-links, they get the latest data.

By bringing the data closer to the user, I now ensure that regardless of where the user is, they receive improved performance. To evaluate this, I return to serverless-artillery and test the CloudFront endpoint. The results are as follows:

MinMaxAveragep10p50p90p95p99
8.12 ms739 ms21.7 ms10.1 ms12.1 ms20 ms34 ms375 ms

To be clear, these are the 301 redirect response times. I configured serverless-artillery not to follow the redirects as I have no control of the speed of the resulting site. The maximum response time was 739 ms. This would be the initial uncached call. The p50 metric shows that half of the traffic is seeing a 12 ms or better response time while the p95 indicates that most of my traffic is experiencing an equal or better than 34 ms response time.

Conclusion

In this series, I talk through building a serverless URL shortener without the use of any Lambda functions. The resulting architecture looks like this:

This application is built by integrating multiple managed services together and applying business logic via mapping templates on API Gateway. Because these are all serverless managed services, they provide inherent availability and scale to meet the client load as needed.

While the “Lambda-less” pattern is not a match for every application, it is a great answer for building highly performant applications with minimal logic. The advantage to this pattern is also in its extensibility. With the data saved to DynamoDB, I can use the DynamoDB streaming feature to connect additional processing as needed. I can also use CloudFront access logs to evaluate internal application metrics. Clone this repo to start serving your own shortened URLs and submit a pull request if you have an improvement.

Did you miss any of this series?

  1. Part 1: Building the application.
  2. Part 2: Securing the application.

Happy coding!

Analyzing API Gateway custom access logs for custom domain names

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/analyzing-api-gateway-custom-access-logs-for-custom-domain-names/

This post is courtesy of Taka Matsumoto, Cloud Support Engineer, AWS

If you are using custom domain names in Amazon API Gateway, it can be useful to gain insights into requests sent to each custom domain name. Although API Gateway provides CloudWatch metrics and options to deliver request logs to Amazon CloudWatch Logs, there is no pre-defined metric or log specific to custom domain names. If there is more than one custom domain name mapped to a single API, understanding the quantity and type of requests by domain name may help understand request patterns.

Using the custom access logging option in API Gateway enables delivery of custom logs to CloudWatch Logs, which can be analyzed using CloudWatch Logs Insights. This blog post walks through the steps to create a CloudWatch log group for API custom access logging, and uses CloudWatch Logs Insights for analysis.

Overview

In the tutorial, you create a CloudWatch log group for custom access logging. You then enable custom access logging for an API stage associated with a custom domain name. The IAM role used in this tutorial must be able to create and update the relevant resources in CloudWatch, IAM, and API Gateway. For this tutorial, use the US East (N. Virginia) Region. In the next steps, the tutorial covers:

  1. Creating a CloudWatch Log group.
  2. Creating an IAM role for access logging.
  3. Enabling custom access logging.
  4. Testing an API using a custom domain name.
  5. Analyzing Logs in CloudWatch Logs Insights.

Create a CloudWatch Log group

Before enabling custom access logging for your API’s stage, create a CloudWatch log group to deliver custom logs. Create a log group called APIGateway_CustomDomainLogs by following these steps:

  1. Go to the CloudWatch Logs console.
  2. Under Actions, click on Create log group and name the log group APIGateway_CustomDomainLogs. Learn more about creating a log group in Working with Log Groups and Log Streams.

Create log group

Create an IAM role for access logging

You must use an IAM role to deliver logs from API Gateway to CloudWatch Logs. If there is no IAM role already available for logging in API Gateway, create a new IAM role:

  1. Navigate to the IAM console.
  2. Under Roles, choose Create role.
  3. Select API Gateway for the service and choose Next: Permissions.

    IAM selection

  4. Leave the attached IAM policy (AmazonAPIGatewayPushToCloudWatchLogs), and choose Next: Tags.
  5. No tags are required for this tutorial. Leave these blank and choose Next: Review.
  6. Name the role APIGatewayCloudWatchLogsRole and choose Create role.
    Create role

3. Enable custom access logging

Now you enable custom access logging. Select one of the API stages that you invoke through a custom domain name:

  1. If there is no CloudWatch log role set for API Gateway, go to the API Gateway Settings page to add the CloudWatch log role ARN. The IAM role ARN follows this format: arn:aws:iam::123456789012:role/APIGatewayCloudWatchLogsRole.
  2. For your API with a custom domain name, go to Stages page and select the Logs/Tracing tab.
  3. Enter the following fields:
    – For CloudWatch Group, add the ARN (for example, arn:aws:logs:us-east-1:123456789012:log-group:APIGateway_CustomDomainLogs).
    – For Log Format, enter:

    {
        "RequestId": "$context.requestId",
        "DomainName": "$context.domainName",
        "APIId": "$context.apiId",
        "RequestPath": "$context.path",
        "RequestTime": "$context.requestTime",
        "SourceIp": "$context.identity.sourceIp",
        "ResourcePath": "$context.resourcePath",
        "Stage": "$context.stage"
    }

    Custom access logging

  4.  Choose Save changes.

Learn more about custom access logging setup in Set up API Logging Using the API Gateway Console. In the Log Format configuration, $context variables retrieve a domain name as well as other API request information. Learn more in $context Variables for Data Models, Authorizers, Mapping Templates, and CloudWatch Access Logging.

4. Test invoke an API using a custom domain name

Once you enable custom access logging, invoke the API using the custom domain name. The logs appear in the specified CloudWatch log group shortly after. A sample response in the CloudWatch log stream looks like the following:

{
    "RequestId": "1b1ebe20-817f-11e9-a796-f5e0ffdcdac7",
    "DomainName": "test.example.com”,
    "APIId": "12345abcde",
    "RequestPath": "/dev",
    "RequestTime": "28/May/2019:19:30:52 +0000",
    "SourceIp": "1.2.3.4",
    "ResourcePath": "/",
    "Stage": "dev"
}

5. Analyze logs in CloudWatch Logs Insights

After setting up the custom access logs, you can query against them to find more insights using the custom domain name.

  1. Go to CloudWatch Logs Insights console.
  2. In the log group text field, select the CloudWatch log group, APIGateway_CustomDomainLogs.
  3. Enter the following query.
    fields @timestamp, @message
    | filter DomainName like /(?i)(test.example.com)/

    This query returns a list of log entries for the custom domain called test.example.com. To run in your account, replace this value with your custom domain name.
    Custom filter

If your network security does not allow the use of web sockets, you cannot access the CloudWatch Logs Insights console. Instead, use the CloudWatch Logs Insights query capabilities using the API. Here are some example queries use the AWS CLI:

1. A sample command for aws logs start-query:

aws logs start-query --log-group-name APIGateway_CustomDomainLogs --start-time 1557085225000 --end-time 1559763625000 --query-string 'fields @timestamp, @message | filter DomainName like /(?i)(test.example.com)/'

The response looks like:

{
    "queryId": "a1234567-bfde-47c7-9d44-41ebed011c66"
}

Learn more about start-query command in aws logs start-query.

2. Run aws logs get-query-results to retrieve the result of the query. A sample command for aws logs get-query-results:

aws logs get-query-results --query-id a1234567-bfde-47c7-9d44-41ebed011c66

The response looks like:

{
    "results": [
        [
            {
                "field": "@timestamp",
                "value": "2019-05-28 19:30:52.494"
            },
            {
                "field": "@message",
                "value": "{\"RequestId\": \"12345678-7cb3-11e9-8896-c30af5588427\",\"DomainName\":\"test.exmaple.com\"}"
            },
            {
                "field": "@ptr",
                "value": "CmEKKAokOTYxNTQyNjM4MjQzOkFQSUdhdGV3YXlfQ3VzdG9tRG9tYWluEAISNRoYAgXM/KQtAAAAAA0L2foABc5YFyAAAAHSIAEoxviFhK4tMM+HhoSuLTgCQOoBSN4OUN4KEAAYAQ=="
            }
        ]
    ],
    "statistics": {
        "recordsMatched": 1.0,
        "recordsScanned": 1.0,
        "bytesScanned": 153.0
    },
    "status": "Complete"
}

You can use other queries to filter the results based on other attributes in the logs. Learn more about get-query-results command in aws logs get-query-results.

Conclusion

In this blog post, I show how to deliver custom access logs from API Gateway to CloudWatch Logs. I also show how to use CloudWatch Logs Insights to run a query against the logs for custom domain name metrics, which help provide insights into custom domain name usage.

To learn more about the query syntax, visit CloudWatch Logs Insights Query Syntax.

Customizing triggers for AWS CodePipeline with AWS Lambda and Amazon CloudWatch Events

Post Syndicated from Bryant Bost original https://aws.amazon.com/blogs/devops/adding-custom-logic-to-aws-codepipeline-with-aws-lambda-and-amazon-cloudwatch-events/

AWS CodePipeline is a fully managed continuous delivery service that helps automate the build, test, and deploy processes of your application. Application owners use CodePipeline to manage releases by configuring “pipeline,” workflow constructs that describe the steps, from source code to deployed application, through which an application progresses as it is released. If you are new to CodePipeline, check out Getting Started with CodePipeline to get familiar with the core concepts and terminology.

Overview

In a default setup, a pipeline is kicked-off whenever a change in the configured pipeline source is detected. CodePipeline currently supports sourcing from AWS CodeCommit, GitHub, Amazon ECR, and Amazon S3. When using CodeCommit, Amazon ECR, or Amazon S3 as the source for a pipeline, CodePipeline uses an Amazon CloudWatch Event to detect changes in the source and immediately kick off a pipeline. When using GitHub as the source for a pipeline, CodePipeline uses a webhook to detect changes in a remote branch and kick off the pipeline. Note that CodePipeline also supports beginning pipeline executions based on periodic checks, although this is not a recommended pattern.

CodePipeline supports adding a number of custom actions and manual approvals to ensure that pipeline functionality is flexible and code releases are deliberate; however, without further customization, pipelines will still be kicked-off for every change in the pipeline source. To customize the logic that controls pipeline executions in the event of a source change, you can introduce a custom CloudWatch Event, which can result in the following benefits:

  • Multiple pipelines with a single source: Trigger select pipelines when multiple pipelines are listening to a single source. This can be useful if your organization is using monorepos, or is using a single repository to host configuration files for multiple instances of identical stacks.
  • Avoid reacting to unimportant files: Avoid triggering a pipeline when changing files that do not affect the application functionality (e.g. documentation files, readme files, and .gitignore files).
  • Conditionally kickoff pipelines based on environmental conditions: Use custom code to evaluate whether a pipeline should be triggered. This allows for further customization beyond polling a source repository or relying on a push event. For example, you could create custom logic to automatically reschedule deployments on holidays to the next available workday.

This post explores and demonstrates how to customize the actions that invoke a pipeline by modifying the default CloudWatch Events configuration that is used for CodeCommit, ECR, or S3 sources. To illustrate this customization, we will walk through two examples: prevent updates to documentation files from triggering a pipeline, and manage execution of multiple pipelines monitoring a single source repository.

The key concepts behind customizing pipeline invocations extend to GitHub sources and webhooks as well; however, creating a custom webhook is outside the scope of this post.

Sample Architecture

This post is only interested in controlling the execution of the pipeline (as opposed to the deploy, test, or approval stages), so it uses simple source and pipeline configurations. The sample architecture considers a simple CodePipeline with only two stages: source and build.

Example CodePipeline Architecture

Example CodePipeline Architecture with Custom CloudWatch Event Configuration

The sample CodeCommit repository consists only of buildspec.yml, readme.md, and script.py files.

Normally, after you create a pipeline, it automatically triggers a pipeline execution to release the latest version of your source code. From then on, every time you make a change to your source location, a new pipeline execution is triggered. In addition, you can manually re-run the last revision through a pipeline using the “Release Change” button in the console. This architecture uses a custom CloudWatch Event and AWS Lambda function to avoid commits that change only the readme.md file from initiating an execution of the pipeline.

Creating a custom CloudWatch Event

When we create a CodePipeline that monitors a CodeCommit (or other) source, a default CloudWatch Events rule is created to trigger our pipeline for every change to the CodeCommit repository. This CloudWatch Events rule monitors the CodeCommit repository for changes, and triggers the pipeline for events matching the referenceCreated or referenceUpdated CodeCommit Event (refer to CodeCommit Event Types for more information).

Default CloudWatch Events Rule to Trigger CodePipeline

Default CloudWatch Events Rule to Trigger CodePipeline

To introduce custom logic and control the events that kickoff the pipeline, this example configures the default CloudWatch Events rule to detect changes in the source and trigger a Lambda function rather than invoke the pipeline directly. The example uses a CodeCommit source, but the same principle applies to Amazon S3 and Amazon ECR sources as well, as these both use CloudWatch Events rules to notify CodePipeline of changes.

Custom CloudWatch Events Rule to Trigger CodePipeline

Custom CloudWatch Events Rule to Trigger CodePipeline

When a change is introduced to the CodeCommit repository, the configured Lambda function receives an event from CloudWatch signaling that there has been a source change.

{
   "version":"0",
   "id":"2f9a75be-88f6-6827-729d-34495072e5a1",
   "detail-type":"CodeCommit Repository State Change",
   "source":"aws.codecommit",
   "account":"accountNumber",
   "time":"2019-11-12T04:56:47Z",
   "region":"us-east-1",
   "resources":[
      "arn:aws:codecommit:us-east-1:accountNumber:codepipeline-customization-sandbox-repo"
   ],
   "detail":{
      "callerUserArn":"arn:aws:sts::accountNumber:assumed-role/admin/roleName ",
      "commitId":"92e953e268345c77dd93cec860f7f91f3fd13b44",
      "event":"referenceUpdated",
      "oldCommitId":"5a058542a8dfa0dacf39f3c1e53b88b0f991695e",
      "referenceFullName":"refs/heads/master",
      "referenceName":"master",
      "referenceType":"branch",
      "repositoryId":"658045f1-c468-40c3-93de-5de2c000d84a",
      "repositoryName":"codepipeline-customization-sandbox-repo"
   }
}

The Lambda function is responsible for determining whether a source change necessitates kicking-off the pipeline, which in the example is necessary if the change contains modifications to files other than readme.md. To implement this, the Lambda function uses the commitId and oldCommitId fields provided in the body of the CloudWatch event message to determine which files have changed. If the function determines that a change has occurred to a “non-ignored” file, then the function programmatically executes the pipeline. Note that for S3 sources, it may be necessary to process an entire file zip archive, or to retrieve past versions of an artifact.

import boto3

files_to_ignore = [ "readme.md" ]

codecommit_client = boto3.client('codecommit')
codepipeline_client = boto3.client('codepipeline')

def lambda_handler(event, context):
    # Extract commits
    old_commit_id = event["detail"]["oldCommitId"]
    new_commit_id = event["detail"]["commitId"]
    # Get commit differences
    codecommit_response = codecommit_client.get_differences(
        repositoryName="codepipeline-customization-sandbox-repo",
        beforeCommitSpecifier=str(old_commit_id),
        afterCommitSpecifier=str(new_commit_id)
    )
    # Search commit differences for files to ignore
    for difference in codecommit_response["differences"]:
        file_name = difference["afterBlob"]["path"].lower()
        # If non-ignored file is present, kickoff pipeline
        if file_name not in files_to_ignore:
            codepipeline_response = codepipeline_client.start_pipeline_execution(
                name="codepipeline-customization-sandbox-pipeline"
                )
            # Break to avoid executing the pipeline twice
            break

Multiple pipelines sourcing from a single repository

Architectures that use a single-source repository monitored by multiple pipelines can add custom logic to control the types of events that trigger a specific pipeline to execute. Without customization, any change to the source repository would trigger all pipelines.

Consider the following example:

  • A CodeCommit repository contains a number of config files (for example, config_1.json and config_2.json).
  • Multiple pipelines (for example, codepipeline-customization-sandbox-pipeline-1 and codepipeline-customization-sandbox-pipeline-2) source from this CodeCommit repository.
  • Whenever a config file is updated, a custom CloudWatch Event triggers a Lambda function that is used to determine which config files changed, and therefore which pipelines should be executed.
Example CodePipeline Architecture

Example CodePipeline Architecture for Monorepos with Custom CloudWatch Event Configuration

This example follows the same pattern of creating a custom CloudWatch Event and Lambda function shown in the preceding example. However, in this scenario, the Lambda function is responsible for determining which files changed and which pipelines should be kicked off as a result. To execute this logic, the Lambda function uses the config_file_mapping variable to map files to corresponding pipelines. Pipelines are only executed if their designated config file has changed.

Note that the config_file_mapping can be exported to Amazon S3 or Amazon DynamoDB for more complex use cases.

import boto3

# Map config files to pipelines
config_file_mapping = {
        "config_1.json" : "codepipeline-customization-sandbox-pipeline-1",
        "config_2.json" : "codepipeline-customization-sandbox-pipeline-2"
        }
        
codecommit_client = boto3.client('codecommit')
codepipeline_client = boto3.client('codepipeline')

def lambda_handler(event, context):
    # Extract commits
    old_commit_id = event["detail"]["oldCommitId"]
    new_commit_id = event["detail"]["commitId"]
    # Get commit differences
    codecommit_response = codecommit_client.get_differences(
        repositoryName="codepipeline-customization-sandbox-repo",
        beforeCommitSpecifier=str(old_commit_id),
        afterCommitSpecifier=str(new_commit_id)
    )
    # Search commit differences for files that trigger executions
    for difference in codecommit_response["differences"]:
        file_name = difference["afterBlob"]["path"].lower()
        # If file corresponds to pipeline, execute pipeline
        if file_name in config_file_mapping:
            codepipeline_response = codepipeline_client.start_pipeline_execution(
                name=config_file_mapping["file_name"]
                )

Results

For the first example, updates affecting only the readme.md file are completely ignored by the pipeline, while updates affecting other files begin a normal pipeline execution. For the second example, the two pipelines monitor the same source repository; however, codepipeline-customization-sandbox-pipeline-1 is executed only when config_1.json is updated and codepipeline-customization-sandbox-pipeline-2 is executed only when config_2.json is updated.

These CloudWatch Event and Lambda function combinations serve as a good general examples of the introduction of custom logic to pipeline kickoffs, and can be expanded to account for variously complex processing logic.

Cleanup

To avoid additional infrastructure costs from the examples described in this post, be sure to delete all CodeCommit repositories, CodePipeline pipelines, Lambda functions, and CodeBuild projects. When you delete a CodePipeline, the CloudWatch Events rule that was created automatically is deleted, even if the rule has been customized.

Conclusion

For scenarios which need you to define additional custom logic to control the execution of one or multiple pipelines, configuring a CloudWatch Event to trigger a Lambda function allows you to customize the conditions and types of events that can kick-off your pipeline.

ICYMI: Serverless Q4 2019

Post Syndicated from Rob Sutter original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2019/

Welcome to the eighth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, checkout what happened last quarter here.

The three months comprising the fourth quarter of 2019

AWS re:Invent

AWS re:Invent 2019

re:Invent 2019 dominated the fourth quarter at AWS. The serverless team presented a number of talks, workshops, and builder sessions to help customers increase their skills and deliver value more rapidly to their own customers.

Serverless talks from re:Invent 2019

Chris Munns presenting 'Building microservices with AWS Lambda' at re:Invent 2019

We presented dozens of sessions showing how customers can improve their architecture and agility with serverless. Here are some of the most popular.

Videos

Decks

You can also find decks for many of the serverless presentations and other re:Invent presentations on our AWS Events Content.

AWS Lambda

For developers needing greater control over performance of their serverless applications at any scale, AWS Lambda announced Provisioned Concurrency at re:Invent. This feature enables Lambda functions to execute with consistent start-up latency making them ideal for building latency sensitive applications.

As shown in the below graph, provisioned concurrency reduces tail latency, directly impacting response times and providing a more responsive end user experience.

Graph showing performance enhancements with AWS Lambda Provisioned Concurrency

Lambda rolled out enhanced VPC networking to 14 additional Regions around the world. This change brings dramatic improvements to startup performance for Lambda functions running in VPCs due to more efficient usage of elastic network interfaces.

Illustration of AWS Lambda VPC to VPC NAT

New VPC to VPC NAT for Lambda functions

Lambda now supports three additional runtimes: Node.js 12, Java 11, and Python 3.8. Each of these new runtimes has new version-specific features and benefits, which are covered in the linked release posts. Like the Node.js 10 runtime, these new runtimes are all based on an Amazon Linux 2 execution environment.

Lambda released a number of controls for both stream and async-based invocations:

  • You can now configure error handling for Lambda functions consuming events from Amazon Kinesis Data Streams or Amazon DynamoDB Streams. It’s now possible to limit the retry count, limit the age of records being retried, configure a failure destination, or split a batch to isolate a problem record. These capabilities help you deal with potential “poison pill” records that would previously cause streams to pause in processing.
  • For asynchronous Lambda invocations, you can now set the maximum event age and retry attempts on the event. If either configured condition is met, the event can be routed to a dead letter queue (DLQ), Lambda destination, or it can be discarded.

AWS Lambda Destinations is a new feature that allows developers to designate an asynchronous target for Lambda function invocation results. You can set separate destinations for success and failure. This unlocks new patterns for distributed event-based applications and can replace custom code previously used to manage routing results.

Illustration depicting AWS Lambda Destinations with success and failure configurations

Lambda Destinations

Lambda also now supports setting a Parallelization Factor, which allows you to set multiple Lambda invocations per shard for Kinesis Data Streams and DynamoDB Streams. This enables faster processing without the need to increase your shard count, while still guaranteeing the order of records processed.

Illustration of multiple AWS Lambda invocations per Kinesis Data Streams shard

Lambda Parallelization Factor diagram

Lambda introduced Amazon SQS FIFO queues as an event source. “First in, first out” (FIFO) queues guarantee the order of record processing, unlike standard queues. FIFO queues support messaging batching via a MessageGroupID attribute that supports parallel Lambda consumers of a single FIFO queue, enabling high throughput of record processing by Lambda.

Lambda now supports Environment Variables in the AWS China (Beijing) Region and the AWS China (Ningxia) Region.

You can now view percentile statistics for the duration metric of your Lambda functions. Percentile statistics show the relative standing of a value in a dataset, and are useful when applied to metrics that exhibit large variances. They can help you understand the distribution of a metric, discover outliers, and find hard-to-spot situations that affect customer experience for a subset of your users.

Amazon API Gateway

Screen capture of creating an Amazon API Gateway HTTP API in the AWS Management Console

Amazon API Gateway announced the preview of HTTP APIs. In addition to significant performance improvements, most customers see an average cost savings of 70% when compared with API Gateway REST APIs. With HTTP APIs, you can create an API in four simple steps. Once the API is created, additional configuration for CORS and JWT authorizers can be added.

AWS SAM CLI

Screen capture of the new 'sam deploy' process in a terminal window

The AWS SAM CLI team simplified the bucket management and deployment process in the SAM CLI. You no longer need to manage a bucket for deployment artifacts – SAM CLI handles this for you. The deployment process has also been streamlined from multiple flagged commands to a single command, sam deploy.

AWS Step Functions

One powerful feature of AWS Step Functions is its ability to integrate directly with AWS services without you needing to write complicated application code. In Q4, Step Functions expanded its integration with Amazon SageMaker to simplify machine learning workflows. Step Functions also added a new integration with Amazon EMR, making EMR big data processing workflows faster to build and easier to monitor.

Screen capture of an AWS Step Functions step with Amazon EMR

Step Functions step with EMR

Step Functions now provides the ability to track state transition usage by integrating with AWS Budgets, allowing you to monitor trends and react to usage on your AWS account.

You can now view CloudWatch Metrics for Step Functions at a one-minute frequency. This makes it easier to set up detailed monitoring for your workflows. You can use one-minute metrics to set up CloudWatch Alarms based on your Step Functions API usage, Lambda functions, service integrations, and execution details.

Step Functions now supports higher throughput workflows, making it easier to coordinate applications with high event rates. This increases the limits to 1,500 state transitions per second and a default start rate of 300 state machine executions per second in US East (N. Virginia), US West (Oregon), and Europe (Ireland). Click the above link to learn more about the limit increases in other Regions.

Screen capture of choosing Express Workflows in the AWS Management Console

Step Functions released AWS Step Functions Express Workflows. With the ability to support event rates greater than 100,000 per second, this feature is designed for high-performance workloads at a reduced cost.

Amazon EventBridge

Illustration of the Amazon EventBridge schema registry and discovery service

Amazon EventBridge announced the preview of the Amazon EventBridge schema registry and discovery service. This service allows developers to automate discovery and cataloging event schemas for use in their applications. Additionally, once a schema is stored in the registry, you can generate and download a code binding that represents the schema as an object in your code.

Amazon SNS

Amazon SNS now supports the use of dead letter queues (DLQ) to help capture unhandled events. By enabling a DLQ, you can catch events that are not processed and re-submit them or analyze to locate processing issues.

Amazon CloudWatch

Amazon CloudWatch announced Amazon CloudWatch ServiceLens to provide a “single pane of glass” to observe health, performance, and availability of your application.

Screenshot of Amazon CloudWatch ServiceLens in the AWS Management Console

CloudWatch ServiceLens

CloudWatch also announced a preview of a capability called Synthetics. CloudWatch Synthetics allows you to test your application endpoints and URLs using configurable scripts that mimic what a real customer would do. This enables the outside-in view of your customers’ experiences, and your service’s availability from their point of view.

CloudWatch introduced Embedded Metric Format, which helps you ingest complex high-cardinality application data as logs and easily generate actionable metrics. You can publish these metrics from your Lambda function by using the PutLogEvents API or using an open source library for Node.js or Python applications.

Finally, CloudWatch announced a preview of Contributor Insights, a capability to identify who or what is impacting your system or application performance by identifying outliers or patterns in log data.

AWS X-Ray

AWS X-Ray announced trace maps, which enable you to map the end-to-end path of a single request. Identifiers show issues and how they affect other services in the request’s path. These can help you to identify and isolate service points that are causing degradation or failures.

X-Ray also announced support for Amazon CloudWatch Synthetics, currently in preview. CloudWatch Synthetics on X-Ray support tracing canary scripts throughout the application, providing metrics on performance or application issues.

Screen capture of AWS X-Ray Service map in the AWS Management Console

X-Ray Service map with CloudWatch Synthetics

Amazon DynamoDB

Amazon DynamoDB announced support for customer-managed customer master keys (CMKs) to encrypt data in DynamoDB. This allows customers to bring your own key (BYOK) giving you full control over how you encrypt and manage the security of your DynamoDB data.

It is now possible to add global replicas to existing DynamoDB tables to provide enhanced availability across the globe.

Another new DynamoDB capability to identify frequently accessed keys and database traffic trends is currently in preview. With this, you can now more easily identify “hot keys” and understand usage of your DynamoDB tables.

Screen capture of Amazon CloudWatch Contributor Insights for DynamoDB in the AWS Management Console

CloudWatch Contributor Insights for DynamoDB

DynamoDB also released adaptive capacity. Adaptive capacity helps you handle imbalanced workloads by automatically isolating frequently accessed items and shifting data across partitions to rebalance them. This helps reduce cost by enabling you to provision throughput for a more balanced workload instead of over provisioning for uneven data access patterns.

Amazon RDS

Amazon Relational Database Services (RDS) announced a preview of Amazon RDS Proxy to help developers manage RDS connection strings for serverless applications.

Illustration of Amazon RDS Proxy

The RDS Proxy maintains a pool of established connections to your RDS database instances. This pool enables you to support a large number of application connections so your application can scale without compromising performance. It also increases security by enabling IAM authentication for database access and enabling you to centrally manage database credentials using AWS Secrets Manager.

AWS Serverless Application Repository

The AWS Serverless Application Repository (SAR) now offers Verified Author badges. These badges enable consumers to quickly and reliably know who you are. The badge appears next to your name in the SAR and links to your GitHub profile.

Screen capture of SAR Verifiedl developer badge in the AWS Management Console

SAR Verified developer badges

AWS Developer Tools

AWS CodeCommit launched the ability for you to enforce rule workflows for pull requests, making it easier to ensure that code has pass through specific rule requirements. You can now create an approval rule specifically for a pull request, or create approval rule templates to be applied to all future pull requests in a repository.

AWS CodeBuild added beta support for test reporting. With test reporting, you can now view the detailed results, trends, and history for tests executed on CodeBuild for any framework that supports the JUnit XML or Cucumber JSON test format.

Screen capture of AWS CodeBuild

CodeBuild test trends in the AWS Management Console

Amazon CodeGuru

AWS announced a preview of Amazon CodeGuru at re:Invent 2019. CodeGuru is a machine learning based service that makes code reviews more effective and aids developers in writing code that is more secure, performant, and consistent.

AWS Amplify and AWS AppSync

AWS Amplify added iOS and Android as supported platforms. Now developers can build iOS and Android applications using the Amplify Framework with the same category-based programming model that they use for JavaScript apps.

Screen capture of 'amplify init' for an iOS application in a terminal window

The Amplify team has also improved offline data access and synchronization by announcing Amplify DataStore. Developers can now create applications that allow users to continue to access and modify data, without an internet connection. Upon connection, the data synchronizes transparently with the cloud.

For a summary of Amplify and AppSync announcements before re:Invent, read: “A round up of the recent pre-re:Invent 2019 AWS Amplify Launches”.

Illustration of AWS AppSync integrations with other AWS services

Q4 serverless content

Blog posts

October

November

December

Tech talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year. These are listed in the Serverless section of the AWS Online Tech Talks page.

Here are the ones from Q4:

Twitch

October

There are also a number of other helpful video series covering Serverless available on the AWS Twitch Channel.

AWS Serverless Heroes

We are excited to welcome some new AWS Serverless Heroes to help grow the serverless community. We look forward to some amazing content to help you with your serverless journey.

AWS Serverless Application Repository (SAR) Apps

In this edition of ICYMI, we are introducing a section devoted to SAR apps written by the AWS Serverless Developer Advocacy team. You can run these applications and review their source code to learn more about serverless and to see examples of suggested practices.

Still looking for more?

The Serverless landing page has much more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. We’re also kicking off a fresh series of Tech Talks in 2020 with new content providing greater detail on everything new coming out of AWS for serverless application developers.

Throughout 2020, the AWS Serverless Developer Advocates are crossing the globe to tell you more about serverless, and to hear more about what you need. Follow this blog to keep up on new launches and announcements, best practices, and examples of serverless applications in action.

You can also follow all of us on Twitter to see latest news, follow conversations, and interact with the team.

Chris Munns: @chrismunns
Eric Johnson: @edjgeek
James Beswick: @jbesw
Moheeb Zara: @virgilvox
Ben Smith: @benjamin_l_s
Rob Sutter: @rts_rob
Julian Wood: @julian_wood

Happy coding!

Top 10 Architecture Blog Posts of 2019

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/top-10-architecture-blog-posts-of-2019/

As we wind our way toward 2020, I want to take a moment to first thank you, our readers, for spending time on our blog. We grew our audience quite a bit this year and the credit goes to our hard-working Solutions Architects and other blog post writers. Below are the top 10 Architecture blog posts written in 2019.

#10: How to Architect APIs for Scale and Security

by George Mao

George Mao, a Specialist Solutions Architect at AWS, focuses on serverless computing and has FIVE posts in the top ten this year. Way to go, George!

This post was the first in a series that focused on best practices and concepts you should be familiar with when you architect APIs for your applications.

Read George’s post.

#9: From One to Many: Evolving VPC Guidance

by Androski Spicer

Since its inception, the Amazon Virtual Private Cloud (VPC) has acted as the embodiment of security and privacy for customers who are looking to run their applications in a controlled, private, secure, and isolated environment.

This logically isolated space has evolved, and in its evolution has increased the avenues that customers can take to create and manage multi-tenant environments with multiple integration points for access to resources on-premises.

Read Androski’s post.

#8: Things to Consider When You Build REST APIs with Amazon API Gateway

by George Mao

REST API 2

This post dives deeper into the things an API architect or developer should consider when building REST APIs with Amazon API Gateway.

Read George’s post.

#7: How to Design Your Serverless Apps for Massive Scale

by George Mao

Serverless at scale-1

Serverless is one of the hottest design patterns in the cloud today, allowing you to focus on building and innovating, rather than worrying about the heavy lifting of server and OS operations. In this series of posts, we’ll discuss topics that you should consider when designing your serverless architectures. First, we’ll look at architectural patterns designed to achieve massive scale with serverless.

Read George’s post.

#6: Best Practices for Developing on AWS Lambda

by George Mao

RDS instance: When to VPC enable a Lambda function

One of the benefits of using Lambda, is that you don’t have to worry about server and infrastructure management. This means AWS will handle the heavy lifting needed to execute your AWS Lambda functions. Take advantage of this architecture with the tips in this post.

Read George’s post.

#5: Stream Amazon CloudWatch Logs to a Centralized Account for Audit and Analysis

by David Bailey

Figure 1 - Initial Landing Zone logging account resources

A key component of enterprise multi-account environments is logging. Centralized logging provides a single point of access to all salient logs generated across accounts and regions, and is critical for auditing, security and compliance. While some customers use the built-in ability to push Amazon CloudWatch Logs directly into Amazon Elasticsearch Service for analysis, others would prefer to move all logs into a centralized Amazon Simple Storage Service (Amazon S3) bucket location for access by several custom and third-party tools. In this blog post, David Bailey will show you how to forward existing and any new CloudWatch Logs log groups created in the future to a cross-account centralized logging Amazon S3 bucket.

Read David’s post.

#4: Updates to Serverless Architectural Patterns and Best Practices

by Drew Dennis

Drew wrote this post at about the halfway point between re:Invent 2018 and re:Invent 2019, where he revisited some of the recent serverless announcements we’ve made. These are all complimentary to the patterns discussed in the re:Invent architecture track’s Serverless Architectural Patterns and Best Practices session.

Read Drew’s post.

#3: Understanding the Different Ways to Invoke Lambda Functions

by George Mao

Invoking Lambda

In George’s first post of this series (#7 on this list), he talked about general design patterns to enable massive scale with serverless applications. In this post, he’ll review the different ways you can invoke Lambda functions and what you should be aware of with each invocation model.

Read George’s post.

#2: Using API Gateway as a Single Entry Point for Web Applications and API Microservices

by Anandprasanna Gaitonde and Mohit Malik

In this post, Anand and Mohit talk about a reference architecture that allows API Gateway to act as single entry point for external-facing, API-based microservices and web applications across multiple external customers by leveraging a different subdomain for each one.

Read Anand’s and Mohit’s post.

#1: 10 Things Serverless Architects Should Know

by Justin Pirtle

Building on the first three parts of the AWS Lambda scaling and best practices series where you learned how to design serverless apps for massive scale, AWS Lambda’s different invocation models, and best practices for developing with AWS Lambda, Justin invited you to take your serverless knowledge to the next level by reviewing 10 topics to deepen your serverless skills.

Read Justin’s post.

Thank You

Thanks again to all our readers and blog post writers. We look forward to learning and building amazing things together in the coming year.

Best of 2019

Multi-branch CodePipeline strategy with event-driven architecture

Post Syndicated from Henrique Bueno original https://aws.amazon.com/blogs/devops/multi-branch-codepipeline-strategy-with-event-driven-architecture/

Henrique Bueno, DevOps Consultant, Professional Services

This blog post presents a solution for automated pipelines creation in AWS CodePipeline when a new branch is created in an AWS CodeCommit repository. A use case for this solution is when a GitFlow approach using CodePipeline is required. The strategy presented here is used by AWS customers to enable the use of GitFlow using only AWS tools.

CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates. CodePipeline automates the build, test, and deploy phases of your release process every time there is a code change, based on the release model you define.

CodeCommit is a fully managed source control service that hosts secure Git-based repositories. It makes it easy for teams to collaborate on code in a secure and highly scalable ecosystem.

GitFlow is a branching model designed around the project release. This provides a robust framework for managing larger projects. Gitflow is ideally suited for projects that have a scheduled release cycle.

Applicability

When using CodePipeline to orchestrate pipelines and CodeCommit as a code source, in addition to setting a repository, you must also set which branch will trigger the pipeline. This configuration works perfectly for the trunk-based strategy, in which you have only one main branch and all the developers interact with this single branch. However, when you need to work with a multi-branching strategy like GitFlow, the requirement to set a pipeline for each branch brings additional challenges.

It’s important to note that trunk-based is, by far, the best strategy for taking full advantage of a DevOps approach; this is the branching strategy that AWS recommends to its customers. On the other hand, many customers like to work with multiple branches and believe it justifies the effort and complexity in dealing with branching merges. This solution is for these customers.

Solution Overview

One of the great benefits of working with Infrastructure as Code is the ability to create multiple identical environments through a single template. This example uses AWS CloudFormation templates to provision pipelines and other necessary resources, as shown in the following diagram.

Multi-branch CodePipeline strategy with event-driven architecture

Multi-branch CodePipeline strategy with event-driven architecture

The template is hosted in an Amazon S3 bucket. An AWS Lambda function deploys a new AWS CloudFormation stack based on this template. This Lambda function is trigged for an Amazon CloudWatch Events rule that looks for events at the CodePipeline repository.

The CreatePipeline Events rule

The AWS CloudFormation snippet that creates the Events rule follows. The Events rule monitors create and delete branches events in all repositories, triggering the CreatePipeline Lambda function.

#----------------------------------------------------------------------#
# EventRule to trigger LambdaPipeline lambda
#----------------------------------------------------------------------#
  CreatePipelineRule:
    Type: AWS::Events::Rule
    Properties: 
      Description: "EventRule"
      EventPattern:
        source:
          - aws.codecommit
        detail-type:
          - 'CodeCommit Repository State Change'
        detail:
          event:
              - referenceDeleted
              - referenceCreated
          referenceType:
            - branch
      State: ENABLED
      Targets: 
      - Arn: !GetAtt CreatePipeline.Arn
        Id: CreatePipeline

 

The CreatePipeline Lambda function

The Lambda function receives the event details, parses the variables, and executes the appropriate actions. If the event is referenceCreated, then the stack is created; otherwise the stack is deleted. The stack name created or deleted is the junction of the repository name plus the new branch name. This is a very simple function.

#----------------------------------------------------------------------#
# Lambda for Stack Creation
#----------------------------------------------------------------------#
import boto3
def lambda_handler(event, context):
    Region = event['region']
    Account = event['account']
    RepositoryName = event['detail']['repositoryName']
    NewBranch = event['detail']['referenceName']
    Event = event['detail']['event']
    if NewBranch == "master":
        quit()
    if Event == "referenceCreated":
        cf_client = boto3.client('cloudformation')
        cf_client.create_stack(
            StackName=f'Pipeline-{RepositoryName}-{NewBranch}',
            TemplateURL=f'https://s3.amazonaws.com/{Account}-templates/TemplatePipeline.yaml',
            Parameters=[
                {
                    'ParameterKey': 'RepositoryName',
                    'ParameterValue': RepositoryName,
                    'UsePreviousValue': False
                },
                {
                    'ParameterKey': 'BranchName',
                    'ParameterValue': NewBranch,
                    'UsePreviousValue': False
                }
            ],
            OnFailure='ROLLBACK',
            Capabilities=['CAPABILITY_NAMED_IAM']
        )
    else:
        cf_client = boto3.client('cloudformation')
        cf_client.delete_stack(
            StackName=f'Pipeline-{RepositoryName}-{NewBranch}'
        )

The logic for creating only the CI or the CI+CD is on the AWS CloudFormation template. The Conditions section of AWS CloudFormation analyzes the new branch name.

Conditions: 
    BranchMaster: !Equals [ !Ref BranchName, "master" ]
    BranchDevelop: !Equals [ !Ref BranchName, "develop"]
    Setup: !Equals [ !Ref Setup, true ]
  • If the new branch is named master, then a stack will be created containing CI+CD pipelines, with deploy stages in the homologation and production environments.
  • If the new branch is named develop, then a stack will be created containing CI+CD pipelines, with a deploy stage in the Dev environment.
  • If the new branch has any other name, then the stack will be created with only a CI pipeline.

Since the purpose of this blog post is to present only a sample of automated pipelines creation, the pipelines used here are for examples only: they don’t deploy to any environment.

 

Applicability

This event-driven strategy permits pipelines to be created or deleted along with the branches. Since the entire environment is created using Infrastructure as Code and the template is the same for all pipelines, there is no possibility of different configuration issues between environments and pipeline stages.

A GitFlow simulation could resemble that shown in the following diagram:

GitFlow Diagram

GitFlow Diagram

  1. First, a CodeCommit repository is created, along with the master branch and its respective pipeline (CI+CD).
  2. The developer creates a branch called develop based on the master branch. The pipeline (CI+CD at Dev) is automatically created.
  3. The developer creates a feature-branch called feature-a based on the develop branch. The CI pipeline for this branch is automatically created.
  4. The developer creates a Pull Request from the feature-a branch to the develop branch. As soon as the Pull Request is accepted and merged and the feature-a branch is deleted, its pipeline is automatically deleted.

The same process can be followed for the release branch and hotfix branch. Once the branch is created, a new pipeline is created for it which follows its branch lifecycle.

Pipelines Stacks

 

Implementation

Before you start, make sure that the AWS CLI is installed and configured on your machine by following these steps:

  1. Clone the repository.
  2. Create the prerequisites stack.
  3. Copy the AWS CloudFormation template to Amazon S3.
  4. Copy the seed.zip file to the Amazon S3 bucket.
  5. Create the first repository and its pipeline.
  6. Create the develop branch.
  7. Create the first feature branch.
  8. Create the first Pull Request.
  9. Execute the Pull Request approval.
  10. Cleanup.

 

1. Clone the repository

Clone the repository with the sample code.

The main files are:

  • Setup.yaml: an AWS CloudFormation template for creating pipeline prerequisites.
  • TemplatePipeline.yaml: an AWS CloudFormation template for pipeline creation.
  • seed/buildspec/CIAction.yaml: a configuration file for an AWS CodeBuild project at the CI stage.
  • seed/buildspec/CDAction.yaml: a configuration file for a CodeBuild project at the CD stage.
# Command to clone the repository
git clone https://github.com/aws-samples/aws-codepipeline-multi-branch-strategy.git
cd aws-codepipeline-multi-branch-strategy

 

2. Create the prerequisite stack

The Setup stack creates the resources that are prerequisites for pipeline creation, as shown in the following chart.

These resources are created only once and they fit all the pipelines created in this example.

Resources

# Command to create Setup stack
aws cloudformation deploy --stack-name Setup-Pipeline \
--template-file Setup.yaml --region us-east-1 --capabilities CAPABILITY_NAMED_IAM

 

3. Copy the AWS CloudFormation template to Amazon S3

For the Lambda function to deploy a new pipeline stack, it needs to get the AWS CloudFormation template from somewhere. To enable it to do so, you need to save the template inside the Amazon S3 bucket that you just created at the Setup stack.

# Command that copy Template to S3 Bucket
aws s3 cp TemplatePipeline.yaml s3://"$(aws sts get-caller-identity --query Account --output text)"-templates/ --acl private

 

4. Copy the seed.zip file to the Amazon S3 bucket

CodeCommit permits you to populate a repository at the moment of its creation as a first commit. The content of this first commit can be saved in a .zip file in an Amazon S3 bucket. Use this CodeCommit option to populate your repository with BuildSpec files for CodeBuild.

# Command to create zip file with the Buildspec folder content.
zip -r seed.zip buildspec

# Command that copy seed.zip file to S3 Bucket.
aws s3 cp seed.zip s3://"$(aws sts get-caller-identity --query Account --output text)"-templates/ --acl private

 

5. Create the first repository and its pipeline

Now that the Setup stack is created and the seed file is stored in an Amazon S3 bucket, create the first CodeCommit repository. Every time that you want to create a new repository, execute the command below to create a new stack.

# Command to create the stack with the CodeCommit repository,
# CodeBuild Projects and the Pipeline for the master branch.
# Note: Change "myapp" by the name you want.

RepoName="myapp"
aws cloudformation deploy --stack-name Repo-$RepoName --template-file TemplatePipeline.yaml \
--parameter-overrides RepositoryName=$RepoName Setup=true \
--region us-east-1 --capabilities CAPABILITY_NAMED_IAM

When the stack is created, in addition to the CodeCommit repository, the CodeBuild projects and the master branch pipeline are also created. By default, a CodeCommit repository is created empty, with no branch. When the repository is populated with the seed.zip file, the master branch is created.

Resources

Access the CodeCommit repository to see the seed files at the master branch. Access the CodePipeline console to see that there’s a new pipeline with the name as the repository. This pipeline contains the CI+CD stages (homolog and prod).

 

6. Create the develop branch

To simulate a real development scenario, create a new branch called develop based on the master branch. In the GitFlow concept these two (master and develop) branches are fixed and never deleted.

When this new branch is created, the Events rule identifies that there’s a change on this repository and triggers the CreatePipeline Lambda function to create a new pipeline for this branch. Access the CodePipeline console to see that there’s a new pipeline with the name of the repository plus the branch name. This pipeline contains the CI+CD stages (Dev).

# Configure Git Credentials using AWS CLI Credential Helper
mkdir myapp
cd myapp
git config --global credential.helper '!aws codecommit credential-helper [email protected]'
git config --global credential.UseHttpPath true

# Clone the CodeCommit repository
# You can get the URL in the CodeCommit Console
git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/myapp .

# Create the develop branch
# For more details: https://docs.aws.amazon.com/codecommit/latest/userguide/how-to-create-branch.html
git checkout -b develop
git push origin develop

 

7. Create the first feature branch

Now that there are two main and fixed branches (master and develop), you can create a feature branch. In the GitFlow concept, feature branches have a short lifetime and are frequently merged to the develop branch. This type of branch only exists during the development period. When the feature development is finished, it is merged to the develop branch and the feature branch is deleted.

# Create the feature-branch branch
# make sure that you are at develop branch
git checkout -b feature-abc
git push origin feature-abc

Just as with the develop branch, when this new branch is created, the Events rule triggers the CreatePipeline Lambda function to create a new pipeline for this branch. Access the CodePipeline console to see that there’s a new pipeline with the name of the repository plus the branch name. This pipeline contains only the CI stage, without a CD stage.

Pipelines-2

 

8. Create the first Pull Request

It’s possible to simulate the end of a feature development, when the branch is ready to be merged with the develop branch. Keep in mind that the feature branch has a short lifecycle:  when the merge is done, the feature branch is deleted along with its pipeline.

# Create the Pull Request
aws codecommit create-pull-request --title "My Pull Request" \
--description "Please review these changes by Tuesday" \
--targets repositoryName=$RepoName,sourceReference=feature-abc,destinationReference=develop \
--region us-east-1

 

9. Execute the Pull Request approval

To merge the feature branch to the develop branch, the Pull Request needs to be approved. In a real scenario, a teammate should do a peer review before approval.

# Accept the Pull Request
# You can get the Pull-Request-ID in the json output of the create-pull-request command
aws codecommit merge-pull-request-by-fast-forward --pull-request-id <PULL_REQUEST_ID_FROM_PREVIOUS_COMMAND> \
--repository-name $RepoName --region us-east-1

# Delete the feature-branch
aws codecommit delete-branch --repository-name $RepoName --branch-name feature-abc --region us-east-1

The new code is integrated with the develop branch. If there’s a conflict, it needs to be solved. After that, the featurebranch is deleted together with its pipeline. The Event Rule triggers the CreatePipeline Lambda function to delete the pipeline for its branch. Access the CodePipeline console to see that the pipeline for the feature branch is deleted.

Pipelines

 

10. Cleanup

To remove the resources created as part of this blog post, follow these steps:

Delete Pipeline Stacks

# Delete all the pipeline Stacks created by CreatePipeline Lambda
Pipelines=$(aws cloudformation list-stacks --stack-status-filter --region us-east-1 --query 'StackSummaries[? StackStatus==`CREATE_COMPLETE` && starts_with(StackName, `Pipeline`) == `true`].[StackName]' --output text)
while read -r Pipeline rest; do aws cloudformation delete-stack --stack-name $Pipeline --region us-east-1 ; done <<< $Pipelines

Delete Repository Stacks

# Delete all the Repository stacks Stacks created by Step 5.
Repos=$(aws cloudformation list-stacks --stack-status-filter --region us-east-1 --query 'StackSummaries[? StackStatus==`CREATE_COMPLETE` && starts_with(StackName, `Repo-`) == `true`].[StackName]' --output text)
while read -r Repo rest; do aws cloudformation delete-stack --stack-name $Repo --region us-east-1 ; done <<< $Repos

Delete Setup-Pipeline Stack

# Cleaning Bucket before Stack deletion 
aws s3 rm s3://"$(aws sts get-caller-identity --query Account --output text)"-templates --recursive

# Delete Setup Stack
aws cloudformation delete-stack --stack-name Setup-Pipeline --region us-east-1 

 

Conclusion

This blog post discussed how you can work with event-driven strategy and Infrastructure as Code to implement a multi-branch pipeline flow using CodePipeline. It demonstrated how an Events rule and Lambda function can be used to fully orchestrate the creation and deletion of pipelines.

 

About the Author

Henrique Bueno

Henrique Bueno

Henrique is a DevOps Consultant in the Brazilian Professional Services Team at Amazon Web Services. He has helped AWS customers design, build and deploy cloud native applications by following the Twelve-Factor App methodology.

ICYMI: Serverless pre:Invent 2019

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/icymi-serverless-preinvent-2019/

With Contributions from Chris Munns – Sr Manager – Developer Advocacy – AWS Serverless

The last two weeks have been a frenzy of AWS service and feature launches, building up to AWS re:Invent 2019. As there has been a lot announced we thought we’d ship an ICYMI post summarizing the serverless service specific features that have been announced. We’ve also dropped in some service announcements from services that are commonly used in serverless application architectures or development.

AWS re:Invent

AWS re:Invent 2019

We also want you to know that we’ll be talking about many of these features (as well as those coming) in sessions at re:Invent.

Here’s what’s new!

AWS Lambda

On September 3, AWS Lambda started rolling out a major improvement to how AWS Lambda functions work with your Amazon VPC networks. This change brings both scale and performance improvements, and addresses several of the limitations of the previous networking model with VPCs.

On November 25, Lambda announced that the rollout of this new capability has completed in 6 additional regions including US East (Virginia) and US West (Oregon).

New VPC to VPC NAT for Lambda functions

New VPC to VPC NAT for Lambda functions

On November 18, Lambda announced three new runtime updates. Lambda now supports Node.js 12, Java 11, and Python 3.8. Each of these new runtimes has new language features and benefits so be sure to check out the linked release posts. These new runtimes are all based on an Amazon Linux 2 execution environment.

Lambda has released a number of controls for both stream and async based invocations:

  • For Lambda functions consuming events from Amazon Kinesis Data Streams or Amazon DynamoDB Streams, it’s now possible to limit the retry count, limit the age of records being retried, configure a failure destination, or split a batch to isolate a problem record. These capabilities will help you deal with potential “poison pill” records that would previously cause streams to pause in processing.
  • For asynchronous Lambda invocations, you can now set the maximum event age and retry attempts on the event. If either configured condition is met, the event can be routed to a dead letter queue (DLQ), Lambda destination, or it can be discarded.

In addition to the above controls, Lambda Destinations is a new feature that allows developers to designate an asynchronous target for Lambda function invocation results. You can set one destination for a success, and another for a failure. This unlocks really useful patterns for distributed event-based applications and can reduce code to send function results to a destination manually.

Lambda Destinations

Lambda Destinations

Lambda also now supports setting a Parallelization Factor, which allows you to set multiple Lambda invocations per shard for Amazon Kinesis Data Streams and Amazon DynamoDB Streams. This allows for faster processing without the need to increase your shard count, while still guaranteeing the order of records processed.

Lambda Parallelization Factor diagram

Lambda Parallelization Factor diagram

Lambda now supports Amazon SQS FIFO queues as an event source. FIFO queues guarantee the order of record processing compared to standard queues which are unordered. FIFO queues support messaging batching via a MessageGroupID attribute which allows for parallel Lambda consumers of a single FIFO queue. This allows for high throughput of record processing by Lambda.

Lambda now supports Environment Variables in AWS China (Beijing) Region, operated by Sinnet and the AWS China (Ningxia) Region, operated by NWCD.

Lastly, you can now view percentile statistics for the duration metric of your Lambda functions. Percentile statistics tell you the relative standing of a value in a dataset, and are useful when applied to metrics that exhibit large variances. They can help you understand the distribution of a metric, spot outliers, and find hard-to-spot situations that create a poor customer experience for a subset of your users.

AWS SAM CLI

AWS SAM CLI deploy command

AWS SAM CLI deploy command

The SAM CLI team simplified the bucket management and deployment process in the SAM CLI. You no longer need to manage a bucket for deployment artifacts – SAM CLI handles this for you. The deployment process has also been streamlined from multiple flagged commands to a single command, sam deploy.

AWS Step Functions

One of the powerful features of Step Functions is its ability to integrate directly with AWS services without you needing to write complicated application code. Step Functions has expanded its integration with Amazon SageMaker to simplify machine learning workflows, and added a new integration with Amazon EMR, making it faster to build and easier to monitor EMR big data processing workflows.

Step Functions step with EMR

Step Functions step with EMR

Step Functions now provides the ability to track state transition usage by integrating with AWS Budgets, allowing you to monitor and react to usage and spending trends on your AWS accounts.

You can now view CloudWatch Metrics for Step Functions at a one-minute frequency. This makes it easier to set up detailed monitoring for your workflows. You can use one-minute metrics to set up CloudWatch Alarms based on your Step Functions API usage, Lambda functions, service integrations, and execution details.

AWS Step Functions now supports higher throughput workflows, making it easier to coordinate applications with high event rates.

In US East (N. Virginia), US West (Oregon), and EU (Ireland), throughput has increased from 1,000 state transitions per second to 1,500 state transitions per second with bucket capacity of 5,000 state transitions. The default start rate for state machine executions has also increased from 200 per second to 300 per second, with bucket capacity of up to 1,300 starts in these regions.

In all other regions, throughput has increased from 400 state transitions per second to 500 state transitions per second with bucket capacity of 800 state transitions. The default start rate for AWS Step Functions state machine executions has also increased from 25 per second to 150 per second, with bucket capacity of up to 800 state machine executions.

Amazon SNS

Amazon SNS now supports the use of dead letter queues (DLQ) to help capture unhandled events. By enabling a DLQ, you can catch events that are not processed and re-submit them or analyze to locate processing issues.

Amazon CloudWatch

CloudWatch announced Amazon CloudWatch ServiceLens to provide a “single pane of glass” to observe health, performance, and availability of your application.

CloudWatch ServiceLens

CloudWatch ServiceLens

CloudWatch also announced a preview of a capability called Synthetics. CloudWatch Synthetics allows you to test your application endpoints and URLs using configurable scripts that mimic what a real customer would do. This enables the outside-in view of your customers’ experiences, and your service’s availability from their point of view.

On November 18, CloudWatch launched Embedded Metric Format to help you ingest complex high-cardinality application data in the form of logs and easily generate actionable metrics from them. You can publish these metrics from your Lambda function by using the PutLogEvents API or for Node.js or Python based applications using an open source library.

Lastly, CloudWatch announced a preview of Contributor Insights, a capability to identify who or what is impacting your system or application performance by identifying outliers or patterns in log data.

AWS X-Ray

X-Ray announced trace maps, which enable you to map the end to end path of a single request. Identifiers will show issues and how they affect other services in the request’s path. These can help you to identify and isolate service points that are causing degradation or failures.

X-Ray also announced support for Amazon CloudWatch Synthetics, currently in preview. X-Ray supports tracing canary scripts throughout the application providing metrics on performance or application issues.

X-Ray Service map with CloudWatch Synthetics

X-Ray Service map with CloudWatch Synthetics

Amazon DynamoDB

DynamoDB announced support for customer managed customer master keys (CMKs) to encrypt data in DynamoDB. This allows customers to bring your own key (BYOK) giving you full control over how you encrypt and manage the security of your DynamoDB data.

It is now possible to add global replicas to existing DynamoDB tables to provide enhanced availability across the globe.

Currently under preview, is another new DynamoDB capability to identify frequently accessed keys and database traffic trends. With this you can now more easily identify “hot keys” and understand usage of your DynamoDB tables.

CloudWatch Contributor Insights for DynamoDB

CloudWatch Contributor Insights for DynamoDB

Last but far from least for DynamoDB, is adaptive capacity, a feature which helps you handle imbalanced workloads by isolating frequently accessed items automatically and shifting data across partitions to rebalance them. This helps both reduce cost by enabling you to provision throughput for a more balanced out workload vs. over provisioning for uneven data access patterns.

AWS Serverless Application Repository

The AWS Serverless Application Repository (SAR) now offers Verified Author badges. These badges enable consumers to quickly and reliably know who you are. The badge will appear next to your name in the SAR and will deep-link to your GitHub profile.

SAR Verified developer badges

SAR Verified developer badges

AWS Code Services

AWS CodeCommit launched the ability for you to enforce rule workflows for pull requests, making it easier to ensure that code has pass through specific rule requirements. You can now create an approval rule specifically for a pull request, or create approval rule templates to be applied to all future pull requests in a repository.

AWS CodeBuild added beta support for test reporting. With test reporting, you can now view the detailed results, trends, and history for tests executed on CodeBuild for any framework that supports the JUnit XML or Cucumber JSON test format.

CodeBuild test trends

CodeBuild test trends

AWS Amplify and AWS AppSync

Instead of trying to summarize all the awesome things that our peers over in the Amplify and AppSync teams have done recently we’ll instead link you to their own recent summary: “A round up of the recent pre-re:Invent 2019 AWS Amplify Launches”.

AWS AppSync

AWS AppSync

Still looking for more?

We only covered a small bit of all the awesome new things that were recently announced. Keep your eyes peeled for more exciting announcements next week during re:Invent and for a future ICYMI Serverless Q4 roundup. We’ll also be kicking off a fresh series of Tech Talks in 2020 with new content helping to dive deeper on everything new coming out of AWS for serverless application developers.

Visualize and Monitor Highly Distributed Applications with Amazon CloudWatch ServiceLens

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/visualize-and-monitor-highly-distributed-applications-with-amazon-cloudwatch-servicelens/

Increasingly distributed applications, with thousands of metrics and terabytes of logs, can be a challenge to visualize and monitor. Gaining an end-to-end insight of the applications and their dependencies to enable rapid pinpointing of performance bottlenecks, operational issues, and customer impact quite often requires the use of multiple dedicated tools each presenting their own particular facet of information. This in turn leads to more complex data ingestion, manual stitching together of the various insights to determine overall performance, and increased costs from maintaining multiple solutions.

Amazon CloudWatch ServiceLens, announced today, is a new fully managed observability solution to help with visualizing and analyzing the health, performance, and availability of highly distributed applications, including those with dependencies on serverless and container-based technologies, all in one place. By enabling you to easily isolate endpoints and resources that are experiencing issues, together with analysis of correlated metrics, logs, and application traces, CloudWatch ServiceLens helps reduce Mean Time to Resolution (MTTR) by consolidating all of this data in a single place using a service map. From this map you can understand the relationships and dependencies within your applications, and dive deep into the various logs, metrics, and traces from a single tool to help you quickly isolate faults. Crucial time spent correlating metrics, logs, and trace data from across various tools is saved, thus reducing any downtime incurred by end users.

Getting Started with Amazon CloudWatch ServiceLens
Let’s see how we can take advantage of Amazon CloudWatch ServiceLens to diagnose the root cause of an alarm triggered from an application. My sample application reads and writes transaction data to a Amazon DynamoDB table using AWS Lambda functions. An Amazon API Gateway is my application’s front-end, with resources for GET and PUT requests, directing traffic to the corresponding GET and PUT lambda functions. The API Gateway resources and the Lambda functions have AWS X-Ray tracing enabled, and the API calls to DynamoDB from within the Lambda functions are wrapped using the AWS X-Ray SDK. You can read more about how to instrument your code, and work with AWS X-Ray, in the Developer Guide.

An error condition has triggered an alarm for my application, so my first stop is the Amazon CloudWatch Console, where I click the Alarm link. I can see that there is some issue with availability with one or more of my API Gateway resources.

Let’s drill down to see what might be going on. First I want to get an overview of the running application so I click Service Map under ServiceLens in the left-hand navigation panel. The map displays nodes representing the distributed resources in my application. The relative size of the nodes represents the amount of request traffic that each is receiving, as does the thickness of the links between them. I can toggle the map between showing Requests modes or Latency mode. Using the same drop-down I can also toggle the option to change relative sizing of the nodes. The data shown for Request mode or Latency mode helps me isolate nodes that I need to triage first. Clicking View connections can also be used to aid in the process, since it helps me understand incoming and outgoing calls, and their impact on the individual nodes.

I’ve closed the map legend in the screenshot so we can get a good look at all the nodes, for reference here it is below.

From the map I can immediately see that my front-end gateway is the source of the triggered alarm. The red indicator on the node is showing me that there are 5xx faults associated with the resource, and the length of the indicator relative to the circumference gives me some idea of how many requests are faulting compared to successful requests. Secondly, I can see that the Lambda functions that are handling PUT requests through the API are showing 4xx errors. Finally I can see a purple indicator on the DynamoDB table, indicating throttling is occurring. At this point I have a pretty good idea of what might be happening, but let’s dig a little deeper to see what CloudWatch ServiceLens can help me prove.

In addition to Map view, I can also toggle List view. This gives me at-a-glance information on average latency, faults, and requests/min for all nodes and is specifically ordered by default to show the “worst” node first, using a sort order of descending by fault rate – descending by number of alarms in alarm – ascending by Node name.

Returning to Map view, hovering my mouse over the node representing my API front-end also gives me similar insight into traffic and faulting request percentage, specific to the node.

To see even more data, for any node, clicking it will open a drawer below the map containing graphed data for that resource. In the screenshot below I have clicked on the ApiGatewayPutLambdaFunction node.

The drawer, for each resource, enables me to jump to view logs for the resource (View logs), traces (View traces), or a dashboard (View dashboard). Below, I have clicked View dashboard for the same Lambda function. Scrolling through the data presented for that resource, I note that the duration of execution is not high, while all invokes are going into error in tandem.

Returning to the API front-end that is showing the alarm, I’d like to take look at the request traces so I click the node to open the drawer, then click View traces. I already know from the map that 5xx and 4xx status codes are being generated in the code paths selected by PUT requests coming into my application so I switch Filter type to be Status code, then select both 502 and 504 entries in the list, finally clicking Add to filter. The Traces view switches to show me the traces that resulted in those error codes, the response time distribution and a set of traces.

Ok, now we’re getting close! Clicking the first trace ID, I get a wealth of data including exception messages about that request – more than I can show in a single screenshot! For example, I can see the timelines of each segment that was traced as part of the request.

Scrolling down further, I can view exception details (below this I can also see log messages specific to the trace too) – and here lays my answer, confirming the throttling indicator that I saw in the original map. I can also see this exception message in the log data specific to the trace, shown at the foot of the page. Previously, I would have had to scan through logs for the entire application to hopefully spot this message, being able to drill down from the map is a significant time saver.

Now I know how to fix the issue and get the alarm resolved – increase the write provisioning for the table!

In conjunction with CloudWatch ServiceLens, Amazon CloudWatch has also launched a preview of CloudWatch Synthetics that helps to monitor endpoints using canaries that run 24×7, 365 days a year, so that I am quickly alerted of issues that my customers are facing. These are also visualized on the Service Map and just as I did above, I can drill down to the traces to view transactions that originated from a canary. The faster I can dive deep into a consolidated view of an operational failure or an alarm, the faster I can root cause the issue and help reduce time to resolution and mitigate the customer impact.

Amazon CloudWatch ServiceLens is available now in all commercial AWS Regions.

— Steve

Debugging with Amazon CloudWatch Synthetics and AWS X-Ray

Post Syndicated from Nizar Tyrewalla original https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/

Today, AWS X-Ray launches support for Amazon CloudWatch Synthetics, enabling developers to trace end-to-end requests from configurable scripts called “canaries”.  These canaries run the test script to monitor web endpoints and APIs using modular, light-weight tests that run 24×7, once per minute. It continuously captures the behavior and availability of the endpoint or URL being monitored, reporting what end-customers are seeing. Customers get alerted immediately in case of failures and tie them back to the root-cause using traces. These canaries provide a complete picture of all the services invoked in the path. With this feature, developers and DevOps engineers can correlate failures in the application with the impact on end customers, determine the root cause of the failures, and identify the upstream and downstream services impacted.

Overview

X-Ray helps developers and DevOps engineers analyze and debug distributed applications, such as applications built with microservice architecture. Using X-Ray, you can understand how your application and its underlying services are performing in order to identify and troubleshoot the root causes of performance issues and errors. X-Ray helps you debug and triage distributed applications, wherever those applications are running, and whether the architecture is serverless, containers, Amazon EC2, on-premises, or a mixture of all of the above.

Amazon CloudWatch Synthetics is a fully managed synthetic monitoring service that allows developers and DevOps engineers to view their application endpoints and URLs using configurable scripts called “canaries” that run 24×7. Canaries alert you as soon as something does not work as expected, as defined by your script. CloudWatch Synthetics canaries can be customized to check for availability, latency, transactions, broken or dead links, step-by-step task completions, page load errors, load latencies for UI assets, complex wizard flows, or checkout flows in your applications.

CloudWatch Synthetics supports End-User Experience Monitoring (EUEM—Synthetic Monitoring), Web Services Monitoring (REST APIs), URL Monitoring, and Website Content Monitoring (protection from unauthorized changes in your websites from phishing, code injection, and cross-site scripting). Canary traffic can continually verify your customers’ experiences even when you don’t have any customer traffic. Canaries can discover issues as soon as your customers do.

Use cases

Here are some of the use cases for which developers and DevOps engineers can use AWS X-Ray with Amazon CloudWatch Synthetics.

Determine if there is a problem

You can determine if there is a problem with your CloudWatch Synthetics canaries by setting CloudWatch alarms based on certain thresholds. You can also determine an increase in errors, faults, throttling rates, or slow response times within your X-Ray Service Map.

Developers can set up thresholds on canaries in Amazon CloudWatch Synthetics console which creates CloudWatch alarms to understand if there were any issues. These thresholds are managed centrally to help you get alerted based on the granularity of the test run. They can then use Amazon Simple Notification Service (SNS) or CloudWatch Event for their alerts. For example, you can set a notification when the canary fails 5 times in 5 runs if the canary runs once per min. You can further analyze and triage HAR File, screenshots and logs in the CloudWatch Synthetics console.

Enable thresholds in CloudWatch Synthetics

Canary thresholds in Amazon CloudWatch

Screenshots in Amazon CloudWatch Synthetics console

 

 

HAR file in Amazon CloudWatch Synthetics console

Developers can also use the new synthetic canary client nodes with type Client::Synthetics in AWS X-Ray console to pinpoint which synthetic canaries are experiencing issues with errors, faults, or throttling rates for the selected time frame. Select the synthetic canary node to view the response time distribution of the entire request. Zoom into a specific time range and dive deep into analyzing trends using X-Ray Analytics, as shown in the following screenshot of a service map with a synthetic canary client node.

X-Ray Service Map with Synthetics node showing errors

 

 

Find the root cause of ongoing failures

You can find the root cause of ongoing failures in upstream and downstream services to reduce the mean time to resolution.

Developers can view the end-to-end path of requests within the X-Ray service map to easily understand the services invoked, as well as the upstream and downstream services. You can also use trace maps for individual traces to dissect each request and determine which service results in the most latency or is causing an error, as shown in the following diagrams.

 

X-Ray Service Map with end to end path of the request invoked by Synthetic canaries

X-Ray Trace Map to view individual request invoked by Synthetic canaries

Once you receive a CloudWatch alarm for failures in a synthetic canary, it becomes critical to determine the root cause of the issue and reduce the impact on end users. Developers can use X-Ray Analytics to address this. X-Ray performs statistical modeling on trace data to determine the probable root cause of an issue. As you can see in the following screenshots, the synthetic tests for console “www” that is running on AWS Elastic Beanstalk are failing. It is because of the throughput capacity exception from the DynamoDB table, impacting the product microservice. This can be quickly determined in the X-Ray Analytics console. Also, developers can determine which of their canary tests are causing this issue using the already created X-Ray annotation “Annotation.aws:canary_arn”.

 

Analyze root cause of the issues in Synthetic canaries using X-Ray Analytics

 

Root cause of the issue using X-Ray Analytics

 

Identify performance bottlenecks and trends

You can identify performance bottlenecks and trends seen by your Synthetics canaries’ traces over time.

View trends in performance of your website or endpoint over time using the continuous traffic from your synthetic canaries to populate a response time histogram over a period of time. Understand the p50, p90, and p95 percentile for your requests originating from Synthetics tests and determine proportion of failures and time series activity to determine the time of occurrence.

Response time histogram and time series activity for Synthetic canaries

 

Compare latency and error/fault rates and experience of end user with Synthetic canaries

You can compare latency and error/fault rates and end-user experiences with the Synthetic canaries.

Using X-Ray Analytics, developers can compare any two trends or collection of traces. For example, in the below diagram, we are comparing the experience of end users using the system (shown in blue trend line) to what the Synthetics tests are reporting (shown in green trend line), which started around 3:00 am. We can easily identify that Synthetics tests are reporting more 5xx errors as compared to end users.

Compare end-user experience with Synthetics canaries using X-Ray Analytics

 

Identify if you have enough canary coverage for all the API’s and URLs.

When debugging synthetic canaries, developers and DevOps engineers are looking to understand if they have enough coverage for their APIs and URLs and if their Synthetic canaries have adequate tests. Using X-Ray Analytics, developers will be able to compare the experience of Synthetic canaries with the rest of their end users. The screenshot below shows blue trend line for canaries and green for the rest of the users. You can also identify that two out of the three URLs don’t have canary tests.

Identify which URLs and API's have canaries running.

Slice and dice the X-Ray service map to focus only on Synthetics tests and its path to downstream services using X-Ray groups

Developers can have multiple APIs and applications running within an account. It becomes critical at that point to focus on certain set of workflows like Synthetics tests for application “www” running on AWS Elastic Beanstalk as shown below. Developers can create X-Ray Group by using a filter expression “edge(id(name: “www”, type: “client::Synthetics”), id(name: “www”, type: “AWS::ElasticBeanstalk::Environment”))” to just focus on traces generated by their Synthetics tests for “www”.

Create X-Ray Groups for Synthetics canaries

 

Getting Started

Getting started with Synthetics is easy. Enable X-Ray tracing for your APIs, web application endpoints and, URLs being monitored by Synthetics canaries.  Visit the documentation to learn more about getting started using AWS X-Ray with Amazon CloudWatch Synthetics.

Conclusion

In this post, I outlined how you can use AWS X-Ray to debug and triage canaries running in Amazon CloudWatch Synthetics. We looked at some of the use cases that will enable developer and DevOps engineers to reduce time to resolution by viewing the end-to-end path of the request, determining the root cause of the issue and understanding which upstream or downstream services are impacted.

New – Amazon CloudWatch Anomaly Detection

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-anomaly-detection/

Amazon CloudWatch launched in early 2009 as part of our desire to (as I said at the time) “make it even easier for you to build sophisticated, scalable, and robust web applications using AWS.” We have continued to expand CloudWatch over the years, and our customers now use it to monitor their infrastructure, systems, applications, and even business metrics. They build custom dashboards, set alarms, and count on CloudWatch to alert them to issues that affect the performance or reliability of their applications.

If you have used CloudWatch Alarms, you know that there’s a bit of an art to setting your alarm thresholds. You want to make sure to catch trouble early, but you don’t want to trigger false alarms. You need to deal with growth and with scale, and you also need to make sure that you adjust and recalibrate your thresholds to deal with cyclic and seasonal behavior.

Anomaly Detection
Today we are enhancing CloudWatch with a new feature that will help you to make more effective use of CloudWatch Alarms. Powered by machine learning and building on over a decade of experience, CloudWatch Anomaly Detection has its roots in over 12,000 internal models. It will help you to avoid manual configuration and experimentation, and can be used in conjunction with any standard or custom CloudWatch metric that has a discernible trend or pattern.

Anomaly Detection analyzes the historical values for the chosen metric, and looks for predictable patterns that repeat hourly, daily, or weekly. It then creates a best-fit model that will help you to better predict the future, and to more cleanly differentiate normal and problematic behavior. You can adjust and fine-tune the model as desired, and you can even use multiple models for the same CloudWatch metric.

Using Anomaly Detection
I can create my own models in a matter of seconds! I have an EC2 instance that generates a spike in CPU Utilization every 24 hours:

I select the metric, and click the “wave” icon to enable anomaly detection for this metric and statistic:

This creates a model with default settings. If I select the model and zoom in to see one of the utilization spikes, I can see that the spike is reflected in the prediction bands:

I can use this model as-is to drive alarms on the metric, or I can select the model and click Edit model to customize it:

I can exclude specific time ranges (past or future) from the data that is used to train the model; this is a good idea if the data reflects a one-time event that will not happen again. I can also specify the timezone of the data; this lets me handle metrics that are sensitive to changes in daylight savings time:

After I have set this up, the anomaly detection model goes in to effect and I can use to create alarms as usual. I choose Anomaly detection as my Threshold type, and use the Anomaly detection threshold to control the thickness of the band. I can raise the alarm when the metric is outside of, great than, or lower than the band:

The remaining steps are identical to the ones that you already use to create other types of alarms.

Things to Know
Here are a couple of interesting things to keep in mind when you are getting ready to use this new CloudWatch feature:

Suitable Metrics – Anomaly Detection works best when the metrics have a discernible pattern or trend, and when there is a minimal number of missing data points.

Updates – Once the model has been created, it will be updated every five minutes with any new metric data.

One-Time Events – The model cannot predict one-time events such as Black Friday or the holiday shopping season.

API / CLI / CloudFormation – You can create and manage anomaly models from the Console, the CloudWatch API (PutAnomalyDetector) and the CloudWatch CLI. You can also create AWS::CloudWatch::AnomalyDetector resources in your AWS CloudFormation templates.

Now Available
You can start creating and using CloudWatch Anomaly Detection today in all commercial AWS regions. To learn more, read about CloudWatch Anomaly Detection in the CloudWatch Documentation.

Jeff;

 

Automated Disaster Recovery using CloudEndure

Post Syndicated from Ryan Jaeger original https://aws.amazon.com/blogs/architecture/automated-disaster-recovery-using-cloudendure/

There are any number of events that cause IT outages and impact business continuity. These could include the unexpected infrastructure or application outages caused by flooding, earthquakes, fires, hardware failures, or even malicious attacks. Cloud computing opens a new door to support disaster recovery strategies, with benefits such as elasticity, agility, speed to innovate, and cost savings—all which aid new disaster recovery solutions.

With AWS, organizations can acquire IT resources on-demand, and pay only for the resources they use. Automating disaster recovery (DR) has always been challenging. This blog post shows how you can use automation to allow the orchestration of recovery to eliminate manual processes. CloudEndure Disaster Recovery, an AWS Company, Amazon Route 53, and AWS Lambda are the building blocks to deliver a cost-effective automated DR solution. The example in this post demonstrates how you can recover a production web application with sub-second Recovery Point Objects (RPOs) and Recovery Time Objectives (RTOs) in minutes.

As part of a DR strategy, knowing RPOs and RTOs will determine what kind of solution architecture you need. The RPO represents the point in time of the last recoverable data point (for example, the “last backup”). Any disaster after that point would result in data loss.

The time from the outage to restoration is the RTO. Minimizing RTO and RPO is a cost tradeoff. Restoring from backups and recreating infrastructure after the event is the lowest cost but highest RTO. Conversely, the highest cost and lowest RTO is a solution running a duplicate auto-failover environment.

Solution Overview

CloudEndure is an automated IT resilience solution that lets you recover your environment from unexpected infrastructure or application outages, data corruption, ransomware, or other malicious attacks. It utilizes block-level Continuous Data Replication (CDP), which ensures that target machines are spun up in their most current state during a disaster or drill, so that you can achieve sub-second RPOs. In the event of a disaster, CloudEndure triggers a highly automated machine conversion process and a scalable orchestration engine that can spin up machines in the target AWS Region within minutes. This process enables you to achieve RTOs in minutes. The CloudEndure solution uses a software agent that installs on physical or virtual servers. It connects to a self-service, web-based use console, which then issues an API call to the selected AWS target Region to create a Staging Area in the customer’s AWS account designated to receive the source machine’s replicated data.

Architecture

In the above example, a webserver and database server have the CloudEndure Agent installed, and the disk volumes on each server replicated to a staging environment in the customer’s AWS account. The CloudEndure Replication Server receives the encrypted data replication traffic and writes to the appropriate corresponding EBS volumes. It’s also possible to configure data replication traffic to use VPN or AWS Direct Connect.

With this current setup, if an infrastructure or application outage occurs, a failover to AWS is executed by manually starting the process from the CloudEndure Console. When this happens, CloudEndure creates EC2 instances from the synchronized target EBS volumes. After the failover completes, additional manual steps are needed to change the website’s DNS entry to point to the IP address of the failed over webserver.

Could the CloudEndure failover and DNS update be automated? Yes.

Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service with three main functions: domain registration, DNS routing, and health checking. A configured Route 53 health check monitors the endpoint of a webserver. If the health check fails over a specified period, an alarm is raised to execute an AWS Lambda function to start the CloudEndure failover process. In addition to health checks, Route 53 DNS Failover allows the DNS record for the webserver to be automatically update based on a healthy endpoint. Now the previously manual process of updating the DNS record to point to the restored web server is automated. You can also build Route 53 DNS Failover configurations to support decision trees to handle complex configurations.

To illustrate this, the following builds on the example by having a primary, secondary, and tertiary DNS Failover choice for the web application:

How Health Checks Work in Complex Amazon Route 53 Configurations

When the CloudEndure failover action executes, it takes several minutes until the target EC2 is launched and configured by CloudEndure. An S3 static web page can be returned to the end-user to improve communication while the failover is happening.

To support this example, Amazon Route 53 DNS failover decision tree can be configured to have a primary, secondary, and tertiary failover. The decision tree logic to support the scenario is the following:

  1. If the primary health check passes, return the primary webserver.
  2. Else, if the secondary health check passes, return the failover webserver.
  3. Else, return the S3 static site.

When the Route 53 health check fails when monitoring the primary endpoint for the webserver, a CloudWatch alarm is configured to ALARM after a set time. This CloudWatch alarm then executes a Lambda function that calls the CloudEndure API to begin the failover.

In the screenshot below, both health checks are reporting “Unhealthy” while the primary health check is in a state of ALARM. At the point, the DNS failover logic should be returning the path to the static S3 site, and the Lambda function executed to start the CloudEndure failover.

The following architecture illustrates the completed scenario:

Conclusion

Having a disaster recovery strategy is critical for business continuity. The benefits of AWS combined with CloudEndure Disaster Recovery creates a non-disruptive DR solution that provides minimal RTO and RPO while reducing total cost of ownership for customers. Leveraging CloudWatch Alarms combined with AWS Lambda for serverless computing are building blocks for a variety of automation scenarios.

 

 

Tips for building a cloud security operating model in the financial services industry

Post Syndicated from Stephen Quigg original https://aws.amazon.com/blogs/security/tips-for-building-a-cloud-security-operating-model-in-the-financial-services-industry/

My team helps financial services customers understand how AWS services operate so that you can incorporate AWS into your existing processes and security operations centers (SOCs). As soon as you create your first AWS account for your organization, you’re live in the cloud. So, from day one, you should be equipped with certain information: you should understand some basics about how our products and services work, you should know how to spot when something bad could happen, and you should understand how to recover from that situation. Below is some of the advice I frequently offer to financial services customers who are just getting started.

How to think about cloud security

Security is security – the principles don’t change. Many of the on-premises security processes that you have now can extend directly to an AWS deployment. For example, your processes for vulnerability management, security monitoring, and security logging can all be transitioned over.

That said, AWS is more than just infrastructure. I sometimes talk to customers who are only thinking about the security of their AWS Virtual Private Clouds (VPCs), and about the Amazon Elastic Compute Cloud (EC2) instances running in those VPCs. And that’s good; its traditional network security that remains quite standard. But I also ask my customers questions that focus on other services they may be using. For example:

  • How are you thinking about who has Database Administrator (DBA) rights for Amazon Aurora Serverless? Aurora Serverless is a managed database service that lets AWS do the heavy lifting for many DBA tasks.
  • Do you understand how to configure (and monitor the configuration of) your Amazon Athena service? Athena lets you query large amounts of information that you’ve stored in Amazon Simple Storage Service (S3).
  • How will you secure and monitor your AWS Lambda deployments? Lambda is a serverless platform that has no infrastructure for you to manage.

Understanding AWS security services

As a customer, it’s important to understand the information that’s available to you about the state of your cloud infrastructure. Typically, AWS delivers much of that information via the Amazon CloudWatch service. So, I encourage my customers to get comfortable with CloudWatch, alongside our AWS security services. The key services that any security team needs to understand include:

  • Amazon GuardDuty, which is a threat detection system for the cloud.
  • AWS Cloudtrail, which is the log of AWS API services.
  • VPC Flow Logs, which enables you to capture information about the IP traffic going to and from network interfaces in your VPC.
  • AWS Config, which records all the configuration changes that your teams have made to AWS resources, allowing you to assess those changes.
  • AWS Security Hub, which offers a “single pane of glass” that helps you assess AWS resources and collect information from across your security services. It gives you a unified view of resources per Region, so that you can more easily manage your security and compliance workflow.

These tools make it much quicker for you to get up to speed on your cloud security status and establish a position of safety.

Getting started with automation in the cloud

You don’t have to be a software developer to use AWS. You don’t have to write any code; the basics are straightforward. But to optimize your use of AWS and to get faster at automating, there is a real advantage if you have coding skills. Automation is the core of the operating model. We have a number of tutorials that can help you get up to speed.

Self-service cloud security resources for financial services customers

There are people like me who can come and talk to you. But to keep you from having to wait for us, we also offer a lot of self-service cloud security resources on our website.

We offer a free digital training course on AWS security fundamentals, plus webinars on financial services topics. We also offer an AWS security certification, which lets you show that your security knowledge has been validated by a third-party.

There are also a number of really good videos you can watch. For example, we had our inaugural security conference, re:Inforce, in Boston this past June. The videos and slides from the conference are now on YouTube, so you can sit and watch at your own pace. If you’re not sure where to start, try this list of popular sessions.

Finding additional help

You can work with a number of technology partners to help extend your security tools and processes to the cloud.

  • Our AWS Professional Services team can come and help you on site. In addition, we can simulate security incidents with you tohelp you get comfortable with security and cloud technology and how to respond to incidents.
  • AWS security consulting partners can also help you develop processes or write the code that you might need.
  • The AWS Marketplace is a wonderful self-service location where you can get all sorts of great security solutions, including finding a consulting partner.

And if you’re interested in speaking directly to AWS, you can always get in touch. There are forms on our website, or you can reach out to your AWS account manager and they can help you find the resources that are necessary for your business.

Conclusion

Financial services customers face some tough security challenges. You handle large amounts of data, and it’s really important that this data is stored securely and that its privacy is respected. We know that our customers do lots of due diligence of AWS before adopting our services, and they have many different regulatory environments within which they have to work. In turn, we want to help customers understand how they can build a cloud security operating model that meets their needs while using our services.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Stephen Quigg

Stephen Quigg is a Principal Securities Solutions Architect within AWS Financial Services. Quigg started his AWS career in Sydney, Australia, but returned home to Scotland three years ago having missed the wind and rain too much. He manages to fit some work in between being a husband and father to two angelic children and making music.

Automating notifications when AMI permissions change

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/automating-notifications-when-ami-permissions-change/

This post is courtesy of Ernes Taljic, Solutions Architect and Sudhanshu Malhotra, Solutions Architect

This post demonstrates how to automate alert notifications when users modify the permissions of an Amazon Machine Image (AMI). You can use it as a blueprint for a wide variety of alert notifications by making simple modifications to the events that you want to receive alerts about. For example, updating the specific operation in Amazon CloudWatch allows you to receive alerts on any activity that AWS CloudTrail captures.

This post walks you through on how to configure an event rule in CloudWatch that triggers an AWS Lambda function. The Lambda function uses Amazon SNS to send an email when an AMI changes to public, private, shared, or unshared with one or more AWS accounts.

Solution overview

The following diagram describes the solution at a high level:

  1. A user changes an attribute of an AMI.
  2. CloudTrail logs the change as a ModifyImageAttribute API event.
  3. A CloudWatch Events rule captures this event.
  4. The CloudWatch Events rule triggers a Lambda function.
  5. The Lambda function publishes a message to the defined SNS topic.
  6. SNS sends an email alert to the topic’s subscribers.

Deployment walkthrough

To implement this solution, you must create:

  • An SNS topic
  • An IAM role
  • A Lambda function
  • A CloudWatch Events rule

Step 1: Creating an SNS topic

To create an SNS topic, complete the following steps:

  1. Open the SNS console.
  2. Under Create topic, for Topic name, enter a name and choose Create topic. You can now see the MySNSTopic page. The Details section displays the topic’s Name, ARN, Display name (optional), and the AWS account ID of the Topic owner.
  3. In the Details section, copy the topic ARN to the clipboard, for example:
    arn:aws:sns:us-east-1:123456789012:MySNSTopic
  4. On the left navigation pane, choose Subscriptions, Create subscription.
  5. On the Create subscription page, do the following:
    1. Enter the topic ARN of the topic you created earlier:
      arn:aws:sns:us-east-1:123456789012:MySNSTopic
    2. For Protocol, select Email.
    3. For Endpoint, enter an email address that can receive notifications.
    4. Choose Create subscription.

Step 2: Creating an IAM role

To create an IAM role, complete the following steps. For more information, see Creating an IAM Role.

  1. In the IAM console, choose Policies, Create Policy.
  2. On the JSON tab, enter the following IAM policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*"
            ],
            "Effect": "Allow",
            "Sid": "LogStreamAccess"
        },
        {
            "Action": [
                "sns:Publish"
            ],
            "Resource": [
                "arn:aws:sns:*:*:*"
            ],
            "Effect": "Allow",
            "Sid": "SNSPublishAllow"
        },
        {
            "Action": [
                "iam:ListAccountAliases"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "ListAccountAlias"
        }
    ]
}

3. Choose Review policy.

4. Enter a name (MyCloudWatchRole) for this policy and choose Create policy. Note the name of this policy for later steps.

5. In the left navigation pane, choose Roles, Create role.

6. On the Select role type page, choose Lambda and the Lambda use case.

7. Choose Next: Permissions.

8. Filter policies by the policy name that you just created, and select the check box.

9. Choose Next: Tags, and give it an appropriate tag.

10. Choose Next: Review. Give this IAM role an appropriate name, and note it for future use.

11.   Choose Create role.

Step 3: Creating a Lambda function

To create a Lambda function, complete the following steps. For more information, see Create a Lambda Function with the Console.

  1. In the Lambda console, choose Author from scratch.
  2. For Function Name, enter the name of your function.
  3. For Runtime, choose Python 3.7.
  4. For Execution role, select Use an existing role, then select the IAM role created in the previous step.
  5. Choose Create Function, remove the default function, and copy the following code into the Function Code window:
# Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
#     http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file.
# This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
# either express or implied. See the License for the specific language governing permissions
# and limitations under the License.
#
# Description: This Lambda function sends an SNS notification to a given AWS SNS topic when an API event of \"Modify Image Attribute\" is detected.
#              The SNS subject is- "API call-<insert call event> by < insert user name> detected in Account-<insert account alias>, see message for further details". 
#              The JSON message body of the SNS notification contains the full event details.
# 
#
# Author: Sudhanshu Malhotra


import json
import boto3
import logging
import os
import botocore.session
from botocore.exceptions import ClientError
session = botocore.session.get_session()

logging.basicConfig(level=logging.DEBUG)
logger=logging.getLogger(__name__)

import ipaddress
import traceback

def lambda_handler(event, context):
	logger.setLevel(logging.DEBUG)
	eventname = event['detail']['eventName']
	snsARN = os.environ['snsARN']          #Getting the SNS Topic ARN passed in by the environment variables.
	user = event['detail']['userIdentity']['type']
	srcIP = event['detail']['sourceIPAddress']
	imageId = event['detail']['requestParameters']['imageId']
	launchPermission = event['detail']['requestParameters']['launchPermission']
	imageAction = list(launchPermission.keys())[0]
	accnt_num =[]
	
	if imageAction == "add":
		if "userId" in launchPermission['add']['items'][0].keys():
			accnt_num = [li['userId'] for li in launchPermission['add']['items']]      # Get the AWS account numbers that the image was shared with
			imageAction = "Image shared with AWS account: " + str(accnt_num)[1:-1]
		else:
			imageAction = "Image made Public"
	
	elif imageAction == "remove":
		if "userId" in launchPermission['remove']['items'][0].keys():
			accnt_num = [li['userId'] for li in launchPermission['remove']['items']]
			imageAction = "Image Unshared with AWS account: " + str(accnt_num)[1:-1]    # Get the AWS account numbers that the image was unshared with
		else:
			imageAction = "Image made Private"
	
	
	logger.debug("Event is --- %s" %event)
	logger.debug("Event Name is--- %s" %eventname)
	logger.debug("SNSARN is-- %s" %snsARN)
	logger.debug("User Name is -- %s" %user)
	logger.debug("Source IP Address is -- %s" %srcIP)
	
	client = boto3.client('iam')
	snsclient = boto3.client('sns')
	response = client.list_account_aliases()
	logger.debug("List Account Alias response --- %s" %response)
	
	# Check if the source IP is a valid IP or AWS service DNS name.
	# If DNS name then we ignore the API activity as this is internal AWS operation
	# For more information check - https://aws.amazon.com/premiumsupport/knowledge-center/cloudtrail-root-action-logs/
	try:
	    validIP = ipaddress.ip_address(srcIP)
	    logger.debug("IP addr is-- %s" %validIP)
	except Exception as e:
	    logger.error("Catching the traceback error: %s" %traceback.format_exc())
	    logger.debug("Seems like the root API activity was caused by an internal operation. IP address is internal service DNS name")
	    return
	try:
		if not response['AccountAliases']:
			accntAliase = (boto3.client('sts').get_caller_identity()['Account'])
			logger.info("Account Aliase is not defined. Account ID is %s" %accntAliase)
		else:
			accntAliase = response['AccountAliases'][0]
			logger.info("Account Aliase is : %s" %accntAliase)
	
	except ClientError as e:
		logger.error("Client Error occured")
	
	try: 
		publish_message = ""
		publish_message += "\nImage Attribute change summary" + "\n\n"
		publish_message += "##########################################################\n"
		publish_message += "# Event Name- " +str(eventname) + "\n" 
		publish_message += "# Account- " +str(accntAliase) +  "\n"
		publish_message += "# AMI ID- " +str(imageId) +  "\n"
		publish_message += "# Image Action- " +str(imageAction) +  "\n"
		publish_message += "# Source IP- " +str(srcIP) +   "\n"
		publish_message += "##########################################################\n"
		publish_message += "\n\n\nThe full event is as below:- \n\n" +str(event) +"\n"
		logger.debug("MESSAGE- %s" %publish_message)
		
		#Sending the notification...
		snspublish = snsclient.publish(
						TargetArn= snsARN,
						Subject=(("Image Attribute change API call-\"%s\" detected in Account-\"%s\"" %(eventname,accntAliase))[:100]),
						Message=publish_message
						)
	except ClientError as e:
		logger.error("An error occured: %s" %e)

6. In the Environment variables section, enter the following key-value pair:

  • Key= snsARN
  • Value= the ARN of the MySNSTopic created earlier

7. Choose Save.

Step 4: Creating a CloudWatch Events rule

To create a CloudWatch Events rule, complete the following steps. This rule catches when a user performs a ModifyImageAttribute API event and triggers the Lambda function (set as a target).

1.       In the CloudWatch console, choose Rules, Create rule.

  • On the Step 1: Create rule page, under Event Source, select Event Pattern.
  • Copy the following event into the preview pane:
{
  "source": [
    "aws.ec2"
  ],
  "detail-type": [
    "AWS API Call via CloudTrail"
  ],
  "detail": {
    "eventSource": [
      "ec2.amazonaws.com"
    ],
    "eventName": [
      "ModifyImageAttribute"
    ]
  }
}
  • For Targets, select Lambda function, and select the Lambda function created in Step 2.

  • Choose Configure details.

2. On the Step 2: Configure rule details page, enter a name and description for the rule.

3. For State, select Enabled.

4. Choose Create rule.

Solution validation

Confirm that the solution works by changing an AMI:

  1. Open the Amazon EC2 console. From the menu, select AMIs under the Images heading.
  2. Select one of the AMIs, and choose Actions, Modify Image Permissions. To create an AMI, see How do I create an AMI that is based on my EBS-backed EC2 instance?
  3. Choose Private.
  4. For AWS Account Number, choose an account with which to share the AMI.
  5. Choose Add Permission, Save.
  6. Check your inbox to verify that you received an email from SNS. The email contains a summary message with the following information, followed by the full event:
  • Event Name
  • Account
  • AMI ID
  • Image Action
  • Source IP

For Image Action, the email lists one of the following events:

  • Image made Public
  • Image made Private
  • Image shared with AWS account
  • Image Unshared with AWS account

The message also includes the account ID of any shared or unshared accounts.

Creating other alert notifications

This post shows you how to automate notifications when AMI permissions change, but the solution is a blueprint for a wide variety of use cases that require alert notifications. You can use the CloudTrail event history to find the API associated with the event that you want to receive notifications about, and create a new CloudWatch Events rule for that event.

  1. Open the CloudTrail console and choose Event history.
  2. Explore the CloudTrail event history and locate the event name associated with the actions that you performed in your AWS account.
  3. Make a note of the event name and modify the eventName parameter in Step 4, 1.b to configure alerts for that particular event.

Automating Notifications - CloudTrail console

Conclusion

This post demonstrated how you can use CloudWatch to create automated notifications when AMI permissions change. Additionally, you can use the CloudTrail event history to find the API for other events, and use the preceding walkthrough to create other event alerts.

For further reading, see the following posts:

 

 

Learn From Your VPC Flow Logs With Additional Meta-Data

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/learn-from-your-vpc-flow-logs-with-additional-meta-data/

Flow Logs for Amazon Virtual Private Cloud enables you to capture information about the IP traffic going to and from network interfaces in your VPC. Flow Logs data can be published to Amazon CloudWatch Logs or Amazon Simple Storage Service (S3).

Since we launched VPC Flow Logs in 2015, you have been using it for variety of use-cases like troubleshooting connectivity issues across your VPCs, intrusion detection, anomaly detection, or archival for compliance purposes. Until today, VPC Flow Logs provided information that included source IP, source port, destination IP, destination port, action (accept, reject) and status. Once enabled, a VPC Flow Log entry looks like the one below.

While this information was sufficient to understand most flows, it required additional computation and lookup to match IP addresses to instance IDs or to guess the directionality of the flow to come to meaningful conclusions.

Today we are announcing the availability of additional meta data to include in your Flow Logs records to better understand network flows. The enriched Flow Logs will allow you to simplify your scripts or remove the need for postprocessing altogether, by reducing the number of computations or lookups required to extract meaningful information from the log data.

When you create a new VPC Flow Log, in addition to existing fields, you can now choose to add the following meta-data:

  • vpc-id : the ID of the VPC containing the source Elastic Network Interface (ENI).
  • subnet-id : the ID of the subnet containing the source ENI.
  • instance-id : the Amazon Elastic Compute Cloud (EC2) instance ID of the instance associated with the source interface. When the ENI is placed by AWS services (for example, AWS PrivateLink, NAT Gateway, Network Load Balancer etc) this field will be “-
  • tcp-flags : the bitmask for TCP Flags observed within the aggregation period. For example, FIN is 0x01 (1), SYN is 0x02 (2), ACK is 0x10 (16), SYN + ACK is 0x12 (18), etc. (the bits are specified in “Control Bits” section of RFC793 “Transmission Control Protocol Specification”).
    This allows to understand who initiated or terminated the connection. TCP uses a three way handshake to establish a connection. The connecting machine sends a SYN packet to the destination, the destination replies with a SYN + ACK and, finally, the connecting machine sends an ACK. In the Flow Logs, the handshake is shown as two lines, with tcp-flags values of 2 (SYN), 18 (SYN + ACK).  ACK is reported only when it is accompanied with SYN (otherwise it would be too much noise for you to filter out).
  • type : the type of traffic : IPV4, IPV6 or Elastic Fabric Adapter.
  • pkt-srcaddr : the packet-level IP address of the source. You typically use this field in conjunction with srcaddr to distinguish between the IP address of an intermediate layer through which traffic flows, such as a NAT gateway.
  • pkt-dstaddr : the packet-level destination IP address, similar to the previous one, but for destination IP addresses.

To create a VPC Flow Log, you can use the AWS Management Console, the AWS Command Line Interface (CLI) or the CreateFlowLogs API and select which additional information and the order you want to consume the fields, for example:

Or using the AWS Command Line Interface (CLI) as below:

$ aws ec2 create-flow-logs --resource-type VPC \
                            --region eu-west-1 \
                            --resource-ids vpc-12345678 \
                            --traffic-type ALL  \
                            --log-destination-type s3 \
                            --log-destination arn:aws:s3:::sst-vpc-demo \
                            --log-format '${version} ${vpc-id} ${subnet-id} ${instance-id} ${interface-id} ${account-id} ${type} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${pkt-srcaddr} ${pkt-dstaddr} ${protocol} ${bytes} ${packets} ${start} ${end} ${action} ${tcp-flags} ${log-status}'

# be sure to replace the bucket name and VPC ID !

{
    "ClientToken": "1A....HoP=",
    "FlowLogIds": [
        "fl-12345678123456789"
    ],
    "Unsuccessful": [] 
}

Enriched VPC Flow Logs are delivered to S3. We will automatically add the required S3 Bucket Policy to authorize VPC Flow Logs to write to your S3 bucket. VPC Flow Logs does not capture real-time log streams for your network interface, it might take several minutes to begin collecting and publishing data to the chosen destinations. Your logs will eventually be available on S3 at s3://<bucket name>/AWSLogs/<account id>/vpcflowlogs/<region>/<year>/<month>/<day>/

An SSH connection from my laptop with IP address 90.90.0.200 to an EC2 instance would appear like this :

3 vpc-exxxxxx2 subnet-8xxxxf3 i-0bfxxxxxxaf eni-08xxxxxxa5 48xxxxxx93 IPv4 172.31.22.145 90.90.0.200 22 62897 172.31.22.145 90.90.0.200 6 5225 24 1566328660 1566328672 ACCEPT 18 OK
3 vpc-exxxxxx2 subnet-8xxxxf3 i-0bfxxxxxxaf eni-08xxxxxxa5 48xxxxxx93 IPv4 90.90.0.200 172.31.22.145 62897 22 90.90.0.200 172.31.22.145 6 4877 29 1566328660 1566328672 ACCEPT 2 OK

172.31.22.145 is the private IP address of the EC2 instance, the one you see when you type ifconfig on the instance.  All flags are “OR”ed during aggregation period. When connection is short, probably both SYN and FIN (3), as well as SYN+ACK and FIN (19) will be set for the same lines.

Once a Flow Log is created, you can not add additional fields or modify the structure of the log to ensure you will not accidently break scripts consuming this data. Any modification will require you to delete and recreate the VPC Flow Logs. There is no additional cost to capture the extra information in the VPC Flow Logs, normal VPC Flow Log pricing applies, remember that Enriched VPC Flow Log records might consume more storage when selecting all fields.  We do recommend to select only the fields relevant to your use-cases.

Enriched VPC Flow Logs is available in all regions where VPC Flow logs is available, you can start to use it today.

— seb

PS: I heard from the team they are working on adding additional meta-data to the logs, stay tuned for updates.

Operational Insights for Containers and Containerized Applications

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/operational-insights-for-containers-and-containerized-applications/

The increasing adoption of containerized applications and microservices also brings an increased burden for monitoring and management. Builders have an expectation of, and requirement for, the same level of monitoring as would be used with longer lived infrastructure such as Amazon Elastic Compute Cloud (EC2) instances. By contrast containers are relatively short-lived, and usually subject to continuous deployment. This can make it difficult to reliably collect monitoring data and to analyze performance or other issues, which in turn affects remediation time. In addition builders have to resort to a disparate collection of tools to perform this analysis and inspection, manually correlating context across a set of infrastructure and application metrics, logs, and other traces.

Announcing general availability of Amazon CloudWatch Container Insights
At the AWS Summit in New York this past July, Amazon CloudWatch Container Insights support for Amazon ECS and AWS Fargate was announced as an open preview for new clusters. Starting today Container Insights is generally available, with the added ability to now also monitor existing clusters. Immediate insights into compute utilization and failures for both new and existing cluster infrastructure and containerized applications can be easily obtained from container management services including Kubernetes, Amazon Elastic Container Service for Kubernetes, Amazon ECS, and AWS Fargate.

Once enabled Amazon CloudWatch discovers all of the running containers in a cluster and collects performance and operational data at every layer in the container stack. It also continuously monitors and updates as changes occur in the environment, simplifying the number of tools required to collect, monitor, act, and analyze container metrics and logs giving complete end to end visibility.

Being able to easily access this data means customers can shift focus to increased developer productivity and away from building mechanisms to curate and build dashboards.

Getting started with Amazon CloudWatch Container Insights
I can enable Container Insights by following the instructions in the documentation. Once enabled and new clusters launched, when I visit the CloudWatch console for my region I see a new option for Container Insights in the list of dashboards available to me.

Clicking this takes me to the relevant dashboard where I can select the container management service that hosts the clusters that I want to observe.

In the below image I have selected to view metrics for my ECS Clusters that are hosting a sample application I have deployed in AWS Fargate. I can examine the metrics for standard time periods such as 1 hour, 3 hours, etc but can also specify custom time periods. Here I am looking at the metrics for a custom time period of the past 15 minutes.

You can see that I can quickly gain operational oversight of the overall performance of the cluster. Clicking the cluster name takes me deeper to view the metrics for the tasks inside the cluster.

Selecting a container allows me to then dive into either AWS X-Ray traces or performance logs.

Selecting performance logs takes me to the Amazon CloudWatch Logs Insights page where I can run queries against the performance events collected for my container ecosystem (e.g., Container, Task/Pod, Cluster, etc.) that I can then use to troubleshoot and dive deeper.

Container Insights makes it easy for me to get started monitoring my containers and enables me to quickly drill down into performance metrics and log analytics without the need to build custom dashboards to curate data from multiple tools. Beyond monitoring and troubleshooting I can also use the data and dashboards Container Insights provides me to support other use cases such as capacity requirements and planning, by helping me understand compute utilization by Pod/Task, Container, and Service for example.

Availability
Amazon CloudWatch Container Insights is generally available today to customers in all public AWS regions where Amazon Elastic Container Service for Kubernetes, Kubernetes, Amazon ECS, and AWS Fargate are present.

— Steve