Tag Archives: alarms

Troubleshooting event publishing issues in Amazon SES

Post Syndicated from Dustin Taylor original https://aws.amazon.com/blogs/ses/troubleshooting-event-publishing-issues-in-amazon-ses/

Over the past year, we’ve released several features that make it easier to track the metrics that are associated with your Amazon SES account. The first of these features, launched in November of last year, was event publishing.

Initially, event publishing let you capture basic metrics related to your email sending and publish them to other AWS services, such as Amazon CloudWatch and Amazon Kinesis Data Firehose. Some examples of these basic metrics include the number of emails that were sent and delivered, as well as the number that bounced or received complaints. A few months ago, we expanded this feature by adding engagement metrics—specifically, information about the number of emails that your customers opened or engaged with by clicking links.

As a former Cloud Support Engineer, I’ve seen Amazon SES customers do some amazing things with event publishing, but I’ve also seen some common issues. In this article, we look at some of these issues, and discuss the steps you can take to resolve them.

Before we begin

This post assumes that your Amazon SES account is already out of the sandbox, that you’ve verified an identity (such as an email address or domain), and that you have the necessary permissions to use Amazon SES and the service that you’ll publish event data to (such as Amazon SNS, CloudWatch, or Kinesis Data Firehose).

We also assume that you’re familiar with the process of creating configuration sets and specifying event destinations for those configuration sets. For more information, see Using Amazon SES Configuration Sets in the Amazon SES Developer Guide.

Amazon SNS event destinations

If you want to receive notifications when events occur—such as when recipients click a link in an email, or when they report an email as spam—you can use Amazon SNS as an event destination.

Occasionally, customers ask us why they’re not receiving notifications when they use an Amazon SNS topic as an event destination. One of the most common reasons for this issue is that they haven’t configured subscriptions for their Amazon SNS topic yet.

A single topic in Amazon SNS can have one or more subscriptions. When you subscribe to a topic, you tell that topic which endpoints (such as email addresses or mobile phone numbers) to contact when it receives a notification. If you haven’t set up any subscriptions, nothing will happen when an email event occurs.

For more information about setting up topics and subscriptions, see Getting Started in the Amazon SNS Developer Guide. For information about publishing Amazon SES events to Amazon SNS topics, see Set Up an Amazon SNS Event Destination for Amazon SES Event Publishing in the Amazon SES Developer Guide.

Kinesis Data Firehose event destinations

If you want to store your Amazon SES event data for the long term, choose Amazon Kinesis Data Firehose as a destination for Amazon SES events. With Kinesis Data Firehose, you can stream data to Amazon S3 or Amazon Redshift for storage and analysis.

The process of setting up Kinesis Data Firehose as an event destination is similar to the process for setting up Amazon SNS: you choose the types of events (such as deliveries, opens, clicks, or bounces) that you want to export, and the name of the Kinesis Data Firehose stream that you want to export to. However, there’s one important difference. When you set up a Kinesis Data Firehose event destination, you must also choose the IAM role that Amazon SES uses to send event data to Kinesis Data Firehose.

When you set up the Kinesis Data Firehose event destination, you can choose to have Amazon SES create the IAM role for you automatically. For many users, this is the best solution—it ensures that the IAM role has the appropriate permissions to move event data from Amazon SES to Kinesis Data Firehose.

Customers occasionally run into issues with the Kinesis Data Firehose event destination when they use an existing IAM role. If you use an existing IAM role, or create a new role for this purpose, make sure that the role includes the firehose:PutRecord and firehose:PutRecordBatch permissions. If the role doesn’t include these permissions, then the Amazon SES event data isn’t published to Kinesis Data Firehose. For more information, see Controlling Access with Amazon Kinesis Data Firehose in the Amazon Kinesis Data Firehose Developer Guide.

CloudWatch event destinations

By publishing your Amazon SES event data to Amazon CloudWatch, you can create dashboards that track your sending statistics in real time, as well as alarms that notify you when your event metrics reach certain thresholds.

The amount that you’re charged for using CloudWatch is based on several factors, including the number of metrics you use. In order to give you more control over the specific metrics you send to CloudWatch—and to help you avoid unexpected charges—you can limit the email sending events that are sent to CloudWatch.

When you choose CloudWatch as an event destination, you must choose a value source. The value source can be one of three options: a message tag, a link tag, or an email header. After you choose a value source, you then specify a name and a value. When you send an email using a configuration set that refers to a CloudWatch event destination, it only sends the metrics for that email to CloudWatch if the email contains the name and value that you specified as the value source. This requirement is commonly overlooked.

For example, assume that you chose Message Tag as the value source, and specified “CategoryId” as the dimension name and “31415” as the dimension value. When you want to send events for an email to CloudWatch, you must specify the name of the configuration set that uses the CloudWatch destination. You must also include a tag in your message. The name of the tag must be “CategoryId” and the value must be “31415”.

For more information about adding tags and email headers to your messages, see Send Email Using Amazon SES Event Publishing in the Amazon SES Developer Guide. For more information about adding tags to links, see Amazon SES Email Sending Metrics FAQs in the Amazon SES Developer Guide.

Troubleshooting event publishing for open and click data

Occasionally, customers ask why they’re not seeing open and click data for their emails. This issue most often occurs when the customer only sends text versions of their emails. Because of the way Amazon SES tracks open and click events, you can only see open and click data for emails that are sent as HTML. For more information about how Amazon SES modifies your emails when you enable open and click tracking, see Amazon SES Email Sending Metrics FAQs in the Amazon SES Developer Guide.

The process that you use to send HTML emails varies based on the email sending method you use. The Code Examples section of the Amazon SES Developer Guide contains examples of several methods of sending email by using the Amazon SES SMTP interface or an AWS SDK. All of the examples in this section include methods for sending HTML (as well as text-only) emails.

If you encounter any issues that weren’t covered in this post, please open a case in the Support Center and we’d be more than happy to assist.

New AWS Auto Scaling – Unified Scaling For Your Cloud Applications

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-auto-scaling-unified-scaling-for-your-cloud-applications/

I’ve been talking about scalability for servers and other cloud resources for a very long time! Back in 2006, I wrote “This is the new world of scalable, on-demand web services. Pay for what you need and use, and not a byte more.” Shortly after we launched Amazon Elastic Compute Cloud (EC2), we made it easy for you to do this with the simultaneous launch of Elastic Load Balancing, EC2 Auto Scaling, and Amazon CloudWatch. Since then we have added Auto Scaling to other AWS services including ECS, Spot Fleets, DynamoDB, Aurora, AppStream 2.0, and EMR. We have also added features such as target tracking to make it easier for you to scale based on the metric that is most appropriate for your application.

Introducing AWS Auto Scaling
Today we are making it easier for you to use the Auto Scaling features of multiple AWS services from a single user interface with the introduction of AWS Auto Scaling. This new service unifies and builds on our existing, service-specific, scaling features. It operates on any desired EC2 Auto Scaling groups, EC2 Spot Fleets, ECS tasks, DynamoDB tables, DynamoDB Global Secondary Indexes, and Aurora Replicas that are part of your application, as described by an AWS CloudFormation stack or in AWS Elastic Beanstalk (we’re also exploring some other ways to flag a set of resources as an application for use with AWS Auto Scaling).

You no longer need to set up alarms and scaling actions for each resource and each service. Instead, you simply point AWS Auto Scaling at your application and select the services and resources of interest. Then you select the desired scaling option for each one, and AWS Auto Scaling will do the rest, helping you to discover the scalable resources and then creating a scaling plan that addresses the resources of interest.

If you have tried to use any of our Auto Scaling options in the past, you undoubtedly understand the trade-offs involved in choosing scaling thresholds. AWS Auto Scaling gives you a variety of scaling options: You can optimize for availability, keeping plenty of resources in reserve in order to meet sudden spikes in demand. You can optimize for costs, running close to the line and accepting the possibility that you will tax your resources if that spike arrives. Alternatively, you can aim for the middle, with a generous but not excessive level of spare capacity. In addition to optimizing for availability, cost, or a blend of both, you can also set a custom scaling threshold. In each case, AWS Auto Scaling will create scaling policies on your behalf, including appropriate upper and lower bounds for each resource.

AWS Auto Scaling in Action
I will use AWS Auto Scaling on a simple CloudFormation stack consisting of an Auto Scaling group of EC2 instances and a pair of DynamoDB tables. I start by removing the existing Scaling Policies from my Auto Scaling group:

Then I open up the new Auto Scaling Console and selecting the stack:

Behind the scenes, Elastic Beanstalk applications are always launched via a CloudFormation stack. In the screen shot above, awseb-e-sdwttqizbp-stack is an Elastic Beanstalk application that I launched.

I can click on any stack to learn more about it before proceeding:

I select the desired stack and click on Next to proceed. Then I enter a name for my scaling plan and choose the resources that I’d like it to include:

I choose the scaling strategy for each type of resource:

After I have selected the desired strategies, I click Next to proceed. Then I review the proposed scaling plan, and click Create scaling plan to move ahead:

The scaling plan is created and in effect within a few minutes:

I can click on the plan to learn more:

I can also inspect each scaling policy:

I tested my new policy by applying a load to the initial EC2 instance, and watched the scale out activity take place:

I also took a look at the CloudWatch metrics for the EC2 Auto Scaling group:

Available Now
We are launching AWS Auto Scaling today in the US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), and Asia Pacific (Singapore) Regions today, with more to follow. There’s no charge for AWS Auto Scaling; you pay only for the CloudWatch Alarms that it creates and any AWS resources that you consume.

As is often the case with our new services, this is just the first step on what we hope to be a long and interesting journey! We have a long roadmap, and we’ll be adding new features and options throughout 2018 in response to your feedback.

Jeff;

Using Amazon CloudWatch and Amazon SNS to Notify when AWS X-Ray Detects Elevated Levels of Latency, Errors, and Faults in Your Application

Post Syndicated from Bharath Kumar original https://aws.amazon.com/blogs/devops/using-amazon-cloudwatch-and-amazon-sns-to-notify-when-aws-x-ray-detects-elevated-levels-of-latency-errors-and-faults-in-your-application/

AWS X-Ray helps developers analyze and debug production applications built using microservices or serverless architectures and quantify customer impact. With X-Ray, you can understand how your application and its underlying services are performing and identify and troubleshoot the root cause of performance issues and errors. You can use these insights to identify issues and opportunities for optimization.

In this blog post, I will show you how you can use Amazon CloudWatch and Amazon SNS to get notified when X-Ray detects high latency, errors, and faults in your application. Specifically, I will show you how to use this sample app to get notified through an email or SMS message when your end users observe high latencies or server-side errors when they use your application. You can customize the alarms and events by updating the sample app code.

Sample App Overview

The sample app uses the X-Ray GetServiceGraph API to get the following information:

  • Aggregated response time.
  • Requests that failed with 4xx status code (errors).
  • 429 status code (throttle).
  • 5xx status code (faults).
Sample app architecture

Overview of sample app architecture

Getting started

The sample app uses AWS CloudFormation to deploy the required resources.
To install the sample app:

  1. Run git clone to get the sample app.
  2. Update the JSON file in the Setup folder with threshold limits and notification details.
  3. Run the install.py script to install the sample app.

For more information about the installation steps, see the readme file on GitHub.

You can update the app configuration to include your phone number or email to get notified when your application in X-Ray breaches the latency, error, and fault limits you set in the configuration. If you prefer to not provide your phone number and email, then you can use the CloudWatch alarm deployed by the sample app to monitor your application in X-Ray.

The sample app deploys resources with the sample app namespace you provided during setup. This enables you to have multiple sample apps in the same region.

CloudWatch rules

The sample app uses two CloudWatch rules:

  1. SCHEDULEDLAMBDAFOR-sample_app_name to trigger at regular intervals the AWS Lambda function that queries the GetServiceGraph API.
  2. XRAYALERTSFOR-sample_app_name to look for published CloudWatch events that match the pattern defined in this rule.
CloudWatch Rules for sample app

CloudWatch rules created for the sample app

CloudWatch alarms

If you did not provide your phone number or email in the JSON file, the sample app uses a CloudWatch alarm named XRayCloudWatchAlarm-sample_app_name in combination with the CloudWatch event that you can use for monitoring.

CloudWatch Alarm for sample app

CloudWatch alarm created for the sample app

Amazon SNS messages

The sample app creates two SNS topics:

  • sample_app_name-cloudwatcheventsnstopic to send out an SMS message when the CloudWatch event matches a pattern published from the Lambda function.
  • sample_app_name-cloudwatchalarmsnstopic to send out an email message when the CloudWatch alarm goes into an ALARM state.
Amazon SNS for sample app

Amazon SNS created for the sample app

Getting notifications

The CloudWatch event looks for the following matching pattern:

{
  "detail-type": [
    "XCW Notification for Alerts"
  ],
  "source": [
    "<sample_app_name>-xcw.alerts"
  ]
}

The event then invokes an SNS topic that sends out an SMS message.

SMS in sample app

SMS that is sent when CloudWatch Event invokes Amazon SNS topic

The CloudWatch alarm looks for the TriggeredRules metric that is published whenever the CloudWatch event matches the event pattern. It goes into the ALARM state whenever TriggeredRules > 0 for the specified evaluation period and invokes an SNS topic that sends an email message.

Email sent in sample app

Email that is sent when CloudWatch Alarm goes to ALARM state

Stopping notifications

If you provided your phone number or email address, but would like to stop getting notified, change the SUBSCRIBE_TO_EMAIL_SMS environment variable in the Lambda function to No. Then, go to the Amazon SNS console and delete the subscriptions. You can still monitor your application for elevated levels of latency, errors, and faults by using the CloudWatch console.

Lambda environment variable in sample app

Change environment variable in Lambda

 

Delete subscription in SNS for sample app

Delete subscriptions to stop getting notified

Uninstalling the sample app

To uninstall the sample app, run the uninstall.py script in the Setup folder.

Extending the sample app

The sample app notifes you when when X-Ray detects high latency, errors, and faults in your application. You can extend it to provide more value for your use cases (for example, to perform an action on a resource when the state of a CloudWatch alarm changes).

To summarize, after this set up you will be able to get notified through Amazon SNS when X-Ray detects high latency, errors and faults in your application.

I hope you found this information about setting up alarms and alerts for your application in AWS X-Ray helpful. Feel free to leave questions or other feedback in the comments. Feel free to learn more about AWS X-Ray, Amazon SNS and Amazon CloudWatch

About the Author

Bharath Kumar is a Sr.Product Manager with AWS X-Ray. He has developed and launched mobile games, web applications on microservices and serverless architecture.

New – Amazon CloudWatch Agent with AWS Systems Manager Integration – Unified Metrics & Log Collection for Linux & Windows

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-agent-with-aws-systems-manager-integration-unified-metrics-log-collection-for-linux-windows/

In the past I’ve talked about several agents, deaemons, and scripts that you could use to collect system metrics and log files for your Windows and Linux instances and on-premise services and publish them to Amazon CloudWatch. The data collected by this somewhat disparate collection of tools gave you visibility into the status and behavior of your compute resources, along with the power to take action when a value goes out of range and indicates a potential issue. You can graph any desired metrics on CloudWatch Dashboards, initiate actions via CloudWatch Alarms, and search CloudWatch Logs to find error messages, while taking advantage of our support for custom high-resolution metrics.

New Unified Agent
Today we are taking a nice step forward and launching a new, unified CloudWatch Agent. It runs in the cloud and on-premises, on Linux and Windows instances and servers, and handles metrics and log files. You can deploy it using AWS Systems Manager (SSM) Run Command, SSM State Manager, or from the CLI. Here are some of the most important features:

Single Agent – A single agent now collects both metrics and logs. This simplifies the setup process and reduces complexity.

Cross-Platform / Cross-Environment – The new agent runs in the cloud and on-premises, on 64-bit Linux and 64-bit Windows, and includes HTTP proxy server support.

Configurable – The new agent captures the most useful system metrics automatically. It can be configured to collect hundreds of others, including fine-grained metrics on sub-resources such as CPU threads, mounted filesystems, and network interfaces.

CloudWatch-Friendly – The new agent supports standard 1-minute metrics and the newer 1-second high-resolution metrics. It automatically includes EC2 dimensions such as Instance Id, Image Id, and Auto Scaling Group Name, and also supports the use of custom dimensions. All of the dimensions can be used for custom aggregation across Auto Scaling Groups, applications, and so forth.

Migration – You can easily migrate existing AWS SSM and EC2Config configurations for use with the new agent.

Installing the Agent
The CloudWatch Agent uses an IAM role when running on an EC2 instance, and an IAM user when running on an on-premises server. The role or the user must include the AmazonSSMFullAccess and AmazonEC2ReadOnlyAccess policies. Here’s my role:

I can easily add it to a running instance (this is a relatively new and very handy EC2 feature):

The SSM Agent is already running on my instance. If it wasn’t, I would follow the steps in Installing and Configuring SSM Agent to set it up.

Next, I install the CloudWatch Agent using the AWS Systems Manager:

This takes just a few seconds. Now I can use a simple wizard to set up the configuration file for the agent:

The wizard also lets me set up the log files to be monitored:

The wizard generates a JSON-format config file and stores it on the instance. It also offers me the option to upload the file to my Parameter Store so that I can deploy it to my other instances (I can also do fine-grained customization of the metrics and log collection configuration by editing the file):

Now I can start the CloudWatch Agent using Run Command, supplying the name of my configuration in the Parameter Store:

This runs in a few seconds and the agent begins to publish metrics right away. As I mentioned earlier, the agent can publish fine-grained metrics on the resources inside of or attached to an instance. For example, here are the metrics for each filesystem:

There’s a separate log stream for each monitored log file on each instance:

I can view and search it, just like I can do for any other log stream:

Now Available
The new CloudWatch Agent is available now and you can start using it today in all public AWS Regions, with AWS GovCloud (US) and the Regions in China to follow.

There’s no charge for the agent; you pay the usual CloudWatch prices for logs and custom metrics.

Jeff;

timeShift(GrafanaBuzz, 1w) Issue 25

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/12/08/timeshiftgrafanabuzz-1w-issue-25/

Welcome to TimeShift

This week, a few of us from Grafana Labs, along with 4,000 of our closest friends, headed down to chilly Austin, TX for KubeCon + CloudNativeCon North America 2017. We got to see a number of great talks and were thrilled to see Grafana make appearances in some of the presentations. We were also a sponsor of the conference and handed out a ton of swag (we overnighted some of our custom Grafana scarves, which came in handy for Thursday’s snow).

We also announced Grafana Labs has joined the Cloud Native Computing Foundation as a Silver member! We’re excited to share our expertise in time series data visualization and open source software with the CNCF community.


Latest Release

Grafana 4.6.2 is available and includes some bug fixes:

  • Prometheus: Fixes bug with new Prometheus alerts in Grafana. Make sure to download this version if you’re using Prometheus for alerting. More details in the issue. #9777
  • Color picker: Bug after using textbox input field to change/paste color string #9769
  • Cloudwatch: build using golang 1.9.2 #9667, thanks @mtanda
  • Heatmap: Fixed tooltip for “time series buckets” mode #9332
  • InfluxDB: Fixed query editor issue when using > or < operators in WHERE clause #9871

Download Grafana 4.6.2 Now


From the Blogosphere

Grafana Labs Joins the CNCF: Grafana Labs has officially joined the Cloud Native Computing Foundation (CNCF). We look forward to working with the CNCF community to democratize metrics and help unify traditionally disparate information.

Automating Web Performance Regression Alerts: Peter and his team needed a faster and easier way to find web performance regressions at the Wikimedia Foundation. Grafana 4’s alerting features were exactly what they needed. This post covers their journey on setting up alerts for both RUM and synthetic testing and shares the alerts they’ve set up on their dashboards.

How To Install Grafana on Ubuntu 17.10: As you probably guessed from the title, this article walks you through installing and configuring Grafana in the latest version of Ubuntu (or earlier releases). It also covers installing plugins using the Grafana CLI tool.

Prometheus: Starting the Server with Alertmanager, cAdvisor and Grafana: Learn how to monitor Docker from scratch using cAdvisor, Prometheus and Grafana in this detailed, step-by-step walkthrough.

Monitoring Java EE Servers with Prometheus and Payara: In this screencast, Adam uses firehose; a Java EE 7+ metrics gateway for Prometheus, to convert the JSON output into Prometheus statistics and visualizes the data in Grafana.

Monitoring Spark Streaming with InfluxDB and Grafana: This article focuses on how to monitor Apache Spark Streaming applications with InfluxDB and Grafana at scale.


GrafanaCon EU, March 1-2, 2018

We are currently reaching out to everyone who submitted a talk to GrafanaCon and will soon publish the final schedule at grafanacon.org.

Join us March 1-2, 2018 in Amsterdam for 2 days of talks centered around Grafana and the surrounding monitoring ecosystem including Graphite, Prometheus, InfluxData, Elasticsearch, Kubernetes, and more.

Get Your Ticket Now


Grafana Plugins

Lots of plugin updates and a new OpenNMS Helm App plugin to announce! To install or update any plugin in an on-prem Grafana instance, use the Grafana-cli tool, or install and update with 1 click on Hosted Grafana.

NEW PLUGIN

OpenNMS Helm App – The new OpenNMS Helm App plugin replaces the old OpenNMS data source. Helm allows users to create flexible dashboards using both fault management (FM) and performance management (PM) data from OpenNMS® Horizon™ and/or OpenNMS® Meridian™. The old data source is now deprecated.


Install Now

UPDATED PLUGIN

PNP Data Source – This data source plugin (that uses PNP4Nagios to access RRD files) received a small, but important update that fixes template query parsing.


Update

UPDATED PLUGIN

Vonage Status Panel – The latest version of the Status Panel comes with a number of small fixes and changes. Below are a few of the enhancements:

  • Threshold settings – removed Show Always option, and replaced it with 2 options:
    • Display Alias – Select when to show the metric alias.
    • Display Value – Select when to show the metric value.
  • Text format configuration (bold / italic) for warning / critical / disabled states.
  • Option to change the corner radius of the panel. Now you can change the panel’s shape to have rounded corners.

Update

UPDATED PLUGIN

Google Calendar Plugin – This plugin received a small update, so be sure to install version 1.0.4.


Update

UPDATED PLUGIN

Carpet Plot Panel – The Carpet Plot Panel received a fix for IE 11, and also added the ability to choose custom colors.


Update


Upcoming Events:

In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We also like to make sure we mention other Grafana-related events happening all over the world. If you’re putting on just such an event, let us know and we’ll list it here.

Docker Meetup @ Tuenti | Madrid, Spain – Dec 12, 2017: Javier Provecho: Intro to Metrics with Swarm, Prometheus and Grafana

Learn how to gain visibility in real time for your micro services. We’ll cover how to deploy a Prometheus server with persistence and Grafana, how to enable metrics endpoints for various service types (docker daemon, traefik proxy and postgres) and how to scrape, visualize and set up alarms based on those metrics.

RSVP

Grafana Lyon Meetup n ° 2 | Lyon, France – Dec 14, 2017: This meetup will cover some of the latest innovations in Grafana and discussion about automation. Also, free beer and chips, so – of course you’re going!

RSVP

FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. Carl Bergquist is managing the Cloud and Monitoring Devroom, and we’ve heard there were some great talks submitted. There is no need to register; all are welcome.


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

We were thrilled to see our dashboards bigger than life at KubeCon + CloudNativeCon this week. Thanks for snapping a photo and sharing!


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


How are we doing?

Hard to believe this is the 25th issue of Timeshift! I have a blast writing these roundups, but Let me know what you think. Submit a comment on this article below, or post something at our community forum. Find an article I haven’t included? Send it my way. Help us make timeShift better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

Running Windows Containers on Amazon ECS

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/running-windows-containers-on-amazon-ecs/

This post was developed and written by Jeremy Cowan, Thomas Fuller, Samuel Karp, and Akram Chetibi.

Containers have revolutionized the way that developers build, package, deploy, and run applications. Initially, containers only supported code and tooling for Linux applications. With the release of Docker Engine for Windows Server 2016, Windows developers have started to realize the gains that their Linux counterparts have experienced for the last several years.

This week, we’re adding support for running production workloads in Windows containers using Amazon Elastic Container Service (Amazon ECS). Now, Amazon ECS provides an ECS-Optimized Windows Server Amazon Machine Image (AMI). This AMI is based on the EC2 Windows Server 2016 AMI, and includes Docker 17.06 Enterprise Edition and the ECS Agent 1.16. This AMI provides improved instance and container launch time performance. It’s based on Windows Server 2016 Datacenter and includes Docker 17.06.2-ee-5, along with a new version of the ECS agent that now runs as a native Windows service.

In this post, I discuss the benefits of this new support, and walk you through getting started running Windows containers with Amazon ECS.

When AWS released the Windows Server 2016 Base with Containers AMI, the ECS agent ran as a process that made it difficult to monitor and manage. As a service, the agent can be health-checked, managed, and restarted no differently than other Windows services. The AMI also includes pre-cached images for Windows Server Core 2016 and Windows Server Nano Server 2016. By caching the images in the AMI, launching new Windows containers is significantly faster. When Docker images include a layer that’s already cached on the instance, Docker re-uses that layer instead of pulling it from the Docker registry.

The ECS agent and an accompanying ECS PowerShell module used to install, configure, and run the agent come pre-installed on the AMI. This guarantees there is a specific platform version available on the container instance at launch. Because the software is included, you don’t have to download it from the internet. This saves startup time.

The Windows-compatible ECS-optimized AMI also reports CPU and memory utilization and reservation metrics to Amazon CloudWatch. Using the CloudWatch integration with ECS, you can create alarms that trigger dynamic scaling events to automatically add or remove capacity to your EC2 instances and ECS tasks.

Getting started

To help you get started running Windows containers on ECS, I’ve forked the ECS reference architecture, to build an ECS cluster comprised of Windows instances instead of Linux instances. You can pull the latest version of the reference architecture for Windows.

The reference architecture is a layered CloudFormation stack, in that it calls other stacks to create the environment. Within the stack, the ecs-windows-cluster.yaml file contains the instructions for bootstrapping the Windows instances and configuring the ECS cluster. To configure the instances outside of AWS CloudFormation (for example, through the CLI or the console), you can add the following commands to your instance’s user data:

Import-Module ECSTools
Initialize-ECSAgent

Or

Import-Module ECSTools
Initialize-ECSAgent –Cluster MyCluster -EnableIAMTaskRole

If you don’t specify a cluster name when you initialize the agent, the instance is joined to the default cluster.

Adding -EnableIAMTaskRole when initializing the agent adds support for IAM roles for tasks. Previously, enabling this setting meant running a complex script and setting an environment variable before you could assign roles to your ECS tasks.

When you enable IAM roles for tasks on Windows, it consumes port 80 on the host. If you have tasks that listen on port 80 on the host, I recommend configuring a service for them that uses load balancing. You can use port 80 on the load balancer, and the traffic can be routed to another host port on your container instances. For more information, see Service Load Balancing.

Create a cluster

To create a new ECS cluster, choose Launch stack, or pull the GitHub project to your local machine and run the following command:

aws cloudformation create-stack –template-body file://<path to master-windows.yaml> --stack-name <name>

Upload your container image

Now that you have a cluster running, step through how to build and push an image into a container repository. You use a repository hosted in Amazon Elastic Container Registry (Amazon ECR) for this, but you could also use Docker Hub. To build and push an image to a repository, install Docker on your Windows* workstation. You also create a repository and assign the necessary permissions to the account that pushes your image to Amazon ECR. For detailed instructions, see Pushing an Image.

* If you are building an image that is based on Windows layers, then you must use a Windows environment to build and push your image to the registry.

Write your task definition

Now that your image is built and ready, the next step is to run your Windows containers using a task.

Start by creating a new task definition based on the windows-simple-iis image from Docker Hub.

  1. Open the ECS console.
  2. Choose Task Definitions, Create new task definition.
  3. Scroll to the bottom of the page and choose Configure via JSON.
  4. Copy and paste the following JSON into that field.
  5. Choose Save, Create.
{
   "family": "windows-simple-iis",
   "containerDefinitions": [
   {
     "name": "windows_sample_app",
     "image": "microsoft/iis",
     "cpu": 100,
     "entryPoint":["powershell", "-Command"],
     "command":["New-Item -Path C:\\inetpub\\wwwroot\\index.html -Type file -Value '<html><head><title>Amazon ECS Sample App</title> <style>body {margin-top: 40px; background-color: #333;} </style> </head><body> <div style=color:white;text-align:center><h1>Amazon ECS Sample App</h1> <h2>Congratulations!</h2> <p>Your application is now running on a container in Amazon ECS.</p></body></html>'; C:\\ServiceMonitor.exe w3svc"],
     "portMappings": [
     {
       "protocol": "tcp",
       "containerPort": 80,
       "hostPort": 8080
     }
     ],
     "memory": 500,
     "essential": true
   }
   ]
}

You can now go back into the Task Definition page and see windows-simple-iis as an available task definition.

There are a few important aspects of the task definition file to note when working with Windows containers. First, the hostPort is configured as 8080, which is necessary because the ECS agent currently uses port 80 to enable IAM roles for tasks required for least-privilege security configurations.

There are also some fairly standard task parameters that are intentionally not included. For example, network mode is not available with Windows at the time of this release, so keep that setting blank to allow Docker to configure WinNAT, the only option available today.

Also, some parameters work differently with Windows than they do with Linux. The CPU limits that you define in the task definition are absolute, whereas on Linux they are weights. For information about other task parameters that are supported or possibly different with Windows, see the documentation.

Run your containers

At this point, you are ready to run containers. There are two options to run containers with ECS:

  1. Task
  2. Service

A task is typically a short-lived process that ECS creates. It can’t be configured to actively monitor or scale. A service is meant for longer-running containers and can be configured to use a load balancer, minimum/maximum capacity settings, and a number of other knobs and switches to help ensure that your code keeps running. In both cases, you are able to pick a placement strategy and a specific IAM role for your container.

  1. Select the task definition that you created above and choose Action, Run Task.
  2. Leave the settings on the next page to the default values.
  3. Select the ECS cluster created when you ran the CloudFormation template.
  4. Choose Run Task to start the process of scheduling a Docker container on your ECS cluster.

You can now go to the cluster and watch the status of your task. It may take 5–10 minutes for the task to go from PENDING to RUNNING, mostly because it takes time to download all of the layers necessary to run the microsoft/iis image. After the status is RUNNING, you should see the following results:

You may have noticed that the example task definition is named windows-simple-iis:2. This is because I created a second version of the task definition, which is one of the powerful capabilities of using ECS. You can make the task definitions part of your source code and then version them. You can also roll out new versions and practice blue/green deployment, switching to reduce downtime and improve the velocity of your deployments!

After the task has moved to RUNNING, you can see your website hosted in ECS. Find the public IP or DNS for your ECS host. Remember that you are hosting on port 8080. Make sure that the security group allows ingress from your client IP address to that port and that your VPC has an internet gateway associated with it. You should see a page that looks like the following:

This is a nice start to deploying a simple single instance task, but what if you had a Web API to be scaled out and in based on usage? This is where you could look at defining a service and collecting CloudWatch data to add and remove both instances of the task. You could also use CloudWatch alarms to add more ECS container instances and keep up with the demand. The former is built into the configuration of your service.

  1. Select the task definition and choose Create Service.
  2. Associate a load balancer.
  3. Set up Auto Scaling.

The following screenshot shows an example where you would add an additional task instance when the CPU Utilization CloudWatch metric is over 60% on average over three consecutive measurements. This may not be aggressive enough for your requirements; it’s meant to show you the option to scale tasks the same way you scale ECS instances with an Auto Scaling group. The difference is that these tasks start much faster because all of the base layers are already on the ECS host.

Do not confuse task dynamic scaling with ECS instance dynamic scaling. To add additional hosts, see Tutorial: Scaling Container Instances with CloudWatch Alarms.

Conclusion

This is just scratching the surface of the flexibility that you get from using containers and Amazon ECS. For more information, see the Amazon ECS Developer Guide and ECS Resources.

– Jeremy, Thomas, Samuel, Akram

Implementing Canary Deployments of AWS Lambda Functions with Alias Traffic Shifting

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/implementing-canary-deployments-of-aws-lambda-functions-with-alias-traffic-shifting/

This post courtesy of Ryan Green, Software Development Engineer, AWS Serverless

The concepts of blue/green and canary deployments have been around for a while now and have been well-established as best-practices for reducing the risk of software deployments.

In a traditional, horizontally scaled application, copies of the application code are deployed to multiple nodes (instances, containers, on-premises servers, etc.), typically behind a load balancer. In these applications, deploying new versions of software to too many nodes at the same time can impact application availability as there may not be enough healthy nodes to service requests during the deployment. This aggressive approach to deployments also drastically increases the blast radius of software bugs introduced in the new version and does not typically give adequate time to safely assess the quality of the new version against production traffic.

In such applications, one commonly accepted solution to these problems is to slowly and incrementally roll out application software across the nodes in the fleet while simultaneously verifying application health (canary deployments). Another solution is to stand up an entirely different fleet and weight (or flip) traffic over to the new fleet after verification, ideally with some production traffic (blue/green). Some teams deploy to a single host (“one box environment”), where the new release can bake for some time before promotion to the rest of the fleet. Techniques like this enable the maintainers of complex systems to safely test in production while minimizing customer impact.

Enter Serverless

There is somewhat of an impedance mismatch when mapping these concepts to a serverless world. You can’t incrementally deploy your software across a fleet of servers when there are no servers!* In fact, even the term “deployment” takes on a different meaning with functions as a service (FaaS). In AWS Lambda, a “deployment” can be roughly modeled as a call to CreateFunction, UpdateFunctionCode, or UpdateAlias (I won’t get into the semantics of whether updating configuration counts as a deployment), all of which may affect the version of code that is invoked by clients.

The abstractions provided by Lambda remove the need for developers to be concerned about servers and Availability Zones, and this provides a powerful opportunity to greatly simplify the process of deploying software.
*Of course there are servers, but they are abstracted away from the developer.

Traffic shifting with Lambda aliases

Before the release of traffic shifting for Lambda aliases, deployments of a Lambda function could only be performed in a single “flip” by updating function code for version $LATEST, or by updating an alias to target a different function version. After the update propagates, typically within a few seconds, 100% of function invocations execute the new version. Implementing canary deployments with this model required the development of an additional routing layer, further adding development time, complexity, and invocation latency.
While rolling back a bad deployment of a Lambda function is a trivial operation and takes effect near instantaneously, deployments of new versions for critical functions can still be a potentially nerve-racking experience.

With the introduction of alias traffic shifting, it is now possible to trivially implement canary deployments of Lambda functions. By updating additional version weights on an alias, invocation traffic is routed to the new function versions based on the weight specified. Detailed CloudWatch metrics for the alias and version can be analyzed during the deployment, or other health checks performed, to ensure that the new version is healthy before proceeding.

Note: Sometimes the term “canary deployments” refers to the release of software to a subset of users. In the case of alias traffic shifting, the new version is released to some percentage of all users. It’s not possible to shard based on identity without adding an additional routing layer.

Examples

The simplest possible use of a canary deployment looks like the following:

# Update $LATEST version of function
aws lambda update-function-code --function-name myfunction ….

# Publish new version of function
aws lambda publish-version --function-name myfunction

# Point alias to new version, weighted at 5% (original version at 95% of traffic)
aws lambda update-alias --function-name myfunction --name myalias --routing-config '{"AdditionalVersionWeights" : {"2" : 0.05} }'

# Verify that the new version is healthy
…
# Set the primary version on the alias to the new version and reset the additional versions (100% weighted)
aws lambda update-alias --function-name myfunction --name myalias --function-version 2 --routing-config '{}'

This is begging to be automated! Here are a few options.

Simple deployment automation

This simple Python script runs as a Lambda function and deploys another function (how meta!) by incrementally increasing the weight of the new function version over a prescribed number of steps, while checking the health of the new version. If the health check fails, the alias is rolled back to its initial version. The health check is implemented as a simple check against the existence of Errors metrics in CloudWatch for the alias and new version.

GitHub aws-lambda-deploy repo

Install:

git clone https://github.com/awslabs/aws-lambda-deploy
cd aws-lambda-deploy
export BUCKET_NAME=[YOUR_S3_BUCKET_NAME_FOR_BUILD_ARTIFACTS]
./install.sh

Run:

# Rollout version 2 incrementally over 10 steps, with 120s between each step
aws lambda invoke --function-name SimpleDeployFunction --log-type Tail --payload \
  '{"function-name": "MyFunction",
  "alias-name": "MyAlias",
  "new-version": "2",
  "steps": 10,
  "interval" : 120,
  "type": "linear"
  }' output

Description of input parameters

  • function-name: The name of the Lambda function to deploy
  • alias-name: The name of the alias used to invoke the Lambda function
  • new-version: The version identifier for the new version to deploy
  • steps: The number of times the new version weight is increased
  • interval: The amount of time (in seconds) to wait between weight updates
  • type: The function to use to generate the weights. Supported values: “linear”

Because this runs as a Lambda function, it is subject to the maximum timeout of 5 minutes. This may be acceptable for many use cases, but to achieve a slower rollout of the new version, a different solution is required.

Step Functions workflow

This state machine performs essentially the same task as the simple deployment function, but it runs as an asynchronous workflow in AWS Step Functions. A nice property of Step Functions is that the maximum deployment timeout has now increased from 5 minutes to 1 year!

The step function incrementally updates the new version weight based on the steps parameter, waiting for some time based on the interval parameter, and performing health checks between updates. If the health check fails, the alias is rolled back to the original version and the workflow fails.

For example, to execute the workflow:

export STATE_MACHINE_ARN=`aws cloudformation describe-stack-resources --stack-name aws-lambda-deploy-stack --logical-resource-id DeployStateMachine --output text | cut  -d$'\t' -f3`

aws stepfunctions start-execution --state-machine-arn $STATE_MACHINE_ARN --input '{
  "function-name": "MyFunction",
  "alias-name": "MyAlias",
  "new-version": "2",
  "steps": 10,
  "interval": 120,
  "type": "linear"}'

Getting feedback on the deployment

Because the state machine runs asynchronously, retrieving feedback on the deployment requires polling for the execution status using DescribeExecution or implementing an asynchronous notification (using SNS or email, for example) from the Rollback or Finalize functions. A CloudWatch alarm could also be created to alarm based on the “ExecutionsFailed” metric for the state machine.

A note on health checks and observability

Weighted rollouts like this are considerably more successful if the code is being exercised and monitored continuously. In this example, it would help to have some automation continuously invoking the alias and reporting metrics on these invocations, such as client-side success rates and latencies.

The absence of Lambda Errors metrics used in these examples can be misleading if the function is not getting invoked. It’s also recommended to instrument your Lambda functions with custom metrics, in addition to Lambda’s built-in metrics, that can be used to monitor health during deployments.

Extensibility

These examples could be easily extended in various ways to support different use cases. For example:

  • Health check implementations: CloudWatch alarms, automatic invocations with payload assertions, querying external systems, etc.
  • Weight increase functions: Exponential, geometric progression, single canary step, etc.
  • Custom success/failure notifications: SNS, email, CI/CD systems, service discovery systems, etc.

Traffic shifting with SAM and CodeDeploy

Using the Lambda UpdateAlias operation with additional version weights provides a powerful primitive for you to implement custom traffic shifting solutions for Lambda functions.

For those not interested in building custom deployment solutions, AWS CodeDeploy provides an intuitive turn-key implementation of this functionality integrated directly into the Serverless Application Model. Traffic-shifted deployments can be declared in a SAM template, and CodeDeploy manages the function rollout as part of the CloudFormation stack update. CloudWatch alarms can also be configured to trigger a stack rollback if something goes wrong.

i.e.

MyFunction:
  Type: AWS::Serverless::Function
  Properties:
    FunctionName: MyFunction
    AutoPublishAlias: MyFunctionInvokeAlias
    DeploymentPreference:
      Type: Linear10PercentEvery1Minute
      Role:
        Fn::GetAtt: [ DeploymentRole, Arn ]
      Alarms:
       - { Ref: MyFunctionErrorsAlarm }
...

For more information about using CodeDeploy with SAM, see Automating Updates to Serverless Apps.

Conclusion

It is often the simple features that provide the most value. As I demonstrated in this post, serverless architectures allow the complex deployment orchestration used in traditional applications to be replaced with a simple Lambda function or Step Functions workflow. By allowing invocation traffic to be easily weighted to multiple function versions, Lambda alias traffic shifting provides a simple but powerful feature that I hope empowers you to easily implement safe deployment workflows for your Lambda functions.

Event-Driven Computing with Amazon SNS and AWS Compute, Storage, Database, and Networking Services

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/event-driven-computing-with-amazon-sns-compute-storage-database-and-networking-services/

Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging

Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.

However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.

If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.

What is event-driven computing?

Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.

Which AWS services publish events to SNS natively?

Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.

Compute services

  • Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).

  • AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.

  • Elastic Load Balancing: Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.

Storage services

  • Amazon S3: Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.

  • Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.

  • Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.

  • AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.

Database services

  • Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.

  • Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.

  • Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.

  • AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.

Networking services

  • Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.

  • AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.

More event-driven computing on AWS

In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:

Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.

Conclusion

Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.

Start now by visiting Amazon SNS in the AWS Management Console, or by trying the AWS 10-Minute Tutorial, Send Fan-out Event Notifications with Amazon SNS and Amazon SQS.

 

Capturing Custom, High-Resolution Metrics from Containers Using AWS Step Functions and AWS Lambda

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/capturing-custom-high-resolution-metrics-from-containers-using-aws-step-functions-and-aws-lambda/

Contributed by Trevor Sullivan, AWS Solutions Architect

When you deploy containers with Amazon ECS, are you gathering all of the key metrics so that you can correctly monitor the overall health of your ECS cluster?

By default, ECS writes metrics to Amazon CloudWatch in 5-minute increments. For complex or large services, this may not be sufficient to make scaling decisions quickly. You may want to respond immediately to changes in workload or to identify application performance problems. Last July, CloudWatch announced support for high-resolution metrics, up to a per-second basis.

These high-resolution metrics can be used to give you a clearer picture of the load and performance for your applications, containers, clusters, and hosts. In this post, I discuss how you can use AWS Step Functions, along with AWS Lambda, to cost effectively record high-resolution metrics into CloudWatch. You implement this solution using a serverless architecture, which keeps your costs low and makes it easier to troubleshoot the solution.

To show how this works, you retrieve some useful metric data from an ECS cluster running in the same AWS account and region (Oregon, us-west-2) as the Step Functions state machine and Lambda function. However, you can use this architecture to retrieve any custom application metrics from any resource in any AWS account and region.

Why Step Functions?

Step Functions enables you to orchestrate multi-step tasks in the AWS Cloud that run for any period of time, up to a year. Effectively, you’re building a blueprint for an end-to-end process. After it’s built, you can execute the process as many times as you want.

For this architecture, you gather metrics from an ECS cluster, every five seconds, and then write the metric data to CloudWatch. After your ECS cluster metrics are stored in CloudWatch, you can create CloudWatch alarms to notify you. An alarm can also trigger an automated remediation activity such as scaling ECS services, when a metric exceeds a threshold defined by you.

When you build a Step Functions state machine, you define the different states inside it as JSON objects. The bulk of the work in Step Functions is handled by the common task state, which invokes Lambda functions or Step Functions activities. There is also a built-in library of other useful states that allow you to control the execution flow of your program.

One of the most useful state types in Step Functions is the parallel state. Each parallel state in your state machine can have one or more branches, each of which is executed in parallel. Another useful state type is the wait state, which waits for a period of time before moving to the next state.

In this walkthrough, you combine these three states (parallel, wait, and task) to create a state machine that triggers a Lambda function, which then gathers metrics from your ECS cluster.

Step Functions pricing

This state machine is executed every minute, resulting in 60 executions per hour, and 1,440 executions per day. Step Functions is billed per state transition, including the Start and End state transitions, and giving you approximately 37,440 state transitions per day. To reach this number, I’m using this estimated math:

26 state transitions per-execution x 60 minutes x 24 hours

Based on current pricing, at $0.000025 per state transition, the daily cost of this metric gathering state machine would be $0.936.

Step Functions offers an indefinite 4,000 free state transitions every month. This benefit is available to all customers, not just customers who are still under the 12-month AWS Free Tier. For more information and cost example scenarios, see Step Functions pricing.

Why Lambda?

The goal is to capture metrics from an ECS cluster, and write the metric data to CloudWatch. This is a straightforward, short-running process that makes Lambda the perfect place to run your code. Lambda is one of the key services that makes up “Serverless” application architectures. It enables you to consume compute capacity only when your code is actually executing.

The process of gathering metric data from ECS and writing it to CloudWatch takes a short period of time. In fact, my average Lambda function execution time, while developing this post, is only about 250 milliseconds on average. For every five-second interval that occurs, I’m only using 1/20th of the compute time that I’d otherwise be paying for.

Lambda pricing

For billing purposes, Lambda execution time is rounded up to the nearest 100-ms interval. In general, based on the metrics that I observed during development, a 250-ms runtime would be billed at 300 ms. Here, I calculate the cost of this Lambda function executing on a daily basis.

Assuming 31 days in each month, there would be 535,680 five-second intervals (31 days x 24 hours x 60 minutes x 12 five-second intervals = 535,680). The Lambda function is invoked every five-second interval, by the Step Functions state machine, and runs for a 300-ms period. At current Lambda pricing, for a 128-MB function, you would be paying approximately the following:

Total compute

Total executions = 535,680
Total compute = total executions x (3 x $0.000000208 per 100 ms) = $0.334 per day

Total requests

Total requests = (535,680 / 1000000) * $0.20 per million requests = $0.11 per day

Total Lambda Cost

$0.11 requests + $0.334 compute time = $0.444 per day

Similar to Step Functions, Lambda offers an indefinite free tier. For more information, see Lambda Pricing.

Walkthrough

In the following sections, I step through the process of configuring the solution just discussed. If you follow along, at a high level, you will:

  • Configure an IAM role and policy
  • Create a Step Functions state machine to control metric gathering execution
  • Create a metric-gathering Lambda function
  • Configure a CloudWatch Events rule to trigger the state machine
  • Validate the solution

Prerequisites

You should already have an AWS account with a running ECS cluster. If you don’t have one running, you can easily deploy a Docker container on an ECS cluster using the AWS Management Console. In the example produced for this post, I use an ECS cluster running Windows Server (currently in beta), but either a Linux or Windows Server cluster works.

Create an IAM role and policy

First, create an IAM role and policy that enables Step Functions, Lambda, and CloudWatch to communicate with each other.

  • The CloudWatch Events rule needs permissions to trigger the Step Functions state machine.
  • The Step Functions state machine needs permissions to trigger the Lambda function.
  • The Lambda function needs permissions to query ECS and then write to CloudWatch Logs and metrics.

When you create the state machine, Lambda function, and CloudWatch Events rule, you assign this role to each of those resources. Upon execution, each of these resources assumes the specified role and executes using the role’s permissions.

  1. Open the IAM console.
  2. Choose Roles, create New Role.
  3. For Role Name, enter WriteMetricFromStepFunction.
  4. Choose Save.

Create the IAM role trust relationship
The trust relationship (also known as the assume role policy document) for your IAM role looks like the following JSON document. As you can see from the document, your IAM role needs to trust the Lambda, CloudWatch Events, and Step Functions services. By configuring your role to trust these services, they can assume this role and inherit the role permissions.

  1. Open the IAM console.
  2. Choose Roles and select the IAM role previously created.
  3. Choose Trust RelationshipsEdit Trust Relationships.
  4. Enter the following trust policy text and choose Save.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "states.us-west-2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Create an IAM policy

After you’ve finished configuring your role’s trust relationship, grant the role access to the other AWS resources that make up the solution.

The IAM policy is what gives your IAM role permissions to access various resources. You must whitelist explicitly the specific resources to which your role has access, because the default IAM behavior is to deny access to any AWS resources.

I’ve tried to keep this policy document as generic as possible, without allowing permissions to be too open. If the name of your ECS cluster is different than the one in the example policy below, make sure that you update the policy document before attaching it to your IAM role. You can attach this policy as an inline policy, instead of creating the policy separately first. However, either approach is valid.

  1. Open the IAM console.
  2. Select the IAM role, and choose Permissions.
  3. Choose Add in-line policy.
  4. Choose Custom Policy and then enter the following policy. The inline policy name does not matter.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [ "logs:*" ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [ "cloudwatch:PutMetricData" ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [ "states:StartExecution" ],
            "Resource": [
                "arn:aws:states:*:*:stateMachine:WriteMetricFromStepFunction"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [ "lambda:InvokeFunction" ],
            "Resource": "arn:aws:lambda:*:*:function:WriteMetricFromStepFunction"
        },
        {
            "Effect": "Allow",
            "Action": [ "ecs:Describe*" ],
            "Resource": "arn:aws:ecs:*:*:cluster/ECSEsgaroth"
        }
    ]
}

Create a Step Functions state machine

In this section, you create a Step Functions state machine that invokes the metric-gathering Lambda function every five (5) seconds, for a one-minute period. If you divide a minute (60) seconds into equal parts of five-second intervals, you get 12. Based on this math, you create 12 branches, in a single parallel state, in the state machine. Each branch triggers the metric-gathering Lambda function at a different five-second marker, throughout the one-minute period. After all of the parallel branches finish executing, the Step Functions execution completes and another begins.

Follow these steps to create your Step Functions state machine:

  1. Open the Step Functions console.
  2. Choose DashboardCreate State Machine.
  3. For State Machine Name, enter WriteMetricFromStepFunction.
  4. Enter the state machine code below into the editor. Make sure that you insert your own AWS account ID for every instance of “676655494xxx”
  5. Choose Create State Machine.
  6. Select the WriteMetricFromStepFunction IAM role that you previously created.
{
    "Comment": "Writes ECS metrics to CloudWatch every five seconds, for a one-minute period.",
    "StartAt": "ParallelMetric",
    "States": {
      "ParallelMetric": {
        "Type": "Parallel",
        "Branches": [
          {
            "StartAt": "WriteMetricLambda",
            "States": {
             	"WriteMetricLambda": {
                  "Type": "Task",
				  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
    	  {
            "StartAt": "WaitFive",
            "States": {
            	"WaitFive": {
            		"Type": "Wait",
            		"Seconds": 5,
            		"Next": "WriteMetricLambdaFive"
          		},
             	"WriteMetricLambdaFive": {
                  "Type": "Task",
				  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
    	  {
            "StartAt": "WaitTen",
            "States": {
            	"WaitTen": {
            		"Type": "Wait",
            		"Seconds": 10,
            		"Next": "WriteMetricLambda10"
          		},
             	"WriteMetricLambda10": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
    	  {
            "StartAt": "WaitFifteen",
            "States": {
            	"WaitFifteen": {
            		"Type": "Wait",
            		"Seconds": 15,
            		"Next": "WriteMetricLambda15"
          		},
             	"WriteMetricLambda15": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
          {
            "StartAt": "Wait20",
            "States": {
            	"Wait20": {
            		"Type": "Wait",
            		"Seconds": 20,
            		"Next": "WriteMetricLambda20"
          		},
             	"WriteMetricLambda20": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
          {
            "StartAt": "Wait25",
            "States": {
            	"Wait25": {
            		"Type": "Wait",
            		"Seconds": 25,
            		"Next": "WriteMetricLambda25"
          		},
             	"WriteMetricLambda25": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
          {
            "StartAt": "Wait30",
            "States": {
            	"Wait30": {
            		"Type": "Wait",
            		"Seconds": 30,
            		"Next": "WriteMetricLambda30"
          		},
             	"WriteMetricLambda30": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
          {
            "StartAt": "Wait35",
            "States": {
            	"Wait35": {
            		"Type": "Wait",
            		"Seconds": 35,
            		"Next": "WriteMetricLambda35"
          		},
             	"WriteMetricLambda35": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
          {
            "StartAt": "Wait40",
            "States": {
            	"Wait40": {
            		"Type": "Wait",
            		"Seconds": 40,
            		"Next": "WriteMetricLambda40"
          		},
             	"WriteMetricLambda40": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
          {
            "StartAt": "Wait45",
            "States": {
            	"Wait45": {
            		"Type": "Wait",
            		"Seconds": 45,
            		"Next": "WriteMetricLambda45"
          		},
             	"WriteMetricLambda45": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
          {
            "StartAt": "Wait50",
            "States": {
            	"Wait50": {
            		"Type": "Wait",
            		"Seconds": 50,
            		"Next": "WriteMetricLambda50"
          		},
             	"WriteMetricLambda50": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          },
          {
            "StartAt": "Wait55",
            "States": {
            	"Wait55": {
            		"Type": "Wait",
            		"Seconds": 55,
            		"Next": "WriteMetricLambda55"
          		},
             	"WriteMetricLambda55": {
                  "Type": "Task",
                  "Resource": "arn:aws:lambda:us-west-2:676655494xxx:function:WriteMetricFromStepFunction",
                  "End": true
                } 
            }
          }
        ],
        "End": true
      }
  }
}

Now you’ve got a shiny new Step Functions state machine! However, you might ask yourself, “After the state machine has been created, how does it get executed?” Before I answer that question, create the Lambda function that writes the custom metric, and then you get the end-to-end process moving.

Create a Lambda function

The meaty part of the solution is a Lambda function, written to consume the Python 3.6 runtime, that retrieves metric values from ECS, and then writes them to CloudWatch. This Lambda function is what the Step Functions state machine is triggering every five seconds, via the Task states. Key points to remember:

The Lambda function needs permission to:

  • Write CloudWatch metrics (PutMetricData API).
  • Retrieve metrics from ECS clusters (DescribeCluster API).
  • Write StdOut to CloudWatch Logs.

Boto3, the AWS SDK for Python, is included in the Lambda execution environment for Python 2.x and 3.x.

Because Lambda includes the AWS SDK, you don’t have to worry about packaging it up and uploading it to Lambda. You can focus on writing code and automatically take a dependency on boto3.

As for permissions, you’ve already created the IAM role and attached a policy to it that enables your Lambda function to access the necessary API actions. When you create your Lambda function, make sure that you select the correct IAM role, to ensure it is invoked with the correct permissions.

The following Lambda function code is generic. So how does the Lambda function know which ECS cluster to gather metrics for? Your Step Functions state machine automatically passes in its state to the Lambda function. When you create your CloudWatch Events rule, you specify a simple JSON object that passes the desired ECS cluster name into your Step Functions state machine, which then passes it to the Lambda function.

Use the following property values as you create your Lambda function:

Function Name: WriteMetricFromStepFunction
Description: This Lambda function retrieves metric values from an ECS cluster and writes them to Amazon CloudWatch.
Runtime: Python3.6
Memory: 128 MB
IAM Role: WriteMetricFromStepFunction

import boto3

def handler(event, context):
    cw = boto3.client('cloudwatch')
    ecs = boto3.client('ecs')
    print('Got boto3 client objects')
    
    Dimension = {
        'Name': 'ClusterName',
        'Value': event['ECSClusterName']
    }

    cluster = get_ecs_cluster(ecs, Dimension['Value'])
    
    cw_args = {
       'Namespace': 'ECS',
       'MetricData': [
           {
               'MetricName': 'RunningTask',
               'Dimensions': [ Dimension ],
               'Value': cluster['runningTasksCount'],
               'Unit': 'Count',
               'StorageResolution': 1
           },
           {
               'MetricName': 'PendingTask',
               'Dimensions': [ Dimension ],
               'Value': cluster['pendingTasksCount'],
               'Unit': 'Count',
               'StorageResolution': 1
           },
           {
               'MetricName': 'ActiveServices',
               'Dimensions': [ Dimension ],
               'Value': cluster['activeServicesCount'],
               'Unit': 'Count',
               'StorageResolution': 1
           },
           {
               'MetricName': 'RegisteredContainerInstances',
               'Dimensions': [ Dimension ],
               'Value': cluster['registeredContainerInstancesCount'],
               'Unit': 'Count',
               'StorageResolution': 1
           }
        ]
    }
    cw.put_metric_data(**cw_args)
    print('Finished writing metric data')
    
def get_ecs_cluster(client, cluster_name):
    cluster = client.describe_clusters(clusters = [ cluster_name ])
    print('Retrieved cluster details from ECS')
    return cluster['clusters'][0]

Create the CloudWatch Events rule

Now you’ve created an IAM role and policy, Step Functions state machine, and Lambda function. How do these components actually start communicating with each other? The final step in this process is to set up a CloudWatch Events rule that triggers your metric-gathering Step Functions state machine every minute. You have two choices for your CloudWatch Events rule expression: rate or cron. In this example, use the cron expression.

A couple key learning points from creating the CloudWatch Events rule:

  • You can specify one or more targets, of different types (for example, Lambda function, Step Functions state machine, SNS topic, and so on).
  • You’re required to specify an IAM role with permissions to trigger your target.
    NOTE: This applies only to certain types of targets, including Step Functions state machines.
  • Each target that supports IAM roles can be triggered using a different IAM role, in the same CloudWatch Events rule.
  • Optional: You can provide custom JSON that is passed to your target Step Functions state machine as input.

Follow these steps to create the CloudWatch Events rule:

  1. Open the CloudWatch console.
  2. Choose Events, RulesCreate Rule.
  3. Select Schedule, Cron Expression, and then enter the following rule:
    0/1 * * * ? *
  4. Choose Add Target, Step Functions State MachineWriteMetricFromStepFunction.
  5. For Configure Input, select Constant (JSON Text).
  6. Enter the following JSON input, which is passed to Step Functions, while changing the cluster name accordingly:
    { "ECSClusterName": "ECSEsgaroth" }
  7. Choose Use Existing Role, WriteMetricFromStepFunction (the IAM role that you previously created).

After you’ve completed with these steps, your screen should look similar to this:

Validate the solution

Now that you have finished implementing the solution to gather high-resolution metrics from ECS, validate that it’s working properly.

  1. Open the CloudWatch console.
  2. Choose Metrics.
  3. Choose custom and select the ECS namespace.
  4. Choose the ClusterName metric dimension.

You should see your metrics listed below.

Troubleshoot configuration issues

If you aren’t receiving the expected ECS cluster metrics in CloudWatch, check for the following common configuration issues. Review the earlier procedures to make sure that the resources were properly configured.

  • The IAM role’s trust relationship is incorrectly configured.
    Make sure that the IAM role trusts Lambda, CloudWatch Events, and Step Functions in the correct region.
  • The IAM role does not have the correct policies attached to it.
    Make sure that you have copied the IAM policy correctly as an inline policy on the IAM role.
  • The CloudWatch Events rule is not triggering new Step Functions executions.
    Make sure that the target configuration on the rule has the correct Step Functions state machine and IAM role selected.
  • The Step Functions state machine is being executed, but failing part way through.
    Examine the detailed error message on the failed state within the failed Step Functions execution. It’s possible that the
  • IAM role does not have permissions to trigger the target Lambda function, that the target Lambda function may not exist, or that the Lambda function failed to complete successfully due to invalid permissions.
    Although the above list covers several different potential configuration issues, it is not comprehensive. Make sure that you understand how each service is connected to each other, how permissions are granted through IAM policies, and how IAM trust relationships work.

Conclusion

In this post, you implemented a Serverless solution to gather and record high-resolution application metrics from containers running on Amazon ECS into CloudWatch. The solution consists of a Step Functions state machine, Lambda function, CloudWatch Events rule, and an IAM role and policy. The data that you gather from this solution helps you rapidly identify issues with an ECS cluster.

To gather high-resolution metrics from any service, modify your Lambda function to gather the correct metrics from your target. If you prefer not to use Python, you can implement a Lambda function using one of the other supported runtimes, including Node.js, Java, or .NET Core. However, this post should give you the fundamental basics about capturing high-resolution metrics in CloudWatch.

If you found this post useful, or have questions, please comment below.

Protect your Reputation with Email Pausing and Configuration Set Metrics

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/ses/protect-your-reputation-with-email-pausing-and-configuration-set-metrics/

In August, we launched the reputation dashboard, which helps you track important metrics that could impact your ability to send emails. By monitoring the metrics in this dashboard, you can protect your sender reputation, which can increase the likelihood that the emails you send will reach your customers’ inboxes.

Today, we’re launching two features that build upon the capabilities of the reputation dashboard. The first is the ability to temporarily pause email sending, either at the configuration set level, or across your entire Amazon SES account. The second is the ability to export reputation metrics for individual configuration sets.

Email Pausing

Today’s update includes new API operations that can temporarily pause your ability to send email using Amazon SES. To disable email sending across your entire Amazon SES account, you can use the UpdateAccountSendingEnabled operation. To pause sending only for emails sent using a specific configuration set, you can use the UpdateConfigurationSetSendingEnabled operation.

Email pausing is helpful because Amazon SES uses automatic enforcement policies. If the bounce or complaint rates for your account are too high, your account is automatically placed on probation. If the bounce or complaint issues continue after the probation period has ended, your account may be suspended.

With email pausing, you can temporarily halt your ability to send email before your account is placed on probation. While your ability to send email is paused, you can identify the issues that were causing your account to register high bounce or complaint rates. You can then resume sending after the issues are resolved.

Email pausing helps ensure that your ability to send email using Amazon SES is not interrupted because of enforcement issues. It helps ensure that your sender reputation won’t be damaged by mistakes or unforeseen issues.

You can learn more about the UpdateAccountSendingEnabled and UpdateConfigurationSetSendingEnabled operations in the Amazon Simple Email Service API Reference.

Configuration Set Reputation Metrics

Amazon SES automatically publishes the bounce and complaint rates for your account to Amazon CloudWatch. In CloudWatch, you can monitor these metrics over time, and create alarms that notify you when your reputation metrics cross certain thresholds.

With today’s update, you can also publish reputation metrics for individual configuration sets to CloudWatch. This feature gives you additional information about the messages you send using Amazon SES. For example, if you send all of your marketing emails using one configuration set, and your transactional emails using a different configuration set, you can view distinct reputation metrics for each type of email.

Because we anticipate that this feature will lead to the creation of many new configuration sets, we’re increasing the maximum number of configuration sets you can create from 50 to 10,000.

For more information about exporting reputation metrics for configuration sets, see Exporting Reputation Metrics for a Configuration Set to CloudWatch in the Amazon Simple Email Service Developer Guide.

Automating These Features

You can use AWS services—including Amazon SNS, AWS Lambda, and Amazon CloudWatch—to create a solution that automatically pauses email sending for your account when your overall reputation metrics cross a certain threshold. Or, to minimize disruption to your email sending program, you can pause email sending for a specific configuration set when the metrics for that configuration set cross a threshold. The following image illustrates the processes that occur when you implement these solutions.

A flow diagram that illustrates a solution for automatically pausing Amazon SES email sending. Amazon SES provides reputation metrics to CloudWatch. If those metrics exceed a threshold, a CloudWatch alarm is triggered, which triggers an SNS topic. The SNS topic sends notifications (email, SMS), and executes a Lambda function, which pauses email sending in SES.

For more information on both of these solutions, see Automatically Pausing Email Sending in the Amazon Simple Email Service Developer Guide.

We’re always looking for ways to help safeguard the reputation you’ve worked hard to build. If you have suggestions, questions, or comments, we’d love to hear from you in the comments below, or in the Amazon SES Forum.

These features are now available in the following AWS Regions: US West (Oregon), US East (N. Virginia), and EU (Ireland).

Amazon ElastiCache Update – Online Resizing for Redis Clusters

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-elasticache-update-online-resizing-for-redis-clusters/

Amazon ElastiCache makes it easy to for you to set up a fast, in-memory data store and cache. With support for the two most popular open source offerings (Redis and Memcached), ElastiCache supports the demanding needs of game leaderboards, in-memory analytics, and large-scale messaging.

Today I would like to tell you about an important addition to Amazon ElastiCache for Redis. You can already create clusters with up to 15 shards, each responsible for storing keys and values for a specific set of slots (each cluster has exactly 16,384 slots). A single cluster can expand to store 3.55 terabytes of in-memory data while supporting up to 20 million reads and 4.5 million writes per second.

Now with Online Resizing
You can now adjust the number of shards in a running ElastiCache for Redis cluster while the cluster remains online and responding to requests. This gives you the power to respond to changes in traffic and data volume without having to take the cluster offline or to start with an empty cache. You can also rebalance a running cluster to uniformly redistribute slot space without changing the number of shards.

When you initiate a resharding or rebalancing operation, ElastiCache for Redis starts by preparing a plan that will result in an even distribution of slots across the shards in the cluster. Then it transfers slots across shards, moving many in parallel for efficiency. This all happens while the cluster continues to respond to requests, with a modest impact on write throughput for writes to a slot that is in motion. The migration rate is dependent on the instance type, network speed, read/write traffic to the slots, and is generally about 1 gigabyte per minute.

The resharding and rebalancing operations apply to Redis clusters that were created with Cluster Mode enabled:

Resharding a Cluster
In general, you will know that it is time to expand a cluster via resharding when it starts to face significant memory pressure or when individual nodes are becoming bottlenecks. You can watch the cluster’s CloudWatch metrics to identify each situation:

Memory Pressure – FreeableMemory, SwapUsage, BytesUsedForCache.

CPU Bottleneck – CPUUtilization, CurrConnections, NewConnections.

Network Bottleneck – NetworkBytesIn, NetworkBytesOut.

You can use CloudWatch Dashboards to monitor these metrics, and CloudWatch Alarms to automate the resharding process.

To reshard a Redis cluster from the ElastiCache Dashboard, click on the cluster to visit the detail page, and then click on the Add shards button:

Enter the number of shards to add and (optionally) the desired Availability Zones, then click on Add:

The status of the cluster will change to modifying and the resharding process will begin. It can take anywhere from a few minutes to several hours, as indicated above. You can track the progress on the detail page for the cluster:

You can see the slots moving from shard to shard:

You can also watch the Events for the cluster:

During the resharding you should avoid the use of the KEYS and SMEMBERS commands, as well as compute-intensive Lua scripts in order to moderate the load on the cluster shards. You should avoid the FLUSHDB and FLUSHALL commands entirely; using them will interrupt and then abort the resharding process.

The status of each shard will return to available when the process is complete:

The same process takes place when you delete shards.

Rebalancing Slots
You can perform this operation by heading to the cluster’s detail page and clicking on Rebalance Slot Distribution:

Things to Know
Here are a couple of things to keep in mind about this new feature:

Engine Version – Your cluster must be running version 3.2.10 of the Redis engine.

Migration Size – Slots that contain items that are larger than 256 megabytes after serialization are not migrated.

Cluster Endpoint – The cluster endpoint does not change as a result of a resharding or rebalancing.

Available Now
This feature is available now and you can start using it today.

Jeff;

 

New – Application Load Balancing via IP Address to AWS & On-Premises Resources

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-application-load-balancing-via-ip-address-to-aws-on-premises-resources/

I told you about the new AWS Application Load Balancer last year and showed you how to use it to do implement Layer 7 (application) routing to EC2 instances and to microservices running in containers.

Some of our customers are building hybrid applications as part of a longer-term move to AWS. These customers have told us that they would like to use a single Application Load Balancer to spread traffic across a combination of existing on-premises resources and new resources running in the AWS Cloud. Other customers would like to spread traffic to web or database servers that are scattered across two or more Virtual Private Clouds (VPCs), host multiple services on the same instance with distinct IP addresses but a common port number, and to offer support for IP-based virtual hosting for clients that do not support Server Name Indication (SNI). Another group of customers would like to host multiple instances of a service on the same instance (perhaps within containers), while using multiple interfaces and security groups to implement fine-grained access control.

These situations arise within a broad set of hybrid, migration, disaster recovery, and on-premises use cases and scenarios.

Route to IP Addresses
In order to address these use cases, Application Load Balancers can now route traffic directly to IP addresses. These addresses can be in the same VPC as the ALB, a peer VPC in the same region, on an EC2 instance connected to a VPC by way of ClassicLink, or on on-premises resources at the other end of a VPN connection or AWS Direct Connect connection.

Application Load Balancers already group targets in to target groups. As part of today’s launch, each target group now has a target type attribute:

instance – Targets are registered by way of EC2 instance IDs, as before.

ip – Targets are registered as IP addresses. You can use any IPv4 address from the load balancer’s VPC CIDR for targets within load balancer’s VPC and any IPv4 address from the RFC 1918 ranges (10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16) or the RFC 6598 range (100.64.0.0/10) for targets located outside the load balancer’s VPC (this includes Peered VPC, EC2-Classic, and on-premises targets reachable over Direct Connect or VPN).

Each target group has a load balancer and health check configuration, and publishes metrics to CloudWatch, as has always been the case.

Let’s say that you are in the transition phase of an application migration to AWS or want to use AWS to augment on-premises resources with EC2 instances and you need to distribute application traffic across both your AWS and on-premises resources. You can achieve this by registering all the resources (AWS and on-premises) to the same target group and associate the target group with a load balancer. Alternatively, you can use DNS based weighted load balancing across AWS and on-premises resources using two load balancers i.e. one load balancer for AWS and other for on-premises resources. In the scenario where application-A back-ends are in VPC and application-B back-ends are in on-premises locations then you can put back-ends for each application in different target groups and use content based routing to route traffic to each target group.

Creating a Target Group
Here’s how I create a target group that sends traffic to some IP addresses as part of the process of creating an Application Load Balancer. I enter a name (ip-target-1) and select ip as the Target type:

Then I enter IP address targets. These can be from the VPC that hosts the load balancer:

Or they can be other private IP addresses within one of the private ranges listed above, for targets outside of the VPC that hosts the load balancer:

After I review the settings and create the load balancer, traffic will be sent to the designated IP addresses as soon as they pass the health checks. Each load balancer can accommodate up to 1000 targets.

I can examine my target group and edit the set of targets at any time:

As you can see, one of my targets was not healthy when I took this screen shot (this was by design). Metrics are published to CloudWatch for each target group; I can see them in the Console and I can create CloudWatch Alarms:

Available Now
This feature is available now and you can start using it today in all AWS Regions.

Jeff;

 

Announcing the Reputation Dashboard

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/ses/announcing-the-reputation-dashboard/

The Amazon SES team is pleased to announce the addition of a reputation dashboard to the Amazon SES console. This new feature helps you track issues that could impact the sender reputation of your Amazon SES account.

What information does the reputation dashboard provide?

Amazon SES users must maintain bounce and complaint rates below a certain threshold. We put these rules in place to protect the sender reputations of all Amazon SES users, and to prevent Amazon SES from being used to deliver spam or malicious content. Users with very high rates of bounces or complaints may be put on probation. If the bounce or complaint rates are not within acceptable limits by the end of the probation period, these accounts may be shut down completely.

Previous versions of Amazon SES provided basic sending metrics, including information about bounces and complaints. However, the bounce and complaint metrics in this dashboard only included information for the past few days of email sent from your account, as opposed to an overall rate.

The new reputation dashboard provides overall bounce and complaint rates for your entire account. This enables you to more closely monitor the health of your account and adjust your email sending practices as needed.

Can’t I just calculate these values myself?

Because each Amazon SES account sends different volumes of email at different rates, we do not calculate bounce and complaint rates based on a fixed time period. Instead, we use a representative volume of email. This representative volume is the basis for the bounce and complaint rate calculations.

Why do we use representative volume in our calculations? Let’s imagine that you sent 1,000 emails one week, and 5 of them bounced. If we only considered a week of email sending, your metrics look good. Now imagine that the next week you only sent 5 emails, and one of them bounced. Suddenly, your bounce rate jumps from half a percent to 20%, and your account is automatically placed on probation. This example may be an extreme case, but it illustrates the reason that we don’t use fixed time periods when calculating bounce and complaint rates.

When you open the new reputation dashboard, you will see bounce and complaint rates calculated using the representative volume for your account. We automatically recalculate these rates every time you send email through Amazon SES.

What else can I do with these metrics?

The Bounce and Complaint Rate metrics in the reputation dashboard are automatically sent to Amazon CloudWatch. You can use CloudWatch to create dashboards that track your bounce and complaint rates over time, and to create alarms that send you notifications when these metrics cross certain thresholds. To learn more, see Creating Reputation Monitoring Alarms Using CloudWatch in the Amazon SES Developer Guide.

How can I see the reputation dashboard?

The reputation dashboard is now available to all Amazon SES users. To view the reputation dashboard, sign in to the Amazon SES console. On the left navigation menu, choose Reputation Dashboard. For more information, see Monitoring Your Sender Reputation in the Amazon SES Developer Guide.

We hope you find the information in the reputation dashboard to be useful in managing your email sending programs and campaigns. If you have any questions or comments, please leave a comment on this post, or let us know in the Amazon SES forum.

New – High-Resolution Custom Metrics and Alarms for Amazon CloudWatch

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-high-resolution-custom-metrics-and-alarms-for-amazon-cloudwatch/

Amazon CloudWatch has been an important part of AWS since early 2009! Launched as part of a three-pack that also included Auto Scaling and Elastic Load Balancing, CloudWatch has evolved into a very powerful monitoring service for AWS resources and the applications that you run on the AWS Cloud. CloudWatch custom metrics (launched way back in 2011) allow you to store business and application metrics in CloudWatch, view them in graphs, and initiate actions based on CloudWatch Alarms. Needless to say, we have made many enhancements to CloudWatch over the years! Some of the most recent include Extended Metrics Retention (and a User Interface Update), Dashboards, API/CloudFormation Support for Dashboards, and Alarms on Dashboards.

Originally, metrics were stored at five minute intervals; this was reduced to one minute (also known as Detailed Monitoring) in response to customer requests way back in 2010. This was a welcome change, but now it is time to do better. Our customers are streaming video, running flash sales, deploying code tens or hundreds of times per day, and running applications that scale in and out very quickly as conditions change. In all of these situations, a minute is simply too coarse of an interval. Important, transient spikes can be missed; disparate (yet related) events are difficult to correlate across time, and the MTTR (mean time to repair) when something breaks is too high.

New High-Resolution Metrics
Today we are adding support for high-resolution custom metrics, with plans to add support for AWS services over time. Your applications can now publish metrics to CloudWatch with 1-second resolution. You can watch the metrics scroll across your screen seconds after they are published and you can set up high-resolution CloudWatch Alarms that evaluate as frequently as every 10 seconds.

Imagine alarming when available memory gets low. This is often a transient condition that can be hard to catch with infrequent samples. With high-resolution metrics, you can see, detect (via an alarm), and act on it within seconds:

In this case the alarm on the right would not fire, and you would not know about the issue.

Publishing High-Resolution Metrics
You can publish high-resolution metrics in two different ways:

  • API – The PutMetricData function now accepts an optional StorageResolution parameter. Set this parameter to 1 to publish high-resolution metrics; omit it (or set it to 60) to publish at standard 1-minute resolution.
  • collectd plugin – The CloudWatch plugin for collectd has been updated to support collection and publication of high-resolution metrics. You will need to set the enable_high_definition_metrics parameter in the config file for the plugin.

CloudWatch metrics are rolled up over time; resolution effectively decreases as the metrics age. Here’s the schedule:

  • 1 second metrics are available for 3 hours.
  • 60 second metrics are available for 15 days.
  • 5 minute metrics are available for 63 days.
  • 1 hour metrics are available for 455 days (15 months).

When you call GetMetricStatistics you can specify a period of 1, 5, 10, 30 or any multiple of 60 seconds for high-resolution metrics. You can specify any multiple of 60 seconds for standard metrics.

A Quick Demo
I grabbed my nearest EC2 instance, installed the latest version of collectd and the Python plugin:

$ sudo yum install collectd collectd-python

Then I downloaded the setup script for the plugin, made it executable, and ran it:

$ wget https://raw.githubusercontent.com/awslabs/collectd-cloudwatch/master/src/setup.py
$ chmod a+x setup.py
$ sudo ./setup.py

I had already created a suitable IAM Role and added it to my instance; it was automatically detected during setup. I was asked to enable the high resolution metrics:

collectd started running and publishing metrics within seconds. I opened up the CloudWatch Console to take a look:

Then I zoomed in to see the metrics in detail:

I also created an alarm that will check the memory.percent.used metric at 10 second intervals. This will make it easier for me to detect situations where a lot of memory is being used for a short period of time:

Now Available
High-resolution custom metrics and alarms are available now in all Public AWS Regions, with support for AWS GovCloud (US) coming soon.

As was already the case, you can store 10 metrics at no charge every month; see the CloudWatch Pricing page for more information. Pricing for high-resolution metrics is identical to that for standard resolution metrics, with volume tiers that allow you to realize savings (on a per-metric) basis when you use more metrics. High-resolution alarms are priced at $0.30 per alarm per month.

New – Target Tracking Policies for EC2 Auto Scaling

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-target-tracking-policies-for-ec2-auto-scaling/

I recently told you about DynamoDB Auto Scaling and showed you how it uses multiple CloudWatch Alarms to automate capacity management for DynamoDB tables. Behind the scenes, this feature makes use of a more general Application Auto Scaling model that we plan to put to use across several different AWS services over time.

The new Auto Scaling model includes an important new feature that we call target tracking. When you create an Auto Scaling policy that makes use of target tracking, you choose a target value for a particular CloudWatch metric. Auto Scaling then turns the appropriate knob (so to speak) to drive the metric toward the target, while also adjusting the relevant CloudWatch Alarms. Specifying your desired target, in whatever metrics-driven units make sense for your application, is generally easier and more direct than setting up ranges and thresholds manually using the original step scaling policy type. However, you can use target tracking in conjunction with step scaling in order to implement an advanced scaling strategy. For example, you could use target tracking for scale-out operations and step scaling for scale-in.

Now for EC2
Today we are adding target tracking support to EC2 Auto Scaling. You can now create scaling policies that are driven by application load balancer request counts, CPU load, network traffic, or a custom metric (the Request Count per Target metric is new, and is also part of today’s launch):

These metrics share an important property: adding additional EC2 instances will (with no changes in overall load) drive the metric down, and vice versa.

To create an Auto Scaling Group that makes use of target tracking, you simply enter a name for the policy, choose a metric, and set the desired target value:

You have the option to disable the scale-in side of the policy. If you do this, you can scale-in manually or use a separate policy.

You can create target tracking policies using the AWS Management Console, AWS Command Line Interface (CLI), or the AWS SDKs.

Here are a couple of things to keep in mind as you look forward to using target tracking:

  • You can track more than one target in a single Auto Scaling Group as long as each one references a distinct metric. Scaling will always choose the policy that drives the highest capacity.
  • Scaling will not take place if the metric has insufficient data.
  • Auto Scaling compensates for rapid, transient fluctuations in the metrics, and strives to minimize corresponding fluctuations in capacity.
  • You can set up target tracking for a custom metric through the Auto Scaling API or the AWS Command Line Interface (CLI).
  • In most cases you should elect to scale on metrics that are published with 1-minute frequency (also known as detailed monitoring). Using 5-minute metrics as the basis for scaling will result in a slower response time.

Now Available
This new feature is available now and you can start using it today at no extra charge. To learn more, read about Target Tracking Scaling in the Auto Scaling User Guide.

Jeff;

New – API & CloudFormation Support for Amazon CloudWatch Dashboards

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-api-cloudformation-support-for-amazon-cloudwatch-dashboards/

We launched CloudWatch Dashboards a couple of years ago. In the post that I wrote for the launch, I showed you how to interactively create a dashboard that displayed chosen CloudWatch metrics in graphical form. After the launch, we added additional features including a full screen mode, a dark theme, control over the range of the Y axis, simplified renaming, persistent storage, and new visualization options.

New API & CLI
While console support is wonderful for interactive use, many customers have asked us to support programmatic creation and manipulation of dashboards and the widgets within. They would like to dynamically build and maintain dashboards, adding and removing widgets as the corresponding AWS resources are created and destroyed. Other customers are interested in setting up and maintaining a consistent set of dashboards across two or more AWS accounts.

I am happy to announce that API, CLI, and AWS CloudFormation support for CloudWatch Dashboards is available now and that you can start using it today!

There are four new API functions (and equivalent CLI commands):

ListDashboards / aws cloudwatch list-dashboards – Fetch a list of all dashboards within an account, or a subset that share a common prefix.

GetDashboard / aws cloudwatch get-dashboard – Fetch details for a single dashboard.

PutDashboard / aws cloudwatch put-dashboard – Create a new dashboard or update an existing one.

DeleteDashboards / aws cloudwatch delete-dashboards – Delete one or more dashboards.

Dashboard Concepts
I want to show you how to use these functions and commands. Before I dive in, I should review a couple of important dashboard concepts and attributes.

Global – Dashboards are part of an AWS account, and are not associated with a specific AWS Region. Each account can have up to 500 dashboards.

Named – Each dashboard has a name that is unique within the AWS account. Names can be up to 255 characters long.

Grid Model – Each dashboard is composed of a grid of cells. The grid is 24 cells across and as tall as necessary. Each widget on the dashboard is positioned at a particular set of grid coordinates, and has a size that spans an integral number of grid cells.

Widgets (Visualizations) – Each widget can display text or a set of CloudWatch metrics. Text is specified using Markdown; metrics can be displayed as single values, line charts, or stacked area charts. Each dashboard can have up to 100 widgets. Widgets that display metrics can also be associated with a CloudWatch Alarm.

Dashboards have a JSON representation that you can now see and edit from within the console. Simply click on the Action menu and choose View/edit source:

Here’s the source for my dashboard:

You can use this JSON as a starting point for your own applications. As you can see, there’s an entry in the widgets array for each widget on the dashboard; each entry describes one widget, starting with its type, position, and size.

Creating a Dashboard Using the API
Let’s say I want to create a dashboard that has a widget for each of my EC2 instances in a particular region. I’ll use Python and the AWS SDK for Python, and start as follows (excuse the amateur nature of my code):

import boto3
import json

cw  = boto3.client("cloudwatch")
ec2 = boto3.client("ec2")

x, y          = [0, 0]
width, height = [3, 3]
max_width     = 12
widgets       = []

Then I simply iterate over the instances, creating a widget dictionary for each one, and appending it to the widgets array:

instances = ec2.describe_instances()
for r in instances['Reservations']:
    for i in r['Instances']:

        widget = {'type'      : 'metric',
                  'x'         : x,
                  'y'         : y,
                  'height'    : height,
                  'width'     : width,
                  'properties': {'view'    : 'timeSeries',
                                 'stacked' : False,
                                 'metrics' : [['AWS/EC2', 'NetworkIn', 'InstanceId', i['InstanceId']],
                                              ['.',       'NetworkOut', '.',         '.']
                                             ],
                                 'period'  : 300,
                                 'stat'    : 'Average',
                                 'region'  : 'us-east-1',
                                 'title'   : i['InstanceId']
                                }
                 }

        widgets.append(widget)

I update the position (x and y) within the loop, and form a grid (if I don’t specify positions, the widgets will be laid out left to right, top to bottom):

        x += width
        if (x + width > max_width):
            x = 0
            y += height

After I have processed all of the instances, I create a JSON version of the widget array:

body   = {'widgets' : widgets}
body_j = json.dumps(body)

And I create or update my dashboard:

cw.put_dashboard(DashboardName = "EC2_Networking",
                 DashboardBody = body_j)

I run the code, and get the following dashboard:

The CloudWatch team recommends that dashboards created programmatically include a text widget indicating that the dashboard was generated automatically, along with a link to the source code or CloudFormation template that did the work. This will discourage users from making manual, out-of-band changers to the dashboards.

As I mentioned earlier, each metric widget can also be associated with a CloudWatch Alarm. You can create the alarms programmatically or by using a CloudFormation template such as the Sample CPU Utilization Alarm. If you decide to do this, the alarm threshold will be displayed in the widget. To learn more about this, read Tara Walker’s recent post, Amazon CloudWatch Launches Alarms on Dashboards.

Going one step further, I could use CloudWatch Events and a Lamba Function to track the creation and deletion of certain resources and update a dashboard in concert with the changes. To learn how to do this, read Keeping CloudWatch Dashboards up to Date Using AWS Lambda.

Accessing a Dashboard Using the CLI
I can also access and manipulate my dashboards from the command line. For example, I can generate a simple list:

$ aws cloudwatch list-dashboards --output table
----------------------------------------------
|               ListDashboards               |
+--------------------------------------------+
||             DashboardEntries             ||
|+-----------------+----------------+-------+|
||  DashboardName  | LastModified   | Size  ||
|+-----------------+----------------+-------+|
||  Disk-Metrics   |  1496405221.0  |  316  ||
||  EC2_Networking |  1498090434.0  |  2830 ||
||  Main-Metrics   |  1498085173.0  |  234  ||
|+-----------------+----------------+-------+|

And I can get rid of the Disk-Metrics dashboard:

$ aws cloudwatch delete-dashboards --dashboard-names Disk-Metrics

I can also retrieve the JSON that defines a dashboard:

Creating a Dashboard Using CloudFormation
Dashboards can also be specified in CloudFormation templates. Here’s a simple template in YAML (the DashboardBody is still specified in JSON):

Resources:
  MyDashboard:
    Type: "AWS::CloudWatch::Dashboard"
    Properties:
      DashboardName: SampleDashboard
      DashboardBody: '{"widgets":[{"type":"text","x":0,"y":0,"width":6,"height":6,"properties":{"markdown":"Hi there from CloudFormation"}}]}'

I place the template in a file and then create a stack using the console or the CLI:

$ aws cloudformation create-stack --stack-name MyDashboard --template-body file://dash.yaml
{
    "StackId": "arn:aws:cloudformation:us-east-1:xxxxxxxxxxxx:stack/MyDashboard/a2a3fb20-5708-11e7-8ffd-500c21311262"
}

Here’s the dashboard:

Available Now
This feature is available now and you can start using it today. You can create 3 dashboards with up to 50 metrics per dashboard at no charge; additional dashboards are priced at $3 per month, as listed on the CloudWatch Pricing page. You can make up to 1 million calls to the new API functions each month at no charge; beyond that you pay $.01 for every 1,000 calls.

Jeff;

AWS Bill Simplification – Consolidated CloudWatch Charges

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-bill-simplification-consolidated-cloudwatch-charges/

The bill that you receive for your use of AWS in July will include a change in the way that Amazon CloudWatch charges are presented. The CloudWatch team made this change in order to make your bill simpler and easier to understand.

Consolidating Charges
In the past, charges for your usage of CloudWatch were split between two sections of your bill. For historical reasons, the charges for CloudWatch Alarms, CloudWatch Metrics, and calls to the CloudWatch API were reported in the Elastic Compute Cloud (EC2) detail section, while charges for CloudWatch Logs and CloudWatch Dashboards were reported in the CloudWatch detail section, like this:

We have received feedback that splitting the charges across two sections of the bill made it difficult to locate and understand the entire set of monitoring charges. In order to address this issue, we are moving the charges that were formerly listed in the Elastic Compute Cloud (EC2) detail section to the CloudWatch detail section. We are making the same change to the detailed billing report, moving the affected charges from the AmazonEC2 product code to the AmazonCloudWatch product code and changing to the AmazonCloudWatch product name. This change does not affect your overall bill; it simply consolidates all of the charges for the use of CloudWatch in one section.

Billing Metric
The CloudWatch billing metric named Estimated Charges can be viewed as a Total Estimated Charge, or broken down By Service:

The total will not change. However, as noted above, the charges that formerly had AmazonEC2 as the ServiceName dimension will now have it set to AmazonCloudWatch:

You may need to adjust thresholds on your billing alarms as a result:

Once again, your total AWS bill will not change. You will begin to see the consolidated charges for CloudWatch in your AWS bill for July 2017.

Jeff;

 

Protect Web Sites & Services Using Rate-Based Rules for AWS WAF

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/protect-web-sites-services-using-rate-based-rules-for-aws-waf/

AWS WAF (Web Application Firewall) helps to protect your application from many different types of application-layer attacks that involve requests that are malicious or malformed. As I showed you when I first wrote about this service (New – AWS WAF), you can define rules that match cross-site scripting, IP address, SQL injection, size, or content constraints:

When incoming requests match rules, actions are invoked. Actions can either allow, block, or simply count matches.

The existing rule model is powerful and gives you the ability to detect and respond to many different types of attacks. It does not, however, allow you to respond to attacks that simply consist of a large number of otherwise valid requests from a particular IP address. These requests might be a web-layer DDoS attack, a brute-force login attempt, or even a partner integration gone awry.

New Rate-Based Rules
Today we are adding Rate-based Rules to WAF, giving you control of when IP addresses are added to and removed from a blacklist, along with the flexibility to handle exceptions and special cases:

Blacklisting IP Addresses – You can blacklist IP addresses that make requests at a rate that exceeds a configured threshold rate.

IP Address Tracking– You can see which IP addresses are currently blacklisted.

IP Address Removal – IP addresses that have been blacklisted are automatically removed when they no longer make requests at a rate above the configured threshold.

IP Address Exemption – You can exempt certain IP addresses from blacklisting by using an IP address whitelist inside of the a rate-based rule. For example, you might want to allow trusted partners to access your site at a higher rate.

Monitoring & Alarming – You can watch and alarm on CloudWatch metrics that are published for each rule.

You can combine new Rate-based Rules with WAF Conditions to implement sophisticated rate-limiting strategies. For example, you could use a Rate-based Rule and a WAF Condition that matches your login pages. This would allow you to impose a modest threshold on your login pages (to avoid brute-force password attacks) and allow a more generous one on your marketing or system status pages.

Thresholds are defined in terms of the number of incoming requests from a single IP address within a 5 minute period. Once this threshold is breached, additional requests from the IP address are blocked until the request rate falls below the threshold.

Using Rate-Based Rules
Here’s how you would define a Rate-based Rule that protects the /login portion of your site. Start by defining a WAF condition that matches the desired string in the URI of the page:

Then use this condition to define a Rate-based Rule (the rate limit is expressed in terms of requests within a 5 minute interval, but the blacklisting goes in to effect as soon as the limit is breached):

With the condition and the rule in place, create a Web ACL (ProtectLoginACL) to bring it all together and to attach it to the AWS resource (a CloudFront distribution in this case):

Then attach the rule (ProtectLogin) to the Web ACL:

The resource is now protected in accord with the rule and the web ACL. You can monitor the associated CloudWatch metrics (ProtectLogin and ProtectLoginACL in this case). You could even create CloudWatch Alarms and use them to fire Lambda functions when a protection threshold is breached. The code could examine the offending IP address and make a complex, business-driven decision, perhaps adding a whitelisting rule that gives an extra-generous allowance to a trusted partner or to a user with a special payment plan.

Available Now
The new, Rate-based Rules are available now and you can start using them today! Rate-based rules are priced the same as Regular rules; see the WAF Pricing page for more info.

Jeff;

Building Loosely Coupled, Scalable, C# Applications with Amazon SQS and Amazon SNS

Post Syndicated from Tara Van Unen original https://aws.amazon.com/blogs/compute/building-loosely-coupled-scalable-c-applications-with-amazon-sqs-and-amazon-sns/

 
Stephen Liedig, Solutions Architect

 

One of the many challenges professional software architects and developers face is how to make cloud-native applications scalable, fault-tolerant, and highly available.

Fundamental to your project success is understanding the importance of making systems highly cohesive and loosely coupled. That means considering the multi-dimensional facets of system coupling to support the distributed nature of the applications that you are building for the cloud.

By that, I mean addressing not only the application-level coupling (managing incoming and outgoing dependencies), but also considering the impacts of of platform, spatial, and temporal coupling of your systems. Platform coupling relates to the interoperability, or lack thereof, of heterogeneous systems components. Spatial coupling deals with managing components at a network topology level or protocol level. Temporal, or runtime coupling, refers to the ability of a component within your system to do any kind of meaningful work while it is performing a synchronous, blocking operation.

The AWS messaging services, Amazon SQS and Amazon SNS, help you deal with these forms of coupling by providing mechanisms for:

  • Reliable, durable, and fault-tolerant delivery of messages between application components
  • Logical decomposition of systems and increased autonomy of components
  • Creating unidirectional, non-blocking operations, temporarily decoupling system components at runtime
  • Decreasing the dependencies that components have on each other through standard communication and network channels

Following on the recent topic, Building Scalable Applications and Microservices: Adding Messaging to Your Toolbox, in this post, I look at some of the ways you can introduce SQS and SNS into your architectures to decouple your components, and show how you can implement them using C#.

Walkthrough

To illustrate some of these concepts, consider a web application that processes customer orders. As good architects and developers, you have followed best practices and made your application scalable and highly available. Your solution included implementing load balancing, dynamic scaling across multiple Availability Zones, and persisting orders in a Multi-AZ Amazon RDS database instance, as in the following diagram.


In this example, the application is responsible for handling and persisting the order data, as well as dealing with increases in traffic for popular items.

One potential point of vulnerability in the order processing workflow is in saving the order in the database. The business expects that every order has been persisted into the database. However, any potential deadlock, race condition, or network issue could cause the persistence of the order to fail. Then, the order is lost with no recourse to restore the order.

With good logging capability, you may be able to identify when an error occurred and which customer’s order failed. This wouldn’t allow you to “restore” the transaction, and by that stage, your customer is no longer your customer.

As illustrated in the following diagram, introducing an SQS queue helps improve your ordering application. Using the queue isolates the processing logic into its own component and runs it in a separate process from the web application. This, in turn, allows the system to be more resilient to spikes in traffic, while allowing work to be performed only as fast as necessary in order to manage costs.


In addition, you now have a mechanism for persisting orders as messages (with the queue acting as a temporary database), and have moved the scope of your transaction with your database further down the stack. In the event of an application exception or transaction failure, this ensures that the order processing can be retired or redirected to the Amazon SQS Dead Letter Queue (DLQ), for re-processing at a later stage. (See the recent post, Using Amazon SQS Dead-Letter Queues to Control Message Failure, for more information on dead-letter queues.)

Scaling the order processing nodes

This change allows you now to scale the web application frontend independently from the processing nodes. The frontend application can continue to scale based on metrics such as CPU usage, or the number of requests hitting the load balancer. Processing nodes can scale based on the number of orders in the queue. Here is an example of scale-in and scale-out alarms that you would associate with the scaling policy.

Scale-out Alarm

aws cloudwatch put-metric-alarm --alarm-name AddCapacityToCustomerOrderQueue --metric-name ApproximateNumberOfMessagesVisible --namespace "AWS/SQS" 
--statistic Average --period 300 --threshold 3 --comparison-operator GreaterThanOrEqualToThreshold --dimensions Name=QueueName,Value=customer-orders
--evaluation-periods 2 --alarm-actions <arn of the scale-out autoscaling policy>

Scale-in Alarm

aws cloudwatch put-metric-alarm --alarm-name RemoveCapacityFromCustomerOrderQueue --metric-name ApproximateNumberOfMessagesVisible --namespace "AWS/SQS" 
 --statistic Average --period 300 --threshold 1 --comparison-operator LessThanOrEqualToThreshold --dimensions Name=QueueName,Value=customer-orders
 --evaluation-periods 2 --alarm-actions <arn of the scale-in autoscaling policy>

In the above example, use the ApproximateNumberOfMessagesVisible metric to discover the queue length and drive the scaling policy of the Auto Scaling group. Another useful metric is ApproximateAgeOfOldestMessage, when applications have time-sensitive messages and developers need to ensure that messages are processed within a specific time period.

Scaling the order processing implementation

On top of scaling at an infrastructure level using Auto Scaling, make sure to take advantage of the processing power of your Amazon EC2 instances by using as many of the available threads as possible. There are several ways to implement this. In this post, we build a Windows service that uses the BackgroundWorker class to process the messages from the queue.

Here’s a closer look at the implementation. In the first section of the consuming application, use a loop to continually poll the queue for new messages, and construct a ReceiveMessageRequest variable.

public static void PollQueue()
{
    while (_running)
    {
        Task<ReceiveMessageResponse> receiveMessageResponse;

        // Pull messages off the queue
        using (var sqs = new AmazonSQSClient())
        {
            const int maxMessages = 10;  // 1-10

            //Receiving a message
            var receiveMessageRequest = new ReceiveMessageRequest
            {
                // Get URL from Configuration
                QueueUrl = _queueUrl, 
                // The maximum number of messages to return. 
                // Fewer messages might be returned. 
                MaxNumberOfMessages = maxMessages, 
                // A list of attributes that need to be returned with message.
                AttributeNames = new List<string> { "All" },
                // Enable long polling. 
                // Time to wait for message to arrive on queue.
                WaitTimeSeconds = 5 
            };

            receiveMessageResponse = sqs.ReceiveMessageAsync(receiveMessageRequest);
        }

The WaitTimeSeconds property of the ReceiveMessageRequest specifies the duration (in seconds) that the call waits for a message to arrive in the queue before returning a response to the calling application. There are a few benefits to using long polling:

  • It reduces the number of empty responses by allowing SQS to wait until a message is available in the queue before sending a response.
  • It eliminates false empty responses by querying all (rather than a limited number) of the servers.
  • It returns messages as soon any message becomes available.

For more information, see Amazon SQS Long Polling.

After you have returned messages from the queue, you can start to process them by looping through each message in the response and invoking a new BackgroundWorker thread.

// Process messages
if (receiveMessageResponse.Result.Messages != null)
{
    foreach (var message in receiveMessageResponse.Result.Messages)
    {
        Console.WriteLine("Received SQS message, starting worker thread");

        // Create background worker to process message
        BackgroundWorker worker = new BackgroundWorker();
        worker.DoWork += (obj, e) => ProcessMessage(message);
        worker.RunWorkerAsync();
    }
}
else
{
    Console.WriteLine("No messages on queue");
}

The event handler, ProcessMessage, is where you implement business logic for processing orders. It is important to have a good understanding of how long a typical transaction takes so you can set a message VisibilityTimeout that is long enough to complete your operation. If order processing takes longer than the specified timeout period, the message becomes visible on the queue. Other nodes may pick it and process the same order twice, leading to unintended consequences.

Handling Duplicate Messages

In order to manage duplicate messages, seek to make your processing application idempotent. In mathematics, idempotent describes a function that produces the same result if it is applied to itself:

f(x) = f(f(x))

No matter how many times you process the same message, the end result is the same (definition from Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions, Hohpe and Wolf, 2004).

There are several strategies you could apply to achieve this:

  • Create messages that have inherent idempotent characteristics. That is, they are non-transactional in nature and are unique at a specified point in time. Rather than saying “place new order for Customer A,” which adds a duplicate order to the customer, use “place order <orderid> on <timestamp> for Customer A,” which creates a single order no matter how often it is persisted.
  • Deliver your messages via an Amazon SQS FIFO queue, which provides the benefits of message sequencing, but also mechanisms for content-based deduplication. You can deduplicate using the MessageDeduplicationId property on the SendMessage request or by enabling content-based deduplication on the queue, which generates a hash for MessageDeduplicationId, based on the content of the message, not the attributes.
var sendMessageRequest = new SendMessageRequest
{
    QueueUrl = _queueUrl,
    MessageBody = JsonConvert.SerializeObject(order),
    MessageGroupId = Guid.NewGuid().ToString("N"),
    MessageDeduplicationId = Guid.NewGuid().ToString("N")
};
  • If using SQS FIFO queues is not an option, keep a message log of all messages attributes processed for a specified period of time, as an alternative to message deduplication on the receiving end. Verifying the existence of the message in the log before processing the message adds additional computational overhead to your processing. This can be minimized through low latency persistence solutions such as Amazon DynamoDB. Bear in mind that this solution is dependent on the successful, distributed transaction of the message and the message log.

Handling exceptions

Because of the distributed nature of SQS queues, it does not automatically delete the message. Therefore, you must explicitly delete the message from the queue after processing it, using the message ReceiptHandle property (see the following code example).

However, if at any stage you have an exception, avoid handling it as you normally would. The intention is to make sure that the message ends back on the queue, so that you can gracefully deal with intermittent failures. Instead, log the exception to capture diagnostic information, and swallow it.

By not explicitly deleting the message from the queue, you can take advantage of the VisibilityTimeout behavior described earlier. Gracefully handle the message processing failure and make the unprocessed message available to other nodes to process.

In the event that subsequent retries fail, SQS automatically moves the message to the configured DLQ after the configured number of receives has been reached. You can further investigate why the order process failed. Most importantly, the order has not been lost, and your customer is still your customer.

private static void ProcessMessage(Message message)
{
    using (var sqs = new AmazonSQSClient())
    {
        try
        {
            Console.WriteLine("Processing message id: {0}", message.MessageId);

            // Implement messaging processing here
            // Ensure no downstream resource contention (parallel processing)
            // <your order processing logic in here…>
            Console.WriteLine("{0} Thread {1}: {2}", DateTime.Now.ToString("s"), Thread.CurrentThread.ManagedThreadId, message.MessageId);
            
            // Delete the message off the queue. 
            // Receipt handle is the identifier you must provide 
            // when deleting the message.
            var deleteRequest = new DeleteMessageRequest(_queueName, message.ReceiptHandle);
            sqs.DeleteMessageAsync(deleteRequest);
            Console.WriteLine("Processed message id: {0}", message.MessageId);

        }
        catch (Exception ex)
        {
            // Do nothing.
            // Swallow exception, message will return to the queue when 
            // visibility timeout has been exceeded.
            Console.WriteLine("Could not process message due to error. Exception: {0}", ex.Message);
        }
    }
}

Using SQS to adapt to changing business requirements

One of the benefits of introducing a message queue is that you can accommodate new business requirements without dramatically affecting your application.

If, for example, the business decided that all orders placed over $5000 are to be handled as a priority, you could introduce a new “priority order” queue. The way the orders are processed does not change. The only significant change to the processing application is to ensure that messages from the “priority order” queue are processed before the “standard order” queue.

The following diagram shows how this logic could be isolated in an “order dispatcher,” whose only purpose is to route order messages to the appropriate queue based on whether the order exceeds $5000. Nothing on the web application or the processing nodes changes other than the target queue to which the order is sent. The rates at which orders are processed can be achieved by modifying the poll rates and scalability settings that I have already discussed.

Extending the design pattern with Amazon SNS

Amazon SNS supports reliable publish-subscribe (pub-sub) scenarios and push notifications to known endpoints across a wide variety of protocols. It eliminates the need to periodically check or poll for new information and updates. SNS supports:

  • Reliable storage of messages for immediate or delayed processing
  • Publish / subscribe – direct, broadcast, targeted “push” messaging
  • Multiple subscriber protocols
  • Amazon SQS, HTTP, HTTPS, email, SMS, mobile push, AWS Lambda

With these capabilities, you can provide parallel asynchronous processing of orders in the system and extend it to support any number of different business use cases without affecting the production environment. This is commonly referred to as a “fanout” scenario.

Rather than your web application pushing orders to a queue for processing, send a notification via SNS. The SNS messages are sent to a topic and then replicated and pushed to multiple SQS queues and Lambda functions for processing.

As the diagram above shows, you have the development team consuming “live” data as they work on the next version of the processing application, or potentially using the messages to troubleshoot issues in production.

Marketing is consuming all order information, via a Lambda function that has subscribed to the SNS topic, inserting the records into an Amazon Redshift warehouse for analysis.

All of this, of course, is happening without affecting your order processing application.

Summary

While I haven’t dived deep into the specifics of each service, I have discussed how these services can be applied at an architectural level to build loosely coupled systems that facilitate multiple business use cases. I’ve also shown you how to use infrastructure and application-level scaling techniques, so you can get the most out of your EC2 instances.

One of the many benefits of using these managed services is how quickly and easily you can implement powerful messaging capabilities in your systems, and lower the capital and operational costs of managing your own messaging middleware.

Using Amazon SQS and Amazon SNS together can provide you with a powerful mechanism for decoupling application components. This should be part of design considerations as you architect for the cloud.

For more information, see the Amazon SQS Developer Guide and Amazon SNS Developer Guide. You’ll find tutorials on all the concepts covered in this post, and more. To can get started using the AWS console or SDK of your choice visit:

Happy messaging!