Tag Archives: monitoring

How to monitor Windows and Linux servers and get internal performance metrics

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/how-to-monitor-windows-and-linux-servers-and-get-internal-performance-metrics/

This post was written by Dean Suzuki, Senior Solutions Architect.

Customers who run Windows or Linux instances on AWS frequently ask, “How do I know if my disks are almost full?” or “How do I know if my application is using all the available memory and is paging to disk?” This blog helps answer these questions by walking you through how to set up monitoring to capture these internal performance metrics.

Solution overview

If you open the Amazon EC2 console, select a running Amazon EC2 instance, and select the Monitoring tab  you can see Amazon CloudWatch metrics for that instance. Amazon CloudWatch is an AWS monitoring service. The Monitoring tab (shown in the following image) shows the metrics that can be measured external to the instance (for example, CPU utilization, network bytes in/out). However, to understand what percentage of the disk is being used or what percentage of the memory is being used, these metrics require an internal operating system view of the instance. AWS places an extra safeguard on gathering data inside a customer’s instance so this capability is not enabled by default.

EC2 console showing Monitoring tab

To capture the server’s internal performance metrics, a CloudWatch agent must be installed on the instance. For Windows, the CloudWatch agent can capture any of the Windows performance monitor counters. For Linux, the CloudWatch agent can capture system-level metrics. For more details, please see Metrics Collected by the CloudWatch Agent. The agent can also capture logs from the server. The agent then sends this information to Amazon CloudWatch, where rules can be created to alert on certain conditions (for example, low free disk space) and automated responses can be set up (for example, perform backup to clear transaction logs). Also, dashboards can be created to view the health of your Windows servers.

There are four steps to implement internal monitoring:

  1. Install the CloudWatch agent onto your servers. AWS provides a service called AWS Systems Manager Run Command, which enables you to do this agent installation across all your servers.
  2. Run the CloudWatch agent configuration wizard, which captures what you want to monitor. These items could be performance counters and logs on the server. This configuration is then stored in AWS System Manager Parameter Store
  3. Configure CloudWatch agents to use agent configuration stored in Parameter Store using the Run Command.
  4. Validate that the CloudWatch agents are sending their monitoring data to CloudWatch.

The following image shows the flow of these four steps.

Process to install and configure the CloudWatch agent

In this blog, I walk through these steps so that you can follow along. Note that you are responsible for the cost of running the environment outlined in this blog. So, once you are finished with the steps in the blog, I recommend deleting the resources if you no longer need them. For the cost of running these servers, see Amazon EC2 On-Demand Pricing. For CloudWatch pricing, see Amazon CloudWatch pricing.

If you want a video overview of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.

Deploy the CloudWatch agent

The first step is to deploy the Amazon CloudWatch agent. There are multiple ways to deploy the CloudWatch agent (see this documentation on Installing the CloudWatch Agent). In this blog, I walk through how to use the AWS Systems Manager Run Command to deploy the agent. AWS Systems Manager uses the Systems Manager agent, which is installed by default on each AWS instance. This AWS Systems Manager agent must be given the appropriate permissions to connect to AWS Systems Manager, and to write the configuration data to the AWS Systems Manager Parameter Store. These access rights are controlled through the use of IAM roles.

Create two IAM roles

IAM roles are identity objects that you attach IAM policies. IAM policies define what access is allowed to AWS services. You can have users, services, or applications assume the IAM roles and get the assigned rights defined in the permissions policies.

To use System Manager, you typically create two IAM roles. The first role has permissions to write the CloudWatch agent configuration information to System Manager Parameter Store. This role is called CloudWatchAgentAdminRole.

The second role only has permissions to read the CloudWatch agent configuration from the System Manager Parameter Store. This role is called CloudWatchAgentServerRole.

For more details on creating these roles, please see the documentation on Create IAM Roles and Users for Use with the CloudWatch Agent.

Attach the IAM roles to the EC2 instances

Once you create the roles, you attach them to your Amazon EC2 instances. By attaching the IAM roles to the EC2 instances, you provide the processes running on the EC2 instance the permissions defined in the IAM role. In this blog, you create two Amazon EC2 instances. Attach the CloudWatchAgentAdminRole to the first instance that is used to create the CloudWatch agent configuration. Attach CloudWatchAgentServerRole to the second instance and any other instances that you want to monitor. For details on how to attach or assign roles to EC2 instances, please see the documentation on How do I assign an existing IAM role to an EC2 instance?.

Install the CloudWatch agent

Now that you have setup the permissions, you can install the CloudWatch agent onto the servers that you want to monitor. For details on installing the CloudWatch agent using Systems Manager, please see the documentation on Download and Configure the CloudWatch Agent.

Create the CloudWatch agent configuration

Now that you installed the CloudWatch agent on your server, run the CloudAgent configuration wizard to create the agent configuration. For instructions on how to run the CloudWatch Agent configuration wizard, please see this documentation on Create the CloudWatch Agent Configuration File with the Wizard. To establish a command shell on the server, you can use AWS Systems Manager Session Manager to establish a session to the server and then run the CloudWatch agent configuration wizard. If you want to monitor both Linux and Windows servers, you must run the CloudWatch agent configuration on a Linux instance and on a Windows instance to create a configuration file per OS type. The configuration is unique to the OS type.

To run the Agent configuration wizard on Linux instances, run the following command:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

To run the Agent configuration wizard on Windows instances, run the following commands:

cd "C:\Program Files\Amazon\AmazonCloudWatchAgent"

amazon-cloudwatch-agent-config-wizard.exe

Note for Linux instances: do not select to collect the collectd metrics in the agent configuration wizard unless you have collectd installed on your Linux servers. Otherwise, you may encounter an error.

Review the Agent configuration

The CloudWatch agent configuration generated from the wizard is stored in Systems Manager Parameter Store. You can review and modify this configuration if you need to capture extra metrics. To review the agent configuration, perform the following steps:

  1. Go to the console for the System Manager service.
  2. Click Parameter store on the left hand navigation.
  3. You should see the parameter that was created by the CloudWatch agent configuration program. For Linux servers, the configuration is stored in: AmazonCloudWatch-linux and for Windows servers, the configuration is stored in:  AmazonCloudWatch-windows.

System Manager Parameter Store: Parameters created by CloudWatch agent configuration wizard

  1. Click on the parameter’s hyperlink (for example, AmazonCloudWatch-linux) to see all the configuration parameters that you specified in the configuration program.

In the following steps, I walk through an example of modifying the Windows configuration parameter (AmazonCloudWatch-windows) to add an additional metric (“Available Mbytes”) to monitor.

  1. Click the AmazonCloudWatch-windows
  2. In the parameter overview, scroll down to the “metrics” section and under “metrics_collected”, you can see the Windows performance monitor counters that will be gathered by the CloudWatch agent. If you want to add an additional perfmon counter, then you can edit and add the counter here.
  3. Press Edit at the top right of the AmazonCloudWatch-windows Parameter Store page.
  4. Scroll down in the Value section and look for “Memory.”
  5. After the “% Committed Bytes In Use”, put a comma “,” and then press Enter to add a blank line. Then, put on that line “Available Mbytes” The following screenshot demonstrates what this configuration should look like.

AmazonCloudWatch-windows parameter contents and how to add a new metric to monitor

  1. Press Save Changes.

To modify the Linux configuration parameter (AmazonCloudWatch-linux), you perform similar steps except you click on the AmazonCloudWatch-linux parameter. Here is additional documentation on creating the CloudWatch agent configuration and modifying the configuration file.

Start the CloudWatch agent and use the configuration

In this step, start the CloudWatch agent and instruct it to use your agent configuration stored in System Manager Parameter Store.

  1. Open another tab in your web browser and go to System Manager console.
  2. Specify Run Command in the left hand navigation of the System Manager console.
  3. Press Run Command
  4. In the search bar,
    • Select Document name prefix
    • Select Equal
    • Specify AmazonCloudWatch (Note the field is case sensitive)
    • Press enter

System Manager Run Command's command document entry field

  1. Select AmazonCloudWatch-ManageAgent. This is the command that configures the CloudWatch agent.
  2. In the command parameters section,
    • For Action, select Configure
    • For Mode, select ec2
    • For Optional Configuration Source, select ssm
    • For optional configuration location, specify the Parameter Store name. For Windows instances, you would specify AmazonCloudWatch-windows for Windows instances or AmazonCloudWatch-linux for Linux instances. Note the field is case sensitive. This tells the command to read the Parameter Store for the parameter specified here.
    • For optional restart, leave yes
  3. For Targets, choose your target servers that you wish to monitor.
  4. Scroll down and press Run. The Run Command may take a couple minutes to complete. Press the refresh button. The Run Command configures the CloudWatch agent by reading the Parameter Store for the configuration and configure the agent using those settings.

For more details on installing the CloudWatch agent using your agent configuration, please see this Installing the CloudWatch Agent on EC2 Instances Using Your Agent Configuration.

Review the data collected by the CloudWatch agents

In this step, I walk through how to review the data collected by the CloudWatch agents.

  1. In the AWS Management console, go to CloudWatch.
  2. Click Metrics on the left-hand navigation.
  3. You should see a custom namespace for CWAgent. Click on the CWAgent Please note that this might take a couple minutes to appear. Refresh the page periodically until it appears.
  4. Then click the ImageId, Instanceid hyperlinks to see the counters under that section.

CloudWatch Metrics: Showing counters under CWAgent

  1. Review the metrics captured by the CloudWatch agent. Notice the metrics that are only observable from inside the instance (for example, LogicalDisk % Free Space). These types of metrics would not be observable without installing the CloudWatch agent on the instance. From these metrics, you could create a CloudWatch Alarm to alert you if they go beyond a certain threshold. You can also add them to a CloudWatch Dashboard to review. To learn more about the metrics collected by the CloudWatch agent, see the documentation Metrics Collected by the CloudWatch Agent.

Conclusion

In this blog, you learned how to deploy and configure the CloudWatch agent to capture the metrics on either Linux or Windows instances. If you are done with this blog, we recommend deleting the System Manager Parameter Store entry, the CloudWatch data and  then the EC2 instances to avoid further charges. If you would like a video tutorial of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.

 

 

Investigate VPC flow with Amazon Detective

Post Syndicated from Ross Warren original https://aws.amazon.com/blogs/security/investigate-vpc-flow-with-amazon-detective/

Many Amazon Web Services (AWS) customers need enhanced insight into IP network flow. Traditionally, cost, the complexity of collection, and the time required for analysis has led to incomplete investigations of network flows. Having good telemetry is paramount, and VPC Flow Logs are a very important part of a robust centralized logging architecture. The information that VPC Flow Logs provide is frequently used by security analysts to determine the scope of security issues, to validate that network access rules are working as expected, and to help analysts investigate issues and diagnose network behaviors. Flow logs capture information about the IP traffic going to and from EC2 interfaces in a VPC. Each record describes aspects of the traffic flow, such as where it originated and where it was sent to, what network ports were used, and how many bytes were sent.

Amazon Detective now enables you to interactively examine the details of the virtual private cloud (VPC) network flows of your Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon Detective makes it easy to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities. Detective automatically collects VPC flow logs from your monitored accounts, aggregates them by EC2 instance, and presents visual summaries and analytics about these network flows. Detective doesn’t require VPC Flow Logs to be configured and doesn’t impact existing flow log collection.

In this blog post, I describe how to use the new VPC flow feature in Detective to investigate an UnauthorizedAccess:EC2/TorClient finding from Amazon GuardDuty. Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data stored in Amazon S3. GuardDuty documentation states that this alert can indicate unauthorized access to your AWS resources with the intent of hiding the unauthorized user’s true identity. I’ll demonstrate how to use Amazon Detective to investigate an instance that was flagged by Amazon GuardDuty to determine whether it is compromised or not.

Starting the investigation in GuardDuty

In my GuardDuty console, I’m going to select the UnauthorizedAccess:EC2/TorClient finding shown in Figure 1, choose the Actions menu, and select Investigate.
 

Figure 1: Investigating from the GuardDuty console

Figure 1: Investigating from the GuardDuty console

This opens a new browser tab and launches the Amazon Detective console, where I’m presented with the profile page for this finding, shown in Figure 2. You must have Detective enabled to pivot between a GuardDuty finding and Detective. Detective provides profile pages for supported GuardDuty findings and AWS resources (for example, IP address, EC2 instance, user, and role) that include information and data visualizations that summarize observed behaviors and give guidance for interpreting them. Profiles help analysts to determine whether the finding is of genuine concern or a false positive. For resources, profiles provide supporting details for an investigation into a finding or for a general hunt for suspicious activity.
 

Figure 2: Finding profile panel

Figure 2: Finding profile panel

In this case, the profile page for this GuardDuty UnauthorizedAccess:EC2/TorClient finding provides contextual and behavioral data about the EC2 instance on which GuardDuty has noted the issue. As I dive into this finding, I’m going to be asking questions that help assess whether the instance was in fact accessed unintentionally, such as, “What IP port or network service was in use at that time?,” “Were any large data transfers involved?,” “Was the traffic allowed by my security groups?” Profile pages in Detective organize content that helps security analysts investigate GuardDuty findings, examine unexpected network behavior, and identify other AWS resources that might be affected by a potential security issue.

I begin scrolling down the page and notice the Findings associated with EC2 instance i-9999999999999999 panel. Detective displays related findings to provide analysts with additional evidence and context about potentially related issues. The finding I’m investigating is listed there, as well as an Unusual Behaviors/VM/Behavior:EC2-NetworkPortUnusual finding. GuardDuty builds a baseline on your network traffic and will generate findings where there is traffic outside the calculated normal. While we might not investigate every instance of anomalous traffic, having these alerts correlated by Detective provides context for validating the issue. Keeping this in mind as I scroll down, at the bottom of this profile page, I find the Overall VPC flow volume panel. If you choose the Info link next to the panel title, you can see helpful tips that describe how to use the visualizations and provide ideas for questions to ask within your investigation. These info links are available throughout Detective. Check them out!

Investigating VPC flow in Detective

In this investigation, I’m very curious about the two large spikes in inbound traffic that I see in the Overall VPC flow volume panel, which seem to be visually associated with some unusual outbound traffic spikes. It’s most likely that these outbound spikes are related to the Unusual Behaviors/VM/Behavior:EC2-NetworkPortUnusual finding I mentioned earlier. To start the investigation, I choose the display details for scope time button, shown circled at the bottom of Figure 2. This expands the VPC Flow Details, shown in Figure 3.
 

Figure 3: Our first look at VPC Flow Details

Figure 3: Our first look at VPC Flow Details

We now can see that each entry displays the volume of inbound traffic, the volume of outbound traffic, and whether the access request was accepted or rejected. Detective provides annotations on the VPC flows to help guide your investigation. These From finding annotations make it clear which flows and resources were involved in the finding. In this case, we can easily see (in Figure 3) the three IP addresses at the top of the list that triggered this GuardDuty finding.

I’m first going to focus on the spikes in traffic that are above the baseline. When I click on one of the spikes in the graph, the time window for the VPC flow activity now matches the dates of these spikes I’m investigating.

If I choose the Inbound Traffic column header, shown in Figure 4, I can find the flows that contributed to the spike during this time window.
 

Figure 4: Inbound traffic spikes

Figure 4: Inbound traffic spikes

Note that the two large inbound spikes aren’t associated with the IP address from the UnauthorizedAccess:EC2/TorClient finding, based on the Detective annotation From finding. Let’s check the outbound traffic. If I do a quick sort of the table based on the outbound traffic column, as shown in Figure 5, we can also see the outbound spikes, and it isn’t immediately evident whether the spikes are associated with this finding. I could continue to investigate the spikes (because they are a visual anomaly), or focus just on the VPC flow traffic that GuardDuty and Detective have labeled as associated with this TOR finding.
 

Figure 5: Outbound traffic spikes

Figure 5: Outbound traffic spikes

Let’s focus on the outbound and inbound spikes and see if we can determine what’s happening. The inbound spikes are on port 443, typically an HTTPS port, or a secure web connection. The outbound spikes are on port 22 (ssh), but go to IP addresses that look to be internal based on their addresses of 172.16.x.x. The port 443 traffic might indicate a web server that’s open to the internet and receiving traffic. With further investigation, we can determine if this idea is valid, and continue hunting for potentially malicious traffic.

A good next step would be to investigate the two specific IP addresses to rule out their involvement in the finding. I can do this by right-clicking on either of the external IP addresses and opening a new tab, where I can focus on investigating these two specific IP addresses. I would take this line of investigation to possibly rule out the involvement of these IP addresses in this finding, determine if they regularly communicate with my resources, find out what instance(s) they’re related to, and see if there are other findings associated with these instances or IP addresses. This deeper investigation is outside the scope of this blog post, but it’s something you should be doing in your own environment.

IP addresses in AWS are ephemeral in nature. The unique identifier in VPC flow logs is the Instance ID. At the time of this investigation, 172.16.0.7 is assigned to the instance related to this finding, so let’s continue to take a look at the internal 172.16.0.7 IP address with 218 MB outbound traffic on port 22. I choose 172.16.0.7, and Detective opens up the profile page for this specific IP address, as shown in Figure 6. Here we see some interesting correlations: two other GuardDuty findings related to SSH brute-force attacks. These could be related to our outbound port 22 spikes, because they’re certainly in the window of time we’re investigating.
 

Figure 6: IP address profile panel

Figure 6: IP address profile panel

As part of a deeper investigation, you would investigate the SSH brute-force findings for 198.51.100.254 and 203.0.113.83 but for now I’m interested what this IP is involved in. Detective easily associates this 172.16.0.7 IP address with the instance that was assigned the IP during the scope time. I scroll down to the bottom of the profile page for 172.16.0.7 and investigate the i-9999999999999999 instance by choosing the instance name.

Filtering VPC flow activity

In Detective, as the investigator we are looking at an instance profile panel, similar to the one in Figure 2, and since we’re interested in VPC flow details, I’m going to scroll down and select display details for scope time.

To focus on specific activity, I can filter the activity details by the following values:

  • IP address
  • Local or remote port
  • Direction
  • Protocol
  • Whether the request was accepted or rejected

I’m going to filter these VPC flow details and just look at port 22 (sshd) inbound traffic. I select the Filter check box and select Local Port and 22, as shown in Figure 7. Detective fills in all the available ports for you, making it easy to complete this filter.
 

Figure 7: Port 22 traffic

Figure 7: Port 22 traffic

The activity details show a few IP addresses related to port 22, and we’re still following the large outbound spikes of traffic. It’s outside the scope of this blog post, but now it would be time to start looking at your security groups and network access control lists (ACLs) and determine why port 22 is open to the internet and sending all this traffic.

Understanding traffic behavior

As an investigator, I now have a good picture of the traffic related to the initial finding, and by diving deeper we’re able to discover other interesting traffic during the same timeframe. While we may not always determine “who has done it,” the goal should be to improve our understanding of the behavior of our environment and gather important technical evidence. Detective helps you identify and investigate anomalies to give you insight into your environment. If we were to continue our investigation into the finding, here are some actions we can take within Detective.

Investigate VPC findings with Detective:

  • Perform ports and utilization analysis
    • Identify service and ephemeral ports
    • Determine whether traffic was accepted or rejected based on security groups and NACL configurations
    • Investigate possible reconnaissance traffic by exploring the significant amount of rejected traffic
  • Correlate EC2 instances to TCP/IP ports and IPs
  • Analyze traffic spikes and anomalies
  • Discover traffic patterns and make behavioral correlations

Explore EC2 instance behavior with Detective:

  • Directional Traffic Analysis
  • Investigate possible data exfiltration events by digging into large transfers
  • Enumerate distinct IP connections and sort and filter by protocol, amount of traffic, and traffic direction
  • Gather data related to a spike in port count from a single IP address (potential brute force) or multiple IP addresses (distributed denial of service (DDoS))

Additional forensics steps to consider

  • Snapshot EC2 Volumes
  • Memory dump of EC2 instance
  • Isolate EC2 instance
  • Review your authentication strategy and assess whether the chosen authentication method is sufficient to protect your asset

Summary

Without requiring you to set up infrastructure or spend time configuring log ingestion, Detective collects, organizes, and presents relevant data for your threat analysis and investigations. Security and operations teams will find this new capability helpful for simplifying EC2 traffic analysis, validating security group permissions, and diagnosing EC2 instance behavior. Detective does the heavy lifting of storing, and analyzing VPC flow data so you can focus on quickly answering your investigative questions. VPC network flow details are available now in all Detective supported Regions and are included as part of your service subscription.

To get started, you can enable a 30-day free trial of Amazon Detective. See the AWS Regional Services page for all the regions where Detective is available. To learn more, visit the Amazon Detective product page.

Are you a visual learner? Check out Amazon Detective Overview and Demonstration. This video helps you learn how and when to use Amazon Detective to improve the security of your AWS resources.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Ross Warren

Ross Warren is a Solution Architect at AWS based in Northern Virginia. Prior to his work at AWS, Ross’ areas of focus included cyber threat hunting and security operations. Ross has worked at a handful of startups and has enjoyed the transition to AWS because he can continue to build solutions for customers on today’s most innovative platform.

Author

Jim Miller

Jim is a Solution Architect at AWS based in Connecticut. Jim has worked within cyber security his entire career with areas of focus including cyber security architecture and incident response. At AWS he loves building secure solutions for customers to enable teams to build and innovate with confidence.

Telltale: Netflix Application Monitoring Simplified

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba

By Andrei U., Seth Katz, Janak Ramachandran, Jeff Butsch, Peter Lau, Ram Vaithilingam, and Greg Burrell

Our Telltale Vision

An alert fires and you get paged in the middle of the night. A metric crossed a threshold. You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? When was the last time somebody adjusted our alert thresholds? Maybe it’s due to an upstream or downstream service?” This is a critical application so you drag yourself out of bed, open your laptop, and start poring through dashboards for more info. You’re not yet convinced there’s a real problem but you’re also aware that the clock is ticking as you dig through a mountain of data looking for clues.

Healthy Netflix services are essential to member joy. When you sit down to watch “Tiger King” you expect it to just play. Over the years we’ve learned from on-call engineers about the pain points of application monitoring: too many alerts, too many dashboards to scroll through, and too much configuration and maintenance. Our streaming teams need a monitoring system that enables them to quickly diagnose and remediate problems; seconds count! Our Node team needs a system that empowers a small group to operate a large fleet.

So we built Telltale.

The Telltale UI shows health over time and across multiple services.
The Telltale timeline.

Telltale combines a variety of data sources to create a holistic view of an application’s health. Telltale learns what constitutes typical health for an application, no alert tuning required. And because we know what’s healthy, we can let application owners know when their services are trending towards unhealthy.

Metrics are a key part of understanding application health. But sometimes you can have too many metrics, too many graphs, and too many dashboards. Telltale shows only the relevant data from the application plus that of upstream and downstream services. We use colors to indicate severity (users can opt to have Telltale display numbers in addition to colors) so users can tell, at a glance, the state of their application’s health. We also highlight interesting broader events such as regional traffic evacuations and nearby deployments, information that is vital to understanding health holistically. Especially during an incident.

That is our Telltale vision. It exists today and monitors the health of over 100 Netflix production-facing applications.

A call graph with four layers of services.
An application lives in an ecosystem

The Application Health Model

A microservice doesn’t live in isolation. It usually has dependencies, talks to other services, and lives in different AWS regions. The call graph above is a relatively simple one, they can be much deeper with dozens of services involved. An application is part of an ecosystem that can be subtly influenced by property changes or radically altered by region-wide events. The launch of a canary can affect an application. As can an upstream or downstream deployments.

Telltale uses a variety of signals from multiple sources to assemble a constantly evolving model of the application’s health:

Different signals have different levels of importance to an application’s health. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others. A canary launch two layers downstream might not be as significant as a deployment immediately upstream. A regional traffic shift means one region ends up with zero traffic while another region has double. You can imagine the impact that has on metrics. A metric’s meaning determines how we should interpret it.

Telltale takes all those factors into consideration when constructing its view of application health.

The application health model is the heart of Telltale.

Intelligent Monitoring

Every service operator knows the difficulty of alert tuning. Set thresholds too low and you get a deluge of spurious alerts. So you overcompensate and relax the tuning to the point of missing important health warnings. The end result is a lack of trust in alerts. Telltale is built on the premise that you shouldn’t have to constantly tune configuration.

We make setup and configuration easy for application owners by providing curated and managed signal packs. These packs are combined into application profiles to address most common service types. Telltale automatically tracks dependencies between services to build the topology used in the application health model. Signal packs and topology detection keep configuration up-to-date with minimal effort. Those who want a more hands-on approach can still do manual configuration and tuning.

No single algorithm can account for the wide variety of signals we use. So, instead, we employ a mix of algorithms including statistical, rule based, and machine learning. We’ll do a future Netflix Tech Blog article focused on our algorithms. Telltale also has analyzers to detect long-term trends or memory leaks. Intelligent monitoring means results our users can trust. It means a faster time to detection and a faster time to resolution during an incident.

Intelligent Alerting

Intelligent monitoring yields intelligent alerting. Telltale creates an issue when it detects a health problem in your application’s ecosystem. Teams can opt in to alerting via Slack, email, or PagerDuty (all powered by our internal alerting system). If the issue is caused by an upstream or downstream system then Telltale’s context-aware routing alerts that team instead. Intelligent alerting also means a team receives a single notification, alert storms are a thing of the past.

An example of a Telltale alert notification in Slack.
An example of a Telltale notification in Slack.

When a problem strikes, it’s essential to have the right information. Our Slack alerts also start a thread containing only the most relevant context about the incident. This includes the signals that Telltale identified as unhealthy and the reasons why. The right context provides a better understanding of the application’s current state so the on-call engineer can return it to health.

Incidents evolve and have their own lifecycle, so updates are essential. Are things getting better or worse? Are there new signals or events to consider? Telltale updates the Slack thread as the current incident unfolds. The thread is marked Resolved upon return to healthy state so users know, at a glance, which incidents are ongoing and which have been successfully remediated.

But these Slack threads aren’t just for Telltale. Teams use them to share additional data, observations, theories, and discussion about the incident. Incident data and discussion all in one thread makes for shared understanding, faster resolution, and easier post-incident analysis.

We strive to improve the quality of Telltale alerts. One way to do that is to learn from our users. So we provide feedback buttons right in the Slack message. Users can tell us to suppress future occurrences of an alert. Or provide a reason for why an alert isn’t actionable. Intelligent alerting means alerts our users can trust.

An example of the details found in a Telltale notification in Slack. Which metrics and which triggers fired.
An example of the details found in a Telltale notification in Slack.

Why Is My Service Unhealthy?

A wide variety of signals, knowledge of the application’s ecosystem, and correlation of signals across multiple services helps Telltale to detect the possible causes of an application’s degraded health. Causes such as an outlier instance, a canary or deployment by a dependent service, an unhealthy database, or just a spike in traffic. Highlighting possible causes saves valuable time during an incident.

Incident Management

An incident summary gathers relevant information for the team.
An example of a Telltale incident summary.

When Telltale sends an alert it also creates a snapshot that has references to the unhealthy signals. As new information arrives, it’s added to this snapshot. This simplifies the post-incident review process for many teams. When it’s time to review past issues, the Application Incident Summary feature shows all aspects of recent issues in a single place including key metrics like total downtime and MTTR (Mean Time To Resolution). We want to help our teams see larger patterns of incidents so they can improve overall service availability.

The cluster view shows groupings of similar incidents so teams can understand the larger patterns.
The cluster view groups similar incidents.

Deployment Monitoring

Telltale’s application health model and intelligent monitoring have proven so powerful that we’re also using it for safer deployments. We start with Spinnaker, our open source delivery platform. As Spinnaker slowly rolls out a new build we use Telltale to continuously monitor the health of the instances running the new build. Continuous monitoring means a deployment stops and rolls back at the first sign of a problem. It means deployment problems have smaller blast radius and a shorter duration.

Continuous Improvement

Operating microservices in a complex ecosystem is challenging. We’re thrilled that Telltale’s intelligent monitoring and alerting helps our service operators improve availability, reduce toil, and sleep better at night. But we’re not done. We’re constantly exploring new algorithms to improve the accuracy of our alerts. We’ll write more about that in a future Netflix Tech Blog post. We’re also evaluating improvements to our application health model. We believe there’s useful information in service log and trace data. And benefits to employing higher resolution metrics. We’re looking forward to collaborating with our platform team on building out those new features. Getting new applications onto Telltale has been a white-glove treatment which doesn’t scale well, we can definitely improve our self-service UI. And we know there’s better heuristics to help pinpoint what’s affecting your service health.

Telltale is application monitoring simplified.

A healthy Netflix service enables us to entertain the world. Correlating disparate signals to model health in realtime is challenging. Add in thousands of streaming device types, an ever-evolving architecture, and a growing content production ecosystem and the problem becomes fascinating. If you’re passionate about observability then come talk to us.


Telltale: Netflix Application Monitoring Simplified was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Helping sites get back online: the origin monitoring intern project

Post Syndicated from Annika Garbers original https://blog.cloudflare.com/helping-sites-get-back-online-the-origin-monitoring-intern-project/

Helping sites get back online: the origin monitoring intern project

Helping sites get back online: the origin monitoring intern project

The most impactful internship experiences involve building something meaningful from scratch and learning along the way. Those can be tough goals to accomplish during a short summer internship, but our experience with Cloudflare’s 2019 intern program met both of them and more! Over the course of ten weeks, our team of three interns (two engineering, one product management) went from a problem statement to a new feature, which is still working in production for all Cloudflare customers.

The project

Cloudflare sits between customers’ origin servers and end users. This means that all traffic to the origin server runs through Cloudflare, so we know when something goes wrong with a server and sometimes reflect that status back to users. For example, if an origin is refusing connections and there’s no cached version of the site available, Cloudflare will display a 521 error. If customers don’t have monitoring systems configured to detect and notify them when failures like this occur, their websites may go down silently, and they may hear about the issue for the first time from angry users.

Helping sites get back online: the origin monitoring intern project
Helping sites get back online: the origin monitoring intern project
When a customer’s origin server is unreachable, Cloudflare sends a 5xx error back to the visitor.‌‌

This problem became the starting point for our summer internship project: since Cloudflare knows when customers’ origins are down, let’s send them a notification when it happens so they can take action to get their sites back online and reduce the impact to their users! This work became Cloudflare’s passive origin monitoring feature, which is currently available on all Cloudflare plans.

Over the course of our internship, we ran into lots of interesting technical and product problems, like:

Making big data small

Working with data from all requests going through Cloudflare’s 26 million+ Internet properties to look for unreachable origins is unrealistic from a data volume and performance perspective. Figuring out what datasets were available to analyze for the errors we were looking for, and how to adapt our whiteboarded algorithm ideas to use this data, was a challenge in itself.

Ensuring high alert quality

Because only a fraction of requests show up in the sampled timing and error dataset we chose to use, false positives/negatives were disproportionately likely to occur for low-traffic sites. These are the sites that are least likely to have sophisticated monitoring systems in place (and therefore are most in need of this feature!). In order to make the notifications as accurate and actionable as possible, we analyzed patterns of failed requests throughout different types of Cloudflare Internet properties. We used this data to determine thresholds that would maximize the number of true positive notifications, while making sure they weren’t so sensitive that we end up spamming customers with emails about sporadic failures.

Designing actionable notifications

Cloudflare has lots of different kinds of customers, from people running personal blogs with interest in DDoS mitigation to large enterprise companies with extremely sophisticated monitoring systems and global teams dedicated to incident response. We wanted to make sure that our notifications were understandable and actionable for people with varying technical backgrounds, so we enabled the feature for small samples of customers and tested many variations of the “origin monitoring email”. Customers responded right back to our notification emails, sent in support questions, and posted on our community forums. These were all great sources of feedback that helped us improve the message’s clarity and actionability.

We frontloaded our internship with lots of research (both digging into request data to understand patterns in origin unreachability problems and talking to customers/poring over support tickets about origin unreachability) and then spent the next few weeks iterating. We enabled passive origin monitoring for all customers with some time remaining before the end of our internships, so we could spend time improving the supportability of our product, documenting our design decisions, and working with the team that would be taking ownership of the project.

We were also able to develop some smaller internal capabilities that built on the work we’d done for the customer-facing feature, like notifications on origin outage events for larger sites to help our account teams provide proactive support to customers. It was super rewarding to see our work in production, helping Cloudflare users get their sites back online faster after receiving origin monitoring notifications.

Our internship experience

The Cloudflare internship program was a whirlwind ten weeks, with each day presenting new challenges and learnings! Some factors that led to our productive and memorable summer included:

A well-scoped project

It can be tough to find a project that’s meaningful enough to make an impact but still doable within the short time period available for summer internships. We’re grateful to our managers and mentors for identifying an interesting problem that was the perfect size for us to work on, and for keeping us on the rails if the technical or product scope started to creep beyond what would be realistic for the time we had left.

Working as a team of interns

The immediate team working on the origin monitoring project consisted of three interns: Annika in product management and Ilya and Zhengyao in engineering. Having a dedicated team with similar goals and perspectives on the project helped us stay focused and work together naturally.

Quick, agile cycles

Since our project faced strict time constraints and our team was distributed across two offices (Champaign and San Francisco), it was critical for us to communicate frequently and work in short, iterative sprints. Daily standups, weekly planning meetings, and frequent feedback from customers and internal stakeholders helped us stay on track.

Great mentorship & lots of freedom

Our managers challenged us, but also gave us room to explore our ideas and develop our own work process. Their trust encouraged us to set ambitious goals for ourselves and enabled us to accomplish way more than we may have under strict process requirements.

After the internship

In the last week of our internships, the engineering interns, who were based in the Champaign, IL office, visited the San Francisco office to meet with the team that would be taking over the project when we left and present our work to the company at our all hands meeting. The most exciting aspect of the visit: our presentation was preempted by Cloudflare’s co-founders announcing public S-1 filing at the all hands! 🙂

Helping sites get back online: the origin monitoring intern project
Helping sites get back online: the origin monitoring intern project

Over the next few months, Cloudflare added a notifications page for easy configurability and announced the availability of passive origin monitoring along with some other tools to help customers monitor their servers and avoid downtime.

Ilya is working for Cloudflare part-time during the school semester and heading back for another internship this summer, and Annika is joining the team full-time after graduation this May. We’re excited to keep working on tools that help make the Internet a better place!

Also, Cloudflare is doubling the size of the 2020 intern class—if you or someone you know is interested in an internship like this one, check out the open positions in software engineering, security engineering, and product management.

Monitoring and management with Amazon QuickSight and Athena in your CI/CD pipeline

Post Syndicated from Umair Nawaz original https://aws.amazon.com/blogs/devops/monitoring-and-management-with-amazon-quicksight-and-athena-in-your-ci-cd-pipeline/

One of the many ways to monitor and manage required CI/CD metrics is to use Amazon QuickSight to build customized visualizations. Additionally, by applying Lean management to software delivery processes, organizations can improve delivery of features faster, pivot when needed, respond to compliance and security changes, and take advantage of instant feedback to improve the customer delivery experience. This blog post demonstrates how AWS resources and tools can provide monitoring and information pertaining to their CI/CD pipelines.

There are three principles in Lean management that this artifact enables and to which it contributes:

  • Limiting work in progress by establishing constraints that drive process improvement and increase throughput.
  • Creating and maintaining dashboards displaying key quality information, productivity metrics, and current status of work (including defects).
  • Using data from development performance and operations monitoring tools to enable business decisions more frequently.

Overview

The following architectural diagram shows how to use AWS services to collect metrics from a CI/CD pipeline and deliver insights through Amazon QuickSight dashboards.

Architecture diagram showing an overview of how CI/CD metrics are extracted and transformed to create a dynamic QuickSight dashboard

In this example, the orchestrator for the CI/CD pipeline is AWS CodePipeline with the entry point as an AWS CodeCommit Git repository for source control. When a developer pushes a code change into the CodeCommit repository, the change goes through a series of phases in CodePipeline. AWS CodeBuild is responsible for performing build actions and, upon successful completion of this phase, AWS CodeDeploy kicks off the actions to execute the deployment.

For each action in CodePipeline, the following series of events occurs:

  • An Amazon CloudWatch rule creates a CloudWatch event containing the action’s metadata.
  • The CloudWatch event triggers an AWS Lambda function.
  • The Lambda function extracts relevant reporting data and writes it to a CSV file in an Amazon S3 bucket.
  • Amazon Athena queries the Amazon S3 bucket and loads the query results into SPICE (an in-memory engine for Amazon QuickSight).
  • Amazon QuickSight obtains data from SPICE to build dashboard displays for the management team.

Note: This solution is for an AWS account with an existing CodePipeline(s). If you do not have a CodePipeline, no metrics will be collected.

Getting started

To get started, follow these steps:

  • Create a Lambda function and copy the following code snippet. Be sure to replace the bucket name with the one used to store your event data. This Lambda function takes the payload from a CloudWatch event and extracts the field’s pipeline, time, state, execution, stage, and action to transform into a CSV file.

Note: Athena’s performance can be improved by compressing, partitioning, or converting data into columnar formats such as Apache Parquet. In this use-case, the dataset size is negligible therefore, a transformation from CSV to Parquet is not required.


import boto3
import csv
import datetime
import os

 # Analyze payload from CloudWatch Event
 def pipeline_execution(data):
     print (data)
     # Specify data fields to deliver to S3
     row=['pipeline,time,state,execution,stage,action']
     
     if "stage" in data['detail'].keys():
         stage=data['detail']['execution']
     else:
         stage='NA'
         
     if "action" in data['detail'].keys():
         action=data['detail']['action']
     else:
         action='NA'
     row.append(data['detail']['pipeline']+','+data['time']+','+data['detail']['state']+','+data['detail']['execution']+','+stage+','+action)  
     values = '\n'.join(str(v) for v in row)
     return values

 # Upload CSV file to S3 bucket
 def upload_data_to_s3(data):
     s3=boto3.client('s3')
     runDate = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S:%f")
     csv_key=runDate+'.csv'
     response = s3.put_object(
         Body=data,
         Bucket='*<example-bucket>*',
         Key=csv_key
     )

 def lambda_handler(event, context):
     upload_data_to_s3(pipeline_execution(event))
  • Create an Athena table to query the data stored in the Amazon S3 bucket. Execute the following SQL in the Athena query console and provide the bucket name that will hold the data.
CREATE EXTERNAL TABLE `devops`(
   `pipeline` string, 
   `time` string, 
   `state` string, 
   `execution` string, 
   `stage` string, 
   `action` string)
 ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY ',' 
 STORED AS INPUTFORMAT 
   'org.apache.hadoop.mapred.TextInputFormat' 
 OUTPUTFORMAT 
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
 LOCATION
   's3://**<example-bucket>**/'
 TBLPROPERTIES (
   'areColumnsQuoted'='false', 
   'classification'='csv', 
   'columnsOrdered'='true', 
   'compressionType'='none', 
   'delimiter'=',', 
   'skip.header.line.count'='1',  
   'typeOfData'='file')  
  • Create a CloudWatch event rule that passes events to the Lambda function created in Step 1. In the event rule configuration, set the Service Name as CodePipeline and, for Event Type, select All Events.

Sample Dataset view from Athena.

Sample Athena query and the results

Amazon QuickSight visuals

After the initial setup is done, you are ready to create your QuickSight dashboard. Be sure to check that the Athena permissions are properly set before creating an analysis to be published as an Amazon QuickSight dashboard.

Below are diagrams and figures from Amazon QuickSight that can be generated using the event data queried from Athena. In this example, you can see how many executions happened in the account and how many were successful.

The following screenshot shows that most pipeline executions are failing. A manager might be concerned that this points to a significant issue and prompt an investigation in which they can allocate resources to improve delivery and efficiency.

QuickSight Dashboard showing total execution successes and failures

The visual for this solution is dynamic in nature. In case the pipeline has more or fewer actions, the visual will adjust automatically to reflect all actions. After looking at the success and failure rates for each CodePipeline action in Amazon QuickSight, as shown in the following screenshot, users can take targeted actions quickly. For example, if the team sees a lot of failures due to vulnerability scanning, they can work on improving that problem area to drive value for future code releases.

QuickSight Dashboard showing the successes and failures of pipeline actions

Day-over-day visuals reflect date-specific activity and enable teams to see their progress over a period of time.

QuickSight Dashboard showing day over day results of successful CI/CD executions and failures

Amazon QuickSight offers controls that can be configured to apply filters to visuals. For example, the following screenshot demonstrates how users can toggle between visuals for different applications.

QuickSight's control function to switch between different visualization options

Cleanup (optional)

In order to avoid unintended charges, delete the following resources:

  • Amazon CloudWatch event rule
  • Lambda function
  • Amazon S3 Bucket (the location in which CSV files generated by the Lambda function are stored)
  • Athena external table
  • Amazon QuickSight data sets
  • Analysis and dashboard

Conclusion

In this blog, we showed how metrics can be derived from a CI/CD pipeline. Utilizing Amazon QuickSight to create visuals from these metrics allows teams to continuously deliver updates on the deployment process to management. The aggregation of the captured data over time allows individual developers and teams to improve their processes. That is the goal of creating a Lean DevOps process: to oversee the meta-delivery pipeline and optimize all future releases by identifying weak spots and points of risk during the entire release process.

___________________________________________________________

About the Authors

Umair Nawaz is a DevOps Engineer at Amazon Web Services in New York City. He works on building secure architectures and advises enterprises on agile software delivery. He is motivated to solve problems strategically by utilizing modern technologies.
Christopher Flores is an Engagement Manager at Amazon Web Services in New York City. He leads AWS developers, partners, and client teams in using the customer engagement accelerator framework. Christopher expedites stakeholder alignment, enterprise cohesion and risk mitigation while ensuring feedback loops to close the engagement lifecycle.
Carol Liao is a Cloud Infrastructure Architect at Amazon Web Services in New York City. She enjoys designing and developing modern IT solutions in the cloud where there is always more to learn, more problems to solve, and more to build.

 

DevOps at re:Invent 2019!

Post Syndicated from Matt Dwyer original https://aws.amazon.com/blogs/devops/devops-at-reinvent-2019/

re:Invent 2019 is fast approaching (NEXT WEEK!) and we here at the AWS DevOps blog wanted to take a moment to highlight DevOps focused presentations, share some tips from experienced re:Invent pro’s, and highlight a few sessions that still have availability for pre-registration. We’ve broken down the track into one overarching leadership session and four topic areas: (a) architecture, (b) culture, (c) software delivery/operations, and (d) AWS tools, services, and CLI.

In total there will be 145 DevOps track sessions, stretched over 5 days, and divided into four distinct session types:

  • Sessions (34) are one-hour presentations delivered by AWS experts and customer speakers who share their expertise / use cases
  • Workshops (20) are two-hours and fifteen minutes, hands-on sessions where you work in teams to solve problems using AWS services
  • Chalk Talks (41) are interactive white-boarding sessions with a smaller audience. They typically begin with a 10–15-minute presentation delivered by an AWS expert, followed by 45–50-minutes of Q&A
  • Builders Sessions (50) are one-hour, small group sessions with six customers and one AWS expert, who is there to help, answer questions, and provide guidance
  • Select DevOps focused sessions have been highlighted below. If you want to view and/or register for any session, including Keynotes, builders’ fairs, and demo theater sessions, you can access the event catalog using your re:Invent registration credentials.

Reserve your seat for AWS re:Invent activities today >>

re:Invent TIP #1: Identify topics you are interested in before attending re:Invent and reserve a seat. We hold space in sessions, workshops, and chalk talks for walk-ups, however, if you want to get into a popular session be prepared to wait in line!

Please see below for select sessions, workshops, and chalk talks that will be conducted during re:Invent.

LEADERSHIP SESSION DELIVERED BY KEN EXNER, DIRECTOR AWS DEVELOPER TOOLS

[Session] Leadership Session: Developer Tools on AWS (DOP210-L) — SPACE AVAILABLE! REGISTER TODAY!

Speaker 1: Ken Exner – Director, AWS Dev Tools, Amazon Web Services
Speaker 2: Kyle Thomson – SDE3, Amazon Web Services

Join Ken Exner, GM of AWS Developer Tools, as he shares the state of developer tooling on AWS, as well as the future of development on AWS. Ken uses insight from his position managing Amazon’s internal tooling to discuss Amazon’s practices and patterns for releasing software to the cloud. Additionally, Ken provides insight and updates across many areas of developer tooling, including infrastructure as code, authoring and debugging, automation and release, and observability. Throughout this session Ken will recap recent launches and show demos for some of the latest features.

re:Invent TIP #2: Leadership Sessions are a topic area’s State of the Union, where AWS leadership will share the vision and direction for a given topic at AWS.re:Invent.

(a) ARCHITECTURE

[Session] Amazon’s approach to failing successfully (DOP208-RDOP208-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Becky Weiss – Senior Principal Engineer, Amazon Web Services

Welcome to the real world, where things don’t always go your way. Systems can fail despite being designed to be highly available, scalable, and resilient. These failures, if used correctly, can be a powerful lever for gaining a deep understanding of how a system actually works, as well as a tool for learning how to avoid future failures. In this session, we cover Amazon’s favorite techniques for defining and reviewing metrics—watching the systems before they fail—as well as how to do an effective postmortem that drives both learning and meaningful improvement.

[Session] Improving resiliency with chaos engineering (DOP309-RDOP309-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker 1: Olga Hall – Senior Manager, Tech Program Management
Speaker 2: Adrian Hornsby – Principal Evangelist, Amazon Web Services

Failures are inevitable. Regardless of the engineering efforts put into building resilient systems and handling edge cases, sometimes a case beyond our reach turns a benign failure into a catastrophic one. Therefore, we should test and continuously improve our system’s resilience to failures to minimize impact on a user’s experience. Chaos engineering is one of the best ways to achieve that. In this session, you learn how Amazon Prime Video has implemented chaos engineering into its regular testing methods, helping it achieve increased resiliency.

[Session] Amazon’s approach to security during development (DOP310-RDOP310-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Colm MacCarthaigh – Senior Principal Engineer, Amazon Web Services

At AWS we say that security comes first—and we really mean it. In this session, hear about how AWS teams both minimize security risks in our products and respond to security issues proactively. We talk through how we integrate security reviews, penetration testing, code analysis, and formal verification into the development process. Additionally, we discuss how AWS engineering teams react quickly and decisively to new security risks as they emerge. We also share real-life firefighting examples and the lessons learned in the process.

[Session] Amazon’s approach to building resilient services (DOP342-RDOP342-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Marc Brooker – Senior Principal Engineer, Amazon Web Services

One of the biggest challenges of building services and systems is predicting the future. Changing load, business requirements, and customer behavior can all change in unexpected ways. In this talk, we look at how AWS builds, monitors, and operates services that handle the unexpected. Learn how to make your own services handle a changing world, from basic design principles to patterns you can apply today.

re:Invent TIP #3: Not sure where to spend your time? Let an AWS Hero give you some pointers. AWS Heroes are prominent AWS advocates who are passionate about sharing AWS knowledge with others. They have written guides to help attendees find relevant activities by providing recommendations based on specific demographics or areas of interest.

(b) CULTURE

[Session] Driving change and building a high-performance DevOps culture (DOP207-R; DOP207-R1)

Speaker: Mark Schwartz – Enterprise Strategist, Amazon Web Services

When it comes to digital transformation, every enterprise is different. There is often a person or group with a vision, knowledge of good practices, a sense of urgency, and the energy to break through impediments. They may be anywhere in the organizational structure: high, low, or—in a typical scenario—somewhere in middle management. Mark Schwartz, an enterprise strategist at AWS and the author of “The Art of Business Value” and “A Seat at the Table: IT Leadership in the Age of Agility,” shares some of his research into building a high-performance culture by driving change from every level of the organization.

[Session] Amazon’s approach to running service-oriented organizations (DOP301-R; DOP301-R1DOP301-R2)

Speaker: Andy Troutman – Director AWS Developer Tools, Amazon Web Services

Amazon’s “two-pizza teams” are famously small teams that support a single service or feature. Each of these teams has the autonomy to build and operate their service in a way that best supports their customers. But how do you coordinate across tens, hundreds, or even thousands of two-pizza teams? In this session, we explain how Amazon coordinates technology development at scale by focusing on strategies that help teams coordinate while maintaining autonomy to drive innovation.

re:Invent TIP #4: The max number of 60-minute sessions you can attend during re:Invent is 24! These sessions (e.g., sessions, chalk talks, builders sessions) will usually make up the bulk of your agenda.

(c) SOFTWARE DELIVERY AND OPERATIONS

[Session] Strategies for securing code in the cloud and on premises. Speakers: (DOP320-RDOP320-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker 1: Craig Smith – Senior Solutions Architect
Speaker 2: Lee Packham – Solutions Architect

Some people prefer to keep their code and tooling on premises, though this can create headaches and slow teams down. Others prefer keeping code off of laptops that can be misplaced. In this session, we walk through the alternatives and recommend best practices for securing your code in cloud and on-premises environments. We demonstrate how to use services such as Amazon WorkSpaces to keep code secure in the cloud. We also show how to connect tools such as Amazon Elastic Container Registry (Amazon ECR) and AWS CodeBuild with your on-premises environments so that your teams can go fast while keeping your data off of the public internet.

[Session] Deploy your code, scale your application, and lower Cloud costs using AWS Elastic Beanstalk (DOP326) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Prashant Prahlad – Sr. Manager

You can effortlessly convert your code into web applications without having to worry about provisioning and managing AWS infrastructure, applying patches and updates to your platform or using a variety of tools to monitor health of your application. In this session, we show how anyone- not just professional developers – can use AWS Elastic Beanstalk in various scenarios: From an administrator moving a Windows .NET workload into the Cloud, a developer building a containerized enterprise app as a Docker image, to a data scientist being able to deploy a machine learning model, all without the need to understand or manage the infrastructure details.

[Session] Amazon’s approach to high-availability deployment (DOP404-RDOP404-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Peter Ramensky – Senior Manager

Continuous-delivery failures can lead to reduced service availability and bad customer experiences. To maximize the rate of successful deployments, Amazon’s development teams implement guardrails in the end-to-end release process to minimize deployment errors, with a goal of achieving zero deployment failures. In this session, learn the continuous-delivery practices that we invented that help raise the bar and prevent costly deployment failures.

[Session] Introduction to DevOps on AWS (DOP209-R; DOP209-R1)

Speaker 1: Jonathan Weiss – Senior Manager
Speaker 2: Sebastien Stormacq – Senior Technical Evangelist

How can you accelerate the delivery of new, high-quality services? Are you able to experiment and get feedback quickly from your customers? How do you scale your development team from 1 to 1,000? To answer these questions, it is essential to leverage some key DevOps principles and use CI/CD pipelines so you can iterate on and quickly release features. In this talk, we walk you through the journey of a single developer building a successful product and scaling their team and processes to hundreds or thousands of deployments per day. We also walk you through best practices and using AWS tools to achieve your DevOps goals.

[Workshop] DevOps essentials: Introductory workshop on CI/CD practices (DOP201-R; DOP201-R1; DOP201-R2; DOP201-R3)

Speaker 1: Leo Zhadanovsky – Principal Solutions Architect
Speaker 2: Karthik Thirugnanasambandam – Partner Solutions Architect

In this session, learn how to effectively leverage various AWS services to improve developer productivity and reduce the overall time to market for new product capabilities. We demonstrate a prescriptive approach to incrementally adopt and embrace some of the best practices around continuous integration and delivery using AWS developer tools and third-party solutions, including, AWS CodeCommit, AWS CodeBuild, Jenkins, AWS CodePipeline, AWS CodeDeploy, AWS X-Ray and AWS Cloud9. We also highlight some best practices and productivity tips that can help make your software release process fast, automated, and reliable.

[Workshop] Implementing GitFLow with AWS tools (DOP202-R; DOP202-R1; DOP202-R2)

Speaker 1: Amit Jha – Sr. Solutions Architect
Speaker 2: Ashish Gore – Sr. Technical Account Manager

Utilizing short-lived feature branches is the development method of choice for many teams. In this workshop, you learn how to use AWS tools to automate merge-and-release tasks. We cover high-level frameworks for how to implement GitFlow using AWS CodePipeline, AWS CodeCommit, AWS CodeBuild, and AWS CodeDeploy. You also get an opportunity to walk through a prebuilt example and examine how the framework can be adopted for individual use cases.

[Chalk Talk] Generating dynamic deployment pipelines with AWS CDK (DOP311-R; DOP311-R1; DOP311-R2)

Speaker 1: Flynn Bundy – AppDev Consultant
Speaker 2: Koen van Blijderveen – Senior Security Consultant

In this session we dive deep into dynamically generating deployment pipelines that deploy across multiple AWS accounts and Regions. Using the power of the AWS Cloud Development Kit (AWS CDK), we demonstrate how to simplify and abstract the creation of deployment pipelines to suit a range of scenarios. We highlight how AWS CodePipeline—along with AWS CodeBuild, AWS CodeCommit, and AWS CodeDeploy—can be structured together with the AWS deployment framework to get the most out of your infrastructure and application deployments.

[Chalk Talk] Customize AWS CloudFormation with open-source tools (DOP312-R; DOP312-R1; DOP312-E)

Speaker 1: Luis Colon – Senior Developer Advocate
Speaker 2: Ryan Lohan – Senior Software Engineer

In this session, we showcase some of the best open-source tools available for AWS CloudFormation customers, including conversion and validation utilities. Get a glimpse of the many open-source projects that you can use as you create and maintain your AWS CloudFormation stacks.

[Chalk Talk] Optimizing Java applications for scale on AWS (DOP314-R; DOP314-R1; DOP314-R2)

Speaker 1: Sam Fink – SDE II
Speaker 2: Kyle Thomson – SDE3

Executing at scale in the cloud can require more than the conventional best practices. During this talk, we offer a number of different Java-related tools you can add to your AWS tool belt to help you more efficiently develop Java applications on AWS—as well as strategies for optimizing those applications. We adapt the talk on the fly to cover the topics that interest the group most, including more easily accessing Amazon DynamoDB, handling high-throughput uploads to and downloads from Amazon Simple Storage Service (Amazon S3), troubleshooting Amazon ECS services, working with local AWS Lambda invocations, optimizing the Java SDK, and more.

[Chalk Talk] Securing your CI/CD tools and environments (DOP316-R; DOP316-R1; DOP316-R2)

Speaker: Leo Zhadanovsky – Principal Solutions Architect

In this session, we discuss how to configure security for AWS CodePipeline, deployments in AWS CodeDeploy, builds in AWS CodeBuild, and git access with AWS CodeCommit. We discuss AWS Identity and Access Management (IAM) best practices, to allow you to set up least-privilege access to these services. We also demonstrate how to ensure that your pipelines meet your security and compliance standards with the CodePipeline AWS Config integration, as well as manual approvals. Lastly, we show you best-practice patterns for integrating security testing of your deployment artifacts inside of your CI/CD pipelines.

[Chalk Talk] Amazon’s approach to automated testing (DOP317-R; DOP317-R1; DOP317-R2)

Speaker 1: Carlos Arguelles – Principal Engineer
Speaker 2: Charlie Roberts – Senior SDET

Join us for a session about how Amazon uses testing strategies to build a culture of quality. Learn Amazon’s best practices around load testing, unit testing, integration testing, and UI testing. We also discuss what parts of testing are automated and how we take advantage of tools, and share how we strategize to fail early to ensure minimum impact to end users.

[Chalk Talk] Building and deploying applications on AWS with Python (DOP319-R; DOP319-R1; DOP319-R2)

Speaker 1: James Saryerwinnie – Senior Software Engineer
Speaker 2: Kyle Knapp – Software Development Engineer

In this session, hear from core developers of the AWS SDK for Python (Boto3) as we walk through the design of sample Python applications. We cover best practices in using Boto3 and look at other libraries to help build these applications, including AWS Chalice, a serverless microframework for Python. Additionally, we discuss testing and deployment strategies to manage the lifecycle of your applications.

[Chalk Talk] Deploying AWS CloudFormation StackSets across accounts and Regions (DOP325-R; DOP325-R1)

Speaker 1: Mahesh Gundelly – Software Development Manager
Speaker 2: Prabhu Nakkeeran – Software Development Manager

AWS CloudFormation StackSets can be a critical tool to efficiently manage deployments of resources across multiple accounts and regions. In this session, we cover how AWS CloudFormation StackSets can help you ensure that all of your accounts have the proper resources in place to meet security, governance, and regulation requirements. We also cover how to make the most of the latest functionalities and discuss best practices, including how to plan for safe deployments with minimal blast radius for critical changes.

[Chalk Talk] Monitoring and observability of serverless apps using AWS X-Ray (DOP327-R; DOP327-R1; DOP327-R2)

Speaker 1 (R, R1, R2): Shengxin Li – Software Development Engineer
Speaker 2 (R, R1): Sirirat Kongdee – Solutions Architect
Speaker 3 (R2): Eric Scholz – Solutions Architect, Amazon

Monitoring and observability are essential parts of DevOps best practices. You need monitoring to debug and trace unhandled errors, performance bottlenecks, and customer impact in the distributed nature of a microservices architecture. In this chalk talk, we show you how to integrate the AWS X-Ray SDK to your code to provide observability to your overall application and drill down to each service component. We discuss how X-Ray can be used to analyze, identify, and alert on performance issues and errors and how it can help you troubleshoot application issues faster.

[Chalk Talk] Optimizing deployment strategies for speed & safety (DOP341-R; DOP341-R1; DOP341-R2)

Speaker: Karan Mahant – Software Development Manager, Amazon

Modern application development moves fast and demands continuous delivery. However, the greatest risk to an application’s availability can occur during deployments. Join us in this chalk talk to learn about deployment strategies for web servers and for Amazon EC2, container-based, and serverless architectures. Learn how you can optimize your deployments to increase productivity during development cycles and mitigate common risks when deploying to production by using canary and blue/green deployment strategies. Further, we share our learnings from operating production services at AWS.

[Chalk Talk] Continuous integration using AWS tools (DOP216-R; DOP216-R1; DOP216-R2)

Speaker: Richard Boyd – Sr Developer Advocate, Amazon Web Services

Today, more teams are adopting continuous-integration (CI) techniques to enable collaboration, increase agility, and deliver a high-quality product faster. Cloud-based development tools such as AWS CodeCommit and AWS CodeBuild can enable teams to easily adopt CI practices without the need to manage infrastructure. In this session, we showcase best practices for continuous integration and discuss how to effectively use AWS tools for CI.

re:Invent TIP #5: If you’re traveling to another session across campus, give yourself at least 60 minutes!

(d) AWS TOOLS, SERVICES, AND CLI

[Session] Best practices for authoring AWS CloudFormation (DOP302-R; DOP302-R1)

Speaker 1: Olivier Munn – Sr Product Manager Technical, Amazon Web Services
Speaker 2: Dan Blanco – Developer Advocate, Amazon Web Services

Incorporating infrastructure as code into software development practices can help teams and organizations improve automation and throughput without sacrificing quality and uptime. In this session, we cover multiple best practices for writing, testing, and maintaining AWS CloudFormation template code. You learn about IDE plug-ins, reusability, testing tools, modularizing stacks, and more. During the session, we also review sample code that showcases some of the best practices in a way that lends more context and clarity.

[Chalk Talk] Using AWS tools to author and debug applications (DOP215-RDOP215-R1DOP215-R2) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Fabian Jakobs – Principal Engineer, Amazon Web Services

Every organization wants its developers to be faster and more productive. AWS Cloud9 lets you create isolated cloud-based development environments for each project and access them from a powerful web-based IDE anywhere, anytime. In this session, we demonstrate how to use AWS Cloud9 and provide an overview of IDE toolkits that can be used to author application code.

[Session] Migrating .Net frameworks to the cloud (DOP321) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Robert Zhu – Principal Technical Evangelist, Amazon Web Services

Learn how to migrate your .NET application to AWS with minimal steps. In this demo-heavy session, we share best practices for migrating a three-tiered application on ASP.NET and SQL Server to AWS. Throughout the process, you get to see how AWS Toolkit for Visual Studio can enable you to fully leverage AWS services such as AWS Elastic Beanstalk, modernizing your application for more agile and flexible development.

[Session] Deep dive into AWS Cloud Development Kit (DOP402-R; DOP402-R1)

Speaker 1: Elad Ben-Israel – Principal Software Engineer, Amazon Web Services
Speaker 2: Jason Fulghum – Software Development Manager, Amazon Web Services

The AWS Cloud Development Kit (AWS CDK) is a multi-language, open-source framework that enables developers to harness the full power of familiar programming languages to define reusable cloud components and provision applications built from those components using AWS CloudFormation. In this session, you develop an AWS CDK application and learn how to quickly assemble AWS infrastructure. We explore the AWS Construct Library and show you how easy it is to configure your cloud resources, manage permissions, connect event sources, and build and publish your own constructs.

[Session] Introduction to the AWS CLI v2 (DOP406-R; DOP406-R1)

Speaker 1: James Saryerwinnie – Senior Software Engineer, Amazon Web Services
Speaker 2: Kyle Knapp – Software Development Engineer, Amazon Web Services

The AWS Command Line Interface (AWS CLI) is a command-line tool for interacting with AWS services and managing your AWS resources. We’ve taken all of the lessons learned from AWS CLI v1 (launched in 2013), and have been working on AWS CLI v2—the next major version of the AWS CLI—for the past year. AWS CLI v2 includes features such as improved installation mechanisms, a better getting-started experience, interactive workflows for resource management, and new high-level commands. Come hear from the core developers of the AWS CLI about how to upgrade and start using AWS CLI v2 today.

[Session] What’s new in AWS CloudFormation (DOP408-R; DOP408-R1; DOP408-R2)

Speaker 1: Jing Ling – Senior Product Manager, Amazon Web Services
Speaker 2: Luis Colon – Senior Developer Advocate, Amazon Web Services

AWS CloudFormation is one of the most widely used AWS tools, enabling infrastructure as code, deployment automation, repeatability, compliance, and standardization. In this session, we cover the latest improvements and best practices for AWS CloudFormation customers in particular, and for seasoned infrastructure engineers in general. We cover new features and improvements that span many use cases, including programmability options, cross-region and cross-account automation, operational safety, and additional integration with many other AWS services.

[Workshop] Get hands-on with Python/boto3 with no or minimal Python experience (DOP203-R; DOP203-R1; DOP203-R2)

Speaker 1: Herbert-John Kelly – Solutions Architect, Amazon Web Services
Speaker 2: Carl Johnson – Enterprise Solutions Architect, Amazon Web Services

Learning a programming language can seem like a huge investment. However, solving strategic business problems using modern technology approaches, like machine learning and big-data analytics, often requires some understanding. In this workshop, you learn the basics of using Python, one of the most popular programming languages that can be used for small tasks like simple operations automation, or large tasks like analyzing billions of records and training machine-learning models. You also learn about and use the AWS SDK (software development kit) for Python, called boto3, to write a Python program running on and interacting with resources in AWS.

[Workshop] Building reusable AWS CloudFormation templates (DOP304-R; DOP304-R1; DOP304-R2)

Speaker 1: Chelsey Salberg – Front End Engineer, Amazon Web Services
Speaker 2: Dan Blanco – Developer Advocate, Amazon Web Services

AWS CloudFormation gives you an easy way to define your infrastructure as code, but are you using it to its full potential? In this workshop, we take real-world architecture from a sandbox template to production-ready reusable code. We start by reviewing an initial template, which you update throughout the session to incorporate AWS CloudFormation features, like nested stacks and intrinsic functions. By the end of the workshop, expect to have a set of AWS CloudFormation templates that demonstrate the same best practices used in AWS Quick Starts.

[Workshop] Building a scalable serverless application with AWS CDK (DOP306-R; DOP306-R1; DOP306-R2; DOP306-R3)

Speaker 1: David Christiansen – Senior Partner Solutions Architect, Amazon Web Services
Speaker 2: Daniele Stroppa – Solutions Architect, Amazon Web Services

Dive into AWS and build a web application with the AWS Mythical Mysfits tutorial. In this workshop, you build a serverless application using AWS Lambda, Amazon API Gateway, and the AWS Cloud Development Kit (AWS CDK). Through the tutorial, you get hands-on experience using AWS CDK to model and provision a serverless distributed application infrastructure, you connect your application to a backend database, and you capture and analyze data on user behavior. Other AWS services that are utilized include Amazon Kinesis Data Firehose and Amazon DynamoDB.

[Chalk Talk] Assembling an AWS CloudFormation authoring tool chain (DOP313-R; DOP313-R1; DOP313-R2)

Speaker 1: Nathan McCourtney – Sr System Development Engineer, Amazon Web Services
Speaker 2: Dan Blanco – Developer Advocate, Amazon Web Services

In this session, we provide a prescriptive tool chain and methodology to improve your coding productivity as you create and maintain AWS CloudFormation stacks. We cover authoring recommendations from editors and plugins, to setting up a deployment pipeline for your AWS CloudFormation code.

[Chalk Talk] Build using JavaScript with AWS Amplify, AWS Lambda, and AWS Fargate (DOP315-R; DOP315-R1; DOP315-R2)

Speaker 1: Trivikram Kamat – Software Development Engineer, Amazon Web Services
Speaker 2: Vinod Dinakaran – Software Development Manager, Amazon Web Services

Learn how to build applications with AWS Amplify on the front end and AWS Fargate and AWS Lambda on the backend, and protocols (like HTTP/2), using the JavaScript SDKs in the browser and node. Leverage the AWS SDK for JavaScript’s modular NPM packages in resource-constrained environments, and benefit from the built-in async features to run your node and mobile applications, and SPAs, at scale.

[Chalk Talk] Scaling CI/CD adoption using AWS CodePipeline and AWS CloudFormation (DOP318-R; DOP318-R1; DOP318-R2)

Speaker 1: Andrew Baird – Principal Solutions Architect, Amazon Web Services
Speaker 2: Neal Gamradt – Applications Architect, WarnerMedia

Enabling CI/CD across your organization through repeatable patterns and infrastructure-as-code templates can unlock development speed while encouraging best practices. The SEAD Architecture team at WarnerMedia helps encourage CI/CD adoption across their company. They do so by creating and maintaining easily extensible infrastructure-as-code patterns for creating new services and deploying to them automatically using CI/CD. In this session, learn about the patterns they have created and the lessons they have learned.

re:Invent TIP #6: There are lots of extra activities at re:Invent. Expect your evenings to fill up onsite! Check out the peculiar programs including, board games, bingo, arts & crafts or ‘80s sing-alongs…

Debugging with Amazon CloudWatch Synthetics and AWS X-Ray

Post Syndicated from Nizar Tyrewalla original https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/

Today, AWS X-Ray launches support for Amazon CloudWatch Synthetics, enabling developers to trace end-to-end requests from configurable scripts called “canaries”.  These canaries run the test script to monitor web endpoints and APIs using modular, light-weight tests that run 24×7, once per minute. It continuously captures the behavior and availability of the endpoint or URL being monitored, reporting what end-customers are seeing. Customers get alerted immediately in case of failures and tie them back to the root-cause using traces. These canaries provide a complete picture of all the services invoked in the path. With this feature, developers and DevOps engineers can correlate failures in the application with the impact on end customers, determine the root cause of the failures, and identify the upstream and downstream services impacted.

Overview

X-Ray helps developers and DevOps engineers analyze and debug distributed applications, such as applications built with microservice architecture. Using X-Ray, you can understand how your application and its underlying services are performing in order to identify and troubleshoot the root causes of performance issues and errors. X-Ray helps you debug and triage distributed applications, wherever those applications are running, and whether the architecture is serverless, containers, Amazon EC2, on-premises, or a mixture of all of the above.

Amazon CloudWatch Synthetics is a fully managed synthetic monitoring service that allows developers and DevOps engineers to view their application endpoints and URLs using configurable scripts called “canaries” that run 24×7. Canaries alert you as soon as something does not work as expected, as defined by your script. CloudWatch Synthetics canaries can be customized to check for availability, latency, transactions, broken or dead links, step-by-step task completions, page load errors, load latencies for UI assets, complex wizard flows, or checkout flows in your applications.

CloudWatch Synthetics supports End-User Experience Monitoring (EUEM—Synthetic Monitoring), Web Services Monitoring (REST APIs), URL Monitoring, and Website Content Monitoring (protection from unauthorized changes in your websites from phishing, code injection, and cross-site scripting). Canary traffic can continually verify your customers’ experiences even when you don’t have any customer traffic. Canaries can discover issues as soon as your customers do.

Use cases

Here are some of the use cases for which developers and DevOps engineers can use AWS X-Ray with Amazon CloudWatch Synthetics.

Determine if there is a problem

You can determine if there is a problem with your CloudWatch Synthetics canaries by setting CloudWatch alarms based on certain thresholds. You can also determine an increase in errors, faults, throttling rates, or slow response times within your X-Ray Service Map.

Developers can set up thresholds on canaries in Amazon CloudWatch Synthetics console which creates CloudWatch alarms to understand if there were any issues. These thresholds are managed centrally to help you get alerted based on the granularity of the test run. They can then use Amazon Simple Notification Service (SNS) or CloudWatch Event for their alerts. For example, you can set a notification when the canary fails 5 times in 5 runs if the canary runs once per min. You can further analyze and triage HAR File, screenshots and logs in the CloudWatch Synthetics console.

Enable thresholds in CloudWatch Synthetics

Canary thresholds in Amazon CloudWatch

Screenshots in Amazon CloudWatch Synthetics console

 

 

HAR file in Amazon CloudWatch Synthetics console

Developers can also use the new synthetic canary client nodes with type Client::Synthetics in AWS X-Ray console to pinpoint which synthetic canaries are experiencing issues with errors, faults, or throttling rates for the selected time frame. Select the synthetic canary node to view the response time distribution of the entire request. Zoom into a specific time range and dive deep into analyzing trends using X-Ray Analytics, as shown in the following screenshot of a service map with a synthetic canary client node.

X-Ray Service Map with Synthetics node showing errors

 

 

Find the root cause of ongoing failures

You can find the root cause of ongoing failures in upstream and downstream services to reduce the mean time to resolution.

Developers can view the end-to-end path of requests within the X-Ray service map to easily understand the services invoked, as well as the upstream and downstream services. You can also use trace maps for individual traces to dissect each request and determine which service results in the most latency or is causing an error, as shown in the following diagrams.

 

X-Ray Service Map with end to end path of the request invoked by Synthetic canaries

X-Ray Trace Map to view individual request invoked by Synthetic canaries

Once you receive a CloudWatch alarm for failures in a synthetic canary, it becomes critical to determine the root cause of the issue and reduce the impact on end users. Developers can use X-Ray Analytics to address this. X-Ray performs statistical modeling on trace data to determine the probable root cause of an issue. As you can see in the following screenshots, the synthetic tests for console “www” that is running on AWS Elastic Beanstalk are failing. It is because of the throughput capacity exception from the DynamoDB table, impacting the product microservice. This can be quickly determined in the X-Ray Analytics console. Also, developers can determine which of their canary tests are causing this issue using the already created X-Ray annotation “Annotation.aws:canary_arn”.

 

Analyze root cause of the issues in Synthetic canaries using X-Ray Analytics

 

Root cause of the issue using X-Ray Analytics

 

Identify performance bottlenecks and trends

You can identify performance bottlenecks and trends seen by your Synthetics canaries’ traces over time.

View trends in performance of your website or endpoint over time using the continuous traffic from your synthetic canaries to populate a response time histogram over a period of time. Understand the p50, p90, and p95 percentile for your requests originating from Synthetics tests and determine proportion of failures and time series activity to determine the time of occurrence.

Response time histogram and time series activity for Synthetic canaries

 

Compare latency and error/fault rates and experience of end user with Synthetic canaries

You can compare latency and error/fault rates and end-user experiences with the Synthetic canaries.

Using X-Ray Analytics, developers can compare any two trends or collection of traces. For example, in the below diagram, we are comparing the experience of end users using the system (shown in blue trend line) to what the Synthetics tests are reporting (shown in green trend line), which started around 3:00 am. We can easily identify that Synthetics tests are reporting more 5xx errors as compared to end users.

Compare end-user experience with Synthetics canaries using X-Ray Analytics

 

Identify if you have enough canary coverage for all the API’s and URLs.

When debugging synthetic canaries, developers and DevOps engineers are looking to understand if they have enough coverage for their APIs and URLs and if their Synthetic canaries have adequate tests. Using X-Ray Analytics, developers will be able to compare the experience of Synthetic canaries with the rest of their end users. The screenshot below shows blue trend line for canaries and green for the rest of the users. You can also identify that two out of the three URLs don’t have canary tests.

Identify which URLs and API's have canaries running.

Slice and dice the X-Ray service map to focus only on Synthetics tests and its path to downstream services using X-Ray groups

Developers can have multiple APIs and applications running within an account. It becomes critical at that point to focus on certain set of workflows like Synthetics tests for application “www” running on AWS Elastic Beanstalk as shown below. Developers can create X-Ray Group by using a filter expression “edge(id(name: “www”, type: “client::Synthetics”), id(name: “www”, type: “AWS::ElasticBeanstalk::Environment”))” to just focus on traces generated by their Synthetics tests for “www”.

Create X-Ray Groups for Synthetics canaries

 

Getting Started

Getting started with Synthetics is easy. Enable X-Ray tracing for your APIs, web application endpoints and, URLs being monitored by Synthetics canaries.  Visit the documentation to learn more about getting started using AWS X-Ray with Amazon CloudWatch Synthetics.

Conclusion

In this post, I outlined how you can use AWS X-Ray to debug and triage canaries running in Amazon CloudWatch Synthetics. We looked at some of the use cases that will enable developer and DevOps engineers to reduce time to resolution by viewing the end-to-end path of the request, determining the root cause of the issue and understanding which upstream or downstream services are impacted.

Growth Monitor pi: an open monitoring system for plant science

Post Syndicated from Helen Lynn original https://www.raspberrypi.org/blog/growth-monitor-pi-an-open-monitoring-system-for-plant-science/

Plant scientists and agronomists use growth chambers to provide consistent growing conditions for the plants they study. This reduces confounding variables – inconsistent temperature or light levels, for example – that could render the results of their experiments less meaningful. To make sure that conditions really are consistent both within and between growth chambers, which minimises experimental bias and ensures that experiments are reproducible, it’s helpful to monitor and record environmental variables in the chambers.

A neat grid of small leafy plants on a black plastic tray. Metal housing and tubing is visible to the sides.

Arabidopsis thaliana in a growth chamber on the International Space Station. Many experimental plants are less well monitored than these ones.
(“Arabidopsis thaliana plants […]” by Rawpixel Ltd (original by NASA) / CC BY 2.0)

In a recent paper in Applications in Plant Sciences, Brandin Grindstaff and colleagues at the universities of Missouri and Arizona describe how they developed Growth Monitor pi, or GMpi: an affordable growth chamber monitor that provides wider functionality than other devices. As well as sensing growth conditions, it sends the gathered data to cloud storage, captures images, and generates alerts to inform scientists when conditions drift outside of an acceptable range.

The authors emphasise – and we heartily agree – that you don’t need expertise with software and computing to build, use, and adapt a system like this. They’ve written a detailed protocol and made available all the necessary software for any researcher to build GMpi, and they note that commercial solutions with similar functionality range in price from $10,000 to $1,000,000 – something of an incentive to give the DIY approach a go.

GMpi uses a Raspberry Pi Model 3B+, to which are connected temperature-humidity and light sensors from our friends at Adafruit, as well as a Raspberry Pi Camera Module.

The team used open-source app Rclone to upload sensor data to a cloud service, choosing Google Drive since it’s available for free. To alert users when growing conditions fall outside of a set range, they use the incoming webhooks app to generate notifications in a Slack channel. Sensor operation, data gathering, and remote monitoring are supported by a combination of software that’s available for free from the open-source community and software the authors developed themselves. Their package GMPi_Pack is available on GitHub.

With a bill of materials amounting to something in the region of $200, GMpi is another excellent example of affordable, accessible, customisable open labware that’s available to researchers and students. If you want to find out how to build GMpi for your lab, or just for your greenhouse, Affordable remote monitoring of plant growth in facilities using Raspberry Pi computers by Brandin et al. is available on PubMed Central, and it includes appendices with clear and detailed set-up instructions for the whole system.

The post Growth Monitor pi: an open monitoring system for plant science appeared first on Raspberry Pi.

Raspberry Pi internet kill switch

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/internet-kill-switch/

Control the internet in your home with this handy Raspberry Pi Zero W internet kill switch.

Internet Kill Switch!

It’s every teenager’s worst nightmare… no WIFI! I built a standalone wireless Internet Kill Switch that lets me turn the Internet off whenever I want. A Raspberry Pi Zero W monitors the switch and sends an alert via SSH over WIFI to my firewall where another script watches for the alert and turns the external interface off or on.

Internet in my home wasn’t really a thing until I was in my late teens, and even then, there wasn’t that much online fun to be had. Not like there is now, with social media and online gaming and the YouTubes.

If I’d had access to the internet of today in my teens, I’m pretty sure I’d have never been off the thing. And that’s where a button like this would have been a godsend for my mother.

Shared by Nick Donaldson on his YouTube account, the Internet Kill Switch is a Raspberry Pi Zero W–powered emergency button that turns off all internet access in the house — perfect for keeping online activities to a reasonable level. Nick explains:

It’s every teenager’s worst nightmare… no WiFi! I built a standalone wireless Internet Kill Switch that lets me turn the internet off whenever I want. A Raspberry Pi Zero W monitors the switch and sends an alert via SSH over WiFi to my firewall, where another script watches for the alert and turns the external interface off or on. I have challenged the boys to hack it…

The Raspberry Pi Zero W sits snug within the button casing and is powered by a battery. And so that the battery can be continuously recharged, the device sits on a wireless charging pad. Hence, the button is juiced up and ready to go at any time.

I can pick it up, walk around at any time, threaten the teenagers, and shut down the internet whenever I want, hahaha!

While internet service providers are starting to roll out smartphone apps that offer similar functionality, we like the physicality of this button.

Great job, Nick! Please don’t turn off our internet.

The post Raspberry Pi internet kill switch appeared first on Raspberry Pi.

Monitor air quality with a Raspberry Pi

Post Syndicated from Andrew Gregory original https://www.raspberrypi.org/blog/monitor-air-quality-with-a-raspberry-pi/

Add a sensor and some Python 3 to your Raspberry Pi to keep tabs on your local air pollution, in the project taken from Hackspace magazine issue 21.

Air is the very stuff we breathe. It’s about 78% nitrogen, 21% oxygen, and 1% argon, and then there’s the assorted ‘other’ bits and pieces – many of which have been spewed out by humans and our related machinery. Carbon dioxide is obviously an important polluter for climate change, but there are other bits we should be concerned about for our health, including particulate matter. This is just really small bits of stuff, like soot and smog. They’re grouped together based on their size – the most important, from a health perspective, are those that are smaller than 2.5 microns in width (known as PM2.5), and PM10, which are between 10 and 2.5 microns in width. This pollution is linked with respiratory illness, heart disease, and lung cancer.

Obviously, this is something that’s important to know about, but it’s something that – here in the UK – we have relatively little data on. While there are official sensors in most major towns and cities, the effects can be very localised around busy roads and trapped in valleys. How does the particular make-up of your area affect your air quality? We set out to monitor our environment to see how concerned we should be about our local air.

Getting started

We picked the SDS011 sensor for our project (see ‘Picking a sensor’ below for details on why). This sends output via a binary data format on a serial port. You can read this serial connection directly if you’re using a controller with a UART, but the sensors also usually come with a USB-to-serial connector, allowing you to plug it into any modern computer and read the data.

The USB-to-serial connector makes it easy to connect the sensor to a computer

The very simplest way of using this is to connect it to a computer. You can read the sensor values with software such as DustViewerSharp. If you’re just interested in reading data occasionally, this is a perfectly fine way of using the sensor, but we want a continuous monitoring station – and we didn’t want to leave our laptop in one place, running all the time. When it comes to small, low-power boards with USB ports, there’s one that always springs to mind – the Raspberry Pi.

First, you’ll need a Raspberry Pi (any version) that’s set up with the latest version of Raspbian, connected to your local network, and ideally with SSH enabled. If you’re unsure how to do this, there’s guidance on the Raspberry Pi website.

The wiring for this project is just about the simplest we’ll ever do: connect the SDS011 to the Raspberry Pi with the serial adapter, then plug the Raspberry Pi into a power source.

Before getting started on the code, we also need to set up a data repository. You can store your data wherever you like – on the SD card, or upload it to some cloud service. We’ve opted to upload it to Adafruit IO, an online service for storing data and making dashboards. You’ll need a free account, which you can sign up for on the Adafruit IO website – you’ll need to know your Adafruit username and Adafruit IO key in order to run the code below. If you’d rather use a different service, you’ll need to adjust the code to push your data there.

We’ll use Python 3 for our code, and we need two modules – one to read the data from the SDS011 and one to push it to Adafruit IO. You can install this by entering the following commands in a terminal:

pip3 install pyserial adafruit-io

You’ll now need to open a text editor and enter the following code:

This does a few things. First, it reads ten bytes of data over the serial port – exactly ten because that’s the format that the SDS011 sends data in – and sticks these data points together to form a list of bytes that we call data.

We’re interested in bytes 2 and 3 for PM2.5 and 4 and 5 for PM10. We convert these from bytes to integer numbers with the slightly confusing line:

pmtwofive = int.from_bytes(b’’.join(data[2:4]), byteorder=’little’) / 10

from_byte command takes a string of bytes and converts them into an integer. However, we don’t have a string of bytes, we have a list of two bytes, so we first need to convert this into a string. The b’’ creates an empty string of bytes. We then use the join method of this which takes a list and joins it together using this empty string as a separator. As the empty string contains nothing, this returns a byte string that just contains our two numbers. The byte_order flag is used to denote which way around the command should read the string. We divide the result by ten, because the SDS011 returns data in units of tens of grams per metre cubed and we want the result in that format aio.send is used to push data to Adafruit IO. The first command is the feed value you want the data to go to. We used kingswoodtwofive and kingswoodten, as the sensor is based in Kingswood. You might want to choose a more geographically relevant name. You can now run your sensor with:

python3 airquality.py

…assuming you called the Python file airquality.py
and it’s saved in the same directory the terminal’s in.

At this point, everything should work and you can set about running your sensor, but as one final point, let’s set it up to start automatically when you turn the Raspberry Pi on. Enter the command:

crontab -e

…and add this line to the file:

@reboot python3 /home/pi/airquality.py

With the code and electronic setup working, your sensor will need somewhere to live. If you want it outside, it’ll need a waterproof case (but include some way for air to get in). We used a Tupperware box with a hole cut in the bottom mounted on the wall, with a USB cable carrying power out via a window. How you do it, though, is up to you.

Now let’s democratise air quality data so we can make better decisions about the places we live.

Picking a sensor

There are a variety of particulate sensors on the market. We picked the SDS011 for a couple of reasons. Firstly, it’s cheap enough for many makers to be able to buy and build with. Secondly, it’s been reasonably well studied for accuracy. Both the hackAIR and InfluencAir projects have compared the readings from these sensors with more expensive, better-tested sensors, and the results have come back favourably. You can see more details at hsmag.cc/DiYPfg and hsmag.cc/Luhisr.

The one caveat is that the results are unreliable when the humidity is at the extremes (either very high or very low). The SDS011 is only rated to work up to 70% humidity. If you’re collecting data for a study, then you should discard any readings when the humidity is above this. HackAIR has a formula for attempting to correct for this, but it’s not reliable enough to neutralise the effect completely. See their website for more details: hsmag.cc/DhKaWZ.

Safe levels

Once you’re monitoring your PM2.5 data, what should you look out for? The World Health Organisation air quality guideline stipulates that PM2.5 not exceed 10 µg/m3 annual mean, or 25 µg/m3 24-hour mean; and that PM10 not exceed 20 µg/m3 annual mean, or 50 µg/m3 24-hour mean. However, even these might not be safe. In 2013, a large survey published in The Lancet “found a 7% increase in mortality with each 5 micrograms per cubic metre increase in particulate matter with a diameter of 2.5 micrometres (PM2.5).”

Where to locate your sensor

Standard advice for locating your sensor is that it should be outside and four metres above ground level. That’s good advice for general environmental monitoring; however, we’re not necessarily interested in general environmental monitoring – we’re interested in knowing what we’re breathing in.

Locating your monitor near your workbench will give you an idea of what you’re actually inhaling – useless for any environmental study, but useful if you spend a lot of time in there. We found, for example, that the glue gun produced huge amounts of PM2.5, and we’ll be far more careful with ventilation when using this tool in the future.

Adafruit IO

You can use any data platform you like. We chose Adafruit IO because it’s easy to use, lets you share visualisations (in the form of dashboards) with others, and connects with IFTTT to perform actions based on values (ours tweets when the air pollution is above legal limits).

One thing to be aware of is that Adafruit IO only holds data for 30 days (on the free tier at least). If you want historical data, you’ll need to sign up for the Plus option (which stores data for 60 days), or use an alternative storage method. You can use multiple data stores if you like.

Checking accuracy

Now you’ve got your monitoring station up and running, how do you know that it’s running properly? Perhaps there’s an issue with the sensor, or perhaps there’s a problem with the code. The easiest method of calibration is to test it against an accurate sensor, and most cities here in the UK have monitoring stations as part of Defra’s Automatic Urban and Rural Monitoring Network. You can find your local station here. Many other countries have equivalent public networks. Unless there is no other option, we would caution against using crowdsourced data for calibration, as these sensors aren’t themselves calibrated.

With a USB battery pack, you can head to your local monitoring point and see if your monitor is getting similar results to the monitoring network.

HackSpace magazine #21 is out now

You can read the rest of this feature in HackSpace magazine issue 21, out today in Tesco, WHSmith, and all good independent UK newsagents.

Or you can buy HackSpace mag directly from us — worldwide delivery is available. And if you’d like to own a handy digital version of the magazine, you can also download a free PDF.

The post Monitor air quality with a Raspberry Pi appeared first on Raspberry Pi.

Monitoring insects at the Victoria and Albert Museum

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/monitoring-insects-at-the-victoria-and-albert-museum/

A simple Raspberry Pi camera setup is helping staff at the Victoria and Albert Museum track and identify insects that are threatening priceless exhibits.

“Fiacre, I need an image of bug infestation at the V&A!”

The problem with bugs

Bugs: there’s no escaping them. Whether it’s ants in your kitchen or cockroaches in your post-apocalyptic fallout shelter, insects have a habit of inconveniently infesting edifices, intent on damaging beloved belongings.

And museums are as likely as anywhere to be hit by creepy-crawly visitors. Especially when many of their exhibits are old and deliciously dusty. Yum!

Tracking insects at the V&A

As Bhavesh Shah and Maris Ines Carvalho state on the V&A blog, monitoring insect activity has become common practice at their workplace. As part of the Integrated Pest Monitoring (IPM) strategy at the museum, they even have trained staff members who inspect traps and report back their findings.

“But what if we could develop a system that gives more insight into the behaviour of insects and then use this information to prevent future outbreaks?” ask Shah and Carvalho.

The team spent around £50 on a Raspberry Pi and a 160° camera, and used these and Claude Pageau’s PI-TIMOLO software project to build an insect monitoring system. The system is now integrated into the museum, tracking insects and recording their movements — even in low-light conditions.

Emma Ormond, Raspberry Pi Trading Office Manager and Doctor of Bugs, believes this to be a Bristletail or Silverfish.

“The initial results were promising. Temperature, humidity, and light sensors could also be added to find out, for example, what time of day insects are more active or if they favour particular environmental conditions.”

For more information on the project, visit the Victoria & Albert Museum blog. And for more information on the Victoria & Albert Museum, visit the Victoria & Albert Museum, London — it’s delightful. We highly recommend attending their Videogames: Design/Play/Disrupt exhibition, which is running until 24 February.

The post Monitoring insects at the Victoria and Albert Museum appeared first on Raspberry Pi.

New – How to better monitor your custom application metrics using Amazon CloudWatch Agent

Post Syndicated from Helen Lin original https://aws.amazon.com/blogs/devops/new-how-to-better-monitor-your-custom-application-metrics-using-amazon-cloudwatch-agent/

This blog was contributed by Zhou Fang, Sr. Software Development Engineer for Amazon CloudWatch and Helen Lin, Sr. Product Manager for Amazon CloudWatch

Amazon CloudWatch collects monitoring and operational data from both your AWS resources and on-premises servers, providing you with a unified view of your infrastructure and application health. By default, CloudWatch automatically collects and stores many of your AWS services’ metrics and enables you to monitor and alert on metrics such as high CPU utilization of your Amazon EC2 instances. With the CloudWatch Agent that launched last year, you can also deploy the agent to collect system metrics and application logs from both your Windows and Linux environments. Using this data collected by CloudWatch, you can build operational dashboards to monitor your service and application health, set high-resolution alarms to alert and take automated actions, and troubleshoot issues using Amazon CloudWatch Logs.

We recently introduced CloudWatch Agent support for collecting custom metrics using StatsD and collectd. It’s important to collect system metrics like available memory, and you might also want to monitor custom application metrics. You can use these custom application metrics, such as request count to understand the traffic going through your application or understand latency so you can be alerted when requests take too long to process. StatsD and collectd are popular, open-source solutions that gather system statistics for a wide variety of applications. By combining the system metrics the agent already collects, with the StatsD protocol for instrumenting your own metrics and collectd’s numerous plugins, you can better monitor, analyze, alert, and troubleshoot the performance of your systems and applications.

Let’s dive into an example that demonstrates how to monitor your applications using the CloudWatch Agent.  I am operating a RESTful service that performs simple text encoding. I want to use CloudWatch to help monitor a few key metrics:

  • How many requests are coming into my service?
  • How many of these requests are unique?
  • What is the typical size of a request?
  • How long does it take to process a job?

These metrics help me understand my application performance and throughput, in addition to setting alarms on critical metrics that could indicate service degradation, such as request latency.

Step 1. Collecting StatsD metrics

My service is running on an EC2 instance, using Amazon Linux AMI 2018.03.0. Make sure to attach the CloudWatchAgentServerPolicy AWS managed policy so that the CloudWatch agent can collect and publish metrics from this instance:

Here is the service structure:

 

The “/encode” handler simply returns the base64 encoded string of an input text.  To monitor key metrics, such as total and unique request count as well as request size and method response time, I used StatsD to define these custom metrics.

@RestController

public class EncodeController {

    @RequestMapping("/encode")
    public String encode(@RequestParam(value = "text") String text) {
        long startTime = System.currentTimeMillis();
        statsd.incrementCounter("totalRequest.count", new String[]{"path:/encode"});
        statsd.recordSetValue("uniqueRequest.count", text, new String[]{"path:/encode"});
        statsd.recordHistogramValue("request.size", text.length(), new String[]{"path:/encode"});
        String encodedString = Base64.getEncoder().encodeToString(text.getBytes());
        statsd.recordExecutionTime("latency", System.currentTimeMillis() - startTime, new String[]{"path:/encode"});
        return encodedString;
    }
}

Note that I need to first choose a StatsD client from here.

The “/status” handler responds with a health check ping.  Here I am monitoring my available JVM memory:

@RestController
public class StatusController {

    @RequestMapping("/status")
    public int status() {
        statsd.recordGaugeValue("memory.free", Runtime.getRuntime().freeMemory(), new String[]{"path:/status"});
        return 0;
    }
}

 

Step 2. Emit custom metrics using collectd (optional)

collectd is another popular, open-source daemon for collecting application metrics. If I want to use the hundreds of available collectd plugins to gather application metrics, I can also use the CloudWatch Agent to publish collectd metrics to CloudWatch for 15-months retention. In practice, I might choose to use either StatsD or collectd to collect custom metrics, or I have the option to use both. All of these use cases  are supported by the CloudWatch agent.

Using the same demo RESTful service, I’ll show you how to monitor my service health using the collectd cURL plugin, which passes the collectd metrics to CloudWatch Agent via the network plugin.

For my RESTful service, the “/status” handler returns HTTP code 200 to signify that it’s up and running. This is important to monitor the health of my service and trigger an alert when the application does not respond with a HTTP 200 success code. Additionally, I want to monitor the lapsed time for each health check request.

To collect these metrics using collectd, I have a collectd daemon installed on the EC2 instance, running version 5.8.0. Here is my collectd config:

LoadPlugin logfile
LoadPlugin curl
LoadPlugin network

<Plugin logfile>
  LogLevel "debug"
  File "/var/log/collectd.log"
  Timestamp true
</Plugin>

<Plugin curl>
    <Page "status">
        URL "http://localhost:8080/status";
        MeasureResponseTime true
        MeasureResponseCode true
    </Page>
</Plugin>

<Plugin network>
    <Server "127.0.0.1" "25826">
        SecurityLevel Encrypt
        Username "user"
        Password "secret"
    </Server>
</Plugin>

 

For the cURL plugin, I configured it to measure response time (latency) and response code (HTTP status code) from the RESTful service.

Note that for the network plugin, I used Encrypt mode which requires an authentication file for the CloudWatch Agent to authenticate incoming collectd requests.  Click here for full details on the collectd installation script.

 

Step 3. Configure the CloudWatch agent

So far, I have shown you how to:

A.  Use StatsD to emit custom metrics to monitor my service health
B.  Optionally use collectd to collect metrics using plugins

Next, I will install and configure the CloudWatch agent to accept metrics from both the StatsD client and collectd plugins.

I installed the CloudWatch Agent following the instructions in the user guide, but here are the detailed steps:

Install CloudWatch Agent:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip -O AmazonCloudWatchAgent.zip && unzip -o AmazonCloudWatchAgent.zip && sudo ./install.sh

Configure CloudWatch Agent to receive metrics from StatsD and collectd:

{
  "metrics": {
    "append_dimensions": {
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "collectd": {},
      "statsd": {}
    }
  }
}

Pass the above config (config.json) to the CloudWatch Agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:config.json -s

In case you want to skip these steps and just execute my sample agent install script, you can find it here.

 

Step 4. Generate and monitor application traffic in CloudWatch

Now that I have the CloudWatch agent installed and configured to receive StatsD and collect metrics, I’m going to generate traffic through the service:

echo "send 100 requests"
for i in {1..100}
do
   curl "localhost:8080/encode?text=TextToEncode_${i}[email protected]#%"
   echo ""
   sleep 1
done

 

Next, I log in to the CloudWatch console and check that the service is up and running. Here’s a graph of the StatsD metrics:

 

Here is a graph of the collectd metrics:

 

Conclusion

With StatsD and collectd support, you can now use the CloudWatch Agent to collect and monitor your custom applications in addition to the system metrics and application logs it already collects. Furthermore, you can create operational dashboards with these metrics, set alarms to take automated actions when free memory is low, and troubleshoot issues by diving into the application logs.  Note that StatsD supports both Windows and Linux operating systems while collectd is Linux only.  For Windows, you can also continue to use Windows Performance Counters to collect custom metrics instead.

The CloudWatch Agent with custom metrics support (version 1.203420.0 or later) is available in all public AWS Regions, AWS GovCloud (US), with AWS China (Beijing) and AWS China (Ningxia) coming soon.

The agent is free to use; you pay the usual CloudWatch prices for logs and custom metrics.

For more details, head over to the CloudWatch user guide for StatsD and collectd.

Building an Amazon CloudWatch Dashboard Outside of the AWS Management Console

Post Syndicated from Stephen McCurry original https://aws.amazon.com/blogs/devops/building-an-amazon-cloudwatch-dashboard-outside-of-the-aws-management-console/

Steve McCurry is a Senior Product Manager for CloudWatch

This is the second in a series of two blog posts that demonstrate how to use the new CloudWatch
snapshot graphs feature. You can find the first post here.

A key challenge for any DevOps team is to provide sufficient monitoring visibility on service
health. Although CloudWatch dashboards are a powerful tool for monitoring your systems and
applications, the dashboards are accessible only to users with permissions to the AWS
Management Console. You can now use a new CloudWatch feature, snapshot graphs, to create
dashboards that contain CloudWatch graphs and are available outside of the AWS Management
Console. You can display CloudWatch snapshot graphs on your internal wiki pages or TV-based
dashboards. You can integrate them with chat applications and ticketing and bug tracking tools.

This blog post shows you how to embed CloudWatch snapshot graphs into your websites using a
lightweight, embeddable widget written in JavaScript.

Snapshot graphs overview

CloudWatch snapshot graphs are images of CloudWatch charts that are useful for building
custom dashboards or integrating with tools outside of AWS. Although the images are static,
they can be refreshed frequently to create a live dashboard experience.

CloudWatch dashboards and charts provide flexible, interactive visualizations that can be used to
create unified operational views across your AWS resources and metrics. However, maybe you
want to display CloudWatch charts on a TV screen for team-level visibility, take snapshots of
charts for auditing in ticketing systems and bug tracking tools, or share snapshots in chat
applications to collaborate on an issue. For these use cases and more, snapshot graphs are an
ideal tool for integrating CloudWatch charts with your webpages and third-party applications.

Snapshot graphs are available through the CloudWatch API, which you can use through the
AWS SDKs or CLI. The charts you request through the API are represented as JSON. To copy
the JSON definition of the graph and use it in the API request, open the Amazon CloudWatch
console. You’ll find the JSON on the Source tab of the Metrics page, as shown here.

All of the features of the CloudWatch line and stacked graphs are available in snapshot graphs,
including vertical and horizontal annotations.

Embedding a snapshot graph in your webpage

In this demonstration, we will set up monitoring for an EC2 instance and embed a CloudWatch
snapshot graph for CPUUtilization in a website outside of the AWS Management Console. The
embeddable widget can be configured to support any CloudWatch line or stacked chart. This
demonstration involves these steps:

  1. Create an EC2 instance to monitor.
  2. Create the Lambda function that calls CloudWatch GetMetricWidgetImage.
  3. Create an API Gateway endpoint that proxies requests to the Lambda function.
  4. Embed the widget into a website and configure it for the API Gateway request.

The code for this solution is available from the SnapshotWidgetDemo GitHub repo.

The embeddable JavaScript widget will communicate with CloudWatch through a gateway in
Amazon API Gateway and an AWS Lambda backend. The advantage of using API Gateway is
the additional flexibility you have to secure the endpoint and create fine-grained access control.
For example, you can block access to the endpoint from outside of your corporate network.
Amazon Route 53 could be an alternative solution.

The end goal is to have a webpage running on your local machine that displays a CloudWatch
snapshot graph displaying live metric data from a sample EC2 instance. The sample code
includes a basic webpage containing the embed code.

The JavaScript widget requests a snapshot graph from an API Gateway endpoint. API Gateway
proxies the request to a Lambda function that calls the new CloudWatch API service,
GetMetricWidgetImage. The retrieved snapshot graph is returned in binary and displayed on the
website in an IMG HTML tag.

Here is what the end-to-end solution looks like:

Server setup

  1. Download the repository.
  2. Navigate to ./server and run npm install
  3. From the server folder, run zip -r snapshotwidgetdemo.zip ./*
  4. Upload snapshotwidgetdemo.zip to any S3 bucket.
  5. Upload ./server/apigateway-lambda.json to any S3 bucket.
  6. Navigate to the AWS CloudFormation console and choose Create Stack.
    • Point the new stack to the S3 location in step 5.
    • During setup, you will be asked for the Lambda S3 bucket name from step 4.

The AWS CloudFormation script will create all the required server-side components described in
the previous section.

Client setup

  1. Navigate to ./client and run npm install
  2. Edit ./demo/index.html to replace the following placeholders with your values.
    a. <YOUR_INSTANCE_ID> You can find the instance ID in the AWS
    CloudFormation stack output.
    b. <YOUR_API_GATEWAY_URL> You can find the full URL in the AWS
    CloudFormation stack output.
    c. <YOUR_API_KEY> The API gateway requires a key. The key reference but not
    the key itself appears in theAWS CloudFormation stack output. To retrieve the
    key value, go to the Keys tab of the Amazon API Gateway console.
  3. Build the component using WebPack ./node_modules/.bin/webpack –config
    webpack.config.js
  4. Server the demo webpage on localhost ./node_modules/.bin/webpack-dev-server —
    open

The browser should open at index.html automatically. The page contains one embedded snapshot
graph with the CPU utilization of your EC2 instance.

Troubleshooting

If you don’t see anything on the webpage, use the browser console tools to check for console
error messages.

If you still can’t debug the problem, go to the Amazon API Gateway console. On the Logs tab,
make sure that Enable CloudWatch Logs is selected, as shown here:

To check the Lambda logs, in the CloudWatch console, choose the Logs tab, and then search for
the name of your Lambda function.

Summary

This blog post provided a solution for embedding CloudWatch snapshot graphs into webpages
and wikis outside of the AWS Management Console. To read the other blog post in this series
about CloudWatch snapshot graphs, see Reduce Time to Resolution with Amazon CloudWatch
Snapshot Graphs and Alerts.

For more information, see the snapshot graphs API documentation or visit our home page to
learn more about how Amazon CloudWatch achieves monitoring visibility for your cloud
resources and applications.

It would be great to hear your feedback.

Reduce Time to Resolution with Amazon CloudWatch Snapshot Graphs and Alerts

Post Syndicated from Stephen McCurry original https://aws.amazon.com/blogs/devops/reduce-time-to-resolution-with-amazon-cloudwatch-snapshot-graphs-and-alerts/

Steve McCurry is a Senior Product Manager for Amazon CloudWatch.

This is the first in a series of two blog posts that show how to use the new Amazon CloudWatch
snapshot graphs feature.

Although automated alerts are an important feature of any monitoring solution, including
Amazon CloudWatch, it can be challenging to identify the issues that matter in the noise of
monitoring alerts. Reducing the time to resolution depends on being able to make quick
decisions around the importance of alerts.

When you receive an alert, you ask, “Is this issue urgent?” Unfortunately, a page or email alert
that contains a text description about a symptom doesn’t provide enough context to answer the
question. You need to correlate the alert with metrics to understand what was happening around
the time of the alert. This digging for more information slows down the process of resolving the
issue.

This blog post shows you how to add richer context to an email alert by attaching a CloudWatch
snapshot graph. The snapshot graph shows the behavior of the underlying metric for the three
hours leading up to the alert.

Snapshot graphs overview

You can use snapshot graphs to integrate and display CloudWatch charts outside of the AWS
Management Console to improve monitoring visibility or reduce time to resolution. This feature
makes it possible for you to display CloudWatch charts on your webpage or integrate charts with
third-party tools, such as ticketing, chat applications, and bug tracking.

CloudWatch snapshot graphs are images of CloudWatch charts that are useful for building
custom dashboards. Although the images are static, they can be refreshed frequently to create a
live dashboard experience.

Snapshot graphs are available through the CloudWatch API, which you can use through the
AWS SDKs or AWS CLI. The charts you request through the API are represented as JSON. To
copy the JSON definition of the chart and use it in the API request, open the Amazon
CloudWatch console. You’ll find the JSON on the Source tab of the Metrics page, as shown
here.

All of the features of the CloudWatch line and stacked charts are available in snapshot graphs,
including vertical and horizontal annotations. The example in this post uses horizontal annotations.

Adding context to an EC2 monitoring alert

In this post, we will set up monitoring for an EC2 instance and generate a monitoring alert. The
alert contains details and a chart that displays the trend of the underlying metric (CPUUtilization)
for three hours leading up to the alert. To follow along, use the code in the SnapshotAlarmDemo
GitHub repo.

These are the steps required for the monitoring workflow:

  1. Create an EC2 instance to monitor.
  2. Create a Lambda function that creates an email alert with a snapshot graph attachment.
  3. Create an Amazon Simple Notification Service (Amazon SNS) topic with a Lambda
    subscription.
  4. Configure Amazon Simple Email Service (Amazon SES) to send email to your
    account(s).
  5. Create an alarm on a metric in CloudWatch whose target is the SNS topic.

After these components are set up and configured, the alarm will be triggered when the metric
breaches the alarm threshold value. The alarm will trigger the SNS topic and the Lambda
function will be executed. The Lambda function will interrogate the SNS message to extract the
details of the underlying metric and will call the Snapshot Graphs API to create the
corresponding chart. The Lambda function will also create an email and add the chart as an
attachment before using SES to send it. The operator will receive the email alert and can view
the chart immediately, without signing in to AWS.

Here is what the end-to-end solution looks like:

Create the EC2 instance to monitor

  1. Open the Amazon EC2 console.
  2. From the console dashboard, choose Launch Instance.
  3. On the Choose an Instance Type page, choose any instance type. For size, choose nano.
  4. Choose Review and Launch to let the wizard complete the other configuration settings
    for you.
  5. On the Review Instance Launch page, choose Launch.
  6. When prompted for a key pair, select Choose an existing key pair, or create one.
  7. Make a note of the instance ID that is displayed on the Launch Status page. You need
    this later, when you create your dashboard.

Create the Lambda function

First, create an IAM role for your Lambda function. Your Lambda function needs a role with the
permissions required to execute Lambda and call the CloudWatch API.

Step 1: Create the IAM role

Navigate to the IAM console.

In the navigation pane, choose Roles, and then choose Create role.

Add the following policies to the new role. These policies are more permissive than the
minimum permissions required for this example, so review and adjust according to your
requirements:

  • CloudWatchReadOnlyAccess
  • AmazonSESFullAccess
  • AmazonSNSReadOnlyAccess

Step 2: Create the Lambda function

Next, navigate to the AWS Lambda console and choose Create a function. If you are using the
demo code provided with this post, choose Node.js 6.10 (or later) for the runtime. Attach the
IAM role you created in step 1.

Step 3: Upload the code

Download the code from the SnapshotAlarmDemo GitHub repo. You will need to run npm install locally and then ZIP the project to upload to the Lambda function.

For Handler, enter emailer.myHandler.

Set the function timeout to 30 seconds, and then choose Save. Requests to the Snapshot Graphs
service will take longer based on the number of metrics and time interval requested. To optimize
request time, keep requests to under three hours of the most recent data.

Step 4: Set the environment variables

The email address and region used by the Lambda function are configured in the environment
variables.

EMAIL_TO_ADDRESS is the email address where the alert email will be sent.
EMAIL_FROM_ADDRESS is the sender email address.

Create the SNS topic

Navigate to the Amazon SNS console and choose Create Topic.

Create a subscription on the topic. For Endpoint, enter your Lambda function.

Configure SES

Navigate to the Amazon SES console. In the left navigation pane, choose Email Addresses.
Choose Verify a New Email Address, and then enter the email address.

SES will send a verification email to the selected address. To verify the account, you must click
the link in the verification email.

Create an alarm in CloudWatch

Navigate to the Amazon CloudWatch console. In the left navigation pane, choose Alarms, and
then choose Create Alarm.

For the Select Metric step, select an instance and metric (CPUUtilization, as shown here). For
the Define Alarm step, enter a unique name and threshold value. Use your SNS topic as the
target action. Change the period to 1 minute. Create the alarm.

Testing the workflow

To simulate a problem, let’s manually change the threshold of the metric to a very low value.

Save the alarm and then wait for the alarm to go into an error state. (This can take a couple of
minutes.) As soon as your alarm is in the error state, you should receive the alert email almost
immediately.

The email contains information about the alarm. The chart of the last three hours of the
associated metric is embedded in the email:

Troubleshooting

If you didn’t receive an email, make sure that the alarm is in the alarm state. If it is, check the
AWS Lambda logs, which you’ll find on the Logs tab of the CloudWatch console.

Conclusion

As you have seen in this demo, you can create CloudWatch alarm workflows that provide more
context than a basic text alert.

In my next post in this series, I will show you other ways to use CloudWatch snapshot graphs to
improve monitoring visibility.

For more information, see the snapshot graphs API documentation.

Taking it further

Most popular ticketing systems allow you to send emails that autogenerate tickets. You can use
the code provided in this post to email your ticketing system to create a ticket for the alarm and
automatically attach the snapshot graph in the body of the ticket. Try it!

You can also use the code provided in this post as a starting point to programmatically email an
entire CloudWatch dashboard rather than an individual chart. Simply retrieve the dashboard
definition using the GetDashboard API and make multiple calls to GetMetricWidgetImage.

I look forward to seeing what you build!

Monitoring your Amazon SNS message filtering activity with Amazon CloudWatch

Post Syndicated from Rachel Richardson original https://aws.amazon.com/blogs/compute/monitoring-your-amazon-sns-message-filtering-activity-with-amazon-cloudwatch/

This post is courtesy of Otavio Ferreira, Manager, Amazon SNS, AWS Messaging.

Amazon SNS message filtering provides a set of string and numeric matching operators that allow each subscription to receive only the messages of interest. Hence, SNS message filtering can simplify your pub/sub messaging architecture by offloading the message filtering logic from your subscriber systems, as well as the message routing logic from your publisher systems.

After you set the subscription attribute that defines a filter policy, the subscribing endpoint receives only the messages that carry attributes matching this filter policy. Other messages published to the topic are filtered out for this subscription. In this way, the native integration between SNS and Amazon CloudWatch provides visibility into the number of messages delivered, as well as the number of messages filtered out.

CloudWatch metrics are captured automatically for you. To get started with SNS message filtering, see Filtering Messages with Amazon SNS.

Message Filtering Metrics

The following six CloudWatch metrics are relevant to understanding your SNS message filtering activity:

  • NumberOfMessagesPublished – Inbound traffic to SNS. This metric tracks all the messages that have been published to the topic.
  • NumberOfNotificationsDelivered – Outbound traffic from SNS. This metric tracks all the messages that have been successfully delivered to endpoints subscribed to the topic. A delivery takes place either when the incoming message attributes match a subscription filter policy, or when the subscription has no filter policy at all, which results in a catch-all behavior.
  • NumberOfNotificationsFilteredOut – This metric tracks all the messages that were filtered out because they carried attributes that didn’t match the subscription filter policy.
  • NumberOfNotificationsFilteredOut-NoMessageAttributes – This metric tracks all the messages that were filtered out because they didn’t carry any attributes at all and, consequently, didn’t match the subscription filter policy.
  • NumberOfNotificationsFilteredOut-InvalidAttributes – This metric keeps track of messages that were filtered out because they carried invalid or malformed attributes and, thus, didn’t match the subscription filter policy.
  • NumberOfNotificationsFailed – This last metric tracks all the messages that failed to be delivered to subscribing endpoints, regardless of whether a filter policy had been set for the endpoint. This metric is emitted after the message delivery retry policy is exhausted, and SNS stops attempting to deliver the message. At that moment, the subscribing endpoint is likely no longer reachable. For example, the subscribing SQS queue or Lambda function has been deleted by its owner. You may want to closely monitor this metric to address message delivery issues quickly.

Message filtering graphs

Through the AWS Management Console, you can compose graphs to display your SNS message filtering activity. The graph shows the number of messages published, delivered, and filtered out within the timeframe you specify (1h, 3h, 12h, 1d, 3d, 1w, or custom).

SNS message filtering for CloudWatch Metrics

To compose an SNS message filtering graph with CloudWatch:

  1. Open the CloudWatch console.
  2. Choose Metrics, SNS, All Metrics, and Topic Metrics.
  3. Select all metrics to add to the graph, such as:
    • NumberOfMessagesPublished
    • NumberOfNotificationsDelivered
    • NumberOfNotificationsFilteredOut
  4. Choose Graphed metrics.
  5. In the Statistic column, switch from Average to Sum.
  6. Title your graph with a descriptive name, such as “SNS Message Filtering”

After you have your graph set up, you may want to copy the graph link for bookmarking, emailing, or sharing with co-workers. You may also want to add your graph to a CloudWatch dashboard for easy access in the future. Both actions are available to you on the Actions menu, which is found above the graph.

Summary

SNS message filtering defines how SNS topics behave in terms of message delivery. By using CloudWatch metrics, you gain visibility into the number of messages published, delivered, and filtered out. This enables you to validate the operation of filter policies and more easily troubleshoot during development phases.

SNS message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). CloudWatch metrics for SNS message filtering is available now, in all AWS Regions.

For information about pricing, see the CloudWatch pricing page.

For more information, see:

Protecting your API using Amazon API Gateway and AWS WAF — Part I

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/protecting-your-api-using-amazon-api-gateway-and-aws-waf-part-i/

This post courtesy of Thiago Morais, AWS Solutions Architect

When you build web applications or expose any data externally, you probably look for a platform where you can build highly scalable, secure, and robust REST APIs. As APIs are publicly exposed, there are a number of best practices for providing a secure mechanism to consumers using your API.

Amazon API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management.

In this post, I show you how to take advantage of the regional API endpoint feature in API Gateway, so that you can create your own Amazon CloudFront distribution and secure your API using AWS WAF.

AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, compromise security, or consume excessive resources.

As you make your APIs publicly available, you are exposed to attackers trying to exploit your services in several ways. The AWS security team published a whitepaper solution using AWS WAF, How to Mitigate OWASP’s Top 10 Web Application Vulnerabilities.

Regional API endpoints

Edge-optimized APIs are endpoints that are accessed through a CloudFront distribution created and managed by API Gateway. Before the launch of regional API endpoints, this was the default option when creating APIs using API Gateway. It primarily helped to reduce latency for API consumers that were located in different geographical locations than your API.

When API requests predominantly originate from an Amazon EC2 instance or other services within the same AWS Region as the API is deployed, a regional API endpoint typically lowers the latency of connections. It is recommended for such scenarios.

For better control around caching strategies, customers can use their own CloudFront distribution for regional APIs. They also have the ability to use AWS WAF protection, as I describe in this post.

Edge-optimized API endpoint

The following diagram is an illustrated example of the edge-optimized API endpoint where your API clients access your API through a CloudFront distribution created and managed by API Gateway.

Regional API endpoint

For the regional API endpoint, your customers access your API from the same Region in which your REST API is deployed. This helps you to reduce request latency and particularly allows you to add your own content delivery network, as needed.

Walkthrough

In this section, you implement the following steps:

  • Create a regional API using the PetStore sample API.
  • Create a CloudFront distribution for the API.
  • Test the CloudFront distribution.
  • Set up AWS WAF and create a web ACL.
  • Attach the web ACL to the CloudFront distribution.
  • Test AWS WAF protection.

Create the regional API

For this walkthrough, use an existing PetStore API. All new APIs launch by default as the regional endpoint type. To change the endpoint type for your existing API, choose the cog icon on the top right corner:

After you have created the PetStore API on your account, deploy a stage called “prod” for the PetStore API.

On the API Gateway console, select the PetStore API and choose Actions, Deploy API.

For Stage name, type prod and add a stage description.

Choose Deploy and the new API stage is created.

Use the following AWS CLI command to update your API from edge-optimized to regional:

aws apigateway update-rest-api \
--rest-api-id {rest-api-id} \
--patch-operations op=replace,path=/endpointConfiguration/types/EDGE,value=REGIONAL

A successful response looks like the following:

{
    "description": "Your first API with Amazon API Gateway. This is a sample API that integrates via HTTP with your demo Pet Store endpoints", 
    "createdDate": 1511525626, 
    "endpointConfiguration": {
        "types": [
            "REGIONAL"
        ]
    }, 
    "id": "{api-id}", 
    "name": "PetStore"
}

After you change your API endpoint to regional, you can now assign your own CloudFront distribution to this API.

Create a CloudFront distribution

To make things easier, I have provided an AWS CloudFormation template to deploy a CloudFront distribution pointing to the API that you just created. Click the button to deploy the template in the us-east-1 Region.

For Stack name, enter RegionalAPI. For APIGWEndpoint, enter your API FQDN in the following format:

{api-id}.execute-api.us-east-1.amazonaws.com

After you fill out the parameters, choose Next to continue the stack deployment. It takes a couple of minutes to finish the deployment. After it finishes, the Output tab lists the following items:

  • A CloudFront domain URL
  • An S3 bucket for CloudFront access logs
Output from CloudFormation

Output from CloudFormation

Test the CloudFront distribution

To see if the CloudFront distribution was configured correctly, use a web browser and enter the URL from your distribution, with the following parameters:

https://{your-distribution-url}.cloudfront.net/{api-stage}/pets

You should get the following output:

[
  {
    "id": 1,
    "type": "dog",
    "price": 249.99
  },
  {
    "id": 2,
    "type": "cat",
    "price": 124.99
  },
  {
    "id": 3,
    "type": "fish",
    "price": 0.99
  }
]

Set up AWS WAF and create a web ACL

With the new CloudFront distribution in place, you can now start setting up AWS WAF to protect your API.

For this demo, you deploy the AWS WAF Security Automations solution, which provides fine-grained control over the requests attempting to access your API.

For more information about deployment, see Automated Deployment. If you prefer, you can launch the solution directly into your account using the following button.

For CloudFront Access Log Bucket Name, add the name of the bucket created during the deployment of the CloudFormation stack for your CloudFront distribution.

The solution allows you to adjust thresholds and also choose which automations to enable to protect your API. After you finish configuring these settings, choose Next.

To start the deployment process in your account, follow the creation wizard and choose Create. It takes a few minutes do finish the deployment. You can follow the creation process through the CloudFormation console.

After the deployment finishes, you can see the new web ACL deployed on the AWS WAF console, AWSWAFSecurityAutomations.

Attach the AWS WAF web ACL to the CloudFront distribution

With the solution deployed, you can now attach the AWS WAF web ACL to the CloudFront distribution that you created earlier.

To assign the newly created AWS WAF web ACL, go back to your CloudFront distribution. After you open your distribution for editing, choose General, Edit.

Select the new AWS WAF web ACL that you created earlier, AWSWAFSecurityAutomations.

Save the changes to your CloudFront distribution and wait for the deployment to finish.

Test AWS WAF protection

To validate the AWS WAF Web ACL setup, use Artillery to load test your API and see AWS WAF in action.

To install Artillery on your machine, run the following command:

$ npm install -g artillery

After the installation completes, you can check if Artillery installed successfully by running the following command:

$ artillery -V
$ 1.6.0-12

As the time of publication, Artillery is on version 1.6.0-12.

One of the WAF web ACL rules that you have set up is a rate-based rule. By default, it is set up to block any requesters that exceed 2000 requests under 5 minutes. Try this out.

First, use cURL to query your distribution and see the API output:

$ curl -s https://{distribution-name}.cloudfront.net/prod/pets
[
  {
    "id": 1,
    "type": "dog",
    "price": 249.99
  },
  {
    "id": 2,
    "type": "cat",
    "price": 124.99
  },
  {
    "id": 3,
    "type": "fish",
    "price": 0.99
  }
]

Based on the test above, the result looks good. But what if you max out the 2000 requests in under 5 minutes?

Run the following Artillery command:

artillery quick -n 2000 --count 10  https://{distribution-name}.cloudfront.net/prod/pets

What you are doing is firing 2000 requests to your API from 10 concurrent users. For brevity, I am not posting the Artillery output here.

After Artillery finishes its execution, try to run the cURL request again and see what happens:

 

$ curl -s https://{distribution-name}.cloudfront.net/prod/pets

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.
<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: [removed]
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>

As you can see from the output above, the request was blocked by AWS WAF. Your IP address is removed from the blocked list after it falls below the request limit rate.

Conclusion

In this first part, you saw how to use the new API Gateway regional API endpoint together with Amazon CloudFront and AWS WAF to secure your API from a series of attacks.

In the second part, I will demonstrate some other techniques to protect your API using API keys and Amazon CloudFront custom headers.

[$] An introduction to MQTT

Post Syndicated from corbet original https://lwn.net/Articles/753705/rss

I was sure that somewhere there must be
physically-lightweight sensors with simple power, simple networking, and
a lightweight protocol that allowed them to squirt their data down the
network with a minimum of overhead. So my interest was piqued when Jan-Piet Mens spoke at FLOSS
UK’s Spring
Conference
on “Small Things for Monitoring”. Once he started passing
working demonstration systems around the room without interrupting the
demonstration, it was clear that MQTT was what I’d been looking for.

Hard Drive Stats for Q1 2018

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/hard-drive-stats-for-q1-2018/

Backblaze Drive Stats Q1 2018

As of March 31, 2018 we had 100,110 spinning hard drives. Of that number, there were 1,922 boot drives and 98,188 data drives. This review looks at the quarterly and lifetime statistics for the data drive models in operation in our data centers. We’ll also take a look at why we are collecting and reporting 10 new SMART attributes and take a sneak peak at some 8 TB Toshiba drives. Along the way, we’ll share observations and insights on the data presented and we look forward to you doing the same in the comments.

Background

Since April 2013, Backblaze has recorded and saved daily hard drive statistics from the drives in our data centers. Each entry consists of the date, manufacturer, model, serial number, status (operational or failed), and all of the SMART attributes reported by that drive. Currently there are about 97 million entries totaling 26 GB of data. You can download this data from our website if you want to do your own research, but for starters here’s what we found.

Hard Drive Reliability Statistics for Q1 2018

At the end of Q1 2018 Backblaze was monitoring 98,188 hard drives used to store data. For our evaluation below we remove from consideration those drives which were used for testing purposes and those drive models for which we did not have at least 45 drives. This leaves us with 98,046 hard drives. The table below covers just Q1 2018.

Q1 2018 Hard Drive Failure Rates

Notes and Observations

If a drive model has a failure rate of 0%, it only means there were no drive failures of that model during Q1 2018.

The overall Annualized Failure Rate (AFR) for Q1 is just 1.2%, well below the Q4 2017 AFR of 1.65%. Remember that quarterly failure rates can be volatile, especially for models that have a small number of drives and/or a small number of Drive Days.

There were 142 drives (98,188 minus 98,046) that were not included in the list above because we did not have at least 45 of a given drive model. We use 45 drives of the same model as the minimum number when we report quarterly, yearly, and lifetime drive statistics.

Welcome Toshiba 8TB drives, almost…

We mentioned Toshiba 8 TB drives in the first paragraph, but they don’t show up in the Q1 Stats chart. What gives? We only had 20 of the Toshiba 8 TB drives in operation in Q1, so they were excluded from the chart. Why do we have only 20 drives? When we test out a new drive model we start with the “tome test” and it takes 20 drives to fill one tome. A tome is the same drive model in the same logical position in each of the 20 Storage Pods that make up a Backblaze Vault. There are 60 tomes in each vault.

In this test, we created a Backblaze Vault of 8 TB drives, with 59 of the tomes being Seagate 8 TB drives and 1 tome being the Toshiba drives. Then we monitored the performance of the vault and its member tomes to see if, in this case, the Toshiba drives performed as expected.

Q1 2018 Hard Drive Failure Rate — Toshiba 8TB

So far the Toshiba drive is performing fine, but they have been in place for only 20 days. Next up is the “pod test” where we fill a Storage Pod with Toshiba drives and integrate it into a Backblaze Vault comprised of like-sized drives. We hope to have a better look at the Toshiba 8 TB drives in our Q2 report — stay tuned.

Lifetime Hard Drive Reliability Statistics

While the quarterly chart presented earlier gets a lot of interest, the real test of any drive model is over time. Below is the lifetime failure rate chart for all the hard drive models which have 45 or more drives in operation as of March 31st, 2018. For each model, we compute their reliability starting from when they were first installed.

Lifetime Hard Drive Failure Rates

Notes and Observations

The failure rates of all of the larger drives (8-, 10- and 12 TB) are very good, 1.2% AFR (Annualized Failure Rate) or less. Many of these drives were deployed in the last year, so there is some volatility in the data, but you can use the Confidence Interval to get a sense of the failure percentage range.

The overall failure rate of 1.84% is the lowest we have ever achieved, besting the previous low of 2.00% from the end of 2017.

Our regular readers and drive stats wonks may have noticed a sizable jump in the number of HGST 8 TB drives (model: HUH728080ALE600), from 45 last quarter to 1,045 this quarter. As the 10 TB and 12 TB drives become more available, the price per terabyte of the 8 TB drives has gone down. This presented an opportunity to purchase the HGST drives at a price in line with our budget.

We purchased and placed into service the 45 original HGST 8 TB drives in Q2 of 2015. They were our first Helium-filled drives and our only ones until the 10 TB and 12 TB Seagate drives arrived in Q3 2017. We’ll take a first look into whether or not Helium makes a difference in drive failure rates in an upcoming blog post.

New SMART Attributes

If you have previously worked with the hard drive stats data or plan to, you’ll notice that we added 10 more columns of data starting in 2018. There are 5 new SMART attributes we are tracking each with a raw and normalized value:

  • 177 – Wear Range Delta
  • 179 – Used Reserved Block Count Total
  • 181- Program Fail Count Total or Non-4K Aligned Access Count
  • 182 – Erase Fail Count
  • 235 – Good Block Count AND System(Free) Block Count

The 5 values are all related to SSD drives.

Yes, SSD drives, but before you jump to any conclusions, we used 10 Samsung 850 EVO SSDs as boot drives for a period of time in Q1. This was an experiment to see if we could reduce boot up time for the Storage Pods. In our case, the improved boot up speed wasn’t worth the SSD cost, but it did add 10 new columns to the hard drive stats data.

Speaking of hard drive stats data, the complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose, all we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone. It is free.

If you just want the summarized data used to create the tables and charts in this blog post, you can download the ZIP file containing the MS Excel spreadsheet.

Good luck and let us know if you find anything interesting.

[Ed: 5/1/2018 – Updated Lifetime chart to fix error in confidence interval for HGST 4TB drive, model: HDS5C4040ALE630]

The post Hard Drive Stats for Q1 2018 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.