Tag Archives: monitoring

Telltale: Netflix Application Monitoring Simplified

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba

By Andrei U., Seth Katz, Janak Ramachandran, Jeff Butsch, Peter Lau, Ram Vaithilingam, and Greg Burrell

Our Telltale Vision

An alert fires and you get paged in the middle of the night. A metric crossed a threshold. You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? When was the last time somebody adjusted our alert thresholds? Maybe it’s due to an upstream or downstream service?” This is a critical application so you drag yourself out of bed, open your laptop, and start poring through dashboards for more info. You’re not yet convinced there’s a real problem but you’re also aware that the clock is ticking as you dig through a mountain of data looking for clues.

Healthy Netflix services are essential to member joy. When you sit down to watch “Tiger King” you expect it to just play. Over the years we’ve learned from on-call engineers about the pain points of application monitoring: too many alerts, too many dashboards to scroll through, and too much configuration and maintenance. Our streaming teams need a monitoring system that enables them to quickly diagnose and remediate problems; seconds count! Our Node team needs a system that empowers a small group to operate a large fleet.

So we built Telltale.

The Telltale UI shows health over time and across multiple services.
The Telltale timeline.

Telltale combines a variety of data sources to create a holistic view of an application’s health. Telltale learns what constitutes typical health for an application, no alert tuning required. And because we know what’s healthy, we can let application owners know when their services are trending towards unhealthy.

Metrics are a key part of understanding application health. But sometimes you can have too many metrics, too many graphs, and too many dashboards. Telltale shows only the relevant data from the application plus that of upstream and downstream services. We use colors to indicate severity (users can opt to have Telltale display numbers in addition to colors) so users can tell, at a glance, the state of their application’s health. We also highlight interesting broader events such as regional traffic evacuations and nearby deployments, information that is vital to understanding health holistically. Especially during an incident.

That is our Telltale vision. It exists today and monitors the health of over 100 Netflix production-facing applications.

A call graph with four layers of services.
An application lives in an ecosystem

The Application Health Model

A microservice doesn’t live in isolation. It usually has dependencies, talks to other services, and lives in different AWS regions. The call graph above is a relatively simple one, they can be much deeper with dozens of services involved. An application is part of an ecosystem that can be subtly influenced by property changes or radically altered by region-wide events. The launch of a canary can affect an application. As can an upstream or downstream deployments.

Telltale uses a variety of signals from multiple sources to assemble a constantly evolving model of the application’s health:

Different signals have different levels of importance to an application’s health. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others. A canary launch two layers downstream might not be as significant as a deployment immediately upstream. A regional traffic shift means one region ends up with zero traffic while another region has double. You can imagine the impact that has on metrics. A metric’s meaning determines how we should interpret it.

Telltale takes all those factors into consideration when constructing its view of application health.

The application health model is the heart of Telltale.

Intelligent Monitoring

Every service operator knows the difficulty of alert tuning. Set thresholds too low and you get a deluge of spurious alerts. So you overcompensate and relax the tuning to the point of missing important health warnings. The end result is a lack of trust in alerts. Telltale is built on the premise that you shouldn’t have to constantly tune configuration.

We make setup and configuration easy for application owners by providing curated and managed signal packs. These packs are combined into application profiles to address most common service types. Telltale automatically tracks dependencies between services to build the topology used in the application health model. Signal packs and topology detection keep configuration up-to-date with minimal effort. Those who want a more hands-on approach can still do manual configuration and tuning.

No single algorithm can account for the wide variety of signals we use. So, instead, we employ a mix of algorithms including statistical, rule based, and machine learning. We’ll do a future Netflix Tech Blog article focused on our algorithms. Telltale also has analyzers to detect long-term trends or memory leaks. Intelligent monitoring means results our users can trust. It means a faster time to detection and a faster time to resolution during an incident.

Intelligent Alerting

Intelligent monitoring yields intelligent alerting. Telltale creates an issue when it detects a health problem in your application’s ecosystem. Teams can opt in to alerting via Slack, email, or PagerDuty (all powered by our internal alerting system). If the issue is caused by an upstream or downstream system then Telltale’s context-aware routing alerts that team instead. Intelligent alerting also means a team receives a single notification, alert storms are a thing of the past.

An example of a Telltale alert notification in Slack.
An example of a Telltale notification in Slack.

When a problem strikes, it’s essential to have the right information. Our Slack alerts also start a thread containing only the most relevant context about the incident. This includes the signals that Telltale identified as unhealthy and the reasons why. The right context provides a better understanding of the application’s current state so the on-call engineer can return it to health.

Incidents evolve and have their own lifecycle, so updates are essential. Are things getting better or worse? Are there new signals or events to consider? Telltale updates the Slack thread as the current incident unfolds. The thread is marked Resolved upon return to healthy state so users know, at a glance, which incidents are ongoing and which have been successfully remediated.

But these Slack threads aren’t just for Telltale. Teams use them to share additional data, observations, theories, and discussion about the incident. Incident data and discussion all in one thread makes for shared understanding, faster resolution, and easier post-incident analysis.

We strive to improve the quality of Telltale alerts. One way to do that is to learn from our users. So we provide feedback buttons right in the Slack message. Users can tell us to suppress future occurrences of an alert. Or provide a reason for why an alert isn’t actionable. Intelligent alerting means alerts our users can trust.

An example of the details found in a Telltale notification in Slack. Which metrics and which triggers fired.
An example of the details found in a Telltale notification in Slack.

Why Is My Service Unhealthy?

A wide variety of signals, knowledge of the application’s ecosystem, and correlation of signals across multiple services helps Telltale to detect the possible causes of an application’s degraded health. Causes such as an outlier instance, a canary or deployment by a dependent service, an unhealthy database, or just a spike in traffic. Highlighting possible causes saves valuable time during an incident.

Incident Management

An incident summary gathers relevant information for the team.
An example of a Telltale incident summary.

When Telltale sends an alert it also creates a snapshot that has references to the unhealthy signals. As new information arrives, it’s added to this snapshot. This simplifies the post-incident review process for many teams. When it’s time to review past issues, the Application Incident Summary feature shows all aspects of recent issues in a single place including key metrics like total downtime and MTTR (Mean Time To Resolution). We want to help our teams see larger patterns of incidents so they can improve overall service availability.

The cluster view shows groupings of similar incidents so teams can understand the larger patterns.
The cluster view groups similar incidents.

Deployment Monitoring

Telltale’s application health model and intelligent monitoring have proven so powerful that we’re also using it for safer deployments. We start with Spinnaker, our open source delivery platform. As Spinnaker slowly rolls out a new build we use Telltale to continuously monitor the health of the instances running the new build. Continuous monitoring means a deployment stops and rolls back at the first sign of a problem. It means deployment problems have smaller blast radius and a shorter duration.

Continuous Improvement

Operating microservices in a complex ecosystem is challenging. We’re thrilled that Telltale’s intelligent monitoring and alerting helps our service operators improve availability, reduce toil, and sleep better at night. But we’re not done. We’re constantly exploring new algorithms to improve the accuracy of our alerts. We’ll write more about that in a future Netflix Tech Blog post. We’re also evaluating improvements to our application health model. We believe there’s useful information in service log and trace data. And benefits to employing higher resolution metrics. We’re looking forward to collaborating with our platform team on building out those new features. Getting new applications onto Telltale has been a white-glove treatment which doesn’t scale well, we can definitely improve our self-service UI. And we know there’s better heuristics to help pinpoint what’s affecting your service health.

Telltale is application monitoring simplified.

A healthy Netflix service enables us to entertain the world. Correlating disparate signals to model health in realtime is challenging. Add in thousands of streaming device types, an ever-evolving architecture, and a growing content production ecosystem and the problem becomes fascinating. If you’re passionate about observability then come talk to us.

Telltale: Netflix Application Monitoring Simplified was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Helping sites get back online: the origin monitoring intern project

Post Syndicated from Annika Garbers original https://blog.cloudflare.com/helping-sites-get-back-online-the-origin-monitoring-intern-project/

Helping sites get back online: the origin monitoring intern project

Helping sites get back online: the origin monitoring intern project

The most impactful internship experiences involve building something meaningful from scratch and learning along the way. Those can be tough goals to accomplish during a short summer internship, but our experience with Cloudflare’s 2019 intern program met both of them and more! Over the course of ten weeks, our team of three interns (two engineering, one product management) went from a problem statement to a new feature, which is still working in production for all Cloudflare customers.

The project

Cloudflare sits between customers’ origin servers and end users. This means that all traffic to the origin server runs through Cloudflare, so we know when something goes wrong with a server and sometimes reflect that status back to users. For example, if an origin is refusing connections and there’s no cached version of the site available, Cloudflare will display a 521 error. If customers don’t have monitoring systems configured to detect and notify them when failures like this occur, their websites may go down silently, and they may hear about the issue for the first time from angry users.

Helping sites get back online: the origin monitoring intern project
Helping sites get back online: the origin monitoring intern project
When a customer’s origin server is unreachable, Cloudflare sends a 5xx error back to the visitor.‌‌

This problem became the starting point for our summer internship project: since Cloudflare knows when customers’ origins are down, let’s send them a notification when it happens so they can take action to get their sites back online and reduce the impact to their users! This work became Cloudflare’s passive origin monitoring feature, which is currently available on all Cloudflare plans.

Over the course of our internship, we ran into lots of interesting technical and product problems, like:

Making big data small

Working with data from all requests going through Cloudflare’s 26 million+ Internet properties to look for unreachable origins is unrealistic from a data volume and performance perspective. Figuring out what datasets were available to analyze for the errors we were looking for, and how to adapt our whiteboarded algorithm ideas to use this data, was a challenge in itself.

Ensuring high alert quality

Because only a fraction of requests show up in the sampled timing and error dataset we chose to use, false positives/negatives were disproportionately likely to occur for low-traffic sites. These are the sites that are least likely to have sophisticated monitoring systems in place (and therefore are most in need of this feature!). In order to make the notifications as accurate and actionable as possible, we analyzed patterns of failed requests throughout different types of Cloudflare Internet properties. We used this data to determine thresholds that would maximize the number of true positive notifications, while making sure they weren’t so sensitive that we end up spamming customers with emails about sporadic failures.

Designing actionable notifications

Cloudflare has lots of different kinds of customers, from people running personal blogs with interest in DDoS mitigation to large enterprise companies with extremely sophisticated monitoring systems and global teams dedicated to incident response. We wanted to make sure that our notifications were understandable and actionable for people with varying technical backgrounds, so we enabled the feature for small samples of customers and tested many variations of the “origin monitoring email”. Customers responded right back to our notification emails, sent in support questions, and posted on our community forums. These were all great sources of feedback that helped us improve the message’s clarity and actionability.

We frontloaded our internship with lots of research (both digging into request data to understand patterns in origin unreachability problems and talking to customers/poring over support tickets about origin unreachability) and then spent the next few weeks iterating. We enabled passive origin monitoring for all customers with some time remaining before the end of our internships, so we could spend time improving the supportability of our product, documenting our design decisions, and working with the team that would be taking ownership of the project.

We were also able to develop some smaller internal capabilities that built on the work we’d done for the customer-facing feature, like notifications on origin outage events for larger sites to help our account teams provide proactive support to customers. It was super rewarding to see our work in production, helping Cloudflare users get their sites back online faster after receiving origin monitoring notifications.

Our internship experience

The Cloudflare internship program was a whirlwind ten weeks, with each day presenting new challenges and learnings! Some factors that led to our productive and memorable summer included:

A well-scoped project

It can be tough to find a project that’s meaningful enough to make an impact but still doable within the short time period available for summer internships. We’re grateful to our managers and mentors for identifying an interesting problem that was the perfect size for us to work on, and for keeping us on the rails if the technical or product scope started to creep beyond what would be realistic for the time we had left.

Working as a team of interns

The immediate team working on the origin monitoring project consisted of three interns: Annika in product management and Ilya and Zhengyao in engineering. Having a dedicated team with similar goals and perspectives on the project helped us stay focused and work together naturally.

Quick, agile cycles

Since our project faced strict time constraints and our team was distributed across two offices (Champaign and San Francisco), it was critical for us to communicate frequently and work in short, iterative sprints. Daily standups, weekly planning meetings, and frequent feedback from customers and internal stakeholders helped us stay on track.

Great mentorship & lots of freedom

Our managers challenged us, but also gave us room to explore our ideas and develop our own work process. Their trust encouraged us to set ambitious goals for ourselves and enabled us to accomplish way more than we may have under strict process requirements.

After the internship

In the last week of our internships, the engineering interns, who were based in the Champaign, IL office, visited the San Francisco office to meet with the team that would be taking over the project when we left and present our work to the company at our all hands meeting. The most exciting aspect of the visit: our presentation was preempted by Cloudflare’s co-founders announcing public S-1 filing at the all hands! 🙂

Helping sites get back online: the origin monitoring intern project
Helping sites get back online: the origin monitoring intern project

Over the next few months, Cloudflare added a notifications page for easy configurability and announced the availability of passive origin monitoring along with some other tools to help customers monitor their servers and avoid downtime.

Ilya is working for Cloudflare part-time during the school semester and heading back for another internship this summer, and Annika is joining the team full-time after graduation this May. We’re excited to keep working on tools that help make the Internet a better place!

Also, Cloudflare is doubling the size of the 2020 intern class—if you or someone you know is interested in an internship like this one, check out the open positions in software engineering, security engineering, and product management.

Monitoring and management with Amazon QuickSight and Athena in your CI/CD pipeline

Post Syndicated from Umair Nawaz original https://aws.amazon.com/blogs/devops/monitoring-and-management-with-amazon-quicksight-and-athena-in-your-ci-cd-pipeline/

One of the many ways to monitor and manage required CI/CD metrics is to use Amazon QuickSight to build customized visualizations. Additionally, by applying Lean management to software delivery processes, organizations can improve delivery of features faster, pivot when needed, respond to compliance and security changes, and take advantage of instant feedback to improve the customer delivery experience. This blog post demonstrates how AWS resources and tools can provide monitoring and information pertaining to their CI/CD pipelines.

There are three principles in Lean management that this artifact enables and to which it contributes:

  • Limiting work in progress by establishing constraints that drive process improvement and increase throughput.
  • Creating and maintaining dashboards displaying key quality information, productivity metrics, and current status of work (including defects).
  • Using data from development performance and operations monitoring tools to enable business decisions more frequently.


The following architectural diagram shows how to use AWS services to collect metrics from a CI/CD pipeline and deliver insights through Amazon QuickSight dashboards.

Architecture diagram showing an overview of how CI/CD metrics are extracted and transformed to create a dynamic QuickSight dashboard

In this example, the orchestrator for the CI/CD pipeline is AWS CodePipeline with the entry point as an AWS CodeCommit Git repository for source control. When a developer pushes a code change into the CodeCommit repository, the change goes through a series of phases in CodePipeline. AWS CodeBuild is responsible for performing build actions and, upon successful completion of this phase, AWS CodeDeploy kicks off the actions to execute the deployment.

For each action in CodePipeline, the following series of events occurs:

  • An Amazon CloudWatch rule creates a CloudWatch event containing the action’s metadata.
  • The CloudWatch event triggers an AWS Lambda function.
  • The Lambda function extracts relevant reporting data and writes it to a CSV file in an Amazon S3 bucket.
  • Amazon Athena queries the Amazon S3 bucket and loads the query results into SPICE (an in-memory engine for Amazon QuickSight).
  • Amazon QuickSight obtains data from SPICE to build dashboard displays for the management team.

Note: This solution is for an AWS account with an existing CodePipeline(s). If you do not have a CodePipeline, no metrics will be collected.

Getting started

To get started, follow these steps:

  • Create a Lambda function and copy the following code snippet. Be sure to replace the bucket name with the one used to store your event data. This Lambda function takes the payload from a CloudWatch event and extracts the field’s pipeline, time, state, execution, stage, and action to transform into a CSV file.

Note: Athena’s performance can be improved by compressing, partitioning, or converting data into columnar formats such as Apache Parquet. In this use-case, the dataset size is negligible therefore, a transformation from CSV to Parquet is not required.

import boto3
import csv
import datetime
import os

 # Analyze payload from CloudWatch Event
 def pipeline_execution(data):
     print (data)
     # Specify data fields to deliver to S3
     if "stage" in data['detail'].keys():
     if "action" in data['detail'].keys():
     values = '\n'.join(str(v) for v in row)
     return values

 # Upload CSV file to S3 bucket
 def upload_data_to_s3(data):
     runDate = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S:%f")
     response = s3.put_object(

 def lambda_handler(event, context):
  • Create an Athena table to query the data stored in the Amazon S3 bucket. Execute the following SQL in the Athena query console and provide the bucket name that will hold the data.
   `pipeline` string, 
   `time` string, 
   `state` string, 
   `execution` string, 
   `stage` string, 
   `action` string)
  • Create a CloudWatch event rule that passes events to the Lambda function created in Step 1. In the event rule configuration, set the Service Name as CodePipeline and, for Event Type, select All Events.

Sample Dataset view from Athena.

Sample Athena query and the results

Amazon QuickSight visuals

After the initial setup is done, you are ready to create your QuickSight dashboard. Be sure to check that the Athena permissions are properly set before creating an analysis to be published as an Amazon QuickSight dashboard.

Below are diagrams and figures from Amazon QuickSight that can be generated using the event data queried from Athena. In this example, you can see how many executions happened in the account and how many were successful.

The following screenshot shows that most pipeline executions are failing. A manager might be concerned that this points to a significant issue and prompt an investigation in which they can allocate resources to improve delivery and efficiency.

QuickSight Dashboard showing total execution successes and failures

The visual for this solution is dynamic in nature. In case the pipeline has more or fewer actions, the visual will adjust automatically to reflect all actions. After looking at the success and failure rates for each CodePipeline action in Amazon QuickSight, as shown in the following screenshot, users can take targeted actions quickly. For example, if the team sees a lot of failures due to vulnerability scanning, they can work on improving that problem area to drive value for future code releases.

QuickSight Dashboard showing the successes and failures of pipeline actions

Day-over-day visuals reflect date-specific activity and enable teams to see their progress over a period of time.

QuickSight Dashboard showing day over day results of successful CI/CD executions and failures

Amazon QuickSight offers controls that can be configured to apply filters to visuals. For example, the following screenshot demonstrates how users can toggle between visuals for different applications.

QuickSight's control function to switch between different visualization options

Cleanup (optional)

In order to avoid unintended charges, delete the following resources:

  • Amazon CloudWatch event rule
  • Lambda function
  • Amazon S3 Bucket (the location in which CSV files generated by the Lambda function are stored)
  • Athena external table
  • Amazon QuickSight data sets
  • Analysis and dashboard


In this blog, we showed how metrics can be derived from a CI/CD pipeline. Utilizing Amazon QuickSight to create visuals from these metrics allows teams to continuously deliver updates on the deployment process to management. The aggregation of the captured data over time allows individual developers and teams to improve their processes. That is the goal of creating a Lean DevOps process: to oversee the meta-delivery pipeline and optimize all future releases by identifying weak spots and points of risk during the entire release process.


About the Authors

Umair Nawaz is a DevOps Engineer at Amazon Web Services in New York City. He works on building secure architectures and advises enterprises on agile software delivery. He is motivated to solve problems strategically by utilizing modern technologies.
Christopher Flores is an Engagement Manager at Amazon Web Services in New York City. He leads AWS developers, partners, and client teams in using the customer engagement accelerator framework. Christopher expedites stakeholder alignment, enterprise cohesion and risk mitigation while ensuring feedback loops to close the engagement lifecycle.
Carol Liao is a Cloud Infrastructure Architect at Amazon Web Services in New York City. She enjoys designing and developing modern IT solutions in the cloud where there is always more to learn, more problems to solve, and more to build.


DevOps at re:Invent 2019!

Post Syndicated from Matt Dwyer original https://aws.amazon.com/blogs/devops/devops-at-reinvent-2019/

re:Invent 2019 is fast approaching (NEXT WEEK!) and we here at the AWS DevOps blog wanted to take a moment to highlight DevOps focused presentations, share some tips from experienced re:Invent pro’s, and highlight a few sessions that still have availability for pre-registration. We’ve broken down the track into one overarching leadership session and four topic areas: (a) architecture, (b) culture, (c) software delivery/operations, and (d) AWS tools, services, and CLI.

In total there will be 145 DevOps track sessions, stretched over 5 days, and divided into four distinct session types:

  • Sessions (34) are one-hour presentations delivered by AWS experts and customer speakers who share their expertise / use cases
  • Workshops (20) are two-hours and fifteen minutes, hands-on sessions where you work in teams to solve problems using AWS services
  • Chalk Talks (41) are interactive white-boarding sessions with a smaller audience. They typically begin with a 10–15-minute presentation delivered by an AWS expert, followed by 45–50-minutes of Q&A
  • Builders Sessions (50) are one-hour, small group sessions with six customers and one AWS expert, who is there to help, answer questions, and provide guidance
  • Select DevOps focused sessions have been highlighted below. If you want to view and/or register for any session, including Keynotes, builders’ fairs, and demo theater sessions, you can access the event catalog using your re:Invent registration credentials.

Reserve your seat for AWS re:Invent activities today >>

re:Invent TIP #1: Identify topics you are interested in before attending re:Invent and reserve a seat. We hold space in sessions, workshops, and chalk talks for walk-ups, however, if you want to get into a popular session be prepared to wait in line!

Please see below for select sessions, workshops, and chalk talks that will be conducted during re:Invent.


[Session] Leadership Session: Developer Tools on AWS (DOP210-L) — SPACE AVAILABLE! REGISTER TODAY!

Speaker 1: Ken Exner – Director, AWS Dev Tools, Amazon Web Services
Speaker 2: Kyle Thomson – SDE3, Amazon Web Services

Join Ken Exner, GM of AWS Developer Tools, as he shares the state of developer tooling on AWS, as well as the future of development on AWS. Ken uses insight from his position managing Amazon’s internal tooling to discuss Amazon’s practices and patterns for releasing software to the cloud. Additionally, Ken provides insight and updates across many areas of developer tooling, including infrastructure as code, authoring and debugging, automation and release, and observability. Throughout this session Ken will recap recent launches and show demos for some of the latest features.

re:Invent TIP #2: Leadership Sessions are a topic area’s State of the Union, where AWS leadership will share the vision and direction for a given topic at AWS.re:Invent.


[Session] Amazon’s approach to failing successfully (DOP208-RDOP208-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Becky Weiss – Senior Principal Engineer, Amazon Web Services

Welcome to the real world, where things don’t always go your way. Systems can fail despite being designed to be highly available, scalable, and resilient. These failures, if used correctly, can be a powerful lever for gaining a deep understanding of how a system actually works, as well as a tool for learning how to avoid future failures. In this session, we cover Amazon’s favorite techniques for defining and reviewing metrics—watching the systems before they fail—as well as how to do an effective postmortem that drives both learning and meaningful improvement.

[Session] Improving resiliency with chaos engineering (DOP309-RDOP309-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker 1: Olga Hall – Senior Manager, Tech Program Management
Speaker 2: Adrian Hornsby – Principal Evangelist, Amazon Web Services

Failures are inevitable. Regardless of the engineering efforts put into building resilient systems and handling edge cases, sometimes a case beyond our reach turns a benign failure into a catastrophic one. Therefore, we should test and continuously improve our system’s resilience to failures to minimize impact on a user’s experience. Chaos engineering is one of the best ways to achieve that. In this session, you learn how Amazon Prime Video has implemented chaos engineering into its regular testing methods, helping it achieve increased resiliency.

[Session] Amazon’s approach to security during development (DOP310-RDOP310-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Colm MacCarthaigh – Senior Principal Engineer, Amazon Web Services

At AWS we say that security comes first—and we really mean it. In this session, hear about how AWS teams both minimize security risks in our products and respond to security issues proactively. We talk through how we integrate security reviews, penetration testing, code analysis, and formal verification into the development process. Additionally, we discuss how AWS engineering teams react quickly and decisively to new security risks as they emerge. We also share real-life firefighting examples and the lessons learned in the process.

[Session] Amazon’s approach to building resilient services (DOP342-RDOP342-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Marc Brooker – Senior Principal Engineer, Amazon Web Services

One of the biggest challenges of building services and systems is predicting the future. Changing load, business requirements, and customer behavior can all change in unexpected ways. In this talk, we look at how AWS builds, monitors, and operates services that handle the unexpected. Learn how to make your own services handle a changing world, from basic design principles to patterns you can apply today.

re:Invent TIP #3: Not sure where to spend your time? Let an AWS Hero give you some pointers. AWS Heroes are prominent AWS advocates who are passionate about sharing AWS knowledge with others. They have written guides to help attendees find relevant activities by providing recommendations based on specific demographics or areas of interest.


[Session] Driving change and building a high-performance DevOps culture (DOP207-R; DOP207-R1)

Speaker: Mark Schwartz – Enterprise Strategist, Amazon Web Services

When it comes to digital transformation, every enterprise is different. There is often a person or group with a vision, knowledge of good practices, a sense of urgency, and the energy to break through impediments. They may be anywhere in the organizational structure: high, low, or—in a typical scenario—somewhere in middle management. Mark Schwartz, an enterprise strategist at AWS and the author of “The Art of Business Value” and “A Seat at the Table: IT Leadership in the Age of Agility,” shares some of his research into building a high-performance culture by driving change from every level of the organization.

[Session] Amazon’s approach to running service-oriented organizations (DOP301-R; DOP301-R1DOP301-R2)

Speaker: Andy Troutman – Director AWS Developer Tools, Amazon Web Services

Amazon’s “two-pizza teams” are famously small teams that support a single service or feature. Each of these teams has the autonomy to build and operate their service in a way that best supports their customers. But how do you coordinate across tens, hundreds, or even thousands of two-pizza teams? In this session, we explain how Amazon coordinates technology development at scale by focusing on strategies that help teams coordinate while maintaining autonomy to drive innovation.

re:Invent TIP #4: The max number of 60-minute sessions you can attend during re:Invent is 24! These sessions (e.g., sessions, chalk talks, builders sessions) will usually make up the bulk of your agenda.


[Session] Strategies for securing code in the cloud and on premises. Speakers: (DOP320-RDOP320-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker 1: Craig Smith – Senior Solutions Architect
Speaker 2: Lee Packham – Solutions Architect

Some people prefer to keep their code and tooling on premises, though this can create headaches and slow teams down. Others prefer keeping code off of laptops that can be misplaced. In this session, we walk through the alternatives and recommend best practices for securing your code in cloud and on-premises environments. We demonstrate how to use services such as Amazon WorkSpaces to keep code secure in the cloud. We also show how to connect tools such as Amazon Elastic Container Registry (Amazon ECR) and AWS CodeBuild with your on-premises environments so that your teams can go fast while keeping your data off of the public internet.

[Session] Deploy your code, scale your application, and lower Cloud costs using AWS Elastic Beanstalk (DOP326) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Prashant Prahlad – Sr. Manager

You can effortlessly convert your code into web applications without having to worry about provisioning and managing AWS infrastructure, applying patches and updates to your platform or using a variety of tools to monitor health of your application. In this session, we show how anyone- not just professional developers – can use AWS Elastic Beanstalk in various scenarios: From an administrator moving a Windows .NET workload into the Cloud, a developer building a containerized enterprise app as a Docker image, to a data scientist being able to deploy a machine learning model, all without the need to understand or manage the infrastructure details.

[Session] Amazon’s approach to high-availability deployment (DOP404-RDOP404-R1) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Peter Ramensky – Senior Manager

Continuous-delivery failures can lead to reduced service availability and bad customer experiences. To maximize the rate of successful deployments, Amazon’s development teams implement guardrails in the end-to-end release process to minimize deployment errors, with a goal of achieving zero deployment failures. In this session, learn the continuous-delivery practices that we invented that help raise the bar and prevent costly deployment failures.

[Session] Introduction to DevOps on AWS (DOP209-R; DOP209-R1)

Speaker 1: Jonathan Weiss – Senior Manager
Speaker 2: Sebastien Stormacq – Senior Technical Evangelist

How can you accelerate the delivery of new, high-quality services? Are you able to experiment and get feedback quickly from your customers? How do you scale your development team from 1 to 1,000? To answer these questions, it is essential to leverage some key DevOps principles and use CI/CD pipelines so you can iterate on and quickly release features. In this talk, we walk you through the journey of a single developer building a successful product and scaling their team and processes to hundreds or thousands of deployments per day. We also walk you through best practices and using AWS tools to achieve your DevOps goals.

[Workshop] DevOps essentials: Introductory workshop on CI/CD practices (DOP201-R; DOP201-R1; DOP201-R2; DOP201-R3)

Speaker 1: Leo Zhadanovsky – Principal Solutions Architect
Speaker 2: Karthik Thirugnanasambandam – Partner Solutions Architect

In this session, learn how to effectively leverage various AWS services to improve developer productivity and reduce the overall time to market for new product capabilities. We demonstrate a prescriptive approach to incrementally adopt and embrace some of the best practices around continuous integration and delivery using AWS developer tools and third-party solutions, including, AWS CodeCommit, AWS CodeBuild, Jenkins, AWS CodePipeline, AWS CodeDeploy, AWS X-Ray and AWS Cloud9. We also highlight some best practices and productivity tips that can help make your software release process fast, automated, and reliable.

[Workshop] Implementing GitFLow with AWS tools (DOP202-R; DOP202-R1; DOP202-R2)

Speaker 1: Amit Jha – Sr. Solutions Architect
Speaker 2: Ashish Gore – Sr. Technical Account Manager

Utilizing short-lived feature branches is the development method of choice for many teams. In this workshop, you learn how to use AWS tools to automate merge-and-release tasks. We cover high-level frameworks for how to implement GitFlow using AWS CodePipeline, AWS CodeCommit, AWS CodeBuild, and AWS CodeDeploy. You also get an opportunity to walk through a prebuilt example and examine how the framework can be adopted for individual use cases.

[Chalk Talk] Generating dynamic deployment pipelines with AWS CDK (DOP311-R; DOP311-R1; DOP311-R2)

Speaker 1: Flynn Bundy – AppDev Consultant
Speaker 2: Koen van Blijderveen – Senior Security Consultant

In this session we dive deep into dynamically generating deployment pipelines that deploy across multiple AWS accounts and Regions. Using the power of the AWS Cloud Development Kit (AWS CDK), we demonstrate how to simplify and abstract the creation of deployment pipelines to suit a range of scenarios. We highlight how AWS CodePipeline—along with AWS CodeBuild, AWS CodeCommit, and AWS CodeDeploy—can be structured together with the AWS deployment framework to get the most out of your infrastructure and application deployments.

[Chalk Talk] Customize AWS CloudFormation with open-source tools (DOP312-R; DOP312-R1; DOP312-E)

Speaker 1: Luis Colon – Senior Developer Advocate
Speaker 2: Ryan Lohan – Senior Software Engineer

In this session, we showcase some of the best open-source tools available for AWS CloudFormation customers, including conversion and validation utilities. Get a glimpse of the many open-source projects that you can use as you create and maintain your AWS CloudFormation stacks.

[Chalk Talk] Optimizing Java applications for scale on AWS (DOP314-R; DOP314-R1; DOP314-R2)

Speaker 1: Sam Fink – SDE II
Speaker 2: Kyle Thomson – SDE3

Executing at scale in the cloud can require more than the conventional best practices. During this talk, we offer a number of different Java-related tools you can add to your AWS tool belt to help you more efficiently develop Java applications on AWS—as well as strategies for optimizing those applications. We adapt the talk on the fly to cover the topics that interest the group most, including more easily accessing Amazon DynamoDB, handling high-throughput uploads to and downloads from Amazon Simple Storage Service (Amazon S3), troubleshooting Amazon ECS services, working with local AWS Lambda invocations, optimizing the Java SDK, and more.

[Chalk Talk] Securing your CI/CD tools and environments (DOP316-R; DOP316-R1; DOP316-R2)

Speaker: Leo Zhadanovsky – Principal Solutions Architect

In this session, we discuss how to configure security for AWS CodePipeline, deployments in AWS CodeDeploy, builds in AWS CodeBuild, and git access with AWS CodeCommit. We discuss AWS Identity and Access Management (IAM) best practices, to allow you to set up least-privilege access to these services. We also demonstrate how to ensure that your pipelines meet your security and compliance standards with the CodePipeline AWS Config integration, as well as manual approvals. Lastly, we show you best-practice patterns for integrating security testing of your deployment artifacts inside of your CI/CD pipelines.

[Chalk Talk] Amazon’s approach to automated testing (DOP317-R; DOP317-R1; DOP317-R2)

Speaker 1: Carlos Arguelles – Principal Engineer
Speaker 2: Charlie Roberts – Senior SDET

Join us for a session about how Amazon uses testing strategies to build a culture of quality. Learn Amazon’s best practices around load testing, unit testing, integration testing, and UI testing. We also discuss what parts of testing are automated and how we take advantage of tools, and share how we strategize to fail early to ensure minimum impact to end users.

[Chalk Talk] Building and deploying applications on AWS with Python (DOP319-R; DOP319-R1; DOP319-R2)

Speaker 1: James Saryerwinnie – Senior Software Engineer
Speaker 2: Kyle Knapp – Software Development Engineer

In this session, hear from core developers of the AWS SDK for Python (Boto3) as we walk through the design of sample Python applications. We cover best practices in using Boto3 and look at other libraries to help build these applications, including AWS Chalice, a serverless microframework for Python. Additionally, we discuss testing and deployment strategies to manage the lifecycle of your applications.

[Chalk Talk] Deploying AWS CloudFormation StackSets across accounts and Regions (DOP325-R; DOP325-R1)

Speaker 1: Mahesh Gundelly – Software Development Manager
Speaker 2: Prabhu Nakkeeran – Software Development Manager

AWS CloudFormation StackSets can be a critical tool to efficiently manage deployments of resources across multiple accounts and regions. In this session, we cover how AWS CloudFormation StackSets can help you ensure that all of your accounts have the proper resources in place to meet security, governance, and regulation requirements. We also cover how to make the most of the latest functionalities and discuss best practices, including how to plan for safe deployments with minimal blast radius for critical changes.

[Chalk Talk] Monitoring and observability of serverless apps using AWS X-Ray (DOP327-R; DOP327-R1; DOP327-R2)

Speaker 1 (R, R1, R2): Shengxin Li – Software Development Engineer
Speaker 2 (R, R1): Sirirat Kongdee – Solutions Architect
Speaker 3 (R2): Eric Scholz – Solutions Architect, Amazon

Monitoring and observability are essential parts of DevOps best practices. You need monitoring to debug and trace unhandled errors, performance bottlenecks, and customer impact in the distributed nature of a microservices architecture. In this chalk talk, we show you how to integrate the AWS X-Ray SDK to your code to provide observability to your overall application and drill down to each service component. We discuss how X-Ray can be used to analyze, identify, and alert on performance issues and errors and how it can help you troubleshoot application issues faster.

[Chalk Talk] Optimizing deployment strategies for speed & safety (DOP341-R; DOP341-R1; DOP341-R2)

Speaker: Karan Mahant – Software Development Manager, Amazon

Modern application development moves fast and demands continuous delivery. However, the greatest risk to an application’s availability can occur during deployments. Join us in this chalk talk to learn about deployment strategies for web servers and for Amazon EC2, container-based, and serverless architectures. Learn how you can optimize your deployments to increase productivity during development cycles and mitigate common risks when deploying to production by using canary and blue/green deployment strategies. Further, we share our learnings from operating production services at AWS.

[Chalk Talk] Continuous integration using AWS tools (DOP216-R; DOP216-R1; DOP216-R2)

Speaker: Richard Boyd – Sr Developer Advocate, Amazon Web Services

Today, more teams are adopting continuous-integration (CI) techniques to enable collaboration, increase agility, and deliver a high-quality product faster. Cloud-based development tools such as AWS CodeCommit and AWS CodeBuild can enable teams to easily adopt CI practices without the need to manage infrastructure. In this session, we showcase best practices for continuous integration and discuss how to effectively use AWS tools for CI.

re:Invent TIP #5: If you’re traveling to another session across campus, give yourself at least 60 minutes!


[Session] Best practices for authoring AWS CloudFormation (DOP302-R; DOP302-R1)

Speaker 1: Olivier Munn – Sr Product Manager Technical, Amazon Web Services
Speaker 2: Dan Blanco – Developer Advocate, Amazon Web Services

Incorporating infrastructure as code into software development practices can help teams and organizations improve automation and throughput without sacrificing quality and uptime. In this session, we cover multiple best practices for writing, testing, and maintaining AWS CloudFormation template code. You learn about IDE plug-ins, reusability, testing tools, modularizing stacks, and more. During the session, we also review sample code that showcases some of the best practices in a way that lends more context and clarity.

[Chalk Talk] Using AWS tools to author and debug applications (DOP215-RDOP215-R1DOP215-R2) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Fabian Jakobs – Principal Engineer, Amazon Web Services

Every organization wants its developers to be faster and more productive. AWS Cloud9 lets you create isolated cloud-based development environments for each project and access them from a powerful web-based IDE anywhere, anytime. In this session, we demonstrate how to use AWS Cloud9 and provide an overview of IDE toolkits that can be used to author application code.

[Session] Migrating .Net frameworks to the cloud (DOP321) — SPACE AVAILABLE! REGISTER TODAY!

Speaker: Robert Zhu – Principal Technical Evangelist, Amazon Web Services

Learn how to migrate your .NET application to AWS with minimal steps. In this demo-heavy session, we share best practices for migrating a three-tiered application on ASP.NET and SQL Server to AWS. Throughout the process, you get to see how AWS Toolkit for Visual Studio can enable you to fully leverage AWS services such as AWS Elastic Beanstalk, modernizing your application for more agile and flexible development.

[Session] Deep dive into AWS Cloud Development Kit (DOP402-R; DOP402-R1)

Speaker 1: Elad Ben-Israel – Principal Software Engineer, Amazon Web Services
Speaker 2: Jason Fulghum – Software Development Manager, Amazon Web Services

The AWS Cloud Development Kit (AWS CDK) is a multi-language, open-source framework that enables developers to harness the full power of familiar programming languages to define reusable cloud components and provision applications built from those components using AWS CloudFormation. In this session, you develop an AWS CDK application and learn how to quickly assemble AWS infrastructure. We explore the AWS Construct Library and show you how easy it is to configure your cloud resources, manage permissions, connect event sources, and build and publish your own constructs.

[Session] Introduction to the AWS CLI v2 (DOP406-R; DOP406-R1)

Speaker 1: James Saryerwinnie – Senior Software Engineer, Amazon Web Services
Speaker 2: Kyle Knapp – Software Development Engineer, Amazon Web Services

The AWS Command Line Interface (AWS CLI) is a command-line tool for interacting with AWS services and managing your AWS resources. We’ve taken all of the lessons learned from AWS CLI v1 (launched in 2013), and have been working on AWS CLI v2—the next major version of the AWS CLI—for the past year. AWS CLI v2 includes features such as improved installation mechanisms, a better getting-started experience, interactive workflows for resource management, and new high-level commands. Come hear from the core developers of the AWS CLI about how to upgrade and start using AWS CLI v2 today.

[Session] What’s new in AWS CloudFormation (DOP408-R; DOP408-R1; DOP408-R2)

Speaker 1: Jing Ling – Senior Product Manager, Amazon Web Services
Speaker 2: Luis Colon – Senior Developer Advocate, Amazon Web Services

AWS CloudFormation is one of the most widely used AWS tools, enabling infrastructure as code, deployment automation, repeatability, compliance, and standardization. In this session, we cover the latest improvements and best practices for AWS CloudFormation customers in particular, and for seasoned infrastructure engineers in general. We cover new features and improvements that span many use cases, including programmability options, cross-region and cross-account automation, operational safety, and additional integration with many other AWS services.

[Workshop] Get hands-on with Python/boto3 with no or minimal Python experience (DOP203-R; DOP203-R1; DOP203-R2)

Speaker 1: Herbert-John Kelly – Solutions Architect, Amazon Web Services
Speaker 2: Carl Johnson – Enterprise Solutions Architect, Amazon Web Services

Learning a programming language can seem like a huge investment. However, solving strategic business problems using modern technology approaches, like machine learning and big-data analytics, often requires some understanding. In this workshop, you learn the basics of using Python, one of the most popular programming languages that can be used for small tasks like simple operations automation, or large tasks like analyzing billions of records and training machine-learning models. You also learn about and use the AWS SDK (software development kit) for Python, called boto3, to write a Python program running on and interacting with resources in AWS.

[Workshop] Building reusable AWS CloudFormation templates (DOP304-R; DOP304-R1; DOP304-R2)

Speaker 1: Chelsey Salberg – Front End Engineer, Amazon Web Services
Speaker 2: Dan Blanco – Developer Advocate, Amazon Web Services

AWS CloudFormation gives you an easy way to define your infrastructure as code, but are you using it to its full potential? In this workshop, we take real-world architecture from a sandbox template to production-ready reusable code. We start by reviewing an initial template, which you update throughout the session to incorporate AWS CloudFormation features, like nested stacks and intrinsic functions. By the end of the workshop, expect to have a set of AWS CloudFormation templates that demonstrate the same best practices used in AWS Quick Starts.

[Workshop] Building a scalable serverless application with AWS CDK (DOP306-R; DOP306-R1; DOP306-R2; DOP306-R3)

Speaker 1: David Christiansen – Senior Partner Solutions Architect, Amazon Web Services
Speaker 2: Daniele Stroppa – Solutions Architect, Amazon Web Services

Dive into AWS and build a web application with the AWS Mythical Mysfits tutorial. In this workshop, you build a serverless application using AWS Lambda, Amazon API Gateway, and the AWS Cloud Development Kit (AWS CDK). Through the tutorial, you get hands-on experience using AWS CDK to model and provision a serverless distributed application infrastructure, you connect your application to a backend database, and you capture and analyze data on user behavior. Other AWS services that are utilized include Amazon Kinesis Data Firehose and Amazon DynamoDB.

[Chalk Talk] Assembling an AWS CloudFormation authoring tool chain (DOP313-R; DOP313-R1; DOP313-R2)

Speaker 1: Nathan McCourtney – Sr System Development Engineer, Amazon Web Services
Speaker 2: Dan Blanco – Developer Advocate, Amazon Web Services

In this session, we provide a prescriptive tool chain and methodology to improve your coding productivity as you create and maintain AWS CloudFormation stacks. We cover authoring recommendations from editors and plugins, to setting up a deployment pipeline for your AWS CloudFormation code.

[Chalk Talk] Build using JavaScript with AWS Amplify, AWS Lambda, and AWS Fargate (DOP315-R; DOP315-R1; DOP315-R2)

Speaker 1: Trivikram Kamat – Software Development Engineer, Amazon Web Services
Speaker 2: Vinod Dinakaran – Software Development Manager, Amazon Web Services

Learn how to build applications with AWS Amplify on the front end and AWS Fargate and AWS Lambda on the backend, and protocols (like HTTP/2), using the JavaScript SDKs in the browser and node. Leverage the AWS SDK for JavaScript’s modular NPM packages in resource-constrained environments, and benefit from the built-in async features to run your node and mobile applications, and SPAs, at scale.

[Chalk Talk] Scaling CI/CD adoption using AWS CodePipeline and AWS CloudFormation (DOP318-R; DOP318-R1; DOP318-R2)

Speaker 1: Andrew Baird – Principal Solutions Architect, Amazon Web Services
Speaker 2: Neal Gamradt – Applications Architect, WarnerMedia

Enabling CI/CD across your organization through repeatable patterns and infrastructure-as-code templates can unlock development speed while encouraging best practices. The SEAD Architecture team at WarnerMedia helps encourage CI/CD adoption across their company. They do so by creating and maintaining easily extensible infrastructure-as-code patterns for creating new services and deploying to them automatically using CI/CD. In this session, learn about the patterns they have created and the lessons they have learned.

re:Invent TIP #6: There are lots of extra activities at re:Invent. Expect your evenings to fill up onsite! Check out the peculiar programs including, board games, bingo, arts & crafts or ‘80s sing-alongs…

Debugging with Amazon CloudWatch Synthetics and AWS X-Ray

Post Syndicated from Nizar Tyrewalla original https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/

Today, AWS X-Ray launches support for Amazon CloudWatch Synthetics, enabling developers to trace end-to-end requests from configurable scripts called “canaries”.  These canaries run the test script to monitor web endpoints and APIs using modular, light-weight tests that run 24×7, once per minute. It continuously captures the behavior and availability of the endpoint or URL being monitored, reporting what end-customers are seeing. Customers get alerted immediately in case of failures and tie them back to the root-cause using traces. These canaries provide a complete picture of all the services invoked in the path. With this feature, developers and DevOps engineers can correlate failures in the application with the impact on end customers, determine the root cause of the failures, and identify the upstream and downstream services impacted.


X-Ray helps developers and DevOps engineers analyze and debug distributed applications, such as applications built with microservice architecture. Using X-Ray, you can understand how your application and its underlying services are performing in order to identify and troubleshoot the root causes of performance issues and errors. X-Ray helps you debug and triage distributed applications, wherever those applications are running, and whether the architecture is serverless, containers, Amazon EC2, on-premises, or a mixture of all of the above.

Amazon CloudWatch Synthetics is a fully managed synthetic monitoring service that allows developers and DevOps engineers to view their application endpoints and URLs using configurable scripts called “canaries” that run 24×7. Canaries alert you as soon as something does not work as expected, as defined by your script. CloudWatch Synthetics canaries can be customized to check for availability, latency, transactions, broken or dead links, step-by-step task completions, page load errors, load latencies for UI assets, complex wizard flows, or checkout flows in your applications.

CloudWatch Synthetics supports End-User Experience Monitoring (EUEM—Synthetic Monitoring), Web Services Monitoring (REST APIs), URL Monitoring, and Website Content Monitoring (protection from unauthorized changes in your websites from phishing, code injection, and cross-site scripting). Canary traffic can continually verify your customers’ experiences even when you don’t have any customer traffic. Canaries can discover issues as soon as your customers do.

Use cases

Here are some of the use cases for which developers and DevOps engineers can use AWS X-Ray with Amazon CloudWatch Synthetics.

Determine if there is a problem

You can determine if there is a problem with your CloudWatch Synthetics canaries by setting CloudWatch alarms based on certain thresholds. You can also determine an increase in errors, faults, throttling rates, or slow response times within your X-Ray Service Map.

Developers can set up thresholds on canaries in Amazon CloudWatch Synthetics console which creates CloudWatch alarms to understand if there were any issues. These thresholds are managed centrally to help you get alerted based on the granularity of the test run. They can then use Amazon Simple Notification Service (SNS) or CloudWatch Event for their alerts. For example, you can set a notification when the canary fails 5 times in 5 runs if the canary runs once per min. You can further analyze and triage HAR File, screenshots and logs in the CloudWatch Synthetics console.

Enable thresholds in CloudWatch Synthetics

Canary thresholds in Amazon CloudWatch

Screenshots in Amazon CloudWatch Synthetics console



HAR file in Amazon CloudWatch Synthetics console

Developers can also use the new synthetic canary client nodes with type Client::Synthetics in AWS X-Ray console to pinpoint which synthetic canaries are experiencing issues with errors, faults, or throttling rates for the selected time frame. Select the synthetic canary node to view the response time distribution of the entire request. Zoom into a specific time range and dive deep into analyzing trends using X-Ray Analytics, as shown in the following screenshot of a service map with a synthetic canary client node.

X-Ray Service Map with Synthetics node showing errors



Find the root cause of ongoing failures

You can find the root cause of ongoing failures in upstream and downstream services to reduce the mean time to resolution.

Developers can view the end-to-end path of requests within the X-Ray service map to easily understand the services invoked, as well as the upstream and downstream services. You can also use trace maps for individual traces to dissect each request and determine which service results in the most latency or is causing an error, as shown in the following diagrams.


X-Ray Service Map with end to end path of the request invoked by Synthetic canaries

X-Ray Trace Map to view individual request invoked by Synthetic canaries

Once you receive a CloudWatch alarm for failures in a synthetic canary, it becomes critical to determine the root cause of the issue and reduce the impact on end users. Developers can use X-Ray Analytics to address this. X-Ray performs statistical modeling on trace data to determine the probable root cause of an issue. As you can see in the following screenshots, the synthetic tests for console “www” that is running on AWS Elastic Beanstalk are failing. It is because of the throughput capacity exception from the DynamoDB table, impacting the product microservice. This can be quickly determined in the X-Ray Analytics console. Also, developers can determine which of their canary tests are causing this issue using the already created X-Ray annotation “Annotation.aws:canary_arn”.


Analyze root cause of the issues in Synthetic canaries using X-Ray Analytics


Root cause of the issue using X-Ray Analytics


Identify performance bottlenecks and trends

You can identify performance bottlenecks and trends seen by your Synthetics canaries’ traces over time.

View trends in performance of your website or endpoint over time using the continuous traffic from your synthetic canaries to populate a response time histogram over a period of time. Understand the p50, p90, and p95 percentile for your requests originating from Synthetics tests and determine proportion of failures and time series activity to determine the time of occurrence.

Response time histogram and time series activity for Synthetic canaries


Compare latency and error/fault rates and experience of end user with Synthetic canaries

You can compare latency and error/fault rates and end-user experiences with the Synthetic canaries.

Using X-Ray Analytics, developers can compare any two trends or collection of traces. For example, in the below diagram, we are comparing the experience of end users using the system (shown in blue trend line) to what the Synthetics tests are reporting (shown in green trend line), which started around 3:00 am. We can easily identify that Synthetics tests are reporting more 5xx errors as compared to end users.

Compare end-user experience with Synthetics canaries using X-Ray Analytics


Identify if you have enough canary coverage for all the API’s and URLs.

When debugging synthetic canaries, developers and DevOps engineers are looking to understand if they have enough coverage for their APIs and URLs and if their Synthetic canaries have adequate tests. Using X-Ray Analytics, developers will be able to compare the experience of Synthetic canaries with the rest of their end users. The screenshot below shows blue trend line for canaries and green for the rest of the users. You can also identify that two out of the three URLs don’t have canary tests.

Identify which URLs and API's have canaries running.

Slice and dice the X-Ray service map to focus only on Synthetics tests and its path to downstream services using X-Ray groups

Developers can have multiple APIs and applications running within an account. It becomes critical at that point to focus on certain set of workflows like Synthetics tests for application “www” running on AWS Elastic Beanstalk as shown below. Developers can create X-Ray Group by using a filter expression “edge(id(name: “www”, type: “client::Synthetics”), id(name: “www”, type: “AWS::ElasticBeanstalk::Environment”))” to just focus on traces generated by their Synthetics tests for “www”.

Create X-Ray Groups for Synthetics canaries


Getting Started

Getting started with Synthetics is easy. Enable X-Ray tracing for your APIs, web application endpoints and, URLs being monitored by Synthetics canaries.  Visit the documentation to learn more about getting started using AWS X-Ray with Amazon CloudWatch Synthetics.


In this post, I outlined how you can use AWS X-Ray to debug and triage canaries running in Amazon CloudWatch Synthetics. We looked at some of the use cases that will enable developer and DevOps engineers to reduce time to resolution by viewing the end-to-end path of the request, determining the root cause of the issue and understanding which upstream or downstream services are impacted.

Growth Monitor pi: an open monitoring system for plant science

Post Syndicated from Helen Lynn original https://www.raspberrypi.org/blog/growth-monitor-pi-an-open-monitoring-system-for-plant-science/

Plant scientists and agronomists use growth chambers to provide consistent growing conditions for the plants they study. This reduces confounding variables – inconsistent temperature or light levels, for example – that could render the results of their experiments less meaningful. To make sure that conditions really are consistent both within and between growth chambers, which minimises experimental bias and ensures that experiments are reproducible, it’s helpful to monitor and record environmental variables in the chambers.

A neat grid of small leafy plants on a black plastic tray. Metal housing and tubing is visible to the sides.

Arabidopsis thaliana in a growth chamber on the International Space Station. Many experimental plants are less well monitored than these ones.
(“Arabidopsis thaliana plants […]” by Rawpixel Ltd (original by NASA) / CC BY 2.0)

In a recent paper in Applications in Plant Sciences, Brandin Grindstaff and colleagues at the universities of Missouri and Arizona describe how they developed Growth Monitor pi, or GMpi: an affordable growth chamber monitor that provides wider functionality than other devices. As well as sensing growth conditions, it sends the gathered data to cloud storage, captures images, and generates alerts to inform scientists when conditions drift outside of an acceptable range.

The authors emphasise – and we heartily agree – that you don’t need expertise with software and computing to build, use, and adapt a system like this. They’ve written a detailed protocol and made available all the necessary software for any researcher to build GMpi, and they note that commercial solutions with similar functionality range in price from $10,000 to $1,000,000 – something of an incentive to give the DIY approach a go.

GMpi uses a Raspberry Pi Model 3B+, to which are connected temperature-humidity and light sensors from our friends at Adafruit, as well as a Raspberry Pi Camera Module.

The team used open-source app Rclone to upload sensor data to a cloud service, choosing Google Drive since it’s available for free. To alert users when growing conditions fall outside of a set range, they use the incoming webhooks app to generate notifications in a Slack channel. Sensor operation, data gathering, and remote monitoring are supported by a combination of software that’s available for free from the open-source community and software the authors developed themselves. Their package GMPi_Pack is available on GitHub.

With a bill of materials amounting to something in the region of $200, GMpi is another excellent example of affordable, accessible, customisable open labware that’s available to researchers and students. If you want to find out how to build GMpi for your lab, or just for your greenhouse, Affordable remote monitoring of plant growth in facilities using Raspberry Pi computers by Brandin et al. is available on PubMed Central, and it includes appendices with clear and detailed set-up instructions for the whole system.

The post Growth Monitor pi: an open monitoring system for plant science appeared first on Raspberry Pi.

Raspberry Pi internet kill switch

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/internet-kill-switch/

Control the internet in your home with this handy Raspberry Pi Zero W internet kill switch.

Internet Kill Switch!

It’s every teenager’s worst nightmare… no WIFI! I built a standalone wireless Internet Kill Switch that lets me turn the Internet off whenever I want. A Raspberry Pi Zero W monitors the switch and sends an alert via SSH over WIFI to my firewall where another script watches for the alert and turns the external interface off or on.

Internet in my home wasn’t really a thing until I was in my late teens, and even then, there wasn’t that much online fun to be had. Not like there is now, with social media and online gaming and the YouTubes.

If I’d had access to the internet of today in my teens, I’m pretty sure I’d have never been off the thing. And that’s where a button like this would have been a godsend for my mother.

Shared by Nick Donaldson on his YouTube account, the Internet Kill Switch is a Raspberry Pi Zero W–powered emergency button that turns off all internet access in the house — perfect for keeping online activities to a reasonable level. Nick explains:

It’s every teenager’s worst nightmare… no WiFi! I built a standalone wireless Internet Kill Switch that lets me turn the internet off whenever I want. A Raspberry Pi Zero W monitors the switch and sends an alert via SSH over WiFi to my firewall, where another script watches for the alert and turns the external interface off or on. I have challenged the boys to hack it…

The Raspberry Pi Zero W sits snug within the button casing and is powered by a battery. And so that the battery can be continuously recharged, the device sits on a wireless charging pad. Hence, the button is juiced up and ready to go at any time.

I can pick it up, walk around at any time, threaten the teenagers, and shut down the internet whenever I want, hahaha!

While internet service providers are starting to roll out smartphone apps that offer similar functionality, we like the physicality of this button.

Great job, Nick! Please don’t turn off our internet.

The post Raspberry Pi internet kill switch appeared first on Raspberry Pi.

Monitor air quality with a Raspberry Pi

Post Syndicated from Andrew Gregory original https://www.raspberrypi.org/blog/monitor-air-quality-with-a-raspberry-pi/

Add a sensor and some Python 3 to your Raspberry Pi to keep tabs on your local air pollution, in the project taken from Hackspace magazine issue 21.

Air is the very stuff we breathe. It’s about 78% nitrogen, 21% oxygen, and 1% argon, and then there’s the assorted ‘other’ bits and pieces – many of which have been spewed out by humans and our related machinery. Carbon dioxide is obviously an important polluter for climate change, but there are other bits we should be concerned about for our health, including particulate matter. This is just really small bits of stuff, like soot and smog. They’re grouped together based on their size – the most important, from a health perspective, are those that are smaller than 2.5 microns in width (known as PM2.5), and PM10, which are between 10 and 2.5 microns in width. This pollution is linked with respiratory illness, heart disease, and lung cancer.

Obviously, this is something that’s important to know about, but it’s something that – here in the UK – we have relatively little data on. While there are official sensors in most major towns and cities, the effects can be very localised around busy roads and trapped in valleys. How does the particular make-up of your area affect your air quality? We set out to monitor our environment to see how concerned we should be about our local air.

Getting started

We picked the SDS011 sensor for our project (see ‘Picking a sensor’ below for details on why). This sends output via a binary data format on a serial port. You can read this serial connection directly if you’re using a controller with a UART, but the sensors also usually come with a USB-to-serial connector, allowing you to plug it into any modern computer and read the data.

The USB-to-serial connector makes it easy to connect the sensor to a computer

The very simplest way of using this is to connect it to a computer. You can read the sensor values with software such as DustViewerSharp. If you’re just interested in reading data occasionally, this is a perfectly fine way of using the sensor, but we want a continuous monitoring station – and we didn’t want to leave our laptop in one place, running all the time. When it comes to small, low-power boards with USB ports, there’s one that always springs to mind – the Raspberry Pi.

First, you’ll need a Raspberry Pi (any version) that’s set up with the latest version of Raspbian, connected to your local network, and ideally with SSH enabled. If you’re unsure how to do this, there’s guidance on the Raspberry Pi website.

The wiring for this project is just about the simplest we’ll ever do: connect the SDS011 to the Raspberry Pi with the serial adapter, then plug the Raspberry Pi into a power source.

Before getting started on the code, we also need to set up a data repository. You can store your data wherever you like – on the SD card, or upload it to some cloud service. We’ve opted to upload it to Adafruit IO, an online service for storing data and making dashboards. You’ll need a free account, which you can sign up for on the Adafruit IO website – you’ll need to know your Adafruit username and Adafruit IO key in order to run the code below. If you’d rather use a different service, you’ll need to adjust the code to push your data there.

We’ll use Python 3 for our code, and we need two modules – one to read the data from the SDS011 and one to push it to Adafruit IO. You can install this by entering the following commands in a terminal:

pip3 install pyserial adafruit-io

You’ll now need to open a text editor and enter the following code:

This does a few things. First, it reads ten bytes of data over the serial port – exactly ten because that’s the format that the SDS011 sends data in – and sticks these data points together to form a list of bytes that we call data.

We’re interested in bytes 2 and 3 for PM2.5 and 4 and 5 for PM10. We convert these from bytes to integer numbers with the slightly confusing line:

pmtwofive = int.from_bytes(b’’.join(data[2:4]), byteorder=’little’) / 10

from_byte command takes a string of bytes and converts them into an integer. However, we don’t have a string of bytes, we have a list of two bytes, so we first need to convert this into a string. The b’’ creates an empty string of bytes. We then use the join method of this which takes a list and joins it together using this empty string as a separator. As the empty string contains nothing, this returns a byte string that just contains our two numbers. The byte_order flag is used to denote which way around the command should read the string. We divide the result by ten, because the SDS011 returns data in units of tens of grams per metre cubed and we want the result in that format aio.send is used to push data to Adafruit IO. The first command is the feed value you want the data to go to. We used kingswoodtwofive and kingswoodten, as the sensor is based in Kingswood. You might want to choose a more geographically relevant name. You can now run your sensor with:

python3 airquality.py

…assuming you called the Python file airquality.py
and it’s saved in the same directory the terminal’s in.

At this point, everything should work and you can set about running your sensor, but as one final point, let’s set it up to start automatically when you turn the Raspberry Pi on. Enter the command:

crontab -e

…and add this line to the file:

@reboot python3 /home/pi/airquality.py

With the code and electronic setup working, your sensor will need somewhere to live. If you want it outside, it’ll need a waterproof case (but include some way for air to get in). We used a Tupperware box with a hole cut in the bottom mounted on the wall, with a USB cable carrying power out via a window. How you do it, though, is up to you.

Now let’s democratise air quality data so we can make better decisions about the places we live.

Picking a sensor

There are a variety of particulate sensors on the market. We picked the SDS011 for a couple of reasons. Firstly, it’s cheap enough for many makers to be able to buy and build with. Secondly, it’s been reasonably well studied for accuracy. Both the hackAIR and InfluencAir projects have compared the readings from these sensors with more expensive, better-tested sensors, and the results have come back favourably. You can see more details at hsmag.cc/DiYPfg and hsmag.cc/Luhisr.

The one caveat is that the results are unreliable when the humidity is at the extremes (either very high or very low). The SDS011 is only rated to work up to 70% humidity. If you’re collecting data for a study, then you should discard any readings when the humidity is above this. HackAIR has a formula for attempting to correct for this, but it’s not reliable enough to neutralise the effect completely. See their website for more details: hsmag.cc/DhKaWZ.

Safe levels

Once you’re monitoring your PM2.5 data, what should you look out for? The World Health Organisation air quality guideline stipulates that PM2.5 not exceed 10 µg/m3 annual mean, or 25 µg/m3 24-hour mean; and that PM10 not exceed 20 µg/m3 annual mean, or 50 µg/m3 24-hour mean. However, even these might not be safe. In 2013, a large survey published in The Lancet “found a 7% increase in mortality with each 5 micrograms per cubic metre increase in particulate matter with a diameter of 2.5 micrometres (PM2.5).”

Where to locate your sensor

Standard advice for locating your sensor is that it should be outside and four metres above ground level. That’s good advice for general environmental monitoring; however, we’re not necessarily interested in general environmental monitoring – we’re interested in knowing what we’re breathing in.

Locating your monitor near your workbench will give you an idea of what you’re actually inhaling – useless for any environmental study, but useful if you spend a lot of time in there. We found, for example, that the glue gun produced huge amounts of PM2.5, and we’ll be far more careful with ventilation when using this tool in the future.

Adafruit IO

You can use any data platform you like. We chose Adafruit IO because it’s easy to use, lets you share visualisations (in the form of dashboards) with others, and connects with IFTTT to perform actions based on values (ours tweets when the air pollution is above legal limits).

One thing to be aware of is that Adafruit IO only holds data for 30 days (on the free tier at least). If you want historical data, you’ll need to sign up for the Plus option (which stores data for 60 days), or use an alternative storage method. You can use multiple data stores if you like.

Checking accuracy

Now you’ve got your monitoring station up and running, how do you know that it’s running properly? Perhaps there’s an issue with the sensor, or perhaps there’s a problem with the code. The easiest method of calibration is to test it against an accurate sensor, and most cities here in the UK have monitoring stations as part of Defra’s Automatic Urban and Rural Monitoring Network. You can find your local station here. Many other countries have equivalent public networks. Unless there is no other option, we would caution against using crowdsourced data for calibration, as these sensors aren’t themselves calibrated.

With a USB battery pack, you can head to your local monitoring point and see if your monitor is getting similar results to the monitoring network.

HackSpace magazine #21 is out now

You can read the rest of this feature in HackSpace magazine issue 21, out today in Tesco, WHSmith, and all good independent UK newsagents.

Or you can buy HackSpace mag directly from us — worldwide delivery is available. And if you’d like to own a handy digital version of the magazine, you can also download a free PDF.

The post Monitor air quality with a Raspberry Pi appeared first on Raspberry Pi.

Monitoring insects at the Victoria and Albert Museum

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/monitoring-insects-at-the-victoria-and-albert-museum/

A simple Raspberry Pi camera setup is helping staff at the Victoria and Albert Museum track and identify insects that are threatening priceless exhibits.

“Fiacre, I need an image of bug infestation at the V&A!”

The problem with bugs

Bugs: there’s no escaping them. Whether it’s ants in your kitchen or cockroaches in your post-apocalyptic fallout shelter, insects have a habit of inconveniently infesting edifices, intent on damaging beloved belongings.

And museums are as likely as anywhere to be hit by creepy-crawly visitors. Especially when many of their exhibits are old and deliciously dusty. Yum!

Tracking insects at the V&A

As Bhavesh Shah and Maris Ines Carvalho state on the V&A blog, monitoring insect activity has become common practice at their workplace. As part of the Integrated Pest Monitoring (IPM) strategy at the museum, they even have trained staff members who inspect traps and report back their findings.

“But what if we could develop a system that gives more insight into the behaviour of insects and then use this information to prevent future outbreaks?” ask Shah and Carvalho.

The team spent around £50 on a Raspberry Pi and a 160° camera, and used these and Claude Pageau’s PI-TIMOLO software project to build an insect monitoring system. The system is now integrated into the museum, tracking insects and recording their movements — even in low-light conditions.

Emma Ormond, Raspberry Pi Trading Office Manager and Doctor of Bugs, believes this to be a Bristletail or Silverfish.

“The initial results were promising. Temperature, humidity, and light sensors could also be added to find out, for example, what time of day insects are more active or if they favour particular environmental conditions.”

For more information on the project, visit the Victoria & Albert Museum blog. And for more information on the Victoria & Albert Museum, visit the Victoria & Albert Museum, London — it’s delightful. We highly recommend attending their Videogames: Design/Play/Disrupt exhibition, which is running until 24 February.

The post Monitoring insects at the Victoria and Albert Museum appeared first on Raspberry Pi.

New – How to better monitor your custom application metrics using Amazon CloudWatch Agent

Post Syndicated from Helen Lin original https://aws.amazon.com/blogs/devops/new-how-to-better-monitor-your-custom-application-metrics-using-amazon-cloudwatch-agent/

This blog was contributed by Zhou Fang, Sr. Software Development Engineer for Amazon CloudWatch and Helen Lin, Sr. Product Manager for Amazon CloudWatch

Amazon CloudWatch collects monitoring and operational data from both your AWS resources and on-premises servers, providing you with a unified view of your infrastructure and application health. By default, CloudWatch automatically collects and stores many of your AWS services’ metrics and enables you to monitor and alert on metrics such as high CPU utilization of your Amazon EC2 instances. With the CloudWatch Agent that launched last year, you can also deploy the agent to collect system metrics and application logs from both your Windows and Linux environments. Using this data collected by CloudWatch, you can build operational dashboards to monitor your service and application health, set high-resolution alarms to alert and take automated actions, and troubleshoot issues using Amazon CloudWatch Logs.

We recently introduced CloudWatch Agent support for collecting custom metrics using StatsD and collectd. It’s important to collect system metrics like available memory, and you might also want to monitor custom application metrics. You can use these custom application metrics, such as request count to understand the traffic going through your application or understand latency so you can be alerted when requests take too long to process. StatsD and collectd are popular, open-source solutions that gather system statistics for a wide variety of applications. By combining the system metrics the agent already collects, with the StatsD protocol for instrumenting your own metrics and collectd’s numerous plugins, you can better monitor, analyze, alert, and troubleshoot the performance of your systems and applications.

Let’s dive into an example that demonstrates how to monitor your applications using the CloudWatch Agent.  I am operating a RESTful service that performs simple text encoding. I want to use CloudWatch to help monitor a few key metrics:

  • How many requests are coming into my service?
  • How many of these requests are unique?
  • What is the typical size of a request?
  • How long does it take to process a job?

These metrics help me understand my application performance and throughput, in addition to setting alarms on critical metrics that could indicate service degradation, such as request latency.

Step 1. Collecting StatsD metrics

My service is running on an EC2 instance, using Amazon Linux AMI 2018.03.0. Make sure to attach the CloudWatchAgentServerPolicy AWS managed policy so that the CloudWatch agent can collect and publish metrics from this instance:

Here is the service structure:


The “/encode” handler simply returns the base64 encoded string of an input text.  To monitor key metrics, such as total and unique request count as well as request size and method response time, I used StatsD to define these custom metrics.


public class EncodeController {

    public String encode(@RequestParam(value = "text") String text) {
        long startTime = System.currentTimeMillis();
        statsd.incrementCounter("totalRequest.count", new String[]{"path:/encode"});
        statsd.recordSetValue("uniqueRequest.count", text, new String[]{"path:/encode"});
        statsd.recordHistogramValue("request.size", text.length(), new String[]{"path:/encode"});
        String encodedString = Base64.getEncoder().encodeToString(text.getBytes());
        statsd.recordExecutionTime("latency", System.currentTimeMillis() - startTime, new String[]{"path:/encode"});
        return encodedString;

Note that I need to first choose a StatsD client from here.

The “/status” handler responds with a health check ping.  Here I am monitoring my available JVM memory:

public class StatusController {

    public int status() {
        statsd.recordGaugeValue("memory.free", Runtime.getRuntime().freeMemory(), new String[]{"path:/status"});
        return 0;


Step 2. Emit custom metrics using collectd (optional)

collectd is another popular, open-source daemon for collecting application metrics. If I want to use the hundreds of available collectd plugins to gather application metrics, I can also use the CloudWatch Agent to publish collectd metrics to CloudWatch for 15-months retention. In practice, I might choose to use either StatsD or collectd to collect custom metrics, or I have the option to use both. All of these use cases  are supported by the CloudWatch agent.

Using the same demo RESTful service, I’ll show you how to monitor my service health using the collectd cURL plugin, which passes the collectd metrics to CloudWatch Agent via the network plugin.

For my RESTful service, the “/status” handler returns HTTP code 200 to signify that it’s up and running. This is important to monitor the health of my service and trigger an alert when the application does not respond with a HTTP 200 success code. Additionally, I want to monitor the lapsed time for each health check request.

To collect these metrics using collectd, I have a collectd daemon installed on the EC2 instance, running version 5.8.0. Here is my collectd config:

LoadPlugin logfile
LoadPlugin curl
LoadPlugin network

<Plugin logfile>
  LogLevel "debug"
  File "/var/log/collectd.log"
  Timestamp true

<Plugin curl>
    <Page "status">
        URL "http://localhost:8080/status";
        MeasureResponseTime true
        MeasureResponseCode true

<Plugin network>
    <Server "" "25826">
        SecurityLevel Encrypt
        Username "user"
        Password "secret"


For the cURL plugin, I configured it to measure response time (latency) and response code (HTTP status code) from the RESTful service.

Note that for the network plugin, I used Encrypt mode which requires an authentication file for the CloudWatch Agent to authenticate incoming collectd requests.  Click here for full details on the collectd installation script.


Step 3. Configure the CloudWatch agent

So far, I have shown you how to:

A.  Use StatsD to emit custom metrics to monitor my service health
B.  Optionally use collectd to collect metrics using plugins

Next, I will install and configure the CloudWatch agent to accept metrics from both the StatsD client and collectd plugins.

I installed the CloudWatch Agent following the instructions in the user guide, but here are the detailed steps:

Install CloudWatch Agent:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip -O AmazonCloudWatchAgent.zip && unzip -o AmazonCloudWatchAgent.zip && sudo ./install.sh

Configure CloudWatch Agent to receive metrics from StatsD and collectd:

  "metrics": {
    "append_dimensions": {
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
      "InstanceId": "${aws:InstanceId}"
    "metrics_collected": {
      "collectd": {},
      "statsd": {}

Pass the above config (config.json) to the CloudWatch Agent:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:config.json -s

In case you want to skip these steps and just execute my sample agent install script, you can find it here.


Step 4. Generate and monitor application traffic in CloudWatch

Now that I have the CloudWatch agent installed and configured to receive StatsD and collect metrics, I’m going to generate traffic through the service:

echo "send 100 requests"
for i in {1..100}
   curl "localhost:8080/encode?text=TextToEncode_${i}[email protected]#%"
   echo ""
   sleep 1


Next, I log in to the CloudWatch console and check that the service is up and running. Here’s a graph of the StatsD metrics:


Here is a graph of the collectd metrics:



With StatsD and collectd support, you can now use the CloudWatch Agent to collect and monitor your custom applications in addition to the system metrics and application logs it already collects. Furthermore, you can create operational dashboards with these metrics, set alarms to take automated actions when free memory is low, and troubleshoot issues by diving into the application logs.  Note that StatsD supports both Windows and Linux operating systems while collectd is Linux only.  For Windows, you can also continue to use Windows Performance Counters to collect custom metrics instead.

The CloudWatch Agent with custom metrics support (version 1.203420.0 or later) is available in all public AWS Regions, AWS GovCloud (US), with AWS China (Beijing) and AWS China (Ningxia) coming soon.

The agent is free to use; you pay the usual CloudWatch prices for logs and custom metrics.

For more details, head over to the CloudWatch user guide for StatsD and collectd.

Building an Amazon CloudWatch Dashboard Outside of the AWS Management Console

Post Syndicated from Stephen McCurry original https://aws.amazon.com/blogs/devops/building-an-amazon-cloudwatch-dashboard-outside-of-the-aws-management-console/

Steve McCurry is a Senior Product Manager for CloudWatch

This is the second in a series of two blog posts that demonstrate how to use the new CloudWatch
snapshot graphs feature. You can find the first post here.

A key challenge for any DevOps team is to provide sufficient monitoring visibility on service
health. Although CloudWatch dashboards are a powerful tool for monitoring your systems and
applications, the dashboards are accessible only to users with permissions to the AWS
Management Console. You can now use a new CloudWatch feature, snapshot graphs, to create
dashboards that contain CloudWatch graphs and are available outside of the AWS Management
Console. You can display CloudWatch snapshot graphs on your internal wiki pages or TV-based
dashboards. You can integrate them with chat applications and ticketing and bug tracking tools.

This blog post shows you how to embed CloudWatch snapshot graphs into your websites using a
lightweight, embeddable widget written in JavaScript.

Snapshot graphs overview

CloudWatch snapshot graphs are images of CloudWatch charts that are useful for building
custom dashboards or integrating with tools outside of AWS. Although the images are static,
they can be refreshed frequently to create a live dashboard experience.

CloudWatch dashboards and charts provide flexible, interactive visualizations that can be used to
create unified operational views across your AWS resources and metrics. However, maybe you
want to display CloudWatch charts on a TV screen for team-level visibility, take snapshots of
charts for auditing in ticketing systems and bug tracking tools, or share snapshots in chat
applications to collaborate on an issue. For these use cases and more, snapshot graphs are an
ideal tool for integrating CloudWatch charts with your webpages and third-party applications.

Snapshot graphs are available through the CloudWatch API, which you can use through the
AWS SDKs or CLI. The charts you request through the API are represented as JSON. To copy
the JSON definition of the graph and use it in the API request, open the Amazon CloudWatch
console. You’ll find the JSON on the Source tab of the Metrics page, as shown here.

All of the features of the CloudWatch line and stacked graphs are available in snapshot graphs,
including vertical and horizontal annotations.

Embedding a snapshot graph in your webpage

In this demonstration, we will set up monitoring for an EC2 instance and embed a CloudWatch
snapshot graph for CPUUtilization in a website outside of the AWS Management Console. The
embeddable widget can be configured to support any CloudWatch line or stacked chart. This
demonstration involves these steps:

  1. Create an EC2 instance to monitor.
  2. Create the Lambda function that calls CloudWatch GetMetricWidgetImage.
  3. Create an API Gateway endpoint that proxies requests to the Lambda function.
  4. Embed the widget into a website and configure it for the API Gateway request.

The code for this solution is available from the SnapshotWidgetDemo GitHub repo.

The embeddable JavaScript widget will communicate with CloudWatch through a gateway in
Amazon API Gateway and an AWS Lambda backend. The advantage of using API Gateway is
the additional flexibility you have to secure the endpoint and create fine-grained access control.
For example, you can block access to the endpoint from outside of your corporate network.
Amazon Route 53 could be an alternative solution.

The end goal is to have a webpage running on your local machine that displays a CloudWatch
snapshot graph displaying live metric data from a sample EC2 instance. The sample code
includes a basic webpage containing the embed code.

The JavaScript widget requests a snapshot graph from an API Gateway endpoint. API Gateway
proxies the request to a Lambda function that calls the new CloudWatch API service,
GetMetricWidgetImage. The retrieved snapshot graph is returned in binary and displayed on the
website in an IMG HTML tag.

Here is what the end-to-end solution looks like:

Server setup

  1. Download the repository.
  2. Navigate to ./server and run npm install
  3. From the server folder, run zip -r snapshotwidgetdemo.zip ./*
  4. Upload snapshotwidgetdemo.zip to any S3 bucket.
  5. Upload ./server/apigateway-lambda.json to any S3 bucket.
  6. Navigate to the AWS CloudFormation console and choose Create Stack.
    • Point the new stack to the S3 location in step 5.
    • During setup, you will be asked for the Lambda S3 bucket name from step 4.

The AWS CloudFormation script will create all the required server-side components described in
the previous section.

Client setup

  1. Navigate to ./client and run npm install
  2. Edit ./demo/index.html to replace the following placeholders with your values.
    a. <YOUR_INSTANCE_ID> You can find the instance ID in the AWS
    CloudFormation stack output.
    b. <YOUR_API_GATEWAY_URL> You can find the full URL in the AWS
    CloudFormation stack output.
    c. <YOUR_API_KEY> The API gateway requires a key. The key reference but not
    the key itself appears in theAWS CloudFormation stack output. To retrieve the
    key value, go to the Keys tab of the Amazon API Gateway console.
  3. Build the component using WebPack ./node_modules/.bin/webpack –config
  4. Server the demo webpage on localhost ./node_modules/.bin/webpack-dev-server —

The browser should open at index.html automatically. The page contains one embedded snapshot
graph with the CPU utilization of your EC2 instance.


If you don’t see anything on the webpage, use the browser console tools to check for console
error messages.

If you still can’t debug the problem, go to the Amazon API Gateway console. On the Logs tab,
make sure that Enable CloudWatch Logs is selected, as shown here:

To check the Lambda logs, in the CloudWatch console, choose the Logs tab, and then search for
the name of your Lambda function.


This blog post provided a solution for embedding CloudWatch snapshot graphs into webpages
and wikis outside of the AWS Management Console. To read the other blog post in this series
about CloudWatch snapshot graphs, see Reduce Time to Resolution with Amazon CloudWatch
Snapshot Graphs and Alerts.

For more information, see the snapshot graphs API documentation or visit our home page to
learn more about how Amazon CloudWatch achieves monitoring visibility for your cloud
resources and applications.

It would be great to hear your feedback.

Reduce Time to Resolution with Amazon CloudWatch Snapshot Graphs and Alerts

Post Syndicated from Stephen McCurry original https://aws.amazon.com/blogs/devops/reduce-time-to-resolution-with-amazon-cloudwatch-snapshot-graphs-and-alerts/

Steve McCurry is a Senior Product Manager for Amazon CloudWatch.

This is the first in a series of two blog posts that show how to use the new Amazon CloudWatch
snapshot graphs feature.

Although automated alerts are an important feature of any monitoring solution, including
Amazon CloudWatch, it can be challenging to identify the issues that matter in the noise of
monitoring alerts. Reducing the time to resolution depends on being able to make quick
decisions around the importance of alerts.

When you receive an alert, you ask, “Is this issue urgent?” Unfortunately, a page or email alert
that contains a text description about a symptom doesn’t provide enough context to answer the
question. You need to correlate the alert with metrics to understand what was happening around
the time of the alert. This digging for more information slows down the process of resolving the

This blog post shows you how to add richer context to an email alert by attaching a CloudWatch
snapshot graph. The snapshot graph shows the behavior of the underlying metric for the three
hours leading up to the alert.

Snapshot graphs overview

You can use snapshot graphs to integrate and display CloudWatch charts outside of the AWS
Management Console to improve monitoring visibility or reduce time to resolution. This feature
makes it possible for you to display CloudWatch charts on your webpage or integrate charts with
third-party tools, such as ticketing, chat applications, and bug tracking.

CloudWatch snapshot graphs are images of CloudWatch charts that are useful for building
custom dashboards. Although the images are static, they can be refreshed frequently to create a
live dashboard experience.

Snapshot graphs are available through the CloudWatch API, which you can use through the
AWS SDKs or AWS CLI. The charts you request through the API are represented as JSON. To
copy the JSON definition of the chart and use it in the API request, open the Amazon
CloudWatch console. You’ll find the JSON on the Source tab of the Metrics page, as shown

All of the features of the CloudWatch line and stacked charts are available in snapshot graphs,
including vertical and horizontal annotations. The example in this post uses horizontal annotations.

Adding context to an EC2 monitoring alert

In this post, we will set up monitoring for an EC2 instance and generate a monitoring alert. The
alert contains details and a chart that displays the trend of the underlying metric (CPUUtilization)
for three hours leading up to the alert. To follow along, use the code in the SnapshotAlarmDemo
GitHub repo.

These are the steps required for the monitoring workflow:

  1. Create an EC2 instance to monitor.
  2. Create a Lambda function that creates an email alert with a snapshot graph attachment.
  3. Create an Amazon Simple Notification Service (Amazon SNS) topic with a Lambda
  4. Configure Amazon Simple Email Service (Amazon SES) to send email to your
  5. Create an alarm on a metric in CloudWatch whose target is the SNS topic.

After these components are set up and configured, the alarm will be triggered when the metric
breaches the alarm threshold value. The alarm will trigger the SNS topic and the Lambda
function will be executed. The Lambda function will interrogate the SNS message to extract the
details of the underlying metric and will call the Snapshot Graphs API to create the
corresponding chart. The Lambda function will also create an email and add the chart as an
attachment before using SES to send it. The operator will receive the email alert and can view
the chart immediately, without signing in to AWS.

Here is what the end-to-end solution looks like:

Create the EC2 instance to monitor

  1. Open the Amazon EC2 console.
  2. From the console dashboard, choose Launch Instance.
  3. On the Choose an Instance Type page, choose any instance type. For size, choose nano.
  4. Choose Review and Launch to let the wizard complete the other configuration settings
    for you.
  5. On the Review Instance Launch page, choose Launch.
  6. When prompted for a key pair, select Choose an existing key pair, or create one.
  7. Make a note of the instance ID that is displayed on the Launch Status page. You need
    this later, when you create your dashboard.

Create the Lambda function

First, create an IAM role for your Lambda function. Your Lambda function needs a role with the
permissions required to execute Lambda and call the CloudWatch API.

Step 1: Create the IAM role

Navigate to the IAM console.

In the navigation pane, choose Roles, and then choose Create role.

Add the following policies to the new role. These policies are more permissive than the
minimum permissions required for this example, so review and adjust according to your

  • CloudWatchReadOnlyAccess
  • AmazonSESFullAccess
  • AmazonSNSReadOnlyAccess

Step 2: Create the Lambda function

Next, navigate to the AWS Lambda console and choose Create a function. If you are using the
demo code provided with this post, choose Node.js 6.10 (or later) for the runtime. Attach the
IAM role you created in step 1.

Step 3: Upload the code

Download the code from the SnapshotAlarmDemo GitHub repo. You will need to run npm install locally and then ZIP the project to upload to the Lambda function.

For Handler, enter emailer.myHandler.

Set the function timeout to 30 seconds, and then choose Save. Requests to the Snapshot Graphs
service will take longer based on the number of metrics and time interval requested. To optimize
request time, keep requests to under three hours of the most recent data.

Step 4: Set the environment variables

The email address and region used by the Lambda function are configured in the environment

EMAIL_TO_ADDRESS is the email address where the alert email will be sent.
EMAIL_FROM_ADDRESS is the sender email address.

Create the SNS topic

Navigate to the Amazon SNS console and choose Create Topic.

Create a subscription on the topic. For Endpoint, enter your Lambda function.

Configure SES

Navigate to the Amazon SES console. In the left navigation pane, choose Email Addresses.
Choose Verify a New Email Address, and then enter the email address.

SES will send a verification email to the selected address. To verify the account, you must click
the link in the verification email.

Create an alarm in CloudWatch

Navigate to the Amazon CloudWatch console. In the left navigation pane, choose Alarms, and
then choose Create Alarm.

For the Select Metric step, select an instance and metric (CPUUtilization, as shown here). For
the Define Alarm step, enter a unique name and threshold value. Use your SNS topic as the
target action. Change the period to 1 minute. Create the alarm.

Testing the workflow

To simulate a problem, let’s manually change the threshold of the metric to a very low value.

Save the alarm and then wait for the alarm to go into an error state. (This can take a couple of
minutes.) As soon as your alarm is in the error state, you should receive the alert email almost

The email contains information about the alarm. The chart of the last three hours of the
associated metric is embedded in the email:


If you didn’t receive an email, make sure that the alarm is in the alarm state. If it is, check the
AWS Lambda logs, which you’ll find on the Logs tab of the CloudWatch console.


As you have seen in this demo, you can create CloudWatch alarm workflows that provide more
context than a basic text alert.

In my next post in this series, I will show you other ways to use CloudWatch snapshot graphs to
improve monitoring visibility.

For more information, see the snapshot graphs API documentation.

Taking it further

Most popular ticketing systems allow you to send emails that autogenerate tickets. You can use
the code provided in this post to email your ticketing system to create a ticket for the alarm and
automatically attach the snapshot graph in the body of the ticket. Try it!

You can also use the code provided in this post as a starting point to programmatically email an
entire CloudWatch dashboard rather than an individual chart. Simply retrieve the dashboard
definition using the GetDashboard API and make multiple calls to GetMetricWidgetImage.

I look forward to seeing what you build!

timeShift(GrafanaBuzz, 1w) Issue 47

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2018/06/01/timeshiftgrafanabuzz-1w-issue-47/

Welcome to TimeShift We cover a lot of ground this week with posts on general monitoring principles, home automation, how CERN uses open source projects in their particle acceleration work, and more. Have an article you’d like highlighted here? Get in touch.
We’re excited to be a sponsor of Monitorama PDX June 4-6. If you’re going, please be sure and say hello! Latest Release: Grafana 5.1.3 This latest point release fixes a scrolling issue that was reported in Firefox.

Monitoring with Azure and Grafana

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2018/05/31/monitoring-with-azure-and-grafana/

Monitoring with Azure and Grafana What is whitebox monitoring?
Why do we monitor our systems?
What is the Azure Monitor plugin and how can I use it to monitor my Azure resources?
Recently, I spoke at Swetugg 2018, a .NET conference held in Stockholm, Sweden to answer these questions. In this video you’ll learn some basic monitoring principles, some of the tools we use to monitor our systems, and get an inside look at the new Azure Monitor plugin for Grafana.

Monitoring your Amazon SNS message filtering activity with Amazon CloudWatch

Post Syndicated from Rachel Richardson original https://aws.amazon.com/blogs/compute/monitoring-your-amazon-sns-message-filtering-activity-with-amazon-cloudwatch/

This post is courtesy of Otavio Ferreira, Manager, Amazon SNS, AWS Messaging.

Amazon SNS message filtering provides a set of string and numeric matching operators that allow each subscription to receive only the messages of interest. Hence, SNS message filtering can simplify your pub/sub messaging architecture by offloading the message filtering logic from your subscriber systems, as well as the message routing logic from your publisher systems.

After you set the subscription attribute that defines a filter policy, the subscribing endpoint receives only the messages that carry attributes matching this filter policy. Other messages published to the topic are filtered out for this subscription. In this way, the native integration between SNS and Amazon CloudWatch provides visibility into the number of messages delivered, as well as the number of messages filtered out.

CloudWatch metrics are captured automatically for you. To get started with SNS message filtering, see Filtering Messages with Amazon SNS.

Message Filtering Metrics

The following six CloudWatch metrics are relevant to understanding your SNS message filtering activity:

  • NumberOfMessagesPublished – Inbound traffic to SNS. This metric tracks all the messages that have been published to the topic.
  • NumberOfNotificationsDelivered – Outbound traffic from SNS. This metric tracks all the messages that have been successfully delivered to endpoints subscribed to the topic. A delivery takes place either when the incoming message attributes match a subscription filter policy, or when the subscription has no filter policy at all, which results in a catch-all behavior.
  • NumberOfNotificationsFilteredOut – This metric tracks all the messages that were filtered out because they carried attributes that didn’t match the subscription filter policy.
  • NumberOfNotificationsFilteredOut-NoMessageAttributes – This metric tracks all the messages that were filtered out because they didn’t carry any attributes at all and, consequently, didn’t match the subscription filter policy.
  • NumberOfNotificationsFilteredOut-InvalidAttributes – This metric keeps track of messages that were filtered out because they carried invalid or malformed attributes and, thus, didn’t match the subscription filter policy.
  • NumberOfNotificationsFailed – This last metric tracks all the messages that failed to be delivered to subscribing endpoints, regardless of whether a filter policy had been set for the endpoint. This metric is emitted after the message delivery retry policy is exhausted, and SNS stops attempting to deliver the message. At that moment, the subscribing endpoint is likely no longer reachable. For example, the subscribing SQS queue or Lambda function has been deleted by its owner. You may want to closely monitor this metric to address message delivery issues quickly.

Message filtering graphs

Through the AWS Management Console, you can compose graphs to display your SNS message filtering activity. The graph shows the number of messages published, delivered, and filtered out within the timeframe you specify (1h, 3h, 12h, 1d, 3d, 1w, or custom).

SNS message filtering for CloudWatch Metrics

To compose an SNS message filtering graph with CloudWatch:

  1. Open the CloudWatch console.
  2. Choose Metrics, SNS, All Metrics, and Topic Metrics.
  3. Select all metrics to add to the graph, such as:
    • NumberOfMessagesPublished
    • NumberOfNotificationsDelivered
    • NumberOfNotificationsFilteredOut
  4. Choose Graphed metrics.
  5. In the Statistic column, switch from Average to Sum.
  6. Title your graph with a descriptive name, such as “SNS Message Filtering”

After you have your graph set up, you may want to copy the graph link for bookmarking, emailing, or sharing with co-workers. You may also want to add your graph to a CloudWatch dashboard for easy access in the future. Both actions are available to you on the Actions menu, which is found above the graph.


SNS message filtering defines how SNS topics behave in terms of message delivery. By using CloudWatch metrics, you gain visibility into the number of messages published, delivered, and filtered out. This enables you to validate the operation of filter policies and more easily troubleshoot during development phases.

SNS message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). CloudWatch metrics for SNS message filtering is available now, in all AWS Regions.

For information about pricing, see the CloudWatch pricing page.

For more information, see:

Protecting your API using Amazon API Gateway and AWS WAF — Part I

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/protecting-your-api-using-amazon-api-gateway-and-aws-waf-part-i/

This post courtesy of Thiago Morais, AWS Solutions Architect

When you build web applications or expose any data externally, you probably look for a platform where you can build highly scalable, secure, and robust REST APIs. As APIs are publicly exposed, there are a number of best practices for providing a secure mechanism to consumers using your API.

Amazon API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management.

In this post, I show you how to take advantage of the regional API endpoint feature in API Gateway, so that you can create your own Amazon CloudFront distribution and secure your API using AWS WAF.

AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, compromise security, or consume excessive resources.

As you make your APIs publicly available, you are exposed to attackers trying to exploit your services in several ways. The AWS security team published a whitepaper solution using AWS WAF, How to Mitigate OWASP’s Top 10 Web Application Vulnerabilities.

Regional API endpoints

Edge-optimized APIs are endpoints that are accessed through a CloudFront distribution created and managed by API Gateway. Before the launch of regional API endpoints, this was the default option when creating APIs using API Gateway. It primarily helped to reduce latency for API consumers that were located in different geographical locations than your API.

When API requests predominantly originate from an Amazon EC2 instance or other services within the same AWS Region as the API is deployed, a regional API endpoint typically lowers the latency of connections. It is recommended for such scenarios.

For better control around caching strategies, customers can use their own CloudFront distribution for regional APIs. They also have the ability to use AWS WAF protection, as I describe in this post.

Edge-optimized API endpoint

The following diagram is an illustrated example of the edge-optimized API endpoint where your API clients access your API through a CloudFront distribution created and managed by API Gateway.

Regional API endpoint

For the regional API endpoint, your customers access your API from the same Region in which your REST API is deployed. This helps you to reduce request latency and particularly allows you to add your own content delivery network, as needed.


In this section, you implement the following steps:

  • Create a regional API using the PetStore sample API.
  • Create a CloudFront distribution for the API.
  • Test the CloudFront distribution.
  • Set up AWS WAF and create a web ACL.
  • Attach the web ACL to the CloudFront distribution.
  • Test AWS WAF protection.

Create the regional API

For this walkthrough, use an existing PetStore API. All new APIs launch by default as the regional endpoint type. To change the endpoint type for your existing API, choose the cog icon on the top right corner:

After you have created the PetStore API on your account, deploy a stage called “prod” for the PetStore API.

On the API Gateway console, select the PetStore API and choose Actions, Deploy API.

For Stage name, type prod and add a stage description.

Choose Deploy and the new API stage is created.

Use the following AWS CLI command to update your API from edge-optimized to regional:

aws apigateway update-rest-api \
--rest-api-id {rest-api-id} \
--patch-operations op=replace,path=/endpointConfiguration/types/EDGE,value=REGIONAL

A successful response looks like the following:

    "description": "Your first API with Amazon API Gateway. This is a sample API that integrates via HTTP with your demo Pet Store endpoints", 
    "createdDate": 1511525626, 
    "endpointConfiguration": {
        "types": [
    "id": "{api-id}", 
    "name": "PetStore"

After you change your API endpoint to regional, you can now assign your own CloudFront distribution to this API.

Create a CloudFront distribution

To make things easier, I have provided an AWS CloudFormation template to deploy a CloudFront distribution pointing to the API that you just created. Click the button to deploy the template in the us-east-1 Region.

For Stack name, enter RegionalAPI. For APIGWEndpoint, enter your API FQDN in the following format:


After you fill out the parameters, choose Next to continue the stack deployment. It takes a couple of minutes to finish the deployment. After it finishes, the Output tab lists the following items:

  • A CloudFront domain URL
  • An S3 bucket for CloudFront access logs
Output from CloudFormation

Output from CloudFormation

Test the CloudFront distribution

To see if the CloudFront distribution was configured correctly, use a web browser and enter the URL from your distribution, with the following parameters:


You should get the following output:

    "id": 1,
    "type": "dog",
    "price": 249.99
    "id": 2,
    "type": "cat",
    "price": 124.99
    "id": 3,
    "type": "fish",
    "price": 0.99

Set up AWS WAF and create a web ACL

With the new CloudFront distribution in place, you can now start setting up AWS WAF to protect your API.

For this demo, you deploy the AWS WAF Security Automations solution, which provides fine-grained control over the requests attempting to access your API.

For more information about deployment, see Automated Deployment. If you prefer, you can launch the solution directly into your account using the following button.

For CloudFront Access Log Bucket Name, add the name of the bucket created during the deployment of the CloudFormation stack for your CloudFront distribution.

The solution allows you to adjust thresholds and also choose which automations to enable to protect your API. After you finish configuring these settings, choose Next.

To start the deployment process in your account, follow the creation wizard and choose Create. It takes a few minutes do finish the deployment. You can follow the creation process through the CloudFormation console.

After the deployment finishes, you can see the new web ACL deployed on the AWS WAF console, AWSWAFSecurityAutomations.

Attach the AWS WAF web ACL to the CloudFront distribution

With the solution deployed, you can now attach the AWS WAF web ACL to the CloudFront distribution that you created earlier.

To assign the newly created AWS WAF web ACL, go back to your CloudFront distribution. After you open your distribution for editing, choose General, Edit.

Select the new AWS WAF web ACL that you created earlier, AWSWAFSecurityAutomations.

Save the changes to your CloudFront distribution and wait for the deployment to finish.

Test AWS WAF protection

To validate the AWS WAF Web ACL setup, use Artillery to load test your API and see AWS WAF in action.

To install Artillery on your machine, run the following command:

$ npm install -g artillery

After the installation completes, you can check if Artillery installed successfully by running the following command:

$ artillery -V
$ 1.6.0-12

As the time of publication, Artillery is on version 1.6.0-12.

One of the WAF web ACL rules that you have set up is a rate-based rule. By default, it is set up to block any requesters that exceed 2000 requests under 5 minutes. Try this out.

First, use cURL to query your distribution and see the API output:

$ curl -s https://{distribution-name}.cloudfront.net/prod/pets
    "id": 1,
    "type": "dog",
    "price": 249.99
    "id": 2,
    "type": "cat",
    "price": 124.99
    "id": 3,
    "type": "fish",
    "price": 0.99

Based on the test above, the result looks good. But what if you max out the 2000 requests in under 5 minutes?

Run the following Artillery command:

artillery quick -n 2000 --count 10  https://{distribution-name}.cloudfront.net/prod/pets

What you are doing is firing 2000 requests to your API from 10 concurrent users. For brevity, I am not posting the Artillery output here.

After Artillery finishes its execution, try to run the cURL request again and see what happens:


$ curl -s https://{distribution-name}.cloudfront.net/prod/pets

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.
<BR clear="all">
<HR noshade size="1px">
Generated by cloudfront (CloudFront)
Request ID: [removed]

As you can see from the output above, the request was blocked by AWS WAF. Your IP address is removed from the blocked list after it falls below the request limit rate.


In this first part, you saw how to use the new API Gateway regional API endpoint together with Amazon CloudFront and AWS WAF to secure your API from a series of attacks.

In the second part, I will demonstrate some other techniques to protect your API using API keys and Amazon CloudFront custom headers.

Monitoring for Everyone

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2018/05/23/monitoring-for-everyone/

Øredev – Carl Bergquist – Monitoring for Everyone What is monitoring?
What do the terms log, metric, and distributed tracing actually mean?
What makes a good alert?
Why should I care?
At a recent developer conference in Malmö, Sweden, I gave a presentation on monitoring and observability to discuss the high level concepts and common tools that are out there.
Monitoring and observability can easily become quite complex, but at the heart of it, we simply want to know how our systems are performing, and when performance drops – be able to find out why.

[$] An introduction to MQTT

Post Syndicated from corbet original https://lwn.net/Articles/753705/rss

I was sure that somewhere there must be
physically-lightweight sensors with simple power, simple networking, and
a lightweight protocol that allowed them to squirt their data down the
network with a minimum of overhead. So my interest was piqued when Jan-Piet Mens spoke at FLOSS
UK’s Spring
on “Small Things for Monitoring”. Once he started passing
working demonstration systems around the room without interrupting the
demonstration, it was clear that MQTT was what I’d been looking for.

timeShift(GrafanaBuzz, 1w) Issue 43

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2018/05/04/timeshiftgrafanabuzz-1w-issue-43/

Welcome to TimeShift This week, Grafana Labs was happy to speak at and sponsor KubeCon + CloudNativeCon EU in Copenhagen, Denmark. We got to chat with a ton of Grafana users, attended amazing talks, and generally had a blast! From Grafana Labs, Goutham Veeramanchaneni gave two talks focusing on TSDB – the engine behind Prometheus, and Tom Wilkie discussed a technique for using Jsonnet for packaging and deploying “Monitoring Mixins” – extensible and customizable combinations of dashboards, alert definitions and exporters.