Origin Error Rate

Post Syndicated from Natasha Wissmann original https://blog.cloudflare.com/smarter-origin-service-level-monitoring/

Smart(er) Origin Service Level Monitoring

Today we’re excited to announce Origin Error Rate notifications: a new, sophisticated way to detect and notify you when Cloudflare sees elevated levels of 5xx errors from your origin.

In 2019, we announced Passive Origin Monitoring alerts to notify you when your origin goes down. Passive Origin Monitoring is great — it tells you if every request to your origin is returning a 521 error (web server down) for a full five minutes. But sometimes that’s not enough. You don’t want to wait for all of your users to have issues; you want to be notified when more users than normal are having issues before it becomes a big problem.

Calculating Anomalies

No service is perfect — we know that a very small percentage of your origin responses are likely to be 5xx errors. Most of the time, these issues are one-offs and nothing actually needs to be done. However, for Internet properties with very high traffic, even a very small percentage could potentially be a large number. If we alerted you for every one of these errors, you would never stop getting useless notifications. When it comes to notifying, the question isn’t whether there are any errors, but how many errors need to exist before it’s a problem.

So how do we actually tell if something is a problem? As humans, it’s relatively easy for us to look at a graph, identify a spike, and think “hmm, that’s not supposed to be there.” For a computer it gets a little more complicated. Unlike humans, who have intuition and can exercise judgement in grey areas, a computer needs an exact set of criteria to tell whether something is out of the ordinary.

The simplest way to detect abnormalities in a time series dataset is to set a single threshold — for example, “notify me whenever more than 5% of the requests to my Internet properties result in errors.” Unfortunately, it’s really hard to pick an effective threshold. Too high and you never actually get notified; too low, and you’re drowning in notifications:

Even when we find that happy medium, we can still miss issues that burn “low and slow”. This is where there’s no obvious, dramatic spike, but something has been going a little wrong for a long time:

We can try layering on multiple thresholds. For example: notify you if the error rate is ever over 10%, or if it’s over 5% for more than five minutes, or if it’s over 2% for more than 10 minutes. Not only does this quickly become complicated, but it also doesn’t account for periodic issues, such as kubernetes pods restarting or deployments going out at a regular interval. What if the error rate is over 5% for only four minutes, but it happens every five minutes? We know that a lot of your end users are being affected, but even the long set of rules listed above wouldn’t catch it.

So thresholds probably aren’t sophisticated enough to detect origin issues. Instead, we turn to the Google SRE Handbook for alerting based on Service Level Objectives (SLOs). An SLO is a part of an agreement between a customer and a service provider. It’s a defined metric and value that both parties agree on. One of the most common types of SLOs is availability, or “the service will be available for a certain percentage of the time”. In this case, the service is your origin and the agreement is between you and your end users. Your end users expect your origin to be available a certain percent of the time.

If we revisit our original concept, we know that you’re comfortable with your origin returning a certain number of errors. We define that number as your SLO. An SLO of 99.9 means that you’re OK with 0.01% of all of your requests over a month being errors. Therefore, 0.01% of all the requests that reach your origin is your total error budget — you can have this many errors in a month and never be notified, because you’ve defined that as acceptable.

What you really want to know is when you’re burning through that error budget too quickly — this probably means that something is actually going wrong (instead of a one-time occurrence). We can measure a burn rate to gauge how quickly you’re burning through your error budget, given the rate of errors that we’re currently seeing. A burn rate of one means that the entirety of the error budget will be used up exactly within the set time period — an ideal scenario. A burn rate of zero is even better since we’re not seeing any errors at all, but ultimately is pretty unrealistic. A burn rate of 10 is most likely a problem — if that rate keeps up for the full month, you’ll have had 10x the number of errors than you originally said was acceptable.

Even when using burn rates instead of thresholds, we still want to have multiple criteria. We want to measure a short time period with a high burn rate (a short indicator). This covers your need to “alert me quickly when something dramatic is happening.” But we also want to have a longer time period with a lower burn rate (a long indicator), in order to cover your need to “don’t alert me on issues that are over quickly.” This way we can ensure that we alert quickly without sending too many false positives.

Let’s take a look at the life of an incident using this methodology. In our first measurement, the short indicator tells us it looks like something is starting to go wrong. However, the long indicator is still within reasonable bounds. We’re not going to sound the alarm yet.

When we measure next, we see that the problem is worse. Now we’re at the point where there are enough errors that not only is the short indicator telling us there’s something wrong, but the long indicator has been impacted too. We feel confident that there’s a problem, and it’s time to notify you.

A couple cycles later, the incident is over. The long indicator is still telling us that something is wrong, because the incident is still within the long time period. However, the short indicator shows that nothing is currently concerning. Since we don’t have both indicators telling us that something is wrong, we won’t notify you. This keeps us from sending notifications for incidents that are already over.

This methodology is cool because of how well it responds to different incidents. If the original spike had been more dramatic, it would have triggered both the long and short indicators immediately. The more errors we’re seeing, the more confident we are that there’s an issue and the sooner we can notify you.

Even with this methodology, we know that different services behave differently. So for this notification, you can choose the Service Level Objective (SLO) you want to use to monitor your Internet property: 99.9% (high sensitivity), 99.8% (medium sensitivity), or 99.7% (low sensitivity). You can also choose which Internet properties you want to monitor — no need to be notified for test properties or lower priority domains.

Getting started today

HTTP Origin Error Rate notifications can be configured in the Notifications tab of the dashboard. Select Origin Error Rate Alert as your alert type. As with all Cloudflare notifications, you’re able to name and describe your notification, and choose how you want to be notified. From there, you can select which domains you want to monitor, and at what SLO.

This notification is available to all Enterprise customers. If you’re interested in monitoring your origin, we encourage you to give it a try.

Our team is hiring in Austin, Lisbon and London.

Noise

Tag Archives: Origin Error Rate

Smart(er) Origin Service Level Monitoring

Calculating Anomalies

Getting started today

The collective thoughts of the interwebz