Post Syndicated from Kyle Brandt original http://blog.serverfault.com/2014/11/10/announcing-bosun/
Imagine if alerting was what you wanted it to be:
Every alert you received was actionable, and there were few false alerts
Notifications were actually informative
You received alerts in time to fix problems before they impacted your users
This isn’t the world we live in…
We accept lots of notifications from our alerting system that are not actionable
The notifications don’t tell us about the problem
We get paged when stuff is dead and not when it is sick
In order to resolve the dissonance between reality and what alerting should be we need:
An expressive way to evaluate alert conditions that isn’t a 1:1 mapping to the metrics
Alerts backed by time-series and not just recent values
A way to to make rich notifications that include useful information
A way to iterate fast with alert design so that our alerts are continuously improved
A little less than a year ago, Matt Jibson and Kyle Brandt set out to create a system to solve this and other problems in monitoring; we call it Bosun. Our belief is that achieving excellence in alerting is a complex problem and requires a powerful and flexible platform to design alerts. Therefore, Bosun’s strategy is to provide a framework that enables the operator to create intelligent and informative alerts. We believe that you are smarter and more creative than any monitoring system can be when it comes to your environment.
In order to achieve that, at the highest level Bosun provides:
An expression language (a small domain-specific language) designed to allow for the creation of highly flexible and specific alerts
Notification templates that allow you to include whatever information you think is relevant
A web interface the provides a workflow for more rapid iteration with improving and creating alerts: Graph -> Expression -> Rule + Template -> Test the rule over history
The Expression Language
We believe that every alert requires action. An alert asks for your attention, and human attention and time is a valuable asset. So alerting is about owning the operators attention. Taking action with alerts practically means one of two things. If the alert was accurate, then you fix the issue that triggered the alert. If the alert was a false positive, then the alert should be tuned in a way that the false positive won’t trigger the alert. This is where things tend to fall down because alert evaluations are not powerful enough to be tuned. With Bosun’s expression language, you can tune alerts in the following ways:
Alert thresholds based on history vs static thresholds (or both combined)
Statistics functions: Min, Percentile, Median, Deviations, Forecasting. You can change the duration that these evaluate over (i.e. 5 minutes, 1 hour, 1 week?)
Scope-aware: How should components in your environment be grouped? By Host, subsystem, cluster, a combination of those things
Boolean conditions: The interaction of multiple components
These possibilities, when applied selectively by a skilled operator, provide ample ways to reduce alerting noise.
Once you have someone’s attention with a valid alert, you need to direct them to the problem as accurately as possible. Our notification templates use the Go template language, which means they can be quite flexible. Notifications in Bosun allow you to:
Include breakdowns of information related to your alert as embedded graphs, html tables, or whatever else you think makes sense
Include information that wasn’t directly related to the alert: i.e. CPU of a host even though it was a memory alert
Generate links to your dashboards or other sources of information
Includes notes about why you created that alert, caveats, and other information the person being notified should be aware of
One of the main issues with alerting is that there is so much friction to tuning alerts that it doesn’t get done. One of Bosun’s goals was to provide a faster iteration cycle for creating and tuning alerts by making the web interface an alerting IDE: Graphs in Bosun’s interface link to expressions, which then link to alert rules and templates. You can then test alerts before implementing; the results of a rule and template can be tested in the interface. You can test how they will behave currently, how they might have behaved at a past time, or generate a timeline of how they might have behaved over a range of time.
This means that your alert tuning doesn’t need to be totally reactionary. You can test alert changes and see how and when they would have triggered over the past weeks (or longer, if you are patient). This results in less alert noise being sent to operators.
But wait! There’s more!
Bosun has also attempted to make some problems in monitoring easier:
Getting data into the system: our agent (called “scollector”) runs on Windows and Linux and starts sending data to Bosun
Applications can push metrics to the system via JSON API calls
Human maintenance: Properly designed alerts will apply to new systems, and services are auto-discovered by scollector. This means you don’t have to remember to update your monitoring most of the time when a new services and hosts are deployed (as long as scollector is pushed out via your build or configuration management process)
We hope you go try this out. We have a docker image that has everything you need—just follow the getting started guide. We hope Bosun is useful to the community. We need your creativity and ideas to continue to grow it (and some contributors would be nice too!). We owe a special thanks to everyone else at Stack Exchange for:
Contributing to scollector – Greg Bray has been working hard to fill out our Windows metrics, and Sam Torno did the same for Linux
Getting a docker build – Peter Grace (who also did a lot of the dogfooding)
Manning the front lines to keep the site up while we built this – the rest of the SRE Team
Feature ideas and monitoring concepts – Tom Limoncelli and his monitoring chapters in The Practice of Cloud System Administration
Letting Matt and I go tilting at windmills – Stack Exchange, Inc.