Tag Archives: Uncategorized

Announcing Bosun

Post Syndicated from Kyle Brandt original http://blog.serverfault.com/2014/11/10/announcing-bosun/

Imagine if alerting was what you wanted it to be:

Every alert you received was actionable, and there were few false alerts
Notifications were actually informative
You received alerts in time to fix problems before they impacted your users

This isn’t the world we live in…

We accept lots of notifications from our alerting system that are not actionable
The notifications don’t tell us about the problem
We get paged when stuff is dead and not when it is sick

In order to resolve the dissonance between reality and what alerting should be we need:

An expressive way to evaluate alert conditions that isn’t a 1:1 mapping to the metrics
Alerts backed by time-series and not just recent values
A way to to make rich notifications that include useful information
A way to iterate fast with alert design so that our alerts are continuously improved

A little less than a year ago, Matt Jibson and Kyle Brandt set out to create a system to solve this and other problems in monitoring; we call it Bosun. Our belief is that achieving excellence in alerting is a complex problem and requires a powerful and flexible platform to design alerts. Therefore, Bosun’s strategy is to provide a framework that enables the operator to create intelligent and informative alerts. We believe that you are smarter and more creative than any monitoring system can be when it comes to your environment.

In order to achieve that, at the highest level Bosun provides:

An expression language (a small domain-specific language) designed to allow for the creation of highly flexible and specific alerts
Notification templates that allow you to include whatever information you think is relevant
A web interface the provides a workflow for more rapid iteration with improving and creating alerts: Graph -> Expression -> Rule + Template -> Test the rule over history


The Expression Language

We believe that every alert requires action. An alert asks for your attention, and human attention and time is a valuable asset. So alerting is about owning the operators attention. Taking action with alerts practically means one of two things. If the alert was accurate, then you fix the issue that triggered the alert. If the alert was a false positive, then the alert should be tuned in a way that the false positive won’t trigger the alert. This is where things tend to fall down because alert evaluations are not powerful enough to be tuned. With Bosun’s expression language, you can tune alerts in the following ways:

Alert thresholds based on history vs static thresholds (or both combined)
Statistics functions: Min, Percentile, Median, Deviations, Forecasting. You can change the duration that these evaluate over (i.e. 5 minutes, 1 hour, 1 week?)
Scope-aware: How should components in your environment be grouped? By Host, subsystem, cluster, a combination of those things
Boolean conditions: The interaction of multiple components

These possibilities, when applied selectively by a skilled operator, provide ample ways to reduce alerting noise.

Notification Templates

Once you have someone’s attention with a valid alert, you need to direct them to the problem as accurately as possible. Our notification templates use the Go template language, which means they can be quite flexible. Notifications in Bosun allow you to:

Include breakdowns of information related to your alert as embedded graphs, html tables, or whatever else you think makes sense
Include information that wasn’t directly related to the alert: i.e. CPU of a host even though it was a memory alert
Generate links to your dashboards or other sources of information
Includes notes about why you created that alert, caveats, and other information the person being notified should be aware of

The Workflow

One of the main issues with alerting is that there is so much friction to tuning alerts that it doesn’t get done. One of Bosun’s goals was to provide a faster iteration cycle for creating and tuning alerts by making the web interface an alerting IDE: Graphs in Bosun’s interface link to expressions, which then link to alert rules and templates. You can then test alerts before implementing; the results of a rule and template can be tested in the interface. You can test how they will behave currently, how they might have behaved at a past time, or generate a timeline of how they might have behaved over a range of time.

This means that your alert tuning doesn’t need to be totally reactionary. You can test alert changes and see how and when they would have triggered over the past weeks (or longer, if you are patient). This results in less alert noise being sent to operators.

But wait! There’s more!

Bosun has also attempted to make some problems in monitoring easier:

Getting data into the system: our agent (called “scollector”) runs on Windows and Linux and starts sending data to Bosun
Applications can push metrics to the system via JSON API calls
Human maintenance: Properly designed alerts will apply to new systems, and services are auto-discovered by scollector. This means you don’t have to remember to update your monitoring most of the time when a new services and hosts are deployed (as long as scollector is pushed out via your build or configuration management process)

We hope you go try this out. We have a docker image that has everything you need—just follow the getting started guide. We hope Bosun is useful to the community. We need your creativity and ideas to continue to grow it (and some contributors would be nice too!). We owe a special thanks to everyone else at Stack Exchange for:

Contributing to scollector – Greg Bray has been working hard to fill out our Windows metrics, and Sam Torno did the same for Linux
Getting a docker build – Peter Grace (who also did a lot of the dogfooding)
Manning the front lines to keep the site up while we built this – the rest of the SRE Team
Feature ideas and monitoring concepts – Tom Limoncelli and his monitoring chapters in The Practice of Cloud System Administration
Letting Matt and I go tilting at windmills – Stack Exchange, Inc.

Homegrown DevOps Tools at Stack Exchange

Post Syndicated from Kyle Brandt original http://blog.serverfault.com/2013/09/05/homegrown-devops-tools-at-stack-exchange/

A lot of tools available in IT/Sysadmin/Ops/DevOps are disappointing:

They don’t fit your environment. They lack features or our designed for a different sort of environment (i.e cloud vs hardware, Linux vs Windows, distributed vs centralized etc)
You can’t interact with them programmatically
They cost too much
They are not customizable enough, or require too much customization to get off the ground
Feel kludgy, unreliable, outdated, or like the programmers were stoned
Don’t fit with your company’s culture (i.e. Enterprise vs Agile)

In short a lot of stuff is too expensive, isn’t a good fit, or is simply bad software. This ends up leaving an ops team with two options. They can whine about it, or create their own tools. So at Stack Exchange we build our own DevOps tools.


Nick Craver’s baby, which we just call “Status” is at first glance a monitoring dashboard, but is essentially a collection of tools that filled various needs:

An Overview of CPU, Memory, and Network utilization for all our servers as well as a detailed view. Done with responsive and interactive D3 graphs as well as sparklines it helps compensate for Solar Wind’s terrible interface.
SQL Server monitoring. SQL’s built in Clustering views are deeply flawed. If a node loses connectivity, it stops updating remote nodes status, so it could show everything as connected and fine, even if there is no connectivity. We also get to see the most expensive queries, active queries utilizing whoisactive, current connections, and which DBs are on which server
HAProxy Monitoring and Administration: With multiple instances of HAProxy we needed a single view instead of HAProxy’s built-in display. Also, this gave us a nice web interface to take servers out of rotation
Redis: A nice presentation of Redis Info across all instances and all servers. Also a display that shows what is slaved to what in at a quick glance
Elastic Search: Health overview of or clusters (as well as index and shard data)
A dashboard of all the exceptions generated by our applications

Status is C# / .NET app. It polls data from various sources – sometimes the system directly and other times it gets it from Orion. There is a lot more to status that makes it awesome. The real accomplishment is that status enables us to see the general health of our main infrastructure at a glance.

Web Logging

If you business is creating and running websites, your web logs are gold. We use the logs generated by our load balancer, HAProxy, as our canonical web logs. In their raw text format, web logs are often not that useful (this is particularly true with over 100 million records a day). However we parse and structure our web logs in a few different ways:


We have C# service that Jarrod Dixon wrote that inserts them into SQL so we can query them. In order to query them we use an instance of Data Explorer, SQL management studio, and also have certain lookups directly from our sites
Displaying realtime graphs of various log information with Realog, a system I created with Go, Redis, and NVD3.js so we could view activity live without having to write queries

One of the interesting things we do with our weblogs is to add extra information by adding headers inside the app and striping them from the response at HAProxy. For example, we capture how many Redis and SQL queries were involved in that request and how long they took.

Patch Dashboard

OS updates can be a bit tedious, even more so in a mixed Windows and Linux environment. PartialPatchDashboard Steven Murawski and George Beech created a dashboard that allows us:

View the outstanding patches and patch count for both Linux and Windows
Trigger updates on either Linux or Windows
Schedule time frames for automatic Linux updates

What’s Next

If you want to learn more about these tools and DevOps at Stack Exchange, come see George, Nick, and Steven present “Building for Operations” at Velocity.

Keeping all this stuff to ourselves feels a bit greedy. However, for something open sourced to be very useful it usually needs to be made a bit more generic which takes time. We also want to build a lot more. Our inventory system Racktables lacks an API so we need a new one or a way to extend it. We want to build our own monitoring system (likely on top of OpenTSDB). In order to create more, and open source it we need help. So we are looking a full time developer with ops experience to join our SRE team. So if you are awesome, want to build awesome ops stuff and open source it, come join us!

MonitorControls – Utilties for monitor management on Windows

Post Syndicated from Laurie Denness original https://laur.ie/blog/2010/11/monitorcontrols-utilties-for-monitor-management-on-windows/

When I ended up using Windows to power the overhead information screens at Last.fm, I lost the ability to have a one line crontab entry that shut the monitors into DPMS standby (and wake them up) when we’re in and out of office hours. Makes no sense wasting power, but more importantly shortening the length of screens having them on when the office is empty.

I didn’t think I would have any issue finding a utility to place the screens in to standby mode. I didn’t; but unfortunately they were either not free, massively complicated or simply didn’t work.

So I found a code snippet online, fired up a copy of Visual Studio and compiled two exe files; MonitorOn.exe and MonitorOff.exe. MonitorOff sends a signal to all attached monitors on the system to go in to sleep mode, and if you move the mouse you can wake them up as normal. Or you can run MonitorOn which will send the signal manually. Simply place these into the Windows Task Scheduler, and you have a simple, effective way to manage your information screens.

You can download MonitorOn and MonitorOff here.

Leaving Last.fm

Post Syndicated from Laurie Denness original https://laur.ie/blog/2010/09/leaving-last-fm/

I’ve spent 3.43 years at Last.fm, which seems almost like a lifetime. For a long time, I couldn’t ever imagine leaving; every morning I would wake up excited to go and face new challenges and do fascinating new things. In the last 6-12 months so much has changed, as Last.fm gradually slips out of being a startup to being a company that, for better or for worse, has to make some money. I will certainly think twice before working for a company that has anything to do with the music industry… it’s a pain of a situation.

I’ve babysat the wonderful creation that is Last.fm through launches (both expected and unexpected), crashes (always unexpected), overheatings (and break-ins, and power failures… All the kind of thing that should never happen to a datacentre) and plenty of blood, sweat and tears.

It’s been an amazing experience, working with some of the most amazing people I have ever met (some of which have come and gone), but it’s time for me to help another startup through getting up at 4am to fix databases and exciting scaling questions.

And that will be Etsy; another website that has an awesome product that I love, plenty of traffic and graphs that point upwards and a bunch of guys who are passionate and have an awesome method of working. I’m really excited about getting involved and learning things again, as well as enabling a different group of passionate users go about their day to day business. I’ll still be in London, but popping to NY on occasion.

Let’s hope the next 3.43 years will be just as exciting.