Post Syndicated from Erick Galinkin original https://blog.rapid7.com/2022/01/18/2022-planning-metrics-that-matter-and-curtailing-the-cobra-effect/
During the British rule of India, the British government became concerned about the number of cobras in the city of Delhi. The ambitious bureaucrats came up with what they thought was the perfect solution, and they issued a bounty for cobra skins. The plan worked wonderfully at first, as cobra skins poured in and reports of cobras in Delhi declined.
However, it wasn’t long before some of the Indian people began breeding these snakes for their lucrative scales. Once the British discovered this scheme, they immediately cancelled the bounty program, and the Indian snake farmers promptly released their now-worthless cobras into the wild.
Now, the cobra conundrum was even worse than before the bounty was offered, giving rise to the term “the cobra effect.” Later, the economist Charles Goodhart coined the closely related Goodhart’s Law, widely paraphrased as, “When a measure becomes a target, it ceases to be a good measure.”
Creating metrics in cybersecurity is hard enough, but creating metrics that matter is a harder challenge still. Any business-minded person can tell you that effective metrics (in any field) need to meet these 5 criteria:
- Cheap to create
- Consistently measured
- Significant to someone
- Tied to a business need
If your proposed metrics don’t meet any one of the above criteria, you are setting yourself up for a fantastic failure. Yet if they do meet those criteria, you aren’t totally out of the woods yet. You must still avoid the cobra effect.
A case study
I’d like to take a moment to recount a story from one of the more effective security operations centers (SOCs) I’ve had the pleasure of working with. They had a quite well-oiled 24/7 operation going. There was a dedicated team of data scientists who spent their time writing custom tooling and detections, as well as a wholly separate team of traditional SOC analysts, who were responsible for responding to the generated alerts. The data scientists were judged by the number of new threat detections they were able to come up with. The analysts were judged by the number of alerts they were able to triage, and they were bound by a (rather rapid) service-level agreement (SLA).
This largely worked well, with one fairly substantial caveat. The team of analysts had to sign off on any new detection that entered the production alerting system. These analysts, however, were largely motivated by being able to triage a new issue quickly.
I’m not here to say that I believe they were doing anything morally ambiguous, but the organizational incentive encouraged them to accept detections that could quickly and easily be marked as false positives and reject detections that took more time to investigate, even if they were more densely populated with true positives. The end effect was a system structured to create a success condition that was a massive number of false-positive alerts that could be quickly clicked away.
Avoiding common pitfalls
The most common metrics used by SOCs are number of issues closed and mean time to close.
While I personally am not in love with these particular quantifiers, there is a very obvious reason these are the go-to data points. They very easily fit all 5 criteria listed above. But on their own, they can lead you down a path of negative incentivization.
So how can we take metrics like this, and make them effective? Ideally, we could use these in conjunction with some analysis on false/true positivity rate to arrive at an efficacy rate that will maximize your true positive detections per dollar.
Arriving at efficacy
Before we get started, let’s make some assumptions. We are going to talk about SOC alerts that must be responded to by a human being. The ideal state is for high-fidelity alerting with automated response, but there is always a state where human intervention is necessary to make a disposition. We are also going to assume that there are a variety of types of detections that have different false-positive and true-positive rates, and for the sheer sake of brevity, we are going to pretend that false negatives incur no cost (an obvious absurdity, but my college physics professor taught me that this is fine for demonstration purposes). We are also going to assume, safely, I imagine, that reviewing these alerts takes time and that time incurs a dollars-and-cents cost.
For any alert type, you would want to establish the number of expected true positives, which is the alert rate multiplied by the true-positive rate (which you must be somehow tracking, by the way). This will give you the expected number of true positives over the alert rate period.
Great! So we know how many true positives to expect in a big bucket of alerts. Now what? Well, we need to know how much it costs to look through the alerts in that bucket! Take the alert rate, multiply by the alert review time, and if you are feeling up to it, multiply by the cost of the manpower, and you’ll arrive at the expected cost to review all the alerts in that bucket.
But the real question you want to know is, is the juice worth the squeeze? The detection efficacy will tell you the cost of each true positive and can be calculated by dividing the number of expected true positives by the expected cost. Or to simplify the whole process, divine the true-positive rate by the average alert review time, and multiply by the manpower cost.
If you capture detection efficacy this way, you can effectively discover which detections are costing you the most and which are most effective.
Dragging down distributions
Another important option to consider is the use of distributions in your metric calculation. We all remember mean, median, and mode from grade school — these and other statistics are tools we can use to tell us how effective we are. In particular, we want to ask whether our measure should be sensitive to outliers — data points that don’t look typical. We should also consider whether our mean and median are being dragged down by our distribution.
As a quick numerical example, assume we have 100 alerts come in, and we bulk-close 75 of them based on some heuristic. The other 25 alerts are all reviewed, taking 15 minutes each, and handed off as true positives. Then our median time to close is 0 minutes, and our mean time to close is 3 minutes and 45 seconds.
Those numbers are great, right? Well, not exactly. They tell us what “normal” looks like but give us no insight into what is actually happening.
To that end, we have two options. Firstly, we can remove zero values from our data! This is typical in data science as a way to clean data, since in most cases, zeros are values that are either useless or erroneous. This gives us a better idea of what “normal” looks like.
Second, we can use a value like the upper quartile to see that the 75th-percentile time to close is 15 minutes, which in this case is a much more representative example of how long an analyst would expect to spend on events. In particular, it’s easy to drag down the average — just close false positives quickly! But it’s much harder to drag down the upper quartile without making some real improvements.
3 keys to keep in mind
When creating metrics for your security program, there are a lot of available options. When choosing your metrics, there are a few keys:
- Watch out for the cobra effect. Your metrics should describe something meaningful, but they should be hard to game. When in doubt, remember Goodhart’s Law — if it’s a target, it’s not a good metric.
- Remember efficacy. In general, we are aiming to get high-quality responses to alerts that require human expertise. To that point, we want our analysts to be as efficient and our detections to be as effective as possible. Efficacy gives us a powerful metric that is tailor-made for this purpose.
- When in doubt, spread it out. A single number is rarely able to give a truly representative measure of what is happening in your environment. However, having two or more metrics — such as mean time to response and upper-quartile time to response — can make those metrics more robust to outliers and against being gamed, ensuring you get better information.