Tom Webster is a seasoned SRE with experience at Dropbox and AWS.
A lot of businesses strive to “simplify” how they think about incident impact through incident metrics. However, organizations often overemphasize what they call “lost revenue.” This is a bad idea and should be avoided at all costs.
Leadership teams want a simple metric to determine how “bad” an incident is for their business. One of the easier choices is something based on revenue—either how much money was directly lost from missed sales opportunities during the system outage (more accurate) or how much money was possibly lost while the system was impacted (less accurate). Depending on your business, this can range from a straightforward calculation to a source of endless frustration. In all cases, though, this metric—hereafter referred to as Big Number—is deceptive and dangerous.
Big Number is not only unhelpful; it is actively harmful and may be nearly impossible to get rid of once implemented. This is because Big Number is addictive. It’s easy to put an arbitrary number on an incident and say, “This is how bad this was.” It’s extremely appealing to have one metric to point to when assessing any large-scale outage. Big Number is easy to reason about, easy to track, and easy to plug into a spreadsheet. It doesn’t require critical thought or a deep dive into the nuances of large-scale incidents. It doesn’t compel you to consider the myriad side effects caused by outages or other high severity incidents. It’s easy to ignore everything else and trust Big Number.
Here’s the problem, though: Big Number is a lie.
The Direct Cost of Big Number
First, let’s talk about the direct cost of Big Number: How do you calculate it in the first place?
If your business model is relatively simple, assigning a number to an incident could be as straightforward as identifying how many customers tried to use your service and calculating the amount of missed revenue due to the outage. If your company provides SLA-breach credits, that’s an easy Big Number to attach to incidents.
Unfortunately, most businesses have more complex revenue flows. In these cases, you may need finance or data science teams to extrapolate and estimate how much money could have been lost during the outage. This takes time.
If calculating this metric is required to complete your post-incident analysis, it can artificially inflate other incident metrics and strain the engineers responsible for those analyses. Remember: Every step you add to your incident response playbook adds time. More time and complexity mean people are less inclined to engage honestly with the process. If the process becomes too burdensome or the metric collection too tedious, engineers are more likely to rush through it just to “get it over with.” Your job as the steward of the Incident Management process is to keep it seamless to maximize its value.
The Insidious Indirect Cost
While the direct cost of Big Number may be painful for some businesses, the truly dangerous consequences are universal.
We’ve already highlighted the most dangerous part: Big Number is addictive because it’s easy. It’s “just money,” right? But there’s far more to incident costs than lost revenue. Big Number ignores critical costs that are hard to quantify.
Big Number ignores the cost of negative user experiences. An outage might have impacted a “low-cost” feature, such as a sub-feature of a paid product—or even a free feature. Where should that fall on your priority list? According to Big Number, it doesn’t matter much because it doesn’t drive revenue. However, users will remember the times your product didn’t work for them. Over time, this erodes trust and leads to customer attrition.
Big Number ignores the cost of negative marketing. If you only focus on incidents affecting revenue-generating features, people will notice. Customers will stop using your product because they see you only care about certain features. If the feature they rely on isn’t a priority for you, they’ll know. Some large companies today face exactly this perception problem—parts of their business are well-supported, while others are visibly neglected. This makes it harder for them to launch new products because people assume they’ll abandon them. Over-reliance on Big Number can undermine your company’s reputation and future.
Big Number ignores free users. Suppose an incident affects only Free-Tier or Free-Trial customers. No revenue lost, no SLA guarantees to pay out—so why care? Because Big Number doesn’t care, but you should. Free users are vital to companies that rely on free tiers or trial subscriptions to drive conversions. Most paid users started as free users. If these free users have a bad experience, and you deprioritize the incident because “nothing was lost,” you’re jeopardizing future growth.
What Should You Do?
Divorce Incident Management from finance. Large-scale problems cannot be addressed based on financial factors alone. If there’s a problem, it must be fixed—full stop. Financial metrics should not be part of your Incident Management process because they muddy the waters and detract from its purpose: solving problems, learning from them, and preventing future incidents.
That’s not to say Big Number has no place in your organization. It’s fine to collect and analyze this data—but do it outside your Incident Management process. Assign it to finance or data science teams, but don’t integrate it into your standard incident workflow or post-incident analysis.
The biggest danger of Big Number is that it can infiltrate your triage process because it’s simple and addictive. People don’t need to think too hard about “dollars lost,” so it becomes a pseudo-scientific way to gauge the impact of outages. This metric is ultimately misleading and self-defeating for effective Incident Management.
Get the latest from Rootly