Should I buy
or build incident
management
software?

TL;DR

You should use professional, outsourced incident management software if you:

Have a home-grown
system, but it’s not
core to your business

Are growing,
either in scale or in
experienced incidents

Need to manage
your costs so they are
reliable and predictable

Introduction

Everyone knows incident management is hard. Dealing with a complicated service when it’s gone down is like being stuck in a collapsed mineshaft: everyone’s screaming in the dark, no-one knows what’s really going on. Cue frantic digging, trying to restore service as quickly as possible. After a few of these accidents, it’s natural to start to wonder if doing incident management really has to involve so much darkness and screaming.

What happens next often depends on who’s responsible for incident management in the org, and what their background is. If they’ve got a PM background, for example, they might start looking at improving their incident processes, figuring that the difficulties really stem from the chaos of trying to run an incident call. They might pick up a copy of the SRE book(s), or start looking at incident management protocols used by firefighters, or institute professionalised rotations which were previously done by a combination of luck and “everybody knows only Chris can fix this.” (Someone should make sure to tell Chris.)

If they’ve got a DevOps or an SRE background on the other hand, they might instinctually turn to software to fix their problems. Of course, software could be used in a thousand ways to help incident management. Today we see everything from home directory scripts that grep through a bunch of awkwardly placed log files, to sophisticated ML models looking for unknown patterns across terabytes of data. For incident management, though, most people end up looking for software support in actually conducting an incident.

This context is a little unusual: not only is it something that has to do the usual data capture, transformation, storage, and transmission as every other online system, it’s also a system that has to be super reliable. If the system you’re using to help fix the broken things is itself broken, that’s just a surefire recipe for more darkness and screaming. It’s also vitally important for it not to get in the way: IM software should speed things up, not enmesh one in a web of box-ticking bureaucracy.

Though the requirements are a little unusual for normal online software, one of the main decisions you need to make is extremely traditional: that is, build-versus-buy.

Compare and Contrast

Build-versus-buy is a question lots of companies
ask themselves all the time, and the set of things
you have to think about is pretty well understood.
Let’s look at the main concerns now.

Is this my core business?

As with most things in the world of software, it could be quite complicated, and there’s a large amount of detail to understanding it fully, but the basics are fairly straightforward. Basically, going with a commercial SaaS provider gives you bounded but guaranteed recurring costs, and going in-house gives you the opposite. To be clear, it might well be cheaper in-house - particularly if the developer salaries are being paid in some other way - but it might also be potentially a lot more expensive, particularly for maintenance.

The best outcome, from a cost-savings perspective, is doing it yourself. But the circumstances in which that happens are pretty rare: it means that you can scrape together the resources across the org for a quick-and-dirty project that ends up not needing much care and feeding. Yes, it happens, but not very often. The worst outcome is thinking you can do that, and discovering, expensively, that you can’t.

Ultimately, maintaining a non-core system is always going to be a headache; for example, if an engineer paid $250k a year spends 3 months to build it and 25% of their time on maintaining it, that’s $62.5k, without adding product, project, or other management costs, never mind the unanticipated surprises that never happen in software development (*coughs gently*).

Answer: in the best possible case, it can be cheaper to do internally, but outsourcing gives you useful cost certainty.

What are the risks?

In general the risks when you build in-house are around a lack of execution, which we often find or non-core systems. (This is especially risky if the IM developer(s) go away at short notice.) The risks when you outsource are typically either a product misfit, or business continuity risks on the provider side. That is, the provider could run out of money, have a significant technical problem, get sued out of existence, etc. For a product misfit, of course much less existential, the product might not end up doing the things you want, or the provider might take the tool in a direction you don’t agree with.

A full accounting for all these risks is tricky, but ultimately you are balancing the (assumed) increased control you have over things in your local scope, versus external risks in the SaaS provider, coupled with by their obvious motivation to make their core product a success.

Answer: could go either way, depending very much on the client situation, though for the small company for whom IM systems are non-core, outsourcing probably edges it.

What is the reliability of in-versus out-sourcing?

Here is one concern which is probably weighted strongly towards out-sourcing. Having the incident management software running on systems which themselves are potentially taken out by the incident is a very serious problem. As against that, you balance the risk that the Internet will be inaccessible locally, or that the external provider will be inaccessible for potentially other reasons.

But given how widely accessible Internet connectivity is these days, and how rarely well-hosted services go down, we would expect it is substantially less risky to use an out-sourced provider.

Answer: almost certainly externally out-sourced, unless you’re actually a cloud provider yourself.

What if we’re basically
doing this already?

“We’ve got a slackbot and a bit of a process together. It’s not great, but it’s okay, and we tinker with the slackbot regularly enough, so we’re good, thanks.” We hear this argument and understand it - displacing something which is already a sunk cost is always a tricky prospect.

The argument that you are too busy to change away from your internal system, though you know it has problems, and that this is “okay for now” is correct only if you never encounter an incident which your internal system can’t handle. Our experience leads us to believe eventually, all organizations will encounter this. Ironically, this will happen sooner to the successful ones, since they’ll often scale very quickly as their customer base grows, and scaling very quickly is unfortunately strongly associated with incidents!

If this is right, then the question becomes not whether to switch over, but when to switch over. The wrong moment to switch is of course, just after the big incident you didn’t handle correctly (but it is sadly a really common moment for companies to realise they had probably better do that thing they were putting off for ages - that is reality). The right moment is probably some function of how rapidly you’re scaling, how rapidly you believe you’ll scale in the future, and a complicated assessment of where you would otherwise put your effort, product roadmap, staff capability, and so on. As a rule of thumb, think that somewhere between a 2x and a 10x scaling usually provides a company with new problems that their previous ways of doing things don’t accommodate. If you find yourself on the bottom of one of those inflections, but clearly going in the “up and to the right” direction, consider strongly switching before get to the top of that hill, because chances are you’ll have lost something along the way.

However, if you feel you’re not growing rapidly enough and have the effort to keep things ticking over in hand, we certainly won’t try to persuade you otherwise. There is an argument from the cost perspective too, of course - in order to evaluate that, you’d have to start putting numbers on what the existing system/approach is actually costing you. This can be tricky, particularly because intangibles like staff morale should also go in the equation! In general though, our recommendation would be that rising growth numbers, rising incident numbers, or decreasing staff morale would all be strong signals to move to professionalising your IM situation.

Answer: Keep in-house until the numbers are going in the wrong direction.