How to Choose the Best On-Call Management Software for Your Team
Your on-call management software can make or break your reliability story. Find out which boxes your on-call solution should be checking for you.
February 7, 2021
3 min read
Successful and blameless postmortems can turn incidents into a gift of learning and prevent repeat mistakes.
The word ”postmortem” can mean both the process and its artifact: the document in which you describe the incident, its resolution and what could be done to prevent it from happening again.
Your system is much more than your IT system. It includes parts of the real world: yourself, your fellow engineers, your boss, your users, your vendors, space, and the worst of all: time.
This complexity makes it difficult to predict, let alone to prevent, failures.
Incidents will certainly happen: you want to benefit, not be harmed, from them.
“Antifragile systems benefit (to some degree) from uncertainty, disorder, error, time…”[1]
In failure, a system reveals new information about itself, particularly hidden relationships between components.
Imagine a simple system with 3 components A, B and C, with the following properties:
Your mental model is the following:
Suddenly, B starts to slow down. This causes A to keep many open connections to C, eventually causing it to drop new incoming connections. When A can’t open new connections to C, it starts failing as well.
You have discovered a hidden relationship between B and C. Your new mental model is:
Your ideal postmortem would produce a document that describes how to change the system to prevent the problem from happening again.
Otherwise, if it does happen again, explain how to reduce the impact and get to a faster resolution.
When your system suffers an incident that hurts you enough.
In theory, you should do it for every incident. In practice, you probably don’t have infinite resources allocated to this, so you could start by focusing on incidents impacting directly customers and/or stakeholders.
You should run it as temporally close as possible to the incident, maybe even start during its resolution, so that everybody’s memory is still fresh.
Clearly state the impact of the incident:
Give a timeline of events and communication between people involved, relating to the incident.
Try to find the root cause, insofar as it is actionable:
"There is an important difference between truth and utility."[2]
Involve the whole team if you can, so that you have as much brainpower as possible, everybody learns, and you will hopefully resolve the incident before it cascades downstream.
"Problems are swarmed and solved, resulting in quick construction of new knowledge."[3]
If it happens again:
Pay attention to information flow, feedback, and delays to communication:
"Changing the length of a delay may make a large change in the behaviour of the system."[4]
If you start naming and shaming, people will have an incentive to hide information, which undermines the whole process.
The is no real boundary between your system and the rest of the world. It is a continuum. In your quest for the root cause, you may be tempted to reach far into territory you have no control over, which has rapidly diminishing returns.
Remember that the objective is to help yourself and your organization learn:
Here is a list of public (and verbose) incident postmortem documents:
Camille Hodoul is a JavaScript and PHP developer living in Grenoble, France. You can learn more about his work on his website.
{{subscribe-form}}