Making Your On-call and Incident Management Program Stick
Maintenance of your incident management practice is as important as creation - find out what you can do to keep your engineering organization strong and consistent year over year.
August 22, 2024
15 mins
Learn how to develop an enterprise incident management strategy for your organization in 2024 with practical tips and industry benchmark data.
Cixin Liu, the famous science fiction author, is pretty good at conveying the scale of the universe, which is usually hard to imagine. For example, the sun is about 149 million kilometers (or 93 million miles) away from the earth, but that doesn’t tell me much other than yeah, it’s kind of far. When Cixin explains that if the sun was the size of a soccer ball, the earth would be a speck of dust floating meters away from it, then I can grasp a better sense of the scale we’re dealing with.
When I worked at Shopify, we processed around 675 billion transactions yearly. If each kilometer that separates the earth was a transaction, you could go to the sun nearly five times with Shopify’s processing scale. It’s a silly analogy, but it helps put things into perspective.
Enterprise scale is not something abstract but a hard-to-imagine reality unless you experience it first-hand. Enterprises have unique challenges that only happen at that scale. You won’t see their problems at a start up or a mid-sized company; worrying about them is pointless until you have to deal with them.
When your company is larger than entire nations, how does incident management look like? When you’re a start up, you’ll get a headache if your service is down for a few hours. You’d be six feet under within a week if you took that approach in an enterprise: a sunny day can see dozens of ongoing incidents across the organization.
In this article, you’ll get an overview of the kind of challenges to expect when dealing with incidents at scale, how to identify gaps in your reliability strategy, and how to measure the success of your incident management.
These insights are based on my experience working with enterprise partners at Rootly, such as LinkedIn, NVIDIA, and Cisco. I’m also drawing best practices from my conversations with reliability leaders from Google, Microsoft, and Okta on my Humans of Reliability series.
When you have a single service, you could say your company is reliable if that service is available, let’s say, 99.9% of the time. What does it mean to be reliable when you have thousands of services supporting different business lines in different regions?
{{subscribe-form}}
Does the CTO need to know about every incident? Probably not. But how do you determine who actually needs to know when you have hundreds of colleagues from different teams and functions loosely connected to every one incident you run into?
You have to have people on-call for hundreds of services around the world. This translates to having dozens of rotations scheduled to cover all services across all time zones.
Few things are as critical as security for enterprise teams. Everything is on the line: data breaches, stock price collapses, class actions. Which makes businesses at scale go all out to ensure their processes, and vendors, are secure and compliant.
Implementing automations at an enterprise company is not just about improving efficiency anymore. Automations around incident management are a necessity to speed up mitigation times and prevent compliance issues.
{{cta-demo}}
You don’t arrive at an enterprise scale without some sort of incident management solution. Whether it’s an ad-hoc collection of utilities that were put together as the need arose over time or a legacy vendor you inherited from the previous management, your teams are somehow managing and resolving incidents.
However, just as your team keeps shipping new products, paying off tech debt, and renovating their infrastructure, your alerting and incident management solution needs to keep evolving to meet new demands and objectives.
Before you decide something is broken or works well, you’ll need to get a good understanding of how incidents are managed at the moment. This is not a simple task. It’ll require you to go into the trenches to see how people on-call are responding to incidents.
After mapping your existing incident management process, some gaps will become apparent. The next step is about prioritizing and planning how to actually improve your process.
Tool sprawl can impact your reliability by forcing responders to juggle a variety of disconnected tools and context switches to address an incident. Opt for a centralized platform that integrates alerting, on-call schedules, incident management, and retrospectives in one place, such as Rootly. This will help you save investments in integrations and consolidate your practice.
Incidents can be hectic and have many people interested in their evolution. Your incident management software can do a lot for you to keep communications streamlined. First of all, solutions like Rootly bring clarity by becoming the single source of truth for any incident-related insight. Plus, you can offload communication tasks through automatic workflows that, for example, notify certain Slack channels when a SEV1 incident is declared in a specific service.
Enterprise teams invest heavily in automation. The amount of manual work involved in incidents is surprising, especially after they’ve been resolved. Retrospectives can take time as you gather the facts after the fact. Incident managers like Rootly automatically keep track of what happens throughout your incident resolution process and can construct a timeline, suggest resolution causes with GenAI, and file action items in Jira.
AI can significantly enhance incident management by providing data-driven insights, predictive analytics, and automated decision support. When leveraging AI, it’s crucial to ensure that it complements human judgment rather than replacing it. AI should be used to process vast amounts of data quickly, identify patterns, and suggest possible actions, leaving the final decision-making to experienced incident responders who can interpret the context and nuances of each situation.
Rootly’s AI capabilities provide advanced insights and suggestions, helping incident responders resolve issues faster. With a privacy-first approach and human-in-the-loop design, Rootly’s AI supports decision-making without replacing human judgment, ensuring that your incident response is both efficient and effective.
Incident management needs to evolve along with its context. Make sure you set a yearly cadence to check the entire process and tools. The review would ideally include inputs from responders and the results from the period of evaluation.
Deciding which SRE metrics are useful to you and which ones aren’t is the first step to evaluating the performance of your reliability strategy. Even then, what constitutes “good” or “needs improvement” will entirely depend on your context and SLOs.
To provide some orientation, the team at Rootly processed the anonymized data of about 150,000 high-severity incidents across enterprise-tier customers, excluding accounts with more than 5,000 employees because they distorted the dataset disproportionately.
MTTR Benchmark
The Mean Time to Resolution / Recovery in our dataset, from detection to recovery, is distributed as follows:
Follow-up Actions
Using the same dataset, we evaluated incidents in the following state: all follow-up action items completed and a retrospective published.
Rootly allows organizations to build custom dashboards. This is vital as not every team inside of large enterprises needs to or cares to see metrics from other teams. Rootly customers often have a dashboard specific teams will leverage (e.g., the mobile app team views their most common incident cause drivers). Customization is powerful but easy to use.
The incident management software you choose to partner with to develop your reliability strategy can make a big difference in your implementation. Ease of use paired with the right amount of configurability, cost-effectiveness, and the ability to integrate with your existing tools play a critical role when choosing an incident management solution.
Rootly caters to true enterprises globally such as LinkedIn, NVIDIA, Figma, Elastic, Cisco, and Shell. We specialize in large-scale enterprise deployments such as LinkedIn with 10,000 users, and the majority of our customers are large global enterprises with 5,000+ users on Rootly.
Building an incident management strategy at scale requires you to continuously assess how your responders are working and how effectively you’re hitting your SLOs.
Check out our full-cycle Incident Response Guide to get more insights into each step of the process.
{{cta-incident}}
Unlock simple and cost-effective incident management with Rootly
Equip your team with the knowledge and tools needed for effective incident management