Google’s State of DevOps 2021 Report: What SREs Need to Know
The four key takeaways for SREs from Google’s State of DevOps 2021 report
August 27, 2024
8 mins
An effective incident response team requires carefully planned roles and continuous improvement. Learn about common pitfalls and best practices when structuring a performing response team.
In 2021, Meta faced two huge incidents at the same time. First, all its services—including Facebook, Instagram, and WhatsApp—suffered a severe global outage for several hours. During this period, a whistleblower leaked internal documents to the press, accusing the company of prioritizing profits over user safety.
A double-feature SEV0? Perhaps, but how Meta handled the situation is a testament to the readiness of their response teams. The issue brought together IT, Engineering, Legal, PR, and executives to ensure a coordinated response, minimize reputational damage, and reassure users and shareholders.
You never know what kind of incident will hit you next. That’s why structuring response teams covering a variety of functions, as in Meta’s case, can dramatically change how effectively you handle and mitigate emergencies.
However, most companies do not have $118 billion in annual revenue and 66 thousand employees and cannot possibly structure response teams with liaisons for every possible function. That doesn’t mean your response team can't be just as effective. In this blog post, you’ll learn roles that you can include in your incident response team and a few heads-up for challenges you’ll find along the way.
{{cta-incident}}
An effective incident response team is essential for any organization, but it’s difficult to strike the balance between team size and expertise or information flow and confidentiality.
Incidents are high-stress events where confusion is one of the biggest threats when working towards a resolution. Figuring out which bases you need covered on the fly is a risky strategy and will likely leave loose ends.
Instead, set time apart to review the roles your response team is using and dive into how they’ve resolved past incidents. Figure out if a role covers too much or may be unnecessary to have on on-call rotations.
Make sure each role has clearly defined responsibilities, and that roles don’t overlap each other to simplify task distribution for responders. Talk to your team often to see how they view the role definitions and how true to the actual workload they are. Keep adjusting to improve your reliability.
It’s tempting to have a relatively compact incident response team because it’s easier to manage and cheaper. The problem with that comes when a larger-than-usual incident pops up and overwhelms your team’s capacity or goes beyond their expertise, causing significantly longer resolution times.
A common structure to tackle this tension is to set up a core incident response team and a network of subject matter experts (SMEs) who can be called if needed. The core response team needs to be well-staffed for several rotations and have a wide range of skills. And given SMEs won’t be on on-call rotations, it’s better to build extra redundancy into the areas of expertise you need the most.
On average, roughly half of the incidents your team will run into will take at least 2 hours to resolve. That’s high-stress time that doesn’t quite end when the incident ends: they still need to file compliance forms, write a retrospective, and perform other outstanding actions that came from the incident.
Constant fire-fighting can be exhausting, especially if you have to do it over the weekend or at 3 am. It takes a lot of investment and iterations to put together an effective incident resolution team, so burnout is one of your biggest concerns. If you don’t want to start over, you’ll need to proactively protect your team and keep them motivated.
{{subscribe-form}}
The Incident Commander, or Incident Manager, is the assigned decision maker during the incident. During an incident, chaos and confusion reign. You need somebody who can call the shots. Your Incident Commander will be responsible for putting together a response team, delegating tasks, and deciding when it’s time to scale an incident up.
Your Incident Managers need to have a solid technical background paired with a deep understanding of your business. Knowing how to deal with people and uncertain environments will also be key skills to be successful in this role.
Most incidents will have an important technical component; that’s why you need somebody to execute and coordinate the root analysis and mitigation processes. The Technical Lead will reach out to subject matter experts if needed for the incident at hand.
Additionally, the Technical Lead will collaborate with other incident response roles to minimize disruptions to customers.
Subject Matter Experts (SMEs) are called into the incident as deemed by the Technical Lead. They will provide their expertise to understand and resolve an incident. SMEs can be focused on CI/CD pipelines or the database infrastructure used by the organization.
It’s recommended that you nourish a network of SMEs in your organization, trying to cover all the technologies you use across your systems. That may mean having multiple SMEs for each aspect of the software development lifecycle. For example, you may need a Jenkins and a GitHub SME because they are both CI/CD solutions used in different product teams.
Incident communication with external parties is critical and delicate. Your Communications Lead will be a media-trained professional who can navigate the technical nuance and business impact of an incident.
Your Communications Lead will coordinate communications with the public regarding an incident. This communication can range from public status page updates to setting up a press release if needed.
During the incident, documenting what is being done and when may seem like a chore. But proper documentation can turn out to be critical for compliance and to understand the incident resolution process.
The Scribe logs information as the incident unfolds, noting key events and keeping track of what’s being done by whom at any given time. After the incident, the Scribe also helps write down the retrospective.
Depending on your business and scale, you may find it convenient to add more incident roles to your response team. Here are a few roles that are popular in different types of industries:
An incident can rapidly require touching on many areas, from networking and databases to customer communications and SLAs. When designing your incident playbooks, define what kind of roles you need covered, including IT, PR, or legal. Then staff them according to your scale, which may mean that one person needs to wear several hats or that one function is shared by multiple people.
A good incident response plan is concise but includes roles, communication protocols, escalation paths, and example scenarios. Make sure everything is defined in enough detail but with simple and straightforward language. Your responders may come to the plan to figure out the next steps when they’re dealing with an incident, so make the format easy to skim through.
Yes, you’re swamped with work, and it’s not a fun thing to do. But drills can help you make reality checks on your incident response playbooks and escalation plans. Your organization changes fast, so it’s likely you’ll run into sections of your incident management procedures that no longer apply or need to be adjusted. Make frequent shadow rotations part of your culture to keep your responders sharp and critical about their systems.
There’s no way around improving your reliability without continuously improving your setup, processes, and team. Block time to assess the performance of your practice regularly, and keep in close touch with your responders. Make sure your SRE team has enough bandwidth to implement proactive measures to improve overall reliability instead of being buried in a permanent state of crisis.
Tools can simplify or complicate your SRE team’s life. For example, invest in keeping your alerts relevant and reducing noisiness. Provide your team with a robust and flexible on-call and incident management solution, like Rootly, so they can focus on resolving incidents instead of implementing and patching ad hoc internal software.
{{cta-demo}}
Communication is essential in every single incident, not only internally to make collaboration more effective and resolve the incident. You’ll also want to have clear communication with impacted customers and external sources.
Incidents are not a one-time kind of deal. You’ll keep running into incidents no matter how “good” your software is. Scale and complexity will only bring more issues. You’ll soon find yourself and your responders performing tedious and repetitive tasks. Automate them, ideally through your incident management tool, so you don’t have to maintain more software from scratch.
Incident Response teams don’t have to be sitting next to each other in the same room. Just as async collaboration through Slack has become a main channel for communication and coordination, incident response can be done across time zones and locations. Tools like Rootly integrate with Slack so your responders don’t have to leave it to assign roles, coordinate a response, and write a retrospective.
Having a distributed team can also be helpful when filling in remote on-call rotations. You can leverage the natural time difference between your responders to extend business hours coverage, which means people have less time to cover on their weekday evenings.
Reliability is a measurable practice, which often demands a minimum service level due to SLAs. You’ll need to understand how effective your incident response team is by digging into the metrics associated with incidents.
There are dozens of metrics available; you need to make sure you choose the SRE metrics that matter to your business objectives. Look into the time it takes your team to resolve incidents, and which steps of the resolution process are taking longer than others.
Structuring an effective response team takes several iterations and more than a few incidents. Make sure you’re defining roles succinctly, such that your responders can be responsible for all the areas an incident can possibly impact.
Modern incident management tools like Rootly can help you assign roles when an incident breaks, without leaving Slack. Rootly also helps your team keep track of actions associated with each role and logs the activity done by each responder to ease the retrospective process.
Book a demo to see first-hand how Rootly can help your response team.
See Rootly in action and book a personalized demo with our team
Learn how to resolve incidents efficiently with actionable tips and best practices from our expert guide