History of SRE: Why Google Invented the SRE Role
A history of Site Reliability Engineering from its origins at Google in 2003 to the present.
October 22, 2021
5 min read
Learn about the key roles within an incident response team, as well as optional incident roles you may not have thought about.
In the world of reliability engineering, folks talk frequently about “incident response teams.” But they rarely explain what, exactly, an incident response team looks like, how it’s structured or which roles organizations should define for incident response.
That’s a problem because your incident response team is only as effective as the roles that go into it. Without the right structure and responsibilities, you risk leaving gaps in your incident response plan that could undercut your team’s ability to respond quickly and efficiently to all aspects of an incident.
This article explains how to define incident response roles in order to build a team that works as effectively and efficiently as possible.
Before defining incident response roles, let’s take a look at what an incident response team does collectively.
An incident response team is a group of personnel who respond to incidents that disrupt IT resources (and that also, by extension, disrupt the business). Designating staff to be part of an incident response team is important because you don’t want to waste time in the midst of an incident trying to decide who needs to do what to handle the problem. By creating an incident response team ahead of time, you have a group of experts who are prepared to respond quickly to issues whenever they arise.
Of course, the exact nature of incident response teams varies from one organization to the next. So do the titles given to different roles within incident response. But in general, most incident response teams include the following core roles.
The incident commander or incident manager is basically the executive in charge of incident response (although he or she need not be an actual executive at your company). The person in this role is the lead decision-maker, and is responsible for overseeing the rest of the incident response team. Incident managers often come from a technical background, but having people skills and management experience is just as important for this role as technical expertise.
Technical leads are in charge of managing the technical dimensions of incident response. Their main responsibility is determining what went wrong, devising a remediation strategy and implementing the fix as efficiently as possible.
To do this work well, technical leads should interface with other incident response roles to ensure that technical problem-solving aligns with other priorities -- like minimizing disruptions to customers and protecting the business’s brand.
Subject matter experts, who are overseen by the team’s technical lead, provide the technical expertise and labor required to work through an incident.
Depending on which types of systems you are supporting, you may need a variety of subject matter experts who are prepared to respond to different types of issues. For example, you may want to define one subject matter expert role for a networking engineer, another for a storage or database engineer and another for a software engineer. Each of these areas of expertise may be necessary when responding to an incident.
The customer relations lead is the role in charge of managing the customer-impacting aspects of an incident. This person is responsible for determining how an incident affects customers and helping the rest of the team to proceed in a way that results in the best possible customer experience.
The communications lead oversees communications with the public about an incident. This person will typically come from a PR background, but an ability to understand how IT systems impact business operations and branding is critical, too.
Note that the communications lead doesn’t manage communications within the incident response team itself. That’s a job that is typically overseen by the assistant incident manager, with help from the scribe.
Documentation is always important, and incident response is no exception. The scribe role addresses this requirement by recording information about incident response processes as they unfold. They can also help in generating postmortem reports.
Beyond the core incident response roles defined above, you may want to consider adding some other roles to your team, depending on your priorities and the type of business you support.
Although the communications lead will oversee public communications about an incident, companies with a large social media presence may benefit from assigning someone to oversee social media communications specifically during incident response.
If your company depends extensively on relationships with partners, consider creating an incident response role that will take the lead in communicating with partners during the incident and help to minimize the impact of the incident on partner relationships.
Although security should be everyone’s responsibility, it can be easy in the midst of a hectic incident response process to make mistakes or fail to follow best practices regarding security. Designating a security lead for your team helps to avoid these risks by ensuring that there is someone whose main goal is to enforce security during all stages of the response.
Not all incidents have repercussions for compliance or could lead to legal issues, but some do. Consider creating a legal or compliance lead role to help the team manage these aspects of incident response.
Some parting advice: There are many incident response roles you can define, but not every company needs every role. Your main priority when defining roles should be to create a team that is agile and flexible, while also covering all of the areas of expertise that the team is likely to need.
After all, incidents by definition are problems that you don’t predict ahead of time, and defining an agile set of roles is the best means of preparing for whichever incidents may come your way.
{{subscribe-form}}