6 Steps SREs Should Take to Prepare for Black Friday and Cyber Monday 2021
Six tips on how Site Reliability Engineers (SREs) can prepare for the reliability challenges of Black Friday and Cyber Monday 2021
December 6, 2023
4 min read
When incidents reach a heightened level of complexity and scale, Strong argues that companies ought to consider having multiple lead roles present, rather than a single Commander overseeing the entire response. In this post, he breaks down when and how he recommends you consider bringing additional command roles in.
This post was contributed by Strong Liang. It has been lightly revised and reposted with his permission from the original article on Medium.
Leading major incident responses can be extremely stressful. You have to quickly gather an ad-hoc team, figure out what went wrong, identify a fix and make sure this doesn't make things worse, all the while with senior leadership breathing down your neck. Are we having fun yet? Many people think having a dedicated incident commander role will solve the problem. But when the stakes are high, my experience is that one is not enough. You need at least 2-3 lead roles in dedicated rotations to achieve sustained, optimal results. Otherwise, you’ll be counting on luck. And luck isn’t a strategy.
To be clear, a major incident usually means a production issue that’s significantly breaching the product SLA and/or SLO (or will do so if no immediate actions are taken). Examples: service unavailable (site down), extremely high latency, missing critical process deadlines — anything that will have your important partner/customer bang on your door and freak out your C-suite.
Google’s SRE books talk about the best practice of having multiple incident lead roles, such as communication lead and operations lead. You may say “Well, that’s Google. Having one dedicated role is expensive enough, how can my org/company afford more?” I think the key here is incident severity. If your company rarely has any incidents that require all hands on deck, sure, this is overkill. But if it happens, and you don’t have the additional lead roles, the response team tends to scramble, which means:
Having multiple lead roles is expensive, so the question is when does it matter the most. To highlight some scenarios where the incident commander needs help:
High stakes partners/customers: Communication is critical in managing expectations and reputation here. Having a dedicated external comms lead makes a world of difference.
Large scale incidents: Usually there are many threads to pursue simultaneously, which requires excellent coordination. As an incident commander, if I’m not sure the team is capable of managing multiple war rooms, I will not start the rooms because it may make things worse. Having an internal communication lead and ops lead who are trained to do this can prevent the coordination to implode.
The first 10 minutes: Even for a “normal” scale incident, the beginning can be overwhelmingly chaotic. Having multiple lead roles spreads the load of establishing order.
In general, you want to monitor whether the response team is struggling with any tasks, due to complexity or load. In a crisis, any signs of struggle predict suboptimal results. External communication is one example, where untrained engineers will have a hard time striking the right balance between clarity, level of detail, and tone. Another example is debugging a multifaceted technical problem, where someone well versed in the monitoring tools can provide critical guidance. Load problems usually are a symptom of uneven distribution, where some members can’t keep up with their tasks while the rest do not have much to do. Having an ops lead to help distribute work can quickly unlock team bandwidth.
On the other hand, if the team is well organized and the situation is stable, having multiple lead roles may be wasteful. To preserve strength, the incident commander should feel free to dismiss some folks. One common scenario for this is when the major incident is an escalation from a long-running lower severity one, where there is already a team with full context, maybe even a clear path to mitigation.
Not sure how to implement multiple lead roles? Here’s an example: 1 rotation for incident commander, 1 rotation for communication lead. When a major incident happens, both rotations get paged. This is especially useful when external communication is required. More roles, such as operations lead can be added if the troubleshooting task is complex enough.
Alternatively, if major incidents typically involve multiple SRE rotations, the oncallers can negotiate their roles at the start of the incident. This can also apply to any oncall rotation whose members are sufficiently trained for the situation.
High stakes incidents require high stakes responses. With a small group of lead roles, the incident response team can navigate any kind of challenging situation and consistently deliver optimal TTM. This way, you solve the problem and keep the panicky breath of leadership at bay.
Zhuang (Strong) Liang is a software engineering leader with over 16 years of experience, specializing in Reliability and Infrastructure at world-class companies like Affirm, Google, and Uber. You can keep up with his posts on Medium.