AWS re:Invent ‘24 - The Unofficial SRE Guide
A curated list of sessions to make the most out of re:Invent as an SREwe
July 16, 2024
7 mins
Discover the best on-call scheduling strategies for SREs in 2024
Reliability is a team sport. As SREs, your team and you are always fighting to keep the score positive for availability, performance, and recoverability across all SLOs. A big part of that fight lies in how you effectively respond when something goes down. And as you’ve experienced in your own flesh, alerts tend to go off right when you’re getting ready to decompress, which calls for setting up 24/7 on-call schedules.
The objective of an on-call schedule is to ensure there’s always someone available to handle issues that pop up, even if it’s at 3 am on a Sunday. A single person can’t possibly cover the entire schedule (without getting burnt out), so you’ll have to distribute the responsibility across the team.
However, it’s not enough to have one single person responsible for acknowledging and handling any incoming alerts. That would introduce a single point of failure in your reliability strategy. That’s why on-call schedules usually have several levels of redundancy built into them through escalation policies (e.g., if Rob doesn’t acknowledge an alert within a few minutes, the alert will be sent to Sara at the managerial level).
On-call schedules can be set up using different strategies, depending on your team structure, objectives, and constraints. In this article, we’ll dissect the three most common ways of distributing on-call duties in a team.
The most common way of organizing on-call schedules is through rotations. This means that the 24/7 coverage is distributed through different shifts.
The basic idea is that you split up all the hours of a week and distribute them into a set of rotations that go one after the other, making sure there’s no gap left between them. Then, you assign team members to each rotation.
The interesting, and convenient, aspect of rotational on-call schedules is that they can be tuned to the team preferences, local labor laws, and any other circumstances.
You can start your incident response team of two using on-call rotations. Juan and Andrea can take an on-call rotation each, perhaps by having each person take all alerts on odd weeks, and have the other do it on even weeks.
However, as your response team grows—and you start delegating on-call duties to more teams—you’ll need an on-call platform to organize rotations and responders effectively. An on-call scheduler like Rootly On-Call can help you set up schedules and escalation policies with minimal effort.
Rotations are used in large enterprises, although the complexity of the shifts across the organization is arguably exponentially higher than the Juan and Andrea startup case.
Rotations can be as short and frequent, or long and spaced out, as the team prefers. For example, responders in Amy’s team prefer to have a 24/7 week-long shift, but only have to go through it every six weeks. By contrast, the people in Karim’s team voted to opt for shorter shifts of one full day every two weeks.
The simplest way of organizing on-call schedules would be to split rotations in equally sized shifts. For example, if you establish seven one-day shifts, you’ll happily have a one-to-one mapping for the week. That means you just assign people to these shifts and you’ve got 24/7 coverage, easy!
Except, you’ll have people always covering weekends, while others only serve on-call during weekdays. This translates to some team members getting burdened during their rest days, which will likely result in tensions and burnout.
That’s why it’s common for team to establish different rotations for business hours on-call, and outside of work hours on-call duty, which includes nights and weekends. The assignment of this shifts requires some admin work and to keep in loop with all responders involved.
If you have teams in San Francisco, New York, London, and Bangalore, you have pretty much somebody in your organization working during most hours of the weekday. The idea of follow-the-sun schedules is to leverage the existing time differences within your teams to assign on-call shifts to people who are on their business hours.
On paper, the follow-the-sun schedule strategy is the most effective way of distributing on-call duty while minimizing out-of-hours disturbances for responders. However, distributing 24/7 coverage like this can be tricky because, in practice, teams in different locations tend to vary in scale and domain of expertise.
Follow-the-sun can be applied only when you have teams geographically distributed in different time zones. This means smaller organizations have fewer chances of using this strategy as a pillar for their on-call schedule.
However, if you don’t have a globally distributed workforce, you can still make use of a follow-the-sun schedule. If you have teammates in Portland and Toronto, you still have a wider range of business hours, which can imply less on-call time outside business hours for responders.
Your on-call schedule should contemplate an overlap of on-call duty where knowledge transfer can take place. The team in time zone A needs to have some overlap with the team in time zone B in order to hand off any ongoing incidents or context.
Even if you have enough teams to cover all weekday hours with a follow-the-sun on-call schedule, that will still leave you with weekends and holidays to cover. You’ll need to set up traditional rotations or Round Robin schedules to ensure your organization can effectively respond to an incident at any time.
Take into account that local holidays vary significantly across geographic locations. For example, the US Labor Day happens on September 2nd, but in Spain, it is on May 1st. When a geography is not available due to a holiday, you’ll have to readjust your on-call schedule.
One of the disadvantages of follow-the-sun schedules is the amount of work needed to set them up and maintain them. First, you’ll need to map the time zones you have available and plan schedules for each such that there’s enough handoff.
However, a modern on-call platform like Rootly On-Call can help you generate follow-the-sun schedules with a few inputs about your teams.
A Round Robin on-call schedule distributes the incoming alerts among a group of responders on-call. This means that responders get fewer alerts when they are on-call, reducing the risk of alert fatigue. Round Robins can be introduced into other on-call rotation strategies with relative ease to unburden teams or escalation layers with a particularly high number of alerts.
Teams who rely heavily on Round Robin set up on-call rotations with more responders staffed at each level. Instead of having a few responders distributed through more shifts, you can have longer on-call rotations with more team members assigned to each.
The alert load will be evenly distributed among the people who are on-call, meaning fewer alerts for each person. Knowing you’re not alone covering the frontlines can make on-call duty less intimidating and improve the team’s response when an incident comes up.
Being able to respond to incidents effectively is a skill that can only be gained through experience. To drive up the reliability of your organization, you’ll want to keep training more people with on-call skills such as live debugging and stakeholder communication.
Round Robin schedules are a good way to expose more team members to your incident response process without overwhelming them. Having more people on-call also means less experienced responders can feel more confident to reach out for help when needed and resolve any incident faster.
Just because you’re less likely to get paged while you’re on-call doesn’t mean you cannot be paged. This pressure makes it difficult for responders on-call to be able to fully relax during their time off, which can lead to negative consequences in their performance.
That’s why Round Robin best practices include making sure your responders are taken off the schedule when they are OOO, even if it is very unlikely they get paged due to their position in the queue.
Round Robin is a strategy that you can use to complement any other on-call rotations you have. For example, make the first layer of your escalation policy a Round Robin if you foresee your team will be overwhelmed by alerts. Or, reduce the load in higher escalation layers by making the management team take requests in a Round Robin schedule.
Most on-call schedulers have a way of helping you set up Round Robin schedules based on a group of responders. In the case of Rootly On-Call, you need a single click to transform your escalation layer into a Round Robin.
No matter which strategy you use to structure your on-call schedule, you’ll be confronted with the matter of how often and for how long should a responder be on-call.
{{subscribe-form}}
It’ll take you a few iterations to land into an on-call scheduling system that works for everyone in your team and addresses the SLOs you’re committed to. It is common for SREs to mix-and-match rotational schedules, use up follow-the-sun as much as possible, and introduce Round Robin escalation policies at certain levels.
To find the right on-call scheduling strategy, you’ll need to set up a feedback loop that keeps driving up your objectives. Remember, responders are humans: keep in mind their personal situation and strive to make on-call duty a more rewarding experience.
Rootly On-Call helps you set up schedules and escalation policies for modern teams. Save time rotational, Round Robin, and follow-the-sun schedules with the smart templates that do most of the heavy-lifting for you. Learn more by booking a demo with our reliability experts.
{{cta-on-call}}
See Rootly in action and book a personalized demo with our team