Get Rootly's Incident Communications Playbook

Don't let an incident catch you off guard - download our new Incident Comms Playbook for effective incident comms strategies!

By submitting this form, you agree to the Privacy Policy and Terms of Use and agree to sharing your information with Rootly and Google.

May 22, 2024

15 min read

Building On-Call Schedules for Humans

Learn how to navigate vacations, parenthood and personal preferences to improve your reliability practice.

Building On-Call Schedules for Humans
Table of contents

Service degradations, third-party failures, and malicious attacks don't wait for work hours—they often strike while you're having your best night. When chaos breaks and stress levels skyrocket, waiting until the next workday is not an option. This urgency is why many organizations ensure they have a team ready to respond around the clock, no matter when an incident strikes.

To manage this continuous readiness, companies implement on-call rotation schedules. Team members like Bob and Maya might be on call, for example, the first week of every month, ready to spring into action if an alert sounds. Their responsibility is to promptly acknowledge the alert, assess its urgency, and determine the severity of the incident. While they often resolve issues independently, complex situations might require escalating the problem and collaborating with others.

Incidents aren't just disruptions; they can dominate your focus, becoming worlds unto themselves. Organizations thus begin to measure the maturity of their reliability practices by metrics such as Mean Time to Recovery (MTTR) and DORA indicators. It’s only fair: as you scale up, you need abstractions to navigate the complexity.

Yet, amid these metrics, the human aspect of incident management sometimes gets overlooked. When systems fail, and downtime ticks into costly hours, it's the on-call people who become the unsung heroes, stepping in to mitigate disasters. Their role is pivotal, yet it's essential to recognize that they are more than just problem solvers—they're people with lives outside of work.

Acknowledging the human side means understanding that team members have personal lives filled with well-deserved holidays, loved ones who get sick, and surprise broken fridges that become little urgencies. Effective incident response strategies must consider these factors to maintain a resilient and responsive team.

This guide aims to explore what "on-call for humans" really means, emphasizing why it's crucial for both the organization and its employees. We'll discuss how to create rotation schedules that are equitable and sensitive to personal needs, ensuring that all team members, including parents and new hires, feel supported and valued. This approach not only helps in retaining talent but also in fostering a work environment where everyone is empowered to perform at their best.

On-call basics

The objective of on-call scheduling is to assign shifts to staff members such that the workload is distributed fairly. You'll need to decide how you split up the time in which people stay on-call and how often they do. On-call scheduling is built around three foundational components: organizing rotations, defining escalation policies, and maintaining schedules in your tool of choice.

Rotations

Incidents can pop up at any time, thus you must plan to always have at least one person on-call. Depending on the complexity of your system, you’ll need more than one person per shift.

You can’t have only a few people be on-call all the time. Not only because it’s unfair, but because it introduces a single point of failure into your strategy. Being able to respond to incidents is a specific skill that you want to cultivate in your team. That’s why you need to organize rotations to distribute the on-call duty in different shifts among your team.

Traditionally, SRE teams handled all on-call duties, bringing systems back online as needed. However, with the rise of the "you build it, you run it," mantra,  the responsibility is increasingly being delegated to the teams that own and operate specific components.

Organizing rotations by teams can add some additional admin work, but distributing on-call load by components can be useful in resolving incidents faster because it's tackled by the people who built it and maintain it.

There are a handful of best practices scheduling templates, such as biweekly or "follow-the-sun" rotations, used in the industry to assign shifts to people. In general though, you'll want to first ask each team member about their preferences and needs before organizing rotations.

Escalation policies

Let’s say you have Anton on-call right now, but he’s not answering his phone when an alert pops up. Who else should you contact? A colleague, a manager? What if the incident has a critical severity, who should be notified?

Escalation policies act as roadmaps to make sure the right people are notified at the right time when an incident breaks. An escalation policy can consist of several layers, to ensure enough redundancy in the scheduling system.

Screenshot of an Escalation Policy
Escalation policies can notify different channels and be repeated until it's acknowledged.

Escalation policies can also be used to keep alert fatigue at bay. Imagine it's 3 am, and you're the primary contact on-call and your night shift is far from over. You've already been awaken four times, but this new alert turns out to be a significant incident. Your stress levels are over the roof and your capacity of respond is limited, so you use the escalation policy to bring extra hands.

Scheduling software

Spreadsheets and manual calendar management will not be enough when you have to manage rotations and escalation policies. Using specialized software will help you avoid errors in your scheduling system.

The oldest on-call scheduler is called PagerDuty, built in 2009 to help teams know when their Java monoliths went down. Nowadays, though, teams build software differently and have more demanding needs. You need to distribute on-call ownership, you need scheduling flexibility, you need a tool that doesn't just ping your phone.

There are a few modern on-call solutions in the market right now. When you go around evaluating tools, make sure to look out for the following:

  • Flexible scheduler: your scheduling software should let you easily set up and modify rotations across teams and services. Even if you design the perfect schedule aligned with employee preferences, short notice changes are unavoidable. You need a tool that doesn't make you re-do all the schedules when you introduce a change.
  • Support for different time zones: when you work with a distributed team, being able to specify rotations in your team's time zone will help you avoid errors in time conversions and dealing with summer times.
  • Gap detection: juggling teams and rotations is difficult, so having an automated way to know when you're missing a coverage slice helps you ensure you'll always have someone on call.
  • Load distribution: helps you identify people who are being given more on-call duties than others, making sure shift assignments are distributed equally throughout your team.
  • Does more than pinging your phone: on-call solutions should give you more context and action items when an alert happens. For example, surfacing relevant playbooks or past incidents.

{{subscribe-form}}

Why human-centric on-call matters

CERN’s Large Hadron Collider is a 27km particle accelerator where protons travel at close to the speed of light to replicate the primal state of the universe. Yet, what makes the billions of data points make any sense is the work of more than 100 thousand scientists and their life-long dedication to research.

“Tools and vendors come and go, technology evolves; what matters is that you have a capable team” — Global Head of Cloud and Container Platform Engineering at Saxo Bank

Your team and you may not be out there trying to find the next Higgins boson, but your objective is far from simple: making complex systems reliable. Thus, you need the best tools to simplify alerting and incident management. But as Global Head of Cloud and Container Platform Engineering at Saxo Bank, Jinhong Brejnholt says, “tools and vendors come and go, technology evolves, what matters is that you have a capable team to keep your platform performant”. Thus, your scheduling practices must prioritize fostering your team in the long term.

Good SREs are hard to find

SREs and other roles who are likely to be have on-call duties are not easy hires. You need someone with who knows about application code, architecture, storage, security, performance, and CI/CD: finding the right person is not trivial. Being on-call also requires a special mindset: SREs need to ready to deep dive into some else’s code in the midst of chaos.

And, when you do find the new golden SRE, it takes a while before they’re ready to handle an incident independently. They need to learn about your system and gain a lot of context in order to respond to incidents nimbly.

Retention and efficiency

People on-call are humans: when you put that at the core of your strategy, things go really well.

Google’s SRE State of DevOps 2023 report found that cultivating on a culture that prioritizes employee’s well-being decreases burnout substantially, while improving employee satisfaction and increasing productivity significantly.

That means that when you give SREs flexibility, cultivate a safe working environment, and nurture work-life balance, you’re not only investing in talent retention. You’re also improving your net reliability practice.

Google's DevOps Report shows how flexibility improves software delivery productivity, and reduces burnout.

We're all human after all

Acknowledge that your team is not an impersonal set of machines filling up your rotations schedule. Embrace their personal situations, preferences, and priorities. It takes some more effort, but it pays off. Below we’ll cover common considerations and tips to manage the human aspect of on-call shifts without compromising your availability.

Vacations & time off: how to adjust on-call schedules

Your team works hard, it’s only fair they can use their Paid Time Off (PTO) to enjoy a lush holiday in Lanzarote or reconnect with their family in a cross-state road trip.

But it’s not only vacations that lead people to take time off. There’s other types of Out Of Office (OOO) situations, such as a temporary absences due to illness of a relative or mental health breaks.

In all cases, you need to make sure the person OOO time is respected, but without compromising your 100% on-call coverage or upsetting the rest of your team. Here are some tips:

Plan ahead

  • Encourage your team to share their OOO plans early. This helps you adjust the on-call schedule well in advance.
  • Keep a shared calendar with everyone's PTO to spot any coverage gaps easily.
  • Use tools like Rootly On-Call to automate and simplify the scheduling process. These tools help manage escalations and make it easier to tweak schedules when someone is away.
  • Establish a system for requesting and implementing overrides (shift changes to an existing on-call schedule)

Cross-training and documentations

  • Promote cross-training so several team members can handle various incidents, crucial when someone is on PTO.
  • Use shadowing on on-call shifts to help less experienced team members get up to speed under guidance.
  • Make sure everyone logs ongoing issues and important updates before heading out on PTO. This makes it easier for the on-call person to pick up where they left off.
  • Maintain open communication lines so the on-call team can reach out to off-duty folks in emergencies, if needed.

Review and Adjust Regularly

  • Recognize that being on-call can be draining. Encouraging or scheduling PTO after tough on-call shifts can help your team recover and maintain balance.
  • Periodically check how your on-call system is working and adjust based on your team’s feedback. This includes looking at workload balance and overall morale.
  • Try to rotate demanding on-call duties fairly among all capable team members.

Paying for On-Call

People on-call have to be mentally and physically ready to react to an incident at any time during their personal time, which means they won’t be able to fully relax—even their sleep can be interrupted! Thus, more and more companies are building compensation schemes for their on-call teammates.

Common compensation strategies

  • Additional Pay and Stipends: Offers direct, tangible rewards for on-call work, though it may not address potential long-term stress.
  • Compensatory Time Off: Provides time off to compensate for on-call hours, supporting work-life balance but may not appeal to everyone as much as direct pay.
  • Differential Pay Rates: Pays more during undesirable hours, compensating for the inconvenience but can be complex to manage.
  • Tiered On-Call Levels: Aligns compensation with the level of on-call duty, reflecting the true demand placed on the SRE but may seem unfair or complex.
  • Performance-Based Bonuses: Ties bonuses to performance metrics, fostering high standards but potentially reducing teamwork if too focused on individual metrics.
  • Wellness and Support Programs: Recognizes the physical and psychological toll of on-call duties by including support programs to enhance overall well-being.

For a more in-depth look at compensation, check out this article on paying for on-call.

Onboarding new joiners to on-call shifts

Being on-call for the first time in a new company can cause anxiety. Ensure your new hire feels empowered and safe from day one by preparing the road for them so they don’t feel lost and alone as they tackle their first incidents while on-call.

Curate documentation and resources

Having a trustworthy reference to find information is invaluable while you get accustomed to a new system. Take the time to curate a list of up-to-date docs and resources that can help your new teammate navigate an incident.

Progressively increase ownership

Avoid throwing a new joiner into a big fire on their first day without the right context and preparation. You can start by providing them with incident exercises and by exposing them to how your team goes through resolution processes via shadowing:

  • Simulation Practices: use both high and low-fidelity simulations to safely expose new hires to incident management without risking real-world stakes.
  • Shadowing: have the new hire shadow a senior SRE during incident resolutions to gain firsthand experience.
  • Reverse Shadowing: let the new teammate lead incident resolutions with a senior as backup, transitioning them into a more active role.
Screenshot of an On-Call shadowing rotation
Tools like Rootly On-Call make it simple to schedule shadow rotations.

Mentorship and support

Getting ramped up to be on-call needs one-on-one support and mentorship so the new joiners understands their responsibilities and have everything they need to succeed.

  • Set explicit expectations: clearly communicate what is expected of them in their new role regarding on-call duties.
  • Introduce key roles: Help the new hire navigate the org chart by suggesting introductions to people they’d likely need to work with in case of an incident.
  • Encourage ownership: Foster decision-making confidence by ensuring they understand and feel confident in their responsibilities.

Inclusion and diversity in SRE teams

When you’re always fighting fires, inclusion and diversity rapidly fall down your priority list. However, research suggests that actively pushing for diversity in your team might actually be the key to get you out of that permanent crisis state.

Women and gender diversity on-call

Globally, women represent a 29% of the workforce in STEM. But if you look into the self-reported gender statistics for KubeCon, only 6% of attendees identify as women (<1% as non-binary). If you ever attended these events, you’ll easily recognize that we still have a lot of work towards diversity and inclusion.

DevOps and SREs are ecosystem that are particularly lagging behind in inclusion. Below are a few ideas of best practices to help out in your organization:

  • Don’t assume, ask: good intentions can sometimes be perceived as patronizing. It's important to directly ask team members what they need instead of making assumptions, especially if you're not from a marginalized group yourself.
  • Intentional and safe environments: women in male-dominated fields often face more pressure to prove their competence. Acknowledge these challenges and actively create a supportive space where concerns and preferences can be freely expressed.
  • Promote career paths: encourage the growth of underrepresented groups into SRE roles through active mentorship programs that provide opportunities for all team members to reach their full potential.

Parents on-call

Parenthood is to be celebrated! More and more countries and companies are recognizing the need for parental leaves. Sweden, for example, offers 480 days of paid parental leave. New parents will change your schedule while they're away, but that's only the beginning. Your colleagues with kids of any age should be able to prioritize their family without compromising their careers. This is where you shine: making on-call rotations parent-friendly. Check out these best practices:

  • Flexible schedules: kids have different holidays, get sick, or need to be picked up earlier some times. Parents need to have flexibility during on-call times, which might mean asking for coverage or exchanging a weekend shift with a colleague. Tools like Rootly On-Call makes it easy for anyone to ask for partial or full coverage of a shift through a mobile app.
  • Work-life balance: being an SRE is demanding in nature, but being a parent is much more. You'll need to have regular check-ins with your staff members with children to make sure they're not risking burnout and their job satisfaction stays on the positive side.
  • Childcare support: some teams offer direct support for childcare, such as subsidies or on-site childcare facilities. This reduces the logistical and financial burdens on parents and can significantly ease the stress associated with balancing professional and personal responsibilities.

Conclusion

Since the industrial revolution, we’ve been forced to choose between human life with productivity. However, technology has gone through several cycles in which making things human-first is a catalyst for success. From interfaces to user experience, and tooling to developer experience, acknowledging the complexity of humans lead to better performance. Being on-call is no different: it’s time to enter the responder experience era.

Pervious guide
Pervious guide
Next guide
Next guide

End-to-End Incident Response Guide

Building On-Call Schedules for Humans