The Pros and Cons of Embedded SREs
A comparison of the two main SRE team models: Embedded SREs vs. standalone SRE teams.
July 9, 2024
7 mins
Minimize alert fatigue by distributing incoming alerts evenly across responders with a Round Robin schedule. This strategy comes in two variations and can benefit some teams more than others.
The concept of Round Robin comes from sports. And it has nothing to do with anyone called Robin, but the french word ruban (ribbon). In a Round Robin tournament, all participants face each other by taking turns. When applied to on-call schedules, a Round Robin escalation policy means that responders assigned to a level will take turns responding to alerts.
When is this strategy useful and when isn’t? In this article, we’ll dig into the key aspects of Round Robin escalation policies, including the two types available, and best practices to improve the responders team dynamics.
You never know when something will go wrong with your website, app, or a provider you rely on. That’s why having someone available—even if it’s outside business hours or on a holiday—to get your service back to normal at any time is crucial. This is where on-call schedules come into the reliability story: you distribute the responsibility for keeping everything running 24/7 across different shifts or rotations that you organize in a schedule.
However, it’s not enough to have a single person available to respond to alerts. What if that on-call person is out of reach for any reason? Maybe their phone ran out of battery, or they got into an accident. That’s when escalation policies come in: they define a hierarchy of layers so that no alert goes through the system without being acknowledged and handled adequately.
Escalation policies let you define a set of responders or teams in each layer, as well as define how you want them to be contacted. If a layer doesn’t acknowledge the alert within a certain timeframe, then the next layer of responders will be notified. The whole escalation policy can be repeated a few times if needed.
It’s a quiet Saturday but you’re on-call. You get an alert at 7 pm that interrupts your dinner. You get another alert at 9 pm, interrupting wine time with your partner. At 11 pm, right before bed, another alert pops up. It’s getting annoying. When a new alert wakes you up at 3 am, will you throw the phone at the wall? This feeling is called alert fatigue, and it’s unfortunately common in people with on-call shifts.
A Round Robin escalation policy can help reduce alert fatigue in your team by distributing the alerts evenly among responders. In a Round Robin schedule, incoming alerts are not all given to a single responder. Instead, responders take turns handling incoming alerts in sequential order.
Traditionally, implementing a Round Robin escalation policy required a lot of manual work, especially from the on-call manager. The manager would set up a spreadsheet where people could fill in when they responded to an alert and resolved an incident. This manual tracking determined whose turn it was to respond to the next alert.
However, modern on-call solution likes Rootly On-Call automate this work for you. With a single click, you can make a level in your escalation policy behave like a Round Robin cycle.
In a Round Robin escalation policy, responders take turns to address incoming alerts. But what happens when the responder in charge of the next alert doesn’t acknowledge it in time? This is where Round Robin can vary: in an alert-based Round Robin, the alert will jump to the next escalation level if not acknowledged timely; in an cycle-based Round Robin, instead of jumping to the next level, the alert will go to the next responder in line on the same level.
Alert-based Round Robin escalation policies are the most popular option, even supported by legacy on-call solutions like PagerDuty. In this type, each incoming alert is assigned to different reporters in turns. If, at any point, any of them fails to acknowledge the alert they received, the escalation policy will notify the next level.
The advantage of this type is that responders in the same level receive alerts and responsability more or less evenly.
In a cycle-based Round Robin, an incoming alert is assigned to responders in turns. But if the responder who received the request doesn’t acknowledge it within a specified timeframe, it is passed to the next user in the Round Robin level. The alert will only escalate to the next level if it goes unacknowledged throughout the entire cycle.
The advantage of this type is that the next level in the escalation policy is less burdened as they don’t have to handle every alert that slips through the previous stage.
Most teams can benefit from using a Round Robin escalation policy, if they have multiple responders available on call. However, this strategy can represent a bigger win for teams who expect a larger number of alerts.
When a team consistently receive alerts across time zones, weekends, and holidays, having a single on-call responder taking care of all incidents that may pop up can be challenging. Not only it is overwhelming for the person on-call, but it affects their ability to effectively respond to alerts throughout their shift. In a longer term, it causes burnout on responders.
By organizing shifts with multiple responders on-call, you can distribute the alert burden among them to minimize the risk of alert fatigue and improve your overall MTTR.
Distributing on-call work through Round Robin escalation policies has several advantages, but it also requires adjusting it to your team’s dynamic and culture.
Using Round Robin escalation policies can help you mitigate alert fatigue in your team and improve how quickly incidents are resolved. Consider whether an alert-based or cycle-based Round Robin are the best solution for the structure of your on-call team. Rootly On-Call comes with both strategies built-in by default, so you can get access to them with a click on any of your escalation layers. Feel free to schedule a demo if you want to take a closer look.
{{cta-on-call}}
Manage schedules, escalations, and PTO aware overrides without the frustration