A Site Reliability Engineer’s Guide to the Holiday Season
SREs face special challenges during the holidays. Here’s how to manage them.
August 30, 2024
6 mins
Learn five common incident response anti-patterns that could be sabotaging your team’s efficiency and learn how to avoid them.
It’s been roughly 20 years since Google coined the term Site Reliability Engineering. Thousands of teams have experienced adopting these ideas, creating a wealth of experience on what works best and what doesn’t. Some practices seem intuitive and may even work for some time, but become expensive problems in the long run or when your circumstances change. That’s what we call anti-patterns: practices that deteriorate overall reliability and become a burden to your incident management team.
Ricardo Castro, Principal SRE Engineer at Fanduel, explains that just as in any other software discipline, like DevOps and Agile methodologies, reliability teams need to be self-critical and ask themselves, “Where might we be going wrong?” This thought process will necessarily arrive at a deeper layer and have teams better define what they are trying to achieve.
In this article, you’ll learn about common anti-patterns that SRE teams run into, usually without even realizing it. These are not rigid norms, but general approaches to several areas of your incident management process.
Incidents are high-stress moments where responders have to manage a lot of complexity—and do so as fast as possible. Duplicated work, forgotten tasks, delayed resolution times. In the midst of this chaos, it’s not rare to end up running into misunderstandings about who does what.
A lack of clear ownership leads to ineffective incident resolution. Without a clear vision of who’s in charge of what, you may end up leaving bases uncovered. Or the opposite: you may end up waking up too many people for nothing, which will impact their mental health and productivity the next day.
Response teams are unique in the sense that they’re usually small and have to figure out ways of moving nimbly without stepping on each other’s toes. That’s why it’s crucial to establish clear roles when building your incident response team.
Your incident response team roles will establish ownership of a responder over an aspect of each incident. Of course, roles can be attached to specific actions to be done. But incidents come in all shapes, which makes delegating ownership great because it gives autonomy to responders to act as they see best fit according to the circumstances.
Plus, ownership comes with accountability. That means responders are more likely to commit to their responsibilities and be more intentional about them.
{{subscribe-form}}
There are several types of communications involved in an incident. For example:
However, what usually happens is that communication types and expectations can get mixed up along the way. For example, VPs get nervous when they don’t see new updates, thinking that perhaps silence means there is no progress being made. Then they interrupt the team to ask for updates, which is often unproductive as responders need to stop what they’re doing to write a summary for the person demanding information.
Having inconsistent communications causes tensions and misunderstandings that end up delaying the incident resolution. Ashley Sawatsky, Reliability Advocate at Rootly and former Incident Communications Lead at Shopify, explains that SRE teams should work with stakeholders to establish explicit communication plans.
Having a structured communication plan sets explicit expectations between incident responders and any other interested parties, making it easier to have information flow effectively. The key is pinning down who needs to know what and when, as well as establishing who will own each communication item.
Incident response management software like Rootly can help coordinate a lot of this communication through workflows that trigger notifications to specific Slack channels or GenAI to draft updates of the incident state so responders don’t have to.
I hear you, you just stayed up until 3 am this Saturday getting that one weird service back up. The last thing you want to do this week is go back to that and write a report about it. But going through the retrospective process can actually prevent it from happening again, even if it doesn’t feel like it at the time.
There are many reasons why teams end up delaying the retrospective: new fires to put out, piling tickets and requests, random meetings, and a long list of etceteras. But once you let it wait for more than a few days, there are only two possible outcomes: you do it quickly to get it over with (ah, nobody remembers much detail by now) or you quietly decide you won’t do it.
Perhaps not all incidents require a retrospective. But running a retrospective process effectively is a skill that your team needs to cultivate, so it’s best if your criteria for a retrospective make it a frequent practice rather than an extraordinary one.
You can leverage tools like Rootly to automate a lot of your retrospective writing, reducing the workload imposed on your responders. For example, Rootly can construct a timeline for you based on key events and can guide your team through the right retrospective template according to the incident type and severity.
Repeatability of your incident management process is a clear sign of maturity. You’ve collected enough experience to structure playbooks that work for your systems and teams, and you’ve trained your team to work effectively within those guidelines. You’re reducing guesswork for your responders so they can focus on resolving the incident.
Naturally, that leads to having repetitive tasks across incidents. Thankfully, incident response solutions can help you automate those tasks so your responders don’t even have to deal with them anymore.
Boasting over 70+ integrations, Rootly can seamlessly integrate into the way your team works. Whether you want actions that derive from an incident to be filed and tracked in Jira, or provide developers with visibility over active incidents through Backstage, Rootly lets you do it through no-code integrations.
Legacy on-call tools like PagerDuty require a lot of manual work to achieve simple tasks. For example, if you have responders going on PTO, you’ll have to manually update several schedules to reflect the change. Or if one of your responders on-call has to pick up their kid from school earlier, PagerDuty makes it an extensively bureaucratic process to have a colleague cover for them.
Legacy alerting and incident response tools slow your team down, tying it to the monolithic pace for which these solutions were designed. Tools like PagerDuty and OpsGenie will always force you to make fragile setups to make your schedules fit the way you work, often having to purchase additional features or seats from them.
Instead, using modern tools like Rootly not only simplifies your incident response but also enables your team to collaborate in new, nimble ways. For example, shadow rotations can be easily set up with a few clicks in Rootly, enabling your team to reinforce a constant-training culture. Setting up Round Robin escalation policies to reduce alert fatigue in your responders is also as simple as one click in Rootly.
Recognizing and remediating anti-patterns in your reliability practice can help you enable your responders to do more while minimizing the risk of burnout. Take the time to periodically review your incident management process and look for signs where you see problems arising from a lack of clear ownership or inconsistent communications. Talk with your response team to gauge whether they find the current retrospective process helpful and if your existing tools are helping your team get ahead towards their SLOs.
Book a demo with our reliability experts to see how Rootly can help you break anti-patterns. Hundreds of leading SRE teams, including LinkedIn, NVIDIA, and Webflow, trust Rootly to help them improve their performance.
{{cta-demo}}
See Rootly in action and book a personalized demo with our team