A Site Reliability Engineer’s Guide to the Holiday Season
SREs face special challenges during the holidays. Here’s how to manage them.
August 29, 2024
5 mins
Treat emails, vendor updates, and calls as alerts using your existing escalation policies and rotations.
Observability is vital for any organization, but it is complex and usually expensive. And for a reason: being able to have visibility over your system’s inner status requires sophisticated instrumentation, large-scale data processing, and skilled professionals.
However, no matter how good your observability stack is, there are alerts that you won’t be able to pick up. For example, a customer may be experiencing an issue while performing an edge case nobody thought of. Your telemetry is not picking it up. Nobody will get an alert about it. But there is a real incident impacting your customers.
Having alternative ways to collect alerts can be a valuable safety net for your team, and the best thing is that they require basically no implementation work. Of course, each works better for certain teams than others, so we’re including best practices for each of them. Let’s get to it!
If your observability solution is not detecting a flaw in a service, it doesn’t mean that the issue your users are experiencing is not real. Providing customers with a way to bring your attention to a problem can be a good way of developing your relationship with them. Setting up an inbox that creates an alert is especially useful for out-of-working hours.
Setting up emails as an alert source gives you the advantage of being able to integrate them into your regular alert workflows. That includes it being routed to a responder on-call and going through a regular triage. Then you can create an incident and escalate it accordingly if needed. You make use of your incident management workflows instead of having to rely on manual steps and a separate inbox to check.
{{subscribe-form}}
You likely rely on dozens of vendors to deliver your software, some more critical than others. Certain external dependencies can compromise your business and overall availability, which makes some SRE teams set up alerts to get upfront notice of potential disruptions.
You can use services like IsDown to aggregate status pages from different vendors and set up notifications to your alerting software using webhooks.
Enabling an email inbox that generates alerts is good enough for a lot of users. But if you’re a business-critical partner to your customers, having more immediate contact with you in case something goes wrong can consolidate your commitment to their success.
Usually, you’d offer a fixed phone number where your customer is connected to somebody on-call in your team, either through a call or by leaving a voicemail. Live rerouting means that the fixed number will dynamically route the call or voicemail to the right responder, and even go through an escalation policy if applicable.
To enable adequate on-call rerouting, you’ll need an alerting solution that supports it, such as Rootly On-Call. That way, you’ll make use of your typical rotations and escalation policies, minimizing the work you need to enable this option for your customers.
Best practices when enabling live call rerouting:
Emails, vendor statuses, and live calls as alert sources can be as easy as clicking a few tickboxes in modern on-call solutions like Rootly On-Call. You can enable valuable use cases without burdening your SRE team with implementation work and improve your position as a partner to your customers.
{{cta-on-call}}
Escalation, overrides, viewing on-call, and more are natively supported in Slack