Top 9 Skills for SREs from ex-Instacart SRE
A list of the top nine SRE skills, from incident management, to cloud computing, to networking and beyond.
October 23, 2024
7 mins
SREs need an incident management solution that’s intuitive, flexible, and powerful. In this post, we explore the key features to consider when evaluating incident management tools, from automation to multi-cloud redundancy.
If you know any professional cook, you’ve likely seen them carrying around their knife roll. This is because they need to perform very precise work and have specific preferences for how it's done. That’s why they bring their own, exceptionally sharp knives to the kitchen, rather than using just any random knife lying around.
SREs also perform very precise work and need the best tools available. You need an incident management solution that works for your team based on your processes, tech stack, and budget. For example, if your team uses Slack, Linear, and Datadog, your incident management software should integrate seamlessly with them. If your team relies on automations, your incident response tool should offer simple yet powerful options.
Following the cooking analogy: a dull knife is not only less effective, it’s dangerous. For your SREs, relying on sub-par tools can be detrimental. The stakes are higher than "we’d be 15% faster" if we had XYZ tool. An inadequate incident response tool can introduce confusion and frustration to your resolution process.
In this blog post, you’ll find out which features to look for in an incident management tool and ideas for criteria you can establish when choosing a vendor.
If you need people to go through “PagerDuty University” (50+ corporate videos) just to get a push notification, you’re probably not making the best use of everybody’s time.
Incident management software shouldn’t require you to look at a manual or Google how to perform everyday tasks like shift overrides or setting up a 24/7 schedule.
An easy-to-use, intuitive tool helps ensure faster onboarding and promotes cross-functional collaboration during an incident, unlike legacy tools that are clunky and designed for engineers only.
When your on-call scheduler requires you to buy a dummy seat just so you can leave intentional gaps in your schedule (looking at you, PagerDuty and OpsGenie), the level of customization of your alerting solution is quite low.
Many SREs have accepted, after years of frustration, that it’s okay PagerDuty doesn’t let them create schedules that work for them without jumping through hoops. Want to page multiple teams? Need different rules for incidents that occur off-hours vs. during business hours? You’ll have to implement numerous hacks that will keep breaking over time.
The truth is, your on-call and incident management should adapt to your team’s workflows and current needs. Reliability is difficult enough on its own; your incident response solution should work for you, not become another chore for your team.
When someone is on call, they need to ensure they’re reachable and can address potential issues immediately. But what if I’m on call and I get a call from my kid’s school saying she’s unwell and needs to be picked up? I just need someone to cover for me for an hour or two. That’s when your incident management software’s flexibility is put to the test.
In legacy on-call tools, making an override in the schedule feels like performing a ritual dance with intricate steps. Making a partial shift override in PagerDuty also involves a lot of frustration.
Your alerting solution should respond to real-life incident management scenarios, whether it’s covering for someone who’s sick, managing special events like a Black Friday sale, or handling last-minute changes. Your incident management software should help in each case by providing simple flexibility.
You need to make absolutely sure that your alerting and incident management solution is reliable. The lack of reliable options from 2009 through the early 2020s made PagerDuty the industry leader. Its rivals at the time, OpsGenie and ZenDuty, were known for long maintenance windows or worrying outages for their alerting services.
However, modern on-call solutions are built with state-of-the-art infrastructure. Rootly On-Call, for example, is the only alerting solution offering multi-cloud redundancy. That means even if AWS has an outage, you still won’t miss a single alert.
Check the vendor’s status pages to see how many incidents they’ve had over the past year. Be critical about the SLA offered and investigate what measures they’re taking to ensure the availability of critical services.
Automation is often used as a catch-all phrase, but there are several areas throughout an incident’s lifecycle where automation can add real value.
Your on-call and alerting solution may support some of these automations or offer them as paid add-ons.
By using AI, the SRE team at Google experienced a 50% increase in velocity when writing incident reports. You can also leverage AI to make incident response faster. Modern on-call and incident management solutions offer AI features that boost responders’ productivity.
Ensure that your organization’s privacy guidelines align with how the incident management software handles data for AI processing. Also, be aware that some vendors charge a base fee for AI access plus a cost per usage.
Most legacy on-call vendors are known for their opaque pricing strategies. They often charge per seat per month, but that only includes access to core features. They’ll try to upsell you on add-ons for even basic features like status pages. That’s why many SRE teams are questioning whether their PagerDuty costs are still worth it.
Finding the right incident response tool for your team involves figuring out what you need and putting the vendor to the test. Below are examples of questions to consider for each feature in your next on-call and incident response solution. Adapt them to your circumstances and add questions as needed.
Ease of Use Criteria:
Customization:
Flexibility:
Automation:
Reliability:
Rootly is a modern on-call and incident response solution trusted by leading SRE teams like LinkedIn, Dropbox, NVIDIA, and Webflow. Rootly offers an intuitive UI that’s easy for both engineers and non-technical staff to use, allows full customization to fit your team’s workflows, and offers transparent pricing with all features included.
Talk with one of our reliability experts to see if Rootly is the right incident response vendor for you.