Get Rootly's Incident Communications Playbook

Don't let an incident catch you off guard - download our new Incident Comms Playbook for effective incident comms strategies!

By submitting this form, you agree to the Privacy Policy and Terms of Use and agree to sharing your information with Rootly and Google.

Back to Blog
Back to Blog

February 11, 2022

4 min read

Top 9 Skills for SREs from ex-Instacart SRE

A list of the top nine SRE skills, from incident management, to cloud computing, to networking and beyond.

Quentin Rousseau
Written by
Quentin Rousseau
Top 9 Skills for SREs from ex-Instacart SRE
Table of contents

It’s easy to talk at a high level about what Site Reliability Engineers do: They ensure that IT systems achieve availability and performance requirements.

But which skills, exactly, do SREs need to do do their jobs? That’s a more complicated question.

To answer it, this article walks through the top nine SRE skills that modern SREs (or aspiring SREs) should master. Although SRE skills may vary from one team to the next depending on the types of systems it manages and the main types of reliability challenges it faces, virtually all SREs need a core set of standard skills that allow them to understand and manage the type of complex, distributed systems they will have to support at the typical organization today.

Without further ado, here’s a breakdown of top SRE skills.

Networking expertise

The network plays a pivotal role in connecting modern, distributed environments. As such, it’s often the culprit when something goes wrong -- a lesson that Facebook, for example, learned when a networking problem brought down its entire global infrastructure.

Situations like this are why SREs should master networking concepts. Even if their organization also employs networking engineers, SREs need a deep understanding of networking themselves to know when the network is the root cause of an incident and how to resolve network-caused issues effectively.

Cloud computing

Like Linux and networking, cloud computing is another category of skill that modern SREs can’t live without.

The reason why is almost self-explanatory: Around 90 percent of businesses use the cloud, and you can’t manage reliability for cloud environments very well if you don’t understand cloud architectures, cloud networking, cloud data storage, cloud observability and so on.

CI/CD pipelines

SREs don’t typically help to develop software, but they nonetheless need a deep understanding of how software is written and deployed -- which, at most organizations today, is a process that happens via a CI/CD pipeline.

It’s hard to engineer reliability if you don’t know how to address reliability problems that emerge from application source code or deployment processes. Understanding how CI/CD processes work and which tools drive them is key for virtually every SRE today.

Linux and Unix

If you come from a Windows background but you want to be an SRE, there’s no getting around it: You’ll need to learn how to work with Linux and other Unix-like systems in addition to Windows.

That’s because, even at organizations that don’t rely heavily on Linux servers, you’re likely to find that Linux and Unix concepts are deeply embedded within other systems that you have to work with. Most public cloud management tools follow the conventions of Linux CLI tools, for example. So do systems like Docker and Kubernetes, even if you run them in a Windows environment.

Quality assurance and software testing automation

SREs also don’t usually help to test software pre-deployment. That task falls to developers or quality assurance engineers.

Nonetheless, understanding how software is tested -- and how to use test automation to speed tests and expand test coverage -- is a vital SRE skill. After all, the more thoroughly and efficiently your team can test software, the greater your chances of catching reliability problems pre-deployment, when they are easier to fix and pose a much lower risk to the business.

Security engineering and response

Securing is another domain that SREs don’t “own,” but where they nonetheless require significant skills.

Indeed, good reliability engineering makes security a priority, and vice versa. SREs who don’t understand security fundamentals are at risk of implementing reliability solutions that are effective from a reliability standpoint, but not necessarily secure.

DevOps

Although SREs are not DevOps engineers, SRE and DevOps are closely related domains. SREs at most organizations today will be expected to understand DevOps concepts and, in many cases, work alongside DevOps teams.

So, plan to master DevOps skills as part of your SRE skills acquisition strategy.

Incident management

Perhaps the single most important type of skill for SREs to learn is incident management. Although many roles may participate in incident response, SREs usually take the lead in organizing the incident response team, communicating with stakeholders and devising the best strategy for resolving each incident as quickly as possible.

This means SREs should know how incident response roles are structured and understand incident response concepts. They should also be familiar with incident response platforms, that automate the complex processes required to ensure rapid, effective incident resolution.

Postmortems

In addition to overseeing incident response, SREs may be tasked with managing postmortems. Knowing how to run a postmortem -- as well as when a postmortem is necessary, and when it makes sense to use a “blameless” postmortem approach -- is an essential SRE skill.

Conclusion

The list of SRE skills could certainly go on. Above are only the most fundamental types of skills SREs will need for most modern environments. But if you’re just starting out on your journey to becoming an SRE, the nine skill domains described above are a good place to begin acquiring the knowledge you’ll need to excel in an SRE career.

{{subscribe-form}}