So you’ve decided to take advantage of Site Reliability Engineering by hiring SREs for your company.
Now, you have a second decision to make: Exactly how many SREs to hire. Do you need just one or two SREs? Or should you build a sprawling SRE team, with a dozen or more SREs on hand to support your organization’s reliability needs?
The answers to these questions will, of course, vary; every business’s needs are different. But every company should consider some core criteria to when choosing how large of an SRE team to create.
Why SRE numbers matter
Choosing how many SREs to hire is important because every company’s reliability needs are different, and no one wants to be stuck with more or fewer SREs than necessary.
SREs earn pretty hefty salaries compared to most other types of engineers, so business leaders have an incentive to avoid hiring more SREs than they need. On the other hand, if you have too few SREs, your company will likely struggle to optimize reliability across all of its systems, because there is simply not enough SRE personpower to go around.
So, it’s worth spending some time thinking strategically about exactly how many SREs to hire.
Tips for deciding how many SREs to hire
There’s no simple formula you can use to arrive at the right number of SREs for your business – and, again, every business is different.
That said, you can gain some insight into the total number of SREs you need by considering the following factors.
Overall engineering team size
The more engineers you have on staff in total, the more SREs you’re likely to need.
That may seem obvious. But exactly how many SREs do you need per engineer? That’s the more complicated question.
Wojtek Cichon provides some guidance on this front. As he writes, the “magic number” for deciding whether to hire an SRE in the first place is 25 – meaning that companies with at least 25 engineers can benefit from an SRE. Extrapolating on that figure, you could argue that you need one SRE for every 25 engineers in your company.
The 25% on-call rule
25 is also a magic number of sorts in the Google SRE book, which describes the so-called “25% on-call rule.” According to this rule, 25 percent of your SREs should be on call to respond to incidents at any given moment.
Of course, the total number of SREs who you need to be on call may vary. But the Google SRE book suggests that having two on-call engineers – a primary and a secondary – is a good approach.
Based on these numbers, the total number of SREs you need is at least 8 – enough to ensure that you can have two SREs on call on a constant basis.
SLA requirements
Another factor to consider is how strict your SLAs are. If you’re shooting for 98 percent uptime, for example, you probably don’t need as many SREs as a company that guarantees 99.99 percent uptime.
This isn’t to say that simply hiring more SREs guarantees higher uptime, of course. There are a number of factors that affect which uptime levels you can achieve. But in general, the stricter your SLA guarantees, the more SREs you’ll need to meet them.
System complexity
Consider, too, the types of technologies you use, and how complex they are. Managing reliability for complex systems like Kubernetes requires more work than managing simpler types of environments, like virtual machines and monolithic applications.
So, if you’ve gone head-first into the cloud-native world, you’ll probably want to hire more SREs. If, on the other hand, your technology stack is relatively simple, hiring a large SRE team may be overkill.
SRE team structure
A final factor to weigh is how your SREs are structured within the organization, and how they interface with other engineers.
Arguably, if you use the embedded SRE model – in which SREs are embedded into other teams – you are likely to need more SREs, because you’ll have to hire at least one for each team that requires SRE support. If you have a standalone SRE team, the SREs can be shared by other teams.
Conclusion
Figuring out exactly how many SREs to hire is tough work, and you should expect to have to experiment a bit to arrive at the right number. But to minimize the risk of over- or under-hiring SREs, you should consider core criteria like overall engineering team size, how many SREs you need to be on call and which types of systems they have to support.
{{subscribe-form}}