Get Rootly's Incident Communications Playbook

Don't let an incident catch you off guard - download our new Incident Comms Playbook for effective incident comms strategies!

By submitting this form, you agree to the Privacy Policy and Terms of Use and agree to sharing your information with Rootly and Google.

Get to know

Tony Holmes

Quick facts

💼
Ex-Apple, YouTube & Netflix
👨‍🏫
Tech mentor
💿
PC Gamer
😎
30 year tech vet

Five Questions with Tony

You got your start as a sysadmin long before the term “SRE” had even entered the lexicon. How do you view the difference between sysadmin and SRE work today? Is SRE an evolution of the sysadmin role, or a totally different craft?

SRE is a techno-social craft. You have to take the social side of the equation equally with the technological side, because you're dealing with people using systems or trying to interact or bring systems back up into good state, so you need to consider the human conditions in that moment.

I think of SRE as a venn diagram. You have your administrators of network systems, storage administrators, developers, devops engineers, CI/CD engineers, and so on. Approximately 10% of them will approach it from a reliability-first perspective. This means asking, first and foremost, “How will this thing fail? And, in the face of that failure, how will we respond?” When this mindset is the first reflex in a discipline, that’s when you get a reliability engineer. I have seen so many companies get it wrong. They had sysadmins and said “Okay, you're now SREs.” No. The skill set that you need is that curiosity and holistic view of the system, in addition to the ability to debug and understand the technology itself. In SRE, you’re looking beyond the systems in isolation, and considering the whole experience that your systems create for users.

Your role at Affirm is the Head of SRE. What does that look like for you day-to-day? What type of decisions are you making and what actually consumes a lot of your time leading an SRE organization?

Yeah definitely. So right now, it’s kind of a dual role. I’m leading the team while also redefining the function as a whole, from how it was run under previous leadership. I’m making little nudges, tweaks and optimizations here and there but thinking about the big picture defining the program goals and methodologies. Right now, it’s working closely with my team to recognize their pain points and address them to support their ability to execute on their mission. One of the things we’re working on now is our incident management flow. We’ve identified opportunities to reduce what we call “self-inflicted pain” — we identified that 73% of our incidents were being caused by human changes. We’re also redefining our severity guidelines so they work inline with our SLOs. So, my goal—to sum all that stuff up—is I'm trying to create and execute on a framework that factors in the human side of the equation, making things clearer, more consistent, easier to reason about.

You mentioned rethinking severity levels - let’s dive into that a bit more. Tell me about what makes severity guidelines effective.

Google does it really well. For anyone who hasn’t read the Google SRE book, their chapters on defining SLOs and multi-window alerting is a great place to start. You want to base your alerting on burn—error budget burn. If you are delivering three nines of reliability, that means you have an error budget of 0.1% over whatever period. So for us, we have monthly SLOs, three nines will be one of them. Some services are tighter, some are more loose, but I'll use that as an example. We want to alert urgently if we're burning the error budget very quickly. So if there's a major incident, maybe a system like DNS has gone out, which will impact a lot of things, which means you're burning a lot of SLOs broadly everywhere. You'll want your alerting to indicate that. So we'll actually base severity on our error rates against our normal volume of transactions. A SEV0 essentially means that over a very short period of time, you're going to burn a very significant chunk of your SLO. As an example, let’s say SEV0 means you are going to burn 10% of your error budget in six hours. That's a lot. So you want your alerts aligned around that, and then you set the various levels. So the SEV0s will be your very, very sharp ones. Your next one will be the SEV1s, which is probably going to be 1% in the same period or 10% over a longer period of time. Google has a pattern of six to one — where the height versus the width, it tends to be a six to one ratio for all the definitions. So a SEV1 is six times more impactful than a SEV0 in terms of the window that they look over. 

As a senior leader in SRE, how do you approach mentorship? What does being a good mentor mean to you?

First, there are a lot of different types of mentorship. I have a mentee from Toronto that I've known for 14 years. He's a close friend and a mentor around life goals, aspirations, direction, and we meet once a month and we have these really intense conversations and we meander through our thoughts, our focuses, and our curiosity.

At Affirm, I believe that everyone in our team should be mentoring somebody if in some capacity, with the exception of the juniors whose entire reason to be is to learn inside the team. You have the technical role-based mentors. I will typically mentor the most senior engineers in my team. They will then mentor the next layer down, so on and so forth, and there'll be some skips and stuff going on in there to give perspective.

So there's technical mentorship, which is easy. This is a skill, here's how I recommend you approach it, here's how you go and learn it. “Soft” skills, that's the harder one. There are a lot of people that have a very difficult time going from a technical to a soft influence based skillset. So the ones that can do that well, that’s a superpower. That’s when I’ll say “I want you to go talk to this person because they are really good at influence without authority. I want you to go and spend six months with them, whatever cadence works, 30 or 60, 30, 60 minutes a week, and I want you to make it very clear this is what you're looking for and you want guidance on how you can grow that skill.

I also recommend people have mentorship from outside of their immediate team, like meeting with another high level IC from another team so you can understand how they perceive your team. And then you want to go to a product team. How do they view the role of our platform, our team? The mentorships should be with a purpose, and some of them, once you get the skill, they should end and then you will find another one for the next skill you need to build. 

Affirm is a huge player in the ecommerce space. Are you a big online shopper? What’s the best purchase you made online recently?

My wife is the big online shopper in our house, but I’ve done my fair share and I'm actually staring at mine right now. It is a wonderful large curve monitor. Affirm has a “wallet” program that allows you to buy equipment for your home office, since we primarily work remotely. So I upgraded my work monitor, which is fantastic, because I also get to use it for gaming. I got it two months ago and I'm absolutely loving it.