Get to know

Steve McGhee

Quick facts

🐾
Has 9 pets
🥃
Go-to drink: Negroni
🏊🏻
Strong swimmer
🎙️
Podcast host + OG SRE

Five Questions with Steve

What’s your earliest memory of being interested in computers?

Very young. We had an 8086 and I remember teaching my older brother about BASIC and how you could play games through the BASIC command line. I was probably six or seven at the time? I remember in early high school reading the entire DOS manual for 3.1. (I think...maybe it was 3.0) cover to cover just to learn all the things that I could possibly do with this crazy beast. I also remember the first program I ever wrote was a Spider-Man game out of some magazine. I don't think it worked, but I wrote a program, so that was cool.

Tell me about the journey to transition from SRE to Reliability Advocate - how has your day to day changed?

I left SRE at Google, mainly (or so I joke) for weather reasons. I wanted to live back in California, and I wanted to try something else. I'd been at Google for 12 years at that point and I had never really had a job outside of Google. I basically went straight from University to Google, so I wanted to be in the “outside” world and see what it really looked like — I definitely learned about that. The Google Site Reliability Engineering book had come out at that point and there were a lot of misunderstandings. I don't place blame on the book or the readers of the book especially, but it's a deep subject and one high-level book is not the same as being immersed in it for a decade. I saw that these misunderstandings were leading to poorly designed systems and poor responses within teams in terms of how to respond to a ‘bad thing’ happening. I remember being in a big room with lights and sounds beeping and bad stuff happening and the vibe was just way off. And I was just like “Why is this happening? Why are we doing it this way?” I wanted to help people crawl out of that into some other shaped thing. I thought about how we can use the beauty of distributed systems to make people suffer less and have a better time on the internet. So that's the high level meta goal—to make the internet more reliable and have people who are running it not have to suffer for it.

Aside from the famous Google SRE book, what book or resource would you recommend first to someone beginning a career in Reliability?

The Seeking SRE book is good by David Blank-Edelman. I know that he's got a new book coming out soon too as well. And then the SLO book is also very good by Alex Hidalgo. That's a strict subset of all the stuff, but it's a very important one. So please please that one's a good one. Thomas Limoncelli’s books also. They don't say SRE on the cover, but they are spiritually very similar. And honestly, going on GitHub and finding the awesome SRE canon of things to look at. I made my own silly thing that I'm happy to promote even though I'm a little embarrassed by it, because it's hard to understand. It's called r9y.dev because I can't spell the word reliability reliably.

What’s the most underrated skill for people working in Reliability/incident response?

Being able to read and write effectively — reading code for example. I hate to glamorize incidents, but when a thing is super broken and you're trying to debug it, ideally you can mitigate it quickly and not have your hair on fire. Then once you debug it, you're gonna be trying to figure out it happened in the first place. So you need to read someone else's code with the intent of finding something that they didn't intend to put in that code. You're building a system in your head as you're reading it and you're comparing it to your history of what you just saw happen in real life and you're trying to make that connection.

If you had to start a career from scratch in a completely new field - not tech at all - what would you choose?

Probably something in athletics, like cycling or swimming. I really like swim coaching, I wish I had more time for it. I also like teaching, which is a form of coaching, so something along those lines.