I like to start with an analogy. Chaos engineering is a lot like crash testing a car or getting a vaccine. We want to inject a little bit of harm — so we can see the failure condition to understand it and prevent it, or lessen the impact of it. Our complex distributed systems are a lot like organisms. Things are always happening. Say you've got a flu virus floating around in there, your body learns from how to adapt and defend against it. Similarly when we inject failure with chaos engineering, it helps us understand how our system responds and how our people who operate these systems respond.
“Chaos engineering” in my opinion is a bit of a fun name but maybe also a bit of a misnomer. People tend to (reasonably) think chaos engineering must be chaotic. That's not actually what I recommend. I don’t think that’s good engineering. To me, chaos engineering is thoughtful failure injection with a goal in mind to understand a system better.
Yeah well, failure’s the key, right? You learn by making a lot of mistakes and seeing how quickly you can adapt. It’s maybe not an exciting answer but it’s a realistic answer. I'd had the chance to be part of management at Amazon and I'd worked closely with leaders at some of the startups I was at before Amazon and Netflix. So I had some sense from having seen people do it a few different ways. I think we live in a day and age where getting advice is not your problem, it's how you filter the advice that is actually effective and works well for you. One of the challenges with startup life is how you balance success and growth. When things are going well, there’s a tendency to double down and go all in on what’s working. But you have to be careful, because things change fast and when you invest too much in one path it’s harder to pivot. Or you might overinvest, overhire, and focus on the wrong things. Like many, we’ve gone through a journey. We’ve had success and happy customers, we’ve scaled up, and we’ve learned lessons about how to run a frugal and efficient company too.
Chaos Monkey is a resiliency tool built and open-sourced by Netflix which randomly terminates virtual machine instances and containers that run inside of your production environment.
Yeah, it's a great question. So I think one half of the answer is that chaos monkey is as much a social solution as it is a technical solution. When Netflix was moving to the cloud—now you live in a world where hosts are ephemeral and a host could be replaced at any time, whereas engineers 10 or 20 years ago relied on the box being there. And so there were states stored on the box, or things specific to that host that people cared about. So when that host goes away in the cloud, that causes a failure. So, I think Netflix made a brilliant management move when they essentially said “Hey, this is what reality looks like, we're gonna force you to prepare for it in staging, in your environment, so that you have to tackle that pain head on and address it”. That meant that people really had to feel the pain and go fix it. So I think at Netflix in general my observation was that people were culturally bought in. As I’ve taken this out into the market, I’ve seen the full gambit of reactions. A lot of teams are hesitant because they’re afraid of what they don’t know — like “We don't think that we're going to pass”. I'm empathetic there. But it's kind of like, “Hey, I want to get in better shape, but I'm afraid to go to the gym.” It's a perfectly understandable feeling, but ultimately you’ve got to get in there. You have to go start putting in the effort to get better or else things are never going to change.
So I think some of the problem with the name “chaos engineering” is a lot of people think it's got to be in live production, all at once, you know, go full bore. But once you explain that there's actually a very rational iterative approach that builds on itself and mitigates the risk, they start to see a version of it that’s more realistic and approachable for them. You can start in staging, even in a particular part of staging. You don't just go and break all of staging for everybody and say “What did we learn?”. You look at how you expect a host to behave, run an experiment on that single host, then ask “did that host fail the way you expected?”. Despite the name, it’s not actually a chaotic practice when done right. Ultimately, there are a set of problems that exist in production that don't exist in staging. They're related to your load balancer, your traffic, your security group, how you shed traffic, how you route DNS — all sorts of things. So you should get to a point where you are testing the system customers rely on. But you can build up to it.
Well, I have 5 kids. So that’s where my motivation comes from. I got married young and had my kids young, so I’ve always been focused on raising them, helping them, preparing them. I also just love being a good engineer. I love seeing good systems. I'm a builder at heart—if you look at my video game library, there's a fair number. I love to just build and grow things over time.
Gremlin is hiring! Check out their careers page at gremlin.com/careers.