Creating Chaos to Achieve Reliability
How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.
June 21, 2024
5 min
We recently spoke to Google's Reliability Advocate, Steve McGhee, in our Humans of Reliability interview series. In addition to his interesting anecdotes on the early days of SRE at Google, and his journey to becoming a Reliability Advocate, he also shared a handful of his favorite SRE resources, which we compiled here into a list.
Being an SRE requires you to know about infrastructure, systems design, software engineering, and have other superpowers—like being able to debug other people’s code while everything is on fire. The key to being a stellar SRE and getting better at the job is experience, experience, and more experience. However, building and strengthening a foundation on reliability will catapult you in the field.
The Site Reliability Engineering: How Google Runs Production System book is arguably one of the most prevalent and well-known SRE resources out there, but there are many other great options for folks looking to expand on their learning. We recently spoke to Google's Reliability Advocate, Steve McGhee, in our Humans of Reliability interview series. In addition to his interesting anecdotes on the early days of SRE at Google, and his journey to becoming a Reliability Advocate, he also shared a handful of his favorite SRE resources, which we compiled here into a list.
Published a few months ago, David’s new book is definitely the most comprehensive reference for becoming an SRE in 2024. The book is an introduction to reliability that lays down the fundamental concepts of the practice and provides an actionable curriculum for people who want to break into the space. It’s also a useful reference for organizations starting their journey towards reliability.
{{subscribe-form}}
Now we’re getting to the nitty-gritty of being an SRE. Reliability is not about vibes or feelings. Reliability has hard numbers and a direct business implications. In this book, Alex overviews how to define and measure Service Level Indicators (SLIs) as a foundation to setting Service Level Objectives (SLOs). The treasure here is that Alex actually dives into implementation strategies to achieve your SLOs.
This book lays the principles of reliability at scale and how organizations approach it. It’s a great general deep dive into SRE because it is a systematic approach to every aspect of reliability. From culture and tooling, to monitoring and career development, David provides an overview of theoretical principles coupled with real-life case studies.
It doesn’t say SRE in the title, but this book is a gem to understand the underlying concepts that make reliability possible in any system. This book is platform-agnostic and explores best practices for systems and networks administration that apply to desktop services, server management and security.
Check out Steve's full interview and get to know more SREs at rootly.com/humans-of-reliability.