January 8, 2025

7 mins

Google SREs are changing the game again: a breakdown of their new approach

Google SREs are redefining reliability practices with STAMP, addressing the limitations of traditional models as systems scale. Their approach highlights the need for system-wide hazard analysis.

Written by

Jorge Lainfiesta

Google SREs are changing the game again: a breakdown of their new approach

Table of contents

This article is based on the paper released by Tim Falzone and Ben Treynor Sloss, both Google SREs.

It’s been over 20 years since Google outlined what SRE, as we know it today, means. SLOs and error budgets are now widely known concepts, applied at thousands of organizations to manage the reliability of their systems.

However, the complexity of Google over the past two decades has grown exponentially. In 2004, Google, with a fresh IPO, had yearly revenue of $3.2 billion. In 2024, the firm soared over $307 billion in revenue—a 100-fold increase. This revenue is backed by an ever-evolving portfolio of products and offerings, all supported by increasingly sophisticated systems.

While traditional SRE thinking remains valuable and in use at Google, the teams have continuously pushed its boundaries—eventually hitting a limit. Now, Google’s SRE team is adopting a new approach to reliability through STAMP, a framework based on control theory that introduces a fundamental shift in how incidents are approached.

The problem with SLOs and error budgets

SLOs and error budgets are foundational pillars of the SRE framework in most organizations. While they remain effective for many, Google SREs have encountered limitations when applying them to highly complex and large-scale systems.

Zero tolerance on critical systems

Some aspects of critical systems—such as data integrity, privacy, and regulatory compliance—cannot tolerate errors. In these cases, the goal isn’t low-frequency incidents and rapid mitigation, but absolute prevention.

SLOs apply to individual components

SLOs work well with systems based on stateless web services. As system complexity increases, you start managing sophisticated state and complex dynamics between components.

A lot of incidents stem from the interactions between components that are each working perfectly fine according to their own standards.

The limits of the traditional SRE model

Architecture models are one of the cornerstones of SRE at Google. A model that explains the data flow between components is essential to understand potential risks and follow an incident’s logic. However, the way models are commonly built has limitations that complicate SRE practices at scale.

Unclear System Dynamics

RPC diagrams are the gold standard used to represent systems. While they show the relationship between components, they don’t reveal all their possible interactions. Tim Falzone and Ben Treynor outline a list of questions that highlight the usual gaps in a system model:

Which RPCs can initiate a flow?
How do errors propagate?
Which components could cause a critical outage? Which can only cause minor issues?
What if one component interaction is safe in some contexts but unsafe in others?
What is the overall goal the system is trying to achieve?
What responsibility does each component in the system have with respect to that overall goal?

It would be impractical to annotate a traditional diagram with all these possibilities, as it would be difficult to capture and read.

Models Don’t Scale Well

As your system gets more complex, navigating its model becomes harder. When you have hundreds of components, it becomes overwhelming to even figure out where to start.

Given the sheer complexity of a system like Google’s, it is quite difficult to maintain a complete and up-to-date version of the model at any given time.

Preventing Failures in the AI Era Is Tough

Rather than solely putting out the fires of the day, SREs strive to predict and prevent future failures. However, the introduction of AI and ML in systems makes predictability more challenging.

Root Cause Analysis Can Be Subjective

When an incident begins can be interpreted in different ways. Is it an incident when it’s detected, when it impacts a customer, or when the bug was first introduced—even if unnoticed? Or does it begin when you implemented a CI/CD pipeline that could let the bug through?

This fuzziness and indefinite-regression makes it harder to prioritize tasks and distribute ownership for proactive reliability.

Incidents Are Not Linear Events

Incidents are often studied as cause-and-effect phenomena—A happened because B happened. But system dynamics are more complex. Each system component has multiple dependencies, and interactions can be influenced by environmental factors.

A New Approach: STAMP

STAMP is a theoretical framework developed at MIT in the 2000s that applies control theory to system safety.

The core premise of STAMP is that safety can only be understood as a system-wide property, not the property of an individual component. In this framework, ‘accidents’ result from complex interactions between components—not just a linear chain of events.

STAMP also considers more than machine-to-machine interactions. It includes human action and external disturbances.

A System Under Control

The shift introduced by STAMP is from “Did A cause B?” to “Which interactions in the system were inadequately controlled for A to happen?” Answering the latter question requires control over your system, meaning you have:

A model of your system
A goal for your system
A way to understand the state of the system (observability)
A way to influence the state of your system

More Than “Normal” and “Loss”

While SLOs help manage risk at the component level, your system ultimately operates in one of two states: normal or loss.

STAMP introduces an additional state: Hazard. A hazard isn’t a single event but a system condition that takes into account worst-case scenarios.

More Time to Prepare and Prevent

A major disadvantage of traditional SRE approaches is the abrupt switch from OK to Problem. With STAMP, you gain better insight into the system’s condition, allowing you to detect potential hazards in advance.

A system can stay in a hazardous state long before an incident happens—like when a bug is present but untriggered or a server is under-provisioned ahead of a traffic surge.

Putting It into Practice with STPA

STAMP serves as the foundation for STPA (System-Theoretic Process Analysis), a hazard assessment methodology used in aviation, manufacturing, and other industries.

Google SREs are applying the STPA Handbook to analyze system interactions and identify ineffective controls at the system level.

By applying STPA to their most complex systems, Google SREs have uncovered hazardous scenarios that could lead to outages. This knowledge enabled them to mitigate issues with quick fixes and long-term engineering efforts.

Should I Start Using STAMP at My Organization?

Google’s work with STAMP is impressive from both theoretical and practical perspectives. However, Google operates at a scale most companies won’t experience.

Even though each STPA analysis requires “little effort,” it typically takes several weeks of engineering work.

If your SLOs and models are reaching their limits, you might be approaching the complexity that warrants non-linear approaches to hazard prevention. Frameworks like STPA and CAST offer ways to operationalize these concepts.