What SREs Can Learn from the Atlassian Nightmare Outage of 2022
A look at the Atlassian outage of April 2022, and what it stands to teach Site Reliability Engineers. A lot to unpack here.
April 22, 2021
5 min read
How can creating chaos achieve better reliability? Chaos and reliability might seem mutually exclusive, but through the use of Chaos Engineering, SREs can bring about meaningful changes to system resiliency.
The tech world is often frenetically fast paced and growth focused. Constant change can seem like its own form of chaos.
It might appear counterintuitive, but chaos can be leveraged as a means of increasing reliability.
Site Reliability Engineers have a lot of tools and processes to choose from when it comes to improving system reliability. One of these processes is Chaos Engineering.
Netflix is probably one of the most well known companies when it comes to pioneering the concept of Chaos Engineering. As early as 2011, the Netflix Tech Blog was talking about how they used specialized tools called the Simian Army to inject faults into their network as a way to find and fix problems before these issues could affect customers.
Under controlled circumstances, experiments would be conducted where random problems were created to test system resiliency. Things such as unplugging a server, making a service unavailable, or even taking an entire cloud region offline.
Today, many companies including LinkedIn, Facebook, and Amazon Web Services leverage some form of chaos engineering to improve system reliability. Although Netflix's Simian Army project is no longer actively maintained, some of the tools are still in use, and newer tools have come along that focus on cloud native platforms like Kubernetes.
It probably goes without saying, but the idea that one failure in a system shouldn't result in a failure of the whole system ought to be well understood and universally accepted. How do software engineers uncover hidden dependencies that can cause a system to fail?
Postmortems and Root Cause Analysis (RCA) are a great way to examine problems that have already happened to find the "why" of prior failures. Using RCA, teams can root out causes and devise ways to prevent or minimize similar problems when they occur in the future.
The issue is that waiting around for problems to randomly happen so they can be remediated is not enough on its own to effectively improve system reliability. Chaos Engineering seeks to address this by proactively injecting faults into systems and observing the results.
In doing so, one could detect hidden dependencies within a system and proactively fix problematic behaviors at any level of the application stack. This includes not just application failures themselves, but those unseen at the network and infrastructure levels, as well as with people and processes.
Traditional testing only provides assurance for some parts of a system. Usually by directly testing the application itself. But there can be other ways a software stack can fail that aren't tested in traditional testing methodologies.
For example, how can a team validate runbooks without actually executing them?
What happens when the underlying infrastructure such as a router or load balancer breaks (or is unplugged)?
Is the application distributed and resilient enough to keep running when someone accidentally drops a table in a database. What if a single virtual instance goes offline?
How well do monitoring and alerting systems function? Do teams even know when a problem occurs? How quickly can they respond?
How does a team know if their incident response skills are up to the task of addressing issues while they are happening? Have they practiced their incident response under real conditions?
By knowing what's expected of a system in a steady state--in other words under normal circumstances--one can use targeted chaos to change conditions on the system away from what is normally expected. Over time, this helps build confidence that the system will withstand the turbulent real world conditions that occur in production.
Chaos experiments should be controlled, and easily turned off if something gets out of hand. Safety mechanisms are required so that problems can be detected and the experiment can be stopped before the system becomes unusable.
Having an experimental platform for testing, along with a control platform as a baseline, are needed so that divergences can be detected without affecting the rest of the customer facing production platform.
All teams affected should be aware that testing is occurring and have a game plan as to how they will respond. Key metrics such as application availability, latency, and error rates (among others) should be monitored.
The idea is not to cause problems, but rather to reveal them. One starts small by keeping the "blast radius" small, then works out from there. It certainly doesn't help you if your customers are negatively affected by your testing activities since this tends to build bad will and harm revenue.
Once confidence in the system increases, the blast radius can be slowly increased. Over time, these experiments can be automated so that they run all the time to constantly root out previously undetected issues.
There are a number of tools SREs can use to create controlled levels of chaos into their systems. While parts of the Simian Army are currently unsupported and unmaintained, others have changed form or were rolled into other projects.
Chaos Monkey is probably one of the most famous members of the Simian Army. It's still actively maintained, and is available as open source. It works by randomly terminating virtual instances or containers, and helps incentivize engineering teams to build more resilient services that are more fault tolerant.
Gremlin is a very popular commercial product with a sleek interface. It was created in part to safely automate implementing Chaos Engineering within software systems at several different levels.
Among the many other features, Gremlin has the ability to create latency, introduce packet loss, manipulate system time, and shut down or restart hosts. Probably the best feature with Gremlin is that you can limit blast radius, quickly shut down experiments, and return the system to a steady state should you run into problems during your experiment.
On the incident response side is Rootly. Rootly lets you test your incident detection and response while performing chaos experiments. It has collaboration features such as Slack integration to automatically setup channels for centralized incident communication, as well as build your incident timeline on the fly.
Rootly can fetch relevant information like recent git commits for your impacted services. Workflows can be customized based on any incident condition. Once incidents are over, Rootly helps teams learn from incidents using postmortems without the manual toil of copy and pasting information into a ticketing system.
There are a lot of tools available to help you implement your Chaos Engineering experiments, but listing them all is outside the scope of today's post. In the near future, we'll probably create a series of posts that cover a much larger swath of Chaos Engineering tools and best practices.
This year, Rootly is sponsoring Failover Conf. A virtual conference where you can learn more about using Chaos Engineering to evolve your incident response and improve reliability. Join us on April 27 from 9am - 3:30pm PDT to learn how to fail smarter.
{{subscribe-form}}