Why More Incidents Are Better
Totally preventing all incidents is not only unrealistic. It’s actually undesirable in some respects.
May 13, 2022
5 min read
A look at the Atlassian outage of April 2022, and what it stands to teach Site Reliability Engineers. A lot to unpack here.
What happens when the tools and services you depend on to drive Site Reliability Engineering turn out to be susceptible to reliability failures of their own?
That’s the question that teams at about 400 businesses have presumably had to ask themselves this month in the wake of a major outage in Atlassian Cloud. The incident offers a number of insights for SREs about reliability risks within reliability management software itself – as well as how to work through complex outages efficiently and transparently, as Atlassian has done following the incident.
The outage, which began on April 4 (resolved on Apr 18) and affected about 400 Atlassian Cloud customer accounts. Atlassian Cloud is a hosted suite of popular Atlassian products, such as Jira and OpsGenie. The outage meant that affected customers could no longer access these tools or the data they managed in them.
According to Atlassian, the problem was triggered by a faulty application migration process. Engineers wrote a script to deactivate an obsolete version of an application. However, due to what Atlassian called a “communication gap” between teams, the script was written in such a way that it deactivated all Atlassian Cloud products, not just the obsolete application.
To make matters worse, the script was apparently configured to delete data permanently, rather than mark it for deletion, which was the intention. As a result, data in affected accounts was removed permanently from production environments.
The Atlassian Cloud outage may not be the very worst type of incident imaginable – failures like Facebook’s 2021 outage were arguably worse because they affected more people and because service restoration was complicated by physical access issues – but it was still pretty bad. Production data was permanently deleted, and hundreds of enterprise customers experienced total service disruptions that have lasted several days and counting.
Given the seriousness of the incident, it’s tempting to point fingers at Atlassian engineers for letting an incident like this happen in the first place. They seem to have written a script with some serious issues, then presumably deployed it without testing it first – which is exactly the opposite of what you might call an SRE best practice.
On the other hand, Atlassian deserves lots of points for responding to the incident efficiently and transparently. Although the company was silent at first, it ultimately shared details about what happened and why, even though those details were a bit embarrassing to its engineers.
Crucially, Atlassian also had backups and failover environments in place, which it has used to speed the recovery process. The major reason why the outage has lasted so long, the company said, is that restoring data from backups to production requires integrating backup data for individual customers into storage that is shared by multiple customers, a tedious process that Atlassian apparently can’t perform automatically (or doesn’t want to, presumably because it would be too risky to automate).
Unfortunately for impacted customers, it does not appear that any fallback tools or services were made available while they waited for Atlassian to restore operations. We imagine this poses more than minor issues for teams that rely on tools like Jira to manage projects and OpsGenie to handle incidents. Perhaps those teams have stood up alternative tools in the meantime – or perhaps they have just spent the past several days crossing their fingers, hoping their project and reliability management tools will come back online ASAP. The full outage postmortem can be found here.
For SREs, then, the key takeaways from this incident would seem to be:
The Atlassian Cloud outage is notable both for its length and for the fact that, somewhat ironically, it took out software that teams use to help prevent these types of issues from happening at their own businesses.
The good news is that Atlassian had the necessary resources in place to restore service as quickly as possible. A shared data storage architecture has led to slow recovery, which is unfortunate, but again, it’s hard to blame Atlassian too much for not setting up dedicated storage for each customer.
{{subscribe-form}}