Podcast: Break Things on Purpose with Gremlin | Building Rootly with JJ Tang
Our co-founder JJ reflects on building the fastest-growing incident management platform and the surprising learnings.
January 9, 2023
4 min read
What SREs can learn from the CircleCI security incident of January 2023.
In some respects, security and reliability are competing priorities. Security controls may reduce reliability, and responding to security incidents may require mission-critical systems to be paused or shut down until they're secure.
The recent security incident involving CircleCI, however, shows that it's not always necessary to choose between prioritizing security or reliability. To its credit, CircleCI has done a nice job of handling the incident in a way that has minimized the impact on reliability, despite the apparent severity of the situation from a security perspective.
On January 4, CircleCI, which develops a CI/CD platform that development and DevOps teams can use to build applications, announced a vulnerability affecting its software.
The company hasn't released many details so far about the exact cause or nature of the incident, but it has emphasized that it is not aware of any malicious actors currently inside its systems. That would imply that CircleCI is confident that the breach has been successfully contained.
Nonetheless, the company has urged its customers to "rotate any and all secrets stored in CircleCI." By secrets, it's referring to various types of tokens, SSH keys and even environment variables that might store access information.
Based on that advice, it seems likely that the CircleCI attack involved a breach wherein malicious actors gained access to secrets data that customers store on the platform. Since the company hasn't warned about exfiltration of other types of data, it would appear that the breach was limited to secrets, and that as long as users update their secrets in order to prevent attackers from using stolen secrets data to access sensitive resources, no damage will occur.
CircleCI has also emphasized: The number one question we’ve received from customers is, “Can I build?” The answer is yes.
That's noteworthy because, again, major security incidents like the one that apparently occurred at CircleCI often result in downtime for mission-critical systems while those systems are updated. And no matter how much you've invested in backup systems, automated failover, redundancy or other reliability techniques, they won't protect your operations if there are security vulnerabilities at the core of your systems. Insecure secrets are just as problematic in a production system as in a backup system.
But in this case, CircleCI customers fared better than the norm. They were able to resume operations within about a day of disclosure of the incident, and the steps they were required to perform to use CircleCI securely – which amounted to updating their secrets – were relatively minor.
CircleCI deserves credit for minimizing the operational impact of this incident. The company also did a nice job of spelling out in its blog post on the incident exactly how customers should update their secrets, as well as which secrets management best practices can harden their CI/CD pipeline security in the future.
Of course, there's no guarantee that future security incidents will be resolved with as little reliability impact as this one. That's why it's important for SREs to take steps like the following to ensure that security incidents don't undercut reliability:
We're hoping that security incidents like the one that CircleCI disclosed this month are few and far between. But there's no reason to think they will be, given that thousands of cyberattacks take place each day. SREs can do their part to help businesses prepare by striking a healthy balance between security and reliability – and hoping that operations can continue despite security incidents, as they did in this case.
{{subscribe-form}}