In some respects, security and reliability are competing priorities. Security controls may reduce reliability, and responding to security incidents may require mission-critical systems to be paused or shut down until they're secure.

The recent security incident involving CircleCI, however, shows that it's not always necessary to choose between prioritizing security or reliability. To its credit, CircleCI has done a nice job of handling the incident in a way that has minimized the impact on reliability, despite the apparent severity of the situation from a security perspective.

What happened at CircleCI?

On January 4, CircleCI, which develops a CI/CD platform that development and DevOps teams can use to build applications, announced a vulnerability affecting its software.

The company hasn't released many details so far about the exact cause or nature of the incident, but it has emphasized that it is not aware of any malicious actors currently inside its systems. That would imply that CircleCI is confident that the breach has been successfully contained.

Nonetheless, the company has urged its customers to "rotate any and all secrets stored in CircleCI." By secrets, it's referring to various types of tokens, SSH keys and even environment variables that might store access information.

Based on that advice, it seems likely that the CircleCI attack involved a breach wherein malicious actors gained access to secrets data that customers store on the platform. Since the company hasn't warned about exfiltration of other types of data, it would appear that the breach was limited to secrets, and that as long as users update their secrets in order to prevent attackers from using stolen secrets data to access sensitive resources, no damage will occur.

Maintaining operations during security incident response

CircleCI has also emphasized: The number one question we’ve received from customers is, “Can I build?” The answer is yes.

That's noteworthy because, again, major security incidents like the one that apparently occurred at CircleCI often result in downtime for mission-critical systems while those systems are updated. And no matter how much you've invested in backup systems, automated failover, redundancy or other reliability techniques, they won't protect your operations if there are security vulnerabilities at the core of your systems. Insecure secrets are just as problematic in a production system as in a backup system.

But in this case, CircleCI customers fared better than the norm. They were able to resume operations within about a day of disclosure of the incident, and the steps they were required to perform to use CircleCI securely – which amounted to updating their secrets – were relatively minor.

CircleCI deserves credit for minimizing the operational impact of this incident. The company also did a nice job of spelling out in its blog post on the incident exactly how customers should update their secrets, as well as which secrets management best practices can harden their CI/CD pipeline security in the future.

Lessons for SREs from the CircleCI incident

Of course, there's no guarantee that future security incidents will be resolved with as little reliability impact as this one. That's why it's important for SREs to take steps like the following to ensure that security incidents don't undercut reliability:

  • Establish playbooks that cover security routines like updating secrets, in order to minimize the time necessary to work through incidents like this one. Although CircleCI offered fairly specific steps to help its customers update their secrets, in other cases businesses might have to figure this stuff out on their own, and having a pre-established playbook would help.
  • Factor security into your redundancy strategy. Again, traditional backup and redundancy methods don't necessarily guarantee that you can recover quickly from a security incident; they mostly protect against other types of failures, like infrastructure outages. But forward-thinking SREs can get ahead of this limitation by planning strategies for maximizing the chances that backup systems are secure even if production systems are hacked. For example, disconnecting backups from the network can help minimize the risk that hacks of production systems will spread to backups, too.
  • Keep on operating. When news of the CircleCI incident first arrived, the knee-jerk reaction of many teams was no doubt to halt CI/CD operations due to the security risk of using a vulnerable platform. That was a reasonable stance at first, but as soon as it became clear that the risks had been mitigated, SREs at companies that use CircleCI hopefully pushed to get CI/CD pipelines back into operation. The point here is that, while it's important to take reasonable security precautions, it's equally important to prioritize operations, and not allow operations to be sacrificed in the name of tight security.

We're hoping that security incidents like the one that CircleCI disclosed this month are few and far between. But there's no reason to think they will be, given that thousands of cyberattacks take place each day. SREs can do their part to help businesses prepare by striking a healthy balance between security and reliability – and hoping that operations can continue despite security incidents, as they did in this case.