How Rootly helps Replit make better business decisions after incidents

Replitrootly-logo

"Because of how intelligently Rootly collects and compiles data from our incidents, it’s simple for us to get an overview of the system’s health and past incidents."

Matthew Iselin

Matthew Iselin

Engineering Manager - SRE, Security

About Replit

Replit is a collaborative programming environment that allows users to create online projects (called Repls) and write code.Tech StackNode.js, React, NGINX, Google Cloud Storage, Slack, Google Suite, Linear

Tech StackNode.js, React, NGINX, Google Cloud Storage, Slack, Google Suite, Linear

Why Incident Management?

In our early days, we were managing incidents from a single Slack channel. As you might guess, things got complicated when there were multiple incidents happening simultaneously. If you weren’t clear about what incident you were referring to, conversations could easily overlap and get mixed up. We tried using threads, but it didn’t scale well for us and made conversations difficult to track.

We ended up building our own Slack bot to help organize things, but it was very rudimentary. It created the incident in Replit, and gave the ability to pull messages from Slack into an incident timeline. At the end of the incident it would generate a big unformatted Slack message with the timeline to copy and paste into a markdown document.

There were a couple problems we ran into with this. One, we were running the bot on our own infrastructure, which meant that in some cases it was inaccessible during an incident if things were completely down. The other problem was maintenance. The bot was a shared responsibility with no formal ownership within the company, so any issues with it were fixed ad-hoc and implementing improvements or new features in the bot wasn’t a priority.

Why Rootly?

We’re big on Slack, so we knew we wanted to continue using Slack during incidents, but we needed more functionality to make the incident management experience better and more reliable for our responders. When we looked at the tools on the market, Rootly’s Slack integration really stood out. With some other tools that integrated with Slack, there was still a need to jump between Slack and the web version of the tool, which wasn’t ideal. Then there’s the other integrations that are invaluable to us—Linear for example. The bidirectional sync makes it so easy to keep our action items tracked and organized. Honestly, we’re really just scratching the surface of what Rootly can do with integrations right now, so that’s something for us to look forward to as we scale and continue to build on our Rootly setup.

Insightful Metrics for Better Business Decisions

Another huge benefit of Rootly for us is the metrics. Because of how intelligently Rootly collects and compiles data from our incidents, it’s simple for us to get an overview of the system’s health and past incidents. And not just the volume of incidents, but information like the causes, the services impacted, etc. I manage our SRE and Security teams, so I’m not always active in individual incidents, but it’s important for me to be able to access that information easily to make decisions about where we need to invest in our infrastructure and security. Before Rootly, tracking that accurately and compiling it would have been a huge effort. Now, I can look at our past incidents over any period of time and quickly identify problem areas to address.

A Better Experience for Incident Responders

Another thing we care about is the actual experience of being on call and responding to incidents. We don’t want our responders bogged down by toil and context switching. We love the way Rootly completes tasks that used to be time-intensive, like pulling and formatting an incident timeline and building a retrospective doc, and makes them completely effortless. All the little things add up too—like a consistent way to keep track of the incident commander and other roles, or automated escalation paths. When we decide something should take place during incidents, we can build a workflow once to support it instead of needing to train people and expect them to remember yet another task. We’ve seen measurable improvements that prove the impact of this. We don’t index too heavily on MTTR as a metric for success because incidents have so many variables, but we have seen a reduction in our overall MTTR since adopting Rootly, which is a positive signal for us.

We also have the added advantage that we implemented Rootly when we were a really small team, so everyone who has joined since has just grown up with it. It’s really helpful in getting people up to speed with our incident response process quickly.

We’re looking forward to continuing to build out our incident automation using Rootly. The ongoing support from the Rootly team is a big help with this as well. Any time we run into trouble, the team is super responsive and knowledgeable. We don’t need to reach out often, but whenever we do we get the support we need.

See how the best are managing their incidents

Book a demo
Our Product Philosophy
Back to Customers
Back to Customers