Top 3 on-call scheduling strategies every SRE should know
Discover the best on-call scheduling strategies for SREs in 2024
July 8, 2021
4 min read
Rootly is on a mission to create a world where maintaining reliability is frictionless, delightful, and accessible to anyone. Making resolving and learning from incidents every organizations superpower.
We are delighted to announce that Rootly has raised a $3.2 million seed round led by Ross Fubini at XYZ Venture Capital with participation from 8VC (investors in Asana, Palantir) and Y Combinator.
Joining them are incredible operators and investors Jason Warner (CTO GitHub), Bharat Mediratta (CTO Dropbox), Max Mullen & Brandon Leonardo (Co-founders Instacart), John Lilly (former Partner Greylock), Akshay Kothari (COO Notion), Andrew Fong (CTO Vise), Ed Shrager (former Softbank), Jett Fein (Partner Headline), Romain Huet (Stripe), and many more.
Before starting Rootly, I was an early engineer and one of the first SRE hires at Instacart back in 2015. I was responsible for our cloud infrastructure, CI/CD pipeline, and overall site reliability, which went from processing hundreds of orders to millions. My co-founder JJ was a senior product manager at Instacart and built core enterprise offerings responsible for >20% total company GMV and helped lead COVID-19 response efforts.
As Instacart rapidly grew, it became evident that being able to scale our incident management process to resolve and learn from incidents quickly was critical. Ensuring our reliability was my top priority as every minute we were down meant substantial revenue loss, shoppers weren't able to receive orders (imagine being stuck in a grocery store waiting for the application to load...), and customers couldn't place orders.
We invested heavily in monitoring and alerting solutions that did a great job acting like a smoke alarm that woke us up at 4 am to say that something is wrong. However, none of them actually helped orchestrate the rapid response and learning from the incident.
As a result, like most companies, we resorted to manual processes and internal hacks as a substitute. For example, manually creating dedicated Slack channels to collaborate in, searching for static runbooks in Google Docs, relying on tribal knowledge to guess which team should be involved, manually fetching Datadog metrics, frantically updating leadership and customers, and more. The manual admin overhead served as an expensive and frustrating distraction from what was important at the time - putting out the 🔥.
After the incident was over, we had to copy-paste timestamps out of Slack to create our postmortem timeline and did not track incident metrics. We were creating action items in Asana, Jira and 3 weeks later, we didn't know who should be working on what. The valuable insights, learnings, and key follow-up actions that would help prevent similar incidents in the future were lost, given the manual toil and fragmented tooling used.
We searched for tools on the market, but nothing effectively addressed this pain point that would help us resolve incidents quickly and learn to prevent them. After being involved in millions of dollars worth of incidents and being in the trenches, JJ and I partnered together to start Rootly.
Rootly is an incident management platform and Slackbot, designed to help companies resolve incidents faster by automating manual admin tasks and providing insight to prevent them in the future. We are on a mission to create a world where maintaining reliability is frictionless, delightful, and accessible to anyone. We want to make resolving and learning from incidents every organizations superpower.
Companies of all sizes regardless of maturity eventually encounter incidents. Already in 2021 companies such as Slack, Robinhood, and most recently Fastly have experienced headline-grabbing outages, demonstrating that no company is immune.
Without the right platform to quickly resolve and learn from incidents, they tend to be:
However, the challenges companies experience has been exacerbated over the last 18 months. 51% of companies report slower incident response times as COVID-19 forced the world remote. System complexity continues to grow as companies shift towards cloud-native, microservices, and depend more on 3rd party services.
Before Rootly, companies are forced to create prohibitively expensive homegrown solutions outside of their core competency.
Today, we are managing hundreds of incidents daily through the Rootly platform with incredible customers such as Nylas, Taplytics, Ritual, and many more. This new funding will allow us to accelerate hiring, expand our core product offering, and continue to onboard even more customers.
We are not simply transforming an industry, we are carving an entirely new +$B segment ourselves and need incredible talent to achieve this ambitious goal. We are hiring across all roles and looking for unconventional thinkers to come join our team!
Thank you everyone for their support over the past year. Now back to work!
— Quentin, JJ
{{subscribe-form}}