The Role of SREs in Observability
Although conversation about observability often ignores SREs, SREs have a central role to play in observability success.
September 18, 2024
14 mins
Once a leading on-call and alerting solution, PagerDuty is now seen as a legacy tool that struggles to meet the demands of modern SRE teams. Discover the seven most popular, cost-effective, and innovative solutions in the market for 2024.
Not all wines get better with time. In fact, only around 15% of all wine production can age well. By 2024, about 90% of all wine bottles produced in 2009 are past their peak or have important quality deteriorations.
Motorola Razrs, portable DVD players, and iPods were pervasive in society fifteen years ago, but you won’t see any of those around anymore. Obsolescence in technology happens much quicker than in wine. To survive in this industry, you must radically adapt and improve your product and services.
Yet, PagerDuty remains the same product it was from its origins in 2009: an expensive phone call service. Instead of investing in their product, PagerDuty has managed to stay afloat by managing legacy enterprise accounts and making basic features separate paid add-ons.
Frustrated by the lack of essential functionalities a modern SRE team needs, and upset with the inflated costs of PagerDuty in a tight-budget context, most organizations are looking for a robust PagerDuty alternative.
In this article, you’ll learn about seven popular PagerDuty alternatives with their pros, cons, and pricing details.
Launched in 2009, PagerDuty was the go-to on-call and alerting solution that was reliable enough to be trusted by enterprises. However, the product was designed to operate under the assumptions and practices at that time. Over the past 15 years, the way organizations develop and ship code has radically changed. Reliability is now a much more mature practice with more demanding requirements.
Some people consider PagerDuty a legacy on-call solution because it keeps offering essentially the same product, without accounting for the many new needs SREs have. Common complaints from PagerDuty customers include a low ROI, unnecessary complexity, and a lack of value beyond delivering push notifications.
According to G2, excessive costs are the topmost complaint PagerDuty’s customers make. PagerDuty’s base product is an on-call scheduler that sends push notifications to whoever is on-call. Everything else is a paid add-on, each with different tiers, from status pages (and “premium” status pages), automations, call-rerouting, AI, and “advanced” integrations.
The lightest plan starts at $21/user/month, but the functionality is so limited that most customers necessarily have to get into the $41/user/month plan. Teams that need automated workflows will have to jump to the Enterprise tier at $60/user/month. You’ll also pay extra if you want status pages or AI.
Here’s a breakdown of the minimum annual costs of setting up your on-call and incident response with PagerDuty, with references for a small and a medium team.
Ask anybody who has used PagerDuty how much time they spent setting up their on-call rotations and how much they dislike having to update them. You’ll trigger them. G2 reviews reflect this frustration too, with complexity being listed as the second-largest problem PagerDuty customers report.
It takes training and many hours of manual work to set up everything you need for your on-call strategy with PagerDuty. And if you need to update anything, good luck, you’ll likely have to redo a lot of your work.
Leave aside the fact that most basic features are sold separately in PagerDuty. Many features that modern teams need are not available in PagerDuty at all, which means SREs must constantly put together hacky and fragile configurations just to get on-call set up.
For example, most organizations these days want to page teams outside of just engineering, but PagerDuty makes this incredibly difficult by requiring you to register your teams as services in order to attach them to an on-call schedule.
In modern on-call solutions like Rootly, more complex features like on-call shadowing can be activated with a click, but PagerDuty requires you to duplicate schedules and update them manually.
PagerDuty only offers on-call and alerting, but what comes after is left for you to figure out.
In an attempt to fix this gap, PagerDuty acquired an incident management vendor called Jeli last year. Jeli was an incident manager focused on handling post-incident tasks such as writing retrospectives and tracking action items derived from an incident. Its incident response solutions were still quite bare by the time it was acquired.
However, Jeli’s rollout as part of PagerDuty remains to be seen. So far, Jeli is only being offered to a select number of PagerDuty enterprise customers. As in most product acquisitions of this kind, customers are unlikely to get a smooth experience for a while.
{{cta-pd}}
Fortunately for SREs, PagerDuty is not the only on-call and alerting solution in the market in 2024. We’ve compiled a list of the most popular options and outlined their competitive features, trade-offs, and costs.
Backed by Google and Y Combinator, Rootly is the leading modern on-call and incident management solution.
Rootly, founded in 2021, revolutionized reliability by bringing modern practices into alerting and incident response. Rootly is trusted by industry leaders like LinkedIn, NVIDIA, and Elastic.
Rootly’s comprehensive platform (including on-call, incident response, status pages, AI, and premium support) comes at competitive prices, saving you at least 50% of the costs of PagerDuty. Check out Rootly’s Pricing page for plans and prices.
Rootly also offers a special program for startups getting started with on-call and incident management.
Datadog, founded in 2010, is a popular observability tool that made waves in the industry by introducing ergonomic patterns to make instrumentation and visualization much easier to implement than in traditional solutions, especially in cloud-native environments.
Over the years, Datadog has grown to offer more than a dozen add-ons to their platform. One of the latest additions is Datadog On-Call, currently available only as a private beta.
The information available on the product is limited to a few blog posts, but it looks promising as the content hints at the possibility of accessing more metrics related to the alert. Datadog On-Call could be a strong PagerDuty alternative for Datadog customers due to the context-rich pages it promises.
Datadog On-Call is still in private beta, and there’s no public information about how much the company will charge for the add-on to their platform. However, the sentiment of Datadog invoices skyrocketing keeps permeating the community, forcing teams to look for alternatives for their observability solution.
OpsGenie was acquired by Atlassian in 2018, a decision its users lament to this day because the company has stopped investing in the product since then. For Atlassian, OpsGenie is just one more among its 18 products.
Atlassian’s strategy for OpsGenie, as a PagerDuty alternative, has always been to compete on pricing rather than platform quality or innovation. Downtimes are frequent, which means you could lose critical alerts if their platform is down.
The product has remained the same for years. For example, even the recent AI wave saw PagerDuty implementing LLM features to stay competitive, but OpsGenie remained unmoved.
OpsGenie’s “Essentials Plan” starts at $9/user/month, but as the plan name suggests, it only includes basic features that are unlikely to be enough for most teams. To access features like user roles or analytics, you’ll have to opt for the “Enterprise Plan” at $29/user/month. Status pages are sold as a separate product.
Don’t forget, you also have to buy an incident response solution to get your SRE team going beyond getting a push notification.
Below are the annual costs of setting up your on-call and incident response with OpsGenie, with references for a small and a medium team:
Released in 2014, Grafana is a well-known visualization tool, popular in the open-source community as a companion for Prometheus, an OSS monitoring tool. Grafana Labs offers managed versions and enterprise support for their open-source initiatives.
Grafana OnCall, however, was not developed by the Grafana Labs team but bought from a solo developer in 2021. It is not meant to be used as a standalone product, and it’s not commercialized as such.
You can only get managed Grafana OnCall if you’re a customer of Grafana Cloud. There is an OSS version of Grafana OnCall, but setting it up and maintaining it will require at least a full-time engineer dedicated to the project.
Grafana OnCall is not sold as a standalone product, so it doesn’t have a cost on its own. Grafana OnCall is offered without additional charge within your Grafana Cloud suite of managed solutions, which hints at the investments that are put into this on-call product.
The costs of Grafana Cloud are usage-based and have tiered rates. It is difficult to estimate how much you’ll pay for Grafana Cloud without implementing Grafana and Prometheus as part of your observability strategy.
You’ll likely have to speak with a Grafana Labs sales representative to get a better idea of which costs to expect.
Splunk, launched in 2003, is a popular observability solution that caters to traditional enterprises. VictorOps was acquired by Splunk and rebranded as Splunk On-Call in 2018, offered as a paid add-on to their platform.
Splunk On-Call is not sold as a standalone product.
Splunk On-Call is not offered as a standalone product, so the price isn’t entirely applicable.
However, Splunk discloses that Splunk On-Call is available as an add-on for $5/user/month if you require fewer than 10 seats. The per-seat cost for larger teams is likely to have a steep increase, but you’ll need to contact their sales team for details.
BetterStack, founded in 2021, is part of a new generation of observability solutions, focused on uptime monitoring. BetterStack offers solutions that can solve specific use cases, like the availability of your HTTP endpoints, which makes setting it up a breeze compared to Prometheus. However, BetterStack is not meant to be a full-blown replacement for a full-stack observability tool.
BetterStack also offers on-call schedules, so you can set up simple escalation policies and rotations to respond to what their platform reports.
BetterStack is billed at $29 / user / month, although they won’t charge you for users that only view logs and traces.
If monitoring uptime of your public endpoints is all you need, BetterStack is definitely the perfect option for you, as you’ll get monitoring, a simple on-call suite, and a status page.
Zenduty emerged in 2015 as a cheaper alternative to PagerDuty. However, various issues with its platform, such as constant scheduling errors and never-ending missed alerts, have consistently held the company back. Nearly a decade after its founding, Zenduty remains unprofitable and had to sell a portion of its stake in 2021 for $1.9 million to stay in business.
Zenduty starts at $5/user/month when you have fewer than 10 users and offers everything in their platform without limits with the Enterprise tier at $21/user/month.
A robust on-call and incident management solution will be critical for your reliability strategy. To pick the best option, you’ll need to assess your tech stack, budget, and priorities.
{{cta-demo}}
Manage schedules, escalations, and PTO aware overrides without the frustration.
Unbeatable features at a fair cost: no per-alerts, surprise bills, or silly upsells.