Rootly | Tristan Watson

Quick facts

💻

SRE @ Oak National Academy

✨

OpenTelemetry Expert

🚲

Avid Cyclist

📍

Based in London, UK

Five Questions with Tristan Watson

What are the hallmarks of a strong observability culture?

The key is to make monitoring and observability approachable and digestible for everyone in the organization. Alerts should be understandable even by non-technical team members. For modern serverless architectures—where multiple platforms and services are stitched together —centralizing everything into a "single pane of glass" is essential. Metrics, alerts, and events should all be integrated, enabling clear visibility into the system’s health and potential issues.

Additionally, it’s crucial to establish robust incident response processes. In serverless setups, where you’re often dependent on external services, understanding where and how failures can cascade is paramount.

How do you approach observability in serverless environments?

Serverless architectures introduce unique challenges because they rely heavily on cloud provider-managed infrastructure. For example, with AWS, you might use Lambda, SQS, or state machines to handle events. This event-driven model—while powerful—is difficult to monitor due to the abstraction and lack of direct control over infrastructure.

Tracing is really useful here. By instrumenting your code to track the lifecycle of events, you can pinpoint bottlenecks, latency issues, or unexpected errors. Tools like OpenTelemetry allow you to follow requests through their entire lifecycle, providing developers with invaluable insights. Business events, such as password resets, can also serve as a starting point for leveling up observability—helping teams benchmark performance and detect anomalies proactively.

How does incident response differ in serverless architectures?

Incident response in serverless setups can be more challenging because of the distributed nature of the systems. Events flow between various services, making it harder to isolate and diagnose issues. Tracing becomes invaluable in these situations, acting as a "cheat code" for understanding failures.

That said, the fundamentals remain similar: monitoring metrics, setting clear error thresholds, and having robust incident response workflows. However, serverless systems require a heightened focus on recovery processes, as you often don’t own the underlying infrastructure. The balance between cost and visibility also plays a significant role—serverless environments can offer cost savings, but investing in effective monitoring and tracing is critical to ensure reliability.

What role does Slack play in observability and incident management?

Slack is an integral tool for observability and incident management, but it needs to be managed thoughtfully. Treating Slack like a well-maintained garden—with clear channels and workflows—is essential to avoid information overload and alert fatigue.

Start by analyzing Slack analytics to understand channel usage and set up public channels for key notifications. Introduce alerts gradually, focusing first on high-severity issues. Tools like Rootly can streamline this process by integrating incident management directly into Slack, enabling teams to work where they’re most comfortable.

Ultimately, Slack’s ubiquity and ease of use make it a powerful platform for driving passive learning and fostering a culture of observability. Teams can stay informed without leaving their preferred workspace, improving adoption and engagement.

Why is OpenTelemetry such a hot topic in observability?

OpenTelemetry is reshaping the observability landscape by providing a standard for metrics, traces, and logs. Previously, developers relied on vendor-specific SDKs and APIs, leading to fragmentation and lock-in. OTel’s open-source framework, backed by the CNCF, offers portability and flexibility, allowing teams to experiment with different platforms without rewriting their telemetry setups.

One of OTel’s most significant advantages is its ability to standardize metrics, making them more intuitive and portable. For example, an HTTP status code is simply labeled as such, regardless of the provider. This normalization reduces friction and simplifies adoption, enabling teams to focus on insights rather than implementation.

I’m particularly excited about OTel’s potential for extending observability beyond traditional applications. For instance, I’ve been brainstorming ways to use it for personal projects, like tracking air quality during bike rides or monitoring urban wildlife activity. The possibilities are endless, and OTel’s flexibility ensures that innovation in this space will continue to grow.

‍What’s your favorite cycling gear?

I’ve accumulated quite a collection of bikes over the years—from a titanium bike to a mountain bike. My favorite, though, is the pub bike (the bike I ride to a pub on). It’s simple, practical, and perfect for urban commutes or snowy rides.

I’ve even thought about integrating OpenTelemetry into cycling projects, like using sensors to track air quality on my rides. There’s so much potential for combining technology with everyday activities to create meaningful insights—something I hope to explore further in the future.

‍👉 Where to Find Tristan

You can connect with Tristan on LinkedIn or X to learn more about his latest projects and thoughts on observability and cycling.

How Clay Uses Rootly To Streamline Incident Management

Get to know

Tristan Watson

Quick facts

Five Questions with Tristan Watson

Ready to get started?