Metrics & Mayhem | Signal Drop: The Alert That Just Says "We Need To Talk"

Sponsored by

What you get from this one: a context-free page doesn't inform the on-call, it frightens them. Here is the three-part rewrite that fixes it.

In this drop

The point: A context-free page is an open loop, and open loops fill with fear.
Why it matters: The first cost of a vague alert isn't the interruption; it's the dread before the on-call knows what they face.
Try this next week: Rewrite your three noisiest alerts to answer what broke, how bad, what to do first.

The point

A duty engineer showed me the page that woke her. It said: error rate elevated. Three words and a graph. She didn't reach for the dashboard first. She lay there guessing. Which service. How bad. The big one or the noisy one.

None of that was in the page. It said something is wrong and left her to supply the rest. That gap, the open loop with no conclusion, is where the fear goes. It's the on-call equivalent of a text that reads 'we need to talk', full stop.

We tune thresholds for weeks and spend ninety seconds on the message. But an alert isn't a trigger. It's a message to a tired human at the worst hour of their day, and it has one job: tell them what they're walking into before they have a chance to be afraid of it.

Reality check

Reality check: the first cost of a vague alert isn't the lost sleep, it's the ten seconds of dread before the on-call knows what they're dealing with.

One proof

Rob Ewaschuk's long-circulated 'My Philosophy on Alerting' at Google made the case years ago that every page should be actionable and carry context. Field note: on one team, rewriting the summary line on our five noisiest alerts (each gaining a one-line 'what broke / how bad / first action' plus an inline runbook link) cut the gap between page and first correct action from roughly nine minutes to under three, measured across the on-call rota over the following month.

Where this breaks

This breaks if you treat it as a wording exercise. If an alert can't honestly answer the three questions, the fix isn't better copy, it's deciding whether it should page a human at all.

Try this next week

Pull your three noisiest alerts. For each, read the message exactly as a human receives it on their phone in the dark.
Rewrite the summary line to answer three things: what broke, how bad, what to do first. Put the runbook link in the message itself.
Any alert that can't answer all three gets demoted from paging until it can.

About the book

If this Signal Drop lands with you, the same thinking runs through Metrics & Mayhem: A CTO's Guide to Observability That Actually Works. Out today in paperback and hardback. Kindle's already live.

Want to taste it first? The free first chapter is yours: FREE CHAPTER.

Or skip the chapter and go straight in: BOOK LINK.

Three links I'm watching

Google SRE Workbook, 'Alerting on SLOs' (sre.google/workbook): the cleanest argument for alerting on symptoms users feel, not raw causes.
Rob Ewaschuk, 'My Philosophy on Alerting': old, still right, every page should be actionable
Signal Drop 8, 'Context, Intent, Headline': the companion idea for any operational message, not just alerts.

One question for you

What's the vaguest alert your team still pages on, and what would it say if it had to tell the truth in one line?

Allan

PS: The episode runs about five minutes. Listen here: SPOTIFY.

The Ultimate Guide for Usage-Based Pricing for SaaS and AI

Implementing usage-based pricing successfully requires more than just a pricing strategy. It requires financial and operational infrastructure capable of handling dynamic pricing models, real-time usage signals, and increasingly complex monetization approaches.

In this guide, you'll learn ⤵

Strategic Advantages + Implementation Guidance
AI Use Cases for Usage-Based Pricing
Insights from SaaS & AI finance leaders on overcoming challenges and maximizing UBP.

Download the Guide