What you get from this one: a four-question habit you can run before merge on every feature, so the team stops paying the heroism tax at three in the morning.
In this drop
The point: Most observability work is reacting to the failure, not preventing it. We dress that reaction up as professionalism
Why it matters: Heroism at 3 a.m. is the polite word for an unbuilt position. The cost of positioning is paid once; the cost of not positioning is paid every incident.
Try this next week: before the next merge, run the four-question positioning checklist
The point
I got a page at a quarter past three in the morning. Customer-facing P99 above two seconds, sustained for six minutes. By the time the three of us were on the call, none of us could name the service's dashboard. The runbook was a stub. The SLO was a slide.
We got the customer back inside an hour. By Friday, the story we told ourselves was that the team had responded brilliantly under pressure.
That story is a lie.
Reality check
Reality check: the response to an incident is not a strategy. The position you built before the incident is the strategy. Everything else is the receipt.
One proof
Source: This is the same lesson Buffett, Bezos, and every operations text from Deming to the SRE Workbook teaches in different vocabulary. Position before the event. Patience is paid in advance, never on the day.
Field note: the team above didn't lose that Tuesday night to bad luck. They lost it the previous quarter, when the launch slipped, and we didn't go back and write the dashboard. They lost it in the planning meeting, where we agreed the SLO in principle and never converted it to an alert. They lost it in every review where we accepted 'we'll tune it later' as an answer
Where this breaks
Positioning is not a substitute for response. A live incident on a system that someone clearly built positions for is still a live incident, and the on-call still does the work. The habit below does not replace incident response. It reduces the number of incidents that have to be heroic.
Try this next week
Pick the next feature your team is shipping. Before merge, write down the three failure modes you can actually name. Three, in plain words.
Decide the SLO for each. A number you would defend in a room.
Decide the dashboard tile the on-call should see at 3am. Not the dashboard. The tile.
Decide the paging rule. What number, sustained for how long, wakes the senior? Written when you are not tired.Three links I’m watching
About the book
If this Signal Drop lands with you, the same thinking runs through Metrics & Mayhem: A CTO's Guide to Observability That Actually Works. Out today in paperback and hardback. Kindle's already live.
Want to taste it first? The free first chapter is yours: FREE CHAPTER.
Or skip the chapter and go straight in: BOOK LINK.
Two links I'm watching
Shane Parrish, Clear Thinking: where the positioning frame in this episode comes from. The Buffett example sits in there.
Google SRE Workbook, 'Implementing SLOs': the cleanest practical guide to defining an SLO before the feature ships, not after.
One question for you
What's the last 3am page your team took where, in hindsight, the position could have been built and wasn't? Hit reply. I read everyone.
Allan
PS: The episode runs about five minutes. Listen here: SPOTIFY.
Trade Real-World Events. Get $10 Free.
Start trading real-world events. With Kalshi, you can trade on things you already follow: inflation, elections, sports, and more. It’s simple: buy “Yes” or “No” shares on what you think will happen, and earn returns if you’re right.
To get you started, we’re giving you a free $10. Use it to explore the platform, test your instincts, and see how prediction markets work in real time.
Join thousands already trading the news and putting their knowledge to work.
Claim your $10 and start trading now.
Trade responsibly.



