One of the biggest lies we still tell ourselves in IT Ops is this.
If we test harder, plan better, and review longer, we can stop outages from happening.
It’s comforting. It’s also bullshit.
Modern systems are complex by default. They’re distributed, interconnected, and constantly changing. Failure isn’t an edge case anymore. It’s the baseline.
Here’s the catch: the real question isn’t how we stop every failure. It’s how fast we regain control when it happens.
That shift matters because a lot of teams build their release process like it’s meant to prevent embarrassment, not protect users. They optimise for the “perfect release”. They add layers. They add gates. They add meetings. They add sign-offs. They assume nothing will go wrong.
Then something does go wrong. And the only move left is a two-hour rollback and a prayer.
That isn’t resilience. That’s hope wearing a process hat.
I’ve sat in those war rooms. Engineers exhausted. Business revenue is bleeding. Leadership under pressure. Everyone is asking the wrong question.
“Why did this happen?”
That question comes later.
The question that matters in the moment is simple.
How do we stop the impact, right now?
This is where the best teams separate themselves.
They don’t treat a release as a single irreversible event. They treat it as something they can control at runtime. Not because it’s trendy. Because it gives them options when things go sideways.
Feature flags.
Kill switches.
Traffic shaping.
Progressive rollout.
Here’s what this looks like in plain terms.
A two-hour emergency rollback is chaos.
A 20-second toggle is the control.
Customers don’t care how elegant your architecture is. They never see it. They care whether the service works.
Leadership doesn’t need a perfect explanation in the middle of an incident. They need one thing.
Are we in control?
This is where observability earns its keep.
Not as dashboards. Not as charts. Not as a wall of green boxes that makes everyone feel better for five minutes.
As decision support.
If you can see the impact clearly, you can act decisively.
If you can’t, you hesitate. Everyone does. And hesitation is expensive.
Allan’s Hard Stop
Resilience isn’t about being perfect. It’s about being in control when things break.
So here’s the habit for next week.
Next time you design a system, a release, or a change process, ask one question:
What’s our fastest safe move if this goes wrong?
Not the cleanest.
Not the most elegant.
The fastest way to reduce impact.
If the answer is “rollback and hope”, you don’t have resilience. You have optimism. And optimism doesn’t scale.
One last thought.
You don’t need more theatre.
You need more control.
Because outages will happen. The teams that win are the ones who can say, calmly and truthfully:
We’ve got this under control.
Allan
PS: Want the audio version? Listen to the Signal Drop here: [Spotify link]
