One Hundred Outages and Nobody in Charge

On June 12, 2025, a single policy update entered Google Cloud's Spanner database. The update contained blank fields. Within seconds (because that's what Spanner does) it replicated globally. Every regional instance of Service Control tried to process the same malformed record, hit a null pointer, and crashed. Seventy-six services went down. Gmail. Spotify. Discord. Snapchat. Seven hours.

Not a code deployment. Not a cyberattack. A configuration change with a blank field.

I've sat in post-incident reviews where the timeline was longer than the outage itself, and still nobody in the room could explain what actually changed. That's the pattern now. The system knows exactly what happened. The humans find out later, sometimes much later.

Four months after Google, a DNS race condition in AWS US-EAST-1 cascaded through DynamoDB's global tables. Fifteen hours. Four million users. Over a thousand companies were stalled because two internal components disagreed about who owned a DNS record.

What a Hundred Outages in One Year Actually Proved

Between August 2024 and August 2025, AWS, Azure, and Google Cloud together experienced more than a hundred service outages. Not a hundred alerts. A hundred outages. The pattern wasn't subtle: configuration changes, not releases, not attacks, were the dominant root cause. A malformed bot config file took down Cloudflare's global network in November. An Azure Front Door routing change cascaded to Xbox, Microsoft 365, and airline check-in kiosks in October.

Every single one of these propagated faster than the humans responsible for them could react.

Reality check: the bottleneck in 2025 wasn't tooling. It wasn't talent. It was the growing distance between what systems can replicate and what leaders can comprehend.

Grafana Labs' 2025 Observability Survey (1,255 respondents) found complexity ranked as the number one concern at 39%, with signal-to-noise challenges at 38%. We're not short of telemetry. We're drowning in it. As The New Stack reported, the economics of observability have been upside down for years, with costs rising linearly with volume while value hasn't kept pace.

More data. Less understanding. That's the gap nobody owns.

Why the CTO Role Wasn't Built for This

I've worked with CTOs who could whiteboard an entire platform from memory. Brilliant architects. But the role they trained for assumed that if you understood the blueprint, you could predict the failure modes.

That assumption didn't survive 2025.

When a null pointer in a quota-checking feature propagates globally through Spanner in under ten seconds, no blueprint helps you. When a DNS race condition brings down a thousand companies because of a dependency chain that doesn't appear in any architecture diagram, the grand plan is just a PDF nobody opened. (Harsh? Maybe. But I've seen the PDFs.)

The Grafana survey found roughly three-quarters of organisations say observability is business-critical at CTO or VP level. That sounds encouraging until you look closer: only a third cite the CTO specifically as the highest decision-maker for observability. In most organisations, it still sits below the leadership table, surfaced only when something breaks badly enough to make the news.

I'm not blaming CTOs. The role was shaped for an era where systems changed on release cycles, not replication cycles. The skills that made someone a great architect (long-horizon planning, vendor strategy, technology selection) are not the skills that explain why seventy-six services just died because of a blank field in a policy record. Those are different muscles. And most leadership structures haven't built them yet.

Case for a Chief Observability Officer (and Why It Might Not Work)

So here's the thought experiment. What if observability had its own seat at the leadership table?

Not someone who owns the tools. Platform engineering keeps that. Someone who owns the organisational capability: making sure that when configuration changes move at machine speed, human understanding keeps pace. The translation layer between what telemetry shows and what the business needs to decide.

Some serious people are making this case. Abid Neemuchwala, former CEO of Wipro and now co-founder at Dennis Pincher Capital, argued on the VUnet Observability Talk podcast that boards should treat observability as strategic, not a back-office IT concern. If digital experience depends on platform uptime, and platform uptime depends on understanding real-time system behaviour, then observability is a board-level risk. Hard to argue with the logic.

In practice, a CObO would own three things: defining outcome indicators executives can act on, ensuring explicit ownership between business outcomes and engineering signals, and maintaining a live map of dependency chains that no static document can capture.

Adding a title doesn't fix an accountability vacuum. It just gives it a nicer chair.

Where this breaks: most organisations can't even agree on who owns the runbook. A CObO without the authority to halt a deployment that ships without observability coverage is decoration, not leadership. And we have enough decoration.

To be fair, the role might never need its own title. What it needs is the function. Whether it sits with the CTO, a VP of Engineering, or someone else matters far less than whether anyone is explicitly on the hook for the distance between replication speed and understanding speed.

What to Do Before You Hire Anyone

Titles are slow. These aren't.

Pick one outcome indicator and two early signals for your most critical service. For a support system: outcome = cases resolved in two or fewer interactions. Early signals = hedged-response rate and retrieval latency p95. Outcome owner: Product. Signal owner: SRE. Write the names on the service page. If ownership is unclear, the metric becomes a blame magnet. Fix ownership first.

Run a weekly dependency review. Ten minutes. One page. What configuration changes shipped? Which ones lacked rollback plans? Which ones propagated globally without a feature flag? Google's June outage happened because new code shipped active across all regions without a kill switch. The SRE team found and used the emergency switch, but only after the crash had gone global. That's not a process. That's luck with good reflexes.

Version what changes answers. If your system includes AI components, expose model version, prompt version, retrieval index, and policy version as labels in your logs and tickets. Not as afterthoughts. As configuration.

Stop building dashboards nobody reads. If the last three incidents were discovered by customers before your monitoring caught them, the dashboard isn't the problem. The assumption behind it is. (I've built those dashboards. I know how it feels.)

The outages of 2025 weren't caused by a lack of talent or tooling. They were caused by a structural gap that nobody owns: systems that replicate changes in seconds, governed by leaders who review dashboards in quarters.

Whether the answer is a Chief Observability Officer, an expanded CTO mandate, or just a VP who refuses to ship without observability coverage, the function matters more than the title. Someone has to own the distance between speed and understanding. Right now, in most organisations, nobody does.

Start with one outcome, two signals, and a name on the page. Review weekly. If it feels too simple, good. You're finally measuring understanding, not just uptime.

Observability isn't about seeing more. It's about seeing what matters.

One Hundred Outages and Nobody in Charge

What a Hundred Outages in One Year Actually Proved

Why the CTO Role Wasn't Built for This

Case for a Chief Observability Officer (and Why It Might Not Work)

What to Do Before You Hire Anyone

Further Reading

Reply

Keep Reading

Mastering Observability