This website uses cookies

Read our Privacy policy and Terms of use for more information.

The gap that no dashboard shows

Your monitoring says the payment system is healthy. CPU is fine. Response times look good. Error rates are within tolerance.

But customers can't complete transactions. Support lines are ringing. Trust is eroding.

This is the reliability gap: the space between systems that look healthy and customers who can't get their work done. Most organisations buy more tools to fix this. But observability isn't a tooling project. It's a decision-making capability.

This book shows you how to build that capability through people, process, and technology.

What you'll get

Who this is for (and who it's not for)

This is for you if:

• You're a CTO, Head of Engineering, or senior technical leader responsible for reliability.

• You've bought observability tools but still can't answer the question: "Are customers okay?"

• You work in banking, finance, or regulated industries where downtime means real consequences.

• You want a strategic guide, not a tool manual.

This is not for you if:

• You're looking for a tool comparison or vendor guide.

• You need step-by-step technical setup instructions.

• You think observability is only about logs, metrics, and traces.

• You don't have operational responsibility for live systems.

The three-move framework

Observability isn't a tool. It's a decision-making capability built through three connected moves.

Move 1: Pick your critical journeys

You can't observe everything. Start with 3–5 customer journeys that define your business. Payments is one example. Login is another. Focus your instrumentation and attention where it matters most.

Move 2: Pair outcomes with early warnings

Define what "healthy" looks like from the customer's point of view (like completed payments). Then connect that to the technical signals that warn you before customers feel pain. This pairing is where observability becomes actionable.

Move 3: Build the operating model

Observability only works if teams know who owns what, when to meet, and how to learn from incidents. This isn't bureaucracy. It's the difference between reactive firefighting and proactive control.

About the author

Allan Mann

Allan has spent 25+ years leading IT operations in banking and government, building resilient systems where failure is not an option. He's been on the receiving end of vendor pitches, budget battles, and 3am outages that weren't supposed to happen.

He writes and speaks about observability, resilience, and what actually works in highly regulated environments. He's the voice behind the Mastering Observability newsletter and the Metrics & Mayhem podcast, where he breaks down complex operational challenges into clear, actionable guidance for technical leaders.

No vendor agenda. No fluff. Just hard-won experience.

What's inside

Eleven chapters on observability that actually works. Each one earns a single decision before the next.

1

Why This Is a Board-Level Problem

The MTTR paradox: more tools, slower recovery.

2

What Observability Actually Means for a CTO

From watching uptime to answering business questions.

3

What It's Really Costing You

The real bill below the licence fee.

4

Who Owns What When It Breaks

The three-layer ownership model. The CrowdStrike test.

Read this chapter free →
5

The People Problem Nobody's Solving

Training is not capability. Practice is.

6

Why Your Best Engineers Aren't Telling You the Truth

Psychological safety, and the cost of silence.

7

The Machines Are Getting Better. Are Your Teams?

AI in operations, governed properly.

8

The CTO Nobody Trained You to Be

Three shifts the role has not kept pace with.

9

What the Organisations That Get This Right Actually Did

GitHub, Fastly, Atlassian: three real recoveries.

10

How to Build This Without Burning It Down

One outcome, one owner, one quarter.

11

What to Do Monday Morning

Three conversations, a 90-day plan.

Ready when you are.

Start with a chapter, or pick the edition that suits you.

Kindle · Paperback · Hardback

Frequently asked questions

Is this technical? +

It's strategic first, technical second. You won't need to code, but you will need to understand how systems produce signals and why those signals matter for decision-making. If you can read a dashboard and ask questions about what it means, you're technical enough.

Do I need a specific tool? +

No. This book is vendor-neutral. It covers principles and practices that work regardless of which monitoring, logging, or tracing tools you use. The framework applies whether you're using open source, commercial platforms, or a mix of both.

How long is it? +

Approximately 200 pages. It's designed to be read in 3–4 focused sessions, or you can jump to the chapters most relevant to your current challenges. Each chapter stands alone, so you don't need to read cover to cover.

Is it suitable for regulated industries? +

Yes. The book includes a dedicated chapter on observability in highly regulated environments, drawing on real-world experience in banking and government. It addresses compliance, audit requirements, data sovereignty, and how to build trust without compromising control.

What will I be able to do after reading? +

You'll be able to identify your critical customer journeys, define what healthy looks like, set up early warning signals, assign ownership, and build the operating cadence needed to turn signals into decisions. You'll also have a clear 90-day roadmap to get started.

How does this relate to SLOs? +

SLOs (Service Level Objectives) are service promises: commitments you make about how reliable your systems will be. This book shows you how to set SLOs that matter to customers, not just technical teams. You'll learn how to tie them to business outcomes and use them as decision-making tools.

Who is this NOT for? +

This isn't a tool comparison guide, a vendor evaluation checklist, or a step-by-step configuration manual. If you're looking for "how to set up Prometheus" or "which APM tool to buy", this isn't the book. It's for leaders who need to build the capability, not just buy the tools.

Can I share this with my team? +

Yes. Many CTOs buy copies for their direct reports, SRE leads, and product owners. The book works as a shared language for aligning technical and business stakeholders around what observability actually means.

About the author

Allan Mann, author of Metrics & Mayhem

Allan Mann

Allan Mann has spent more than 25 years leading IT operations and infrastructure teams in banking, government, and large software organisations. He's built monitoring platforms, led incident response teams, and helped CTOs make sense of the gap between system health and customer experience.

He writes and speaks on observability, resilience, and technical leadership. Allan runs Mastering Observability, a weekly newsletter read by senior tech leaders, and hosts the Metrics & Mayhem podcast, where he interviews CTOs and engineering leaders about building reliable systems in the real world.

He believes observability is a leadership capability, not a tooling project. And he's allergic to vendor pitches.

Ready to close the observability gap?

Stop buying visibility. Start building control.

Written for CTOs. Vendor neutral. Focused on outcomes.