LLMs Broke Reliability. Observability Is How Ops Keeps Trust

How to measure AI reliability when answers aren't deterministic.

By the fourth slide at SREcon EMEA 2025, half the room nodded; the other half reached for coffee. Microsoft's Brendan Burns described how Azure vets new models at SREcon 25 Americas. One model critiques another's output, then humans give a final thumbs up or down. It sounded less like server management and more like sociology.

That was the point.

Reliability is no longer deterministic. Two identical prompts can produce different answers that look confident and feel useful, yet still be wrong for the user. That is the world IT Operations inherits.

Here's the catch: visibility does not equal understanding. A green dashboard does not tell you if the answer made sense to the person waiting for help.

The New Reliability Stack: Outcome + Drift

The failure point has moved. In classic systems, it sat at the deployment stage. In AI systems, it sits at the inference stage. So our observability must shift from "is it up" to "did it behave".

That means making five things observable, not just the cluster and the code:

  1. Outcome. What the user actually experienced.

  2. Model version. Which brain responded?

  3. Prompt. What was asked?

  4. Retrieval path. What context was used?

  5. Drift. How answers change over time.

Industry practice is nudging in the same direction. Jay Lees at Meta argues that the only reliable arbiter of "good" is a business metric, such as click-through rate for ads. If it rises, experience is improving. If it falls, something regressed. That is an outcome SLI.

Todd Underwood has pushed a complementary idea: for ML systems, the end-to-end quality of the model is the only SLO that matters to reliability. Everything else is a proxy. In practice, full provenance across data, prompts, and embeddings is heavy for most teams, so start where drift costs trust.

The shift is real. Surveys through 2024–2025 continue to place monitoring and observability among the top challenges for ML in production, with early adopters pairing outcome SLIs with drift signals reporting time-to-fix improvements of 30–40% over baseline alert-only approaches.

Now translate this into operations. Pair one Outcome SLI with a small set of Early Indicators. The outcome reveals how users feel. The indicators fire fast enough to correct drift.

Here's what that looks like in practice:

Layer

What it measures

Example

Owner

Outcome SLI

Business truth

Cases resolved in ≤ 2 replies

Product

Early Indicators

Technical drift

Hedged‑response rate, retrieval latency p95, guardrail triggers

SRE / Platform

Field note. In a six-week pilot (April to May 2025, n = 5,900 tickets), pairing the outcome "cases resolved in ≤ 2 replies" with early indicators (hedge rate and retrieval p95) cut escalations by 18%. Ownership split: Product held the outcome; SRE owned the indicators and the runbook. Silence is not confidence. And yet, this pairing worked.

Where this breaks: if ownership blurs. When everyone owns the outcome, no one does. It becomes a blame metric.

Five Things to Instrument Now

1) Version the things that change answers. Model, prompt, retrieval index, and policy should appear in incident tickets and dashboards. Treat them as configuration, not as afterthoughts.

2) Trace retrieval like a payment path. Track cache hits, top-k hit rate, document freshness, and end-to-end latency. If retrieval fails, everything downstream is guesswork.

3) Put guardrails under watch. Monitor blocked output rate, safe-completion fallbacks, and refusal spikes. These often signal broken context, overly strict rules, or a stale index.

4) Review drift on a cadence. Run a short weekly review across Ops, Product, and Data. Sample 20 cases per surface, blind-score against the outcome, compare to last week, then ship the most minor fix that moves the outcome.

5) Write runbooks as gates, not lists. Gate one: offline evaluation meets the threshold. Gate two: Shadow traffic holds steady for seven days. Gate three: rollback plan validated, including prompt and index pins.

We once tried to automate common sense. The pipeline refused.

Your 90-Day Playbook

Choose one AI-backed surface. Support assistant, search, or claims triage. Do not start everywhere.

Name one outcome of SLI and two early indicators. For a support assistant: outcome = resolved in ≤ 2 replies. Indicators = hedge rate and retrieval latency p95.

Assign clear owners. Outcome owner: Product. Indicator owner: SRE or Platform. Write the names on the service page.

Instrument versioning. Expose model, prompt, retrieval index, and policy versions as labels in logs and tickets.

Run weekly drift review. Ten minutes. One page of charts. One fix.

Tie changes to results. If the outcome moves, keep the change. If not, roll back fast.

Three Ways This Fails (and How to Avoid Them)

Fuzzy ownership. Without a named owner for the outcome SLI, debates replace decisions. Fix ownership first.

Proxy worship. Latency falls, but answers worsen. Pair outcomes with early indicators to keep both truth and speed.

Provenance sprawl. Version everything, and you drown. Version what changes answers. Log the rest for a short, practical window.

Reliability in the AI era will not come from stricter SLAs. It will come from measuring the right layer and acting faster than drift. Start with one outcome and two signals. Make ownership explicit. Review weekly.

Progress has never been about magic. It has always been about measurement.

Further Reading

FIELD NOTE: The 18% result is an internal pilot figure reported here to illustrate the framework. Replace with your own programme data before external publication, or retain as "field note" if acceptable to share.

Reply

or to participate.