In partnership with

The pager didn't care about your evening. It didn't care about your kid's school play or the first decent sleep you'd had in a week.

I've carried that weight.

Two years ago, the 3 AM page meant a scramble. Slack threads are lighting up, eight people on a bridge call with no clear owner, someone SSH-ing into a box while half-asleep. I've been the one on mute, checking runbooks I wrote six months ago and hoping they still applied. You don't forget those calls. The silence in the first thirty seconds isn't confidence. It's people working out who owns what.

February 2026, and the anomaly that would have woken your checkout team last Tuesday resolved itself before anyone's phone buzzed. A predictive model flagged the memory saturation pattern, correlated it with a deployment from ninety minutes earlier, and rolled back the canary. Total human involvement: zero. The dashboard barely flickered.

This is the story the industry wants to tell right now. Gartner predicts that over 60% of large enterprises will adopt self-healing infrastructure powered by AIOps by 2026. PagerDuty's AI agent suite (https://www.pagerduty.com/?page_id=96831), launched in October 2025, reports early adopters resolving incidents up to 50% faster. Jeffrey Hausman, PagerDuty's Chief Product Development Officer, called it "a turning point for digital operations." The direction is real, and the tooling is moving faster than the organisations adopting it.

The uncomfortable truth: the midnight pager isn't dying because we finally built better tools. It's dying because the tools don't need us for the easy stuff anymore. And the hard stuff? That's where it gets uncomfortable.

What Auto-Remediation Actually Looks Like (And What It Doesn't)

If you've been running auto-remediation for pod restarts and disk cleanup for years, fair enough. That's not new, and you don't need AI for it. What's changed is the scope.

A Unite.AI analysis from this month (https://www.unite.ai/agentic-sre-how-self-healing-infrastructure-is-redefining-enterprise-aiops-in-2026/) describes the emerging "Agentic SRE" model: multi-agent architectures where one agent detects anomalies, another evaluates probable root causes, a third executes remediation, and a fourth verifies recovery against defined reliability objectives. Each step follows policies and safety constraints. Engineers observe and review outcomes rather than execute commands.

If you've spent any time fighting alert fatigue across 2,000 weekly alerts (only 3% of which actually need action, per Grafana Labs), you can see why this matters.

That's qualitatively different from a cron job that clears /tmp.

The shift from "human-in-the-loop" to "human-on-the-loop" sounds subtle, but the operational implications are real. Engineers stop running the playbook and start designing the policies that govern what the playbook is allowed to do. PagerDuty's SRE Agent learns from related incidents, surfaces context, and executes validated runbooks automatically. Their Shift Agent handles scheduling conflicts that used to eat half a manager's Monday. Microsoft's Azure SRE Agent, announced at Build 2025 (https://www.pagerduty.com/blog/ai/pagerduty-azure-ai-sre-agent/), was purpose-built to integrate with PagerDuty's Operations Cloud for automated triage and mitigation.

The maturity path is progressive automation: suggestions first, then approvals, then autonomous action, staged by blast radius. Not a binary switch. One team I spoke with moved from manual approvals for every rollback to policy-gated autonomous rollback for their tier-3 services, keeping human sign-off for anything touching payments. The decision to widen the blast radius belongs to the team that owns the service. That part hasn't changed.

The Part Nobody Wants to Hear

Here's where I need to be honest about the counterargument, because it's strong.

Heath Newburn, Distinguished Field Engineer at PagerDuty, put it plainly in an APMdigest interview (https://www.apmdigest.com/discovering-aiops-9): the vast majority of the time, something goes badly wrong, you need smart people with experience to keep things running. He's right. And he works for PagerDuty, which tells you something about where even the vendors think the line is. Shamus McGillicuddy, VP of Research at EMA, was blunter: auto-remediation today is mostly for low-impact, repeatable things. You don't even necessarily need AI for that. Hard to argue.

The data support them. The 2025 State of DevOps Report (https://devtechinsights.com/toil-in-sre-why-ai-hasnt-solved-burnout-2025/) found that 57% of SREs still spend more than half their week on toil, despite AI tool adoption. The Catchpoint SRE Report 2025 (https://www.apmdigest.com/maximizing-resilience-insights-2025-sre-report) showed median toil rising from 25% to 30% year-on-year. Not falling. Rising. We added the damn tools and the toil went up.

I've been part of that pattern. I've recommended tooling that promised to reduce noise and watched the team drown in a different kind of noise instead. That's not a vendor problem. That's an organisational one, and I've contributed to it.

"We added the damn tools and the toil went up."

And here's the finding that should make every auto-remediation advocate pause: the DORA 2024 report (https://getdx.com/blog/2024-dora-report/) found that AI tooling correlates with worsened software delivery performance for the second consecutive year. Not because the code is bad. Because batch sizes grow when AI makes it easier to write more code, and bigger changesets mean more risk. Sound familiar? Better tools, worse outcomes. We've seen this film before.

Auto-remediation handles the restarts. It doesn't handle the judgment calls.

"Auto-remediation handles the restarts. It doesn't handle the judgment calls."

The toil didn't disappear. It displaced. Engineers aren't restarting pods at 3 AM anymore. They're waking at 3 AM to verify that the AI restarted the pod correctly. Different alert, same exhaustion. One DevOps practitioner captured it well on Reddit in late 2024: "AI doesn't eliminate the pager. It just makes the pager sound smarter when it goes off." That's not cynicism. That's operational reality.

Where this breaks: if your organisation hasn't invested in the governance layer (policies, approval workflows, blast-radius controls), autonomous remediation becomes autonomous risk. Carlos Casanova, Principal Analyst at Forrester, confirmed in APMdigest (https://www.apmdigest.com/discovering-aiops-9) that enterprises doing this successfully have only done it for specific, tested use cases. Not the whole estate. Not yet. Casanova knows the terrain, and the nuance matters: auto-remediation works where you've earned the right to trust it.

What On-Call Looks Like When the Easy Stuff Is Gone

So what happens to on-call when predictive remediation handles 80% of incidents autonomously?

The remaining 20% gets harder. More ambiguous. Higher stakes. Novel failures, cascading dependencies, and multi-service incidents where business context matters more than telemetry correlation. The kind of problem where you need someone who understands what the system is supposed to do for customers, not just what metrics it's emitting.

Here's what I've seen, and it keeps me honest: MTTR has worsened from 47% of organisations taking over an hour in 2020 to 82% in 2024, despite widespread observability tool adoption. Better tools didn't fix the problem because the problem was never purely technical. It was organisational. Ownership, communication, decision authority. The stuff that doesn't fit in a dashboard. Auto-remediation risks repeating the same pattern at a different layer. If you automate the response but nobody owns the policy, you've just moved the failure mode upstream.

The new on-call demands skills that most teams haven't built yet. Policy design. Remediation governance. What the Cloud Native Now analysis (https://cloudnativenow.com/contributed-content/how-sres-are-using-ai-to-transform-incident-response-in-the-real-world/) calls "AI reliability engineering": ensuring the quality and transparency of AI-driven incident response systems. And the Rootly SRE Report 2025 (https://rootly.com/blog/sre-report-2025---key-takeaway) found that 67% of SREs don't have enough time for technical training. So we're asking engineers to govern systems they haven't been trained to govern. It's messy, but progress usually is.

If you lead one of these teams, here's what I'd do next sprint: pick one service where auto-remediation is already running (or about to). Name the owner of the remediation policy. Name the owner of the escalation path when the automation gets it wrong. Write both names down. If you can't, that's your gap. Fix that before you widen the blast radius.

The midnight pager is dying. Good. It should.

What replaces it isn't silence.

It's the weight of designing the policies, trusting the automation, and being the human who decides what happens when the machine gets it wrong. That's harder than restarting a pod at 3 AM.

But it's more honest about what reliability actually requires.

Further Reading

Gartner Predicts 2026: AI Agents Will Transform IT Infrastructure and Operations (https://www.itential.com/resource/analyst-report/gartner-predicts-2026-ai-agents-will-reshape-infrastructure-operations/) (Dec 2025)

Agentic SRE: How Self-Healing Infrastructure Is Redefining Enterprise AIOps (https://www.unite.ai/agentic-sre-how-self-healing-infrastructure-is-redefining-enterprise-aiops-in-2026/) (Feb 2026)

AIOps for SRE: Using AI to Reduce On-Call Fatigue (https://devops.com/aiops-for-sre-using-ai-to-reduce-on-call-fatigue-and-improve-reliability/) (Nov 2025)

Catchpoint SRE Report 2025: Maximizing Resilience (https://www.apmdigest.com/maximizing-resilience-insights-2025-sre-report)

PagerDuty H2 2025: AI Agent Suite Launch (https://www.pagerduty.com/?page_id=96831) (Oct 2025)

Want to get the most out of ChatGPT?

ChatGPT is a superpower if you know how to use it correctly.

Discover how HubSpot's guide to AI can elevate both your productivity and creativity to get more things done.

Learn to automate tasks, enhance decision-making, and foster innovation with the power of AI.

Reply

Avatar

or to participate

Keep Reading