You have 1 article left to read this month before you need to register a free LeadDev.com account.
Estimated reading time: 8 minutes
How to safely introduce AI and agentic capabilities into your on-call processes.
That third page at 3am is exhausting, but it is more than a nuisance; it’s a signal that your incident response orchestration is broken.
While most teams blame noisy alerts, the real bottleneck is the manual effort required for humans to stitch together disparate data points to react effectively.
Google SRE teams aim to limit on-call bandwidth to 25% per engineer to prevent burnout and ensure a sustainable workload.
The percentage of time spent on active on-call duties is part of a strict 50/25/25 balance: at least 50% of an SRE‘s time must be dedicated to engineering projects that scale the service, while the remaining 50% is split between other operational tasks and active paging response.
These constraints reveal that companies cannot scale by merely handling paging manually. If the majority of time is not invested in engineering automation, the operational load becomes “toil” that grows linearly with the service, eventually becoming intractable.
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
Orchestration, not just noise
Alert fatigue is a measurable burden. Google data shows the average incident including triage, remediation, and postmortem requires six hours of focused effort.
This means any shift exceeding two incidents per 12 hours forces an engineer into a state of operational overload, where stress impairs rational cognition and leads to suboptimal, habitual decision-making.
Good orchestration moves beyond simple noise filtering to ensure a 1:1 alert-to-incident ratio. Without it, a single database failure can cascade into 50 uncoordinated pages across microservices.
Coordination means grouping these related signals to prevent “alert fan-out,” while orchestration involves utilizing an automated protocol to handle mundane actions like status updates and role handoffs.
In an orchestrated system – where production services are integrated with automated incident-management tools – you wouldn’t receive ten disparate alerts. Instead, the system senses the issue, determines the root cause, and notifies the correct owner with “git blame” context just once.
According to Google SRE best practices, maintaining this balance is vital for team well-being; if you are buried in alerts overnight, you cannot function as the voice of reason the next day.
Reframing incidents as a pipeline
What if we treated incident response like an assembly line? Divide the response into discrete steps, to which we could assign an “agent” to each one – whether that’s a person or a robot.
In typical manual processes, a single engineer carries out these steps under pressure. With an agent-based approach, we divide this workload between:
- Detector: Rather than fixed threshold values, a smart detector employs statistical analysis to determine whether an anomaly is actually significant (i.e., notable enough to flag). Sophisticated detectors used in companies such as Dynatrace and Datadog learn seasonality to avoid alerting for everyday latency peaks.
- Triager: As soon as an alert occurs, triage begins. Essentially all the initial digging that an engineer would need to accomplish gets done in those first 5-10 minutes.
- Router: Based on that information, this agent makes a determination of whom or what to alert. Older systems frequently utilized static alert mappings to alert an individual in a rotation. Smart routers are now possible that can review the incident type and severity to alert accordingly.
- Remediator: When an incident is straightforward, like a full disk or memory leak, why wait for a human? A remediator can execute predefined runbook actions to fix issues before engineers finish their first cup of coffee. Facebook’s auto-remediation system has been safely rebooting servers and fixing hardware issues for years.
- Reporter: After the fire is out, we need documentation for accountability and learning. The reporter documents the incident timeline: the alerts, what the triager found, what the remediator did, who was paged. When every action gets logged in an audit trail, you can reconstruct what happened.
In an optimal system, these agents function like an assembly line. It does not take humans out of the equation, but enhances them by increasing bandwidth. Engineers are responsible for complex and strategic problem-solving, whereas agents are involved in grunt work.
More like this
Keeping the robots in check
Handing over parts of incident response to AI agents is powerful but also potentially scary. Nobody wants a runaway script turning a minor hiccup into a major outage. Governance mechanisms need to be a first-class part of any production AI orchestration.
- Audit trails for everything: Modern AI-driven systems record every agent move in real-time, including timestamps, input/output values, and human approvals. Cryptographically signed trails provide an indisputable record that insulates the team during regulatory audits or internal “what went wrong” exercises. By replacing subjective memory with verifiable data, these logs facilitate a blameless postmortem, ensuring the focus remains on systematic analysis rather than individual liability.
- Human-in-the-loop triggers: Not every trigger needs to happen automatically. In situations when an AI system isn’t sure, or when there’s potentially too much risk involved, we create checkpoints that trigger human approval. A system could suggest a database failover but wait for approval if it’s involving a master database used in production. Policy rules could include thresholds for customer impact.
- Typed contracts and strict interfaces: A possible approach to stopping AI agents before they go rogue would involve creating tight interfaces or strict typed contracts for the agents. Rather than entering free-form control prompts, you are actually limiting what you can do with these AI agents. These remediators would have to choose from a list of safe functions to perform with a specified set of parameters. Strict typed contracts make generative AI systems more deterministic.
- Policy sandboxes: A sandbox environment must first be set up before deploying an agent to test its interactions based on predetermined rules. Policy engines are used to apply negative rules such as “the AI must not delete databases ever,” to name one. An agent’s plan will be halted or modified if it violates a rule.
Start small, learn fast
How do you introduce this into a live, mission-critical environment? Begin with deploying your agents in an advisory role. They receive the same input that your engineers receive when there’s a problem to fix but do not apply those fixes.
During the shadow phase, building trust in the agents’ accuracy is vital. Now, when your agents are reliable in decisions regarding fixes for calls, you proceed to a phase that blends both human decision-making and AI.
Don’t turn the switch for all types of incidents simultaneously. Rather, pick those incidents involving painful or repetitive tasks. Maybe it’s that 3am disk-full alert that just needs to be cleaned up. Automate those and see the results. Then gradually expand.
As of day one, design the transition of AI to humans when required. If the AI first tries a solution to see if that solves the problem but doesn’t, that’s when it needs to alert a human. The automation system needs to realize when to alert humans.
What not to do
Be wary when using a large language model (LLM) to improvise actions in a production environment. There are documented instances where failing to control LLM output resulted in hazardous behavior. Encapsulate LLM agents in strict output formats or decision scopes.
Make sure to supply context to your detectors involving aspects like the time of day or prior deployments. Without that, you’ll add to alert fatigue rather than alleviating it.
Not all other services are equal. Rebooting a server automatically could be perfectly acceptable if you’re working with a stateless service. Rebooting a database with dependencies would create data problems.
Engineers should always have an easy way to pause or turn off the automation if it doesn’t behave well. And never forget that automation done well adds to good engineering, not replaces it. Highly reliable signals or processes are a great goal to pursue first.
The payoff: Better reliability, happier engineers
A well-orchestrated on-call system powered by AI agents can transform the daily life of your engineers and the reliability of your services.
As detectors get smarter and agents learn to resolve multiple issues autonomously, there are fewer pages to alert humans. Many trivial problems are either never paged to a human or are solved before anyone notices them. As a result, when that wake-up page does come in, there’s probably something real there.
Mean Time to Acknowledgment (MTA) approaches zero because the acknowledgment occurs instantly. Diagnosis algorithms identify root causes earlier than humans can, and automated fixes are applied within seconds. Even if a human needs to get involved, that person will be working with a pre-triaged problem that’s already well-defined.
Skilled engineers are freed from mundane effort to focus more on strategic-level engineering. During those times when the site’s going down and interruptions are reduced to a minimum, being on-call shifts the position from a firefighter to something more akin to a supervisor position.

London • June 2 & 3, 2026
LDX3 London agenda is live! 🎉
Under the hood: Why math matters
Finally, let’s take a peek under the hood.
Smart agents rely on battle-tested computational mathematics to distinguish signal from noise. This foundation provides the mechanical advantage needed to scale without increasing human “toil”:
- Bayesian Change Point Detection: Instead of rigid thresholds, this identifies abrupt shifts in data patterns. It helps your detector catch real issues faster while ignoring normal variance.
- Extreme Value Theory: Rather than binary alerts, this calculates an anomaly score based on the distribution of rare spikes. It distinguishes between a daily blip and a critical, rare event.
- Queueing Theory: By modeling response capacity like a call center, this math confirms that only two incidents per shift are sustainable for a human. It guides when to autoscale AI assistance to prevent human bottlenecks.
- Seasonality Decomposition: Factors in time-based cycles like traffic drops at 3:00 AM to ensure the system doesn’t trigger false alarms for “normal” periodic behavior.
By grounding automation in these frameworks, we transform on-call from a reactive “firefighting” exercise into a predictable, mathematically-optimized pipeline.