AI Agent Monitoring: What to Track (and Why It Matters)

05/29/2024

Shipping an agent is not the finish line. It’s the start of operations. In production, the most expensive failures are the silent ones: quality drift, cost creep, and integration breakage that degrades outcomes over time.

The 5 categories of monitoring you need

1) Quality signals

Correction rate (how often humans edit outputs)
Escalation rate (how often the agent asks for help)
Online checks (required fields present, grounded answer checks)
Sampled review score for high-impact workflows

2) Tool behavior and safety

Tool call success rate
Retries, timeouts, and error clusters
Disallowed tool attempts (must be zero)
Approval events and approval rejection rate

3) Cost and efficiency

Cost per run
Cost per successful outcome
Token usage trends
Expensive loops (repeated calls without progress)

4) Latency and throughput

End-to-end latency (median and p95)
Latency by workflow step (where time is spent)
Backlog/queue depth at peak hours

5) Drift and change detection

Input distribution shifts (new ticket types, new exceptions)
Retrieval top-doc changes (knowledge freshness/quality)
Eval score trends over time
New failure clusters emerging

A monitoring dashboard that works (minimum viable)

Workflow health: runs/day, success rate, top failure reasons
Quality: correction rate, escalation rate, sampled quality score
Tools: success %, retries, timeouts, approvals
Cost: cost/run, cost spikes, budget alerts
Latency: median + p95 latency, bottleneck step

Alerts: what to page on vs what to review weekly

Page immediately

Tool safety violation
Repeated failures on a critical step
Sustained cost explosion
Integration outage impacting production runs

Review weekly

Gradual increase in corrections
Cost creep
Drift indicators
New failure clusters

Monitoring answers: “What’s happening now?” Evaluation gates answer: “Is it safe to ship the next change?” You need both to run production agents sustainably.