Shipping an agent is not the finish line. It’s the start of operations. In production, the most expensive failures are the silent ones: quality drift, cost creep, and integration breakage that degrades outcomes over time.
The 5 categories of monitoring you need
1) Quality signals
- Correction rate (how often humans edit outputs)
- Escalation rate (how often the agent asks for help)
- Online checks (required fields present, grounded answer checks)
- Sampled review score for high-impact workflows
2) Tool behavior and safety
- Tool call success rate
- Retries, timeouts, and error clusters
- Disallowed tool attempts (must be zero)
- Approval events and approval rejection rate
3) Cost and efficiency
- Cost per run
- Cost per successful outcome
- Token usage trends
- Expensive loops (repeated calls without progress)
4) Latency and throughput
- End-to-end latency (median and p95)
- Latency by workflow step (where time is spent)
- Backlog/queue depth at peak hours
5) Drift and change detection
- Input distribution shifts (new ticket types, new exceptions)
- Retrieval top-doc changes (knowledge freshness/quality)
- Eval score trends over time
- New failure clusters emerging
A monitoring dashboard that works (minimum viable)
- Workflow health: runs/day, success rate, top failure reasons
- Quality: correction rate, escalation rate, sampled quality score
- Tools: success %, retries, timeouts, approvals
- Cost: cost/run, cost spikes, budget alerts
- Latency: median + p95 latency, bottleneck step
Alerts: what to page on vs what to review weekly
Page immediately
- Tool safety violation
- Repeated failures on a critical step
- Sustained cost explosion
- Integration outage impacting production runs
Review weekly
- Gradual increase in corrections
- Cost creep
- Drift indicators
- New failure clusters
Monitoring answers: “What’s happening now?” Evaluation gates answer: “Is it safe to ship the next change?” You need both to run production agents sustainably.