Evals for agents: beyond 'does it work?'
Evals for production agents: suite design, metrics that matter, regression detection. What separates real eval from demo testing.
Most teams measure agents with “I ran 10 prompts and they seem fine.” That’s not eval — it’s demo testing. Real eval is statistical, repeatable, detects regression, and costs little.
This post covers eval suite design that works for production agents.
Why “seems fine” fails
Three reasons demo testing deceives:
1. Confirmation bias. You test scenarios where you believe the agent works. Edge cases stay out.
2. Doesn’t catch regression. When you change the prompt next week, no way to know if something that worked stopped working.
3. Doesn’t allow comparison. “Version A vs B” requires numeric metric, not impression.
The 4 eval levels
Level 1 · Prompt unit test
50-200 representative inputs with expected output (or programmatic criterion). Runs in CI before any prompt or model change.
Example entry:
- input: "I'm frustrated, this doesn't work"
expect:
- sentiment_detected: "negative"
- escalation_suggested: true
- tone: "empathic"
Validator can be regex, LLM-as-judge, or deterministic metric.
Level 2 · Regression suite
Grows automatically from the failure corpus. Each production bug becomes a test — can’t come back.
Schema:
- id: REG-042
introduced: 2026-04-12
cause: "Agent suggested plan cancellation before offering downgrade"
input: "I'm not happy with the plan"
must_not_output: ["cancel", "cancellation"]
must_output: ["downgrade", "alternative"]
The more the system runs, the more the suite grows. Good sign.
Level 3 · Behavioral eval (LLM-as-judge)
For subjective qualities (tone, naturalness, completeness), use another LLM as judge. Pattern:
[INPUT] [Agent output]
[JUDGE] Rate 1-5 on:
- Tone (appropriately formal?)
- Completeness (resolves the request?)
- Safety (revealed sensitive info?)
Caveat: LLM-as-judge has bias (models prefer outputs from same model, or longer ones). Calibrate with human sample before trusting.
Level 4 · Production canary
Runs on 1-5% of real traffic, compares outputs to previous version. Detects subtle drift that synthetic misses.
Typical structure:
- 95-99% traffic → stable version.
- 1-5% → candidate version.
- Metrics: human acceptance, retries, time to resolution.
- Promote or rollback after minimum volume (1,000-5,000 turns).
Metrics that matter
| Metric | What it measures | When it matters |
|---|---|---|
| Pass rate | % of tests passed | Prompt/model change |
| Regression count | How many regressed | Before deploy |
| Behavioral score | Subjective quality | Periodic review |
| Production acceptance | % accepted by user | Canary |
| Cost per task | $ avg per interaction | Optimization |
| Latency p99 | Slow tail | UX critical |
The counter-intuitive insight
100% pass rate is suspicious. Means your suite is too easy, or the model memorized the tests (leakage). Aim for 70-90% pass rate on well-designed suite — the 10-30% that fail is where you learn.
Pattern that works
Inspired by what we see in HES:
- Week 1-2: build unit suite of 50 entries. Run on every deploy.
- Week 3-12: feed regression suite from real failures.
- Month 3+: add LLM-as-judge for subjective dimensions.
- Month 6+: production canary for non-trivial changes.
Don’t skip steps. Production canary without unit suite is lottery.
Common anti-patterns
1. Eval identical to the prompt. Suite reusing system prompt examples measures memorization, not capability. Use unseen inputs.
2. Single metric. Pass rate alone deceives. Combine with cost, latency, behavioral.
3. Eval that only runs on dev laptop. Has to run in CI, with versioned result snapshots.
4. Eval missing locale edge cases. Accents, regional slang, ID format, date format. Global model can fail there silently.
Where to go deeper
For the runtime governance framework evals validate: Harness Stack.