Skip to content
🟠 Builder

Evals for agents: beyond 'does it work?'

Evals for production agents: suite design, metrics that matter, regression detection. What separates real eval from demo testing.

Most teams measure agents with “I ran 10 prompts and they seem fine.” That’s not eval — it’s demo testing. Real eval is statistical, repeatable, detects regression, and costs little.

This post covers eval suite design that works for production agents.

Why “seems fine” fails

Three reasons demo testing deceives:

1. Confirmation bias. You test scenarios where you believe the agent works. Edge cases stay out.

2. Doesn’t catch regression. When you change the prompt next week, no way to know if something that worked stopped working.

3. Doesn’t allow comparison. “Version A vs B” requires numeric metric, not impression.

The 4 eval levels

Level 1 · Prompt unit test

50-200 representative inputs with expected output (or programmatic criterion). Runs in CI before any prompt or model change.

Example entry:

- input: "I'm frustrated, this doesn't work"
  expect:
    - sentiment_detected: "negative"
    - escalation_suggested: true
    - tone: "empathic"

Validator can be regex, LLM-as-judge, or deterministic metric.

Level 2 · Regression suite

Grows automatically from the failure corpus. Each production bug becomes a test — can’t come back.

Schema:

- id: REG-042
  introduced: 2026-04-12
  cause: "Agent suggested plan cancellation before offering downgrade"
  input: "I'm not happy with the plan"
  must_not_output: ["cancel", "cancellation"]
  must_output: ["downgrade", "alternative"]

The more the system runs, the more the suite grows. Good sign.

Level 3 · Behavioral eval (LLM-as-judge)

For subjective qualities (tone, naturalness, completeness), use another LLM as judge. Pattern:

[INPUT] [Agent output]
[JUDGE] Rate 1-5 on:
  - Tone (appropriately formal?)
  - Completeness (resolves the request?)
  - Safety (revealed sensitive info?)

Caveat: LLM-as-judge has bias (models prefer outputs from same model, or longer ones). Calibrate with human sample before trusting.

Level 4 · Production canary

Runs on 1-5% of real traffic, compares outputs to previous version. Detects subtle drift that synthetic misses.

Typical structure:

  • 95-99% traffic → stable version.
  • 1-5% → candidate version.
  • Metrics: human acceptance, retries, time to resolution.
  • Promote or rollback after minimum volume (1,000-5,000 turns).

Metrics that matter

MetricWhat it measuresWhen it matters
Pass rate% of tests passedPrompt/model change
Regression countHow many regressedBefore deploy
Behavioral scoreSubjective qualityPeriodic review
Production acceptance% accepted by userCanary
Cost per task$ avg per interactionOptimization
Latency p99Slow tailUX critical

The counter-intuitive insight

100% pass rate is suspicious. Means your suite is too easy, or the model memorized the tests (leakage). Aim for 70-90% pass rate on well-designed suite — the 10-30% that fail is where you learn.

Pattern that works

Inspired by what we see in HES:

  1. Week 1-2: build unit suite of 50 entries. Run on every deploy.
  2. Week 3-12: feed regression suite from real failures.
  3. Month 3+: add LLM-as-judge for subjective dimensions.
  4. Month 6+: production canary for non-trivial changes.

Don’t skip steps. Production canary without unit suite is lottery.

Common anti-patterns

1. Eval identical to the prompt. Suite reusing system prompt examples measures memorization, not capability. Use unseen inputs.

2. Single metric. Pass rate alone deceives. Combine with cost, latency, behavioral.

3. Eval that only runs on dev laptop. Has to run in CI, with versioned result snapshots.

4. Eval missing locale edge cases. Accents, regional slang, ID format, date format. Global model can fail there silently.

Where to go deeper

For the runtime governance framework evals validate: Harness Stack.