🟠 Builder

Harness Stack: introduction to the runtime governance framework

Extended version of Harness Stack — the 9 layers wrapping a prompt in production, with implementation order and failure signals per layer.

May 15, 2026 · 12 min · ai-engineering

The Harness Stack defines the 9 layers wrapping a prompt in production. This post extends with real implementation order, failure signals per layer, and the practical difference between “agent that demos well” and “agent that holds production.”

The motivating question

Why do agents that run beautifully in vendor demos fail in real production? Short answer: what separates demo from production isn’t the model. It’s what’s around the model.

The demo has a happy human, giving expected prompts, in an environment without real data. Production has unexpected users, real data with edge cases, irreversible actions, and audits.

Harness Stack is the checklist for “around.”

Implementation order (field-tested)

Don’t build the 9 layers in parallel. Build in order, validating each. Order learned in the field:

Block 1 (weeks 1-2): Context + Constraint + Verification

Layer 1 · Context. What the agent sees. System prompt + RAG + tool state. Start here because any other layer fails if context is wrong.

Failure signal: agent cites info that doesn’t exist (hallucination), or ignores info that does (context loss).

Layer 2 · Constraint. What the agent can and can’t do. Tool allowlist, scope, rate limit. Critical before any real-world action.

Failure signal: agent tries action it shouldn’t; or hits rate limit without fallback.

Layer 3 · Verification. Output checked before applying effect. Schema validation, dry-run, sandbox execution.

Failure signal: real-world action with wrong payload (syntactically valid, semantically wrong).

Block 2 (weeks 3-4): Durable pause + Confidence gating

Layer 7 · Durable pause. On irreversible action, agent stops and awaits human confirmation with timeout window. Before Feedback or Advisor — because production needs priority safeguard.

Failure signal: agent sends mass email, transfers funds, or deletes data without confirmation.

Layer 8 · Confidence gating. Agent declares confidence before acting. Threshold of 0.7-0.85 triggers escalation.

Failure signal: agent delivers confident answer for something it clearly didn’t know.

Block 3 (weeks 5-6): Failure corpus

Layer 9 · Failure corpus. Versioned repository of every production failure, with analysis.

Start collecting from day 1, even manually. In 90 days you’ll have the first useful corpus to feed other layers.

Block 4 (month 3+): Feedback + Advisor + Emotion

Layer 4 · Feedback. Accepted/rejected/edited signal returns to the prompt. Takes 2-3 months to have enough volume to iterate.

Layer 5 · Advisor. Second opinion under load. Expensive implementation — usually another model (Opus reviewing Sonnet) or static rule. Worth it only in critical actions.

Layer 6 · Emotion. User friction signals detected and used to open extra verification. Last layer because it depends on volume and on the rest working.

The most common mistake: jumping to “cool demo”

Teams new to agents want Multi-agent + Advisor + complex RAG in the MVP. Result: nine systems that half-work, none reliable.

Pattern that works: build MVP with only Block 1 (Context + Constraint + Verification). Put into limited production. Suffer. Learn. Add Durable pause where it hurts. Then Confidence gating. Only then add Advisor.

The counter-intuitive insight

Failure corpus (layer 9) seems least important because “it’s just logging.” It’s the most important. Without corpus, the other 8 layers don’t evolve. You’re iterating in the dark.

At HES, we spend more tokens curating traces into failure corpus than running the proposer itself. It’s the pavement.

Where to go deeper

For the framework itself: Harness Stack hub. For related delegation (which agent gets which task): Agent Trust Stack.