Skip to content
🟠 Builder

Harness Stack: the nine layers wrapping a prompt in production

Context · Constraint · Verification · Feedback · Advisor · Emotion · Durable pause · Confidence gating · Failure corpus

Harness Stack is the Automation Labs framework of nine layers that separate a working demo prompt from a trustworthy production agent, derived from the Harness Evolution System (HES).

Harness Stack: the nine layers wrapping a prompt in production diagram

The nine layers

1 · Context. What the agent sees before generating a response. Includes system prompt, RAG, persistent memory, tool state. Context engineering is the discipline of deciding what’s in and what’s out. The layer where most “weird” agents are broken.

2 · Constraint. What the agent can and cannot do. Tool allowlist, permission scope, rate limit, time budget. Constraints are the biggest operational safety lever. An agent without clear constraint is a demo, not a product.

3 · Verification. How the system knows the agent’s output is correct, before applying the effect. Schema validation, dry-run, sandbox execution, regex assert. Without verification, the agent’s mistake goes straight to the real world.

4 · Feedback. How the agent receives signal about output: accepted, rejected, edited, ignored. Signal flows back into prompt refinement, prompt template, or policy. Without feedback, the agent doesn’t evolve.

5 · Advisor. Second opinion under load — another model (or another agent, or a static rule) consulted when confidence drops, stakes rise, or the chain ran too long. Advisor doesn’t replace the agent; it changes seats in risk moments.

6 · Emotion. User friction signals captured as governance data: repeated retries, short and tense messages, abandonment. It’s not “the agent recognizes feeling”; it’s the system recognizing something is off and opening extra verification.

7 · Durable pause. For irreversible actions (commit to prod, bulk email send, bank transfer), the agent stops and requests human confirmation with a durable timeout window. Missing this layer is the root cause of 80% of agent incidents in production.

8 · Confidence gating. The agent declares confidence before declaring an answer. When confidence drops below threshold, escalate to Advisor or to human. Calibrated confidence is more valuable than raw accuracy.

9 · Failure corpus. Versioned repository of every failure observed in production: what came in, what went out, why it was wrong, which layer failed. Failure corpus feeds the other 8 layers continuously. Without corpus, the system forgets errors and repeats them.

How to apply

Use Harness Stack as an audit checklist: for each layer, does the agent have explicit implementation? “Implicit in the prompt” is a failure answer. Each layer should be code, configuration, or named policy — not hope.

Implementation order matters. Build layers 1-3 first (Context, Constraint, Verification); then 7-8 (Durable pause, Confidence gating); then 9 (Failure corpus). Layers 4-6 (Feedback, Advisor, Emotion) come after the agent is running.

Use cases applying Harness Stack

  • Harness Evolution System (HES) — proprietary system that automatically evolves the nine layers via trace analysis.
  • OpenClaw — multi-channel gateway applying layers 2 (Constraint) and 9 (Failure corpus) at the gateway level.

When to use

  • AI agent in production (not a demo, not a POC) taking real-world actions.
  • Agent incident audit: identifying which layer failed.
  • Brief for an AI engineer building a new agent on n8n/Cowork/OpenClaw stack.

When NOT to use

  • Quick prototyping or throwaway demos — overhead not justified.
  • Purely conversational agents without tool use — some layers (Verification, Failure corpus) don't apply.