Frameworks

The eval harness is the deliverable, not the feature

Why we ship the test suite before the agent — and what changes when you do.

Priya Ramanathan/Principal, Engineering/April 8, 2026/8 min read

A small reframing with large consequences

When we scope a Pro or Custom engagement, the deliverable in week one is not a working agent. It's the eval harness — the test suite, the labeled dataset, the scoring rubric, and the dashboard that displays today's number. The agent comes after. This was not always how we worked. We changed because the alternative kept producing agents that demoed beautifully and degraded quietly.

The reframe: the eval harness is the product. The agent is the implementation of the contract the harness enforces. You can swap models, swap prompts, swap retrieval strategies, swap providers — the harness stays. It's the only artifact that matters past month three.

What goes in the harness

A complete harness has six parts. We've stopped shipping anything without all six.

1. A labeled eval set, version-pinned in git

50–500 examples per agent step, labeled with the correct output and the criteria that make an output "correct." Pinned by file hash; tagged with the agent version it certified. When you replay it next quarter, you replay the exact same set, not a drifted snapshot.

2. Step-level scoring, not just end-to-end

If the agent has four steps, the harness scores each step independently in addition to the end-to-end trace. End-to-end-only metrics hide which step is regressing. Step-level lets you isolate the regression to one prompt and fix only that one.

3. Both deterministic and judge scoring

Deterministic checks (regex match, JSON schema, exact equality) for things that have right answers. Judge models (a smaller model with a rubric) for things that don't — tone, completeness, faithfulness to source. Use both. Deterministic alone misses the entire class of "factually wrong but parses fine." Judge-only is expensive and flakey.

4. A "false-confident" check

This is the test that flags when the agent asserts something not present in its retrieved context. It's the single highest-value check in any RAG-shaped agent. We've never seen a production failure from a too-cautious agent. We've seen many from a too-confident one.

The check: extract assertions from the agent's output, check each against the retrieved context, fail the example if any assertion isn't grounded. It catches the failure mode that burned three of the last four teams who came to us with "AI for support" projects that publicly embarrassed them.

5. A cost-per-task line

The harness reports cost per evaluated task, not just accuracy. If accuracy went up 4 points and cost went up 4x, that's not a win — it's a trade. The harness surfaces the trade so the team can make the call deliberately.

6. A dashboard the operator looks at

Not the engineer. The operator — the support manager, the merchandising lead, the underwriter. If they can't read the dashboard without an engineer present, the harness is incomplete. Numbers that aren't seen don't influence behavior.

What changes when the harness is the contract

Three things change immediately when you ship the harness first.

Scope conversations get unstuck. "Should the agent do X?" becomes "what eval would prove the agent can do X?" If the eval is hard to write, the feature is probably ill-defined. Most scope creep dies at this question.

Model choice becomes a configuration. With the harness in place, swapping Sonnet for Haiku, or Claude for GPT, or a fine-tune for a base model, becomes a one-day experiment. Run the harness. Compare numbers. Pick the winner. Without the harness, model choice is a religious argument with vibes for evidence.

Drift gets caught the day it happens. A regression in production trips the same harness that approved the agent in CI. The fix loop stays under 24 hours. Without the harness, drift is discovered weeks later by a customer.

The objection we hear most

"We don't have labeled data." Almost nobody does on day one. The labeled set is the work, not the prerequisite.

We start every Pro engagement with a labeling session — usually 2–4 hours with the operator champion (the person who lives the workflow today). 100 examples in an afternoon, scored by the person who knows what "right" looks like. That set is more valuable than the next month of model iteration. It becomes the seed that grows; we add 10–30 examples per week pulled from production traces.

The "we don't have data" objection is almost always cover for "we don't want to commit to what right looks like." Which is exactly why the eval has to come first — it forces the commitment.

A concrete case

A logistics SaaS came to us with two failed internal agent prototypes. Same model, same prompt patterns, similar retrieval. The third attempt — ours — landed at 94% classification accuracy and shipped in 8 days. The difference was not the model. It was that we built the 1,200-example eval set on day 1 before writing a single prompt. The prompt iteration that followed was directed by failure modes the eval surfaced. The first two attempts had iterated against vibes.

Cost of building the eval: about 6 hours of senior engineering plus 4 hours of the customer's lead support agent labeling. Cost of not building it: two prior failed projects and a CTO who'd written off "AI for support" as two years out.

What you can do this week

If you have an agent without an eval, write 50 labeled examples. Pick the most common task. Score what "right" looks like in three columns: deterministic checks, judge criteria, false-confident check. Run it. The number you get back is your real starting line.

You will be tempted to skip this and go straight to prompt iteration. Don't. The reason your agent feels stuck is because every change moves the number in an unmeasured direction.

QNTX AI Ops ships eval-first, on every Pro and Custom build. [See how it works](/services/accelerator).

evalstestingproductionengineering