Frameworks
How we ship multi-step agents that don't drift
The four primitives every chained agent needs before it touches a customer.
The drift you can't blame on the model
Every multi-step agent we've shipped that survived a year in production has the same four primitives. Every one we've inherited that was drifting was missing at least two of them. The model is almost never the cause. The cause is usually structural — the team built the agent and forgot to build the scaffolding that keeps it honest.
A "multi-step agent" here means anything where the model makes more than one decision in a single task: classify → retrieve → draft → confirm, or plan → tool-call → reason → write. The compounding-error problem is real. If each step is 95% reliable, a four-step chain is 81% reliable end-to-end. You don't fix that with a better model. You fix it with the four primitives below.
Primitive 1: A pinned eval set, refreshed weekly
The eval set is the deliverable. Not the agent. The agent is what you ship to extract value from the eval set.
A useful eval set has three properties. It's pinned — the version of the eval set is checked into git and tagged with the agent version it certified. When the agent regresses, you can replay the exact eval that passed last week. It's stratified — the set covers each step in the chain independently, plus end-to-end traces. A failing classify step shouldn't be discoverable only by an end-to-end test, because then you don't know which step broke. It's refreshed weekly — pulled from production traces with a labeling pass. The corpus your agent sees in week 12 is not the corpus it saw in week 1, and a stale eval will greenlight a drifted agent every time.
The cheapest version of this is a Promptfoo config in your repo with a JSON eval set per step. The expensive version is Langfuse + a labeling tool + a weekly review. Both work. The version that doesn't work is "we'll add evals later."
Primitive 2: A confidence gate at every handoff
Between every step in the chain, the agent should be able to answer: am I confident enough to continue, or should I escalate?
Two practical patterns. Heuristic gates — token logprobs, retrieval similarity scores, schema-validation pass/fail. Cheap, fast, brittle if used alone. Judge model gates — a small fast model (Haiku, GPT-4o-mini) scoring the previous step's output against criteria. More expensive (~10–30% of step cost), much more robust.
The gate doesn't decide for the agent. It routes. Below threshold → human, with the partial work and the gate's reasoning attached. Above threshold → continue. The most expensive failure mode in production agents is silently continuing past a low-confidence step. A missing draft is cheaper than a wrong one — every time.
We tune gate thresholds against the eval set, not in production. Production tuning gets you a great-feeling demo and a 9% wrong-answer rate.
Primitive 3: A typed schema for state between steps
If your agent passes free-text strings between steps, you have an agent that will drift. Period.
Every step's output gets a schema (Pydantic, Zod, JSON Schema — pick one). The next step parses against the schema before doing any work. If parsing fails, the step retries with a structured-output mode or escalates. This catches roughly 60% of the silent failures we see in inherited codebases — a step quietly produced text the next step partially misinterpreted, and the agent kept walking.
Bonus: typed state makes the audit trail trivially loggable. Every transition is a row. Every row is replayable.
Primitive 4: A daily drift probe
The agent runs in production. The eval runs in CI. What watches the gap?
A daily drift probe samples N production traces (we use 50–200, depending on volume), runs them against the current eval scoring rubric, and posts a single number to a dashboard: today's production accuracy on the same metric the eval uses. When that number drops more than 2 points week-over-week, somebody gets a notification. Not "we should look at evals." A specific number on a specific dashboard with a specific alert.
The probe is the cheapest piece of infrastructure on this list and the one most often skipped. We've watched a $2k/month agent silently drift to 64% accuracy over six weeks because nobody looked. The probe would have caught it on day 4.
What this looks like in a Pro-tier build
The Pro tier ($250–$400/mo) ships all four primitives by default, because the whole point of productizing the build is that the scaffolding is non-negotiable. Day 1: pinned eval set written before the agent. Day 2-3: typed schemas for every transition. Day 4-5: confidence gates with thresholds tuned against the eval. Day 6: drift probe wired to the dashboard. Day 7-onward: the actual agent, layered on top of the scaffolding.
The reason this fits in a one-week productized build is that we've stopped customizing the scaffolding. The scaffolding is the same on every Pro engagement. What's customized is the agent — the chain, the tools, the corpus. Everything underneath is the same shape.
What you can do this week
If you have an agent in production with no eval set, build one. Doesn't have to be perfect. 50 examples beats zero examples by an enormous margin.
If you have an eval set but no drift probe, write the probe. It's roughly 80 lines of Python plus a cron.
If you have both but no confidence gates, add them at the most expensive handoff first — typically the one feeding a write-action (sending an email, posting to an API, executing a transaction). Below-threshold routes to a human queue. You'll be surprised how much you catch in the first week.
The model will keep getting better. The drift will keep happening anyway. The teams whose agents survive are the ones who built the scaffolding first.
QNTX AI Ops ships production AI systems with eval, gating, and drift probes built in. [See the Pro-tier playbook](/services/accelerator).