Post-Mortems

The audit-first playbook for AI in operations

Before you build anything, audit what you have. Here's the two-week diagnostic we run.

Marcus Whitfield/Director, Operations/February 12, 2026/8 min read

The pattern we keep seeing

A team ships an AI tool in Q1. It works for six weeks. By Q2 it's drifting. By Q3 it's mothballed. In Q4, the team scopes a new build with a different vendor, different model, different framework — and reproduces the same failure six months later.

Nobody audited why the first one died. The diagnostic was skipped because "we're moving on to v2." This is the most expensive mistake we see in operations AI. The next build inherits the same organizational gravity that killed the first one — wrong incentive on the champion, no eval discipline, dirty data that nobody owns. Different code, same outcome.

The audit-first playbook is two weeks of structured diagnostic before any new build is scoped. It's the cheapest engagement we run, and the one with the highest leverage on the next twelve months of AI spend.

The five pillars of the audit

The audit looks at five layers. Each layer has specific questions and specific evidence we ask for.

1. Champion alignment

Who owns the AI system? What is their variable compensation tied to? Is the metric the AI moves the same metric the champion is paid against?

We've audited systems where the champion was incentivized on cost reduction, the AI was producing growth, and nobody upstream cared because the dashboard didn't roll up to anyone's bonus. The system died of indifference.

Evidence we ask for: org chart of the AI program, written compensation linkages, the dashboard the executive committee actually opens.

2. Pipeline ownership

Where does the data come from? Who maintains the pipeline? When was it last broken? How was the breakage detected — by a monitor or by a customer?

The most common technical cause of AI degradation is upstream pipeline drift that nobody caught. A schema change, a deprecation, a silent null where there used to be a value. The AI keeps running and slowly produces garbage.

Evidence we ask for: pipeline diagram, last 90 days of incidents, the monitoring dashboard, names of the people responsible.

3. Evaluation discipline

Is there an eval set? When was it last refreshed? Does anyone look at the production metrics? What's the cadence of "eval ran, action taken"?

A surprising number of "eval suites" are never run after week 4 of the build. They exist in the repo. They were green when shipped. Nobody has run them since.

Evidence we ask for: the eval set itself, the dashboard, the last three changes that were made because the eval surfaced something.

4. Cost reality

What does the system actually cost per task? Including verification, retrieval, observability, and the human time it takes to maintain. How does that compare to the original ROI case?

We have yet to do an audit where the actual cost matched the slide. Usually it's 2-5x. Sometimes 10x. The slide was made before the system met production volume and was never refreshed.

Evidence we ask for: trailing 90 days of provider bills, observability spend, an honest estimate of human-in-the-loop hours.

5. Kill criterion

Was there a written condition under which the system would be retired? Has any condition been hit? If yes, what happened?

Almost no AI systems have written kill criteria. As a result, dying systems aren't retired — they're absorbed into other initiatives, defunded over time, or quietly mothballed. The organization never learns from the death because there was no death event to learn from.

Evidence we ask for: the original ROI memo, any written success criteria, current performance against those criteria.

What the audit produces

A 12-15 page memo with five sections (one per pillar), each with: current state, gap to working, lowest-cost remediation, and a recommendation on whether to fix-in-place or retire-and-rebuild.

Plus a one-page executive summary that names: the single highest-leverage fix the team can make in the next 30 days, the single most expensive thing the current architecture commits the team to over the next 12 months, and a yes/no on whether the team is ready to scope a new build.

The yes/no is the most valuable part. About 30% of the audits we run end in "no — fix the existing system first, don't start a new one." That recommendation often saves the customer a $200k+ second mistake.

How long it takes

Two weeks of elapsed time, roughly 30-40 hours of our work. We interview the champion, the engineering lead, the operator(s) who use the system, and the finance person who owns the budget. We pull the actual production traces and the actual bills. We replay the eval against current state if there is one.

We deliberately don't try to fix anything in the audit. The audit's job is to produce honest information. The fixing is a separate engagement, with its own scope and its own price, and either we do that work or we hand the memo to the team and they do it themselves. About a third of audits don't lead to a follow-on engagement with us, which is exactly what we'd predict if the audits are honest.

Why this fits before any Pro or Custom build

We've started recommending the audit on roughly half the new-build conversations we have. Not all — sometimes a team is genuinely greenfield with no prior AI system to audit. But anyone with a previous attempt should diagnose it before committing to a successor.

The audit's price is small relative to a Pro retainer or a Custom build. The information it produces routinely changes the scope of the build that follows. We'd rather discover in week 2 of the diagnostic that the real problem is pipeline ownership than in week 12 of a build that there's no clean data source.

This is also the discipline we apply to ourselves on any system we inherit. Before we change a single prompt, we read the eval set, run it against current production, look at the cost dashboard, and interview the operator. Half the time, the prompt change we were going to make would have been a fix on the wrong layer.

What you can do this week

If you have a previous AI system that didn't work, write down — honestly, on one page — what failed and why. Champion alignment, pipeline, evals, cost, kill criterion. Even doing this informally will surface most of what a formal audit would.

If the answer is "we don't actually know why it failed," that's the signal that an audit is worth its price. The cost of not doing it shows up six months into the next build.

QNTX AI Ops runs structured AI audits before new builds. [Schedule a diagnostic](/contact).

auditoperationsdiagnosticimplementation