Field Notes
Cost guardrails for production LLM apps
The eight runaway-bill patterns we've watched in the wild — and how to bound each one.
The bill that arrives without warning
Every team that has shipped an LLM app to production has at least one cost story. Ours include: a retry loop that multiplied a customer's bill 11x in a weekend, a streaming endpoint that left connections open for hours after the user closed the tab, an agent that recursively summarized its own context window until it hit the model's max-tokens limit on every call. None of these were exotic. All were preventable. The fix is the same in every case: guardrails enforced at the infrastructure layer, not at the application layer.
Here are eight patterns we've watched in the wild and the guardrail for each.
1. Per-request token cap
The most basic guardrail and the one most often missing. Every LLM call should have a hard `max_tokens` on input and output, sized to the task. A summarization endpoint shouldn't be able to send 100k tokens because some upstream service appended a debug payload.
Implement at the SDK wrapper layer, not in each call site. If a call wants more, it has to override explicitly. Default cap should be 2-3x the largest legitimate use case.
2. Per-user / per-tenant quota
Even with per-request caps, a single user can run a million requests if your app loops. Every tenant in a multi-tenant system needs a daily token budget enforced before the request reaches the model provider. Redis counter, sliding window, hard ceiling.
Costs roughly 1ms per request to check. Saves an unbounded amount when a customer's webhook integration goes into a retry storm.
3. Retry budget, not retry count
Most teams set "retries: 3" and call it done. This is wrong. Three retries on a request that costs $0.40 is fine. Three retries on a request that costs $4 is a problem if the failure is permanent. Budget retries by spend, not count. After $X of retries on this user / tenant / endpoint in the last hour, stop retrying and escalate to a human.
The 11x weekend bill we mentioned in the intro was a "retries: 5" config on a 90% error rate from a vendor outage.
4. Streaming connection timeout
Streaming endpoints feel free because the user is "actively reading." They're not always. Browsers close, networks drop, users tab away. If you don't have a server-side max-stream-duration, you'll have streams running for the entire model timeout (often 10 minutes) generating tokens nobody reads.
Set it at 60-90 seconds. Track the abort rate. If it's high, your UX needs work, not your stream limit.
5. Context-window self-defense
Agents that summarize their own context, append tool outputs, or accumulate conversation history will eventually hit the context limit. The interesting failure mode isn't "request fails" — it's "request succeeds at the maximum-allowed token count, every single call, forever."
Track input-token distribution per endpoint. If P95 input tokens is within 20% of the model's context limit, you have a context-management bug. Truncate, summarize, or sliding-window before you hit the ceiling.
6. Model fallback ladder
Every endpoint should have a "if this model is rate-limited, what's plan B?" Usually: try frontier model, fall back to next tier down, fall back to a different provider, fall back to cached / stale response, fall back to a human queue.
The ladder is not just for cost. It's for reliability. But it lets you put the cheapest acceptable model first when load is high, which is its own cost guardrail.
7. Observability with per-call cost annotation
You cannot guard what you don't measure. Every call should log: model, input tokens, output tokens, cost (computed), latency, user, endpoint, retry count. We use Helicone or Langfuse for this; rolling your own is fine. The point is the per-call line item, not the dashboard.
Daily roll-up alert: cost per endpoint > 2x trailing 7-day average → page someone. Catches the new feature shipped on Friday that's running 5x the budget on Monday morning.
8. The "free tier" trap
A surprising number of cost overruns come from a vendor's free tier expiring silently. "We were using the free 100k traces/month tier." Then traces hit 850k/month and the bill becomes $4,800. The team had no internal alert because the spend wasn't on their statement until day 31.
Audit every dependency for free-tier limits. Set internal alerts at 60%, 80%, 100% of each. Don't trust the vendor to tell you — they make money when you blow through.
Putting it together
A complete cost-guardrail stack has three layers.
Pre-flight (before the call leaves your service): per-request token cap, per-tenant quota check, model selection from the fallback ladder.
In-flight (during the call): streaming timeout, observability annotation.
Post-flight (after the call returns): retry budget update, cost log line, anomaly check against trailing average.
Each layer is a few hours of engineering, well under a sprint to put all of them in. The cost of not having them is unbounded — bounded by the credit limit on whatever card the API key is attached to.
How we ship this
Every Pro-tier build ($250-$400/mo) ships with the first six guardrails by default — they're part of the productized scaffolding, not a customization. Custom-tier builds add the fallback ladder and the cost-anomaly alerting because both require integration with the customer's existing observability stack.
The reason cost guardrails are productized is that they don't vary much by use case. The patterns above work the same for support agents, RAG assistants, document extractors. We stopped customizing them years ago.
What you can do this week
If you have an LLM app in production with no per-request token cap, add it. One config line at the SDK level. Pick a number 3x your current P99 input tokens.
If you have it but no per-tenant quota, add it next. A Redis counter and a check function. Half a day.
If you have both but no per-call cost log line, add observability — even just structured logs piped to your existing log aggregator. You can't optimize what you can't measure, and you can't catch the runaway you don't see.
The bill won't surprise you if you've already drawn the lines.
QNTX AI Ops ships production LLM apps with guardrails as default. [See the Pro tier](/services/accelerator).