Decision Tools
When to use RAG vs. fine-tuning in 2026
A practical decision tree, with the questions we actually ask in scoping calls.
The argument that won't die
Every quarter, someone asks us "should we be fine-tuning?" The honest answer most of the time is "no, but not for the reason you think." The argument has gotten muddier in 2026 because the tooling has improved on both sides — managed RAG is faster to ship than ever, and parameter-efficient fine-tuning (LoRA, QLoRA) has dropped the cost floor of a custom adapter to roughly $50–$500.
So the decision is no longer "which is cheaper" or "which is technically possible." Both are cheap. Both are possible. The decision is which one matches the shape of your problem. Below is the tree we use in scoping calls.
Question 1: Is the gap a knowledge gap or a behavior gap?
This is the question that resolves 80% of the conversation.
A knowledge gap means the model doesn't know facts it needs. Your product changelog, your customer's account history, your internal policy doc, your catalog. The base model has never seen this. RAG is the answer. Fine-tuning here is a category error — you'd be trying to memorize a corpus that changes weekly.
A behavior gap means the model knows the facts but doesn't respond the way you need. Tone, output format, multi-step reasoning patterns specific to your domain, refusing classes of requests, structured output that the prompt-engineering ceiling can't reach. Fine-tuning earns its keep here.
About 70% of the projects we scope are knowledge gaps. People conflate "the model doesn't behave right on my data" with "we need to fine-tune," when the actual fix is "give it the right context."
Question 2: How fresh does the answer need to be?
If the underlying ground truth changes more than monthly, fine-tuning is structurally wrong. You'd be retraining as fast as your data drifts. RAG is mandatory.
If the ground truth is stable for years (medical guidelines, legal precedent, fixed product specs), fine-tuning is a coherent option. Even then, RAG is usually the cheaper first move.
A useful rule: if you can't answer "when does the source-of-truth update?" in one sentence, don't fine-tune. You don't have the operational discipline to keep a tuned model fresh.
Question 3: How big is the prompt, and how often is it served?
RAG has a per-call cost: retrieval + context tokens. At low volume (under 10k calls/day), the cost is irrelevant. At high volume (millions/day), context tokens dominate the bill. A fine-tuned model with the knowledge baked in can serve the same task on shorter prompts.
Worked numbers: a 6k-token RAG prompt at $3/M input tokens is $0.018/call. At 5M calls/month, that's $90k/month. The same task on a fine-tuned model with a 1k-token prompt is $0.003/call, or $15k/month. The fine-tune amortizes its training cost (roughly $200–$2,000) in a single day at that volume.
But — and this matters — fine-tuning only helps if the task is genuinely repetitive enough that the same compressed prompt works. If queries vary widely, you can't compress the prompt without losing accuracy.
Question 4: How auditable does the answer need to be?
RAG outputs naturally cite the documents they pulled from. You can trace any answer back to its source. For regulated work — finance, legal, healthcare, compliance — this is non-negotiable. The audit trail is the product.
Fine-tuned models cannot tell you why they said what they said. The knowledge is baked into weights. For regulated work, this means fine-tuning is mostly limited to behavior shaping (tone, format) rather than knowledge replacement.
Question 5: Do you have labeled training data?
Fine-tuning needs hundreds to thousands of labeled examples — not just inputs, but inputs paired with the desired outputs. RAG needs documents, which most teams already have.
The cost of producing labeled fine-tune data is almost always larger than the cost of the fine-tune itself. We've watched teams spend $80k of internal time labeling for a $400 fine-tune. The fine-tune was the cheap part.
The combinations that actually win
In practice, most production systems use both. The pattern that's working in 2026:
Behavior layer (fine-tune): a small adapter on a frontier base model that pins output format, tone, and refusal behavior. Trained once on a few hundred examples. Refreshed quarterly.
Knowledge layer (RAG): hybrid retrieval over the live corpus. Refreshed continuously.
Reasoning layer (base model): the frontier model's general capability, untouched.
This split lets each layer change at its own cadence. Knowledge updates daily without retraining. Behavior updates quarterly. The base model upgrades when the provider ships a better one.
What we actually do in Pro-tier builds
The Pro tier defaults to RAG with no fine-tuning. We can ship that in a week. About 90% of customers don't need anything else.
The Custom tier opens fine-tuning as an option. We add it only after the RAG version has shipped, run for 4–6 weeks, and surfaced specific behaviors that prompt engineering can't fix. By that point, we have actual production traces to label, so the fine-tune has real data.
The order matters. Fine-tuning before shipping the RAG version is committing to an architecture before you've measured what the architecture is for.
A short version
- Knowledge gap → RAG
- Behavior gap → fine-tune (after RAG, with real traces)
- Fresh data → RAG
- Stable data + high volume → consider both
- Need citations → RAG
- No labeled data → RAG
- High-stakes, regulated → RAG, possibly with fine-tuned behavior layer
The default in 2026 is RAG-first, fine-tune-later, both-eventually-for-mature-systems. The teams who fine-tune first are usually solving a problem they haven't yet measured.
If you're scoping a build and stuck on architecture, [we'll work through it on a 30-minute call](/contact).