Fine-tuning vs Prompting vs RAG
A decision framework for how to adapt a model to your task
When a foundation model doesn't behave the way you need, you have three main levers: change what you say to it (prompting), give it external information at runtime (RAG), or change the model's weights to alter its behaviour permanently (fine-tuning). Each lever solves a different problem and comes with different costs. Most teams reach for fine-tuning too early.
The three approaches
Prompting
Tell the model what you want via instructions, examples, and context in the input.
Cost: Low. Iterates in minutes.
RAG
Retrieve relevant external content and supply it in context at inference time.
Cost: Moderate. Requires retrieval infra.
Fine-tuning
Update the model's weights using examples of desired behaviour.
Cost: High. Requires data, GPU time, ongoing maintenance.
These aren't mutually exclusive. Real systems often combine them: a fine-tuned model that also does RAG, guided by a well-crafted system prompt. But understanding what each approach actually does — and what problem it solves — is the prerequisite for combining them sensibly.
What each approach actually changes
Prompting changes the input
A prompt is everything you send to the model before it generates. System prompts, few-shot examples, chain-of-thought instructions, output format specifications — all of these are prompting. When you write a better prompt, you're not changing the model. You're changing the context the model interprets its task within.
Prompting is cheap to iterate and requires no infrastructure beyond the model API. It's also limited: you can only influence the model through the information present in the context window. You can't change how the model handles tasks it wasn't trained on, correct systematic errors in its reasoning, or inject knowledge that exceeds your context budget.
RAG changes what information is available at runtime
RAG doesn't change the model or the prompt structure — it changes the content. Instead of relying on the model's parametric memory (what it learned during training), you retrieve relevant content from an external source and include it in the context. The model's weights don't change; what it reads does.
RAG is the right choice when your problem is information access: the model doesn't know about your private data, recent events, or a specific corpus. It's not the right choice if the problem is that the model doesn't behave the right way — RAG can't teach a model to follow a specific output format, adopt a tone, or correct systematic errors in its reasoning.
Fine-tuning changes the model's weights
Fine-tuning updates the model's parameters using gradient descent on a dataset of examples. This permanently (for that model checkpoint) changes how the model behaves — what it tends to say, how it formats output, how it reasons, what it believes about the world. Fine-tuning can produce behaviour changes that no prompt can reliably achieve.
The cost is real: you need labelled examples of the desired behaviour (usually hundreds to thousands, sometimes more), GPU compute for the training run, evaluation infrastructure to confirm the fine-tune worked, and an operational plan for hosting or serving the resulting model. You also need to repeat this process when you want to update the model's behaviour or move to a newer base model.
Mapping problems to approaches
| Problem | Best approach | Why |
|---|---|---|
| Model doesn't know about your private data | RAG | Information access problem — retrieval provides it at runtime without storing it in weights |
| Model doesn't know about recent events | RAG | Same — knowledge cutoff is an information problem |
| Model doesn't follow your output format | Prompting first, then fine-tuning if prompting fails | Format adherence is often achievable with good few-shot examples; fine-tune only if that's unreliable |
| Model uses the wrong tone or persona | Prompting first, then fine-tuning for consistency | Persona can be set in system prompt; fine-tune when it needs to be deeply embedded and consistent |
| Model makes systematic domain errors (e.g. wrong medical reasoning) | Fine-tuning with domain examples | Prompting can't fix systematic reasoning patterns; need weight updates |
| Model is too slow / too expensive | Smaller model + fine-tuning to recover capability | Fine-tuning a smaller model on task-specific examples can match a larger model's quality at lower cost |
| Model doesn't know a specialised vocabulary or domain | RAG first; fine-tuning if RAG is insufficient | RAG can supply domain content in-context; fine-tuning can embed it in weights if RAG proves too unreliable |
| Need structured output (JSON, function calls) | Prompting + structured output feature; fine-tuning if unreliable | Most modern models support structured output natively — try prompting before training |
Types of fine-tuning
Full fine-tuning
All model weights are updated during training. This produces the strongest possible behaviour change but requires the most compute, the most memory (you need to store gradients for every parameter), and produces the largest model artefact. For very large models (70B+), full fine-tuning is often impractical without substantial GPU cluster access.
LoRA and parameter-efficient fine-tuning
LoRA (Low-Rank Adaptation) freezes the original model weights and trains small, low-rank adapter matrices alongside them. The adapters are much smaller than the full model — typically 1–5% of the total parameter count. During inference, the adapter weights are merged with or applied on top of the base model. QLoRA extends this by also quantising the base model weights during training, reducing memory requirements further.
LoRA has become the default fine-tuning method for most practical use cases because it's fast to train, cheap to store (adapters are small), and produces results close to full fine-tuning on most tasks. The main limitation is that it can't update every type of behaviour as deeply as full fine-tuning — but for most tasks, the difference doesn't matter.
Instruction fine-tuning and RLHF
The base models released by labs aren't directly usable for chat or instruction-following — they're trained to predict text, not to follow instructions helpfully. The models you use via APIs have already been fine-tuned to follow instructions (often using RLHF — reinforcement learning from human feedback). When people talk about fine-tuning their own models, they're usually fine-tuning these already instruction-tuned checkpoints on task-specific data.
Data requirements for fine-tuning
Fine-tuning requires data — examples of the input-output behaviour you want. The number of examples needed varies enormously by task. Simple format changes might be achievable with a few hundred examples. Complex reasoning tasks may require thousands. Behaviours that are deeply contrary to the base model's training may require more still, and may degrade other capabilities in the process (catastrophic forgetting).
Data quality matters more than quantity. A fine-tune trained on 500 high-quality, consistent examples will outperform one trained on 5,000 noisy, inconsistent ones. Annotation consistency is particularly important: if different annotators would label the same input differently, the model will learn the inconsistency.
// Don't skip the baseline
Before investing in fine-tuning, establish a prompt-engineering baseline. Write the best system prompt you can, use few-shot examples, and measure performance on a held-out evaluation set. Fine-tuning is only worth the investment if it measurably outperforms a well-engineered prompt on your actual task. Many teams fine-tune and then discover that better prompting would have gotten them most of the way there at 1% of the cost.
A practical decision guide
// decision_framework.txt
Is the problem that the model doesn't have access to specific information (private data, recent events, a large corpus)?
→ Yes: start with RAG. It's faster to build and easier to update than a fine-tune, and it directly solves information access problems.
Is the problem that the model doesn't follow instructions correctly — format, tone, or task framing?
→ Yes: start with prompting. Invest time in the system prompt and few-shot examples. Measure with evals. If prompting doesn't get you there after genuine effort, consider fine-tuning.
Is the problem systematic errors in reasoning or domain knowledge that prompting can't fix?
→ Yes: consider fine-tuning. But first, do you have enough high-quality labelled examples? Is the improvement measurable? Is the maintenance cost acceptable?
Is the problem latency or cost — the model is too slow or too expensive?
→ Consider fine-tuning a smaller base model on your specific task. A 7B model fine-tuned on your task may match a 70B model's quality on that task at much lower inference cost.
Combining approaches
The three approaches are complementary, not competing. A production system might use a fine-tuned model (to ensure format adherence and domain-appropriate tone), with a system prompt (to set the task context per-request), and RAG (to supply relevant information from a private knowledge base). Each layer solves a distinct problem that the others don't.
The sequencing matters. Get the model's core behaviour right through prompting first — then layer in RAG for information access, and add fine-tuning only when you have a measured gap that prompting and RAG can't close. Building in that order keeps iteration cycles short and costs manageable.
// In short