Fine-tuning vs Prompting vs RAG — Inference Engineering

The three approaches

Prompting

Tell the model what you want via instructions, examples, and context in the input.

Cost: Low. Iterates in minutes.

RAG

Retrieve relevant external content and supply it in context at inference time.

Cost: Moderate. Requires retrieval infra.

Fine-tuning

Update the model's weights using examples of desired behaviour.

Cost: High. Requires data, GPU time, ongoing maintenance.

These aren't mutually exclusive. Real systems often combine them: a fine-tuned model that also does RAG, guided by a well-crafted system prompt. But understanding what each approach actually does — and what problem it solves — is the prerequisite for combining them sensibly.

What each approach actually changes

Prompting changes the input

A prompt is everything you send to the model before it generates. System prompts, few-shot examples, chain-of-thought instructions, output format specifications — all of these are prompting. When you write a better prompt, you're not changing the model. You're changing the context the model interprets its task within.

Prompting is cheap to iterate and requires no infrastructure beyond the model API. It's also limited: you can only influence the model through the information present in the context window. You can't change how the model handles tasks it wasn't trained on, correct systematic errors in its reasoning, or inject knowledge that exceeds your context budget.

RAG changes what information is available at runtime

RAG doesn't change the model or the prompt structure — it changes the content. Instead of relying on the model's parametric memory (what it learned during training), you retrieve relevant content from an external source and include it in the context. The model's weights don't change; what it reads does.

RAG is the right choice when your problem is information access: the model doesn't know about your private data, recent events, or a specific corpus. It's not the right choice if the problem is that the model doesn't behave the right way — RAG can't teach a model to follow a specific output format, adopt a tone, or correct systematic errors in its reasoning.

Fine-tuning changes the model's weights

Fine-tuning updates the model's parameters using gradient descent on a dataset of examples. This permanently (for that model checkpoint) changes how the model behaves — what it tends to say, how it formats output, how it reasons, what it believes about the world. Fine-tuning can produce behaviour changes that no prompt can reliably achieve.

The cost is real: you need labelled examples of the desired behaviour (usually hundreds to thousands, sometimes more), GPU compute for the training run, evaluation infrastructure to confirm the fine-tune worked, and an operational plan for hosting or serving the resulting model. You also need to repeat this process when you want to update the model's behaviour or move to a newer base model.

Mapping problems to approaches

Problem	Best approach	Why
Model doesn't know about your private data	RAG	Information access problem — retrieval provides it at runtime without storing it in weights
Model doesn't know about recent events	RAG	Same — knowledge cutoff is an information problem
Model doesn't follow your output format	Prompting first, then fine-tuning if prompting fails	Format adherence is often achievable with good few-shot examples; fine-tune only if that's unreliable
Model uses the wrong tone or persona	Prompting first, then fine-tuning for consistency	Persona can be set in system prompt; fine-tune when it needs to be deeply embedded and consistent
Model makes systematic domain errors (e.g. wrong medical reasoning)	Fine-tuning with domain examples	Prompting can't fix systematic reasoning patterns; need weight updates
Model is too slow / too expensive	Smaller model + fine-tuning to recover capability	Fine-tuning a smaller model on task-specific examples can match a larger model's quality at lower cost
Model doesn't know a specialised vocabulary or domain	RAG first; fine-tuning if RAG is insufficient	RAG can supply domain content in-context; fine-tuning can embed it in weights if RAG proves too unreliable
Need structured output (JSON, function calls)	Prompting + structured output feature; fine-tuning if unreliable	Most modern models support structured output natively — try prompting before training

Types of fine-tuning

Full fine-tuning

All model weights are updated during training. This produces the strongest possible behaviour change but requires the most compute, the most memory (you need to store gradients for every parameter), and produces the largest model artefact. For very large models (70B+), full fine-tuning is often impractical without substantial GPU cluster access.

LoRA and parameter-efficient fine-tuning

LoRA (Low-Rank Adaptation) freezes the original model weights and trains small, low-rank adapter matrices alongside them. The adapters are much smaller than the full model — typically 1–5% of the total parameter count. During inference, the adapter weights are merged with or applied on top of the base model. QLoRA extends this by also quantising the base model weights during training, reducing memory requirements further.

LoRA has become the default fine-tuning method for most practical use cases because it's fast to train, cheap to store (adapters are small), and produces results close to full fine-tuning on most tasks. The main limitation is that it can't update every type of behaviour as deeply as full fine-tuning — but for most tasks, the difference doesn't matter.

Instruction fine-tuning and RLHF

The base models released by labs aren't directly usable for chat or instruction-following — they're trained to predict text, not to follow instructions helpfully. The models you use via APIs have already been fine-tuned to follow instructions (often using RLHF — reinforcement learning from human feedback). When people talk about fine-tuning their own models, they're usually fine-tuning these already instruction-tuned checkpoints on task-specific data.

Data requirements for fine-tuning

Fine-tuning requires data — examples of the input-output behaviour you want. The number of examples needed varies enormously by task. Simple format changes might be achievable with a few hundred examples. Complex reasoning tasks may require thousands. Behaviours that are deeply contrary to the base model's training may require more still, and may degrade other capabilities in the process (catastrophic forgetting).

Data quality matters more than quantity. A fine-tune trained on 500 high-quality, consistent examples will outperform one trained on 5,000 noisy, inconsistent ones. Annotation consistency is particularly important: if different annotators would label the same input differently, the model will learn the inconsistency.

// Don't skip the baseline

Before investing in fine-tuning, establish a prompt-engineering baseline. Write the best system prompt you can, use few-shot examples, and measure performance on a held-out evaluation set. Fine-tuning is only worth the investment if it measurably outperforms a well-engineered prompt on your actual task. Many teams fine-tune and then discover that better prompting would have gotten them most of the way there at 1% of the cost.

A practical decision guide

// decision_framework.txt

Is the problem that the model doesn't have access to specific information (private data, recent events, a large corpus)?

→ Yes: start with RAG. It's faster to build and easier to update than a fine-tune, and it directly solves information access problems.

Is the problem that the model doesn't follow instructions correctly — format, tone, or task framing?

→ Yes: start with prompting. Invest time in the system prompt and few-shot examples. Measure with evals. If prompting doesn't get you there after genuine effort, consider fine-tuning.

Is the problem systematic errors in reasoning or domain knowledge that prompting can't fix?

→ Yes: consider fine-tuning. But first, do you have enough high-quality labelled examples? Is the improvement measurable? Is the maintenance cost acceptable?

Is the problem latency or cost — the model is too slow or too expensive?

→ Consider fine-tuning a smaller base model on your specific task. A 7B model fine-tuned on your task may match a 70B model's quality on that task at much lower inference cost.

// default: prompt first, RAG if information-limited, fine-tune last

Combining approaches

The three approaches are complementary, not competing. A production system might use a fine-tuned model (to ensure format adherence and domain-appropriate tone), with a system prompt (to set the task context per-request), and RAG (to supply relevant information from a private knowledge base). Each layer solves a distinct problem that the others don't.

The sequencing matters. Get the model's core behaviour right through prompting first — then layer in RAG for information access, and add fine-tuning only when you have a measured gap that prompting and RAG can't close. Building in that order keeps iteration cycles short and costs manageable.

// In short

01Prompting changes the input; RAG changes the information available; fine-tuning changes the model. Each solves a different class of problem.

02Information access problems → RAG. Private data, knowledge cutoffs, and large corpora are retrieval problems, not training problems.

03Behaviour and format problems → prompt engineering first. Invest in the system prompt and few-shot examples before committing to fine-tuning.

04Fine-tuning is for systematic gaps that prompting can't close. Domain reasoning errors, deeply embedded persona, or using a smaller model to match a larger one's task performance.

05LoRA is the default fine-tuning method for most practical cases. Fast to train, cheap to store, and results are close to full fine-tuning for most tasks.

06The approaches stack. Fine-tuned model + system prompt + RAG is a common production pattern. Each layer handles what the others can't.