Training vs. Inference
The Core Concept to Understand
Before optimising latency, selecting a GPU, or choosing an inference engine — you need to understand what inference actually is and how it differs from training. This is the foundation everything else is built on.
Why this distinction matters
Most mistakes in AI infrastructure — the expensive kind, the kind that cause engineers to rebuild systems from scratch — originate from a single misunderstanding: treating training and inference as variations of the same problem. They are not.
Training and inference share a model and share a GPU, and that is roughly where the similarity ends. They have different objectives, different computational profiles, different memory patterns, different hardware requirements, and different economic structures. An infrastructure decision that is correct for training is frequently wrong for inference, and vice versa.
// Key Principle
Training is a one-time cost paid to create a model. Inference is a recurring cost paid every time the model is used. For any model that reaches production, inference will cost orders of magnitude more than training over its lifetime.
The bulk of an AI model's lifetime compute is spent on inference, not training — training is a one-time cost, inference is ongoing. The economics follow: optimising training by 10% saves 10% of a one-time cost. Optimising inference by 10% saves 10% of an ongoing cost that never stops accumulating.
What training does
Training is the process by which a model learns. Given a dataset of examples, the model makes predictions, measures how wrong they are, and adjusts its internal parameters to be less wrong next time. This cycle repeats millions or billions of times until the model's predictions are useful.
The internal parameters — called weights — are the model's "knowledge". A large model with 70 billion parameters has seventy billion of these numbers. Training is the process of finding good values for all of them.
The two passes
Training requires two passes through the model on every iteration:
// training_loop
Forward Pass
Backward Pass (Backpropagation)
The backward pass is computationally intensive and requires storing gradients for every parameter — a structure roughly as large as the model weights themselves. This is why training a 70B model requires far more memory than running it: you need to hold the weights, the gradients, and the optimiser states simultaneously.
Training is intentionally destructive to the weights — that is the point. Each step nudges the weights toward values that produce better predictions. When training concludes, the weights are frozen and the model is ready to use.
// Common Misconception
Models do not "learn" during normal production use. When ChatGPT answers your question, its weights do not change. The model is not updating itself based on your input. Inference is strictly a read-only operation on the weights.
What inference does
Inference is the process by which a trained model uses what it has learned. Given a new input — a prompt, an image, an audio clip — the model runs a single forward pass and produces an output. No gradients are computed. No weights are updated. The model is frozen.
For a large language model, inference is the process of generating tokens one at a time. Given a prompt, the model predicts the most likely next token, appends it to the sequence, and runs another forward pass to predict the token after that. This continues until the model generates a stop token or hits a length limit.
// llm_inference
This is why inference has a fundamentally different performance profile than training. Every token generation requires loading the entire model's weights through memory — even though only a small fraction of the computation changes between tokens. This makes inference memory-bandwidth-bound, not compute-bound.
The key differences, side by side
| Training | Inference | |
|---|---|---|
| Purpose | Adjust weights to learn from data | Use frozen weights to generate output |
| Passes per step | Forward + backward pass | Forward pass only |
| Weights | Updated on every step | Read-only, never updated |
| Gradients | Required — stored in memory | Not computed, not stored |
| Memory profile | Weights + gradients + optimiser states (~4–8× model size) | Weights + KV cache only (~1.2× model size) |
| Compute profile | Compute-bound (matrix multiply dominates) | Memory-bandwidth-bound (weight loading dominates) |
| Parallelism | Batch many examples, process together | Often single-request; batching adds complexity |
| Latency sensitivity | None — hours/days is acceptable | High — users wait in real time |
| Cost structure | One-time (or periodic retraining) | Per-request, ongoing forever |
| Hardware optimisation | Maximise FLOPS and interconnect bandwidth | Maximise memory bandwidth + capacity |
The memory difference, in concrete numbers
The memory requirements are worth making concrete. Take a 70B open-weight model as a concrete example — widely deployed, representative of the scale where inference costs become significant.
In FP16 (half-precision), each parameter occupies 2 bytes. The model weights alone therefore require:
70,000,000,000 × 2 bytes = 140 GB for weights alone
For inference, you need that 140GB plus a relatively modest overhead for activations and the KV cache (which grows with context length). Two high-end datacenter GPUs with 80GB each can serve this model.
For training, you additionally need to store gradients (another ~140GB) and optimiser states — AdamW stores two moment vectors per parameter, adding a further ~560GB. Total training memory: roughly 840GB. That requires a multi-GPU cluster, not a small GPU cluster.
// Rule of Thumb
Inference memory ≈ 2 × params (in billions) GB at FP16. Training memory ≈ 16–20 × params (in billions) GB at FP16 with AdamW. The inference footprint is roughly 8–10× smaller than the training footprint for the same model.
Why engineers confuse them
The confusion arises because both phases use the same artefact — a neural network — running on the same hardware — a GPU. The code that defines the model architecture is often identical. Many frameworks run the same model.forward() call in both cases.
But the context is completely different. Training runs model.forward() to collect gradients it will use to update the model. Inference runs model.forward() to produce an output it will return to a user. The objectives, constraints, and optimisation strategies diverge from that point.
The practical consequence: a company that spends months building training infrastructure, then assumes inference requires "the same thing, just smaller", will discover expensive surprises. Inference requires low-latency serving, request batching, KV cache management, streaming outputs, and hardware provisioned for memory bandwidth — none of which feature in a training pipeline.
// A Common Expensive Mistake
Buying the highest-end training-optimised GPUs (with their premium interconnect hardware) for an inference cluster, when more bandwidth-focused alternatives often offer better cost-per-token for serving workloads. The interconnect bandwidth that matters enormously for training is largely irrelevant for single-model inference.
Why inference is the harder engineering problem
Training is difficult science. Inference is difficult engineering.
In training, you control the entire environment: batch size, precision, hardware, duration. You can afford to be slow as long as you converge. You run it once, or periodically, on a cluster you own.
In inference, you are serving real users with unpredictable traffic patterns, latency requirements, and wildly varying prompt lengths. You must make decisions in milliseconds that training pipelines make in minutes. You must maximise throughput without violating per-user latency SLAs. You must manage GPU memory dynamically as requests arrive and complete. You must serve hundreds or thousands of concurrent requests on hardware that was designed for batch computation.
This is why inference engineering is a distinct discipline — and why the optimisation techniques it employs (continuous batching, paged attention, speculative decoding, quantization) exist specifically to address constraints that training never encounters.
// In short