// Guide — Concepts

Training vs. Inference

The Core Concept to Understand

Before optimising latency, selecting a GPU, or choosing an inference engine — you need to understand what inference actually is and how it differs from training. This is the foundation everything else is built on.

15 min read ·

Why this distinction matters

Most mistakes in AI infrastructure — the expensive kind, the kind that cause engineers to rebuild systems from scratch — originate from a single misunderstanding: treating training and inference as variations of the same problem. They are not.

Training and inference share a model and share a GPU, and that is roughly where the similarity ends. They have different objectives, different computational profiles, different memory patterns, different hardware requirements, and different economic structures. An infrastructure decision that is correct for training is frequently wrong for inference, and vice versa.

// Key Principle

Training is a one-time cost paid to create a model. Inference is a recurring cost paid every time the model is used. For any model that reaches production, inference will cost orders of magnitude more than training over its lifetime.

The bulk of an AI model's lifetime compute is spent on inference, not training — training is a one-time cost, inference is ongoing. The economics follow: optimising training by 10% saves 10% of a one-time cost. Optimising inference by 10% saves 10% of an ongoing cost that never stops accumulating.

What training does

Training is the process by which a model learns. Given a dataset of examples, the model makes predictions, measures how wrong they are, and adjusts its internal parameters to be less wrong next time. This cycle repeats millions or billions of times until the model's predictions are useful.

The internal parameters — called weights — are the model's "knowledge". A large model with 70 billion parameters has seventy billion of these numbers. Training is the process of finding good values for all of them.

The two passes

Training requires two passes through the model on every iteration:

// training_loop

Forward Pass

1.Input data fed into the model
2.Model produces a prediction
3.Loss function measures error vs. ground truth

Backward Pass (Backpropagation)

4.Error propagated back through all layers
5.Gradients computed for every weight
6.Weights updated via optimiser (e.g. AdamW)
This cycle — forward pass + backward pass + weight update — is one training step. A large model trains for millions of steps.

The backward pass is computationally intensive and requires storing gradients for every parameter — a structure roughly as large as the model weights themselves. This is why training a 70B model requires far more memory than running it: you need to hold the weights, the gradients, and the optimiser states simultaneously.

Training is intentionally destructive to the weights — that is the point. Each step nudges the weights toward values that produce better predictions. When training concludes, the weights are frozen and the model is ready to use.

// Common Misconception

Models do not "learn" during normal production use. When ChatGPT answers your question, its weights do not change. The model is not updating itself based on your input. Inference is strictly a read-only operation on the weights.

What inference does

Inference is the process by which a trained model uses what it has learned. Given a new input — a prompt, an image, an audio clip — the model runs a single forward pass and produces an output. No gradients are computed. No weights are updated. The model is frozen.

For a large language model, inference is the process of generating tokens one at a time. Given a prompt, the model predicts the most likely next token, appends it to the sequence, and runs another forward pass to predict the token after that. This continues until the model generates a stop token or hits a length limit.

// llm_inference

IN"What is the capital of France?"
Forward pass #1  →  "The"
Forward pass #2  →  "capital"
Forward pass #3  →  "of"
Forward pass #4  →  "France"
Forward pass #5  →  "is"
Forward pass #6  →  "Paris."
Forward pass #7  →  [STOP]
Each forward pass reads the full model weights from GPU memory. For a 70B model in FP16, that is ~140GB read on every single token generated.

This is why inference has a fundamentally different performance profile than training. Every token generation requires loading the entire model's weights through memory — even though only a small fraction of the computation changes between tokens. This makes inference memory-bandwidth-bound, not compute-bound.

The key differences, side by side

Training Inference
Purpose Adjust weights to learn from data Use frozen weights to generate output
Passes per step Forward + backward pass Forward pass only
Weights Updated on every step Read-only, never updated
Gradients Required — stored in memory Not computed, not stored
Memory profile Weights + gradients + optimiser states (~4–8× model size) Weights + KV cache only (~1.2× model size)
Compute profile Compute-bound (matrix multiply dominates) Memory-bandwidth-bound (weight loading dominates)
Parallelism Batch many examples, process together Often single-request; batching adds complexity
Latency sensitivity None — hours/days is acceptable High — users wait in real time
Cost structure One-time (or periodic retraining) Per-request, ongoing forever
Hardware optimisation Maximise FLOPS and interconnect bandwidth Maximise memory bandwidth + capacity

The memory difference, in concrete numbers

The memory requirements are worth making concrete. Take a 70B open-weight model as a concrete example — widely deployed, representative of the scale where inference costs become significant.

In FP16 (half-precision), each parameter occupies 2 bytes. The model weights alone therefore require:

70,000,000,000 × 2 bytes = 140 GB for weights alone

For inference, you need that 140GB plus a relatively modest overhead for activations and the KV cache (which grows with context length). Two high-end datacenter GPUs with 80GB each can serve this model.

For training, you additionally need to store gradients (another ~140GB) and optimiser states — AdamW stores two moment vectors per parameter, adding a further ~560GB. Total training memory: roughly 840GB. That requires a multi-GPU cluster, not a small GPU cluster.

// Rule of Thumb

Inference memory ≈ 2 × params (in billions) GB at FP16. Training memory ≈ 16–20 × params (in billions) GB at FP16 with AdamW. The inference footprint is roughly 8–10× smaller than the training footprint for the same model.

Why engineers confuse them

The confusion arises because both phases use the same artefact — a neural network — running on the same hardware — a GPU. The code that defines the model architecture is often identical. Many frameworks run the same model.forward() call in both cases.

But the context is completely different. Training runs model.forward() to collect gradients it will use to update the model. Inference runs model.forward() to produce an output it will return to a user. The objectives, constraints, and optimisation strategies diverge from that point.

The practical consequence: a company that spends months building training infrastructure, then assumes inference requires "the same thing, just smaller", will discover expensive surprises. Inference requires low-latency serving, request batching, KV cache management, streaming outputs, and hardware provisioned for memory bandwidth — none of which feature in a training pipeline.

// A Common Expensive Mistake

Buying the highest-end training-optimised GPUs (with their premium interconnect hardware) for an inference cluster, when more bandwidth-focused alternatives often offer better cost-per-token for serving workloads. The interconnect bandwidth that matters enormously for training is largely irrelevant for single-model inference.

Why inference is the harder engineering problem

Training is difficult science. Inference is difficult engineering.

In training, you control the entire environment: batch size, precision, hardware, duration. You can afford to be slow as long as you converge. You run it once, or periodically, on a cluster you own.

In inference, you are serving real users with unpredictable traffic patterns, latency requirements, and wildly varying prompt lengths. You must make decisions in milliseconds that training pipelines make in minutes. You must maximise throughput without violating per-user latency SLAs. You must manage GPU memory dynamically as requests arrive and complete. You must serve hundreds or thousands of concurrent requests on hardware that was designed for batch computation.

This is why inference engineering is a distinct discipline — and why the optimisation techniques it employs (continuous batching, paged attention, speculative decoding, quantization) exist specifically to address constraints that training never encounters.

// In short

01 Training updates weights. Inference reads them. These are not variations of the same operation — they are fundamentally different computational processes.
02 Inference is memory-bandwidth-bound. Every generated token requires reading the full model weights. Compute is not the bottleneck — memory is.
03 Inference is the dominant lifetime cost. Training is one-time. Inference accumulates with every request forever. Optimising inference has compounding returns.
04 Inference requires different hardware choices. Memory capacity and bandwidth matter more than raw FLOPS. The best training GPU is rarely the best inference GPU.
05 Inference is harder to engineer. You are serving real users, in real time, with unpredictable load. The techniques in the rest of this site exist specifically to solve inference's unique constraints.