Mixture of Experts
More Model Capacity Without Proportionally More Compute
A dense model runs every parameter on every token. MoE breaks that constraint — you can have a much larger model while only activating a fraction of it per token. The tradeoff lands differently for training, inference, and deployment, and it's worth understanding each separately.
The dense baseline
In a standard transformer — what people usually call a dense model — every token passes through every layer, and every layer runs all of its parameters. The feed-forward network (FFN) in each transformer block processes every single token, regardless of what that token is or what it means. A comma and a rare technical term get the same amount of compute.
This works well, but it means compute scales directly with model size. Double the parameters, roughly double the compute per token. If you want more model capacity — more ability to represent complex patterns — you pay for it on every single forward pass.
// The core tension
Larger models generally perform better, but they cost more to run per token. Dense scaling forces you to pay the full compute cost for every token, even though most tokens probably don't need all that capacity. MoE is an attempt to break that coupling.
What MoE does differently
A Mixture of Experts model replaces the single feed-forward network in each transformer block with a set of parallel FFNs — the experts — and a small learned network called a router. For each token, the router picks a small number of experts to use (typically two), passes the token to those experts, combines their outputs, and moves on. The rest of the experts do nothing for that token.
The model's total parameter count includes all the experts. But the compute for any single token only involves the active ones. You get the representational capacity of a large model while paying a fraction of the per-token compute cost.
// dense_vs_moe_per_token
Dense — every parameter runs on every token
Dense block
Same FFN for every token. Compute cost tied directly to parameter count.
MoE block
Only selected experts run. Total params >> active params per token.
The router
The router is a small linear layer that takes the token's representation as input and outputs a score for each expert. The top-k experts by score are selected — where k is a hyperparameter set at training time, usually 1 or 2. Those experts process the token, their outputs are weighted by the router scores, and the weighted sum becomes the layer output.
The router is learned during training, not hand-designed. The model figures out how to route. In practice, experts tend to develop loose specialisations — certain experts activate more frequently for certain types of content — but this is an emergent property, not something you specify.
// routing_for_a_single_token
Expert load balancing
Left to itself, a router will converge on a small number of popular experts and essentially ignore the rest. This is called expert collapse, and it wastes most of the capacity you added. Training MoE models requires an auxiliary loss term that penalises imbalanced routing — pushing the model to distribute tokens more evenly across experts. Getting this right is one of the harder parts of training a MoE model well.
In inference, load imbalance shows up as a different problem: some experts get routed far more traffic than others, creating hotspots on whichever GPU holds those experts. More on this in the deployment section.
// token_routing_across_8_experts — one batch step
Expert utilisation (relative token count):
▓▓▓▓▓
▓
▓▓▓▓
▓▓
▓▓▓▓▓
▓
▓▓▓
▓
The actual tradeoff
MoE gives you more total parameters for a given compute budget — but it doesn't give you anything for free. The tradeoff lands differently depending on what you care about.
| Dimension | Dense model | MoE model (same active params) |
|---|---|---|
| Total parameters | Smaller — all params are active params | Much larger — most params are inactive per token |
| Compute per token | Proportional to total size | Lower — only active experts run |
| Memory required | Proportional to total size | Also proportional to total size — all experts must be loaded |
| Training cost | Proportional to total size × tokens | Lower per token, but requires load balancing and is harder to stabilise |
| Inference latency (single request) | Predictable | Similar to a dense model of the active-param size — routing adds minimal overhead |
| Inference throughput | Limited by memory bandwidth and GPU count | Complicated — expert parallelism required at scale, routing creates communication overhead |
The key insight is in the memory row. A MoE model with eight experts and two active per token has roughly four times the parameters of a dense model with the same per-token compute — but it needs four times the GPU memory to hold all those experts. You can't leave inactive experts on disk and load them on demand without destroying latency.
// The memory catch
Compute efficiency and memory efficiency point in opposite directions with MoE. You get better compute-per-token than a parameter-equivalent dense model, but you need the same memory as if you were running all experts all the time. This is the constraint that shapes almost every MoE deployment decision.
What MoE means for deployment
Deploying a MoE model at inference is meaningfully more complex than deploying a dense model of similar capability. A few specific issues come up repeatedly.
Expert parallelism
When a MoE model is too large for a single GPU, the natural sharding strategy is to put different experts on different GPUs — this is called expert parallelism. During a forward pass, tokens are routed to whichever GPU holds their assigned expert. This requires all-to-all communication between GPUs as tokens scatter to their experts and results come back.
All-to-all communication is expensive and scales badly with the number of GPUs. Tensor parallelism (splitting each layer across GPUs) avoids the routing communication but requires even more interconnect bandwidth. In practice, MoE deployments often use a combination — tensor parallelism within a node, expert parallelism across nodes — and tuning this for throughput is non-trivial.
Load imbalance at inference
If some experts consistently receive more tokens than others, the GPUs holding the popular experts become bottlenecks. The batch can't complete until the slowest GPU finishes. Strategies for managing this include routing capacity limits (rejecting tokens routed to an overloaded expert and sending them to a backup), but these can degrade quality if the router's first choice is frequently overridden.
Quantization and MoE
Because memory is the binding constraint, quantization is especially relevant for MoE models — you want to compress those idle expert weights as much as possible. The complication is that experts don't all behave the same way under quantization. Some experts handle niche patterns with less redundancy than others, and aggressive quantization can hurt quality unevenly. Per-expert calibration helps but adds complexity to the quantization pipeline.
// Where MoE shines
MoE is at its best when you need a lot of model capacity but are sensitive to per-token inference cost — and when you have enough GPUs to hold the full model in memory without needing to shard aggressively. Large-batch, high-throughput serving with capable interconnects is the sweet spot. It's harder to justify for latency-sensitive single-user deployments on limited hardware.
How to think about sparse vs dense
The right mental model isn't "MoE is better than dense" or vice versa. They make different bets. A dense model concentrates capacity — every parameter contributes to every token, so the parameters are heavily utilised but there aren't that many of them. A MoE model distributes capacity — most parameters are idle at any moment, but the model can draw on a much larger pool of specialised representations when needed.
For tasks where a smaller, focused model would do fine, a MoE model wastes a lot of hardware loading and maintaining experts that rarely activate. For tasks that require broad knowledge and nuanced handling of diverse input types, the larger expert pool earns its keep.
// memory vs compute summary
memory ∝ total_params (all experts loaded)
compute ∝ active_params (k/N experts per token)
These scale independently. A model with 8 experts, 2 active, has the memory footprint of the full model but the per-token compute of roughly 2/8 of it. The ratio k/N is called the sparsity level.
// In short