Mixture of Experts: More Capacity, Same Compute

The dense baseline

In a standard transformer — what people usually call a dense model — every token passes through every layer, and every layer runs all of its parameters. The feed-forward network (FFN) in each transformer block processes every single token, regardless of what that token is or what it means. A comma and a rare technical term get the same amount of compute.

This works well, but it means compute scales directly with model size. Double the parameters, roughly double the compute per token. If you want more model capacity — more ability to represent complex patterns — you pay for it on every single forward pass.

// The core tension

Larger models generally perform better, but they cost more to run per token. Dense scaling forces you to pay the full compute cost for every token, even though most tokens probably don't need all that capacity. MoE is an attempt to break that coupling.

What MoE does differently

A Mixture of Experts model replaces the single feed-forward network in each transformer block with a set of parallel FFNs — the experts — and a small learned network called a router. For each token, the router picks a small number of experts to use (typically two), passes the token to those experts, combines their outputs, and moves on. The rest of the experts do nothing for that token.

The model's total parameter count includes all the experts. But the compute for any single token only involves the active ones. You get the representational capacity of a large model while paying a fraction of the per-token compute cost.

// dense_vs_moe_per_token

Dense — every parameter runs on every token

Dense block

Attention

FFN (all params active)

Same FFN for every token. Compute cost tied directly to parameter count.

MoE block

Attention

Router — selects 2 of N experts

Expert A · Expert B (active)

Experts C–N (inactive)

Only selected experts run. Total params >> active params per token.

Attention layers are typically left dense — the MoE substitution happens in the FFN layers. This is where most of a transformer's parameters actually live.

The router

The router is a small linear layer that takes the token's representation as input and outputs a score for each expert. The top-k experts by score are selected — where k is a hyperparameter set at training time, usually 1 or 2. Those experts process the token, their outputs are weighted by the router scores, and the weighted sum becomes the layer output.

The router is learned during training, not hand-designed. The model figures out how to route. In practice, experts tend to develop loose specialisations — certain experts activate more frequently for certain types of content — but this is an emergent property, not something you specify.

// routing_for_a_single_token

01Token representation enters the router (small linear layer)

02Router outputs a score for each of the N experts

03Top-k experts by score are selected — typically k=1 or k=2

04Selected experts each process the token independently

05Outputs are combined using softmax-normalised router scores as weights

The router itself is tiny relative to the experts — its compute cost is negligible. The savings come entirely from running only k experts instead of all N.

Expert load balancing

Left to itself, a router will converge on a small number of popular experts and essentially ignore the rest. This is called expert collapse, and it wastes most of the capacity you added. Training MoE models requires an auxiliary loss term that penalises imbalanced routing — pushing the model to distribute tokens more evenly across experts. Getting this right is one of the harder parts of training a MoE model well.

In inference, load imbalance shows up as a different problem: some experts get routed far more traffic than others, creating hotspots on whichever GPU holds those experts. More on this in the deployment section.

// token_routing_across_8_experts — one batch step

Expert utilisation (relative token count):

E1
▓▓▓▓▓

E2
▓

E3
▓▓▓▓

E4
▓▓

E5
▓▓▓▓▓

E6
▓

E7
▓▓▓

E8
▓

Even with load balancing loss, routing is rarely perfectly uniform. Some experts consistently handle more traffic than others. In a distributed deployment this translates directly to uneven GPU utilisation.

The actual tradeoff

MoE gives you more total parameters for a given compute budget — but it doesn't give you anything for free. The tradeoff lands differently depending on what you care about.

Dimension	Dense model	MoE model (same active params)
Total parameters	Smaller — all params are active params	Much larger — most params are inactive per token
Compute per token	Proportional to total size	Lower — only active experts run
Memory required	Proportional to total size	Also proportional to total size — all experts must be loaded
Training cost	Proportional to total size × tokens	Lower per token, but requires load balancing and is harder to stabilise
Inference latency (single request)	Predictable	Similar to a dense model of the active-param size — routing adds minimal overhead
Inference throughput	Limited by memory bandwidth and GPU count	Complicated — expert parallelism required at scale, routing creates communication overhead

The key insight is in the memory row. A MoE model with eight experts and two active per token has roughly four times the parameters of a dense model with the same per-token compute — but it needs four times the GPU memory to hold all those experts. You can't leave inactive experts on disk and load them on demand without destroying latency.

// The memory catch

Compute efficiency and memory efficiency point in opposite directions with MoE. You get better compute-per-token than a parameter-equivalent dense model, but you need the same memory as if you were running all experts all the time. This is the constraint that shapes almost every MoE deployment decision.

What MoE means for deployment

Deploying a MoE model at inference is meaningfully more complex than deploying a dense model of similar capability. A few specific issues come up repeatedly.

Expert parallelism

When a MoE model is too large for a single GPU, the natural sharding strategy is to put different experts on different GPUs — this is called expert parallelism. During a forward pass, tokens are routed to whichever GPU holds their assigned expert. This requires all-to-all communication between GPUs as tokens scatter to their experts and results come back.

All-to-all communication is expensive and scales badly with the number of GPUs. Tensor parallelism (splitting each layer across GPUs) avoids the routing communication but requires even more interconnect bandwidth. In practice, MoE deployments often use a combination — tensor parallelism within a node, expert parallelism across nodes — and tuning this for throughput is non-trivial.

Load imbalance at inference

If some experts consistently receive more tokens than others, the GPUs holding the popular experts become bottlenecks. The batch can't complete until the slowest GPU finishes. Strategies for managing this include routing capacity limits (rejecting tokens routed to an overloaded expert and sending them to a backup), but these can degrade quality if the router's first choice is frequently overridden.

Quantization and MoE

Because memory is the binding constraint, quantization is especially relevant for MoE models — you want to compress those idle expert weights as much as possible. The complication is that experts don't all behave the same way under quantization. Some experts handle niche patterns with less redundancy than others, and aggressive quantization can hurt quality unevenly. Per-expert calibration helps but adds complexity to the quantization pipeline.

// Where MoE shines

MoE is at its best when you need a lot of model capacity but are sensitive to per-token inference cost — and when you have enough GPUs to hold the full model in memory without needing to shard aggressively. Large-batch, high-throughput serving with capable interconnects is the sweet spot. It's harder to justify for latency-sensitive single-user deployments on limited hardware.

How to think about sparse vs dense

The right mental model isn't "MoE is better than dense" or vice versa. They make different bets. A dense model concentrates capacity — every parameter contributes to every token, so the parameters are heavily utilised but there aren't that many of them. A MoE model distributes capacity — most parameters are idle at any moment, but the model can draw on a much larger pool of specialised representations when needed.

For tasks where a smaller, focused model would do fine, a MoE model wastes a lot of hardware loading and maintaining experts that rarely activate. For tasks that require broad knowledge and nuanced handling of diverse input types, the larger expert pool earns its keep.

// memory vs compute summary

memory ∝ total_params (all experts loaded)

compute ∝ active_params (k/N experts per token)

These scale independently. A model with 8 experts, 2 active, has the memory footprint of the full model but the per-token compute of roughly 2/8 of it. The ratio k/N is called the sparsity level.

// In short

01MoE replaces each dense FFN layer with N experts and a router. The router picks k experts per token. Only those experts run. Total parameters stay large; per-token compute stays small.

02Memory doesn't shrink. All experts must be loaded into GPU memory, even the ones doing nothing for a given token. Compute efficiency and memory efficiency don't move together.

03The router is learned, not designed. Experts develop loose specialisations through training. Expert collapse — where most routing concentrates on a few experts — is a training stability problem, managed through auxiliary load balancing losses.

04Deployment is more complex than dense. Expert parallelism introduces all-to-all communication overhead. Load imbalance creates bottlenecks. Quantization behaves differently per expert.

05MoE suits large-batch, high-throughput, well-connected deployments. It's harder to justify when memory is tight, latency is the primary constraint, or GPU interconnects are limited.