Speculative Decoding
Draft, Verify, and Get More Tokens Per Forward Pass
Standard LLM decoding is serial by design: one token per forward pass. Speculative decoding breaks that constraint. A small model drafts several tokens, a large model verifies them all in one pass, and the output is provably identical to running the large model alone.
The core insight
Standard autoregressive decoding is fundamentally serial. To generate token N, you must have token N-1. Each step requires a full forward pass through the entire model — for a large model this is an expensive operation that cannot be parallelised over the output sequence. This is the bottleneck.
Speculative decoding breaks this serialisation by exploiting an asymmetry in the cost of generation versus verification. Generating a token requires computing a distribution over the entire vocabulary and sampling from it. Verifying whether a given token is acceptable only requires computing the probability of that specific token — and you can verify multiple tokens in a single forward pass because the attention is non-causal over pre-specified tokens.
// The key asymmetry
Generating K tokens sequentially requires K forward passes. Verifying K candidate tokens simultaneously requires only 1 forward pass. Speculative decoding exploits this to produce multiple tokens per large model forward pass.
How it works
The algorithm uses two models: a small, fast draft model and the original large target model. The draft model is substantially smaller — typically an order of magnitude fewer parameters — fast enough that generating several draft tokens adds little overhead relative to a single target model forward pass.
// speculative_decoding_loop
A concrete example
Draft model proposes 5 tokens for the sequence "The Eiffel Tower is in":
Draft tokens proposed
Target model runs one forward pass and checks each token:
3 accepted + 1 corrected = 4 tokens from 1 large-model pass (vs. 1 in standard decoding)
Acceptance rate: the critical variable
The speedup from speculative decoding depends entirely on the acceptance rate — what fraction of draft tokens the target model accepts. If the draft model predicts well (its output distribution closely matches the target's), acceptance rates are high on well-matched workloads, yielding multiple tokens per large-model pass. If the distributions diverge (on out-of-distribution inputs, highly creative tasks, or with mismatched architectures), acceptance rates drop and the method provides no benefit.
// When it doesn't help
Speculative decoding provides less benefit in two scenarios: (1) high-temperature sampling, where the target model's outputs are inherently unpredictable and the draft model's guesses are frequently wrong, and (2) high-concurrency serving, where the extra draft model compute adds overhead that outweighs the savings from batch parallelism.
Draft model choices
| Approach | Draft source | Acceptance rate | Best for |
|---|---|---|---|
| External draft model | Smaller model of same family (e.g., a 7–8B draft for a 70B target) | High, when well-matched | General purpose; requires extra VRAM |
| EAGLE / EAGLE-2 | Lightweight autoregressive head trained on target's features | Very high (80–95%) | Latency-critical deployments; highest acceptance rates |
| Medusa | Multiple parallel decoding heads on target model | Medium-high | Single-model serving, no separate draft VRAM |
| Lookahead decoding | N-gram cache built from generation history | Variable | Repetitive text, low-overhead alternative |
Output equivalence: does it change the model?
This is the most common concern about speculative decoding, and the answer is: no, the output distribution is provably identical to the target model.
The acceptance/rejection algorithm is a modified rejection sampling scheme. When a draft token is rejected, the corrected token is sampled from a distribution that accounts for the probability mass already used by the draft token. Mathematically, the marginal distribution over outputs is identical to what the target model alone would produce.
In practice, stochastic token generation means results won't be numerically identical to a non-speculative run at the same seed — but the statistical properties are identical.
What kind of speedup to expect
Observed speedups vary considerably based on workload, draft model quality, and sampling temperature. Rough ranges from published evaluations and production deployments:
// expected_speedup_ranges
// In short