Speculative Decoding: Draft, Verify, Accelerate

The core insight

Standard autoregressive decoding is fundamentally serial. To generate token N, you must have token N-1. Each step requires a full forward pass through the entire model — for a large model this is an expensive operation that cannot be parallelised over the output sequence. This is the bottleneck.

Speculative decoding breaks this serialisation by exploiting an asymmetry in the cost of generation versus verification. Generating a token requires computing a distribution over the entire vocabulary and sampling from it. Verifying whether a given token is acceptable only requires computing the probability of that specific token — and you can verify multiple tokens in a single forward pass because the attention is non-causal over pre-specified tokens.

// The key asymmetry

Generating K tokens sequentially requires K forward passes. Verifying K candidate tokens simultaneously requires only 1 forward pass. Speculative decoding exploits this to produce multiple tokens per large model forward pass.

How it works

The algorithm uses two models: a small, fast draft model and the original large target model. The draft model is substantially smaller — typically an order of magnitude fewer parameters — fast enough that generating several draft tokens adds little overhead relative to a single target model forward pass.

// speculative_decoding_loop

1.Draft: The small model autoregressively generates K candidate tokens (typically K = 4–8).

2.Verify: The large model processes all K+1 positions in a single forward pass, computing token probabilities at each position.

3.Accept/Reject: Starting from the first draft token, each token is accepted with probability min(1, p_target / p_draft). Acceptance is done left-to-right; the first rejection terminates the sequence.

4.Correction: At the first rejected position, the large model's distribution is used to sample a corrected token.

5.Repeat: The accepted tokens are appended to the output. Typically 2–5 tokens are produced per large-model pass.

The acceptance algorithm ensures the output distribution is identical to what the large model would have produced alone. Speculative decoding is lossless.

A concrete example

Draft model proposes 5 tokens for the sequence "The Eiffel Tower is in":

Paris France and is very

Draft tokens proposed

Target model runs one forward pass and checks each token:

Paris ✓ , ✓ France ✓ and ✗ which

3 accepted + 1 corrected = 4 tokens from 1 large-model pass (vs. 1 in standard decoding)

Acceptance rate: the critical variable

The speedup from speculative decoding depends entirely on the acceptance rate — what fraction of draft tokens the target model accepts. If the draft model predicts well (its output distribution closely matches the target's), acceptance rates are high on well-matched workloads, yielding multiple tokens per large-model pass. If the distributions diverge (on out-of-distribution inputs, highly creative tasks, or with mismatched architectures), acceptance rates drop and the method provides no benefit.

// When it doesn't help

Speculative decoding provides less benefit in two scenarios: (1) high-temperature sampling, where the target model's outputs are inherently unpredictable and the draft model's guesses are frequently wrong, and (2) high-concurrency serving, where the extra draft model compute adds overhead that outweighs the savings from batch parallelism.

Draft model choices

Approach	Draft source	Acceptance rate	Best for
External draft model	Smaller model of same family (e.g., a 7–8B draft for a 70B target)	High, when well-matched	General purpose; requires extra VRAM
EAGLE / EAGLE-2	Lightweight autoregressive head trained on target's features	Very high (80–95%)	Latency-critical deployments; highest acceptance rates
Medusa	Multiple parallel decoding heads on target model	Medium-high	Single-model serving, no separate draft VRAM
Lookahead decoding	N-gram cache built from generation history	Variable	Repetitive text, low-overhead alternative

Output equivalence: does it change the model?

This is the most common concern about speculative decoding, and the answer is: no, the output distribution is provably identical to the target model.

The acceptance/rejection algorithm is a modified rejection sampling scheme. When a draft token is rejected, the corrected token is sampled from a distribution that accounts for the probability mass already used by the draft token. Mathematically, the marginal distribution over outputs is identical to what the target model alone would produce.

In practice, stochastic token generation means results won't be numerically identical to a non-speculative run at the same seed — but the statistical properties are identical.

What kind of speedup to expect

Observed speedups vary considerably based on workload, draft model quality, and sampling temperature. Rough ranges from published evaluations and production deployments:

// expected_speedup_ranges

2–3×Greedy decoding, well-matched draft model — typical best-case for predictable prompts

3–4×Feature-based draft heads, greedy decoding — training a small head directly on the target model's activations raises acceptance rates further

1.3–1.8×Temperature > 0.7, mixed prompt types — lower acceptance rate, realistic production conditions

~1×High-temperature creative tasks or mismatched draft — if too many draft tokens are rejected, the overhead wipes out the gain

These are latency reductions (time-to-last-token), not throughput gains. The draft model adds compute overhead, so at high batch sizes the benefit shrinks. Measure on your own workload — acceptance rate varies more than any other factor.

// In short

01Speculative decoding exploits verification being cheaper than generation. One large-model forward pass verifies multiple tokens, breaking the serial bottleneck of autoregressive decoding.

02The output is provably identical to the target model. Acceptance/rejection sampling preserves the target distribution exactly. This is not an approximation.

03Acceptance rate determines everything. High acceptance yields 2–3× speedup. Low acceptance breaks even or worse. Match the draft model to your actual workload distribution.

04Feature-based draft heads get the best acceptance rates. Training a small head directly on the target model's activations produces distributions that closely match the target, well above what a standalone draft model achieves.

05It's a latency trick, not a throughput trick. For single-user or low-concurrency serving, speculative decoding cuts time-to-last-token. At high concurrency, continuous batching is the more valuable optimisation.