Sample Page

Speculative decoding is an inference-time optimization for autoregressive large language models (LLMs) that generates multiple tokens per decoding step instead of one. A smaller draft model proposes a sequence of candidate tokens, and the larger target model verifies them in a single forward pass through a modified rejection sampling scheme. The verification preserves the target model’s original output distribution, so the technique produces the same results as standard decoding while cutting latency by roughly two to three times.[1][2] The name is an analogy to speculative execution in CPU design, where a processor runs instructions along a predicted branch before the outcome is known.[3]

Background

Standard autoregressive decoding in large language models generates one token at a time. The model computes a probability distribution over its vocabulary, samples the next token, and feeds that token back as input. For large models, this process is bottlenecked by memory bandwidth rather than arithmetic throughput: loading the model’s parameters from high-bandwidth memory (HBM) to the processor takes up most of the wall-clock time at each step.[1] Because of this, a forward pass over one token and a forward pass over several tokens in a batch take roughly the same time. Speculative decoding relies on this property.[3]

Mechanism

The technique alternates between two phases: drafting and verification.[4]

During drafting, a fast approximation model generates a short run of K candidate tokens, typically between 3 and 12. The draft model is usually a much smaller version of the target model or a lightweight auxiliary network.[5]

During verification, the target model scores the entire draft sequence in one batched forward pass. A modified rejection sampling algorithm compares the draft and target probabilities at each position. If the target model would have been at least as likely to produce a given token, that token is accepted; the first token that fails is resampled from a corrected distribution, and everything after it is thrown out. The result is that the output distribution is the same as if each token had been generated one at a time.[1][2]

How many tokens get accepted per cycle depends on how well the draft model matches the target. For common words and predictable continuations the match tends to be good, so the target model can confirm several tokens at once.[3]

History

An early precursor was blockwise parallel decoding, proposed in 2018 by Stern, Shazeer, and Uszkoreit. Their method predicted multiple future tokens through auxiliary prediction heads and validated them against the autoregressive model, but it only worked with greedy decoding and did not preserve the full sampling distribution.[6]

The modern form of the technique came from Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google Research, who posted “Fast Inference from Transformers via Speculative Decoding” on arXiv in November 2022.[1] Separately and at about the same time, Charlie Chen and colleagues at DeepMind arrived at a closely related method they called speculative sampling, published in February 2023.[2] Both papers introduced the use of rejection sampling to guarantee that the output distribution is unchanged. Leviathan et al. showed roughly 2–3x speedup on T5-XXL (11 billion parameters); Chen et al. reported 2–2.5x on the Chinchilla model (70 billion parameters).

The Leviathan et al. paper was presented as an oral at the International Conference on Machine Learning in July 2023.[1]

Variants

SpecInfer (Miao et al., 2024) uses multiple small language models to jointly build a tree of candidate continuations rather than a single chain. The target model verifies the whole tree in parallel and keeps the longest valid path, with reported speedups of 1.5–3.5x.[7]

Medusa (Cai et al., 2024) takes a different approach by not using a separate draft model at all. Extra lightweight decoding heads are attached to the target model itself, and each one predicts a token at a different future position. The candidates are evaluated through a tree-structured attention mechanism. The authors measured 2.2–3.6x speedup.[8]

EAGLE (Li et al., 2024) performs autoregression on the target model’s internal feature representations (specifically the second-to-top layer) rather than on tokens directly. On LLaMA 2 Chat 70B, this gave a 2.7–3.5x latency reduction. Later versions added dynamic draft trees (EAGLE-2) and further optimizations (EAGLE-3), reaching 3–6.5x speedup.[9]

Adoption

By 2024, speculative decoding had become a standard part of production LLM serving. Google uses it in the AI Overviews feature of Google Search.[3] Open-source inference frameworks such as vLLM, NVIDIA’s TensorRT-LLM, and SGLang all include built-in support for speculative decoding and its variants.[5][10] Apple, AWS, and Meta have also published research extending the method or deploying it at scale.[11]

See also

References

  1. ^ a b c d e Leviathan, Yaniv; Kalman, Matan; Matias, Yossi (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of the 40th International Conference on Machine Learning (ICML).
  2. ^ a b c Chen, Charlie; Borgeaud, Sebastian; Irving, Geoffrey; Lespiau, Jean-Baptiste; Sifre, Laurent; Jumper, John (2023-02-02). “Accelerating Large Language Model Decoding with Speculative Sampling”. arXiv:2302.01318 [cs.CL].
  3. ^ a b c d “Looking back at speculative decoding”. research.google. Retrieved 2026-04-05.
  4. ^ Xia, Heming; Yang, Zhe; Dong, Qingxiu; Wang, Peiyi; Li, Yongqi; Ge, Tao; Liu, Tianyu; Li, Wenjie; Sui, Zhifang (2024-06-04), Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding, arXiv, doi:10.48550/arXiv.2401.07851, arXiv:2401.07851, retrieved 2026-04-05
  5. ^ a b “An Introduction to Speculative Decoding for Reducing Latency in AI Inference”. NVIDIA Technical Blog. 2025-09-17. Retrieved 2026-04-05.
  6. ^ Stern, Mitchell; Shazeer, Noam; Uszkoreit, Jakob (2018-11-07), Blockwise Parallel Decoding for Deep Autoregressive Models, arXiv, doi:10.48550/arXiv.1811.03115, arXiv:1811.03115, retrieved 2026-04-05
  7. ^ Miao, Xupeng; Oliaro, Gabriele; Zhang, Zhihao; Cheng, Xinhao; Wang, Zeyu; Zhang, Zhengxin; Wong, Rae Ying Yee; Zhu, Alan; Yang, Lijie (2024-04-01), SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification, arXiv, doi:10.48550/arXiv.2305.09781, arXiv:2305.09781, retrieved 2026-04-05
  8. ^ Cai, Tianle; Li, Yuhong; Geng, Zhengyang; Peng, Hongwu; Lee, Jason D.; Chen, Deming; Dao, Tri (2024-06-14), Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, arXiv, doi:10.48550/arXiv.2401.10774, arXiv:2401.10774, retrieved 2026-04-05
  9. ^ Li, Yuhui; Wei, Fangyun; Zhang, Chao; Zhang, Hongyang (2025-03-04), EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, arXiv, doi:10.48550/arXiv.2401.15077, arXiv:2401.15077, retrieved 2026-04-05
  10. ^ “Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding”. NVIDIA Technical Blog. 2024-12-17. Retrieved 2026-04-05.
  11. ^ Ryu, Hyun; Kim, Eric (2024-11-20). “Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding”. arXiv:2411.13157 [cs.CL].