What is Speculative Decoding? Speculative Decoding is a very clever optimization that increases inference speed for LLMs. The idea is to use a smaller "draft" model to generate a completion to a prompt. This completion is then processed in parallel by the larger model, and if the draft model's predictions align with the larger model's preferences, we can potentially generate multiple tokens in a single pass - all while preserving the original output distribution. This leverages the concept of speculative execution, an optimization technique usually seen in processors, where a task can be performed in parallel while checking if it will actually be required. The classic example is branch prediction. For speculative execution to be effective, we need an efficient mechanism to suggest which speculative tasks to execute — specifically, ones likely to be needed. This optimization is particularly valuable for LLM inference because memory bandwidth is the bottleneck while compute resources are often available. With speculative decoding, your next-token predictions are batched together - the draft model's k-token completion becomes k+1 parallel prediction tasks for the main model, letting us potentially accept 1 to k+1 tokens tokens in a single forward pass. In the paper, the authors demonstrated an out-of-the-box latency improvement of 2X-3X without any change to the outputs. Intriguingly, they showed that the draft model doesn't even need to be a small LLM - a simple bigram model achieved a 1.25X speedup in their experiments. Theoretically, even random completions should lead to minor speedups. Speculative Decoding Paper: https://lnkd.in/eUBpwp5S Influential previous work: Blockwise Parallel Decoding for Deep Autoregressive Models: https://lnkd.in/eFyT7CTt Shallow Aggressive Decoding (SAD): https://lnkd.in/edHYU8Bg