Speculative pre-positioning: off-path decode for stateful inference sessions
The paper introduces speculative pre-positioning, a technique to reduce tail latency in stateful LLM inference by moving decoding work off the critical path. It achieves up to 2.3x speedup for interactive applications without model changes.
Background
- Large language models (LLMs) like ChatGPT generate text one word at a time, which is slow. "Speculative decoding" is a known trick: a smaller, faster model drafts a batch of possible words, and the big model checks them in parallel, speeding things up.
- This paper introduces "speculative pre-positioning" (SPP), a new twist optimized for "stateful" sessions — repeated back-and-forth calls where the model remembers previous context (e.g., long code editing or chatbot conversations).
- The key idea: SPP pre-computes and caches intermediate results from the draft model, so that when the main model catches up, it can skip redundant work. This is analogous to CPU "prefetching" in computer architecture.
- The practical payoff: SPP claims up to 2.2× speedup vs. standard speculative decoding in stateful inference workloads, with no loss in output quality — important because stateful sessions are the dominant use case for production LLM APIs.