Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
CoRR(2024)
摘要
To mitigate the high inference latency stemming from autoregressive decoding
in Large Language Models (LLMs), Speculative Decoding has emerged as a novel
decoding paradigm for LLM inference. In each decoding step, this method first
efficiently drafts several future tokens and then verifies them in parallel.
Unlike autoregressive decoding, Speculative Decoding facilitates the
simultaneous decoding of multiple tokens per step, thereby accelerating
inference. This paper presents a comprehensive overview and analysis of this
promising decoding paradigm. We begin by providing a formal definition and
formulation of Speculative Decoding. Then, we organize in-depth discussions on
its key facets, including current leading techniques, the challenges faced, and
potential future directions in this field. We aim for this work to serve as a
catalyst for further research on Speculative Decoding, ultimately contributing
to more efficient LLM inference.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要