PaSS: Parallel Speculative Sampling.
CoRR(2023)
摘要
Scaling the size of language models to tens of billions of parameters has led
to impressive performance on a wide range of tasks. At generation, these models
are used auto-regressively, requiring a forward pass for each generated token,
and thus reading the full set of parameters from memory. This memory access
forms the primary bottleneck for generation and it worsens as the model size
increases. Moreover, executing a forward pass for multiple tokens in parallel
often takes nearly the same time as it does for just one token. These two
observations lead to the development of speculative sampling, where a second
smaller model is used to draft a few tokens, that are then validated or
rejected using a single forward pass of the large model. Unfortunately, this
method requires two models that share the same tokenizer and thus limits its
adoption. As an alternative, we propose to use parallel decoding as a way to
draft multiple tokens from a single model with no computational cost, nor the
need for a second model. Our approach only requires an additional input token
that marks the words that will be generated simultaneously. We show promising
performance (up to $30\%$ speed-up) while requiring only as few as $O(d_{emb})$
additional parameters.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要