RelayAttention for Efficient Large Language Model Serving with Long System Prompts
CoRR(2024)
摘要
Practical large language model (LLM) services may involve a long system
prompt, which specifies the instructions, examples, and knowledge documents of
the task and is reused across numerous requests. However, the long system
prompt causes throughput/latency bottlenecks as the cost of generating the next
token grows w.r.t. the sequence length. This paper aims to improve the
efficiency of LLM services that involve long system prompts. Our key
observation is that handling these system prompts requires heavily redundant
memory accesses in existing causal attention computation algorithms.
Specifically, for batched requests, the cached hidden states (i.e., key-value
pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM
multiple times, each corresponding to an individual request. To eliminate such
a redundancy, we propose RelayAttention, an attention algorithm that allows
reading these hidden states from DRAM exactly once for a batch of input tokens.
RelayAttention is a free lunch: it maintains the generation quality while
requiring no model retraining, as it is based on a mathematical reformulation
of causal attention.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要