Bifurcated Attention for Single-Context Large-Batch Sampling
arxiv(2024)
摘要
In our study, we present bifurcated attention, a method developed for
language model inference in single-context batch sampling contexts. This
approach aims to reduce redundant memory IO costs, a significant factor in
latency for high batch sizes and long context lengths. Bifurcated attention
achieves this by dividing the attention mechanism during incremental decoding
into two distinct GEMM operations, focusing on the KV cache from prefill and
the decoding process. This method ensures precise computation and maintains the
usual computational load (FLOPs) of standard attention mechanisms, but with
reduced memory IO. Bifurcated attention is also compatible with multi-query
attention mechanism known for reduced memory IO for KV cache, further enabling
higher batch size and context length. The resulting efficiency leads to lower
latency, improving suitability for real-time applications, e.g., enabling
massively-parallel answer generation without substantially increasing latency,
enhancing performance when integrated with postprocessing techniques such as
reranking.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要