Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
CoRR(2024)
摘要
The rapid proliferation of Large Language Models (LLMs) has been a driving
force in the growth of cloud-based LLM services, which are now integral to
advancing AI applications. However, the dynamic auto-regressive nature of LLM
service, along with the need to support exceptionally long context lengths,
demands the flexible allocation and release of substantial resources. This
presents considerable challenges in designing cloud-based LLM service systems,
where inefficient management can lead to performance degradation or resource
wastage. In response to these challenges, this work introduces DistAttention, a
novel distributed attention algorithm that segments the KV Cache into smaller,
manageable units, enabling distributed processing and storage of the attention
module. Based on that, we propose DistKV-LLM, a distributed LLM serving system
that dynamically manages KV Cache and effectively orchestrates all accessible
GPU and CPU memories spanning across the data center. This ensures a
high-performance LLM service on the cloud, adaptable to a broad range of
context lengths. Validated in a cloud environment with 32 NVIDIA A100 GPUs in
configurations from 2 to 32 instances, our system exhibited 1.03-2.4x
end-to-end throughput improvements and supported context lengths 2-19x longer
than current state-of-the-art LLM service systems, as evidenced by extensive
testing across 18 datasets with context lengths up to 1,900K.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要