DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
arxiv(2024)
摘要
Distributed LLM serving is costly and often underutilizes hardware
accelerators due to three key challenges: bubbles in pipeline-parallel
deployments caused by the bimodal latency of prompt and token processing, GPU
memory overprovisioning, and long recovery times in case of failures. In this
paper, we propose DéjàVu, a system to address all these challenges using a
versatile and efficient KV cache streaming library (DéjàVuLib). Using
DéjàVuLib, we propose and implement efficient prompt-token disaggregation
to reduce pipeline bubbles, microbatch swapping for efficient GPU memory
management, and state replication for fault-tolerance. We highlight the
efficacy of these solutions on a range of large models across cloud
deployments.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要