CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
arxiv(2023)
摘要
As large language models (LLMs) take on complex tasks, their inputs are
supplemented with longer contexts that incorporate domain knowledge. Yet using
long contexts is challenging, as nothing can be generated until the whole
context is processed by the LLM. While the context-processing delay can be
reduced by reusing the KV cache of a context across different inputs, fetching
the KV cache, which contains large tensors, over the network can cause high
extra network delays.
CacheGen is a fast context-loading module for LLM systems. First, CacheGen
uses a custom tensor encoder, leveraging KV cache's distributional properties
to encode a KV cache into more compact bitstream representations with
negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts
the compression level of different parts of a KV cache to cope with changes in
available bandwidth, in order to maintain low context-loading delay and high
generation quality.
compression level for a part of the context or recompute its KV cache on the
fly. We test CacheGen on popular LLMs and datasets. Compared to the recent
systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x
and the total delay in fetching and processing contexts by 3.2-3.7x with
negligible impact on the LLM response quality. Our code is at:
https://github.com/UChi-JCL/CacheGen.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要