Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
CoRR(2023)
摘要
The growing demand for Large Language Models (LLMs) in applications such as
content generation, intelligent chatbots, and sentiment analysis poses
considerable challenges for LLM service providers. To efficiently use GPU
resources and boost throughput, batching multiple requests has emerged as a
popular paradigm; to further speed up batching, LLM quantization techniques
reduce memory consumption and increase computing capacity. However, prevalent
quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully
leverage the capabilities of modern GPUs, such as 4-bit integer operators,
resulting in sub-optimal performance.
To maximize LLMs' serving throughput, we introduce Atom, a low-bit
quantization method that achieves high throughput improvements with negligible
accuracy loss. Atom significantly boosts serving throughput by using low-bit
operators and considerably reduces memory consumption via low-bit quantization.
It attains high accuracy by applying a novel mixed-precision and fine-grained
quantization process. We evaluate Atom on 4-bit weight-activation quantization
setups in the serving context. Atom improves end-to-end throughput by up to
$7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8
quantization, while maintaining the same latency target.
更多查看译文
关键词
accurate llm,low-bit
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要