What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
arxiv(2024)
摘要
Quantization has emerged as a promising technique for improving the memory
and computational efficiency of large language models (LLMs). Though the
trade-off between performance and efficiency is well-known, there is still much
to be learned about the relationship between quantization and LLM performance.
To shed light on this relationship, we propose a new perspective on
quantization, viewing it as perturbations added to the weights and activations
of LLMs. We call this approach "the lens of perturbation". Using this lens, we
conduct experiments with various artificial perturbations to explore their
impact on LLM performance. Our findings reveal several connections between the
properties of perturbations and LLM performance, providing insights into the
failure cases of uniform quantization and suggesting potential solutions to
improve the robustness of LLM quantization. To demonstrate the significance of
our findings, we implement a simple non-uniform quantization approach based on
our insights. Our experiments show that this approach achieves minimal
performance degradation on both 4-bit weight quantization and 8-bit
quantization for weights and activations. These results validate the
correctness of our approach and highlight its potential to improve the
efficiency of LLMs without sacrificing performance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要