AffineQuant: Affine Transformation Quantization for Large Language Models
ICLR 2024(2024)
摘要
The significant resource requirements associated with Large-scale Language
Models (LLMs) have generated considerable interest in the development of
techniques aimed at compressing and accelerating neural networks. Among these
techniques, Post-Training Quantization (PTQ) has emerged as a subject of
considerable interest due to its noteworthy compression efficiency and
cost-effectiveness in the context of training. Existing PTQ methods for LLMs
limit the optimization scope to scaling transformations between pre- and
post-quantization weights. In this paper, we advocate for the direct
optimization using equivalent Affine transformations in PTQ (AffineQuant). This
approach extends the optimization scope and thus significantly minimizing
quantization errors. Additionally, by employing the corresponding inverse
matrix, we can ensure equivalence between the pre- and post-quantization
outputs of PTQ, thereby maintaining its efficiency and generalization
capabilities. To ensure the invertibility of the transformation during
optimization, we further introduce a gradual mask optimization method. This
method initially focuses on optimizing the diagonal elements and gradually
extends to the other elements. Such an approach aligns with the
Levy-Desplanques theorem, theoretically ensuring invertibility of the
transformation. As a result, significant performance improvements are evident
across different LLMs on diverse datasets. To illustrate, we attain a C4
perplexity of 15.76 (2.26 lower vs 18.02 in OmniQuant) on the LLaMA2-7B model
of W4A4 quantization without overhead. On zero-shot tasks, AffineQuant achieves
an average of 58.61 accuracy (1.98 lower vs 56.63 in OmniQuant) when using
4/4-bit quantization for LLaMA-30B, which setting a new state-of-the-art
benchmark for PTQ in LLMs.
更多查看译文
关键词
post-training quantization,large language model,Affine Transformation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要