LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices.
ICLR 2024(2024)
Researcher
Abstract
With the commercialization of large language models (LLMs), weight-activation quantization has emerged to compress and accelerate LLMs, achieving high throughput while reducing inference costs. However, existing post-training quantization (PTQ) techniques for quantizing both weights and activations of LLMs still suffer from non-negligible performance drops, especially on massive multitask language understanding. To address this issue, we propose Low-Rank Quantization (LRQ) - a simple yet effective post-training weight quantization method for LLMs that reconstructs the outputs of an intermediate Transformer block by leveraging low-rank weight-scaling matrices, replacing the conventional full weight-scaling matrices that entail as many learnable scales as their associated weights. Thanks to parameter sharing via low-rank structure, LRQ only need to learn significantly fewer parameters while enabling the individual scaling of weights, thus boosting the generalization capability of quantized LLMs. Through extensive experiments, we demonstrate the superiority of LRQ to prior LLM INT8 PTQ works. Remarkably, we confirm for the first time the possibility of 4-bit weight and 8-bit activation quantization for LLMs with minimal accuracy loss among LLM PTQ studies.
MoreTranslated text
Key words
Efficient Inference,Post-Training Quantization,Large Language Models
PDF
View via Publisher
AI Read Science
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined