L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ
CoRR(2024)
摘要
Post-training quantization (PTQ) and quantization-aware training (QAT)
methods are gaining popularity in mitigating the high memory and computational
costs associated with Large Language Models (LLMs). In resource-constrained
scenarios, PTQ, with its reduced training overhead, is often preferred over
QAT, despite the latter's potential for higher accuracy. Meanwhile,
parameter-efficient fine-tuning (PEFT) methods like low-rank adaptation (LoRA)
have been introduced, and recent efforts have explored quantization-aware PEFT
techniques. However, these approaches may lack generality due to their reliance
on the pre-quantized model's configuration. Their effectiveness may be
compromised by non-linearly quantized or mixed-precision weights, and the
retraining of specific quantization parameters might impede optimal
performance. To address these challenges, we propose L4Q, an algorithm for
parameter-efficient quantization-aware training. L4Q leverages LoRA-wise
learned quantization step size for LLMs, aiming to enhance generality. The
simultaneous quantization-and-fine-tuning process of L4Q is applicable to
high-precision models, yielding linearly quantized weights with superior
accuracy. Our experiments, conducted on the LLaMA and LLaMA2 model families
using an instructional dataset, showcase L4Q's capabilities in language
comprehension and few-shot in-context learning, achieving sub-4-bit precision
while maintaining comparable training times to applying PEFT on a quantized
model.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要