ApiQ: Finetuning of 2-Bit Quantized Large Language Model
CoRR(2024)
摘要
Memory-efficient finetuning of large language models (LLMs) has recently
attracted huge attention with the increasing size of LLMs, primarily due to the
constraints posed by GPU memory limitations and the comparable results of these
methods with full finetuning. Despite the advancements, current strategies for
memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance
across diverse bit-width quantizations and multifaceted tasks. This
inconsistency largely stems from the detrimental impact of the quantization
process on preserved knowledge, leading to catastrophic forgetting and
undermining the utilization of pretrained models for finetuning purposes. In
this work, we introduce a novel quantization framework named ApiQ, designed to
restore the lost information from quantization by concurrently initializing
LoRA components and quantizing the weights of LLMs. This approach ensures the
maintenance of the original LLM's activation precision while mitigating the
error propagation from shallower into deeper layers. Through comprehensive
evaluations conducted on a spectrum of language tasks with various models, ApiQ
demonstrably minimizes activation error during quantization. Consequently, it
consistently achieves superior finetuning outcomes across various bit-widths of
quantization.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要