How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models?
CoRR(2023)
摘要
Pruning and quantization form the foundation of model compression for neural
networks, enabling efficient inference for large language models (LLMs).
Recently, various quantization and pruning techniques have demonstrated
state-of-the-art performance in a post-training setting. They rely upon
calibration data, a small set of unlabeled examples, to generate layer
activations. However, no prior work has systematically investigated how the
calibration data impacts the effectiveness of model compression methods. In
this paper, we present the first extensive empirical study on the effect of
calibration data upon LLM performance. We trial a variety of pruning and
quantization methods, tasks, models, and datasets. Surprisingly, we find
substantial variations in downstream task performance, contrasting existing
work that suggests a greater level of robustness to the calibration data.
Finally, we make a series of recommendations for the effective use of
calibration data in LLM quantization and pruning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要