Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models

Yingtao Zhang,Haoli Bai, Jialin Zhao, Haokun Lin,Lu Hou,Carlo Vittorio Cannistraci

ICLR 2024(2024)

引用 0|浏览6
暂无评分
摘要
With the rapid growth of large language models (LLMs), there is increasing demand for memory and computation for LLMs. Recent efforts on post-training pruning of LLMs aim to reduce the model size and computation, yet the performance is still sub-optimal. In this paper, we present a plug-and-play solution for post-training pruning of LLMs. The proposed solution has two innovative components: 1) **Relative Importance and Activations** (RIA), a new pruning metric that jointly considers the weight and activations efficiently on LLMs; and 2) **Channel Permutation**, a new approach to maximally preserve important weights under N:M sparsity. The proposed two components can be readily combined to further enhance the N:M structuredly pruned LLMs. Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLaMA ranging from 7B to 65B. Furthermore, N:M structured pruning with channel permutation can even outperform the original LLaMA2 70B on zero-shot tasks, together with practical speed-up on specific hardware.
更多
查看译文
关键词
Post-Training Pruning,Combinatorial Optimization,Large Language Models,Inference Acceleration
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要