Learn To be Efficient: Build Structured Sparsity in Large Language Models
CoRR(2024)
摘要
Large Language Models (LLMs) have achieved remarkable success with their
billion-level parameters, yet they incur high inference overheads. The
emergence of activation sparsity in LLMs provides a natural approach to reduce
this cost by involving only parts of the parameters for inference. Existing
methods only focus on utilizing this naturally formed activation sparsity,
overlooking the potential for further amplifying this inherent sparsity. In
this paper, we hypothesize that LLMs can learn to be efficient by achieving
more structured activation sparsity.To achieve this, we introduce a novel
algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs
to learn to activate fewer neurons and achieve a better trade-off between
sparsity and performance. Furthermore, unlike SOTA MoEfication methods, which
mainly focus on ReLU-based models, LTE can also be applied to LLMs like GPT and
LLaMA with soft activation functions. We evaluate LTE on four models and eleven
datasets. The experiments show that LTE achieves a better trade-off between
sparsity and task performance. For instance, LTE with LLaMA provides a
1.83x-2.59x FLOPs speed-up on language generation tasks, outperforming the
state-of-the-art methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要