FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
CoRR(2024)
摘要
The large number of parameters in Pretrained Language Models enhance their
performance, but also make them resource-intensive, making it challenging to
deploy them on commodity hardware like a single GPU. Due to the memory and
power limitations of these devices, model compression techniques are often used
to decrease both the model's size and its inference latency. This usually
results in a trade-off between model accuracy and efficiency. Therefore,
optimizing this balance is essential for effectively deploying LLMs on
commodity hardware. A significant portion of the efficiency challenge is the
Feed-forward network (FFN) component, which accounts for roughly 2/3
total parameters and inference latency. In this paper, we first observe that
only a few neurons of FFN module have large output norm for any input tokens,
a.k.a. heavy hitters, while the others are sparsely triggered by different
tokens. Based on this observation, we explicitly split the FFN into two parts
according to the heavy hitters. We improve the efficiency-accuracy trade-off of
existing compression methods by allocating more resource to FFN parts with
heavy hitters. In practice, our method can reduce model size by 43.1% and
bring 1.25∼1.56× wall clock time speedup on different hardware with
negligible accuracy drop.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要