Revisiting Knowledge Distillation for Autoregressive Language Models
CoRR(2024)
摘要
Knowledge distillation (KD) is a common approach to compress a teacher model
to reduce its inference cost and memory footprint, by training a smaller
student model. However, in the context of autoregressive language models (LMs),
we empirically find that larger teacher LMs might dramatically result in a
poorer student. In response to this problem, we conduct a series of analyses
and reveal that different tokens have different teaching modes, neglecting
which will lead to performance degradation. Motivated by this, we propose a
simple yet effective adaptive teaching approach (ATKD) to improve the KD. The
core of ATKD is to reduce rote learning and make teaching more diverse and
flexible. Extensive experiments on 8 LM tasks show that, with the help of ATKD,
various baseline KD methods can achieve consistent and significant performance
gains (up to +3.04
encouragingly, ATKD can improve the student model generalization effectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要