PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation
arxiv(2024)
摘要
Recently, the scale of transformers has grown rapidly, which introduces
considerable challenges in terms of training overhead and inference efficiency
in the scope of task adaptation. Existing works, namely Parameter-Efficient
Fine-Tuning (PEFT) and model compression, have separately investigated the
challenges. However, PEFT cannot guarantee the inference efficiency of the
original backbone, especially for large-scale models. Model compression
requires significant training costs for structure searching and re-training.
Consequently, a simple combination of them cannot guarantee accomplishing both
training efficiency and inference efficiency with minimal costs. In this paper,
we propose a novel Parallel Yielding Re-Activation (PYRA) method for such a
challenge of training-inference efficient task adaptation. PYRA first utilizes
parallel yielding adaptive weights to comprehensively perceive the data
distribution in downstream tasks. A re-activation strategy for token modulation
is then applied for tokens to be merged, leading to calibrated token features.
Extensive experiments demonstrate that PYRA outperforms all competing methods
under both low compression rate and high compression rate, demonstrating its
effectiveness and superiority in maintaining both training efficiency and
inference efficiency for large-scale foundation models. Our code will be
released to the public.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要