Weight subcloning: direct initialization of transformers using larger pretrained ones
CoRR(2023)
摘要
Training large transformer models from scratch for a target task requires
lots of data and is computationally demanding. The usual practice of transfer
learning overcomes this challenge by initializing the model with weights of a
pretrained model of the same size and specification to increase the convergence
and training speed. However, what if no pretrained model of the required size
is available? In this paper, we introduce a simple yet effective technique to
transfer the knowledge of a pretrained model to smaller variants. Our approach
called weight subcloning expedites the training of scaled-down transformers by
initializing their weights from larger pretrained models.
Weight subcloning involves an operation on the pretrained model to obtain the
equivalent initialized scaled-down model. It consists of two key steps: first,
we introduce neuron importance ranking to decrease the embedding dimension per
layer in the pretrained model. Then, we remove blocks from the transformer
model to match the number of layers in the scaled-down network. The result is a
network ready to undergo training, which gains significant improvements in
training speed compared to random initialization. For instance, we achieve 4x
faster training for vision transformers in image classification and language
models designed for next token prediction.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要