When Attention Meets Fast Recurrence - Training Language Models with Reduced Compute.
EMNLP(2021)
摘要
Large language models have become increasingly difficult to train because of the required computation time and cost. In this work, we present SRU++, a recurrent unit with optional built-in attention that exhibits state-of-the-art modeling capacity and training efficiency. On standard language modeling benchmarks such as enwik8 and Wiki-103 datasets, our model obtains better perplexity and bits-per-character (bpc) while using 2.5x-10x less training time and cost compared to top-performing Transformer models. Our results reaffirm that attention is not all we need and can be complementary to other sequential modeling modules. Moreover, fast recurrence with little attention can be a leading model architecture.
更多查看译文
关键词
training language models,fast recurrence,attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要