Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Meeting of the Association for Computational Linguistics, Volume abs/1901.02860, 2019.

    Cited by: 274|Bibtex|Views78|Links
    EI
    Keywords:
    long shortneural language modelEffective Context LengthLSTMlength contextMore(10+)
    Wei bo:
    As shown in Table 9, due to the state reuse scheme, Transformer-XL achieves an up to 1,874 times speedup during evaluation

    Abstract:

    Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. As a solution, we propose a novel neural architecture, Transformer-XL, that enables Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Concretely...More

    Code:

    Data:

    0
    Introduction
    • Language modeling is among the important problems that require modeling long-term dependency, with successful applications such as unsupervised pretraining (Dai and Le, 2015; Peters et al, 2018; Radford et al, 2018; Devlin et al, 2018).
    • It has been a challenge to equip neural networks with the capability to model long-term dependency in sequential data.
    • Recurrent neural networks (RNNs), in particular Long Short-.
    • Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997), have been a standard solution to language modeling and obtained strong results on multiple benchmarks.
    • Previous work has found that LSTM language models use 200 context words on average (Khandelwal et al, 2018), indicating room for further improvement
    Highlights
    • Language modeling is among the important problems that require modeling long-term dependency, with successful applications such as unsupervised pretraining (Dai and Le, 2015; Peters et al, 2018; Radford et al, 2018; Devlin et al, 2018)
    • Previous work has found that LSTM language models use 200 context words on average (Khandelwal et al, 2018), indicating room for further improvement
    • To address the aforementioned limitations of fixed-length contexts, we propose a new architecture called Transformer-XL
    • To address the limitations of using a fixed-length context, we propose to introduce a recurrence mechanism to the Transformer architecture
    • Relative Effective Context Length is defined on a model group instead of a single model, and the gain of a long context is measure by the relative improvement over the best short context model
    • As shown in Table 9, due to the state reuse scheme, Transformer-XL achieves an up to 1,874 times speedup during evaluation
    Methods
    • Transformer plexity using only a shorter context, and it is not suitable for fair comparison among multiple models.
    • The RECL of TransformerXL is 80% and 450% longer than recurrent networks and Transformer respectively
    • Both the recurrence mechanism and the positional encodings contribute to a longer RECL.
    • This further substantiates the argument that Transformer-XL is able to model longer-term dependency
    Results
    • Evaluation Speed

      the authors compare the evaluation speed of the model with the vanilla Transformer model (AlRfou et al, 2018).
    • As shown in Table 9, due to the state reuse scheme, Transformer-XL achieves an up to 1,874 times speedup during evaluation
    Conclusion
    • Transformer-XL obtains strong perplexity results, models longer-term dependency than RNNs and Transformer, achieves substantial speedup during

      How much Al-Rfou et al (2018) is slower evaluation, and is able to generate coherent text articles.
    • Transformer-XL obtains strong perplexity results, models longer-term dependency than RNNs and Transformer, achieves substantial speedup during.
    • How much Al-Rfou et al (2018) is slower evaluation, and is able to generate coherent text articles.
    • The authors envision interesting applications of Transformer-XL in the fields of text generation, unsupervised feature learning, image and speech modeling
    Summary
    • Introduction:

      Language modeling is among the important problems that require modeling long-term dependency, with successful applications such as unsupervised pretraining (Dai and Le, 2015; Peters et al, 2018; Radford et al, 2018; Devlin et al, 2018).
    • It has been a challenge to equip neural networks with the capability to model long-term dependency in sequential data.
    • Recurrent neural networks (RNNs), in particular Long Short-.
    • Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997), have been a standard solution to language modeling and obtained strong results on multiple benchmarks.
    • Previous work has found that LSTM language models use 200 context words on average (Khandelwal et al, 2018), indicating room for further improvement
    • Methods:

      Transformer plexity using only a shorter context, and it is not suitable for fair comparison among multiple models.
    • The RECL of TransformerXL is 80% and 450% longer than recurrent networks and Transformer respectively
    • Both the recurrence mechanism and the positional encodings contribute to a longer RECL.
    • This further substantiates the argument that Transformer-XL is able to model longer-term dependency
    • Results:

      Evaluation Speed

      the authors compare the evaluation speed of the model with the vanilla Transformer model (AlRfou et al, 2018).
    • As shown in Table 9, due to the state reuse scheme, Transformer-XL achieves an up to 1,874 times speedup during evaluation
    • Conclusion:

      Transformer-XL obtains strong perplexity results, models longer-term dependency than RNNs and Transformer, achieves substantial speedup during

      How much Al-Rfou et al (2018) is slower evaluation, and is able to generate coherent text articles.
    • Transformer-XL obtains strong perplexity results, models longer-term dependency than RNNs and Transformer, achieves substantial speedup during.
    • How much Al-Rfou et al (2018) is slower evaluation, and is able to generate coherent text articles.
    • The authors envision interesting applications of Transformer-XL in the fields of text generation, unsupervised feature learning, image and speech modeling
    Tables
    • Table1: Comparison with state-of-the-art results on WikiText-103. ⇧ indicates contemporary work
    • Table2: Comparison with state-of-the-art results on enwik8
    • Table3: Comparison with state-of-the-art results on text8
    • Table4: Comparison with state-of-the-art results on One Billion Word. ⇧ indicates contemporary work
    • Table5: Comparison with state-of-the-art results on Penn Treebank. † indicates using two-step finetuning
    • Table6: Ablation study on WikiText-103. For the first two blocks, we use a slightly smaller model (128M parameters). † indicates that the corresponding row is reduced to the same setting as the Transformer network in (<a class="ref-link" id="cAl-Rfou_et+al_2018_a" href="#rAl-Rfou_et+al_2018_a">Al-Rfou et al, 2018</a>), except that two auxiliary losses are not implemented in our experiments. “PPL init” refers to using the same length as training. “PPL best” indicates the perplexity obtained by using the optimal length. “Attn Len” is the shortest possible attention length during evaluation to achieve the corresponding result (PPL best). Increasing the attention length during evaluation improves performance only when our positional encoding is used. The “Transformer-XL (151M)” setting uses a standard parameter budget as previous work (<a class="ref-link" id="cMerity_et+al_2018_a" href="#rMerity_et+al_2018_a">Merity et al, 2018</a>), where we observe a similar effect when increasing the attention length during evaluation
    • Table7: Ablation study on One Billion Word, a dataset without long-term dependency
    • Table8: Relative effective context length (RECL) comparison. See text for the definition of RECL and r. The first three models and the last four models are compared as two model groups when we calculate RECL (RECL is computed on a model group rather than a single model). Each group has the same parameter budget
    • Table9: Slowdown in terms of running time during evaluation. Evaluation is based on per-token time on one GPU
    Download tables as Excel
    Related work
    • In the last few years, the field of language modeling has witnessed many significant advances, including but not limited to devising novel architectures to better encode the context (Bengio et al, 2003; Mikolov et al, 2010; Merity et al, 2016; Al-Rfou et al, 2018), improving regularization and optimization algorithms (Gal and Ghahramani, 2016) , speeding up the Softmax computation (Grave et al, 2016a) , and enriching the output distribution family (Yang et al, 2017).

      To capture the long-range context in language modeling, a line of work directly feeds a representation of the wider context into the network as an additional input. Existing works range from ones where context representations are manually defined (Mikolov and Zweig, 2012; Ji et al, 2015; Wang and Cho, 2015) to others that rely on document-level topics learned from data (Dieng et al, 2016; Wang et al, 2017).

      More broadly, in generic sequence modeling, how to capture long-term dependency has been a long-standing research problem. From this perspective, since the ubiquitous adaption of LSTM, many efforts have been spent on relieving the vanishing gradient problem, including better initialization (Le et al, 2015), additional loss signal (Trinh et al, 2018), augmented memory structure (Ke et al, 2018) and others that modify the internal architecture of RNNs to ease the optimization (Wu et al, 2016; Li et al, 2018). Different from them, our work is based on the Transformer architecture and shows that language modeling as a real-world task benefits from the ability to learn longer-term dependency.
    Funding
    • ZD and YY were supported in part by National Science Foundation (NSF) under the grant IIS1546329 and by the DOE-Office of Science under the grant ASCR #KJ040201
    • ZY and RS were supported in part by the Office of Naval Research grant N000141812861, the NSF grant IIS1763562, the Nvidia fellowship, and the Siebel scholarship
    Reference
    • Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444.
      Findings
    • Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853.
      Findings
    • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
      Findings
    • Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.
      Findings
    • Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155.
      Google ScholarLocate open access versionFindings
    • Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
      Findings
    • Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2016. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704.
      Findings
    • Tim Cooijmans, Nicolas Ballas, César Laurent, Çaglar Gülçehre, and Aaron Courville. 2016. Recurrent batch normalization. arXiv preprint arXiv:1603.09025.
      Findings
    • Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087.
      Google ScholarLocate open access versionFindings
    • Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083.
      Findings
    • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
      Findings
    • Adji B Dieng, Chong Wang, Jianfeng Gao, and John Paisley. 2016. Topicrnn: A recurrent neural network with long-range semantic dependency. arXiv preprint arXiv:1611.01702.
      Findings
    • Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027.
      Google ScholarLocate open access versionFindings
    • Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, and Hervé Jégou. 2016a. Efficient softmax approximation for gpus. arXiv preprint arXiv:1609.04309.
      Findings
    • Edouard Grave, Armand Joulin, and Nicolas Usunier. 2016b. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426.
      Findings
    • Alex Graves. 2013. Generating sequences with recurrent neural networks.
      Google ScholarFindings
    • Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401.
      Findings
    • David Ha, Andrew Dai, and Quoc V Le. 2016. Hypernetworks. arXiv preprint arXiv:1609.09106.
      Findings
    • Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
      Google ScholarFindings
    • Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
      Google ScholarLocate open access versionFindings
    • 2018. An improved relative self-attention mechanism for transformer with application to music generation. arXiv preprint arXiv:1809.04281.
      Findings
    • Hakan Inan, Khashayar Khosravi, and Richard Socher. 2016. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462.
      Findings
    • Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. 2015. Document context language models. arXiv preprint arXiv:1511.03962.
      Findings
    • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
      Findings
    • Nan Rosemary Ke, Anirudh Goyal ALIAS PARTH GOYAL, Olexa Bilaniuk, Jonathan Binas, Michael C Mozer, Chris Pal, and Yoshua Bengio. 2018. Sparse attentive backtracking: Temporal credit assignment through reminding. In Advances in Neural Information Processing Systems, pages 7650–7661.
      Google ScholarLocate open access versionFindings
    • Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. arXiv preprint arXiv:1805.04623.
      Findings
    • Bryon Knol. 2017. cmix v13. http://www.byronknoll.com/cmix.html.
      Findings
    • Ben Krause, Liang Lu, Iain Murray, and Steve Renals. 2016. Multiplicative lstm for sequence modelling. arXiv preprint arXiv:1609.07959.
      Findings
    • Oleksii Kuchaiev and Boris Ginsburg. 2017. Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722.
      Findings
    • Quoc V Le, Navdeep Jaitly, and Geoffrey E Hinton. 2015. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941.
      Findings
    • Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. 2018. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5457–5466.
      Google ScholarLocate open access versionFindings
    • Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055.
      Findings
    • MultiMedia LLC. 2009. Large text compression benchmark.
      Google ScholarFindings
    • Gábor Melis, Charles Blundell, Tomáš Kocisky, Karl Moritz Hermann, Chris Dyer, and Phil Blunsom. 2018. Pushing the bounds of dropout. arXiv preprint arXiv:1805.09208.
      Findings
    • Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182.
      Findings
    • Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240.
      Findings
    • Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
      Findings
    • Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Cernocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
      Google ScholarLocate open access versionFindings
    • Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. SLT, 12(234-239):8.
      Google ScholarLocate open access versionFindings
    • Asier Mujika, Florian Meier, and Angelika Steger. 2017. Fast-slow recurrent neural networks. In Advances in Neural Information Processing Systems, pages 5915–5924.
      Google ScholarLocate open access versionFindings
    • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
      Findings
    • Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268.
      Findings
    • Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf.
      Findings
    • Jack W Rae, Chris Dyer, Peter Dayan, and Timothy P Lillicrap. 2018. Fast parametric learning with activation memorization. arXiv preprint arXiv:1803.10049.
      Findings
    • Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
      Findings
    • Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. 2018. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10434–10443.
      Google ScholarLocate open access versionFindings
    • Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
      Findings
    • Noam Shazeer, Joris Pelemans, and Ciprian Chelba. 2014. Skip-gram language modeling using sparse non-negative matrix probability estimation. arXiv preprint arXiv:1412.1454.
      Findings
    • Trieu H Trinh, Andrew M Dai, Thang Luong, and Quoc V Le. 2018. Learning longer-term dependencies in rnns with auxiliary losses. arXiv preprint arXiv:1803.00144.
      Findings
    • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
      Google ScholarLocate open access versionFindings
    • Tian Wang and Kyunghyun Cho. 2015. Largercontext language modelling. arXiv preprint arXiv:1511.03729.
      Findings
    • Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Jiaji Huang, Wei Ping, Sanjeev Satheesh, and Lawrence Carin. 2017. Topic compositional neural language model. arXiv preprint arXiv:1712.09783.
      Findings
    • Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916.
      Findings
    • Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. 2016. On multiplicative integration with recurrent neural networks. In Advances in neural information processing systems, pages 2856–2864.
      Google ScholarLocate open access versionFindings
    • Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. 2017. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953.
      Findings
    • Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. 2016. Recurrent highway networks. arXiv preprint arXiv:1607.03474.
      Findings
    • Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
      Findings
    Your rating :
    0

     

    Tags
    Comments