Fast and Accurate Reading Comprehension by Combining Self-Attention and Convolution

    David Dohan
    David Dohan
    Thang Luong
    Thang Luong
    Rui Zhao
    Rui Zhao

    international conference on learning representations, 2018.

    Cited by: 4|Bibtex|Views28|
    Keywords:
    Stanford Question Answering Datasetmachine reading comprehensionsquad datasetGoogle’s NMTend modelMore(12+)
    Wei bo:
    We propose a fast and accurate end-to-end model for machine reading comprehension

    Abstract:

    Current end-to-end machine reading and question answering (Qu0026A) models are primarily based on recurrent neural networks (RNNs) with attention. Despite their success, these models are often slow for both training and inference due to the sequential nature of RNNs. We propose a new Qu0026A model that does not require recurrent networks:...More

    Code:

    Data:

    Introduction
    • There is growing interest in the tasks of machine reading comprehension and automated question answering.
    • A successful combination of these two ingredients is the Bidirectional Attention Flow (BiDAF) model by Seo et al (2016), which achieve strong results on the SQuAD dataset (Rajpurkar et al, 2016).
    • A weakness of these models is that they are often slow for both training and inference due to their recurrent nature, especially for long texts.
    • The slow inference prevents the machine comprehension systems from being deployed in real-time applications
    Highlights
    • There is growing interest in the tasks of machine reading comprehension and automated question answering
    • The most successful models generally employ two key ingredients: (1) a recurrent model to process sequential inputs, and (2) an attention component to cope with long term interactions
    • In this paper, aiming to make the machine comprehension fast, we propose to remove the recurrent nature of these models
    • F1 measures the portion of overlap tokens between the predicted answer and groundtruth, while exact match score is 1 if the prediction is exactly the same as groundtruth or 0 otherwise
    • Our model trained on the original dataset outperforms all the documented results in the literature, in terms of both Exact Match and F1 scores
    • We propose a fast and accurate end-to-end model for machine reading comprehension
    Methods
    • The authors conduct experiments to study the performance of the model and the data augmentation technique.
    • SQuAD contains 107.7K query-answer pairs, with 87.5K for training, 10.1K for validation, and another 10.1K for testing.
    • The authors test the model on another dataset TriviaQA (Joshi et al, 2017), which consists of 650K context-query-answer triples.
    • According to the previous work (Joshi et al, 2017; Hu et al, 2017; Pan et al, 2017), the same model would have similar performance on both Wikipedia and Web, but the latter is five time larger.
    • To keep the training time manageable, the authors omit the experiment on Web data
    Results
    • The F1 and Exact Match (EM) are two evaluation metrics of accuracy for the model performance.
    • To make a fair and thorough comparison, the authors both report both the published results in their latest papers/preprints and the updated but not documented results on the leaderboard.
    • The authors deem the latter as the unpublished results.
    • The authors' result on the official test set is 76.2/84.6, which significantly outperforms the best documented result 73.2/81.8
    Conclusion
    • The authors propose a fast and accurate end-to-end model for machine reading comprehension.
    • The authors' core innovation is to completely remove the recurrent networks in the base model.
    • The resulting model is fully feedforward, composed entirely of separable convolutions, attention, linear layers, and layer normalization, which is suitable for parallel computation.
    • The resulting model is both fast and accurate: It surpasses the best published results on SQuAD dataset while up to 13/9 times faster than a competitive recurrent models for a training/inference iteration.
    • The authors find that the authors are able to achieve significant gains by utilizing data augmentation consisting of translating context and passage pairs to and from another language as a way of paraphrasing the questions and contexts
    Summary
    • Introduction:

      There is growing interest in the tasks of machine reading comprehension and automated question answering.
    • A successful combination of these two ingredients is the Bidirectional Attention Flow (BiDAF) model by Seo et al (2016), which achieve strong results on the SQuAD dataset (Rajpurkar et al, 2016).
    • A weakness of these models is that they are often slow for both training and inference due to their recurrent nature, especially for long texts.
    • The slow inference prevents the machine comprehension systems from being deployed in real-time applications
    • Methods:

      The authors conduct experiments to study the performance of the model and the data augmentation technique.
    • SQuAD contains 107.7K query-answer pairs, with 87.5K for training, 10.1K for validation, and another 10.1K for testing.
    • The authors test the model on another dataset TriviaQA (Joshi et al, 2017), which consists of 650K context-query-answer triples.
    • According to the previous work (Joshi et al, 2017; Hu et al, 2017; Pan et al, 2017), the same model would have similar performance on both Wikipedia and Web, but the latter is five time larger.
    • To keep the training time manageable, the authors omit the experiment on Web data
    • Results:

      The F1 and Exact Match (EM) are two evaluation metrics of accuracy for the model performance.
    • To make a fair and thorough comparison, the authors both report both the published results in their latest papers/preprints and the updated but not documented results on the leaderboard.
    • The authors deem the latter as the unpublished results.
    • The authors' result on the official test set is 76.2/84.6, which significantly outperforms the best documented result 73.2/81.8
    • Conclusion:

      The authors propose a fast and accurate end-to-end model for machine reading comprehension.
    • The authors' core innovation is to completely remove the recurrent networks in the base model.
    • The resulting model is fully feedforward, composed entirely of separable convolutions, attention, linear layers, and layer normalization, which is suitable for parallel computation.
    • The resulting model is both fast and accurate: It surpasses the best published results on SQuAD dataset while up to 13/9 times faster than a competitive recurrent models for a training/inference iteration.
    • The authors find that the authors are able to achieve significant gains by utilizing data augmentation consisting of translating context and passage pairs to and from another language as a way of paraphrasing the questions and contexts
    Tables
    • Table1: Comparison between answers in original sentence and paraphrased sentence
    • Table2: The performances of different models on SQuAD dataset
    • Table3: Speed comparison between our model and RNN-based models on SQuAD dataset, all with batch size 32. RNN-x-y indicates an RNN with x layers each containing y hidden units. Here, we use bidirectional LSTM as the RNN. The speed is measured by batches/second, so higher is faster
    • Table4: Speed comparison between our model and BiDAF (<a class="ref-link" id="cSeo_et+al_2016_a" href="#rSeo_et+al_2016_a">Seo et al, 2016</a>) on SQuAD dataset
    • Table5: An ablation study of data augmentation and other aspects of our model. The reported results are obtained on the development set. For rows containing entry “data augmentation”, “×N ” means the data is enhanced to N times as large as the original size, while the ratio in the bracket indicates the sampling ratio among the original, English-French-English and English-German-English data during training
    • Table6: The F1 scores on the adversarial SQuAD test set
    • Table7: The development set performances of different single-paragraph reading models on the Wikipedia domain of TriviaQA dataset. Note that ∗ indicates the result on test set
    • Table8: Speed comparison between the proposed model and RNN-based models on TriviaQA Wikipedia dataset, all with batch size 32. RNN-x-y indicates an RNN with x layers each containing y hidden units. The RNNs used here are bidirectional LSTM. The processing speed is measured by batches/second, so higher is faster
    Download tables as Excel
    Related work
    • Machine reading comprehension and automated question answering has become an important topic in the NLP domain. Their popularity can be attributed to an increase in publicly available annotated datasets, such as SQuAD (Rajpurkar et al, 2016), TriviaQA (Joshi et al, 2017), CNN/Daily News (Hermann et al, 2015), WikiReading (Hewlett et al, 2016), Children Book Test (Hill et al, 2015), etc. A great number of end-to-end neural network models have been proposed to tackle these challenges, including BiDAF (Seo et al, 2016), r-net (Wang et al, 2017), DCN (Xiong et al, 2016), ReasoNet (Shen et al, 2017b), Document Reader (Chen et al, 2017), Interactive AoA Reader (Cui et al, 2017) and Reinforced Mnemonic Reader (Hu et al, 2017).

      Recurrent Neural Networks (RNNs) have featured predominatnly in Natural Language Processing in the past few years. The sequential nature of the text coincides with the design philosophy of RNNs, and hence their popularity. In fact, all the reading comprehension models mentioned above are based on RNNs. Despite being common, the sequential nature of RNN prevent parallel computation, as tokens must be fed into the RNN in order. Another drawback of RNNs is difficulty modeling long dependencies, although this is somewhat alleviated by the use of Gated Recurrent Unit (Chung et al., 2014) or Long Short Term Memory architectures (Hochreiter & Schmidhuber, 1997). For simple tasks such as text classification, with reinforcement learning techniques, models (Yu et al, 2017) have been proposed to skip irrelevant tokens to both further address the long dependencies issue and speed up the procedure. However, it is not clear if such methods can handle complicated tasks such as Q&A. The reading comprehension task considered in this paper always needs to deal with long text, as the context paragraphs may be hundreds of words long. Recently, attempts have been made to replace the recurrent networks by full convolution or full attention architectures (Kim, 2014; Gehring et al, 2017; Vaswani et al, 2017b; Shen et al, 2017a). Those models have been shown to be not only faster than the RNN architectures, but also effective in other tasks, such as text classification, machine translation or sentiment analysis.
    Reference
    • Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016. URL http://arxiv.org/abs/1603.04467.
      Findings
    • Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016. URL http://arxiv.org/abs/1607.06450.
      Findings
    • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
      Google ScholarLocate open access versionFindings
    • Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer opendomain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1870–1879, 2017.
      Google ScholarLocate open access versionFindings
    • Francois Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016. URL http://arxiv.org/abs/1610.02357.
      Findings
    • Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
      Findings
    • Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. CoRR, abs/1710.10723, 201URL http://arxiv.org/abs/1710.10723.
      Findings
    • Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-overattention neural networks for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 593–602, 2017.
      Google ScholarLocate open access versionFindings
    • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In International Conference on Machine Learning, 2017.
      Google ScholarLocate open access versionFindings
    • Yichen Gong and Samuel R. Bowman. Ruminating reader: Reasoning with gated multi-hop attention. CoRR, abs/1704.07415, 2017. URL http://arxiv.org/abs/1704.07415.
      Findings
    • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1693–1701, 2015.
      Google ScholarLocate open access versionFindings
    • Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. Wikireading: A novel large-scale language understanding task over wikipedia. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016.
      Google ScholarLocate open access versionFindings
    • Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children’s books with explicit memory representations. CoRR, abs/1511.02301, 2015.
      Findings
    • Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.
      Google ScholarLocate open access versionFindings
    • Minghao Hu, Yuxing Peng, and Xipeng Qiu. Reinforced mnemonic reader for machine comprehension. CoRR, abs/1705.02798, 2017. URL http://arxiv.org/abs/1705.02798.
      Findings
    • Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pp. 646–661, 2016.
      Google ScholarLocate open access versionFindings
    • Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 2021–2031, 2017.
      Google ScholarLocate open access versionFindings
    • Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 August 4, Volume 1: Long Papers, pp. 1601–1611, 2017.
      Google ScholarLocate open access versionFindings
    • Lukasz Kaiser, Aidan N Gomez, and Francois Chollet. Depthwise separable convolutions for neural machine translation. arXiv preprint arXiv:1706.03059, 2017.
      Findings
    • Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746– 1751, 2014.
      Google ScholarLocate open access versionFindings
    • Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
      Findings
    • Mirella Lapata, Rico Sennrich, and Jonathan Mallinson. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 1: Long Papers, pp. 881–893, 2017.
      Google ScholarLocate open access versionFindings
    • Kenton Lee, Tom Kwiatkowski, Ankur P. Parikh, and Dipanjan Das. Learning recurrent span representations for extractive question answering. CoRR, abs/1611.01436, 2016.
      Findings
    • Rui Liu, Junjie Hu, Wei Wei, Zi Yang, and Eric Nyberg. Structural embedding of syntactic trees for machine comprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 826–835, 2017a.
      Google ScholarLocate open access versionFindings
    • Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. Stochastic answer networks for machine reading comprehension. CoRR, abs/1712.03556, 2017b. URL http://arxiv.org/abs/1712.03556.
      Findings
    • Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attentionbased neural machine translation. In EMNLP, 2015.
      Google ScholarLocate open access versionFindings
    • Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt, 2017.
      Findings
    • Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, and Xiaofei He. MEMEN: multi-layer embedding with memory networks for machine comprehension. CoRR, abs/1707.09098, 2017.
      Findings
    • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word//w representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
      Locate open access versionFindings
    • Jonathan Raiman and John Miller. Globally normalized reader. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 1070–1080, 2017.
      Google ScholarLocate open access versionFindings
    • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 2383–2392, 2016.
      Google ScholarLocate open access versionFindings
    • Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603, 2016. URL http://arxiv.org/abs/1611.01603.
      Findings
    • Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan: Directional self-attention network for rnn/cnn-free language understanding. CoRR, abs/1709.04696, 2017a. URL http://arxiv.org/abs/1709.04696.
      Findings
    • Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 1047–1055, 2017b.
      Google ScholarLocate open access versionFindings
    • Rupesh Kumar Srivastava, Klaus Greff, and Jurgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387.
      Findings
    • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017a. URL http://arxiv.org/abs/1706.03762.
      Findings
    • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems, 2017b.
      Google ScholarLocate open access versionFindings
    • Shuohang Wang and Jing Jiang. Machine comprehension using match-lstm and answer pointer. CoRR, abs/1608.07905, 2016. URL http://arxiv.org/abs/1608.07905.
      Findings
    • Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 189–198, 2017.
      Google ScholarLocate open access versionFindings
    • Zhiguo Wang, Haitao Mi, Wael Hamza, and Radu Florian. Multi-perspective context matching for machine comprehension. CoRR, abs/1612.04211, 2016. URL http://arxiv.org/abs/1612.04211.
      Findings
    • Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural QA as simple as possible but not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, August 3-4, 2017, pp. 271–280, 2017.
      Google ScholarLocate open access versionFindings
    • Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
      Findings
    • Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. CoRR, abs/1611.01604, 2016. URL http://arxiv.org/abs/1611.01604.
      Findings
    • Adams Wei Yu, Hongrae Lee, and Quoc V. Le. Learning to skim text. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1880–1890, 2017.
      Google ScholarLocate open access versionFindings
    • Yang Yu, Wei Zhang, Kazi Saidul Hasan, Mo Yu, Bing Xiang, and Bowen Zhou. End-to-end reading comprehension with dynamic answer chunk ranking. CoRR, abs/1610.09996, 2016. URL http://arxiv.org/abs/1610.09996.
      Findings
    • Junbei Zhang, Xiao-Dan Zhu, Qian Chen, Li-Rong Dai, Si Wei, and Hui Jiang. Exploring question understanding and adaptation in neural-network-based question answering. CoRR, abs/1703.04617, 2017.
      Findings
    • Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 649–657, 2015.
      Google ScholarLocate open access versionFindings
    • Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, and Ming Zhou. Neural question generation from text: A preliminary study. CoRR, abs/1704.01792, 2017. URL http://arxiv.org/abs/1704.01792.
      Findings
    Your rating :
    0

     

    Tags
    Comments