Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification

meeting of the association for computational linguistics, Volume abs/1805.02220, 2018.

Cited by: 34|Bibtex|Views34|Links
EI
Keywords:
multi passagequestion answeringms marcosearch engineweb datumMore(8+)
Weibo:
The experimental results demonstrate that our model outperforms the baseline models by a large margin and achieves the state-of-the-art performance on two challenging datasets, both of which are designed for Machine reading comprehension on real web data

Abstract:

Machine reading comprehension (MRC) on real web data usually requires the machine to answer a question by analyzing multiple passages retrieved by search engine. Compared with MRC on a single passage, multi-passage MRC is more challenging, since we are likely to get multiple confusing answer candidates from different passages. To address ...More

Code:

Data:

0
Introduction
  • Machine reading comprehension (MRC), empowering computers with the ability to acquire knowledge and answer questions from textual data, is believed to be a crucial step in building a general intelligent agent (Chen et al, 2016).
  • A significant milestone is that several MRC models have exceeded the performance of human annotators on the SQuAD dataset1 (Rajpurkar et al, 2016).
  • This success on single Wikipedia passage is still not adequate, considering the ultimate goal of reading the whole web.
  • They use the search engine to retrieve multiple passages and the MRC models are required to read these passages in order to give the final answer
Highlights
  • Machine reading comprehension (MRC), empowering computers with the ability to acquire knowledge and answer questions from textual data, is believed to be a crucial step in building a general intelligent agent (Chen et al, 2016)
  • The BiDAF and Match-LSTM models are provided as two baseline systems (He et al, 2017)
  • We implement our system based on this new strategy, and our system achieves further improvement by a large margin
  • We propose an end-to-end framework to tackle the multi-passage MRC task
  • We creatively design three different modules in our model, which can find the answer boundary, model the answer content and conduct cross-passage answer verification respectively. All these three modules can be trained with different forms of the answer labels and training them jointly can provide further improvement
  • The experimental results demonstrate that our model outperforms the baseline models by a large margin and achieves the state-of-the-art performance on two challenging datasets, both of which are designed for MRC on real web data
Methods
  • To verify the effectiveness of the model on multipassage machine reading comprehension, the authors conduct experiments on the MS-MARCO (Nguyen et al, 2016) and DuReader (He et al, 2017) datasets.
  • One prerequisite for answer verification is that there should be multiple correct answers so that they can verify each other
  • Both the MS-MARCO and DuReader datasets require the human annotators to generate multiple answers if possible.
  • A span is taken as valid if it can achieve F1 score larger than 0.7 compared with any reference answer
  • From these statistics, the authors can see that the phenomenon of multiple answers is quite common for both MS-MARCO and DuReader.
  • These answers will provide strong signals for answer verification if the authors can leverage them properly
Results
  • Results on MS

    MARCO

    Table 3 shows the results of the system and other state-of-the-art models on the MS-MARCO test set.
  • The results of the model and several baseline systems on the test set of DuReader are shown in Table 4.
  • The BiDAF and Match-LSTM models are provided as two baseline systems (He et al, 2017).
  • Based on BiDAF, as is described in Section 3.2, the authors tried a new paragraph selection strategy by employing a paragraph ranking (PR) model.
  • [4] A pure culture comprises a single species or strains.
  • A mixed . . . [5] A pure culture is a culture consisting of only one strain. [6] A pure culture is one in which only one kind of microbial species
Conclusion
  • 4.1 Ablation Study

    To get better insight into the system, the authors conduct in-depth ablation study on the development set of MS-MARCO, which is shown in Table 5.
  • The authors creatively design three different modules in the model, which can find the answer boundary, model the answer content and conduct cross-passage answer verification respectively.
  • All these three modules can be trained with different forms of the answer labels and training them jointly can provide further improvement.
  • The experimental results demonstrate that the model outperforms the baseline models by a large margin and achieves the state-of-the-art performance on two challenging datasets, both of which are designed for MRC on real web data
Summary
  • Introduction:

    Machine reading comprehension (MRC), empowering computers with the ability to acquire knowledge and answer questions from textual data, is believed to be a crucial step in building a general intelligent agent (Chen et al, 2016).
  • A significant milestone is that several MRC models have exceeded the performance of human annotators on the SQuAD dataset1 (Rajpurkar et al, 2016).
  • This success on single Wikipedia passage is still not adequate, considering the ultimate goal of reading the whole web.
  • They use the search engine to retrieve multiple passages and the MRC models are required to read these passages in order to give the final answer
  • Methods:

    To verify the effectiveness of the model on multipassage machine reading comprehension, the authors conduct experiments on the MS-MARCO (Nguyen et al, 2016) and DuReader (He et al, 2017) datasets.
  • One prerequisite for answer verification is that there should be multiple correct answers so that they can verify each other
  • Both the MS-MARCO and DuReader datasets require the human annotators to generate multiple answers if possible.
  • A span is taken as valid if it can achieve F1 score larger than 0.7 compared with any reference answer
  • From these statistics, the authors can see that the phenomenon of multiple answers is quite common for both MS-MARCO and DuReader.
  • These answers will provide strong signals for answer verification if the authors can leverage them properly
  • Results:

    Results on MS

    MARCO

    Table 3 shows the results of the system and other state-of-the-art models on the MS-MARCO test set.
  • The results of the model and several baseline systems on the test set of DuReader are shown in Table 4.
  • The BiDAF and Match-LSTM models are provided as two baseline systems (He et al, 2017).
  • Based on BiDAF, as is described in Section 3.2, the authors tried a new paragraph selection strategy by employing a paragraph ranking (PR) model.
  • [4] A pure culture comprises a single species or strains.
  • A mixed . . . [5] A pure culture is a culture consisting of only one strain. [6] A pure culture is one in which only one kind of microbial species
  • Conclusion:

    4.1 Ablation Study

    To get better insight into the system, the authors conduct in-depth ablation study on the development set of MS-MARCO, which is shown in Table 5.
  • The authors creatively design three different modules in the model, which can find the answer boundary, model the answer content and conduct cross-passage answer verification respectively.
  • All these three modules can be trained with different forms of the answer labels and training them jointly can provide further improvement.
  • The experimental results demonstrate that the model outperforms the baseline models by a large margin and achieves the state-of-the-art performance on two challenging datasets, both of which are designed for MRC on real web data
Tables
  • Table1: An example from MS-MARCO. The text in bold is the predicted answer candidate from each passage according to the boundary model. The candidate from [1] is chosen as the final answer by this model, while the correct answer is from [6] and can be verified by the answers from [3], [4], [5]
  • Table2: Percentage of questions that have multiple valid answers or answer spans designed from real-world search engines and involve a large number of passages retrieved from the web. One difference of these two datasets is that MS-MARCO mainly focuses on the English web data, while DuReader is designed for Chinese MRC. This diversity is expected to reflect the generality of our method. In terms of the data size, MS-MARCO contains 102023 questions, each of which is paired up with approximately 10 passages for reading comprehension. As for DuReader, it keeps the top-5 search results for each question and there are totally 201574 questions
  • Table3: Performance of our method and competing models on the MS-MARCO test set validation performance on the MS-MARCO development set. The hidden size is set to be 150 and we apply L2 regularization with its weight as 0.0003. The task weights β1, β2 are both set to be 0.5. To train our model, we employ the Adam algorithm (Kingma and Ba, 2014) with the initial learning rate as 0.0004 and the mini-batch size as 32. Exponential moving average is applied on all trainable variables with a decay rate 0.9999
  • Table4: Performance on the DuReader test set
  • Table5: Ablation study on MS-MARCO development set formance. If we ensemble the models trained with different random seeds and hyper-parameters, the results can be further improved and outperform the ensemble model in <a class="ref-link" id="cTan_et+al_2017_a" href="#rTan_et+al_2017_a">Tan et al (2017</a>), especially in terms of the BLEU-1
  • Table6: Scores predicted by our model for the answer candidates shown in Table 1. Although the candidate [1] gets high boundary and content scores, the correct answer [6] is preferred by the verification model and is chosen as the final answer
Download tables as Excel
Related work
  • Machine reading comprehension made rapid progress in recent years, especially for singlepassage MRC task, such as SQuAD (Rajpurkar et al, 2016). Mainstream studies (Seo et al, 2016; Wang and Jiang, 2016; Xiong et al, 2016) treat reading comprehension as extracting answer span from the given passage, which is usually achieved by predicting the start and end position of the answer. We implement our boundary model similarly by employing the boundary-based pointer network (Wang and Jiang, 2016). Another inspiring work is from Wang et al (2017c), where the authors propose to match the passage against itself so that the representation can aggregate evidence from the whole passage. Our verification model adopts a similar idea. However, we collect information across passages and our attention is based on the answer representation, which is much more efficient than attention over all passages. For the model training, Xiong et al (2017) argues that the boundary loss encourages exact answers at the cost of penalizing overlapping answers. Therefore they propose a mixed objective that incorporates rewards derived from word overlap. Our joint training approach has a similar function. By taking the content and verification loss into consideration, our model will give less loss for overlapping answers than those unmatched answers, and our loss function is totally differentiable.
Funding
  • This work is supported by the National Basic Research Program of China (973 program, No 2014CB340505) and Baidu-Peking University Joint Project
Reference
  • Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
    Google ScholarLocate open access versionFindings
  • Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
    Findings
  • Wei He, Kai Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, and Haifeng Wang. 2017. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073.
    Findings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015.
    Google ScholarLocate open access versionFindings
  • Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 201The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
    Findings
  • Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 201MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 colocated with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016).
    Google ScholarLocate open access versionFindings
  • Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, and Xiaofei He. 201Memen: Multi-layer embedding with memory networks for machine comprehension. arXiv preprint arXiv:1707.09098.
    Findings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA.. pages 311–318.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP). pages 1532– 1543.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016.
    Google ScholarLocate open access versionFindings
  • Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
    Findings
  • Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603.
    Findings
  • Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. 2017. Reasonet: Learning to stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. pages 1047– 1055.
    Google ScholarLocate open access versionFindings
  • Chuanqi Tan, Furu Wei, Nan Yang, Weifeng Lv, and Ming Zhou. 2017. S-net: From answer extraction to answer generation for machine reading comprehension. arXiv preprint arXiv:1706.04815.
    Findings
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 20Pointer networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. pages 2692–2700.
    Google ScholarLocate open access versionFindings
  • Shuohang Wang and Jing Jiang. 20Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.
    Findings
  • Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. 2017a. R$ˆ3$: Reinforced reader-ranker for open-domain question answering. arXiv preprint arXiv:1709.00023.
    Findings
  • Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, Gerald Tesauro, and Murray Campbell. 2017b. Evidence aggregation for answer re-ranking in open-domain question answering. arXiv preprint arXiv:1711.05116.
    Findings
  • Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017c. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers.
    Google ScholarLocate open access versionFindings
  • Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural QA as simple as possible but not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, August 3-4, 2017. pages 271–280.
    Google ScholarLocate open access versionFindings
  • Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604.
    Findings
  • Caiming Xiong, Victor Zhong, and Richard Socher. 2017. DCN+: mixed objective and deep residual coattention for question answering. arXiv preprint arXiv:1711.00106.
    Findings
Your rating :
0

 

Tags
Comments