DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Yajuan Lyu
Yajuan Lyu
Xinyan Xiao
Xinyan Xiao
Yizhong Wang
Yizhong Wang
Qiaoqiao She
Qiaoqiao She

meeting of the association for computational linguistics, 2018.

Cited by: 71|Bibtex|Views41|Links
EI
Keywords:
world mrcopen domainbaseline systemopinion questionprevious mrcMore(15+)
Weibo:
This paper announced the release of DuReader, a new dataset for researchers interested in machine reading comprehension

Abstract:

This paper introduces DuReader, a new large-scale, open-domain Chinese ma- chine reading comprehension (MRC) dataset, designed to address real-world MRC. DuReader has three advantages over previous MRC datasets: (1) data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated. (2) questi...More

Code:

Data:

0
Introduction
  • The task of machine reading comprehension (MRC) aims to empower machines to answer questions after reading articles (Rajpurkar et al., 2http://ai.baidu.com/broad/download? dataset=dureader

    3https://github.com/baidu/DuReader

    2016; Nguyen et al, 2016).
  • A number of datasets have been developed for MRC, as shown in Table 1.
  • This paper hopes to advance MRC even further with the release of DuReader, challenging the community to deal with more realistic data sources, more types of questions and more scale, as illustrated in Tables 1-4.
  • Table 1 highlights DuReader’s advantages over previous datasets in terms of data sources and scale.
Highlights
  • The task of machine reading comprehension (MRC) aims to empower machines to answer questions after reading articles (Rajpurkar et al., 2http://ai.baidu.com/broad/download? dataset=dureader

    3https://github.com/baidu/DuReader

    2016; Nguyen et al, 2016)
  • A number of datasets have been developed for MRC, as shown in Table 1
  • This paper hopes to advance MRC even further with the release of DuReader, challenging the community to deal with more realistic data sources, more types of questions and more scale, as illustrated in Tables 1-4
  • What types of question queries do we find in the logs of a search engine? A pilot study was performed to create a taxonomy of question types
  • If we directly apply the state-of-the-art MRC models that was designed for answer span selction, there will be efficiency issues
  • If the accuracy is lower than 95%, the corresponding workers and the experts need to revise the answers again
  • This paper announced the release of DuReader, a new dataset for researchers interested in machine reading comprehension (MRC)
Methods
  • The authors implement and evaluate the baseline systems with two state-of-the-art models.
  • If the authors directly apply the state-of-the-art MRC models that was designed for answer span selction, there will be efficiency issues.
  • To improve both the efficiency of training and testing, the designed systems have two steps: (1) select one most related paragraph from each document, and (2) apply the state-of-the-art MRC models on the selected paragraphs.
  • MRC models designed for answer span selection will be trained on these selected paragraphs
Results
  • Results and Analysis

    The authors evaluate the reading comprehension task via character-level BLEU-4 (Papineni et al, 2002) and Rouge-L (Lin, 2004), which are widely used for evaluating the quality of language generation.
  • The authors evaluate the Selected Paragraph that has the largest overlap with the question among all documents.
  • The authors find that the reading comprehension models get much higher score on Zhidao data.
  • This shows that it is much harder for the models to comprehend open-domain web articles than to find answers in passages from a question answering community.
  • The performance of human beings on these two datasets shows little difference, which suggests that human’s reading skill is more stable on different types of documents
Conclusion
  • As shown in the experiments, the current state-ofthe-art models still underperform human beings by a large margin on the dataset.
  • There are some questions in the dataset that have not been extensively studied before, such as yes-no questions and opinion questions requiring multi-document MRC.
  • New methods are needed for opinion recognition, cross-sentence reasoning, and multi-document summarization.
  • It is necessary to design a more sophisticated paragraph ranking model for the real-world MRC problem.This paper announced the release of DuReader, a new dataset for researchers interested in machine reading comprehension (MRC).
  • Since the release of the task, the authors have already seen significant improvements from more sophisticated models
Summary
  • Introduction:

    The task of machine reading comprehension (MRC) aims to empower machines to answer questions after reading articles (Rajpurkar et al., 2http://ai.baidu.com/broad/download? dataset=dureader

    3https://github.com/baidu/DuReader

    2016; Nguyen et al, 2016).
  • A number of datasets have been developed for MRC, as shown in Table 1.
  • This paper hopes to advance MRC even further with the release of DuReader, challenging the community to deal with more realistic data sources, more types of questions and more scale, as illustrated in Tables 1-4.
  • Table 1 highlights DuReader’s advantages over previous datasets in terms of data sources and scale.
  • Methods:

    The authors implement and evaluate the baseline systems with two state-of-the-art models.
  • If the authors directly apply the state-of-the-art MRC models that was designed for answer span selction, there will be efficiency issues.
  • To improve both the efficiency of training and testing, the designed systems have two steps: (1) select one most related paragraph from each document, and (2) apply the state-of-the-art MRC models on the selected paragraphs.
  • MRC models designed for answer span selection will be trained on these selected paragraphs
  • Results:

    Results and Analysis

    The authors evaluate the reading comprehension task via character-level BLEU-4 (Papineni et al, 2002) and Rouge-L (Lin, 2004), which are widely used for evaluating the quality of language generation.
  • The authors evaluate the Selected Paragraph that has the largest overlap with the question among all documents.
  • The authors find that the reading comprehension models get much higher score on Zhidao data.
  • This shows that it is much harder for the models to comprehend open-domain web articles than to find answers in passages from a question answering community.
  • The performance of human beings on these two datasets shows little difference, which suggests that human’s reading skill is more stable on different types of documents
  • Conclusion:

    As shown in the experiments, the current state-ofthe-art models still underperform human beings by a large margin on the dataset.
  • There are some questions in the dataset that have not been extensively studied before, such as yes-no questions and opinion questions requiring multi-document MRC.
  • New methods are needed for opinion recognition, cross-sentence reasoning, and multi-document summarization.
  • It is necessary to design a more sophisticated paragraph ranking model for the real-world MRC problem.This paper announced the release of DuReader, a new dataset for researchers interested in machine reading comprehension (MRC).
  • Since the release of the task, the authors have already seen significant improvements from more sophisticated models
Tables
  • Table1: DuReader has three advantages over previous MRC datasets: (1) data sources: questions and documents are based on Baidu Search & Baidu Zhidao; answers are manually generated, (2) question types, and (3) scale: 200k questions, 420k answers and 1M documents (largest Chinese MRC dataset so far). The next three tables address advantage (2)
  • Table2: Examples of the six types of questions in Chinese (with glosses in English). Previous datasets have focused on fact-entity and fact-description, though all six types are common in search logs
  • Table3: Pilot Study found that all six types of question queries are common in search logs
  • Table4: The distribution of question types in
  • Table5: Examples from DuReader. Annotations for these questions include both the answers, as well as supporting sentences
  • Table6: Performance of typical MRC systems on the DuReader
  • Table7: Model performance with gold paragraph
  • Table8: Performance on various question types. Current MRC models achieve impressive improvements compared with the selected paragraph baseline. However, there is a large gap between these models and human
  • Table9: Performance of opinion-aware model on YesNo questions
Download tables as Excel
Reference
  • Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017. Attention-overattention neural networks for reading comprehension. In Proceedings of 55th Annual Meeting of the Association for Computational Linguistics, pages 593–602.
    Google ScholarLocate open access versionFindings
  • Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. 2016. Consensus attention-based neural networks for chinese reading comprehension.
    Google ScholarFindings
  • Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
    Findings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693– 1701.
    Google ScholarLocate open access versionFindings
  • Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 201The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
    Findings
  • Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. CoRR.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR.
    Google ScholarFindings
  • Tomas Kocisky, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gabor Melis, and Edward Grefenstette. 2017. The narrativeqa reading comprehension challenge. arXiv preprint arXiv:1712.07040.
    Findings
  • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
    Findings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81.
    Google ScholarLocate open access versionFindings
  • Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
    Findings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Matthew Richardson, Christopher J.C. Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 193–203.
    Google ScholarLocate open access versionFindings
  • Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. CoRR.
    Google ScholarFindings
  • Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200.
    Google ScholarLocate open access versionFindings
  • Shuohang Wang and Jing Jiang. 20Machine comprehension using match-lstm and answer pointer. In ICLR, pages 1–15.
    Google ScholarLocate open access versionFindings
  • Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated Self-Matching Networks for Reading Comprehension and Question Answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 189–198.
    Google ScholarLocate open access versionFindings
  • Caiming Xiong, Victor Zhong, and Richard Socher. 2017. Dynamic coattention networks for question answering. In Proceedings of International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments