Adaptations of ROUGE and BLEU to Better Evaluate Machine Reading Comprehension Task

An Yang
An Yang
Yajuan Lyu
Yajuan Lyu

MACHINE READING FOR QUESTION ANSWERING, pp. 98-104, 2018.

Cited by: 0|Bibtex|Views4|Links
EI
Keywords:
mrc modelmachine reading comprehensionmean reciprocal rankhuman judgmentreading comprehension datasetMore(13+)
Weibo:
For question answering machine reading comprehension tasks, automatic evaluation metrics are commonly based on measuring lexical overlap, such as BLEU and ROUGE

Abstract:

Current evaluation metrics to question answering based machine reading comprehension (MRC) systems generally focus on the lexical overlap between candidate and reference answers, such as ROUGE and BLEU. However, bias may appear when these metrics are used for specific question types, especially questions inquiring yes-no opinions and enti...More

Code:

Data:

0
Introduction
  • The goal of current MRC tasks is to develop agents which are able to comprehend passages automatically and answer open-domain questions correctly.
  • Answers which lack or mispredict entities should be in distinction from correct answers, but the mistakes affect little in BLEU and ROUGE, especially when the entity is a number.
  • These two question types are quite common in MRC datasets and real scenario.
  • For the reasons above, developing an automatic evaluation system which takes consideration of the inherent characteristics of these question types is of great necessity
Highlights
  • The goal of current machine reading comprehension (MRC) tasks is to develop agents which are able to comprehend passages automatically and answer open-domain questions correctly
  • With the release of several large-scale datasets like SQuAD (Rajpurkar et al, 2016), MS-MARCO (Nguyen et al, 2016) and DuReader (He et al, 2017), many MRC models have been proposed in previous works (Wang and Jiang, 2016; Seo et al, 2016; Wang et al, 2017)
  • Answers with contrary opinions may have high lexical overlap, such as “The radiation of wireless routers has an impact on people” and “The radiation of wireless routers has no impact on people”
  • For question answering MRC tasks, automatic evaluation metrics are commonly based on measuring lexical overlap, such as BLEU and ROUGE
  • In some cases, we notice that these automatic evaluation metrics may be biased from human judgment, especially for yes-no and entity questions
  • The statistical analysis shows that our adaptations achieve higher correlation to human judgment compared with original ROUGE-L and BLEU, proving the effectiveness of our methodology
Methods
  • The brief idea of the adaptations is to add additional lexical overlap items which can reflect opinion and entity agreement as the bonus.
  • In the official evaluation of MS-MARCO and DuReader, ROUGE-L and BLEU are employed as metrics at the same time, with the former as the primary criterion for ranking participating systems
  • Their modifications will be elaborated separately.
  • For one question sample with single candidate and several gold answers, Papineni et al (2002) define cumulative BLEU-n with uniform n-gram weight as follows: n 1 n BLEUcum = BP · Pi (1) i=1.
Conclusion
  • For question answering MRC tasks, automatic evaluation metrics are commonly based on measuring lexical overlap, such as BLEU and ROUGE.
  • In some cases, the authors notice that these automatic evaluation metrics may be biased from human judgment, especially for yes-no and entity questions.
  • The authors think it may mislead the development of real scene MRC systems.
  • The authors hope the exploration can bring more research attention to the design of MRC evaluation metrics
Summary
  • Introduction:

    The goal of current MRC tasks is to develop agents which are able to comprehend passages automatically and answer open-domain questions correctly.
  • Answers which lack or mispredict entities should be in distinction from correct answers, but the mistakes affect little in BLEU and ROUGE, especially when the entity is a number.
  • These two question types are quite common in MRC datasets and real scenario.
  • For the reasons above, developing an automatic evaluation system which takes consideration of the inherent characteristics of these question types is of great necessity
  • Methods:

    The brief idea of the adaptations is to add additional lexical overlap items which can reflect opinion and entity agreement as the bonus.
  • In the official evaluation of MS-MARCO and DuReader, ROUGE-L and BLEU are employed as metrics at the same time, with the former as the primary criterion for ranking participating systems
  • Their modifications will be elaborated separately.
  • For one question sample with single candidate and several gold answers, Papineni et al (2002) define cumulative BLEU-n with uniform n-gram weight as follows: n 1 n BLEUcum = BP · Pi (1) i=1.
  • Conclusion:

    For question answering MRC tasks, automatic evaluation metrics are commonly based on measuring lexical overlap, such as BLEU and ROUGE.
  • In some cases, the authors notice that these automatic evaluation metrics may be biased from human judgment, especially for yes-no and entity questions.
  • The authors think it may mislead the development of real scene MRC systems.
  • The authors hope the exploration can bring more research attention to the design of MRC evaluation metrics
Tables
  • Table1: Pearson correlation coefficients (PCC) between annotators
  • Table2: PCCs between various automatic metrics and human judgment for different question types on single question level
  • Table3: PCCs between various automatic metrics and human judgment for different question types on overall score level
Download tables as Excel
Related work
  • MRC Task Recent years have witnessed growing research interest in machine reading comprehension. Annotation of large-scale datasets is a strong driving force for the recent progress of MRC systems. The paradigm of such datasets ranges from cloze test (Hermann et al, 2015; Hill et al, 2015), multiple choice (Lai et al, 2017), span extraction (Rajpurkar et al, 2016) and answer generation (Nguyen et al, 2016; He et al, 2017). The last paradigm with multi-passages and manually annotated answers for each question is more close to real application. Based on these resources, end-to-end neural MRC model architectures are implemented, including match-LSTM (Wang and Jiang, 2016), BiDAF (Seo et al, 2016), DCN (Xiong et al, 2016) and r-net (Wang et al, 2017). With the objective of lexical overlap based evaluation metrics, these models focus more on text matching to references, which has bias to human demand. Instead, conceiving opinion and entity-aware metrics will encourage future MRC systems to look more into real application cases.
Funding
  • This work was partially supported by National Natural Science Foundation of China (61572049) and BaiduPeking University Joint Project
Reference
  • Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. pages 65–72.
    Google ScholarLocate open access versionFindings
  • Eric Breck, John D Burger, Lisa Ferro, Lynette Hirschman, David House, Marc Light, and Inderjeet Mani. 2000. How to evaluate your question answering system every day and still get real work done. arXiv preprint cs/0004008.
    Google ScholarLocate open access versionFindings
  • David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 610– 619.
    Google ScholarLocate open access versionFindings
  • Hoa Trang Dang, Diane Kelly, and Jimmy J Lin. 2007. Overview of the trec 2007 question answering track. In Trec. volume 7, page 63.
    Google ScholarLocate open access versionFindings
  • Wei He, Kai Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, et al. 2017. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073.
    Findings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems. pages 1693– 1701.
    Google ScholarLocate open access versionFindings
  • Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
    Findings
  • Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing.
    Google ScholarLocate open access versionFindings
  • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
    Findings
  • Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
    Google ScholarFindings
  • Feifan Liu and Yang Liu. 2008. Correlation between rouge and human evaluation of extractive meeting summaries. In Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: Short papers. Association for Computational Linguistics, pages 201– 204.
    Google ScholarLocate open access versionFindings
  • Ani Nenkova, Rebecca Passonneau, and Kathleen McKeown. 2007. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing (TSLP) 4(2):4.
    Google ScholarLocate open access versionFindings
  • Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
    Findings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 311–318.
    Google ScholarLocate open access versionFindings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Matthew Richardson, Christopher JC Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. pages 193–203.
    Google ScholarLocate open access versionFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603.
    Findings
  • Ellen M Voorhees. 2003. Overview of trec 2003. In Trec. pages 1–13.
    Google ScholarLocate open access versionFindings
  • Ellen M Voorhees and DM Tice. 2000. Overview of the trec-9 question answering track. In TREC.
    Google ScholarFindings
  • Ellen M Voorhees et al. 1999. The trec-8 question answering track report. In Trec. volume 99, pages 77– 82.
    Google ScholarLocate open access versionFindings
  • Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. arXiv preprint arXiv:1608.07905.
    Findings
  • Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 189–198.
    Google ScholarLocate open access versionFindings
  • Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604.
    Findings
Your rating :
0

 

Tags
Comments