Large-scale Cloze Test Dataset Created by Teachers

EMNLP, pp. 2344-2356, 2018.

Cited by: 17|Bibtex|Views78|Links
EI
Keywords:
language modeltest datasetscale clozecloze questionChildren’s Books testMore(12+)
Weibo:
We propose a large-scale cloze test dataset CLOTH that is designed by teachers

Abstract:

Cloze tests are widely adopted in language exams to evaluate studentsu0027 language proficiency. In this paper, we propose the first large-scale human-created cloze test dataset CLOTH, containing questions used in middle-school and high-school language exams. With missing blanks carefully created by teachers and candidate choices purposel...More

Code:

Data:

0
Introduction
  • Being a classic language exercise, the cloze test (Taylor, 1953) is an accurate assessment of language proficiency (Fotos, 1991; Jonz, 1991; Tremblay, 2011) and has been widely employed in language examinations.
  • To facilitate natural language understanding, automatically-generated cloze datasets are introduced to measure the ability of machines in reading comprehension (Hermann et al, 2015; Hill et al, 2016; Onishi et al, 2016).
  • In these datasets, each cloze question typically consists of.
  • Automaticallygenerated candidates can be totally irrelevant or grammatically unsuitable for the blank, resulting in again purposeless or trivial questions
Highlights
  • Being a classic language exercise, the cloze test (Taylor, 1953) is an accurate assessment of language proficiency (Fotos, 1991; Jonz, 1991; Tremblay, 2011) and has been widely employed in language examinations
  • Motivated by the aforementioned drawbacks, we propose CLOTH, a large-scale cloze test dataset collected from English exams
  • We find that language model (LM) trained on the One Billion Word Corpus can achieve a remarkable score but cannot solve the cloze test
  • We propose a large-scale cloze test dataset CLOTH that is designed by teachers
  • With missing blanks and candidate options carefully created by teachers to test different aspects of language phenomena, CLOTH requires a deep language understanding and better captures the complexity of human language
  • We find that human outperforms 1B-LM by a significant margin
Results
  • The comparison is shown in Table 4
  • Both attentive readers achieve similar accuracy to the LSTM.
  • The authors hypothesize that the reason of the attention model’s unsatisfactory performance is that the evidence of a question cannot be found by matching the context.
  • When the authors further remove the human-created data so that only generated data is employed, the accuracy drops to 0.543, similar to the performance of LM.
  • The authors' prediction model achieves an F1 score of 36.5 on the test set, which is understandable since https://gist.github.com/ihsgnef/
Conclusion
  • The authors propose a large-scale cloze test dataset CLOTH that is designed by teachers.
  • With missing blanks and candidate options carefully created by teachers to test different aspects of language phenomena, CLOTH requires a deep language understanding and better captures the complexity of human language.
  • Despite the excellent performance of 1B-LM when compared with models trained only on CLOTH, it is still important to investigate and create more effective models and algorithms which provide complementary advantages to having a large amount of data.
  • The authors suggest training models only on the training set of CLOTH and comparing with models that do not utilize external data
Summary
  • Introduction:

    Being a classic language exercise, the cloze test (Taylor, 1953) is an accurate assessment of language proficiency (Fotos, 1991; Jonz, 1991; Tremblay, 2011) and has been widely employed in language examinations.
  • To facilitate natural language understanding, automatically-generated cloze datasets are introduced to measure the ability of machines in reading comprehension (Hermann et al, 2015; Hill et al, 2016; Onishi et al, 2016).
  • In these datasets, each cloze question typically consists of.
  • Automaticallygenerated candidates can be totally irrelevant or grammatically unsuitable for the blank, resulting in again purposeless or trivial questions
  • Results:

    The comparison is shown in Table 4
  • Both attentive readers achieve similar accuracy to the LSTM.
  • The authors hypothesize that the reason of the attention model’s unsatisfactory performance is that the evidence of a question cannot be found by matching the context.
  • When the authors further remove the human-created data so that only generated data is employed, the accuracy drops to 0.543, similar to the performance of LM.
  • The authors' prediction model achieves an F1 score of 36.5 on the test set, which is understandable since https://gist.github.com/ihsgnef/
  • Conclusion:

    The authors propose a large-scale cloze test dataset CLOTH that is designed by teachers.
  • With missing blanks and candidate options carefully created by teachers to test different aspects of language phenomena, CLOTH requires a deep language understanding and better captures the complexity of human language.
  • Despite the excellent performance of 1B-LM when compared with models trained only on CLOTH, it is still important to investigate and create more effective models and algorithms which provide complementary advantages to having a large amount of data.
  • The authors suggest training models only on the training set of CLOTH and comparing with models that do not utilize external data
Tables
  • Table1: The statistics of the training, development and test sets of CLOTH-M (middle school questions), CLOTH-H (high school questions) and CLOTH
  • Table2: A Sample passage from our dataset. Bold faces highlight the correct answers. There is only one best answer among four candidates, although several candidates may seem correct
  • Table3: The question type statistics of 3000 sampled questions where GM, STR, MP, LTR and O denotes grammar, short-term-reasoning, matching/paraphrasing, long-term-reasoning and others respectively
  • Table4: Models’ performance and human-level performance on CLOTH. LSTM, Stanford Attentive Reader and Attentive Reader with positionaware attention shown in the top part only use supervised data labelled by human. LM outperforms LSTM since it receives more supervisions in learning to predict each word. Training on large external corpus further significantly enhances LM’s accuracy
  • Table5: Error analysis of 1-billion-language-model with three sentences as the context. The questions are sampled from the sample passage shown in Table 2. The correct answer is in bold text. The incorrectly selected options are in italics
  • Table6: Humans’ performance compared with 1billion-language-model. In the short context part, both 1B-LM and humans only use information of one sentence. In the long context part, humans have the whole passage as the context, while 1BLM uses contexts of three sentences
  • Table7: The model’s performance when trained on α percent of automatically-generated data and 100 − α percent of human-created data
  • Table8: Overall results on CLOTH. Ex. denotes external data
  • Table9: Ablation study on using the representativeness information (denoted as rep.) and the human-created data (denoted as hum.)
  • Table10: An Amazon Turker’s label for the sample passage
Download tables as Excel
Related work
  • Large-scale automatically-generated cloze tests (Hermann et al, 2015; Hill et al, 2016; Onishi et al, 2016) lead to significant research advancements. However, generated questions do not consider language phenomenon to be tested and are relatively easy to solve. Recently proposed reading comprehension datasets are all labeled by humans to ensure a high quality (Rajpurkar et al, 2016; Joshi et al, 2017; Trischler et al, 2016; Nguyen et al, 2016).

    Perhaps the closet work to CLOTH is the LAMBADA dataset (Paperno et al, 2016). LAMBADA also targets at finding challenging words to test LM’s ability in comprehending a longer context. However, LAMBADA does not provide a candidate set for each question, which can cause ambiguities when multiple words can fit in. Furthermore, only test set and development set are labeled manually. The provided training set is the unlabeled Book Corpus (Zhu et al, 2015). Such unlabeled data do not emphasize long-dependency questions and have a mismatched distribution with the test set, as showed in Section 5. Further, the Book Corpus is too large to allow rapid algorithm development for researchers who do not have access to a huge amount of computational power.
Funding
  • This research was supported in part by DARPA grant FA8750-12-20342 funded under the DEFT program
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    Findings
  • Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. 2017. Active bias: Training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems, pages 1003–1013.
    Google ScholarLocate open access versionFindings
  • Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 201One billion word benchmark for measuring progress in statistical lming. arXiv preprint arXiv:1312.3005.
    Findings
  • Danqi Chen, Jason Bolton, and Christopher D Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. arXiv preprint arXiv:1606.02858.
    Findings
  • Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
    Findings
  • Rui Correia, Jorge Baptista, Maxine Eskenazi, and Nuno J Mamede. 2012. Automatic generation of cloze question stems. In PROPOR, pages 168–178. Springer.
    Google ScholarLocate open access versionFindings
  • Rui Correia, Jorge Baptista, Nuno Mamede, Isabel Trancoso, and Maxine Eskenazi. 2010. Automatic generation of cloze question distractors. In Proceedings of the Interspeech 2010 Satellite Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, Waseda University, Tokyo, Japan.
    Google ScholarLocate open access versionFindings
  • Pradeep Dasigi, Waleed Ammar, Chris Dyer, and Eduard Hovy. 2017. Ontology-aware token embeddings for prepositional phrase attachment. arXiv preprint arXiv:1705.02925.
    Findings
  • Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.
    Findings
  • Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. 2017. Learning to paraphrase for question answering. arXiv preprint arXiv:1708.06022.
    Findings
  • Sandra S Fotos. 1991. The cloze test as an integrative measure of efl proficiency: A substitute for essays on college entrance examinations? Language Learning, 41(3):313–336.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In NIPS.
    Google ScholarFindings
  • Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The goldilocks principle: Reading children’s books with explicit memory representations. ICLR.
    Google ScholarLocate open access versionFindings
  • Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
    Google ScholarLocate open access versionFindings
  • Jon Jonz. 1991. Cloze item types and second language comprehension. Language testing, 8(1):1–22.
    Google ScholarLocate open access versionFindings
  • Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. ACL.
    Google ScholarLocate open access versionFindings
  • Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of lming. arXiv preprint arXiv:1602.02410.
    Findings
  • Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Andrey Kurtasov. 2013. A system for generating cloze test items from russian-language text. In Proceedings of the Student Research Workshop associated with RANLP 2013, pages 107–112.
    Google ScholarLocate open access versionFindings
  • Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. EMNLP.
    Google ScholarLocate open access versionFindings
  • Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
    Findings
  • Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. arXiv preprint arXiv:1608.05457.
    Findings
  • Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. 2016. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
    Findings
  • Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
    Google ScholarLocate open access versionFindings
  • Anselmo Penas, Yusuke Miyao, Alvaro Rodrigo, Eduard H Hovy, and Noriko Kando. 2014. Overview of clef qa entrance exams task 2014. In CLEF (Working Notes), pages 1194–1200.
    Google ScholarLocate open access versionFindings
  • Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543.
    Google ScholarLocate open access versionFindings
  • Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
    Findings
  • Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
    Findings
  • Alvaro Rodrigo, Anselmo Penas, Yusuke Miyao, Eduard H Hovy, and Noriko Kando. 2015. Overview of clef qa entrance exams task 2015. In CLEF (Working Notes).
    Google ScholarFindings
  • J Sachs, P Tung, and RYH Lam. 1997. How to construct a cloze test: Lessons from testing measurement theory models. Perspectives.
    Google ScholarFindings
  • Carissa Schoenick, Peter Clark, Oyvind Tafjord, Peter Turney, and Oren Etzioni. 2017. Moving beyond the turing test with the allen ai science challenge. Communications of the ACM, 60(9):60–64.
    Google ScholarLocate open access versionFindings
  • Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603.
    Findings
  • Burr Settles. 2009. Active learning literature survey.
    Google ScholarFindings
  • Hideyuki Shibuki, Kotaro Sakamoto, Yoshinobu Kano, Teruko Mitamura, Madoka Ishioroshi, Kelly Y Itakura, Di Wang, Tatsunori Mori, and Noriko Kando. 2014. Overview of the ntcir-11 qa-lab task. In NTCIR.
    Google ScholarFindings
  • Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. 2016. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769.
    Google ScholarLocate open access versionFindings
  • Adam Skory and Maxine Eskenazi. 2010. Predicting cloze task quality for vocabulary training. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 49–56. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Wilson L Taylor. 1953. cloze procedure: a new tool for measuring readability. Journalism Bulletin, 30(4):415–433.
    Google ScholarLocate open access versionFindings
  • Annie Tremblay. 2011. Proficiency assessment standards in second language acquisition research. Studies in Second Language Acquisition, 33(3):339–372.
    Google ScholarLocate open access versionFindings
  • Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.
    Findings
  • Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189–198.
    Google ScholarLocate open access versionFindings
  • Yichong Xu, Jingjing Liu, Jianfeng Gao, Yelong Shen, and Xiaodong Liu. 2017. Towards human-level machine reading comprehension: Reasoning and inference with multiple strategies. arXiv preprint arXiv:1711.04964.
    Findings
  • Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D Manning. 2017. Positionaware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45.
    Google ScholarLocate open access versionFindings
  • Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
    Google ScholarLocate open access versionFindings
  • Geoffrey Zweig and Christopher JC Burges. 2011. The microsoft research sentence completion challenge. Technical report, Technical Report MSR-TR-2011129, Microsoft.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments