MOOCCube: A Large-scale Data Repository for NLP Applications in MOOCs

ACL, pp. 3135-3142, 2020.

Cited by: 0|Bibtex|Views397|Links
EI
Keywords:
course concept extractionMassive open online coursesprerequisite relationcourse concept expansionexample applicationMore(6+)
Weibo:
We present MOOCCube, a large-scale data repository of over 700 MOOC courses, 100k concepts, 8 million student behaviors with an external resource

Abstract:

The prosperity of Massive Open Online Courses (MOOCs) provides fodder for many NLP and AI research for education applications, e.g., course concept extraction, prerequisite relation discovery, etc. However, the publicly available datasets of MOOC are limited in size with few types of data, which hinders advanced models and novel attempts ...More
0
Introduction
  • Massive open online courses (MOOCs) boom swiftly in recent years and have provided convenient education for over 100 million users worldwide (Shah, 2019).
  • Most of the publicly available datasets are designed for a specific task or method, e.g., Zhang et al(2019) build a MOOC enrollment dataset for course recommendation and (Yu et al, 2019) is only for course concept expansion, which merely contains a subset of MOOC elements
  • They are not feasible enough to support ideas that demand more types of information.
  • These datasets only contain a small size of specific entities or relation instances, e.g., prerequisite relation of TutorialBank (Fabbri et al, 2018) only has 794 cases, making it insufficient for advanced models
Highlights
  • Massive open online courses (MOOCs) boom swiftly in recent years and have provided convenient education for over 100 million users worldwide (Shah, 2019)
  • From extracting course concepts and their prerequisite relations (Pan et al, 2017b; Roy et al, 2019; Li et al, 2019) to analyzing student behaviors (Zhang et al, 2019; Feng et al, 2019), Massive open online courses-related topics, tasks, and methods snowball in recent years
  • Most of the publicly available datasets are designed for a specific task or method, e.g., Zhang et al(2019) build a Massive open online courses enrollment dataset for course recommendation and (Yu et al, 2019) is only for course concept expansion, which merely contains a subset of Massive open online courses elements
  • They are not feasible enough to support ideas that demand more types of information. These datasets only contain a small size of specific entities or relation instances, e.g., prerequisite relation of TutorialBank (Fabbri et al, 2018) only has 794 cases, making it insufficient for advanced models
  • PREREQs perform best in F1-score, while student behavior is beneficial to the precision of this model (PREREQ-S improves the precision to 0.651)
  • We present MOOCCube, a multi-dimensional data repository containing courses, concepts, and student activities from real Massive open online courses websites
Results
  • PREREQs perform best in F1-score, while student behavior is beneficial to the precision of this model (PREREQ-S improves the precision to 0.651)
Conclusion
  • The authors present MOOCCube, a multi-dimensional data repository containing courses, concepts, and student activities from real MOOC websites.
  • Obtaining large-scale data in all dimensions, MOOCCube can support new models and diverse NLP applications in MOOCs. The authors conduct prerequisite relation extraction as an example application, and experimental results show the potential of such a repository.
  • Promising future directions include: 1) utilize more types of data from MOOCCube to facilitate existing topics; 2) employ advanced models in existing tasks; 3) more innovative NLP application tasks in online education domain
Summary
  • Introduction:

    Massive open online courses (MOOCs) boom swiftly in recent years and have provided convenient education for over 100 million users worldwide (Shah, 2019).
  • Most of the publicly available datasets are designed for a specific task or method, e.g., Zhang et al(2019) build a MOOC enrollment dataset for course recommendation and (Yu et al, 2019) is only for course concept expansion, which merely contains a subset of MOOC elements
  • They are not feasible enough to support ideas that demand more types of information.
  • These datasets only contain a small size of specific entities or relation instances, e.g., prerequisite relation of TutorialBank (Fabbri et al, 2018) only has 794 cases, making it insufficient for advanced models
  • Results:

    PREREQs perform best in F1-score, while student behavior is beneficial to the precision of this model (PREREQ-S improves the precision to 0.651)
  • Conclusion:

    The authors present MOOCCube, a multi-dimensional data repository containing courses, concepts, and student activities from real MOOC websites.
  • Obtaining large-scale data in all dimensions, MOOCCube can support new models and diverse NLP applications in MOOCs. The authors conduct prerequisite relation extraction as an example application, and experimental results show the potential of such a repository.
  • Promising future directions include: 1) utilize more types of data from MOOCCube to facilitate existing topics; 2) employ advanced models in existing tasks; 3) more innovative NLP application tasks in online education domain
Tables
  • Table1: Statistics of existing NLP-in-Education datasets
  • Table2: Results of prerequisite discovery
  • Table3: Course name list
  • Table4: Statistics of questions
  • Table5: Statistics of entities and relationships in MOOC Q&A
Download tables as Excel
Related work
  • In this section, we introduce the research of NLP in education, especially in MOOCs, as well as several publicly available related datasets.

    Existing research in MOOCs uses courses and students as the main resource, which can be divided into two categories according to the research object: one focuses on the content of the courses, such as the course concept extraction (Pan et al, 2017b), prerequisite relation discovery (Pan et al, 2017a), and course concept expansion (Yu et al., 2019); the other focuses on the learning behavior of students, such as the prediction of dropouts (Feng et al, 2019), course recommendations (Zhang et al, 2019; Cao et al, 2019), etc. Due to the different tasks, researchers have to repeat the work to build their datasets, which arouses the original motivation of MOOCCube.

    In addition, some researchers also try to obtain education information from other resources, e.g., ACL Anthology (Radev et al, 2013), TutorialBank (Fabbri et al, 2018), and LectureBank (Li et al, 2019). They collected concepts and relationships from papers and lectures and also built diverse datasets. Though they are also limited in data scale, these beneficial attempts guide the construction of MOOCCube.
Funding
  • Zhiyuan Liu is supported by the National KeyResearch and Development Program of China(No 2018YFB1004503), and others are supported by NSFC key project (U1736204, 61533018), a grant from Beijing Academy of Artificial Intelligence (BAAI2019ZD0502), a grant from the Insititute for Guo Qiang, Tsinghua University, THUNUS NExT Co-Lab, the Center for Massive Online Education of Tsinghua Univerisity, and XuetangX
Reference
  • Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi Li. 2017. Bridge text and knowledge by learning multi-prototype entity mention embedding. In ACL.
    Google ScholarFindings
  • Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. 2019. Unifying knowledge graph learning and recommendation: Towards a better understanding of user preferences. In The world wide web conference.
    Google ScholarFindings
  • KDD Cup. 2015. Kdd cup 2015: Predicting dropouts in mooc.
    Google ScholarFindings
  • Alexander Fabbri, Irene Li, Prawat Trairatvorakul, Yijiao He, Weitai Ting, Robert Tung, Caitlin Westerfield, and Dragomir Radev. 2018. Tutorialbank: A manually-collected corpus for prerequisite chains, survey extraction and resource recommendation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 611–620.
    Google ScholarLocate open access versionFindings
  • Wenzheng Feng, Jie Tang, and Tracy Xiao Liu. 2019. Understanding dropouts in moocs. In Proceedings of the AAAI Conference on Artificial Intelligence, (Vol 33 No 01: AAAI-19, IAAI-19, EAAI-20).
    Google ScholarLocate open access versionFindings
  • Jonathan Gordon, Stephen Aguilar, Emily Sheng, and Gully Burns. 2017. Structured generation of technical reading lists. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 261–270, Copenhagen, Denmark. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jonathan Gordon, Linhong Zhu, Aram Galstyan, Prem Natarajan, and Gully Burns. 2016. Modeling concept dependencies in a scientific corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 866–875, Berlin, Germany. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Irene Li, Alexander R Fabbri, Robert R Tung, and Dragomir R Radev. 2019. What should i learn first: Introducing lecturebank for nlp education and prerequisite chain learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6674–6681.
    Google ScholarLocate open access versionFindings
  • Chen Liang, Zhaohui Wu, Wenyi Huang, and C Lee Giles. 2015. Measuring prerequisite relations among concepts. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1668–1674.
    Google ScholarLocate open access versionFindings
  • Hugo Liu and Push Singh. 2004. Conceptneta practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226.
    Google ScholarLocate open access versionFindings
  • Liangming Pan, Chengjiang Li, Juanzi Li, and Jie Tang. 2017a. Prerequisite relation learning for concepts in moocs. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1447–1456.
    Google ScholarLocate open access versionFindings
  • Liangming Pan, Xiaochen Wang, Chengjiang Li, Juanzi Li, and Jie Tang. 2017b. Course concept extraction in moocs via embedding-based graph propagation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 875–884.
    Google ScholarLocate open access versionFindings
  • Simone Paolo Ponzetto and Michael Strube. 2007. Deriving a large scale taxonomy from wikipedia. In AAAI, volume 7, pages 1440–1445.
    Google ScholarLocate open access versionFindings
  • Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2013. The acl anthology network corpus. Language Resources and Evaluation, 47(4):919–944.
    Google ScholarLocate open access versionFindings
  • Sudeshna Roy, Meghana Madhyastha, Sheril Lawrence, and Vaibhav Rajan. 2019. Inferring concept prerequisite relations from online educational resources. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9589–9594.
    Google ScholarLocate open access versionFindings
  • D Shah. 2019. Year of mooc-based degrees: A review of mooc stats and trends in 2018. class central. Class Centrals MOOC Report.
    Google ScholarFindings
  • Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 990–998. ACM.
    Google ScholarLocate open access versionFindings
  • Thierry Volery and Deborah Lord. 2000. Critical success factors in online education. International journal of educational management, 14(5):216–223.
    Google ScholarLocate open access versionFindings
  • Jifan Yu, Chenyu Wang, Gan Luo, Lei Hou, Juanzi Li, Zhiyuan Liu, and Jie Tang. 20Course concept expansion in moocs with external knowledge and interactive game. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 4292–4302.
    Google ScholarLocate open access versionFindings
  • Han Zhang, Maosong Sun, Xiaochen Wang, Zhengyang Song, Jie Tang, and Jimeng Sun. 2017. Smart jump: Automated navigation suggestion for videos in moocs. In Proceedings of the 26th international conference on world wide web companion, pages 331–339. International World Wide Web Conferences Steering Committee.
    Google ScholarLocate open access versionFindings
  • Jing Zhang, Bowen Hao, Bo Chen, Cuiping Li, Hong Chen, and Jimeng Sund. 2019. Hierarchical reinforcement learning for course recommendation in moocs. In Proceedings of the AAAI Conference on Artificial Intelligence, (Vol 33 No 01: AAAI-19, IAAI19, EAAI-20).
    Google ScholarLocate open access versionFindings
  • Many efforts for extracting prerequisite relation utilize this information (Liang et al., 2015; Roy et al., 2019). For each domain of courses, we invite three experts who have corresponding teaching experience to annotate the dependency relation among them.
    Google ScholarLocate open access versionFindings
  • Quality Control. Both of concept taxonomy and prerequisite relations are subjective (Liang et al., 2015). To prevent low-quality annotation results, we mix some golden standards (which are from existing well-organized datasets (Fabbri et al., 2018)) into the annotation pool. Once the labeling result is different from the golden standard, we lead another expert estimation to specifically confirm the truth of these conflicts and identify the annotators that can’s meet the requirements.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments