Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages
arxiv(2024)
摘要
Wikipedia is the largest web repository of free knowledge. Volunteer editors
devote time and effort to creating and expanding articles in more than 300
language editions. As content quality varies from article to article, editors
also spend substantial time rating articles with specific criteria. However,
keeping these assessments complete and up-to-date is largely impossible given
the ever-changing nature of Wikipedia. To overcome this limitation, we propose
a novel computational framework for modeling the quality of Wikipedia articles.
State-of-the-art approaches to model Wikipedia article quality have leveraged
machine learning techniques with language-specific features. In contrast, our
framework is based on language-agnostic structural features extracted from the
articles, a set of universal weights, and a language version-specific
normalization criterion. Therefore, we ensure that all language editions of
Wikipedia can benefit from our framework, even those that do not have their own
quality assessment scheme. Using this framework, we have built datasets with
the feature values and quality scores of all revisions of all articles in the
existing language versions of Wikipedia. We provide a descriptive analysis of
these resources and a benchmark of our framework. In addition, we discuss
possible downstream tasks to be addressed with these datasets, which are
released for public use.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要