Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso

WikiSym '11: Proceedings of the 7th International Symposium on Wikis and Open Collaboration(2011)

引用 58|浏览0
暂无评分
摘要
User generated content (UGC) constitutes a significant fraction of the Web. However, some wiiki-based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient online algorithms for better tools to assist users in vandalism detection. We describe an efficient and accurate classifier that performs vandalism detection in UGC sites. We show the results of our classifier in the PAN Wikipedia dataset. We explore the effectiveness of a combination of 66 individual features that produce an AUC of 0.9553 on a test dataset -- the best result to our knowledge. Using Lasso optimization we then reduce our feature--rich model to a much smaller and more efficient model of 28 features that performs almost as well -- the drop in AUC being only 0.005. We describe how this approach can be generalized to other user generated content systems and describe several applications of this classifier to help users identify potential vandalism.
更多
查看译文
关键词
ugc site,pan wikipedia dataset,efficient online algorithm,efficient model,accurate classifier,vandalism detection,potential vandalism,possible vandalism,content system,popular site,feature-rich model,wikipedia,lasso,online algorithm,random forest,user generated content,machine learning,random forests
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要