Towards Authorship Attribution in Arabic Short-Microblog Text

IEEE ACCESS(2021)

引用 3|浏览2
暂无评分
摘要
Authorship attribution is the study to identify individuals by their writing styles without knowing their actual identities. This is a challenging task in natural language processing. Most work on authorship attribution focused on English, whereas, the problem is understudied in Arabic language. However, due to the complex and distinct morphological nature of the Arabic language, techniques developed for English are not directly applicable to Arabic. This paper explored the possibility of using state-of-the-art classifiers, Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Random Forest, to predict authorship in Arabic short-microblog text. We employed three commonly used linguistic features, character-, lexical- and syntactic-based, in an incremental manner to predict the accuracy of the selected classifiers. The results elucidate that a systematic combination of linguistic features improves authorship classification. However, an inverse correlation was observed in authorship classification accuracy and the number of authors. Overall, SVM and Random Forest classifier are comparable and attained similar to 65% accuracy, whereas KNN hardly attained similar to 35% accuracy. In addition, lexical features offer more discriminatory power as compared to the character and syntactic features.
更多
查看译文
关键词
Authorship attribution, Arabic microblogs, classification, grid search CV
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要