Is word length inaccurate for authorship attribution?

DIGITAL SCHOLARSHIP IN THE HUMANITIES(2022)

引用 0|浏览1
暂无评分
摘要
Word length refers to a feature that is extracted from texts and used to characterize authorial style; it was quantitatively demonstrated by Mendenhall (Mendenhall, T. C., 1887, The characteristics curves of composition. Science, IX: 237-49). Many similar features for describing authorial style have been proposed; however, research indicates that compared with other features, word length identifies authors with lower accuracy. This study proposes a feature, referred to as c-wordL, to improve the accuracy of authorship attribution in texts through the classification of words into several types by following the part-of-speech (POS) tags and combining these types with the word length data. The proposed method was tested using 200 literary texts from ten different authors in Japanese, English, and Chinese. The results indicated that c-wordL was more accurate than the existing word length-based features and provided useful information that word unigrams and POS tag bigrams could not measure. In addition, the ease of interpretation of different types of features was discussed. In summary, c-wordL outperformed the existing superior features in explaining the distinct writing styles and identifying the authors.
更多
查看译文
关键词
authorship attribution,word length inaccurate
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要