Using citation networks to evaluate the impact of text length on keyword extraction

PLOS ONE(2023)

引用 0|浏览0
暂无评分
摘要
The identification of key concepts within unstructured data is of paramount importance in practical applications. Despite the abundance of proposed methods for extracting primary topics, only a few works investigated the influence of text length on the performance of keyword extraction (KE) methods. Specifically, many studies lean on abstracts and titles for content extraction from papers, leaving it uncertain whether leveraging the complete content of papers can yield consistent results. Hence, in this study, we employ a network-based approach to evaluate the concordance between keywords extracted from abstracts and those from the entire papers. Community detection methods are utilized to identify interconnected papers in citation networks. Subsequently, paper clusters are formed to identify salient terms within each cluster, employing a methodology akin to the term frequency-inverse document frequency (tf-idf) approach. Once each cluster has been endowed with its distinctive set of key terms, these selected terms are employed to serve as representative keywords at the paper level. The top-ranked words at the cluster level, which also appear in the abstract, are chosen as keywords for the paper. Our findings indicate that although various community detection methods used in KE yield similar levels of accuracy. Notably, text clustering approaches outperform all citation-based methods, while all approaches yield relatively low accuracy values. We also identified a lack of concordance between keywords extracted from the abstracts and those extracted from the corresponding full-text source. Considering that citations and text clustering yield distinct outcomes, combining them in hybrid approaches could offer improved performance.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要