An ensemble approach for text document clustering using Wikipedia concepts.

DOCENG(2014)

引用 13|浏览37
暂无评分
摘要
ABSTRACTMost text clustering algorithms represent a corpus as a document-term matrix in the bag of words model. The feature values are computed based on term frequencies in documents and no semantic relatedness between terms is considered. Therefore, two semantically similar documents may sit in different clusters if they do not share any terms. One solution to this problem is to enrich the document representation using an external resource like Wikipedia. We propose a new way to integrate Wikipedia concepts in partitional text document clustering in this work. A text corpus is first represented as a document-term matrix and a document-concept matrix. Terms that exist in the corpus are then clustered based on the document-term representation. Given the term clusters, we propose two methods, one based on the document-term representation and the other one based on the document-concept representation, to find two sets of seed documents. The two sets are then used in our text clustering algorithm in an ensemble approach to cluster documents. The experimental results show that even though the document-concept representations do not result in good document clusters per se, integrating them in our ensemble approach improves the quality of document clusters significantly.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要