Topical term weighting based on extended random sets for relevance feature selection
WI(2017)
摘要
It is challenging to discover relevant features from long documents that describe user information needs due to the nature of text where synonymy, polysemy noise, and high dimensionality are inherited problems. Traditional feature selection methods could not effectively deal with these problems, because they assume that documents describe one topic only. Topic-based techniques, such as Latent Dirichlet Allocation (LDA), relax this assumption. They have been developed on the basis that a document can exhibit multiple hidden topics. However, LDA does not show encouraging results in selecting relevant features, because LDA calculates the weight of terms based on their local documents and does not generalise it globally at the collection level. So as to address this problem, we propose an innovative and effective extended random set model to generalise LDA weight for local document terms. The model is used as a weighting scheme for topical terms. It can assign a more discriminately accurate weight to these terms based on their appearance in LDA topics and relevant documents. The experimental results, based on the standard RCV1 dataset, TREC topics, and five standard performance measures, show that the proposed model significantly outperforms eight state-of-the-art baseline models in information filtering.
更多查看译文
关键词
Feature Selection, Term Weighting, Latent Dirichlet Allocation, Extended Random Set, Text Mining
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络