Representing Documents via Latent Keyphrase Inference.
WWW '16: 25th International World Wide Web Conference Montréal Québec Canada April, 2016(2016)
摘要
Many text mining approaches adopt bag-of-words or -grams models to represent documents. Looking beyond just the words, , the explicit surface forms, in a document can improve a computer's understanding of text. Being aware of this, researchers have proposed concept-based models that rely on a human-curated knowledge base to incorporate other related concepts in the document representation. But these methods are not desirable when applied to vertical domains (, literature, enterprise, ) due to low coverage of in-domain concepts in the general knowledge base and interference from out-of-domain concepts. In this paper, we propose a data-driven model named () that represents documents with a vector of closely related domain keyphrases instead of single words or existing concepts in the knowledge base. We show that given a corpus of in-domain documents, topical content units can be learned for each domain keyphrase, which enables a computer to do smart inference to discover latent document keyphrases, going beyond just explicit mentions. Compared with the state-of-art document representation approaches, LAKI fills the gap between bag-of-words and concept-based models by using domain keyphrases as the basic representation unit. It removes dependency on a knowledge base while providing, with keyphrases, readily interpretable representations. When evaluated against 8 other methods on two text mining tasks over two corpora, LAKI outperformed all.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络