Dirichlet Process Mixture Models based topic identification for short text streams

NLPKE(2011)

引用 7|浏览9
暂无评分
摘要
Topic detection and tracking (TDT) has been extensively studied and applied in recent years. However, prior work is mostly based on regular news text, the problem of scaling to short stories remains pretty much open. Besides, prior work conducts topic identification on separated stories by assuming story segmentation as prerequisites, which is another challenging yet critical task for TDT research. In this paper, we propose a Dirichlet Process Mixture Model (DPMM) based topic identification method, which deals with topic segmentation, topic detection and tracking in an unified model, and achieves reasonable results for short stories. We first present DPMM and its application in topic identification task. Then we discuss two different solutions specifically designed to solve sparseness problem associated with short stories. One is the design of algorithm flow. Instead of a single short text, the processing unit of topic identification is converted to session firstly. The other applies extended DPMM model which takes account of word dependency when estimating distributions of words associated with every known topic. Whereafter, we extend DPMM to identify topic for spontaneous text streams by managing topic segmentation, topic detection and tracking simultaneously. The attractive advantage of DPMM is the number of mixture components needs not been fixed in advance, and it does not need prior knowledge about number and content of topics. So compared with other existing methods, it is more suitable for streaming topic identification. Our empirical results on TDT3 evaluation data verify that DPMM is valid in the task of topic identification for short text data with stream properties, and extended DPMM outperforms original DPMM methods.
更多
查看译文
关键词
stochastic processes,topic identification,short text streams,extended dpmm,topic detection and tracking,data streams,dirichlet process mixture model,static short text,dirichlet process mixture models,story segmentation,news text,topic segmentation,algorithm flow design,text analysis,dpmm,word dependency,tdt3 evaluation data,manganese,unified model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要