# The dual-sparse topic model: mining focused topics and focused terms in short text

WWW, 2014.

EI

Keywords:

Wei bo:

Abstract:

Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate on several salient topics instead of covering a wide variety of topics. A real topic a...More

Code:

Data:

Introduction

- The authors are living in an era of information revolution where social media is gradually substituting the role of traditional media.
- More than 500 million tweets are posted by Twitter users on a daily basis 1
- This huge volume of usergenerated content, normally in the form of very short documents, contains rich and useful information that can hardly be found in traditional information sources [31].
- Statistical topic models have been proved to be effective tools for exploratory analysis of the overload of text content [4].
- A topic model can provide an effective organization of latent semantics to the unstructured text collection

Highlights

- We are living in an era of information revolution where social media is gradually substituting the role of traditional media
- A topic model can provide an effective organization of latent semantics to the unstructured text collection
- It is noticeable that Wang and Blei only address the sparsity of topic-word distributions, the “Spike and Slab” prior they introduce to topic modeling is related to the key practice of our treatment [27]
- We address the dual sparsity of the topic representation for documents and the word representation for topics in topic modeling
- This problem is especially important for analyzing short text such as user-generated content on the Web
- We propose a novel topic model, DsparseTM, which employs a “Spike and Slab” process and introduces a smoothing prior and a weak smoothing prior for focused/unfocused topics and focused/unfocused terms

Results

- 5.4.1 Topic coherence

The PMI scores of all candidate methods are presented in Table 3. - DBLP 20NG Twitter Twitter-A Number of topics DsparseTM LDA STC Conference topic.
- The proposed DsparseTM model yields the highest PMI score, followed by LDA and the mixture of unigrams, all of which outperform STC by a large margin.
- The average PMI score of the 22 conference based topics is 0.586.
- It is interesting to see that both DsparseTM and LDA achieve higher PMI scores than the conference based topics

Conclusion

- The authors address the dual sparsity of the topic representation for documents and the word representation for topics in topic modeling.
- This problem is especially important for analyzing short text such as user-generated content on the Web. The authors propose a novel topic model, DsparseTM, which employs a “Spike and Slab” process and introduces a smoothing prior and a weak smoothing prior for focused/unfocused topics and focused/unfocused terms.

Summary

## Introduction:

The authors are living in an era of information revolution where social media is gradually substituting the role of traditional media.- More than 500 million tweets are posted by Twitter users on a daily basis 1
- This huge volume of usergenerated content, normally in the form of very short documents, contains rich and useful information that can hardly be found in traditional information sources [31].
- Statistical topic models have been proved to be effective tools for exploratory analysis of the overload of text content [4].
- A topic model can provide an effective organization of latent semantics to the unstructured text collection
## Results:

5.4.1 Topic coherence

The PMI scores of all candidate methods are presented in Table 3.- DBLP 20NG Twitter Twitter-A Number of topics DsparseTM LDA STC Conference topic.
- The proposed DsparseTM model yields the highest PMI score, followed by LDA and the mixture of unigrams, all of which outperform STC by a large margin.
- The average PMI score of the 22 conference based topics is 0.586.
- It is interesting to see that both DsparseTM and LDA achieve higher PMI scores than the conference based topics
## Conclusion:

The authors address the dual sparsity of the topic representation for documents and the word representation for topics in topic modeling.- This problem is especially important for analyzing short text such as user-generated content on the Web. The authors propose a novel topic model, DsparseTM, which employs a “Spike and Slab” process and introduces a smoothing prior and a weak smoothing prior for focused/unfocused topics and focused/unfocused terms.

- Table1: Variables and Notations
- Table2: Statistics of the data sets
- Table3: Topic coherence (PMI) on four data sets
- Table4: Focused topics and average sparsity ratio for different DBLP categories
- Table5: Focused topics and average sparsity ratio for different 20NG categories comp.graphics comp.os.ms-windows.misc soc.religion.christian talk.politics.mideast
- Table6: Focused terms and sparsity ratio of selected topics on DBLP
- Table7: Focused terms and sparsity ratio of selected topics on 20NG
- Table8: Most frequent terms of a category, and most relevant topics on DBLP
- Table9: Most frequent terms of a category, and most relevant topics on 20NG

Related work

- To the best of our knowledge, this is the first study that simultaneously mines focused topics and focused terms from short text. This is achieved through addressing the dualsparsity of topic mixtures and topic-word distributions in topic modeling. Our work is related to the following lines of literature.

2.1 Classical Probabilistic Topic Models

Classical probabilistic topic models like the probabilistic latent semantic analysis (PLSA) [13] and the latent Dirichlet allocation (LDA) [6] have been widely adopted in text mining. Without utilizing auxiliary information such as higherlevel context, the classical topic models generally regard each document as an admixture of topics where each topic is defined as a unigram distribution over all the terms in the vocabulary. In practice, the benefit of LDA over PLSA comes from the smoothing process of the document-topic distributions and the topic-word distributions introduced by the Dirichlet prior, which thus alleviates the overfitting problem of PLSA. In order to relax the assumption that the user knows the number of topics a priori, Blei et al proposed a hierarchical topic model utilizing the Chinese Restaurant Process for the construction of infinite number of topics [5]. These classical models generally lack the ability of directly controlling the posterior sparsity [11] of the inferred representations, thus fail to address the skewness of the topic mixtures and the word distributions. Indeed, one could enhance the sparsity by approaching the Dirichlet prior in LDA to zero. However, such a cruel process also inevitably results in a weakened effect of smoothing. Previous work has shown that simply applying a small Dirichlet prior is not only ineffective in controlling the posterior sparsity [32], but also results in compromised, less smooth document-topic and topic-term distributions [27]. In other words, a weakened Dirichlet smoothing yields sparsity only because of the scarcity of information. As a result, when applied to short documents, classical topic models are usually unable to perform as well as for professional documents, even when the Dirichlet priors are optimized.

Funding

- This work is supported by the Hong Kong Research Grants Council (RGC) General Research Fund (GRF) Project No CUHK 411211, 411310, the Chinese University of Hong Kong Direct Grant No 4055015, and the National Science Foundation under grant numbers IIS-0968489, IIS-1054199, CCF1048168

Reference

- C. Archambeau, B. Lakshminarayanan, and G. Bouchard. Latent IBP compound dirichlet allocation. In NIPS Bayesian Nonparametrics Workshop, 2011.
- A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In UAI, pages 27–34, 2009.
- Y. Bengio, A. C. Courville, and J. S. Bergstra. Unsupervised models of images by spike-and-slab rbms. In ICML, pages 1145–1152, 2011.
- D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
- D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In NIPS, pages 106–114, 2003.
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.
- J. Chang, J. L. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288–296, 2009.
- X. Chen, M. Zhou, and L. Carin. The contextual focused topic model. In KDD, pages 96–104, 2012.
- A. C. Courville, J. Bergstra, and Y. Bengio. A spike and slab restricted boltzmann machine. In International Conference on Artificial Intelligence and Statistics, pages 233–241, 2011.
- K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265–292, 2002.
- J. V. Graca, K. Ganchev, B. Taskar, and F. Pereira. Posterior vs. parameter sparsity in latent variable models. In NIPS, pages 664–672, 2009.
- T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228–5235, 2004.
- T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289–296, 1999.
- P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. JMLR, 5:1457–1469, 2004.
- H. Ishwaran and J. S. Rao. Spike and slab variable selection: Frequentist and bayesian strategies. The Annals of Statistics, 33(2):730–773, 2005.
- A. Kaban, E. Bingham, and T. Hirsimaki. Learning to read between the lines: The aspect bernoulli model. In SDM, pages 462–466, 2004.
- D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
- Y. Lu, Q. Mei, and C. Zhai. Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Information Retrieval, 14(2):178–203, 2011.
- R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In SIGIR, pages 889–892, 2013.
- D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL, pages 100–108, 2010.
- I. Sato and H. Nakagawa. Rethinking collapsed variational bayes inference for lda. In ICML, 2012.
- E. Saund. A multiply cause mixture model for unsupervised learning. Neural Comput., 7(1):51–71, 1995.
- M. Shashanka, B. Raj, and P. Smaragdis. Sparse overcomplete latent variable decomposition of counts data. In NIPS, pages 1313–1320, 2007.
- J. Tang, M. Zhang, and Q. Mei. One theme in all views: Modeling consensus topics in multiple contexts authors. In KDD, pages 5–13, 2013.
- Y. W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In NIPS, pages 1353–1360, 2006.
- H. M. Wallach, D. Mimno, and A. McCallum. Rethinking lda: Why priors matter. In NIPS, pages 1973–1981, 2009.
- C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In NIPS, pages 1982–1989, 2009.
- Q. Wang, J. Xu, H. Li, and N. Craswell. Regularized latent semantic indexing. In SIGIR, pages 685–694, 2011.
- S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. Focused topic models. In NIPS Workshop on Applications for Topic Models: Text and Beyond, 2009.
- S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. The ibp compound dirichlet process and its application to focused topic modeling. In ICML, pages 1151–1158, 2010.
- W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, pages 338–349, 2011.
- J. Zhu and E. P. Xing. Sparse topical coding. In UAI, pages 831–838, 2011.

Tags

Comments